Stata code – Tim Plante, MD MHS

Julia for Stata users 2 – Fundamentals of loading, merging, and viewing data

(See the part 1 post here.)

Viewing and changing your working directory

One step will be changing your working directory to the place where you actually want to be working. See your current working directory with “pwd()“. Note: if you get a message saying “pwd (generic function with 1 method)”, you entered “pwd” and not “pwd()”.

Anyway, here’s what I see:

What’s up with the double slashes in Windows directories? Windows is unique among OSes in that it uses backslashes (\) and not forward slashes (/) in its file structure. It turns out that the backward slash is also special character in coding so you have to do two slashes for “normal” slashes as a translation in a lot of coding programs (details here). If you want to change your directory in Windows, you’ll need to deal with single to double backslashes conversion by either (1) adding a second backslash everywhere it shows up in a directory, or (2) declare the string to be raw as is described here. See more on the second, easier option below.

You can change the directory in Julia with “cd([your directory])“. In Windows explorer, you can copy the path by right-clicking on the desired folder and clicking “copy as path”.

So, to change your directory in windows, you need to deal with how Julia handles backslashes. Adding “raw” before the string of the directory is simpler. The following two are equivalent:

cd("C:\\Users\\USER\\julia code")
cd(raw"C:\Users\USER\julia code")

The second allows you to simply paste the directory from Windows Explorer. So, use raw before your directory string! I don’t know if people will have this problem with other operating systems, I’m guessing not.

You can view the contents of your working directory (e.g., like “ls” or “dir”) with “readdir()“. In this example, I have already used cd() to change my pwd() to the folder that has a Julia script in it:

Hey! It’s my “count to a billion” code from part 1. Since it’s in my current directory, I can load it and run it using the include() command as follows:

include("test.jl")

This is actually quite a helpful command if you are writing your script in Notepad++ like we covered in Part 1. Rather than copying/pasting the contents of your script from Notepad++ into Julia/REPL, you can simply do your editing in Notepad++, save the updated script, and then re-run it using include(filename.jl)! A bit less clunky.

Importing files from other statistical packages (SAS’s sas7bdat, Stata’s .dta) using TidierFiles

As discussed in part 1, Tidier is a meta-package that includes a ton of sub-packages in an attempt to create Tidyverse in Julia. One of Tidier’s included packages is the TidierFiles package, which reads and writes all sorts of stuff. Note: You load all Tidier packages, including TidierFiles, when you load Tidier itself. TidierFiles uses Julia’s revered DataFrame package, FYI. You’ll see “df” to name imported data sitting in a dataframe.

Step 1: Installing Tidier (and with it, TidierFiles) and DataFrame

In the package manager (hit “]” to enter), type “add Tidier”. After they install, hit backspace to get back to Julia’s REPL. Alternatively, you can run the following in the REPL or in a script to install Tidier:

using Pkg # need to remember to load Pkg itself!!
Pkg.add("Tidier") # This loads lots, including TiderFiles and DataFrame
Pkg.status() # see that it installed. 
using Tidier # load tidier
# type ? and "Tidier" to see all of the packages that come with it.
# type ? and "Tidier.TidierFiles" to read about that specific package.

Step 2: Importing SAS, Stata, CSV and other files

Importing SAS: Now let’s download the airline.sas7bdat dataset from here: https://www.principlesofeconometrics.com/sas.htm — save it to your pwd(). The following command (1) uses Tidier’s TidierFiles to import the airline.sas7bdat as a DataFrame called “df” then (2) shows that it loaded correctly using varinfo().

Note: You can opt to use “read_file” instead of “read_sas” and let TidierFiles figure out what type of file it is. In 2025, this gives an error if you are trying to automate the download using a string, so better to use the “read_sas” command here.

using Tidier # this loads TidierFiles
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir() # the airline file should be in the pwd()
# the following saves the sas file as a dataframe called "df1"
df1 = read_sas("airline.sas7bdat") 
# see that it loaded as a dataframe: 
varinfo()
# fin

In theory, you should also be able to load the above file directly from the web, but I’m getting an IO error doing that. I’ll come back and try to debug later. This should be the correct code, I’m not sure why it’s not working.

using Tidier
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()
df2 = read_sas("http://www.principlesofeconometrics.com/sas/airline.sas7bdat")
# see that it loaded as a dataframe: 
varinfo()
# fin

Alternatively, you can instead use Julia to download a file to your pwd() in a script. Let’s say you want to download the andy.sas7bdat file and save it to the pwd(). the Downloads package can help with that. Install it with “]” and “add Downloads” or “using Pkg” and “Pkg.add(“Downloads”)”.

using Tidier
using Downloads
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()

# specify the URL
url = "http://www.principlesofeconometrics.com/sas/andy.sas7bdat"

# Specify the destination, but need to explicitly name the file
# so grab the filename from the end of the URL and list it
# along with the pwd using the joinpath command
filename = split(url,"/") |> last 
dest = joinpath(pwd(), filename)
Downloads.download(url, dest)

# see that the file is downloaded in your pwd()
readdir()

# Now use the above script to import the andy file, using the
# captured filename string to automate the import
df3 = read_sas(filename)
# see that it loaded as a dataframe: 
varinfo()
# fin

Importing Stata: This is essentially identical to importing a SAS file, but use the “read_dta” command in place of “read_sas”. If you had the auto.dta file in your working directory, this is how you’d import it.

using Tidier
using Downloads
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()
df4 = read_dta("auto.dta")
# see that it loaded as a dataframe: 
varinfo()
# fin

Here’s how that’d look using the auto.dta file from Stata’s website (https://www.stata-press.com/data/r17/r.html):

using Tidier
using Downloads
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()

# specify the URL
url = "https://www.stata-press.com/data/r17/auto.dta"

# Specify the destination, but need to explicitly name the file
# so grab the filename from the end of the URL and list it
# along with the pwd using the joinpath command
filename = split(url,"/") |> last 
dest = joinpath(pwd(), filename)
Downloads.download(url, dest)

# see that the file is downloaded in your pwd()
readdir()
# now import that file as a dataframe:
df5 = read_dta(filename)
# see that it loaded as a dataframe: 
varinfo()
# fin

Here you go!

Importing CSV and other filetypes – TidierFiles will import “csv”, “tsv”, “xlsx”, “delim”, “table”, “fwf”, “sav”, “sas”, “dta”, “arrow”, “parquet”, “rdata”, “rds, and Google sheets, you just need to select the correct command to do so (see details here) to replace “read_sas” and “read_dta” above.

Step 3: Merging DataFrames together

We’ll use NHANES data (saved as sas7bdat) and merge on SEQN, aka a unique identifier. The following script downloads the DEMO file and a cholesterol file, and saves them in DataFrames called df_demo and df_trigly

using Tidier
using Downloads
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()

# Demo file
url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2013/DataFiles/DEMO_H.xpt"
filename = split(url, "/") |> last 
dest = joinpath(pwd(), filename)
Downloads.download(url, dest)
readdir()
df_demo=read_sas(filename)


# Cholesterol file
url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2013/DataFiles/TRIGLY_H.xpt"
filename = split(url, "/") |> last 
dest = joinpath(pwd(), filename)
Downloads.download(url, dest)
readdir()
df_chol=read_sas(filename)


# look at dataframes and strings loaded:
varinfo()
# fin

Now, let’s merge/join the df_demo and df_chol dataframes together and save it as a dataframe called df_merge. This will use TidierData’s join function, you can read about here. There are a bunch of different joins, we’ll be doing a full_join which will preserve all rows without dropping any for missing. Read about joins (“merge”) here. Of note, the join command as implemented in TidierData will infer the matching variable based upon identical columns in the two datasets. Continuing with the prior script:

df_merge = @full_join(df_demo, df_chol)

Combined code:

using Tidier
using Downloads
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()

# Demo file
url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2013/DataFiles/DEMO_H.xpt"
filename = split(url, "/") |> last 
dest = joinpath(pwd(), filename)
Downloads.download(url, dest)
readdir()
df_demo=read_sas(filename)


# Cholesterol file
url = "https://wwwn.cdc.gov/Nchs/Data/Nhanes/Public/2013/DataFiles/TRIGLY_H.xpt"
filename = split(url, "/") |> last 
dest = joinpath(pwd(), filename)
Downloads.download(url, dest)
readdir()
df_chol=read_sas(filename)

# join/merge datasets, save as "df_merge" dataframe
df_merge = @full_join(df_demo, df_chol)
# look at dataframes that are there:
varinfo()
# fin

Exporting data

You can export your work using TidierData, just use “write_sas”, “write_dta”, “write_csv” or whatever you want. Details are here. For example, you can append the following to the end of the prior code and save the df_merge dataframe as a CSV file in your pwd(), including the column names as the first row.:

write_csv(df_merge, "NHANES_merged.csv", col_names=true)
# fin

Saving your work in Julia’s JLD2 format and then later loading it

The JLD2 file format seems to be designed to be compatible with future changes in Julia. See details here. Install it in the Pkg interface (hit “]”) then “add JLD2” or “using Pkg” and “Pkg.add(“JLD2″)”. You can save things in your memory (type “varinfo()” to see what’s loaded). For example, you can save the df_merge from 2 sections above with:

using JLD2
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()
@save "mydataframe.jld2" df_merge
# look at files in pwd()
readdir()
# fin

I haven’t quite found an elegant way to reload JLD2 with Tidier, it instead uses DataFrames (which is actually included in Tidier but doesn’t correctly work with this). You’ll need to separately install DataFrames in the Pkg interface (hit “]” and type “add DataFrames” or in REML type “using Pkg” and “Pkg.add(“DataFrames”)

# close and reopen Julia, load JLD2 and DataFrames
using JLD2
using DataFrames
# set pwd() with cd(), confirm you did it correctly with pwd(), then read the pwd() contents with readdir()
cd(raw"C:\YOUR DIRECTORY")
pwd()
readdir()
# load the dataframe as df_merge.
@load "mydataframe.jld2" df_merge
# look at dataframes and strings loaded:
varinfo()
# fin

Viewing your data in the browser

One great Stata feature is “browse”, allowing you to look at all of the data. The “BrowseTables” package allows you to do something similar in Julia’s interface and (more importantly) a browser window. We will demonstrate this using the merged NHANES dataset above.

Jump to the pkg interface (hit “]”) and type “add BrowseTables” or add in REPL with “using Pkg” and “Pkg.add(“BrowseTables”) then the following:

using BrowseTables
# view 'whatever fits' in the julia terminal:
HTMLTable(df_merge)

# open up the nhanes_merge example from above in a browser:
open_html_table(df_merge)
# fin

The first command, HTMLTable(df_merge), will show output in the Julia browser, truncating the columns and rows so that it fits like such:

The second command, open_html_table(df_merge), will open the entire dataset in your default browser, like so:

Stata’s ‘browse’ feature is nice in that it allows you to browse subsets of data, e.g., ‘browse if sex==”M”‘ would show data for just males. I’ll play around with this a bit more and see if there’s a simple way to subset the rendered tables to just a sample of the data.

Using Stata and Graphviz to make social network graphs and hierarchical graphs

I recently had to make a figure that showed relationship between variables. I tried a few different software packages and ultimately decided that Graphviz is the easiest. Thanks to dreampuf for their Graphviz online program! I used this web-based implementation and didn’t have to install Graphviz on my computer. So for this, we’ll be using this online Graphviz package: https://dreampuf.github.io/GraphvizOnline

I wrote a basic Stata script that inputs data and then outputs Graphviz code that you can copy and paste right into the Graphviz website above. (I strongly strongly strongly recommend saving your Graphviz code locally on your computer as a text file. On Windows I use Notepad++. Don’t save this code in a word processor because it will do unpredictable things to the quotes.) You can then tweak the settings of the outputted Graphviz code to your liking. See all sorts of settings in the left-sided menu here: https://graphviz.org/docs/nodes/

Originally I wanted this to be a network graph (“neato”) but ultimately liked how it looked best in the hierarchical graph (“dot”). You can change between graph types using the engine dropdown menu on the top right of the Graphviz online website. You can also change the file type to something that journals will use, like PNG, on the top right.

Code to output Graphviz code from your Stata database

Run the following in its entirety from a Stata do file.

clear all // clear memory
//
// Now input variables called 'start' and 'end#'.
// 'start' is the originating node and 'end#' is
// every node that 'start' connects to.
// If you need additional 'end#' variables, just add them
// using strL (capital L) then the next 'end#' number.
// In this example, there's 1 'start' and 4 'end#'
// so there are 5 total columns. 
input strL start strL end1 strL end2 strL end3 strL end4
"ant" "bat" "cat" "dog" "fox"
"bat" "ent" "" "" ""
"cat" "ent" "fox" "" ""
"dog" "ent" "fox" "" ""
end // end input
// 
// The following code reshapes the data from wide
// to long and drops all subsequent blank variables
// from that rotation (if any). 
gen row = _n //make row variable
reshape long end, i(row) j(nodenum) // reshape
drop if end=="" // drop empty cells
keep start end // just need these variables
//
// The following loop renders the current Stata
// dataset as Graphviz code. Since this uses loops
// and local macros, it needs to be run all at once 
// in a do file rather than line by line. 
local max = _N
quietly {
	// Start of graphviz code:
	noisily di ""
	noisily di ""
	noisily di ""
	noisily di ""
	noisily di "// Copy everything that follows"
	noisily di "// and paste it into here:"
	noisily di "// https://dreampuf.github.io/GraphvizOnline"
	noisily di "digraph g {"
	//
	// This prints out the connection between
	// each 'start' node and all connected
	// 'end#' nodes one by one.
	forvalues n = 1/`max' {
		noisily di start[`n'] " -> " end[`n'] ";"
	}
	//
	// Global graph attributes follows. 
	// "bb" sets the size of the figure
	// from lower left x, y, then upper right x, y.
	// There are lots of other settings here: 
	// https://graphviz.org/docs/graph/
	// ...if adding more, just add between the final
	// comma and closing bracket. If adding several
	// additional settings here, separate each with 
	// a comma. 
	// Note that this has an opening and closing tick
	// so the quotes inside print like characters
	// and not actual stata code quotes.
	noisily di `"graph [bb="0,0,100,1000",];"' 
	//
	// The next block generates code to render each
	// node. First, we need to reshape long(er) so that
	// all of the 'start' and 'end#' variables are all
	// in a single column, delete duplicates, and 
	// sort them. 
	rename start thing1
	rename end thing2
	gen row = _n
	reshape long thing, i(row) j(nodenum) 
	keep thing
	duplicates drop
	sort thing
	//
	// Now print out settings for each node. These
	// can be fine tuned. Lots of options for 
	// node formatting here: 
	// https://graphviz.org/docs/nodes/
	local max = _N
	forvalues n= 1/`max' {
		noisily di thing[`n'] `" [width="0.1", height="0.1", fontsize="8", shape=box];"'
	}
	// End of graphviz code: 
	noisily di "}"
	noisily di "// don't copy below this line"
	noisily di ""
	noisily di ""
	noisily di ""
	noisily di ""
}
// that's it!

The above Stata code prints the following Graphviz code in the Stata output window. This code can be copied/pasted to the Graphviz website linked above. (Make sure to save a backup of this Graphviz code as a txt file on your computer!!) Make sure your Stata screen is full size before running the above Stata code or it might insert some line breaks that you have to manually delete since the output width is (usually) determined by the window size. Also, if your node settings get long, it’ll also insert line breaks that you’ll have to manually delete.





// Copy everything that follows
// and paste it into here:
// https://dreampuf.github.io/GraphvizOnline
digraph g {
ant -> bat;
ant -> cat;
ant -> dog;
ant -> fox;
bat -> ent;
cat -> ent;
cat -> fox;
dog -> ent;
dog -> fox;
graph [bb="0,0,100,1000",];
ant [width="0.1", height="0.1", fontsize="8", shape=box];
bat [width="0.1", height="0.1", fontsize="8", shape=box];
cat [width="0.1", height="0.1", fontsize="8", shape=box];
dog [width="0.1", height="0.1", fontsize="8", shape=box];
ent [width="0.1", height="0.1", fontsize="8", shape=box];
fox [width="0.1", height="0.1", fontsize="8", shape=box];
}
// don't copy below this line

Example figures from outputted code above using the different Graphviz engines

Clicking the dropdown in the top right “engine” toggles between the figures below. You can learn more about these here: https://graphviz.org/docs/layouts/

Not shown below are “nop” and “nop2” which don’t render correctly for unclear reasons. Some of these will need to be tweaked to be publication quality, some of them frankly don’t work with this dataset. For this made up code, I think dot and neato look great!

Dot (hierarchical or layered drawing of directed graphs, my favorite for this project):

Neato (a nice network graph, called “spring model” layout):

Circo, aka circular layout:

fdp (force-directed placement):

sfdp (scalable force-directed placement):

twopi (radial layout):

osage (clustered graphs):

Patchwork (clustered graph using squarified treemap):

Creating a desktop shortcut to Stata in Ubuntu Linux

I’m a Linux novice and installed Stata 18 on my new Ubuntu 24.10 dual boot laptop. But! Stata doesn’t show up as an installed program in my launcher. I found the installed files, including the executable for the GUI-based version of Stata 18 (“xstata-se”) in the /usr/local/stata18/ folder. I wanted to make a desktop shortcut to that folder but there doesn’t seem to be an option to make a shortcut from the Ubuntu file launcher. Instead, I followed the directions from the user ‘forester’ here and did it from the terminal.

Pop open your terminal by clicking ctrl+alt+T and plop in the text that follows, substituting your user name where indicated below

sudo ln -s /usr/local/stata18/ /home/[your user name]/Desktop

After you enter your password, a new link should appear on your desktop

FYI: If you are using a different version of Stata, the number will be different for the Stata folder (e.g., stata19).

Also, I had tried to create a desktop link to the xstata-se file itself, but clicking the link wouldn’t run Stata. Popping open the parent folder that the executable lives in is pretty close so I’ll stick with it.

Part 8: Regressions

There are all sorts of models out there for performing regressions. This page focuses on 3 that are used for binary outcomes:

Logistic regression
Modified Poisson regression
Cox proportional hazards models.

Getting set up

Before you get going, you want to explicitly define your outcome of interest (aka dependent variable), primary exposure (aka independent variable) and covariates that you are adjusting for in your model (aka other independent variables). You’ll also want to know your study design (eg case-control? cross-sectional? prospective? time to event?).

Are your independent variables continuous or factor/categorical?

In these regression models, you can specify that independent variables (primary exposure and covariates that are in your model) are continuous or factor/categorical. Continuous variables can be specified with a leading “c.”, and examples might include age, height, or length. (“c.age”.) In contrast, factor/categorical variables might be New England states (e.g., 1=Connecticut, 2=Maine, 3=Mass, 4=New Hampshire, 5=RI, and 6=Vermont). So, you’d want to specifiy a variable for “newengland” as i.newengland in your regression. Stata defaults to treating the smallest number (here, 1=Connecticut) as the reference group. You can change that by using the “ib#.” prefix instead, where the # is the reference group, or here ib2.newengland to have Maine as the reference group.

Read about factor variables in –help fvvarlist–.

What about binary/dichotomous variables (things that are ==0 or 1)? Well, it doesn’t change your analysis if you specify a binary/dichotomous as either continuous (“c.”) or factor/categorical (“i.”). The math is the same on the back end.

Checking for interactions

In general, when checking for an interaction, you will need to specify if the two variables of interest are categorical or factor/categorical and drop two pound signs in between the two. See details on –help fvvarlist–. Here’s an example of how this would look, looking for an interaction between sex on age groups.

regress bp i.sex##c.agegrp

      Source |       SS           df       MS      Number of obs   =       240
-------------+----------------------------------   F(3, 236)       =     23.86
       Model |   9519.9625         3  3173.32083   Prob > F        =    0.0000
    Residual |  31392.8333       236   133.02048   R-squared       =    0.2327
-------------+----------------------------------   Adj R-squared   =    0.2229
       Total |  40912.7958       239  171.183246   Root MSE        =    11.533

------------------------------------------------------------------------------
          bp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         sex |
     Female  |     -4.275   3.939423    -1.09   0.279    -12.03593    3.485927
      agegrp |     7.0625   1.289479     5.48   0.000      4.52214     9.60286
             |
sex#c.agegrp |
     Female  |      -1.35   1.823599    -0.74   0.460    -4.942611    2.242611
             |
       _cons |   143.2667   2.785593    51.43   0.000     137.7789    148.7545
------------------------------------------------------------------------------

You’ll see the sex#c.agegrp P-value is 0.460, so that wouldn’t qualify as a statistically significant interaction.

Regressions using survey data?

If you are using survey analyses (eg need to account for pweighting), generally you have to use the svy: prefixes for your analyses. This includes svy: logistic, poisson, etc. Type –help svy_estimation– to see what the options are.

Logistic regression

Logistic regressions provide odds ratios of binary outcomes. Odds ratios don’t approximate the risk ratio if the outcome is common (eg >10% occurrence) so I tend to avoid them as I study hypertension, which occurs commonly in a population.

There are oodles of details on logistic regression in the excellent UCLA website. In brief, you want want to use a regression command, you can use “logit” and get the raw betas or “logistic” and get the odds ratios. Most people will want to use “logistic” to get the odds ratios.

Here’s an example of logistic regression in Stata. In this dataset, “race” is Black/White/other, so you’ll need to specify this as a factor/categorical variable with the “i.” prefix, however “smoke” is binary so you can either specify it as a continuous or as a factor/categorical variable. If you don’t specify anything, then it is treated as continuous, which is fine.


webuse lbw, clear
codebook race // see race is 1=White, 2=Black, 3=other
codebook smoke // smoke is 0/1
logistic low i.race smoke 

// output: 
Logistic regression                                     Number of obs =    189
                                                        LR chi2(3)    =  14.70
                                                        Prob > chi2   = 0.0021
Log likelihood = -109.98736                             Pseudo R2     = 0.0626

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      Black  |   2.956742   1.448759     2.21   0.027     1.131716    7.724838
      Other  |   3.030001   1.212927     2.77   0.006     1.382616     6.64024
             |
       smoke |   3.052631   1.127112     3.02   0.003     1.480432    6.294487
       _cons |   .1587319   .0560108    -5.22   0.000     .0794888    .3169732
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

So, here you see that the odds ratio (OR) for the outcome of “low” is 2.96 (95% CI 1.13, 7.72) for Black relative to White participants and the OR for the outcome of “low” is 3.03 (95% CI 1.38, 6.64) for Other race relative to White participants. Since it’s the reference group, you’ll notice that White race isn’t shown. But what if we want Black race as the reference? You’d use “ib2.race” instead of “i.race”. Example:

logistic low ib2.race smoke 
// output:

Logistic regression                                     Number of obs =    189
                                                        LR chi2(3)    =  14.70
                                                        Prob > chi2   = 0.0021
Log likelihood = -109.98736                             Pseudo R2     = 0.0626

------------------------------------------------------------------------------
         low | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      White  |   .3382101   .1657178    -2.21   0.027     .1294525    .8836136
      Other  |   1.024777   .5049663     0.05   0.960     .3901157    2.691938
             |
       smoke |   3.052631   1.127112     3.02   0.003     1.480432    6.294487
       _cons |   .4693292   .2059043    -1.72   0.085     .1986269    1.108963
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Modified Poisson regression

Modified Poisson regression (sometimes called Poisson Regression with Robust Variance Estimation or Poisson Regression with Sandwich Variance Estimation) is pretty straightforward. Note: this is different from ‘conventional’ Poisson regression, which is used for counts and not dichotomous outcomes. You use the “poisson” command with subcommands “, vce(robust) irr” to use the modified Poisson regression type. Note: with svy data, robust VCE is the default so you just need to use the subcommand “, irr”.

As with the logistic regression section above, race is a 3-level nominal variable so we’ll use the “i.” or “ib#.” prefix to specify that it’s not to be treated as a continuous variable.

webuse lbw, clear
codebook race // see race is 1=White, 2=Black, 3=other
codebook smoke // smoke is 0/1
// set Black race as the reference group
poisson low ib2.race smoke, vce(robust) irr
// output:

Iteration 0:  Log pseudolikelihood = -122.83059  
Iteration 1:  Log pseudolikelihood = -122.83058  

Poisson regression                                      Number of obs =    189
                                                        Wald chi2(3)  =  19.47
                                                        Prob > chi2   = 0.0002
Log pseudolikelihood = -122.83058                       Pseudo R2     = 0.0380

------------------------------------------------------------------------------
             |               Robust
         low |        IRR   std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
        race |
      White  |    .507831    .141131    -2.44   0.015     .2945523    .8755401
      Other  |   1.038362   .2907804     0.13   0.893     .5997635      1.7977
             |
       smoke |   2.020686   .4300124     3.31   0.001     1.331554    3.066471
       _cons |   .3038099   .0800721    -4.52   0.000     .1812422    .5092656
------------------------------------------------------------------------------
Note: _cons estimates baseline incidence rate.

Here you see that the relative risk (RR) of “low” for White relative to Black participants is 0.51 (95% CI 0.29, 0.88), and the RR for other race relative to Black participants is 1.04 (95% CI 0.60, 1.80). As above, the “Black” group isn’t shown since it’s the reference group.

Cox proportional hazards model

Cox PH models are used for time to event data. This is a 2-part. First is to –stset– the data, thereby telling Stata that it’s time to event data. In the –stset– command, you specify the outcome/failure variable. Second is to use the –stcox– command to specify the primary exposure/covariates of interest (aka independent variables). There are lots of different steps to setting up the stset command, so make sure to check the –help stset– page. Ditto –help stcox–.

Here, you specify the days of follow-up as “days_hfhosp” and failure as “hfhosp”. Since follow-up is in days, you set the scale to 365.25 to have it instead considered as years.

stset days_hfhosp, f(hfhosp) scale(365.25)

Then you’ll want to use the stcox command to estimate the risk of hospitalization by beta blocker use by age, sex, and race with race relative to the group that is equal to 1.

stcox betablocker age sex ib1.race

You’ll want to make a Kaplan Meier figure at the same time, read about how to do that on this other post.

Printing hazard ratio on Kaplan Meier curve in Stata

I recently made a figure that estimates a hazard ratio and renders it right on top of a Kaplan Meier curve in Stata. Here’s some example code to make this.

Good luck!


// Load example dataset. I got this from the --help stset-- file
webuse diet, clear

// First, stset the data. 
stset dox /// dox is the event or censor date
, ///
failure(fail) /// "fail" is the failure vs censor variable
scale(365.25)


// Next, estimate a cox ph model by "hienergy"
stcox hienergy
// now grab the bits from output of this
local hrb=r(table)[1,1]
local hrlo=r(table)[5,1]
local hrhi=r(table)[6,1]
local pval = r(table)[4,1]
// now format the p-value so it's pretty
if `pval'>=0.056 {
	local pvalue "P=`: display %3.2f `pval''"
}
if `pval'>=0.044 & `pval'<0.056 {
	local pvalue "P=`: display %5.4f `pval''"
}
if `pval' <0.044 {
	local pvalue "P=`: display %4.3f `pval''"
}
if `pval' <0.001 {
	local pvalue "P<0.001"
}
if `pval' <0.0001 {
	local pvalue "P<0.0001"
}

di "original P is " `pval' ", formatted is " "`pvalue'"
di "HR " %4.2f `hrb' " (95% CI " %4.2f `hrlo' "-" %4.2f `hrhi' "; `pvalue')"

// Now make a km plot. this example uses CIs
sts graph ///
, ///
survival /// 
by(hienergy) ///
plot1opts(lpattern(dash) lcolor(red)) /// options for line 1
plot2opts(lpattern(solid) lcolor(blue)) /// options for line 2
ci /// add CIs
ci1opts(color(red%20)) /// options for CI 1
ci2opts(color(blue%20)) /// options for CI 2
/// Following this is the legend, placed in the 6 O'clock position. 
/// Only graphics 5 and 6 are needed, but all 6 are shown so you 
/// see that other bits that can show up in the legend. Delete 
/// everything except for 5 and 6 to hide the rest of the legend components
legend(order(1 "[one]" 2 "[two]" 3 "[three]" 4 "[four]" 5 "First group" 6 "Second group") position(6)) ///
/// Risk table to print at the bottom:
risktable(0(5)20 , size(small) order(1 "First group" 2 "Second group")) ///
title("Title") ///
t1title("Subtitle") ///
xtitle("Year of Follow-Up") ///
ytitle("Event-Free Survival") ///
/// Here's how you render the HR. Change the first 2 numbers to move it:
text(0 0 "`: display "HR " %4.2f `hrb' " (95% CI " %4.2f `hrlo' "-" %4.2f `hrhi' "; `pvalue')"'", placement(e) size(medsmall)) ///
yla(0(0.2)1)

Stata graph box boxplots with different colors for –over– groups

I recently had to make some boxplots with Stata’s –graph box– command. I find this to be challenging to use since it varies from syntax from the –twoway– command set that I use all the time. I was using the –over– subcommand x2 and wanted to change the colors of each box & dots by group from the first –over– subcommand. I found some helpful details on the Statalist Forum here and here. Here’s what I did to accomplish this, using the help from the Statalist forum.

Some tweaks here: I wanted to show rotate some labels 45 degrees with –angle– and I also aggressively labeled variables and their values so I didn’t need to manually relabel the figure (which is done with the –relabel– subcommand if needed). It takes an extra 30 seconds to label variables and values, and will save you lots of headbanging fiddling with the –relabel– command, so just label your variables and values from the start.

This example uses fake data. Code follows the picture. Good luck!

clear all
set obs 1000 // blank dataset with 1000 observations
set seed 8675309 // jenny seed
gen group = round(runiform()) // make group that is 0 or 1
gen time = round(3*runiform()) // make 4 times, 0 through 3
replace time = time+1 // now time is 1 through 4
gen tommy2tone = 100*rbeta(3,10) // fake skewed data

// now apply labels to variables.
// technically you only need to label the 3rd one
// of these since categorical variable value labels
// are shown instead of the variable label itself,
// but might as well do all 3 in case you need them
// labeled somewhere else.
label variable group "Group"
label variable time "Time"
label variable tommy2tone "Jenny Lyrics remembered, %"

// now make value labels.
* group
label define grouplab 0 "Tommy" 1 "Tutone"
label values group grouplab //DON'T FORGET TO APPLY LABELS

* time
label define timelab 1 "Time 1" 2 "Time 2" 3 "Time 3" 4 "Time 4"
label values time timelab //DON'T FORGET TO APPLY LABELS

// code for boxplot
graph box tommy2tone ///
, ///
over(group, label(angle(45))) ///
over(time) ///
scale(1.3) /// embiggen labels & figure components
box(1, color(red)) marker(1, mcolor(red)) ///
box(2, color(blue)) marker(2, mcolor(blue)) ///
asyvars showyvars leg(off)

Making a table in Stata for regression results (and other output) using frames

Frames were introduced in Stata 16 and are handy for (a) storing/manipulating multiple datasets simultaneously and (b) building datasets on the fly. I’ve had good luck making a table using frames. This strategy includes (1) making a new frame with as many columns as you need, specifying they are long strings (strL), and printing the top row, (2) running regressions or whatnot, (3) saving components of the regression as local macros that print as text, (4) making “display” macros for each column, (5) saving those “display” local macros to the new frame, repeating steps 2-5 as many times as needed, and (6) exporting the new frame as an Excel file.

Stata has recently introduced “tables” and “collect” features that allow you to do similar things. I find those features overly confusing, so I’ve developed this approach. I hope you find it useful! Here’s what we are trying to make:

Note: these details depend on use of concepts discussed on this other post (“return list” “ereturn list” “matrix list r(table)”, local macros, etc.). Make sure to familiarize yourself with that post.

Problems you might run into here:

This uses local macros, which disappear when a do file stops running. You need to run the entire do file from top to bottom every time, not line by line.
Stata’s -frames post- functionality is finicky. It expects exactly as many posted variables as there are variables in the new frame, and that each posted format matches the predefined format on the frame’s variable.
Three forward slashes (“///”) extends things across lines. Two forward slashes (“//”) allows commenting but does not continue to the next line. Check your double and triple forward slashes.
Stata can’t write to an open excel file. Close the excel file before re-running.

Code

* Reset frames and reload data
frames reset
sysuse auto, clear
*
* STEP 1 - Make a new frame with long strings
*
* Make new frame called "table" and 
* define all columns/variables as being strings.
* Name all variables as col#, starting with col1.
* You can extend this list as long as you'd like.
frame create table /// <--TRIPLE SLASH
strL col1 /// name 
strL col2  /// n
strL col3 /// beta
strL col4 /// 95% CI
strL col5 // <-- DOUBLE SLASH, P-value 
* If you want to add more col#s, make sure you change 
* the last double slashes to triple slashes, all new 
* col#s should have triple slashes except the last,
* which should have double (or no) slashes
*
* Prove to yourself that you have made a new frame
* called "table" in addition to the "default" one
* with the auto dataset. 
frames dir
* You could switch to the new table frame with 
* the "cwf table" command, if interested. To switch
* back, type "cwf default". 
*
* STEP 1B - print out your first row
* 
frame post table /// <--TRIPLE SLASH
("Variable name") /// col1
("N") /// col2
("Beta") /// col3
("95% CI") /// col4
("P-value") // <--DOUBLE SLASH col5 
*
* You can repeat this step as many times as you want 
* to add as many rows of custom text as needed. 
* Note that if you want a blank column, you need
* to still include the quotations within the parentheses.
* eg, if you didn't want the first column to have 
* "variable name", you'd put this instead:
* strL ("") /// col1
*
* If you wanted to flip over to the table frame and 
* see what's there, you'd type:
*cwf table 
*bro
*
* ...and to flip back to the default frame, type:
*cwf default
*
* STEP 2 - run your regression and look where the 
* coefficients of interest are stored 
* and
* STEP 3 - saving components of regression as local
* macros
*
regress price weight
*
ereturn list
* The N is here under e(N), save that as a local macro
local n1_1 = e(N)
*
matrix list r(table)
* The betas is at [1,1], the 95% thresholds
* are at [5,1] and [6,1], the se is
* at [2,1], and the p-value is at [4,1].
* We weren't planning on using se in this table,
* but we'll grab it anyway in case we change our minds
local beta1_1 = r(table)[1,1]
local ll1_1 = r(table)[5,1]
local ul1_1 = r(table)[6,1]
local se1_1 = r(table)[2,1]
local p1_1 = r(table)[4,1]
*
* STEP 4 - Making "display" macros
* 
* We are going to use local macros to grab things
* by column with a "display" commmand. Note that 
* column 1 is name, which we are going to call "all" here. 
* You could easily grab the variable name here with 
* the "local lab:" command detailed here:
* https://www.stata.com/statalist/archive/2002-11/msg00087.html
* or: 
*local label_name: variable label foreign 
* ...then call `label_name' instead of "all"
* 
* Now this is very important, we are not going to just
* capture these variables as local macros, we are going
* to capture a DISPLAY command followed by how we
* want the text to be rendered in the table. 
* Immediately after defining the local macro, we are going
* to call the macro so we can see what it will render 
* as in the table. This is what it will look like:
*local col1 di "All"
*`col1'
*
* IF YOU HAVE A BLANK CELL IN THIS ROW OF YOUR TABLE, 
* USE TWO EMPTY QUOTATION MARKS AFTER THE "di" COMMAND, eg: 
* local col1 di ""
* 
* now we will grab all pieces of the first row, 
* column-by-column, at the same time:
local col1 di "All" // name
`col1'
local col2 di `n1_1' // n
`col2'
local col3 di %4.2f `beta1_1' // betas
`col3'
local col4 di %4.2f `ll1_1' " to " %4.2f `ul1_1' // 95% ci
`col4'
local col5 di %4.3f `p1_1' // p-value
`col5' 
*
* note for the P-value, you can get much fancier in 
* formatting, see my script on how to do that here: 
* https://blog.uvm.edu/tbplante/2022/10/26/formatting-p-values-for-stata-output/
*
* STEP 5 - Posting the display macros to the table frame
* 
* Now post all columns row-by-row to the table frame
frame post table ///
("`: `col1''") ///
("`: `col2''") ///
("`: `col3''") ///
("`: `col4''") ///
("`: `col5''") // <-- DOUBLE SLASH
*
* Bonus! Insert a text row
*
frame post table /// <--TRIPLE SLASH
("By domestic vs. foreign status") /// col1
("") /// col2
("") /// col3
("") /// col4
("") // <--DOUBLE SLASH col5 
* Now repeat by foreign status
* domestic
regress price weight if foreign ==0 
ereturn list
local n1_1 = e(N)
*
matrix list r(table)
local beta1_1 = r(table)[1,1]
local ll1_1 = r(table)[5,1]
local ul1_1 = r(table)[6,1]
local se1_1 = r(table)[2,1]
local p1_1 = r(table)[4,1]
*
* note: you could automate col1 with the following command,
* which would grabe the label from the foreign==0 value of
* foreign:
*local label_name: label foreign 0
* ... then call `label_name' in the "local col1 di". 
local col1 di "  Domestic" // name, with 2 space indent
`col1'
local col2 di `n1_1' // n
`col2'
local col3 di %4.2f `beta1_1' // betas
`col3'
local col4 di %4.2f `ll1_1' " to " %4.2f `ul1_1' // 95% ci
`col4'
local col5 di %4.3f `p1_1' // p-value
`col5' 
*
frame post table ///
("`: `col1''") ///
("`: `col2''") ///
("`: `col3''") ///
("`: `col4''") ///
("`: `col5''") // <-- DOUBLE SLASH
*
* foreign
regress price weight if foreign ==1
ereturn list
local n1_1 = e(N)
*
matrix list r(table)
local beta1_1 = r(table)[1,1]
local ll1_1 = r(table)[5,1]
local ul1_1 = r(table)[6,1]
local se1_1 = r(table)[2,1]
local p1_1 = r(table)[4,1]
*
local col1 di "  Foreign" // name, with 2 space indent
`col1'
local col2 di `n1_1' // n
`col2'
local col3 di %4.2f `beta1_1' // betas
`col3'
local col4 di %4.2f `ll1_1' " to " %4.2f `ul1_1' // 95% ci
`col4'
local col5 di %4.3f `p1_1' // p-value
`col5' 
*
frame post table ///
("`: `col1''") ///
("`: `col2''") ///
("`: `col3''") ///
("`: `col4''") ///
("`: `col5''") // <-- DOUBLE SLASH
*
* STEP 6 - export the table frame as an excel file
* 
* This is the present working directory, where this excel
* file will be saved if you don't specify a directory:
pwd
*
* Switch to the table frame and take a look at it:
cwf table
bro 
* Now export it to excel
export excel using "table.xlsx", replace
*
* That's it!

Generating overlapping/overlaying decile frequency histograms in Stata

I recently had a dataset with two groups (0 or 1), and a continuous variable. I wanted to show how the overall deciles of that continuous variable varied by group. Step 1 was to generate an overall decile variable with an –xtile– command. Step 2 was to make a frequency histogram. BUT! I wanted these histograms to overlap and not be side-by-side. Stata’s handy –histogram– is a quick and easy way to make histograms by groups using the –by– command, but it makes them side-by-side like this, and not overlapping. (Note: see how to use –twoway histogram– to make overlapping histograms at the end of this post.)

I instead used a collapse command to generate a count of # in each decile by group (using the transparent color command as color percent sign number), like this:

Here’s the code to make both:

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

// make overall deciles of continuousvalue
xtile decilesbygroup = continuousvalue, nq(10)

// now make a frequency histogram of those deciles
set scheme s1color // I like this scheme
hist decilesbygroup, by(var0or1) frequency bin(10)

// make a variable equal to 1 that we will sum in collapse
gen countbygroup = 1
// now sum that variable by the 0 or 1 indicator and deciles
collapse (sum) countbygroup, by(var0or1 decilesbygroup)
// now render the count from above as a bar graph:
set scheme s1color // I like this scheme
twoway ///
(bar countbygroup decilesbygroup if var0or1==0, vertical color(red%40)) ///
(bar countbygroup decilesbygroup if var0or1==1, vertical color(blue%40)) ///
, ///
legend(order(1 "var0or1==0" 2 "var0or1==1")) ///
title("Title!") ///
xtitle("Decile of continuousvalue") ///
xla(1(1)10) ///
yla(0(10)70, angle(0)) ///
ytitle("N in Decile")

You could also offset the deciles by the var0or1 and shrink the bar width a bit to get a frequency histogram where the bars are next to each other, like this:

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

// make overall deciles of continuousvalue
xtile decilesbygroup = continuousvalue, nq(10)

// now make a frequency histogram of those deciles
set scheme s1color // I like this scheme
hist decilesbygroup, by(var0or1) frequency bin(10)

// offset the decilesbygroup by var0or1 a bit:
replace decilesbygroup = decilesbygroup - 0.2 if var0or1==0
replace decilesbygroup = decilesbygroup + 0.2 if var0or1==1

// make a variable equal to 1 that we will sum in collapse
gen countbygroup = 1
// now sum that variable by the 0 or 1 indicator and deciles
collapse (sum) countbygroup, by(var0or1 decilesbygroup)
// now render the count from above as a bar graph:
set scheme s1color // I like this scheme
twoway ///
(bar countbygroup decilesbygroup if var0or1==0, vertical color(red%40) barwidth(0.4)) ///
(bar countbygroup decilesbygroup if var0or1==1, vertical color(blue%40) barwidth(0.4)) ///
, ///
legend(order(1 "var0or1==0" 2 "var0or1==1")) ///
title("Title!") ///
xtitle("Decile of continuousvalue") ///
xla(1(1)10) ///
yla(0(10)70, angle(0)) ///
ytitle("N in Decile")

A few quick notes here: The way that I am specifying the “bins” for the histograms here is different than how Stata specifies bins for histograms, since I’m forcing it to render by decile. If you were to generate a histogram of the “continuousvalue” instead of the above example using “decilebygroup”, you’ll notice that the resulting histograms looks a bit different from each other:

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

// make overall deciles of continuousvalue
xtile decilesbygroup = continuousvalue, nq(10)

// now make a frequency histogram of those deciles
set scheme s1color // I like this scheme
hist decilesbygroup, title("hist decilesbygroup") by(var0or1) frequency bin(10) name(a)
hist continuousvalue, title("hist continuousvalue") by(var0or1)  frequency bin(10) name(b)

Also, this code will only render frequency histograms, not density histograms, which are the default in Stata. You can also use the –twoway hist– command to overlay two bar graphs, but these might not perfectly align with the deciles. But, using the –twoway hist– allows you to use density histograms instead. See the example that follows. I suspect that most people will get what they need with the –twoway hist– command in Stata.

clear all

// make fake data
set obs 1000
set seed 8675309
gen id=_n // ID of 1 though 100
gen var0or1 = round(runiform())
gen continuousvalue = 100*runiform()

set scheme s1color // I like this scheme
twoway ///
(hist continuousvalue if var0or1==0, bin(10) color(red%40) density) ///
(hist continuousvalue if var0or1==1, bin(10) color(blue%40) density) ///
, ///
legend(order(1 "var0or1==0" 2 "var0or1==1")) ///
title("Title!") ///
xtitle("Grouping in 10 Bins")

Making Box Plots in Stata from scratch

Stata has some handy built-in features to make boxplots. If you are trying to make a typical boxplot in Stata, go read up on the –graph box– command as described on this post.

I was asked to abstract a boxplot from an old paper and re-render it in Stata.

I used the excellent WebPlotDigitizer to abstract the points in these figure. It took me a few rounds of data abstraction to get all of the data. I generated the following variables

(1) “row” – rows 1-8 — the “A-D” labels on the x axis are offset at 1.5, 3.5, 5.5, and 7.5, respectively,
(2) “group” – an indicator for group 1 or 2,
(3-5) “boxlow” “boxmid” and “boxhigh” – the lower, mid, and upper bounds of the box,
(6-7) “bar1” and “bar2” – the low and upper end of the bars, and
(8) “dot” – extreme values.

I used “input” at the top of a do file to load these data. The colors were taken from ColorBrewer2. The “box” are just really wide pcspike lines with overlaying horizontal bars as floating text boxes to indicate the bottom, median, and top of the box. One problem is the placement of the horizontal lines on and above/below the box itself, since the “―” ascii character isn’t perfectly in the middle of the vertical space that is used to render it. To get around this, I included an offset value that could be modified to shift these lines up and down so they lined up with the boxes. The “whiskers” are an rcap. The extreme values are just scatterplot dots.

Code to make this follows, data here are fake.

// abstracted with this: https://apps.automeris.io/wpd/
clear all

input row group boxlow boxmid boxhigh bar1 bar2 dot
1 1 110 130 140 100 150 .
2 2 115 125 135 90 150 .
3 1 135 145 155 120 175 . 
4 2 80 110 115 70 125 .
5 1 160 175 180 140 200 .
6 2 120 130 140 110 160 .
7 1 145 160 170 135 190 .
8 2 120 135 155 110 160 .
1 . . . . . . 95
1 . . . . . . 100
1 . . . . . . 155
1 . . . . . . 160
2 . . . . . . 80
2 . . . . . . 85
2 . . . . . . 155
2 . . . . . . 160
3 . . . . . . 110
3 . . . . . . 115
3 . . . . . . 180
3 . . . . . . 185
3 . . . . . . 190
4 . . . . . . 60
4 . . . . . . 65
4 . . . . . . 130
4 . . . . . . 135
5 . . . . . . 130
5 . . . . . . 135
5 . . . . . . 210
5 . . . . . . 140
6 . . . . . . 100
6 . . . . . . 105
6 . . . . . . 170
6 . . . . . . 175
7 . . . . . . 125
7 . . . . . . 130
7 . . . . . . 200
7 . . . . . . 210
8 . . . . . . 100
8 . . . . . . 105
8 . . . . . . 170
8 . . . . . . 175
end



// offset to move the bars on the box up or down on Y-axis,
// tweak as needed:
local offset = 2

set scheme s1mono

twoway ///
(rcap bar1 bar2 row, vert lcolor(black)) /// code for 95% CI
(pcspike boxlow row boxhigh row if group==1, vert lwidth(vvvthick) lcolor("141 211 199")) ///
(pcspike boxlow row boxhigh row if group==2, vert lwidth(vvvthick) lcolor("255 255 179")) ///
(scatter dot row , mcolor(black) msymbol(o) msize(tiny)) ///
, ///
text(`=boxmid[1]+`offset'' 1 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[1]+`offset'' 1 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[1]+`offset'' 1 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[2]+`offset'' 2 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[2]+`offset'' 2 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[2]+`offset'' 2 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[3]+`offset'' 3 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[3]+`offset'' 3 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[3]+`offset'' 3 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[4]+`offset'' 4 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[4]+`offset'' 4 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[4]+`offset'' 4 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[5]+`offset'' 5 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[5]+`offset'' 5 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[5]+`offset'' 5 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[6]+`offset'' 6 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[6]+`offset'' 6 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[6]+`offset'' 6 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[7]+`offset'' 7 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[7]+`offset'' 7 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[7]+`offset'' 7 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
text(`=boxmid[8]+`offset'' 8 "―", placement(c) size(medlarge) color(black)) ///
text(`=boxlow[8]+`offset'' 8 "{bf:―}", placement(c) size(medlarge) color(black)) ///
text(`=boxhigh[8]+`offset'' 8 "{bf:―}", placement(c) size(medlarge) color(black)) ///
///
legend(order(2 "Group 1" 3 "Group 2") row(2) size(small)) ///
scale(1.3) ///
///
xtitle("X Title") /// 
ytitle("Y Title") ///
///
title("Title") ///
xsize(5) ///
ysize(6) ///
yla(0(50)200, angle(0)) ///
xscale(r(.5 8.5)) ///
xla(1.5 "A" 3.5 "B" 5.5 "C" 7.5 "D") ///
xline(2.5 4.5 6.5)

Making a subgroup analysis figure in Stata

I was the analyst for the myPACE trial (published here in JAMA Cardiology), and needed to put together a subgroup analysis figure. I didn’t find any helpful stock code, so I wrote my own. The code uses the Frames feature that was introduced in Stata 16. It will (a) make a new frame with the required variables but no data, (b) generate a new dichotomous variable from continuous variables, (c) generate labels, (d) grab the Ns, point estimates, and 95% CI from a logistic regression, (e) grab the P-value for interaction for the primary exposure dependent variable*, (f) write the point estimate/95% CI/P-value for interaction to the new frame, then (g) switch to the new frame and make this figure. This script uses local macros so needs to be run all at once in a do file, not line by line.

This uses a stock Stata dataset called “catheter”. The outcome of interest/dependent variable is “infect”, the primary exposure/independent variable of interest is “time”, and the subgroups are age, sex, and patient number. This uses logistic regression, but you can easily swap this model out for any other model.

*You can get more complex code to format the P-values here.

Here’s the code!

frame reset // drop all existing frames and data
webuse catheter, clear // load analytical dataset
version 16 // need Stata version 16 or newer
*
* Make an empty frame with the variables we'll add later row by row
* The variables "rowname" and "pvalue" will be strings so when
* you add to these variables with the --frame post-- command,
* you need to use quotes. 
frame create subgroup str30 rowname n beta low95 high95 str30 pval
*
*** Age
* Need to generate a dichotomous age variable 
* You don't need to do this if the variable is already dichotomous, 
* ordinal, or nominal
generate agesplit = (age>=50) if !missing(age) // below 50 is 0, 50 and above is 1
* 
* Now generate a label for the overall grouping
local label_group "{bf:Age}" 
* 
* Group 0, below age 50
* Generate label for this subgroup
local label_subgroup_0 "Under 50y" 
* Now run the model for this subgroup
logistic infect time if agesplit==0 
* Now save the N, beta, and 95% CI as local macros.
* There's lots you can save after a regression, type --return list--, 
* --ereturn list--, and --matrix list r(table)-- to see what's there
local n_0 = e(N) 
local beta_0 = r(table)[1,1] 
local low95_0 = r(table)[5,1]
local high95_0 = r(table)[6,1]
* print above local macros to prove you collected them correctly: 
di "For `label_subgroup_0', n=" `n_0' ", beta (95% CI)=" %4.2f `beta_0' " (" %4.2f `low95_0' " to " %4.2f `high95_0' ")" 
*
* Group 1, at least 50 years old
local label_subgroup_1 "At least 50y" 
logistic infect time if agesplit==1 
local n_1 = e(N)
local beta_1 = r(table)[1,1] 
local low95_1 = r(table)[5,1]
local high95_1 = r(table)[6,1]
di "For `label_group', subgroup `label_subgroup_1', n=" `n_1' ", beta (95% CI)=" %4.2f `beta_1' " (" %4.2f `low95_1' " to " %4.2f `high95_1' ")" 
* 
* Now run the model with an interaction term between the primary exposure
* and the subgroup. 
logistic infect c.time##i.agesplit
* Grab the p-value as a local macro saved as pval
local pval = r(table)[4,5] 
* Print that local macro to see that you've grabbed it correctly
di "P-int = " %4.3f `pval'
* Format that P-value and save that as a new local macro called pvalue
local pvalue "P=`: display %3.2f `pval''"
di "`pvalue'"
* 
* Now write your local macros to the the "subgroup" frame
* Each "frame post" command will add 1 additional row to the frame.
* We will graph these line by line.
* First line with overall group name a P-value for interaction:
frame post subgroup ("`label_group'") (.)  (.) (.) (.) ("`pvalue'") 
* Now each subgroup by itself:
frame post subgroup ("`label_subgroup_0'") (`n_0')  (`beta_0') (`low95_0') (`high95_0') ("") 
frame post subgroup ("`label_subgroup_1'") (`n_1')  (`beta_1') (`low95_1') (`high95_1') ("") 
* Optional blank line:
frame post subgroup ("") (.) (.) (.) (.) ("") 
*
*** Female sex
* This is already dichotomous, so don't need to create a new variable
* like we did for age.
local label_group "{bf:Sex}"  
*group 0, males
local label_subgroup_0 "Males" 
logistic infect time if female==0 
local n_0 = e(N) 
local beta_0 = r(table)[1,1] 
local low95_0 = r(table)[5,1]
local high95_0 = r(table)[6,1]
*group 1, females
local label_subgroup_1 "Females" 
logistic infect time if female==1 
local n_1 = e(N) // N
local beta_1 = r(table)[1,1] 
local low95_1 = r(table)[5,1]
local high95_1 = r(table)[6,1]
*interaction P-value
logistic infect c.time##i.female
local pval = r(table)[4,5]
local pvalue "P=`: display %3.2f `pval''"
*write to subgroup frame
frame post subgroup ("`label_group'") (.)  (.) (.) (.) ("`pvalue'") 
frame post subgroup ("`label_subgroup_0'") (`n_0')  (`beta_0') (`low95_0') (`high95_0') ("") 
frame post subgroup ("`label_subgroup_1'") (`n_1')  (`beta_1') (`low95_1') (`high95_1') ("") 
frame post subgroup ("") (.) (.) (.) (.) ("") 
*
*** patient
* need to generate a patient dichotomous variable
generate patientsplit = (patient>=20) if !missing(patient) // below 20 is 0, 20 and above is 1
local label_group "{bf:Patient}" 
*group 0, below 20
local label_subgroup_0 "Under 20th patient" 
logistic infect time if patientsplit==0 
local n_0 = e(N) 
local beta_0 = r(table)[1,1] 
local low95_0 = r(table)[5,1]
local high95_0 = r(table)[6,1]
*group 1, 20 and above
local label_subgroup_1 "At least the 20th patient" 
logistic infect time if patientsplit==1 
local n_1 = e(N) 
local beta_1 = r(table)[1,1] 
local low95_1 = r(table)[5,1]
local high95_1 = r(table)[6,1]
*interaction P-value
logistic infect c.time##i.agesplit
local pval = r(table)[4,5] 
local pvalue "P=`: display %3.2f `pval''"
*write to subgroup frame
frame post subgroup ("`label_group'") (.)  (.) (.) (.) ("`pvalue'") 
frame post subgroup ("`label_subgroup_0'") (`n_0')  (`beta_0') (`low95_0') (`high95_0') ("") 
frame post subgroup ("`label_subgroup_1'") (`n_1')  (`beta_1') (`low95_1') (`high95_1') ("") 
frame post subgroup ("") (.) (.) (.) (.) ("") 
*
*** Now make the figure. You'll have to modify this so the number of rows 
*   in your subgroup frame matches the labels and whatnot
set scheme s1mono // I like this scheme
* Change frame to the subgroup frame
cwf subgroup
* Generate a row number by the current order of the data in this frame
gen row=_n
* Here's the code to make the figure
twoway ///
(scatter row beta, msymbol(d) mcolor(black) msize(medium)) ///
(rcap low95 high95 row, horizontal lcolor(black) lwidth(medlarge)) ///
, ///
legend(off) ///
xline(1, lcolor(red) lpattern(dash) lwidth(medium)) ///
title("Title") ///
yti("Y Title") ///
xti("X Title") ///
yscale(reverse) ///
yla( ///
1 "`=rowname[1]'" ///
2 "`=rowname[2]', n=`=n[2]'" ///
3 "`=rowname[3]', n=`=n[3]'" ///
4 " " /// blank since it's a blank row
5 "`=rowname[5]'" ///
6 "`=rowname[6]', n=`=n[6]'" ///
7 "`=rowname[7]', n=`=n[7]'" ///
8 " " /// blank since it's a blank row
9 "`=rowname[9]'" ///
10 "`=rowname[10]', n=`=n[10]'" ///
11 "`=rowname[11]', n=`=n[11]'" ///
12 " " /// blank since it's a blank row
, angle(0) labsize(small) noticks) ///
xla(0.8(.2)2.2) ///
text(1 1.1 "`=pval[1]'", placement(e) size(small)) /// these are the p-value labels
text(5 1.1 "`=pval[5]'", placement(e) size(small)) ///
text(9 1.1 "`=pval[9]'", placement(e) size(small)) 
*
* Now export your figure as a PNG file
graph export "myfigure.png", replace width(1000)

University of Vermont

Tim Plante, MD MHS

Category: Stata code