I recently made these two figures with a Stata do file that (A) generates some fake data, (B) plots the data overall with a fitted line, (C) plots each participant individually with fitted line, and (D) merges these four individual plots into one overall plot. One tricky piece of this was to get the –graph combine– command to get the four figures to be square, I had to fiddle with the –ysize– to get them to be square. I had originally tried using the –aspect– option, but it builds in lots of white space to between the figures. this way seemed to work.
Here’s the Stata code to make the above figures.
************************
*****clear data/graphs**
****set graph schemes***
************************
clear
graph drop _all
set scheme s1mono
************************
*****generate data******
************************
set seed 8675309
// generate 40 random values
set obs 40
//...called x and y that range from - to +9
gen y = runiform(-9,9)
gen x = runiform(-9,9)
// now make an indicator for each grouping of 10
gen n =.
replace n= 1 if _n>=0 & _n<=10
replace n= 2 if _n>10 & _n<=20
replace n= 3 if _n>20 & _n<=30
replace n= 4 if _n>30 & _n<=40
// now for each successive 10 rows, add 100, 120, 140, and 160
replace y = y+100 if n==1
replace x = x+100 if n==1
replace y = y+120 if n==2
replace x= x+120 if n==2
replace y= y+140 if n==3
replace x= x+140 if n==3
replace y= y+160 if n==4
replace x= x+160 if n==4
************************
*****single figure******
************************
twoway ///
/// line of unity (45 degree line):
(function y = x, ra(90 170) clpat(solid) clwidth(medium) clcolor(black)) ///
/// line of fit:
(lfit y x, lcolor(red) lpattern(dash) lwidth(thick)) ///
/// dots for each group, using colors from colorbrewer2
(scatter y x if n==1, mcolor("230 97 1") msymbol(O) mlcolor(black)) ///
(scatter y x if n==2, mcolor("253 184 99") msymbol(O) mlcolor(black)) ///
(scatter y x if n==3, mcolor("178 171 210") msymbol(O) mlcolor(black)) ///
(scatter y x if n==4, mcolor("94 60 153") msymbol(O) mlcolor(black)) ///
, ///
/// force size:
ysize(4) xsize(4) ///
/// axis titles and rows:
xtitle("Reference BP, mm Hg") ///
ytitle("Cuffless BP, mm Hg") ///
xla(90(20)170) ///
yla(90(20)170) ///
legend(off) ///
/// labels:
text(102 109 "←Participant A", placement(e)) ///
text(119 127 "←Participant B", placement(e)) ///
text(142 135 "Participant C→", placement(w)) ///
text(160 154 "Participant D→", placement(w))
/// export figure:
graph export figure1av2.png, replace width(1000)
************************
***four-piece figures***
*****to merge***********
************************
twoway ///
(function y = x if n==1, ra(90 110) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==1, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==1, mcolor("230 97 1") msymbol(O) mlcolor(black)) ///
, ///
title("Participant A") ///
xtitle(" ") ///
ytitle(" ") ///
yla(90(5)110) ///
xla(90(5)110) ///
legend(off) ///
name(figure1b1)
twoway ///
(function y = x if n==2, ra(110 130) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==2, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==2, mcolor("253 184 99") msymbol(O) mlcolor(black)) ///
, ///
title("Participant B") ///
xtitle(" ") ///
ytitle(" ") ///
yla(110(5)130) ///
xla(110(5)130) ///
legend(off) ///
name(figure1b2)
twoway ///
(function y = x if n==3, ra(130 150) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==3, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==3, mcolor("178 171 210") msymbol(O) mlcolor(black)) ///
, ///
title("Participant C") ///
xtitle(" ") ///
ytitle(" ") ///
yla(130(5)150) ///
xla(130(5)150) ///
legend(off) ///
name(figure1b3)
twoway ///
(function y = x if n==4, ra(150 170) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==4, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==4, mcolor("94 60 153") msymbol(O) mlcolor(black)) ///
, ///
title("Participant D") ///
xtitle(" ") ///
ytitle(" ") ///
yla(150(5)170) ///
xla(150(5)170) ///
legend(off) ///
name(figure1b4)
***********************
***combine figures*****
***********************
graph combine figure1b1 figure1b2 figure1b3 figure1b4 /*figure1b5*/, ///
b1title("Reference BP, mm Hg") ///
l1title("Cuffless BP, mm Hg") ///
/// this forces the four scatterplots to be the same height/width
/// details here:
/// https://www.stata.com/statalist/archive/2013-07/msg00842.html
cols(2) iscale(.7273) ysize(5.9) graphregion(margin(zero))
/// export to figure
graph export figure1bv2.png, replace width(1000)
Stata 16 introduced the new Frames functionality, which allows multiple datasets to be stored in memory, with each dataset stored in its own “Frame”. This allows for dynamic manipulation of multiple datasets across multiple Frames. Stata is still simplest to use when manipulating a single dataset (or, frame). So, Stata users will probably be interested in building a single dataset/Frame for a specific analysis that is built from variables taken from multiple datasets/Frames.
One handy application of Frames is to import non-Stata datasets as separate frames and combine them (really, merge) into a single analytical dataset/Frame. Before using Frames, I had previously imported non-Stata datasets, saved them locally, then merged them 1 by 1. With Frames, you just import each dataset into its own Frame, and “merge” them directly, skipping the intermediate “save as Stata dta file” step.
Here’s my approach to building a single analytical dataset from multiple imported datasets, with frames. We’ll do this with NHANES data. This is a modification of the code on this post.
Step 1: Drop (reset) all frames, create new ones to import the new datasets, and run commands within each new frame to import NHANES/SAS datasets.
// Here's the NHANES website, FYI:
// https://wwwn.cdc.gov/nchs/nhanes/default.aspx
//
// drop all frames from memory. This will delete all unsaved data so be careful!!
frames reset
//
// make a blank frame called "DEMO_F"
// you could type "frame create" or the brief synonym "mkf" for
// "make frame"
mkf DEMO_F
// ...Then run the sas import command within it, grabbing it from the CDC website.
// Since this command needs to be run from within the DEMO_F frame,
// we can tell Stata to run the command from that frame without
// actually changing to it using the "frame [name]:" prefix
frame DEMO_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT", clear
//
// ditto for the "BPQ_F" and "KIQ_U_F" datasets.
mkf BPQ_F
frame BPQ_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.XPT", clear
//
mkf KIQ_U_F
frame KIQ_U_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/KIQ_U_F.XPT", clear
//
// let's see a list of current frames:
frames dir
//
// Which frame are you using though?
// pwf is present working frame, or the current one in use.
pwf
// You'll see that the pwf if "default".
Step 2: Create an “analytical” frame that will contain the data you need to complete your analysis, and copy the variable that links all of your data to this frame. Also switch to that new analytical frame.
// The "linking" variable in this dataset is called "seqn", which
// we will copy ("put") from the DEMO_F frame.
// (Your file might have a linking variable called "id".)
// This creates a new frame called "analytical" and also moves the "seqn"
// variable from DEMO_F to the new analytical frame in one line.
frame DEMO_F: frame put seqn, into(analytical)
//
// see current list of frames and present working frame:
frames dir
pwf
// Now change from the default frame to the new analytical one.
// You can change frames with "cwf" for "change working frame".
cwf analytical
Step 3: Now that you are are in the new analytical frame, link all of your frames using the “linking” variable (“seqn” here).
// use the "frlink" command to link frames. This can be 1:1 linking or 1:m.
//
// remember that your cwf should be analytical right now.
//
frlink 1:1 seqn, frame(DEMO_F)
frlink 1:1 seqn, frame(BPQ_F)
frlink 1:1 seqn, frame(KIQ_U_F)
//
// You are still within the "analytical" frame, but now your frames are all
// linked or connected to each other.
// if you look at your dataset with the --browse-- command, you'll see there
// are now new DEMO_F, BPQ_F, and KIQ_U_F variables. These are the "rows" for
// linked IDs in those other frames, so Stata knows where to look for
// variables in those other frames.
Step 4: “Get” the specific variables you want from each frame. This is how you merge individual variables from multiple frames.
// from my NHANES post, we need to grab weighting variables from DEMO_F
// and a BP variable from BPQ_F. Just for kicks, we'll also grab self-
// reported "weak kidneys" from KIQ_U_F.
//
// remember that your cwf should be analytical right now.
//
frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_F)
frget bpq020, from(BPQ_F)
frget kiq022, from(KIQ_U_F)
//
// now you have a nice merged database!
Step 5 (optional): If you are satisfied with your analytical dataset and no longer need the other frames, you can now drop the “linking” variables from the analytical dataset, save your analytical dataset, clear your frames, and reopen your analytical dataset
// You can save this analytical frame as a Stata
// dataset now if you are done manipulating the other frames.
//
// You might opt to drop the new linking variables prior to saving
// for simplicity.
drop DEMO_F BPQ_F KIQ_U_F
//
// Stata's "save" command won't save other frames as FYI, just the pwf.
// But in this example, we are done with other frames.
save analytical.dta, replace
//
// Drop all other frames so you don't get an annoying pop-up
// about unsaved frames in memory.
// Be careful! This will drop all data from memory!!
frames reset
//
// now reopen your previously saved analytical dataset.
use analytical.dta, clear
Bonus: Appending frames
You might need to append frames. There are some details here about how to do this. I’m using the –fframeappend– command by Jürgen Wiemers.
// install fframeappend, only need to do once:
ssc install fframeappend
// change to the frame you want to append all of your data to,
// in this example it's called "appended"
// you'll have to use "mkf appended" if you don't already have one
// called that.
cwf appended
//now append your frame "a" to the current open frame
fframeappend, using(a)
// now repeat using frame "b" to the current open frame
fframeappend, using(b)
Bonus: Here’s steps 1-4 in a single loop using global macros
For advanced users: Here’s a loop and some global macros that is adaptable to downloading several years. NHANES uses different letters for files from different years, the 2009-2010 one uses “F”.
frames reset
global url "https://wwwn.cdc.gov/Nchs/Nhanes/"
global F "2009-2010/"
global files DEMO BPQ KIQ_U
foreach x in F {
foreach y in $files {
mkf `y'_`x'
frame `y'_`x': import sasxport5 "${url}${`x'}`y'_`x'.xpt"
}
frame DEMO_`x': frame put seqn, into(analytical_`x')
cwf analytical_`x'
foreach y in $files {
frlink 1:1 seqn, frame(`y'_`x')
}
frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_`x')
frget bpq020, from(BPQ_`x')
frget kiq022, from(KIQ_U_`x')
}
frames dir
pwf
Here’s the same as above, but just for a single year.
frames reset
global dir "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/"
global files DEMO_F BPQ_F KIQ_U_F
foreach y in $files {
mkf `y'
frame `y': import sasxport5 "${dir}`y'.xpt"
}
frame DEMO_F: frame put seqn, into(analytical)
cwf analytical
foreach y in $files {
frlink 1:1 seqn, frame(`y')
}
frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_F)
frget bpq020, from(BPQ_F)
frget kiq022, from(KIQ_U_F)
frames dir
pwf
Mediation is a commonly-used tool in epidemiology. Inverse odds ratio-weighted (IORW) mediation was described in 2013 by Eric J. Tchetgen Tchetgen in this publication. It’s a robust mediation technique that can be used in many sorts of analyses, including logistic regression, modified Poisson regression, etc. It is also considered valid if there is an observed exposure*mediator interaction on the outcome.
There have been a handful of publications that describe the implementation of IORW (and its cousin inverse odds weighting, or IOW) in Stata, including this 2015 AJE paper by Quynh C Nguyen and this 2019 BMJ open paper by Muhammad Zakir Hossin (see the supplements of each for actual code). I recently had to implement this in a REGARDS project using a binary mediation variable and wanted to post my code here to simplify things. Check out the Nguyen paper above if you need to modify the following code to run IOW instead of IOWR, or if you are using a continuous mediation variable, rather than a binary one.
A huge thank you to Charlie Nicoli MD, Leann Long PhD, and Boyi Guo (almost) PhD who helped clarify implementation pieces. Also, Charlie wrote about 90% of this code so it’s really his work. I mostly cleaned it up, and clarified the approach as integrated in the examples from Drs. Nguyen and Hossin from the papers above.
IORW using pweight data (see below for unweighted version)
The particular analysis I was running uses pweighting. This code won’t work in data that doesn’t use weighting. This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.
*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. (Although you'll probably
// have to adjust the svyset commands in the
// program below to work in your dataset, in all
// actuality)
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets.
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row.
global title "my cool analysis, model 1"
//
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or
// "treatment") is also binary. This code would need to be modified
// for a continuous exposure variable. See details in Step 2.
global outcome mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Components of weighting to go into "svyset" command.
// You might have
global samplingweight mysamplingweightvar
global id id_num // ID for your participants
global strata mystratavar
//
// Now pick number of bootstraps. Aim for 1000 when you are actually
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed.
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset.
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw_weighted
program iorw_weighted, rclass
// drop all variables that will be generated below.
capture drop predprob
capture drop inverseoddsratio
capture drop weight_iorw
//
*Step 1: svyset data since your dataset is weighted. If your dataset
// does NOT require weighting for its analysis, do not use this program.
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
//
*Step 2: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are
// using logistic regression (logit). If the exposure is a normally-distributed
// continuous variable, use linear regression instead.
svy linearized, subpop(${ifstatement}): logit ${exposure} ${mediator} ${covariates}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights.
scalar med_beta=e(b)[1,1]
//
*Step 3: Generate predicted probability and create inverse odds ratio and its
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
//
// Calculate inverse odds ratio weights. Since our dataset uses sampling
// weights, we need to multiply the dataset's weights times the IORW for the
// exposure group. This step is fundamentally different for non-weighted
// datasets.
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures.
gen weight_iorw = ${samplingweight} if ${exposure}==0
replace weight_iorw = inverseoddsratio*${samplingweight} if ${exposure}==1
//
*Step 4: TOTAL EFFECTS (ie no mediator) without IORW weighting yet.
// (same as direct effect, but without the IORW)
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log)
matrix bb_total= e(b)
scalar b_total=bb_total[1,1]
return scalar b_total=bb_total[1,1]
//
*Step 5: DIRECT EFFECTS; using IORW weights instead of the weighting that
// is used typically in your analysis.
svyset ${id}, strata(${strata}) weight(weight_iorw) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 6: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 7: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 8: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw_weighted
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI,
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in
// a clean manner, exponentiating them when required.
quietly {
noisily di "${title} (bootstrap count=" e(N_reps) ")*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f `high954' "%" _col(50) %4.2f `beta4' "% (" %4.2f `low954' "%, " %4.2f `high954' "%)"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
noisily di " "
}
// the end.
IORW for datasets that don’t use weighting (probably the one you are looking for)
Here is the code above, except without consideration of weighting in the overall dataset. (Obviously, IORW uses weighting itself.) This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.
*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW.
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets.
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row.
global title "my cool analysis, model 1"
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or
// "treatment") is also binary. This code would need to be modified
// for a continuous exposure variable. See details in Step 1.
global outcome mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Now pick number of bootstraps. Aim for 1000 when you are actually
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed.
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset.
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw
program iorw, rclass
// drop all variables that will be generated below.
capture drop predprob
capture drop inverseoddsratio
capture drop weight_iorw
//
//
*Step 1: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are
// using logistic regression (logit). If the exposure is a normally-distributed
// continuous variable, use linear regression instead.
logit ${exposure} ${mediator} ${covariates} ${ifstatement}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights.
scalar med_beta=e(b)[1,1]
//
*Step 2: Generate predicted probability and create inverse odds ratio and its
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
//
// Calculate inverse odds ratio weights.
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures.
gen weight_iorw = 1 if ${exposure}==0
replace weight_iorw = inverseoddsratio if ${exposure}==1
//
*Step 3: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. (same as direct effect, but without the IORW)
glm ${outcome} ${exposure} ${covariates} ${ifstatement}, family(poisson) link(log) vce(robust)
matrix bb_total= e(b)
scalar b_total=bb_total[1,1]
return scalar b_total=bb_total[1,1]
//
*Step 4: DIRECT EFFECTS; using IORW weights
glm ${outcome} ${exposure} ${covariates} ${ifstatement} [pweight=weight_iorw], family(poisson) link(log) vce(robust)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 5: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 6: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 7: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI,
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in
// a clean manner, exponentiating them when required.
quietly {
noisily di "${title} (bootstrap count=" e(N_reps) ")*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f `high954' "%" _col(50) %4.2f `beta4' " (" %4.2f `low954' ", " %4.2f `high954' ")"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
noisily di " "
}
// the end.
If you are looking to automate the printing of values from macros/regressions or whatnot on your figures, make sure to check out this post.
I love Stata’s macro functionality. It allows you to grab information from r- or e-level data after executing a Stata command then calling that back later. (Local macros are temporary, global macros are more persistent.) This is particularly useful when generating summary statistics collected in macros after the –sum– command, or displaying a subset of components from a regression, such as the beta coefficient and 95% confidence intervals, or P-values (details on how to manipulate regression function results with macros are here).
One problem in using macros is that raw r- or e-level data are really long and not amenable to output in tables for publication without formatting. I’ve hadn’t previously been able to apply formatting (eg %4.2f) while generating macros, outside of applying the “round” command. (I don’t like the round command because it’s tricky to code in a program for reasons I won’t get into.) Instead, I have applied the number formatting when displaying the macro. That creates issues when generating string output from numerical macros, since my prior strategy of applying numerical formatting (eg %4.2f) didn’t work when displaying a numerical macro in a string (i.e., embedding it within quotations). Instead, I wanted to apply the format while also generating the macro itself.
It turns out that not only can you apply formatting while generating a macro with the “: display” subcommand, you can also trim extra spaces from the generated macro at the same time. (Note that the “trim” command has been replaced by “strltrim” and a lot of other related commands that you can find in –help string functions–.) As a bonus, it turns out that you can also apply formatting (e.g., %4.2f) of a macro when displaying within a string using the “: display” subcommand strategically surrounded by opening and closing tick marks.
Here’s a demo script of what I’m getting at.
(Note: I strongly recommend against formatting/rounding/reducing precision of any macros that you are generating if you will later do math on them. Formatting during generation of macros is only useful for macros intended to be displayed later on.)
sysuse auto, clear
sum mpg
//
// 1. generate an unformatted macro, equals sign
// is required
local mpg_mean = r(mean)
//
// 2. apply formatting while generating the local
// macro, note the colon for the macro
// subcommand, instead of an equals sign
local mpg_mean_fmt: display %10.2f r(mean)
//
// 3. here's formatting mixed with a trim command,
// this combines an equals sign and colon.
local mpg_mean_neat = strltrim("`: display %10.2f r(mean)'")
//
//
// Now let's call the unformatted macro from #1:
display "here it is unformatted: `mpg_mean'"
// note that it's really long after the decimal
//
// Let's apply formatting to #1 OUTSIDE of quotations:
display "here it is formatted " %10.2f `mpg_mean'
// ...and here's formatting to #1 WITHIN quotations:
display "here it is formatted `: display %10.2f `mpg_mean''"
// see all of the leading spaces? Using %3.2f would fix
// that, but I wanted to show the trim function.
//
// here's the format applied during macro generation in
// #2 (without a trim function):
display "here's an alt format: `mpg_mean_fmt'"
// still lots of leading spaces.
//
// here's trimming and formatting mixed from #3:
display "here's fmt & trim: `mpg_mean_neat'"
// Bingo!
Stata allows the labeling of variables and also the individual values of categorical or ordinal variable values. For example, in the –sysuse auto– database, “foreign” is labeled as “Car origin”, 0 is “Domestic”, and 1 is “Foreign”. It isn’t terribly intuitive to extract the variable label of foreign (here, “Car origin”) or the labels from the categorical values (here, “Domestic” and “Foreign”).
Here’s a script that you might find helpful to extract these labels and save them as macros that can later be called back. This example generates a second string variable that applies those labels. I use this sort of code to automate the labeling of figures with the value labels, but this is a pretty simple example for now.
Remember to run the entire script from top to bottom or else Stata might drop your macros. Good luck!
sysuse auto, clear
// Note: saving variable label details at
// --help macro--, under "Macro functions
// for extracting data attributes"
//
// 1. Extract the label for the variable itself.
// If we look at the --codebook-- for the
// variable "foreign", we see...
//
codebook foreign
//
// ...that the label for "foreign" is "Car
// origin" (see it in the top right of
// the output). Here's how we grab the
// label of "foreign", save it as a macro,
// and print it.
//
local foreign_lab: variable label foreign
di "`foreign_lab'"
//
// 2. Extract the label for the values of the variable.
// If we look at --codebook-- again for "foreign",
// we see...
//
codebook foreign
//
// ...that 0 is "Domestic" and 1 is "Foreign".
// Here's how to grab those labels, save macros,
// and print them
//
local foreign_vallab_0: label (foreign) 0
local foreign_vallab_1: label (foreign) 1
di "The label of `foreign_lab' 0 is `foreign_vallab_0' and 1 is `foreign_vallab_1'"
//
// 3. Now you can make a variable for the value labels.
//
gen strL foreign_vallab = "" // this makes a string
replace foreign_vallab="`foreign_vallab_0'" if foreign==0
replace foreign_vallab="`foreign_vallab_1'" if foreign==1
//
// 4. You can also label this new string variable
// using the label from #1
//
label variable foreign_vallab "`foreign_lab' as string"
//
// BONUS: You can automate this with a loop, using
// --levelsof-- to extract options for each
// categorical variable. There is only one
// labeled categorical variable in this dataset
// (foreign) so this loop only uses the single one.
//
sysuse auto, clear
foreach x in foreign {
local `x'_lab: variable label `x' // #1 from above
gen strL `x'_vallab = "" // start of #3
label variable `x'_vallab "``x'_lab' string"
levelsof `x', local(valrange)
foreach n of numlist `valrange' {
local `x'_vallab_`n': label (`x') `n' // #2 from above
replace `x'_vallab = "``x'_vallab_`n''" if `x' == `n' // end of #3
}
}
Stata is great because of its intuitive syntax, reasonable learning curve, and dependable implementation. There’s some cutting edge functionality and graphical tools in R that are missing in Stata. I came across the Rcall package that allows Stata to interface with R and use some of these advanced features. (Note: this worked for me as far back as Stata 16. I have no reason to think that this wouldn’t work with Stata 13 or newer.)
Below details:
How to get set up with Rcall
Example 1: How to make figures with the ‘ggplot2’ and ‘ggstatsplot’ R packages
Example 2: How to estimate the Charlson comorbidity index or Elixhauser comorbidity score with the ‘comorbidity’ R package
Example 3: Estimating race by last name and sex by first name with the ‘predictrace’ R package
Example 4: Making a heatplot with lattice/levelplot
Installation of R, R packages, and Rcall (you only need to do this once)
Download and install R. Open R and install the readstata13 package, which is required to install Rcall. While you’re at it, install ggplot2 and ggstatsplot. Note: ggplot2 is included in the excellent multi-package collection called Tidyverse. We are going to install Tidyverse instead of ggplot2 alone. Tidyverse also installs several other packages useful in data science that you might need later. This includes dplyr, tidyr, readr, purrr, tibble, stringr, forcats, import, wrangle, program, and model. I have also gotten an error saying “‘Rcpp_precious_remove’ not provided by package ‘Rcpp'”, which was fixed by installing Rcpp, so install that too.
It’ll prompt you to set up an install directory and choose your mirror/repository. Just pick one geographically close to you. After these finish installing, you can close R.
Rcall’s installation is within Stata (as usual for Stata programs) but originates from Github, not the usual SSC install. You need to install a separate package to allow you to install things from Github in Stata. From the Stata command line, type:
net install github, from("https://haghish.github.io/github/")
Now install Rcall itself from the Stata command line:
github install haghish/rcall, stable
If all goes well, it should install!
Edit: In July 2024, there seems to be a problem with the installation. If you get an error saying “github package was not found” and “please update your GitHub package”, you might try to install an older version of rcall than the current version (which is 3.1.0 in 7/2024) as such:
github install haghish/rcall, version(3.0.7)
Using Rcall
You should read details on the Rcall help file (type –help rcall– in Stata) for an overview. Also read the Rcall overview on Github. In brief, you can send datasets from Stata to R using –rcall st.data()–. You can kick things back to stata with –st.load(name of R frame)–. –rcall clear– reboots R as a new instance.
There are four modes for using Rcall: vanilla, sync, interactive, and console. For our purposes, we are going to focus on the interactive mode since this allows you to manipulate R from within a do file.
Example 1: Make a figure in ggplot2 using Stata and Rcall
Here’s some demo code to make a figure with ggplot2, which is the standard for figures in R. There’s a handy cheat sheet here. This intro page is quite helpful. This overview is excellent. Check out the demo figures from this page as well. If your ggplot command extends across multiple lines, make sure to end each line (except the final line) with the three forward slash (“///”) line break notation that is used by Stata.
// load sysuse auto dataset
sysuse auto, clear
// set up rcall, clean session and load necessary packages
rcall clear // starts with a new instance of R
rcall: library(ggplot2) // load the ggplot2 library
// move Stata's auto dataset over to R and prove it's there.
rcall: data<- st.data() // move auto dataset to r
rcall: names(data) // prove that you can see the variables.
rcall: head(data, n=10) // now look at the first 10 rows of the data in R
// now make a scatterplot with ggplot2, note the three slashes for line break
rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
geom_point()
rcall: ggsave("ggtest.png", plot=e)
// figure out where that PNG is saved:
rcall: getwd()
Note: rather than using the three forward slashes, you can also change the delimiter to a semicolon, like the following. Just remember to change it back to normal (“cr”). Here’s an equivalent to above with semicolon delimiters. Note that the ggplot bit that spreads across two lines no longer has any slashes. This looks a bit more like “true R code”.
Here’s what it made! It was saved in my Documents folder, but check the output above to see where you working directory is.
You can get much more complex with the figure, like specifying colors by foreign status, specifying dot size by headroom size, adding a loess curve with 95% CI, and adding some labels. You can swap out the “rcall: e <- ggplot(…)" bit above for the following. Remember to end every non-final line with the three forward slashes.
Let’s make a figure in ggstatsplot using Stata and Rcall
Here’s some demo code to make a figure with ggstatsplot (which is very awesome and you should check it out). If your ggstatsplot command extends across multiple lines, make sure to end each line (except the final line) with the three forward slash (“///”) line break notation that is used by Stata.
// load sysuse auto dataset
sysuse auto, clear
// set up rcall, clean session and load necessary packages
rcall clear // starts with a new instance of R
rcall: library(ggstatsplot) // load the ggstatsplot library
rcall: library(ggplot2) // need ggplot2 to save the png
// move Stata's auto dataset over to R and prove it's there.
rcall: data<- st.data() // move auto dataset to r
rcall: names(data) // prove that you can see the variables.
rcall: head(data, n=10) // now look at the first 10 rows of the data in R
// let's make a violin plot using ggstatsplot
rcall: f <- ggbetweenstats( data = data, x=foreign, y=weight, title="title")
rcall: ggsave("ggstatsplottest.png", plot=f)
// figure out where that PNG is saved:
rcall: getwd()
If you check your working directory (it was my “Documents” folder in Windows), you’ll find this figure as a PNG:
You can automate the output of ggstatsplot figures by editing the ggplot2 components that make it up. You’d insert the following into the ggstats plot code in the parentheses following “ggbetweenstats” to make the y scale on a log axis, for example:
Quick do file to automate R-Stata integration and make ggplot2 or ggstatsplot figures
I made a do file that simplifies the setup of Rcall. Specifically, it 1. Sets R’s working directory to match your current Stata working directory, 2. Starts with a fresh R install, 3. Loads your current Stata dataset in R, and 3. Loads ggplot2 and ggstatsplot in R.
To use, just load your data, run a “do” command followed by the URL to my do file, then run whatever ggplot2 or ggstatsplots commands you want.This assumes you have installed R, the required packages, and Rcall (see very top of this page). If you get an error, try using this alternative version of the do file that doesn’t try to match Stata and R’s working directory.
Example code:
// Step 1: open dataset
sysuse auto, clear
// Step 2: run the do file, hosted on my UVM directory:
do https://www.uvm.edu/~tbplante/rcall_ggplot2_ggstatsplot_setup_v1_0.do
// if errors with above, use this do file instead:
// do https://www.uvm.edu/~tbplante/rcall_ggplot2_ggstatsplot_setup_alt_v1_0.do
// Step 3: run whatever ggplot2 or ggstatsplot code you want:
rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
geom_point(aes(col=foreign, size=headroom)) + ///
geom_smooth(method="loess") + ///
labs(title="ggplot2 demo", x="MPG", y="Weight", caption="Caption!")
rcall: ggsave("ggtest.png", plot=e)
Example 2: Using “comorbidity” R package in Stata with Rcall to estimate Charlson comorbidity index or Elixhauser comorbidity score
Here’s some semicolon delimited Stata code to run from a Stata do file apply the Charlson comorbidity index to some Stata data.
webuse australia10, clear // load a default stata dataset with an ICD10 variable
gen id=_n // make an ID by row as there's no ID variable in this dataset
#delimit ;
rcall clear ;
rcall: library(comorbidity) ; // load comorbidity package
rcall: data<- st.data() ; // move data to r
rcall: names(data) ; // look at data
rcall: head(data, n=10) ; // look at rows
rcall: charlston <- comorbidity(x=data, id="id", code = "cause",
map = "charlson_icd10_quan", assign0 = FALSE) ;
rcall: score(x=charlston, weights = "charlson", assign0=FALSE) ;
rcall: mergeddata <- merge(data, charlston, by="id") ; // merge the original & new charlson data
rcall: head(mergeddata, n=10) ; // look at rows
rcall: st.load(mergeddata) ; // kick the merged data back to stata
#delimit cr
Example 3: Predicting race by last name and sex by first name using the ‘predictrace’ package
Here’s how to use this package in Stata using Rcall. There is a lot of nuance in this package so make sure to read the paper and the github page.
In R, type:
install.packages(predictrace)
Then in Stata, write a do file that generates a variable called “lastnames” that is lower case last names (for race matching) and “firstnames” that is lowercase first names (for sex matching). Below is the code to estimate race by last name (first batch of code) and then sex by first name (second batch of code). This is semicolon delimited.
Estimating race by last name
(Note: this considers “Hispanic” to be a race and not an ethnicity…)
// Clear memory, input dataset of first and last names.
// Here, I'm formatting the strings so they are up to 100 characters
// in length so they don't get clipped (str100).
// If I specified str5 then "Flintstone" would be "Flint".
clear all
input str100 firstname str100 lastname
"Jacob" "Peralta"
"Rosa" "Diaz"
"Terrence" "Jeffords"
"Amy" "Santiago"
"Charles" "Boyle"
"Regina" "Linetti"
"Raymond" "Holt"
"Michael" "Hitchcock"
"Norman" "Scully"
end
compress firstname // optional, shortens string format from 100 char to the minimum length
compress lastname // optional, shortens string format from 100 char to the minimum length
// now replace all names with their lower case variant:
replace firstname = lower(firstname) // first name isn't used here fyi
replace lastname = lower(lastname)
#delimit ;
rcall clear ;
rcall: library(predictrace) ; // load predictrace
rcall: data<- st.data() ; // move data to r
rcall: names(data) ; // look at names of data
rcall: head(data, n=10) ; // look at first 10 rows
rcall: lastnamevector <-data[, "lastname"] ; // make vector from lastname column
rcall: data = merge(data, predict_race(lastnamevector), by.x = 'lastname', by.y = 'name', sort = FALSE) ;
rcall: head(data, n=10) ; // look at rows
rcall: st.load(data) ; // kick the merged data back to stata
#delimit cr
Estimating sex by first name
// Clear memory, input dataset of first and last names.
// Here, I'm formatting the strings so they are up to 100 characters
// in length so they don't get clipped (str100).
// If I specified str5 then "Flintstone" would be "Flint".
clear all
input str100 firstname str100 lastname
"Jacob" "Peralta"
"Rosa" "Diaz"
"Terrence" "Jeffords"
"Amy" "Santiago"
"Charles" "Boyle"
"Regina" "Linetti"
"Raymond" "Holt"
"Michael" "Hitchcock"
"Norman" "Scully"
end
compress firstname // optional, shortens string format from 100 char to the minimum length
compress lastname // optional, shortens string format from 100 char to the minimum length
// now replace all names with their lower case variant:
replace firstname = lower(firstname)
replace lastname = lower(lastname) // last name isn't used here fyi
#delimit ;
rcall clear ;
rcall: library(predictrace) ; // load predictrace
rcall: data<- st.data() ; // move data to r
rcall: names(data) ; // look at names of data
rcall: head(data, n=10) ; // look at first 10 rows
rcall: firstnamevector <-data[, "firstname"] ; // make vector from firstname column
rcall: data = merge(data, predict_gender(firstnamevector), by.x = 'firstname', by.y = 'name', sort = FALSE) ;
rcall: head(data, n=10) ; // look at rows
rcall: st.load(data) ; // kick the merged data back to stata
#delimit cr
Special thanks to Katherine Wilkinson for her R brilliance in debugging this.
Example 4: Making a heatplot with lattice/levelplot
I wanted to make a heatplot ranging from -1 to +1 and wanted the negatives to be a different color from the positives, and have them turn more muted as they get to zero. I couldn’t quite figure out how to do this with the –twoway contour– or –plotmatrix– commands. It was pretty simple to do these in R, just needed to use the RColorBrewer package, specifying “BrBG” as the palette. I used the “lattice” package and its “levelplot” command as described in this post.
In R, install the lattice and RColorBrewer packages (lattice was already installed on my R desktop):
Now in Stata, input the data you’re interested in rendering in order of columns left to right then each column. Then, kick it to R, load the necessary libraries, grab your colorbrewer scheme of choice, label the axes, and then use the “levelplot” to render this. It will save a PDF in your R working directory. I couldn’t for the life of me figure out how to save it as a PNG but there was diminishing returns on figuring that out.
clear all
input x y z
-0.8 0.2 0.4
0.7 -0.4 0.1
0.9 -0.2 -0.4
end
#delimit ;
rcall clear ;
rcall: data<- st.data() ; // move data to r
rcall: names(data) ; // look at names of data
rcall: head(data, n=10) ; // look at first 10 rows
rcall: library(lattice) ; // load packages
rcall: library(RColorBrewer) ;
rcall: colors <- colorRampPalette(brewer.pal(16, "BrBG")) ; // get colors
rcall: colnames(data) <-c("alfa", "bravo", "charlie") ; //names of columns and rows
rcall: rownames(data) <-c("one", "two", "three") ;
rcall: levelplot(
t(data[c(nrow(data):1) , ]),
col.regions=colors,
xlab="xlabel!",
ylab="ylabel!",
at=seq(min(-1), max(1), length.out=100),
scales=list(y=list(rot=0), x=list(rot=45))
) ; // pull data, apply colors & labels, set color axis range, rotate labels
rcall: getwd() ; // figure out where that PDF is saved:
#delimit cr
Note in 2023: We are using R and not Stata this summer so this post doesn’t apply to this year’s projects.
Stata is a popular commercial statistical software package that was first released 30+ years ago. It has some really nice features, loads of top-rate documentation, a very active community, and approachable syntax. For beginners, I think it’s the simplest to learn.
Learning how to use Stata
Stata has really, really, really good documentation.
The documentation is outstanding. Let’s say that you want to learn how to use the –destring– command. In the command line (1a under “Stata’s Interface” below), type:
help destring
…and up will pop a focused help file. There’s the “View complete PDF manual entry” option that has EXTENSIVE documentation of the command. (Note: This file seems to only work well with Adobe PDF reader, not alternative PDF readers like Sumatra). If the focused help file isn’t sufficient to answer your questions, try the complete PDF manual.
The focused help file has multiple parts, but the syntax example is gold. Further down you’ll see example uses of the command.
Web searches will find even more answers
Odds are that someone has already hit the same problem you have in using Stata. Queries in your favorite search engine are likely to find answers on the Statalist archive or UCLA’s excellent website.
You can install Stata programs that other users have written
There are MANY MANY MANY user-written programs out there that can be installed and used in your code. You only need to install them once. Most are on BU’s repository called SSC. I use the table1_mc program extensively (it makes pretty table 1s, you can read about it here). To install table1_mc from SSC, you type:
ssc install table1_mc
…and Stata will download it and install it for you. It’s ready to use when it finishes installing. And, there’s no need to re-install it, it will load each time you start Stata.
Quirks of Stata
Stata only works with rectangular datasets
Think of a rectangular dataset as a single spreadsheet in Excel. It has vertical columns and horizontal rows. There’s no data on a Z axis coming out of the computer at your face.
A rectangular dataset is the only type that Stata works with. Other statistical software like R or Python can handle many more complex data structures. For learners, forcing data to fit within a rectangular dataset is a huge advantage in my mind since that structure is intuitive, and you can always browse your data with the built-in data browser (see 3c under Stata’s Interface, below).
Stata only works with one dataset at a time*
One dataset in Stata is akin to one spreadsheet in a workbook in Excel. In Excel, you can have multiple spreadsheets in one .xlsx workbook file, with each spreadsheet appearing on a different tab at the bottom. All spreadsheets are in the memory at the same time. You can do math across spreadsheets in a workbook in excel, summarize costs in one column in spreadsheet A and have the result appear in one cell on spreadsheet B. In Stata, you can only have one spreadsheet (here, dataset) open at a time.* Because of this, Stata users spend a good deal of time merging and appending multiple datasets to make a single dataset that has all of the necessary variables in the best format from the get-go.
A big problem historically with Stata was that datasets are loaded in the RAM, and big datasets would be too big for conventional computers. That’s not an issue anymore since even cheap computers have several gigabytes of RAM.
*This isn’t true anymore. Starting in version 16, Stata can actually now have multiple datasets in memory, each stored in its own frame. These frames can be very useful in certain scenarios, but for our purposes, we are going to pretend that you can have just one dataset open at a time.
Data are either string or numeric. Their color changes in the data browser
Strings are basically text that are thought to be words and not numbers. But sometimes a dataset will be imported wrong and things that are actually numbers (“1.5”, “2.5” in different rows of the same column) will be imported and considered to be strings and not numbers. This might be because they were imported incorrectly. This might be that later down in the list there is a word in a different cell (“1.5”, “2.5”, “Specimen error”). If any row of a variable contains something that isn’t a number, Stata makes the entire column, and with it the variable, a string.
IMPORTANT: When viewing strings in the data browser (3c under “Stata’s Interface”below), they appear in RED text. When specifying strings in commands, you need to enclose them in quotations (eg count if name==”Old”). Missing strings are two quotes with nothing in between them (eg count if name==””).
In order to do math, you need to have things be numbers. There are several different numerical formats that you can read about here. If something is an integer (nothing after the decimal), it can be byte, int, and long. If something has a decimal point, it’s float or double. Stata does a nice job selecting which numerical format your data should be in, so you probably don’t need to think much about the difference between byte, int, long, float, or double again.
IMPORTANT: When viewing numeric variables in the data browser, they appear in BLACK text (or BLUE if they have a label applied). When specifying strings in commands, no quotations are needed (eg count if quartile==1). Missing strings are periods (eg count if quartile==.), and a period is positive infinity (a missing value is bigger than a value of one billion).
To convert from a string to a numerical value (change the “1” to a 1), you use the –destring– command. You might need to include the force and replace options, but read up on those by typing –help destring–.
To convert from a numerical value to a string (change the “1” to a 1), use the –tostring– command. Note that missing numerical values will go from a dot to a dot in quotations (. becomes “.”), which is not the same as a missing value for a string, which is just empty quotations (“”). It’s a good idea to follow up a –tostring– command with a command that replaces “.” values with “” values.
Stata’s output is only 255 characters wide, max
The output window of Stata will print (“display”) the inputted command and results from that command. It will clip the output at up to 255 characters, and insert a line break to the next row. You can specify:
set linesize 255
…so that the output is always 255 characters wide. Otherwise, it’ll adjust the output to match how wide your output window is.
The working directory is your “documents folder” unless you manually set the working directory with the cd command or open up Stata by double clicking on a .do file in Windows explorer
The working directory is where Stata is working from. If you save a dataset with the –save– command, it’ll save it in the working directory unless you specify all of the files from the C: drive on. If you double click on the Stata icon to open it up in Windows and type the present working directory command to see where it’s working from (that’s –pwd–), it’ll print out:
. pwd
C:\Users\USERNAME\Documents
So, if you type:
save "dataset.dta", replace
…it’ll save dataset.dta in C:\Users\USERNAME\Documents
Let’s say that you really want to be working in your OneDrive folder because that’s secure and backed up and your Documents folder isn’t. The directory for your desired folder is:
save "C:\Users\USERNAME\OneDrive\Research project\Analysis\dataset.dta", replace
Note that there’s a space in the Research project folder name so the directory needs to be in quotations. If there was no space anywhere in the directory, you could omit the quotations. I’m including quotations everywhere here because it’s good practice.
One option is to change your working directory to the OneDrive folder. You use the –cd– command to do that then any save command will automatically save in that folder:
cd "C:\Users\USERNAME\OneDrive\Research project\Analysis\"
save "dataset.dta", replace
Alternatively, you can save your project’s Do file in the “C:\Users\USERNAME\OneDrive\Research project\Analysis\” folder. Rather than opening Stata by clicking on the icon, find the Do file in your OneDrive folder in Windows Explorer and double click on it. It’ll open Stata AND set that folder as the working directory!! For a new project, this means opening Stata by clicking on its icon, opening a blank do file, saving that do file in your OneDrive folder, closing the Do File Editor and Stata, then reopening stata by double clicking on your blank do file in Windows Explorer.
Stata is most effectively used with with command-line input, specifically through the Do File Editor. There is a graphical user interface that can be handy.
I think that everything in Stata should be completed through Do files. These are text files with sequential lines of codes that make Stata perform commands in order.
I strongly, strongly, strongly recommend having a set number of do files, the first (“01 cleanup.do”) will merge original data files, generate new variables, print out your inclusion flow diagram, and save your analytical dataset (“analytical.dta”). The rest of the files all do just one thing, like make a table or render a figure. All other files except for the first (“01 cleanup.do”) will first (a) open your analytical dataset (“analytical.dta”) then (b) run some sort of stats without modifying the analytical dataset. I also have one last miscellaneous one for random things that you might need from the text (“99 misc things for text.do”), like estimating median and IQR follow up or some one-off t-tests. If you realize in a later do file (say “02 table 1.do”) that you need to generate a new variable or merge in another raw dataset that you didn’t realize that you needed when you first built your “analytical.dta” dataset, resist the impulse to make new variables in your “02 table 1.do” file and instead go back and rework your “01 cleanup.do” file to generate that variable or merge in another raw dataset. Then, rerun your “01 cleanup.do” file it so it overwrites your “analytical.dta” dataset, and then go back and continue working on your “02 table 1.do” file.
So, your do file directory might look like this:
01 cleanup.do – (a) Merge original data files, (b) generate new variables, (c) print out an inclusion flow diagram and (d) also builds your “analytical.dta” analytical dataset that all other do files will use.
02 table 1.do – Opens “analytical.dta” then uses table1_mc or something similar to make a table 1.
03 histogram of exposure.do – Opens “analytical.dta” then uses the –histogram– command to draw a distribution of your exposure.
04 event table.do – Opens “analytical.dta” then uses some sum commands to print out the # of events by group of interest.
05 regressions.do – Opens “analytical.dta” then runs your regressions.
06 spline.do – Opens “analytical.dta” then runs some code to make restricted cubic splines.
99 misc things for text.do – Opens “analytical.dta” then runs one-off code for random details that you might need for the text (e.g., median and IQR follow-up).
GUI in Stata
There is a graphical user interface (GUI) with clickable menus. You can click through commands and it’ll generate the code and run the command of interest, and these can be handy for stealing syntax to run an annoying command. The command from the GUI will appear in the Command History (1c below) and you can right click and copy/paste it into your do file.
I find –import excel– to be frustrating and use the GUI probably 90% of the time to generate that command then copy/paste the syntax into my do file.
You’ll be much better off if you use a do file for nearly everything. Try to minimize the use of the GUI.
Stata won’t let you close a dataset in the memory or overwrite an existing dataset without some effort
The –use– command will open up a dataset in the memory. If you don’t have a dataset opened yet, this will open one:
use dataset.dta
Remember that Stata can only have one dataset opened at a time, so any time you open one when you already have a different dataset opened in memory, Stata will need to drop the open dataset. If you spent a lot of time on the open dataset creating new variables or merging with other datasets, closing it will make you lose all of your work unless you have also saved it. Stata doesn’t want you to make this mistake so if you already have a dataset opened and you type in the above command, Stata will say “No” and you won’t be opening the new dataset.
Instead, you need to put “–, clear–” at the end of the command, like this:
use dataset.dta, clear
And now Stata will drop whatever you have open. It’s really just a nice check to keep you from discarding your work accidentally.
Similarly, if you are trying to save a dataset with the –save– command into an empty folder, you just need to type:
save newdataset.dta
…and Stata will save it no problem. HOWEVER if you are trying to overwrite an existing dataset with that same name, Stata will say “No” and you won’t be saving your dataset today. This is another check. instead, you just need to use “–, replace–” to overwrite. Example:
save newdataset.dta, replace
Stata’s interface
Here’s a quick overview of the Stata interface in Windows.
Ways to input and interact with commands: 1a. Command line – This is where you type command by command. Unless you are just poking around in your data, you should avoid using this. Anything that you want to reproduce in your analysis should be done in the Do file editor. 1b. Open Do file editor button – The Do File Editor is the most important part of Stata in my opinion. A do file is a long text file saving command after command. This is where you should do all of your analytical work. 1c. Command history – If you use the command line or GUI to make a command, it’ll be saved here. You can right click on old commands and copy/paste them into your do file.
Output window – Your command will appear here with a preceding dot (“. sysuse auto” means that I had previously typed in “sysuse auto”). The output from your do file or command will appear immediately below.
Ways to interact with data 3a. Variable list – This is a list of variables in the open dataset. You can double click on them and the variable name will be copied to the command line. You can ctrl+click and select multiple and then copy them to the clipboard. This is quite handy. 3b. Variable and dataset properties – This will let you see details about a selected variable in the variable list and the current dataset in memory. 3c. Data browser – You can also pop this open with the –bro– command. this views all data in a spreadsheet format that looks like Excel.
The very excellent table1_mc program will automate generation of your Table 1 for nearly all needs (read about it here), except for datasets using pweight. I’ve been toying around with automating Excel table generation using Stata v16+ Frames features. I recently started working on a database that requires pweighting for analyses, and opted to use this to as an opportunity to use Frames to generate the automation of a pweight adjusted Table 1.
do https://www.uvm.edu/~tbplante/p_weight_table1_v1_3.do
…in your Stata do file and it’ll pull in the entire script! Note that full instructions will show up in the Stata output window when you run the above line. The instructions below are incomplete. This code does not produce P-values.
How to use this do file
// Step 1a: close all open frames, drop all macros,
// and open your dataset
frames reset
macro drop _all
webuse multistage, clear
//
// Step 1b: Figure out where your present working directory is,
// this is where the excel spreadsheet will be saved.
// Change the working directory with the "cd"
// command as needed.
pwd
//
// Step 2: Declare your data to be pweighted
svyset county [pweight=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools)
//
// Step 3: If your columns require the generation of pweighted
// tertiles, quartiles, or whatnot, do that now.
// For this example, we'll do by quartile of weight.
// note: per this website: https://www.stata.com/support/faqs/statistics/percentiles-for-survey-data/
// ...Only the pweight needs to be specified when making
// weighted quartiles.
xtile weightquart=weight [pweight=sampwgt], n(4)
//
// Step 4: Recode binary variables so they are 0 and 1 (if needed)
// Note: in this dataset, it's 1 and 2 for male and female,
// respectively.
gen female = 1 if sex==2 // recode sex to female, where 1 is female
replace female=0 if sex==1 // male is now 0
//
// Step 5: Name your variables and options for multiple options
// Note: The variables are already labeled but we are doing it
// again for completeness' sake.
//
// Continuous variables
label variable weight "Weight in lbs"
label variable height "Height in in" // I don't know why people are 400 in tall. that's 33 ft.
// Nominal variables (same process would happen for ordinal or continuous varibles)
label variable race "Race" // Race is nominal so need to also define values of race
label define racelabels 1 "White" 2 "Black" 3 "Other"
label values race racelabels // Apply the labels!!!
// Binary variables, no need to apply labels
label variable female "Female sex"
//
// Step 6: Call the do file
// Note: Instructions on this program's use will show right
// after it's called. Look at the Stata output window.
do https://www.uvm.edu/~tbplante/p_weight_table1_v1_3.do
//
// Step 7: Now follow the instructions! That are in the stata
// output window!
table1pweight_start table1 1 4 weightquart weight %10.1f
table1pweight_contn_sd table1 1 4 weightquart height %10.1f
table1pweight_bin table1 1 4 weightquart female %10.0f
table1pweight_cat table1 1 4 weightquart race %10.0f
table1pweight_end table1 1 4 weightquart weight %10.2f
//
// CLOSE THE NEW EXCEL FILE OR YOU'LL GET AN ERROR WHEN
// RUNNING STEP 7.
//
// Step 8: Look at the excel output! Here, it's a file called
// table1.xlsx that's sitting in your pwd (see step
// 1b above). You might notice blanks for the 2nd and
// 3rd columns, but that's because of a strata with a single
// sampling unit. You can confirm numbers using the survey
// tools.
// remember! This is where your excel file is saved:
pwd
Note: this post was written prior to Stata 17, which now allows Python to control Stata and vice versa. In Stata 16, Python could not control Stata but Stata could control Python. Because of this functionality, there’s a more streamlined approach to getting Jupyter to play nicely with Stata 17, detailed here. You’ll still need to install Python and Jupyter, so this post is still helpful.
Stata 16 now integrates with Python. I’m pretty stoked about using some of the Python figure packages. Getting it up and running has been a bit of a challenge. Here’s how I got it to work.
Of note, since I started this post, Stata’s blog has started a series on using Python, which you should check out here.
Install part 1:
Installing Anaconda (free for individuals, not for institutions)
Anaconda comes with many built-in statistical packages. The (free) individual version of Anaconda from here. Just make sure to check the “set as path” button during the Anaconda install!
Installing the (universally free) traditional Python distribution
As of July 2020, Python apparently has two versions that are commonly used, the 2.x version and the 3.x version. The end-of-life of 2.x versions is this year, so I wouldn’t recommend using it (current highest version is 2.7). Instead, use the 3.x version, currently the 3.8 version. You can find it at the Windows Python Download Page.
Make sure to install the version matching your Stata install! Stata comes as 32 bit or 64 bit. In Stata, type –about– to see what version you have. You’ll see that mine is running the 64-bit version of Stata. If you have a relatively modern computer, you are probably running the 64-bit version of Stata. Windows can actually run either 32-bit or 64-bit versions if you have a 64-bit processor, so do yourself a favor and just check. Type -about- in Stata to confirm your version.
Make sure that you install the corresponding version of Python. The highlighted one (x86-64) is the 64 bit. The other one (x86) is the 32-bit version. For this example, since I have the 64-bit version of stata, I installed the x86-64, 64-bit version of Python.
I had originally installed the 32-bit version of Python and Stata couldn’t load it. Installing the 64-bit version of Python solved that. There’s actually a big “download now” button on the main Python webpage that will give you the 32-bit version. Make sure to select the specific stable release in the picture above.
For the love of Pete, check this PATH box when you install it.
PATH is a list of commands that can be run from the Windows command line, and where their relative program exists.
See this check box right here? Select it. If you don’t, you’ll have a heck of a time getting anything to run from the command line. This should be checked on default, I have no idea why it’s not. If you forgot to check this box, uninstall Python and reinstall it after checking this box.
Also, notice that it says “64 bit” on the installer screen above. If it says “32 bit”, you probably downloaded the wrong version. Go back and try again!
What the heck did I just install?
There are two Python shell apps/programs that came along with the default Python setup. IDLE is a more user-friendly Python shell. It resembles the command line in Stata, but it has the syntax highlighting of the Do file editor.
The app called “Python 3.8 (64-bit)” is the shell without any markup. If you want to play around with Python, I recommend using IDLE.
Making your first program in IDLE
Anything you run in Python should be from a script, or a *.py file. Pop one open from within IDLE by hitting Ctrl+N. Enter the following:
print("Hello world!")
Then save it and run it (by pressing F5) and you’ll get the hello world!
How does that look in Stata?
Let’s do the same thing in a Stata do file. In order to open up the Python shell within Stata, you have to type –python– on its own line, your intended python code, then –end– on its own line. Here, I have entered:
python
print("hello world!")
end
Then just hit ctrl+d or the run button to get it to work in Stata!
Install Part 2: How do I get the Pandas, Matplotlib, SciPy, Sklearn, and NumPy libraries installed in Python?
Note: Anaconda comes with all of these except sklearn. For below, just complete the sklearn step.
Python by itself can do some stuff, but the heavy lifting for stats and visualization is from add-in libraries that aren’t included with the default Python and must be added in before doing much of anything else. (Note: Anacondadoes come with those and is a Python installation geared towards science, but we’re doing the classic install here.) Installing these additional libraries can be done with the included pip library, which automates all downloads and installations. BUT pip it has to be called from the Windows command line, not in a Python shell (i.e., not in IDLE). You’ll know you’re in the shell if the line starts with this:
>>>
So if you type “pip install pandas” in the shell (after the “>>>”), you’ll get an error and you will not be getting pandas.
To pop up the Windows command line, hit the Start button then type “cmd” to open it up. Or hit windows key+r and type “cmd” to open it up. If you correctly checked the PATH checkbox in the install, you should get the version reported if you type the following in:
python --version
If you get some sort of error, it’s probably because you didn’t check the PATH box during the install. Uninstall Python then reinstall it and make for sure you check that stupid PATH box.
A note about the Windows 10 command line: If you type “Python” and hit enter, Windows pops up the Windows store and tries to get you to install the version of Python that they host. This is by far the dumbest Windows feature ever, and I have seen BOB. So, avoid ever typing the word “python” in the command line. Instead, use the handy “py” command, which does everything you’ll need it to do. Py is the python launcher.
To call pip, you want to type in “py” then “-m” then “pip” and its commands. the “-m” allows you to run library commands as a script. If it’s been a while since you updated pip, now’s a good time to update it in the command line to show how to call pip:
How do I use Pandas, Matplotlib, NumPy, Scikit-learn (sklearn), and SciPy in Stata?
Once the libraries are installed, you can then integrate them into your scripts. Each time you want to use them, you need to import them so you can call them. The convention is to import these using common so you don’t have to type “pandas” over and over again, you can just use “pd”. Ditto for other libraries:
python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import sklearn as sk
end
Install part 3: How do I install Jupyter Lab and get it to work with Stata? (This is the updated version of Jupyter Notebook)
Note: Jupyter lab comes installed with Anaconda, but node.js, npm, and stata_kernel still must be installed.
Jupyter is a super popular way to cleanly complete analyses. Its origins were in Python, but it now works in R and Stata. Details on installing Jupyter are here. Specific instructions for getting it to interface with Stata are here. Here’s how I got it installed:
First: install node.js (warning: this takes a long time and requires a reboot)
Make sure that you have the current version of Python installed! Node.js will install the current version and if your version is outdated, it’ll create a huge headache.
First, install node.js. Download here. I checked the box in the node.js install to also install additional software. At the end of the install, it pops open Windows powershell window and installed a bunch of stuff, including Python if you do not have the current version. It also installs Chocolatey and a few other things. NOW STOP HERE AND REBOOT YOUR COMPUTER. Come back when it restarts.
Next: install other npm, jupyterlab, and Stata plugins (warning: this also takes a while)
First, it’s always good habit to make sure you have the newest version of pip!
py -m pip install --upgrade pip
Then, type the following in the Windows command line:
When later trying to run jupyter lab, I got an error saying “Exception: Jupyter command jupyter-lab not found.” I updated pip then uninstalled and reinstalled npm, jupyterlab, and stata_kernel (just replace the word “install” with “uninstall” above) then installed everything again. This fixed things.
Finally, you need to do a last step described under “windows specific steps” hereto get this to work in Windows. I found my Stata executable at “C:\Program Files\Stata16\StataSE-64.exe”, not Program Files (x86). You can delete the new Stata desktop shortcut once you run it as an administrator one time.
You might also need to configure Jupyter to work with your Stata install. Details are here. I found the configuration file named .stata_kernel.conf sitting in this folder: C:\Users\MYUSERNAME\.stata_kernel.conf
In reviewing the configuration file, it seems to have correctly identified my Stata SE 16 setup. I changed the graph format from svg to png, but left the rest unchanged.
Now that I have Jupyter Lab installed, how do I open and use it?
In Anaconda Navigator, just click the “Jupyter lab” button. For a traditional Python install, open the Windows command line, type:
jupyter lab
It should open up your web browser to a Jupyter page. Keep the Windows Command line terminal open in the background. If you close it, Jupyter will cease to work. (Note: This isn’t true for anaconda if you open Jupyter from the GUI.) Click the “Stata” notebook button to start.
It’ll open up a scripting page for your code. At the bottom it should say “stata idle”. That’s how you know you set it up correctly.
Now you can use traditional Stata code!
There are additional programs called “Magics” detailed here that help Stata integrate more seamlessly with Jupyter. Each of these commands begins with a % symbol. There are specific ways to modify these commands.
%browse – lists the first 200 rows
%head – first 10 rows
%tail – last 10 rows
%set – can change the graph_format, graph_scale, graph_width, and graph_height
How do I use SFI to interface Stata and Python?
Stata and Python talk to each other using the Stata function interface, or SFI. MORE TO COME ON THIS.
I love restricted cubic splines, made famous by Frank Harrell (see his approach starting on page 58 here). Dr. Harrell made a package for automating these in R. I’m not aware of an equivalent package for Stata. Here’s my approach to making this specific restricted cubic spline in Stata.
The model here is modified Poisson regression using the Zou 2004 method since the outcome is binary. Since it’s coded as a GLM, it’ll be relatively easy to swap out this one specific model for other models, like logistic regression using the appropriate link & family. It’s good habit to have the probability density of the outcome across the continuum of exposure, so that is plopped on the bottom here. It wouldn’t take much work to replace with a histogram.
This is for two groups (group1=blue, group2=red). It wouldn’t be too hard to make this for just one group, deleting everything having to do with group 2 or having the number 2 in it. Or, duplicate lines for a group 3. Also, sometimes folks like to present a kernel density plot for each outcome, so you’d just duplicate the kdensity lines and add some code to specify that one of each is for folks with outcome==0 then outcome==1.
One quirk is that the xbrcspline code depends on a list of all of the possible options for the exposure, which is brought in from listof to xbrcspline as a numlist. As you can see in –help limits–, numlists are capped at 2,500 numbers. If you get an error saying “values() invalid — invalid numlist has too many elements”, then you have too many individual options for your exposure. This might be because your exposure is an integer with a huge amount of digits after the decimal place, so your >2,500 observations all come with their own unique value for the exposure. Rounding can help reduce this count, if it analytically and clinically makes sense. (This is okay for BMI, for example, because there isn’t clinically relevant difference between a bmi of 27.0400558235 and 27.04). To generate a rounded value, plop —gen bmi_round=(bmi, 0.01)– in the first few lines, right after opening your data. Then use “bmi_round” as your exposure variable rather than “bmi”.
You also should consider centering your exposure variables at the mean. To do this, use the –sum– command for each variable then –gen [variable’s name]_center = [variable’s name]/r(mean)–. Then use the “[variable’s name]_center” as exposures variables.
This code is a mostly rewritten version of one that is used at the Welch Center at Hopkins, where it has been handed down through generations of doctoral students and post-docs.
I’ve recently needed to make these figures using pweights, so I put some comments throughout to simplify that process.
Note: One glitch with WordPress (the blog that hosts this) is that it can’t render certain code below unless it’s the final line in a block of code, so I’ve had to chop it up this script, which should be a single block of code, into several blocks of code. If you copy and paste this into a do file and try to run it, you’ll notice some blank lines in the “Make a Figure” section that you’ll need to delete.
Enjoy!
Code to make the figure above
*************************************************
***************load data!************************
****************and drop any open graphs*********
*************************************************
webuse fvex, clear
graph drop _all
macro drop _all
// if weighted, declare weighting here
*************************************************
**************define variables here**************
*************************************************
// Define the OUTCOME, = to 0 or 1, after the (first) word "outcome"
global outcome outcome // this database's outcome is called outcome...
// Define the EXPOSURE following the word "exposure"
global exposure age // this has to be a continuous variable
// Define the COVARIATES following word "covariate"
global covariates sex distance
// Define what makes GROUP 1 here after "thing1" (blue)
global thing1 if arm==1
// Define what makes GROUP 2 here after "thing2" (red)
global thing2 if arm==2
*************************************************
************auto-generate subgroups and**********
*****************local macros needed*************
********************for spline*******************
*************************************************
// need to make the exposure by group for the kernel density plots.
gen exposuresubgroup1 = ${exposure} ${thing1}
gen exposuresubgroup2 = ${exposure} ${thing2}
// The xbrcspine command below needs the middle value of the exposure
// for each group. This is often times the median, but if there are
// an even number of values for the exposure, the median will be an
// average of the two middle values of the exposure variable. The following
// code defines the median as a local macro (median1temp) then grabs the
// actual value for the exposure closest to that number (median_1).
// _pctile can be used with pweight, if needed.
//
_pctile exposuresubgroup1, p(50)
// _pctile exposuresubgroup1 [pweight=samplingweight], p(50)
local median1temp = r(r1)
gen mediandiff1=abs(exposuresubgroup1-`median1temp')
gsort mediandiff1
local median_1 = ${exposure}[_n==1]
//
_pctile exposuresubgroup2, p(50)
// _pctile exposuresubgroup2 [pweight=samplingweight], p(50)
local median2temp = r(r1)
gen mediandiff2=abs(exposuresubgroup2-`median2temp')
gsort mediandiff2
local median_2 = ${exposure}[_n==1]
//
// get rid of these variables, since they are not needed anymore
drop mediandiff1 mediandiff2
*************************************************
*************make splines and view knots!********
*************************************************
// NOTE: look at the stata output here, the knots will be displayed
// you'll need to list these in the footer
//
// Harrell's method recommends using the # of knots ('k') as 3, 4 or 5
// with n >=100 as 5 and n<30 as 3. It's just a guideline.
// Details are on page 58 here:
// https://hbiostat.org/doc/rms.pdf
mkspline ${exposure}_spline1=exposuresubgroup1, nknots(4) cubic displayknots
mat knots1=r(knots)
mkspline ${exposure}_spline2=exposuresubgroup2, nknots(4) cubic displayknots
mat knots2=r(knots)
// above won't work for pweight weighted data. instead, you need
// to specifically define the percentiles from the table on pg 5 here:
// https://www.stata.com/manuals13/rmkspline.pdf
// But basically,
// 3 knots, percentiles are at: 10 50 90
// 4 knots, percentiles are at: 5 35 65 95
// 5 knots, percentiles are at: 5 27.5 50 72.5 95
// 6 knots, percentiles are at: 5 23 41 59 77 95
// 7 knots, percentiles are at: 2.5 18.33 34.17 50 65.83 81.67 97.5
// the code for 4 knots follows in hidden code:
//
//_pctile exposuresubgroup1 [pweight=samplingweight], p(5 35 65 95)
//return list
//local gr1knot1 = r(r1)
//local gr1knot2 = r(r2)
//local gr1knot3 = r(r3)
//local gr1knot4 = r(r4)
//di "Knots for group 1 are at " `gr1knot1' ", " `gr1knot2' ", " ///
//`gr1knot3' ", " `gr1knot4' "."
//
//
//mkspline ${exposure}_spline1=exposuresubgroup1, ///
//knots(`gr1knot1' `gr1knot2' `gr1knot3' `gr1knot4' ) ///
//cubic displayknots
//
//mat knots1=r(knots)
//
//_pctile exposuresubgroup2 [pweight=samplingweight], p(5 35 65 95)
//return list
//local gr2knot1 = r(r1)
//local gr2knot2 = r(r2)
//local gr2knot3 = r(r3)
//local gr2knot4 = r(r4)
//
//di "Knots for group 2 are at " `gr2knot1' ", " `gr2knot2' ", " ///
// `gr2knot3' ", " `gr2knot4' "."
//
//
//mkspline ${exposure}_spline2=exposuresubgroup2, ///
//knots(`gr2knot1' `gr2knot2' `gr2knot3' `gr2knot4') ///
//cubic displayknots
//
//mat knots2=r(knots)
*************************************************
********Generate models to use in splines********
*************************************************
// model time!
// the models here are modified poisson regressions using sandwich variance estimators
// which is the Zou method from this classic paper:
// https://pubmed.ncbi.nlm.nih.gov/15033648/
// GLMs can be used with pweight, if needed.
//
// First group
glm ${outcome} c.${exposure}_spline1* ${covariates} ${thing1}, ///
fam(poisson) link(log) eform robust
// robust is required for modified poisson regression by zou's method
// (i.e., modified poisson regression with sandwich variance estimators)
// for pweight or survey data, use:
// svy: glm ${outcome} c.${exposure}_spline1* ${covariates} ${thing1}, ///
// fam(poisson) link(log) eform
levelsof(${exposure}) ${thing1} // generates r(levels) for next line
xbrcspline ${exposure}_spline1, values(`r(levels)') ///
ref(`median_1') matknots(knots1) ///
eform gen(lpnt1 hr1 lb1 ub1)
// Second group
glm ${outcome} c.${exposure}_spline2* ${covariates} ${thing2}, ///
fam(poisson) link(log) eform robust
// for pweight or survey data, use:
// svy: glm ${outcome} c.${exposure}_spline2* ${covariates} ${thing2}, ///
// fam(poisson) link(log) eform
//
levelsof(${exposure}) ${thing2} // generates r(levels) for next line
xbrcspline ${exposure}_spline2, values(`r(levels)') ///
ref(`median_2') matknots(knots2) ///
eform gen(lpnt2 hr2 lb2 ub2)
*************************************************
***********generate extremes to drop*************
*************************************************
// should drop the extremes, here the 0.5 and 99.5th percentiles
// but need to define these as local macros.
// _pctile can be used with pweight, if needed.
_pctile exposuresubgroup1, p(0.5 99.5)
return list
local cut_a1 = r(r1)
local cut_b1 = r(r2)
_pctile exposuresubgroup2, p(0.5 99.5)
return list
local cut_a2 = r(r1)
local cut_b2 = r(r2)
*************************************************
***************make the figure*******************
*************************************************
set scheme s1mono // my favorite scheme
twoway ///
/// spline for one group
(line hr1 lpnt1 if lpnt1 > `cut_a1' & ///
lpnt1 < `cut_b1' , yaxis(1) lp(solid) lc(blue) lwidth(medthick) ) ///
/// kernel density for one group
/// note: can alternatively install kdens and moremata to
/// use pweighting in kernel density.
(kdensity exposuresubgroup1 if exposuresubgroup1 > `cut_a1' & ///
exposuresubgroup1 < `cut_b1' , yaxis(2) lp(shortdash) lcolor(blue)) ///
/// kernel density for other group
(kdensity exposuresubgroup2 if exposuresubgroup2 > `cut_a2' & ///
exposuresubgroup2 < `cut_b2' , yaxis(2) lp(shortdash) lcolor(red)) ///
, ///
yline(1, lpattern(solid) lcolor(black) axis(1)) ///
/// labels for the left axis, hide the following line if you aren't sure
/// of what the range should be:
ylabel(0.25 "0.25" 0.5 "0.5" 1 "1" 2.5 "2.5" 5 "5", axis(1) angle(0)) ///
/// range for the left axis. the BOTTOM of this range has to be WAAAY lower
/// than the bottom ylabel in the line above in order for it to sit
/// on the way top of the figure.
yscale(r(0.001 2) log axis(1)) /// log scale here
/// ditto for the right axis, but the TOP of the range on the yscale needs
/// to be WAAAY higher than the top ylabel
ylabel(0 "0" 0.01 "0.01" 0.02 "0.02" 0.03 "0.03" 0.04 "0.04" 0.05 "0.05", ///
axis(2) labsize(vsmall) angle(0)) ///
yscale(r(0.0 .2) axis(2)) /// not log scale here
/// you'll notice that the x label doesn't span the entire
/// range of the exposure because the local macro cuts above.
///xlabel(20(5)60) in case you want to specify the x axis
/// for the titles, need to put a bunch of spaces so things align
ytitle(" {bf:Risk Ratio (95% CI)}", axis(1)) ///
ytitle("{bf:Probability Density} ", ///
justification(left) axis(2)) ///
xtitle("{bf:Exposure Here}") ///
title("{bf:The Title}") ///
legend(order(1 "Group 1" 5 "Group 2")) ///
name(mygraph1)
*************************************************
*****************Save figure!********************
*************************************************
// save as png, change to tif if needed for submission
graph export "mygraph1.png", replace width(1000)