Using Stata’s Frames feature to build an analytical dataset

Stata 16 introduced the new Frames functionality, which allows multiple datasets to be stored in memory, with each dataset stored in its own “Frame”. This allows for dynamic manipulation of multiple datasets across multiple Frames. Stata is still simplest to use when manipulating a single dataset (or, frame). So, Stata users will probably be interested in building a single dataset/Frame for a specific analysis that is built from variables taken from multiple datasets/Frames.

One handy application of Frames is to import non-Stata datasets as separate frames and combine them (really, merge) into a single analytical dataset/Frame. Before using Frames, I had previously imported non-Stata datasets, saved them locally, then merged them 1 by 1. With Frames, you just import each dataset into its own Frame, and “merge” them directly, skipping the intermediate “save as Stata dta file” step.

Here’s my approach to building a single analytical dataset from multiple imported datasets, with frames. We’ll do this with NHANES data. This is a modification of the code on this post.

Step 1: Drop (reset) all frames, create new ones to import the new datasets, and run commands within each new frame to import NHANES/SAS datasets.

// Here's the NHANES website, FYI:
// https://wwwn.cdc.gov/nchs/nhanes/default.aspx
//
// drop all frames from memory. This will delete all unsaved data so be careful!!
frames reset
//
// make a blank frame called "DEMO_F"
// you could type "frame create" or the brief synonym "mkf" for 
// "make frame"
mkf DEMO_F
// ...Then run the sas import command within it, grabbing it from the CDC website. 
// Since this command needs to be run from within the DEMO_F frame,
// we can tell Stata to run the command from that frame without 
// actually changing to it using the "frame [name]:" prefix
frame DEMO_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT", clear 
//
// ditto for the "BPQ_F" and "KIQ_U_F" datasets.
mkf BPQ_F
frame BPQ_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.XPT", clear
//
mkf KIQ_U_F
frame KIQ_U_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/KIQ_U_F.XPT", clear
// 
// let's see a list of current frames:
frames dir 
//
// Which frame are you using though? 
// pwf is present working frame, or the current one in use.
pwf 
// You'll see that the pwf if "default". 

Step 2: Create an “analytical” frame that will contain the data you need to complete your analysis, and copy the variable that links all of your data to this frame. Also switch to that new analytical frame.

// The "linking" variable in this dataset is called "seqn", which 
// we will copy ("put") from the DEMO_F frame. 
// (Your file might have a linking variable called "id".)
// This creates a new frame called "analytical" and also moves the "seqn"
// variable from DEMO_F to the new analytical frame in one line.
frame DEMO_F: frame put seqn, into(analytical)
//
// see current list of frames and present working frame:
frames dir
pwf
// Now change from the default frame to the new analytical one.
// You can change frames with "cwf" for "change working frame".
cwf analytical

Step 3: Now that you are are in the new analytical frame, link all of your frames using the “linking” variable (“seqn” here).

// use the "frlink" command to link frames. This can be 1:1 linking or 1:m.
//
// remember that your cwf should be analytical right now. 
//
frlink 1:1 seqn, frame(DEMO_F)
frlink 1:1 seqn, frame(BPQ_F)
frlink 1:1 seqn, frame(KIQ_U_F)
// 
// You are still within the "analytical" frame, but now your frames are all 
// linked or connected to each other. 
// if you look at your dataset with the --browse-- command, you'll see there
// are now new DEMO_F, BPQ_F, and KIQ_U_F variables. These are the "rows" for
// linked IDs in those other frames, so Stata knows where to look for 
// variables in those other frames. 

Step 4: “Get” the specific variables you want from each frame. This is how you merge individual variables from multiple frames.

// from my NHANES post, we need to grab weighting variables from DEMO_F
// and a BP variable from BPQ_F. Just for kicks, we'll also grab self-
// reported "weak kidneys" from KIQ_U_F. 
// 
// remember that your cwf should be analytical right now. 
//
frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_F)
frget bpq020, from(BPQ_F)
frget kiq022, from(KIQ_U_F)
//
// now you have a nice merged database! 

Step 5 (optional): If you are satisfied with your analytical dataset and no longer need the other frames, you can now drop the “linking” variables from the analytical dataset, save your analytical dataset, clear your frames, and reopen your analytical dataset

// You can save this analytical frame as a Stata
// dataset now if you are done manipulating the other frames. 
// 
// You might opt to drop the new linking variables prior to saving
// for simplicity.
drop DEMO_F BPQ_F KIQ_U_F
//
// Stata's "save" command won't save other frames as FYI, just the pwf.
// But in this example, we are done with other frames. 
save analytical.dta, replace
//
// Drop all other frames so you don't get an annoying pop-up 
// about unsaved frames in memory. 
// Be careful! This will drop all data from memory!!
frames reset 
//
// now reopen your previously saved analytical dataset.
use analytical.dta, clear

Bonus: Here’s steps 1-4 in a single loop using global macros

For advanced users: Here’s a loop and some global macros that is adaptable to downloading several years. NHANES uses different letters for files from different years, the 2009-2010 one uses “F”.

frames reset
global url "https://wwwn.cdc.gov/Nchs/Nhanes/"
global F "2009-2010/"
global files DEMO BPQ KIQ_U

foreach x in F {
	foreach y in $files {
		mkf `y'_`x'
		frame `y'_`x': import sasxport5 "${url}${`x'}`y'_`x'.xpt"
	}
frame DEMO_`x': frame put seqn, into(analytical_`x')
cwf analytical_`x'
	foreach y in $files {
		frlink 1:1 seqn, frame(`y'_`x') 
	}
	frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_`x')
	frget bpq020, from(BPQ_`x')
	frget kiq022, from(KIQ_U_`x')
}
frames dir 
pwf 

Here’s the same as above, but just for a single year.

frames reset
global dir "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/"
global files DEMO_F BPQ_F KIQ_U_F

foreach y in $files {
	mkf `y'
	frame `y': import sasxport5 "${dir}`y'.xpt"
}
frame DEMO_F: frame put seqn, into(analytical)
cwf analytical
foreach y in $files {
	frlink 1:1 seqn, frame(`y') 
}
frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_F)
frget bpq020, from(BPQ_F)
frget kiq022, from(KIQ_U_F)
frames dir 
pwf 

Mediation analysis in Stata using IORW (inverse odds ratio-weighted mediation)

Mediation is a commonly-used tool in epidemiology. Inverse odds ratio-weighted (IORW) mediation was described in 2013 by Eric J. Tchetgen Tchetgen in this publication. It’s a robust mediation technique that can be used in many sorts of analyses, including logistic regression, modified Poisson regression, etc. It is also considered valid if there is an observed exposure*mediator interaction on the outcome.

There have been a handful of publications that describe the implementation of IORW (and its cousin inverse odds weighting, or IOW) in Stata, including this 2015 AJE paper by Quynh C Nguyen and this 2019 BMJ open paper by Muhammad Zakir Hossin (see the supplements of each for actual code). I recently had to implement this in a REGARDS project using a binary mediation variable and wanted to post my code here to simplify things. Check out the Nguyen paper above if you need to modify the following code to run IOW instead of IOWR, or if you are using a continuous mediation variable, rather than a binary one.

A huge thank you to Charlie Nicoli MD, Leann Long PhD, and Boyi Guo (almost) PhD who helped clarify implementation pieces. Also, Charlie wrote about 90% of this code so it’s really his work. I mostly cleaned it up, and clarified the approach as integrated in the examples from Drs. Nguyen and Hossin from the papers above.

IORW using pweight data (see below for unweighted version)

The particular analysis I was running uses pweighting. This code won’t work in data that doesn’t use weighting. This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. (Although you'll probably 
// have to adjust the svyset commands in the 
// program below to work in your dataset, in all 
// actuality)
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
//
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 2.
global outcome mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Components of weighting to go into "svyset" command. 
// You might have 
global samplingweight mysamplingweightvar
global id id_num // ID for your participants
global strata mystratavar
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw_weighted
program iorw_weighted, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
*Step 1: svyset data since your dataset is weighted. If your dataset 
// does NOT require weighting for its analysis, do not use this program. 
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
//
*Step 2: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
svy linearized, subpop(${ifstatement}): logit ${exposure} ${mediator} ${covariates}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 3: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights.  Since our dataset uses sampling
// weights, we need to multiply the dataset's weights times the IOR for the 
// exposure group. This step is fundamentally different for non-weighted 
// datasets. 
gen weight_iorw = ${samplingweight} if ${exposure}==0
replace weight_iorw = inverseoddsratio*${samplingweight} if ${exposure}==1
//
*Step 4: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. 
// (same as direct effect, but without the IORW)
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log) 
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 5: DIRECT EFFECTS; using IORW weights instead of the weighting that
// is used typically in your analysis. 
svyset ${id}, strata(${strata}) weight(weight_iorw) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 6: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 7: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 8: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw_weighted
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title}*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951'
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952'
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953'
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
}
// the end.

IORW for datasets that don’t use weighting (probably the one you are looking for)

Here is the code above, except without consideration of weighting in the overall dataset. (Obviously, IORW uses weighting itself.) This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. 
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 2.
global outcome  mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw
program iorw, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
//
*Step 1: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
logit ${exposure} ${mediator} ${covariates} ${ifstatement}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 2: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights. 
gen weight_iorw = 1 if ${exposure}==0
replace weight_iorw = inverseoddsratio if ${exposure}==1
//
*Step 3: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. (same as direct effect, but without the IORW)
glm ${outcome} ${exposure} ${covariates} ${ifstatement}, family(poisson) link(log) vce(robust)
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 4: DIRECT EFFECTS; using IORW weights
glm ${outcome} ${exposure} ${covariates} ${ifstatement} [pweight=weight_iorw], family(poisson) link(log) vce(robust)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 5: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 6: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 7: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title}*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951'
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952'
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953'
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
}
// the end.

Rounding/formatting a value while creating or displaying a Stata local or global macro

I love Stata’s macro functionality. It allows you to grab information from r- or e-level data after executing a Stata command then calling that back later. (Local macros are temporary, global macros are more persistent.) This is particularly useful when generating summary statistics collected in macros after the –sum– command, or displaying a subset of components from a regression, such as the beta coefficient and 95% confidence intervals, or P-values (details on how to manipulate regression function results with macros are here).

One problem in using macros is that raw r- or e-level data are really long and not amenable to output in tables for publication without formatting. I’ve hadn’t previously been able to apply formatting (eg %4.2f) while generating macros, outside of applying the “round” command. (I don’t like the round command because it’s tricky to code in a program for reasons I won’t get into.) Instead, I have applied the number formatting when displaying the macro. That creates issues when generating string output from numerical macros, since my prior strategy of applying numerical formatting (eg %4.2f) didn’t work when displaying a numerical macro in a string (i.e., embedding it within quotations). Instead, I wanted to apply the format while also generating the macro itself.

I came across this post on the Stata List by Nick Cox, which details how to do just that: https://www.stata.com/statalist/archive/2011-05/msg00269.html

It turns out that not only can you apply formatting while generating a macro with the “: display” subcommand, you can also trim extra spaces from the generated macro at the same time. (Note that the “trim” command has been replaced by “strltrim” and a lot of other related commands that you can find in –help string functions–.) As a bonus, it turns out that you can also apply formatting (e.g., %4.2f) of a macro when displaying within a string using the “: display” subcommand strategically surrounded by opening and closing tick marks.

Here’s a demo script of what I’m getting at.

(Note: I strongly recommend against formatting/rounding/reducing precision of any macros that you are generating if you will later do math on them. Formatting during generation of macros is only useful for macros intended to be displayed later on.)

sysuse auto, clear
sum mpg
//
// 1. generate an unformatted macro, equals sign 
//    is required
local mpg_mean = r(mean) 
// 
// 2. apply formatting while generating the local 
//    macro, note the colon for the macro 
//    subcommand, instead of an equals sign
local mpg_mean_fmt: display %10.2f r(mean)
//
// 3. here's formatting mixed with a trim command,
//    this combines an equals sign and colon. 
local mpg_mean_neat = strltrim("`: display %10.2f r(mean)'")
//
//
// Now let's call the unformatted macro from #1:
display "here it is unformatted: `mpg_mean'"
// note that it's really long after the decimal
//
// Let's apply formatting to #1 OUTSIDE of quotations:
display "here it is formatted " %10.2f `mpg_mean'
// ...and here's formatting to #1 WITHIN quotations:
display "here it is formatted `: display %10.2f `mpg_mean''"
// see all of the leading spaces? Using %3.2f would fix 
// that, but I wanted to show the trim function.
//
// here's the format applied during macro generation in 
// #2 (without a trim function):
display "here's an alt format: `mpg_mean_fmt'"
// still lots of leading spaces.
//
// here's trimming and formatting mixed from #3:
display "here's fmt & trim: `mpg_mean_neat'"
// Bingo!

Extracting variable labels and categorical/ordinal value labels in Stata

Stata allows the labeling of variables and also the individual values of categorical or ordinal variable values. For example, in the –sysuse auto– database, “foreign” is labeled as “Car origin”, 0 is “Domestic”, and 1 is “Foreign”. It isn’t terribly intuitive to extract the variable label of foreign (here, “Car origin”) or the labels from the categorical values (here, “Domestic” and “Foreign”).

Here’s a script that you might find helpful to extract these labels and save them as macros that can later be called back. This example generates a second string variable that applies those labels. I use this sort of code to automate the labeling of figures with the value labels, but this is a pretty simple example for now.

Remember to run the entire script from top to bottom or else Stata might drop your macros. Good luck!

sysuse auto, clear
// Note: saving variable label details at 
//       --help macro--, under "Macro functions 
//       for extracting data attributes"
// 
// 1. Extract the label for the variable itself.
//    If we look at the --codebook-- for the 
//    variable "foreign", we see...
//
codebook foreign
//
//    ...that the label for "foreign" is "Car 
//    origin" (see it in the top right of 
//    the output).  Here's how we grab the 
//    label of "foreign", save it as a macro, 
//    and print it.
//
local foreign_lab: variable label foreign
di "`foreign_lab'"
//
// 2. Extract the label for the values of the variable.
//    If we look at --codebook-- again for "foreign", 
//    we see...
//
codebook foreign
// 
//    ...that 0 is "Domestic" and 1 is "Foreign". 
//    Here's how to grab those labels, save macros, 
//    and print them
//
local foreign_vallab_0: label (foreign) 0 
local foreign_vallab_1: label (foreign) 1 
di "The label of `foreign_lab' 0 is `foreign_vallab_0' and 1 is `foreign_vallab_1'"
//
// 3. Now you can make a variable for the value labels.
//  
gen strL foreign_vallab = "" // this makes a string 
replace foreign_vallab="`foreign_vallab_0'" if foreign==0
replace foreign_vallab="`foreign_vallab_1'" if foreign==1
// 
// 4. You can also label this new string variable 
//    using the label from #1
// 
label variable foreign_vallab "`foreign_lab' as string"
// 
// BONUS: You can automate this with a loop, using 
//    --levelsof-- to extract options for each 
//    categorical variable. There is only one 
//    labeled categorical variable in this dataset 
//    (foreign) so this loop only uses the single one. 
// 
sysuse auto, clear
foreach x in foreign {
	local `x'_lab: variable label `x' // #1 from above
	gen strL `x'_vallab = "" // start of #3
	label variable `x'_vallab "``x'_lab' string"
	levelsof `x', local(valrange)
	foreach n of numlist `valrange' { 
		local `x'_vallab_`n': label (`x') `n' // #2 from above
		replace `x'_vallab = "``x'_vallab_`n''" if `x' == `n' // end of #3
	}
}

Stata-R integration with Rcall

Stata is great because of its intuitive syntax, reasonable learning curve, and dependable implementation. There’s some cutting edge functionality and graphical tools in R that are missing in Stata. I came across the Rcall package that allows Stata to interface with R and use some of these advanced features. (Note: I’m currently using Stata 16. I have no reason to think that this wouldn’t work with Stata 13 or newer.)

Installation of R, R packages, and Rcall (you only need to do this once)

Download and install R. Open R and install the readstata13 package, which is required to install Rcall. While you’re at it, install ggplot2 and ggstatsplot. Note: ggplot2 is included in the excellent multi-package collection called Tidyverse. We are going to install Tidyverse instead of ggplot2 alone. Tidyverse also installs several other packages useful in data science that you might need later. This includes dplyr, tidyr, readr, purrr, tibble, stringr, forcats, import, wrangle, program, and model. I have also gotten an error saying “‘Rcpp_precious_remove’ not provided by package ‘Rcpp'”, which was fixed by installing Rcpp, so install that too.

In R, type:

install.packages("readstata13")
install.packages("tidyverse")
install.packages("ggstatsplot")
install.packages("Rcpp")

It’ll prompt you to set up an install directory and choose your mirror/repository. Just pick one geographically close to you. After these finish installing, you can close R.

Rcall’s installation is within Stata (as usual for Stata programs) but originates from Github, not the usual SSC install. You need to install a separate package to allow you to install things from Github in Stata. From the Stata command line, type:

net install github, from("https://haghish.github.io/github/")

Now install Rcall itself from the Stata command line:

github install haghish/rcall, stable

If all goes well, it should install!

Using Rcall

You should read details on the Rcall help file (type –help rcall– in Stata) for an overview. Also read the Rcall overview on Github. In brief, you can send datasets from Stata to R using –rcall st.data()–. –rcall clear– reboots R as a new instance.

There are four modes for using Rcall: vanilla, sync, interactive, and console. For our purposes, we are going to focus on the interactive mode since this allows you to manipulate R from within a do file.

Let’s make a figure in ggplot2 using Stata and Rcall

Here’s some demo code to make a figure with ggplot2, which is the standard for figures in R. There’s a handy cheat sheet here. This intro page is quite helpful. This overview is excellent. Check out the demo figures from this page as well. If your ggplot command extends across multiple lines, make sure to end each line (except the final line) with the three forward slash (“///”) line break notation that is used by Stata.

// load sysuse auto dataset
sysuse auto, clear
// set up rcall, clean session and load necessary packages
rcall clear // starts with a new instance of R
rcall: library(ggplot2) // load the ggplot2 library
// move Stata's auto dataset over to R and prove it's there.
rcall: data<- st.data() // move auto dataset to r
rcall: names(data) // prove that you can see the variables. 
rcall: head(data, n=10) // now look at the first 10 rows of the data in R
// now make a scatterplot with ggplot2, note the three slashes for line break
rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
       geom_point()
rcall: ggsave("ggtest.png", plot=e)
// figure out where that PNG is saved:
rcall: getwd()

Here’s what it made! It was saved in my Documents folder, but check the output above to see where you working directory is.

You can get much more complex with the figure, like specifying colors by foreign status, specifying dot size by headroom size, adding a loess curve with 95% CI, and adding some labels. You can swap out the “rcall: e <- ggplot(…)" bit above for the following. Remember to end every non-final line with the three forward slashes.

rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
	geom_point(aes(col=foreign, size=headroom)) + ///
	geom_smooth(method="loess") +   ///
	labs(title="ggplot2 demo", x="MPG", y="Weight", caption="Caption!")

Here’s what I got. Varying dot size by a third variable can be done in Stata using weighted markers, as FYI.

Let’s make a figure in ggstatsplot using Stata and Rcall

Here’s some demo code to make a figure with ggstatsplot (which is very awesome and you should check it out). If your ggstatsplot command extends across multiple lines, make sure to end each line (except the final line) with the three forward slash (“///”) line break notation that is used by Stata.

// load sysuse auto dataset
sysuse auto, clear
// set up rcall, clean session and load necessary packages
rcall clear // starts with a new instance of R
rcall: library(ggstatsplot) // load the ggstatsplot library
rcall: library(ggplot2) // need ggplot2 to save the png
// move Stata's auto dataset over to R and prove it's there.
rcall: data<- st.data() // move auto dataset to r
rcall: names(data) // prove that you can see the variables. 
rcall: head(data, n=10) // now look at the first 10 rows of the data in R
// let's make a violin plot using ggstatsplot
rcall: f <- ggbetweenstats( data = data, x=foreign, y=weight, title="title")
rcall: ggsave("ggstatsplottest.png", plot=f)
// figure out where that PNG is saved:
rcall: getwd()

If you check your working directory (it was my “Documents” folder in Windows), you’ll find this figure as a PNG:

You can automate the output of ggstatsplot figures by editing the ggplot2 components that make it up. You’d insert the following into the ggstats plot code in the parentheses following “ggbetweenstats” to make the y scale on a log axis, for example:

ggplot.component = ggplot2::scale_y_continuous(trans='log') 

Quick do file to automate R-Stata integration and make ggplot2 or ggstatsplot figures

I made a do file that simplifies the setup of Rcall. Specifically, it 1. Sets R’s working directory to match your current Stata working directory, 2. Starts with a fresh R install, 3. Loads your current Stata dataset in R, and 3. Loads ggplot2 and ggstatsplot in R.

To use, just load your data, run a “do” command followed by the URL to my do file, then run whatever ggplot2 or ggstatsplots commands you want.This assumes you have installed R, the required packages, and Rcall (see very top of this page). If you get an error, try using this alternative version of the do file that doesn’t try to match Stata and R’s working directory.

Example code:

// Step 1: open dataset
sysuse auto, clear
// Step 2: run the do file, hosted on my UVM directory:
do http://www.uvm.edu/~tbplante/rcall_ggplot2_ggstatsplot_setup_v1_0.do
// if errors with above, use this do file instead:
// do http://www.uvm.edu/~tbplante/rcall_ggplot2_ggstatsplot_setup_alt_v1_0.do
// Step 3: run whatever ggplot2 or ggstatsplot code you want:
rcall: e<- ggplot(data, aes(x=mpg, y=weight)) + ///
	geom_point(aes(col=foreign, size=headroom)) + ///
	geom_smooth(method="loess") +   ///
	labs(title="ggplot2 demo", x="MPG", y="Weight", caption="Caption!")
rcall: ggsave("ggtest.png", plot=e)

Part 3: Introduction to Stata

Stata is a popular commercial statistical software package that was first released 30+ years ago. It has some really nice features, loads of top-rate documentation, a very active community, and approachable syntax. For beginners, I think it’s the simplest to learn.

Learning how to use Stata

Stata has really, really, really good documentation.

The documentation is outstanding. Let’s say that you want to learn how to use the –destring– command. In the command line (1a under “Stata’s Interface” below), type:

help destring

…and up will pop a focused help file. There’s the “View complete PDF manual entry” option that has EXTENSIVE documentation of the command. (Note: This file seems to only work well with Adobe PDF reader, not alternative PDF readers like Sumatra). If the focused help file isn’t sufficient to answer your questions, try the complete PDF manual.

The focused help file has multiple parts, but the syntax example is gold. Further down you’ll see example uses of the command.

Web searches will find even more answers

Odds are that someone has already hit the same problem you have in using Stata. Queries in your favorite search engine are likely to find answers on the Statalist archive or UCLA’s excellent website.

You can install Stata programs that other users have written

There are MANY MANY MANY user-written programs out there that can be installed and used in your code. You only need to install them once. Most are on BU’s repository called SSC. I use the table1_mc program extensively (it makes pretty table 1s, you can read about it here). To install table1_mc from SSC, you type:

ssc install table1_mc

…and Stata will download it and install it for you. It’s ready to use when it finishes installing. And, there’s no need to re-install it, it will load each time you start Stata.

Quirks of Stata

Stata only works with rectangular datasets

Think of a rectangular dataset as a single spreadsheet in Excel. It has vertical columns (like a y axis) and horizontal rows (like an x axis). There’s no data on a Z axis coming out of the computer at your face.

A rectangular dataset is the only type that Stata works with. Other statistical software like R or Python can handle many more complex data structures. For learners, forcing data to fit within a rectangular dataset is a huge advantage in my mind since that structure is intuitive, and you can always browse your data with the built-in data browser (see 3c under Stata’s Interface, below).

Stata only works with one dataset at a time*

One dataset in Stata is akin to one spreadsheet in a workbook in Excel. In Excel, you can have multiple spreadsheets in one .xlsx file, with each spreadsheet appearing on a different tab at the bottom. All spreadsheets are in the memory at the same time. You can do math across spreadsheets in a workbook in excel, summarize costs in one column in spreadsheet A and have the result appear in one cell on spreadsheet B. In Stata, you can only have one spreadsheet (here, dataset) open at a time.* Because of this, Stata users spend a good deal of time merging and appending multiple datasets to make a single dataset that has all of the necessary variables in the best format from the get-go.

A big problem historically with Stata was that datasets are loaded in the RAM, and big datasets would be too big for conventional computers. That’s not an issue anymore since even cheap computers have several gigabytes of RAM.

*This isn’t true anymore. Starting in version 16, Stata can actually now have multiple datasets in memory, each stored in its own frame. These frames can be very useful in certain scenarios, but for our purposes, we are going to pretend that you can have just one dataset open at a time.

Data are either string or numeric. Their color changes in the data browser

Strings are basically text that are thought to be words and not numbers. But sometimes a dataset will be imported wrong and things that are actually numbers (“1.5”, “2.5” in different rows of the same column) will be imported and considered to be strings and not numbers. This might be because they were imported incorrectly. This might be that later down in the list there is a word in a different cell (“1.5”, “2.5”, “Specimen error”). If any row of a variable contains something that isn’t a number, Stata makes the entire column, and with it the variable, a string.

IMPORTANT: When viewing strings in the data browser (3c under “Stata’s Interface” below), they appear in RED text. When specifying strings in commands, you need to enclose them in quotations (eg count if name==”Old”). Missing strings are two quotes with nothing in between them (eg count if name==””).

In order to do math, you need to have things be numbers. There are several different numerical formats that you can read about here. If something is an integer (nothing after the decimal), it can be byte, int, and long. If something has a decimal point, it’s float or double. Stata does a nice job selecting which numerical format your data should be in, so you probably don’t need to think much about the difference between byte, int, long, float, or double again.

IMPORTANT: When viewing numeric variables in the data browser, they appear in BLACK text (or BLUE if they have a label applied). When specifying strings in commands, no quotations are needed (eg count if quartile==1). Missing strings are periods (eg count if quartile==.), and a period is positive infinity (a missing value is bigger than a value of one billion).

To convert from a string to a numerical value (change the “1” to a 1), you use the –destring– command. You might need to include the force and replace options, but read up on those by typing –help destring–.

To convert from a numerical value to a string (change the “1” to a 1), use the –tostring– command. Note that missing numerical values will go from a dot to a dot in quotations (. becomes “.”), which is not the same as a missing value for a string, which is just empty quotations (“”). It’s a good idea to follow up a –tostring– command with a command that replaces “.” values with “” values.

Stata’s output is only 255 characters wide, max

The output window of Stata will print (“display”) the inputted command and results from that command. It will clip the output at up to 255 characters, and insert a line break to the next row. You can specify:

set linesize 255

…so that the output is always 255 characters wide. Otherwise, it’ll adjust the output to match how wide your output window is.

The working directory is your “documents folder” unless you manually set the working directory with the cd command or open up Stata by double clicking on a .do file in Windows explorer

The working directory is where Stata is working from. If you save a dataset with the –save– command, it’ll save it in the working directory unless you specify all of the files from the C: drive on. If you double click on the Stata icon to open it up in Windows and type the present working directory command to see where it’s working from (that’s –pwd–), it’ll print out:

. pwd 
C:\Users\USERNAME\Documents

So, if you type:

save "dataset.dta", replace

…it’ll save dataset.dta in C:\Users\USERNAME\Documents

Let’s say that you really want to be working in your OneDrive folder because that’s secure and backed up and your Documents folder isn’t. The directory for your desired folder is:

C:\Users\USERNAME\OneDrive\Research project\Analysis

In order to save your file there, you’d type:

save "C:\Users\USERNAME\OneDrive\Research project\Analysis\dataset.dta", replace

Note that there’s a space in the Research project folder name so the directory needs to be in quotations. If there was no space anywhere in the directory, you could omit the quotations. I’m including quotations everywhere here because it’s good practice.

One option is to change your working directory to the OneDrive folder. You use the –cd– command to do that then any save command will automatically save in that folder:

cd "C:\Users\USERNAME\OneDrive\Research project\Analysis\"
save "dataset.dta", replace

Alternatively, you can save your project’s Do file in the “C:\Users\USERNAME\OneDrive\Research project\Analysis\” folder. Rather than opening Stata by clicking on the icon, find the Do file in your OneDrive folder in Windows Explorer and double click on it. It’ll open Stata AND set that folder as the working directory!! For a new project, this means opening Stata by clicking on its icon, opening a blank do file, saving that do file in your OneDrive folder, closing the Do File Editor and Stata, then reopening stata by double clicking on your blank do file in Windows Explorer.

Stata is most effectively used with with command-line input, specifically through the Do File Editor. There is a graphical user interface that can be handy.

I think that everything in Stata should be completed through Do files. These are text files with sequential lines of codes that make Stata perform commands in order.

There is a graphical user interface (GUI) with clickable menus. You can click through commands and it’ll generate the code and run the command of interest, and these can be handy for stealing syntax to run an annoying command. The command from the GUI will appear in the Command History (1c below) and you can right click and copy/paste it into your do file.

I find –import excel– to be frustrating and use the GUI probably 90% of the time to generate that command then copy/paste the syntax into my do file.

Stata won’t let you close a dataset in the memory or overwrite an existing dataset without some effort

The –use– command will open up a dataset in the memory. If you don’t have a dataset opened yet, this will open one:

use dataset.dta

Remember that Stata can only have one dataset opened at a time, so any time you open one when you already have a different dataset opened in memory, Stata will need to drop the open dataset. If you spent a lot of time on the open dataset creating new variables or merging with other datasets, closing it will make you lose all of your work unless you have also saved it. Stata doesn’t want you to make this mistake so if you already have a dataset opened and you type in the above command, Stata will say “No” and you won’t be opening the new dataset.

Instead, you need to put “–, clear–” at the end of the command, like this:

use dataset.dta, clear

And now Stata will drop whatever you have open. It’s really just a nice check to keep you from discarding your work accidentally.

Similarly, if you are trying to save a dataset with the –save– command into an empty folder, you just need to type:

save newdataset.dta

…and Stata will save it no problem. HOWEVER if you are trying to overwrite an existing dataset with that same name, Stata will say “No” and you won’t be saving your dataset today. This is another check. instead, you just need to use “–, replace–” to overwrite. Example:

save newdataset.dta, replace

Stata’s interface

Here’s a quick overview of the Stata interface in Windows. Note: the Mac interface looks a bit different. There’s some way to make the Mac interface look like the Windows interface, but I don’t know how to do that. I’ll try to remember to update this page when I help a Mac user in the future.

  1. Ways to input and interact with commands:
    1a. Command line – This is where you type command by command. Unless you are just poking around in your data, you should avoid using this. Anything that you want to reproduce in your analysis should be done in the Do file editor.
    1b. Open Do file editor button – The Do File Editor is the most important part of Stata in my opinion. A do file is a long text file saving command after command. This is where you should do all of your analytical work.
    1c. Command history – If you use the command line or GUI to make a command, it’ll be saved here. You can right click on old commands and copy/paste them into your do file.
  2. Output window – Your command will appear here with a preceding dot (“. sysuse auto” means that I had previously typed in “sysuse auto”). The output from your do file or command will appear immediately below.
  3. Ways to interact with data
    3a. Variable list – This is a list of variables in the open dataset. You can double click on them and the variable name will be copied to the command line. You can ctrl+click and select multiple and then copy them to the clipboard. This is quite handy.
    3b. Variable and dataset properties – This will let you see details about a selected variable in the variable list and the current dataset in memory.
    3c. Data browser – You can also pop this open with the –bro– command. this views all data in a spreadsheet format that looks like Excel.

Table 1 with pweights in Stata

The very excellent table1_mc program will automate generation of your Table 1 for nearly all needs (read about it here), except for datasets using pweight. I’ve been toying around with automating Excel table generation using Stata v16+ Frames features. I recently started working on a database that requires pweighting for analyses, and opted to use this to as an opportunity to use Frames to generate the automation of a pweight adjusted Table 1.

v1.2 of my code (updated 2021-09-20) to automate this lives here: http://www.uvm.edu/~tbplante/p_weight_table1_v1_2.do

You can just put:

do http://www.uvm.edu/~tbplante/p_weight_table1_v1_2.do

…in your Stata do file and it’ll pull in the entire script! Note that full instructions will show up in the Stata output window when you run the above line. The instructions below are incomplete. This code does not produce P-values.

How to use this do file

// Step 1a: close all open frames, drop all macros, 
//          and open your dataset
frames reset
macro drop _all
webuse multistage, clear
//
// Step 1b: Figure out where your present working directory is, 
//          this is where the excel spreadsheet will be saved. 
//          Change the working directory with the "cd"
//          command as needed. 
pwd
//
// Step 2: Declare your data to be pweighted
svyset county [pweight=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools)
//
// Step 3: If your columns require the generation of pweighted 
//         tertiles, quartiles, or whatnot, do that now. 
//         For this example, we'll do by quartile of weight. 
// note: per this website: https://www.stata.com/support/faqs/statistics/percentiles-for-survey-data/
//       ...Only the pweight needs to be specified when making 
//        weighted quartiles. 
xtile weightquart=weight [pweight=sampwgt], n(4) 
//
// Step 4: Recode binary variables so they are 0 and 1 (if needed)
// Note: in this dataset, it's 1 and 2 for male and female, 
//       respectively. 
gen female = 1 if sex==2 // recode sex to female, where 1 is female
replace female=0 if sex==1 // male is now 0
//
// Step 5: Name your variables and options for multiple options
// Note: The variables are already labeled but we are doing it 
//       again for completeness' sake. 
//
// Continuous variables
label variable weight "Weight in lbs"
label variable height "Height in in" // I don't know why people are 400 in tall. that's 33 ft.
// Nominal variables (same process would happen for ordinal or continuous varibles)
label variable race "Race" // Race is nominal so need to also define values of race
label define racelabels 1 "White" 2 "Black" 3 "Other"
label values race racelabels // Apply the labels!!!
// Binary variables, no need to apply labels
label variable female "Female sex"
//
// Step 6: Call the do file
// Note: Instructions on this program's use will show right 
//       after it's called. Look at the Stata output window. 
do http://www.uvm.edu/~tbplante/p_weight_table1_v1_2.do
//
// Step 7: Now follow the instructions! That are in the stata 
//         output window!
table1pweight_start table1 1 4 weightquart weight %10.1f
table1pweight_contn_se  table1 1 4 weightquart height %10.1f
table1pweight_bin_se  table1 1 4 weightquart female %10.0f
table1pweight_cat_se  table1 1 4 weightquart race %10.0f
table1pweight_end table1 1 4 weightquart weight %10.2f
//
// CLOSE THE NEW EXCEL FILE OR YOU'LL GET AN ERROR WHEN 
// RUNNING STEP 7.
//
// Step 8: Look at the excel output! Here, it's a file called 
//         table1.xlsx that's sitting in  your pwd (see step 
//         1b above). You might notice blanks for the 2nd and 
//         3rd columns, but that's because of a strata with a single
//         sampling unit. You can confirm numbers using the survey
//         tools.
// Example: manually check the height by quartile of weight
svy: mean height if weightquart==1
svy: mean height if weightquart==2
svy: mean height if weightquart==3
svy: mean height if weightquart==4
// now check % female sex by quartile
svy: proportion female if weightquart==1
svy: proportion female if weightquart==2
svy: proportion female if weightquart==3
svy: proportion female if weightquart==4

Getting Python and Jupyter to work with Stata in Windows

Note: this post was written prior to Stata 17, which now allows Python to control Stata and vice versa. In Stata 16, Python could not control Stata but Stata could control Python. Because of this functionality, there’s a more streamlined approach to getting Jupyter to play nicely with Stata 17, detailed here. You’ll still need to install Python and Jupyter, so this post is still helpful.

Stata 16 now integrates with Python. I’m pretty stoked about using some of the Python figure packages. Getting it up and running has been a bit of a challenge. Here’s how I got it to work.

Of note, since I started this post, Stata’s blog has started a series on using Python, which you should check out here.

Install part 1:

Installing Anaconda (free for individuals, not for institutions)

Anaconda comes with many built-in statistical packages. The (free) individual version of Anaconda from here. Just make sure to check the “set as path” button during the Anaconda install!

Installing the (universally free) traditional Python distribution

As of July 2020, Python apparently has two versions that are commonly used, the 2.x version and the 3.x version. The end-of-life of 2.x versions is this year, so I wouldn’t recommend using it (current highest version is 2.7). Instead, use the 3.x version, currently the 3.8 version. You can find it at the Windows Python Download Page.

Make sure to install the version matching your Stata install! Stata comes as 32 bit or 64 bit. In Stata, type –about– to see what version you have. You’ll see that mine is running the 64-bit version of Stata. If you have a relatively modern computer, you are probably running the 64-bit version of Stata. Windows can actually run either 32-bit or 64-bit versions if you have a 64-bit processor, so do yourself a favor and just check. Type -about- in Stata to confirm your version.

Make sure that you install the corresponding version of Python. The highlighted one (x86-64) is the 64 bit. The other one (x86) is the 32-bit version. For this example, since I have the 64-bit version of stata, I installed the x86-64, 64-bit version of Python.

I had originally installed the 32-bit version of Python and Stata couldn’t load it. Installing the 64-bit version of Python solved that. There’s actually a big “download now” button on the main Python webpage that will give you the 32-bit version. Make sure to select the specific stable release in the picture above.

For the love of Pete, check this PATH box when you install it.

PATH is a list of commands that can be run from the Windows command line, and where their relative program exists.

See this check box right here? Select it. If you don’t, you’ll have a heck of a time getting anything to run from the command line. This should be checked on default, I have no idea why it’s not. If you forgot to check this box, uninstall Python and reinstall it after checking this box.

Also, notice that it says “64 bit” on the installer screen above. If it says “32 bit”, you probably downloaded the wrong version. Go back and try again!

What the heck did I just install?

There are two Python shell apps/programs that came along with the default Python setup. IDLE is a more user-friendly Python shell. It resembles the command line in Stata, but it has the syntax highlighting of the Do file editor.

The app called “Python 3.8 (64-bit)” is the shell without any markup. If you want to play around with Python, I recommend using IDLE.

Making your first program in IDLE

Anything you run in Python should be from a script, or a *.py file. Pop one open from within IDLE by hitting Ctrl+N. Enter the following:

print("Hello world!")

Then save it and run it (by pressing F5) and you’ll get the hello world!

How does that look in Stata?

Let’s do the same thing in a Stata do file. In order to open up the Python shell within Stata, you have to type –python– on its own line, your intended python code, then –end– on its own line. Here, I have entered:

python
print("hello world!")
end

Then just hit ctrl+d or the run button to get it to work in Stata!

Install Part 2: How do I get the Pandas, Matplotlib, SciPy, Sklearn, and NumPy libraries installed in Python?

Note: Anaconda comes with all of these except sklearn. For below, just complete the sklearn step.

Python by itself can do some stuff, but the heavy lifting for stats and visualization is from add-in libraries that aren’t included with the default Python and must be added in before doing much of anything else. (Note: Anaconda does come with those and is a Python installation geared towards science, but we’re doing the classic install here.) Installing these additional libraries can be done with the included pip library, which automates all downloads and installations. BUT pip it has to be called from the Windows command line, not in a Python shell (i.e., not in IDLE). You’ll know you’re in the shell if the line starts with this:

>>>

So if you type “pip install pandas” in the shell (after the “>>>”), you’ll get an error and you will not be getting pandas.

To pop up the Windows command line, hit the Start button then type “cmd” to open it up. Or hit windows key+r and type “cmd” to open it up. If you correctly checked the PATH checkbox in the install, you should get the version reported if you type the following in:

python --version

If you get some sort of error, it’s probably because you didn’t check the PATH box during the install. Uninstall Python then reinstall it and make for sure you check that stupid PATH box.

A note about the Windows 10 command line: If you type “Python” and hit enter, Windows pops up the Windows store and tries to get you to install the version of Python that they host. This is by far the dumbest Windows feature ever, and I have seen BOB. So, avoid ever typing the word “python” in the command line. Instead, use the handy “py” command, which does everything you’ll need it to do. Py is the python launcher.

To call pip, you want to type in “py” then “-m” then “pip” and its commands. the “-m” allows you to run library commands as a script. If it’s been a while since you updated pip, now’s a good time to update it in the command line to show how to call pip:

py -m pip install --upgrade pip

Now let’s install the core packages for Python:

py -m pip install pandas
py -m pip install matplotlib
py -m pip install numpy
py -m pip install scipy
py -m pip install sklearn

You’ll see a screen like this, and it’ll let

You only need to do this installation step once.

How do I use Pandas, Matplotlib, NumPy, Scikit-learn (sklearn), and SciPy in Stata?

Once the libraries are installed, you can then integrate them into your scripts. Each time you want to use them, you need to import them so you can call them. The convention is to import these using common so you don’t have to type “pandas” over and over again, you can just use “pd”. Ditto for other libraries:

python
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import sklearn as sk
end

Install part 3: How do I install Jupyter Lab and get it to work with Stata? (This is the updated version of Jupyter Notebook)

Note: Jupyter lab comes installed with Anaconda, but node.js, npm, and stata_kernel still must be installed.

Jupyter is a super popular way to cleanly complete analyses. Its origins were in Python, but it now works in R and Stata. Details on installing Jupyter are here. Specific instructions for getting it to interface with Stata are here. Here’s how I got it installed:

First: install node.js (warning: this takes a long time and requires a reboot)

Make sure that you have the current version of Python installed! Node.js will install the current version and if your version is outdated, it’ll create a huge headache.

First, install node.js. Download here. I checked the box in the node.js install to also install additional software. At the end of the install, it pops open Windows powershell window and installed a bunch of stuff, including Python if you do not have the current version. It also installs Chocolatey and a few other things. NOW STOP HERE AND REBOOT YOUR COMPUTER. Come back when it restarts.

Next: install other npm, jupyterlab, and Stata plugins (warning: this also takes a while)

First, it’s always good habit to make sure you have the newest version of pip!

py -m pip install --upgrade pip

Then, type the following in the Windows command line:

py -m pip install npm
py -m pip install jupyterlab
py -m pip install stata_kernel
py -m stata_kernel.install
jupyter labextension install jupyterlab-stata-highlight

When later trying to run jupyter lab, I got an error saying “Exception: Jupyter command jupyter-lab not found.” I updated pip then uninstalled and reinstalled npm, jupyterlab, and stata_kernel (just replace the word “install” with “uninstall” above) then installed everything again. This fixed things.

Finally, you need to do a last step described under “windows specific steps” here to get this to work in Windows. I found my Stata executable at “C:\Program Files\Stata16\StataSE-64.exe”, not Program Files (x86). You can delete the new Stata desktop shortcut once you run it as an administrator one time.

You might also need to configure Jupyter to work with your Stata install. Details are here. I found the configuration file named .stata_kernel.conf sitting in this folder: C:\Users\MYUSERNAME\.stata_kernel.conf

In reviewing the configuration file, it seems to have correctly identified my Stata SE 16 setup. I changed the graph format from svg to png, but left the rest unchanged.

Now that I have Jupyter Lab installed, how do I open and use it?

In Anaconda Navigator, just click the “Jupyter lab” button. For a traditional Python install, open the Windows command line, type:

jupyter lab

It should open up your web browser to a Jupyter page. Keep the Windows Command line terminal open in the background. If you close it, Jupyter will cease to work. (Note: This isn’t true for anaconda if you open Jupyter from the GUI.) Click the “Stata” notebook button to start.

It’ll open up a scripting page for your code. At the bottom it should say “stata idle”. That’s how you know you set it up correctly.

Now you can use traditional Stata code!

There are additional programs called “Magics” detailed here that help Stata integrate more seamlessly with Jupyter. Each of these commands begins with a % symbol. There are specific ways to modify these commands.

  • %browse – lists the first 200 rows
  • %head – first 10 rows
  • %tail – last 10 rows
  • %set – can change the graph_format, graph_scale, graph_width, and graph_height

How do I use SFI to interface Stata and Python?

Stata and Python talk to each other using the Stata function interface, or SFI. MORE TO COME ON THIS.

Making Restricted Cubic Splines in Stata

I love restricted cubic splines, made famous by Frank Harrell (see his approach starting on page 58 here). Dr. Harrell made a package for automating these in R. I’m not aware of an equivalent package for Stata. Here’s my approach to making this specific restricted cubic spline in Stata.

The model here is modified Poisson regression using the Zou 2004 method since the outcome is binary. Since it’s coded as a GLM, it’ll be relatively easy to swap out this one specific model for other models, like logistic regression using the appropriate link & family. It’s good habit to have the probability density of the outcome across the continuum of exposure, so that is plopped on the bottom here. It wouldn’t take much work to replace with a histogram.

This is for two groups (group1=blue, group2=red). It wouldn’t be too hard to make this for just one group, deleting everything having to do with group 2 or having the number 2 in it. Or, duplicate lines for a group 3. Also, sometimes folks like to present a kernel density plot for each outcome, so you’d just duplicate the kdensity lines and add some code to specify that one of each is for folks with outcome==0 then outcome==1.

One quirk is that the xbrcspline code depends on a list of all of the possible options for the exposure, which is brought in from listof to xbrcspline as a numlist. As you can see in –help limits–, numlists are capped at 2,500 numbers. If you get an error saying “values() invalid — invalid numlist has too many elements”, then you have too many individual options for your exposure. This might be because your exposure is an integer with a huge amount of digits after the decimal place, so your >2,500 observations all come with their own unique value for the exposure. Rounding can help reduce this count, if it analytically and clinically makes sense. (This is okay for BMI, for example, because there isn’t clinically relevant difference between a bmi of 27.0400558235 and 27.04). To generate a rounded value, plop —gen bmi_round=(bmi, 0.01)– in the first few lines, right after opening your data. Then use “bmi_round” as your exposure variable rather than “bmi”.

You also should consider centering your exposure variables at the mean. To do this, use the –sum– command for each variable then –gen [variable’s name]_center = [variable’s name]/r(mean)–. Then use the “[variable’s name]_center” as exposures variables.

This code is a mostly rewritten version of one that is used at the Welch Center at Hopkins, where it has been handed down through generations of doctoral students and post-docs.

I’ve recently needed to make these figures using pweights, so I put some comments throughout to simplify that process.

Enjoy!

Code to make the figure above

*************************************************
***************load data!************************
****************and drop any open graphs*********
*************************************************
webuse fvex, clear
graph drop _all
macro drop _all
// if weighted, declare weighting here
*************************************************
**************define variables here**************
*************************************************
// Define the OUTCOME, = to 0 or 1, after the (first) word "outcome"
global outcome outcome // this database's outcome is called outcome...
// Define the EXPOSURE following the word "exposure"
global exposure age // this has to be a continuous variable
// Define the COVARIATES following word "covariate"
global covariates sex distance
// Define what makes GROUP 1 here after "thing1" (blue)
global thing1 if arm==1 
// Define what makes GROUP 2 here after "thing2" (red)
global thing2 if arm==2

*************************************************
************auto-generate subgroups and**********
*****************local macros needed*************
********************for spline*******************
*************************************************
// need to make the exposure by group for the kernel density plots. 
gen exposuresubgroup1 = ${exposure} ${thing1}
gen exposuresubgroup2 = ${exposure} ${thing2}

// The xbrcspine command below needs the middle value of the exposure 
// for each group. This is often times the median, but if there are 
// an even number of values for the exposure, the median will be an 
// average of the two middle values of the exposure variable. The following 
// code defines the median as a local macro (median1temp) then grabs the
// actual value for the exposure closest to that number (median_1). 
// _pctile can be used with pweight, if needed.
//
_pctile exposuresubgroup1, p(50)
// _pctile exposuresubgroup1 [pweight=samplingweight], p(50)
local median1temp = r(r1)
gen mediandiff1=abs(exposuresubgroup1-`median1temp')
gsort mediandiff1
local median_1 = ${exposure}[_n==1]
//
_pctile exposuresubgroup2, p(50)
// _pctile exposuresubgroup2 [pweight=samplingweight], p(50)
local median2temp = r(r1)
gen mediandiff2=abs(exposuresubgroup2-`median2temp')
gsort mediandiff2
local median_2 = ${exposure}[_n==1]
//
// get rid of these variables, since they are not needed anymore
drop mediandiff1 mediandiff2

*************************************************
*************make splines and view knots!********
*************************************************
// NOTE: look at the stata output here, the knots will be displayed
// you'll need to list these in the footer
// 
// Harrell's method recommends using the # of knots ('k') as 3, 4 or 5
// with n >=100 as 5 and n<30 as 3. It's just a guideline. 
// Details are on page 58 here: 
// https://hbiostat.org/doc/rms.pdf
mkspline ${exposure}_spline1=exposuresubgroup1, nknots(4) cubic displayknots
mat knots1=r(knots)
mkspline ${exposure}_spline2=exposuresubgroup2, nknots(4) cubic displayknots
mat knots2=r(knots)
// above won't work for pweight weighted data. instead, you need 
// to specifically define the percentiles from the table on pg 5 here:
// https://www.stata.com/manuals13/rmkspline.pdf
// But basically, 
// 3 knots, percentiles are at: 10 50 90
// 4 knots, percentiles are at: 5 35 65 95
// 5 knots, percentiles are at: 5 27.5 50 72.5 95
// 6 knots, percentiles are at: 5 23 41 59 77 95
// 7 knots, percentiles are at: 2.5 18.33 34.17 50 65.83 81.67 97.5
// the code for 4 knots follows in hidden code:
//
//_pctile exposuresubgroup1 [pweight=samplingweight], p(5 35 65 95) 
//return list
//local gr1knot1 = r(r1)
//local gr1knot2 = r(r2)
//local gr1knot3 = r(r3)
//local gr1knot4 = r(r4)
//di "Knots for group 1 are at " `gr1knot1' ", " `gr1knot2' ", " ///
//`gr1knot3' ", " `gr1knot4' "."
//
//
//mkspline ${exposure}_spline1=exposuresubgroup1,  ///
//knots(`gr1knot1' `gr1knot2' `gr1knot3' `gr1knot4' ) ///
//cubic displayknots
//
//mat knots1=r(knots)
//
//_pctile exposuresubgroup2 [pweight=samplingweight], p(5 35 65 95) 
//return list
//local gr2knot1 = r(r1)
//local gr2knot2 = r(r2)
//local gr2knot3 = r(r3)
//local gr2knot4 = r(r4)
//
//di "Knots for group 2 are at " `gr2knot1' ", " `gr2knot2' ", " ///
// `gr2knot3' ", " `gr2knot4' "."
//
//
//mkspline ${exposure}_spline2=exposuresubgroup2,  ///
//knots(`gr2knot1' `gr2knot2' `gr2knot3' `gr2knot4') ///
//cubic displayknots
//
//mat knots2=r(knots)
*************************************************
********Generate models to use in splines********
*************************************************
// model time!
// the models here are modified poisson regressions using sandwich variance estimators
// which is the Zou method from this classic paper:
// https://pubmed.ncbi.nlm.nih.gov/15033648/
// GLMs can be used with pweight, if needed.
//
// First group 
glm ${outcome} c.${exposure}_spline1* ${covariates}  ${thing1},  ///
fam(poisson) link(log) eform robust
// robust is required for modified poisson regression by zou's method 
// (i.e., modified poisson regression with sandwich variance estimators)
// for pweight or survey data, use:
// svy: glm ${outcome} c.${exposure}_spline1* ${covariates}  ${thing1},  ///
// fam(poisson) link(log) eform
levelsof(${exposure})  ${thing1}  // generates r(levels) for next line
xbrcspline ${exposure}_spline1, values(`r(levels)') ///
ref(`median_1') matknots(knots1) /// 
eform gen(lpnt1 hr1 lb1 ub1)

// Second group
glm ${outcome} c.${exposure}_spline2* ${covariates} ${thing2}, ///
fam(poisson) link(log) eform robust
// for pweight or survey data, use:
// svy: glm ${outcome} c.${exposure}_spline2* ${covariates}  ${thing2},  ///
// fam(poisson) link(log) eform
//
levelsof(${exposure}) ${thing2} // generates r(levels) for next line
xbrcspline ${exposure}_spline2, values(`r(levels)') ///
ref(`median_2') matknots(knots2) /// 
eform gen(lpnt2 hr2 lb2 ub2)

*************************************************
***********generate extremes to drop*************
*************************************************
// should drop the extremes, here the 0.5 and 99.5th percentiles
// but need to define these as local macros.
// _pctile can be used with pweight, if needed.
_pctile exposuresubgroup1, p(0.05 99.5)
return list
local cut_a1 = r(r1)
local cut_b1 = r(r2)

_pctile exposuresubgroup2, p(0.05 99.5)
return list
local cut_a2 = r(r1)
local cut_b2 = r(r2)
*************************************************
***************make the figure*******************
*************************************************
set scheme s1mono // my favorite scheme
twoway ///
/// spline for one group
(line  hr1 lpnt1 if lpnt1 > `cut_a1' & ///
lpnt1 < `cut_b1' , yaxis(1) lp(solid) lc(blue) lwidth(medthick) ) ///
(rarea lb1 ub1 lpnt1 if lpnt1 > `cut_a1' & ///
lpnt1 < `cut_b1' , yaxis(1) color(blue%5)) ///
(line  lb1 lpnt1 if lpnt1 > `cut_a1' & ///
lpnt1 < `cut_b1', yaxis(1)  lp(dash) lc(blue) lwidth(thin) ) ///
(line  ub1 lpnt1 if lpnt1 > `cut_a1' & ///
lpnt1 < `cut_b1' , yaxis(1) lp(dash) lc(blue) lwidth(thin) ) ///
/// spline for other group
(line  hr2 lpnt2 if lpnt2 > `cut_a2' & ///
lpnt2 < `cut_b2' , yaxis(1) lp(solid) lc(red) lwidth(medthick) ) ///
(rarea lb2 ub2 lpnt2 if lpnt2 > `cut_a2' & ///
lpnt2 < `cut_b2' , yaxis(1) color(red%5)) ///
(line  lb2 lpnt2 if lpnt2 > `cut_a2' & ///
lpnt2 < `cut_b2' , yaxis(1)  lp(dash) lc(red) lwidth(thin) ) ///
(line  ub2 lpnt2 if lpnt2 > `cut_a2' & ///
lpnt2 < `cut_b2' , yaxis(1) lp(dash) lc(red) lwidth(thin) ) ///
(line  ub2 lpnt2 if lpnt2 > `cut_a2' & ///
lpnt2 < `cut_b2' , yaxis(1) lp(dash) lc(red) lwidth(thin) ) ///
/// kernel density for one group
/// note: can alternatively install kdens and moremata to 
/// use pweighting in kernel density.
(kdensity exposuresubgroup1 if exposuresubgroup1 > `cut_a1' & ///
exposuresubgroup1 < `cut_b1' , yaxis(2) lp(shortdash) lcolor(blue)) ///
/// kernel density for other group
(kdensity exposuresubgroup2 if exposuresubgroup2 > `cut_a2' & ///
exposuresubgroup2 < `cut_b2' , yaxis(2) lp(shortdash) lcolor(red)) ///
, ///
yline(1, lpattern(solid) lcolor(black) axis(1)) ///
/// labels for the left axis, hide the following line if you aren't sure 
/// of what the range should be:
ylabel(0.25 "0.25" 0.5 "0.5" 1 "1" 2.5 "2.5" 5 "5", axis(1) angle(0)) ///
/// range for the left axis. the BOTTOM of this range has to be WAAAY lower 
/// than the bottom ylabel in the line above in order for it to sit 
/// on the way top of the figure. 
yscale(r(0.001 2) log axis(1)) /// log scale here
/// ditto for the right axis, but the TOP of the range on the yscale needs 
/// to be WAAAY higher than the top ylabel
ylabel(0 "0" 0.01 "0.01" 0.02 "0.02" 0.03 "0.03" 0.04 "0.04" 0.05 "0.05", ///
axis(2) labsize(vsmall) angle(0)) ///
yscale(r(0.0 .2) axis(2)) /// not log scale here
/// you'll notice that the x label doesn't span the entire 
/// range of the exposure because the local macro cuts above. 
///xlabel(20(5)60) in case you want to specify the x axis
/// for the titles, need to put a bunch of spaces so things align
ytitle("                {bf:Risk Ratio (95% CI)}", axis(1)) ///
ytitle("{bf:Probability Density}                                          ", ///
 justification(left) axis(2)) ///
xtitle("{bf:Exposure Here}") ///
title("{bf:The Title}") ///
legend(order(1 "Group 1" 5 "Group 2")) ///
name(mygraph1)
			
*************************************************
*****************Save figure!********************
*************************************************
// save as png, change to tif if needed for submission			
graph export "mygraph1.png", replace width(1000)

Use Stata to download the NY Times COVID-19 database and render a Twitter-compatible US mortality figure

Here’s the figure!

Code follows

Comments are in-line below. Some unique strategies in this code:

  • This will automatically download the latest NY Times dataset, but the date of “last day of follow-up” needs to be specifically defined. I find that the label locations need to be tweaked every day, and this process isn’t simple to automate.
  • The colors are defined by global macros once and are applied multiple times by calling those macros.
  • Text blocks are rendered next to the last day of follow-up with a translucent white background and non-translucent colored border that matches the dotted line.
  • Twitter figures should be output at 1100 x 628, per this blog. This script does that. Twitter clips images that aren’t this size.
****************************************************
// step 1: download  and save NY times database
****************************************************
version 15.1 // my version of Stata when this was written

import delimited using ///
"https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv", ///
varn(1) clear

// Now make the date a stata date. Load this handy date-fixing program 
// I wrote. The syntax is 'fixdate [variable name] [mdy, ymd, etc]
do http://www.uvm.edu/~tbplante/fixdate_v1_0.do
fixdate date ymd
// rename state to state_fullname
rename state state_fullname
****************************************************
// step 2: keep 50 states+DC, apply abbreviations
****************************************************
gen state=" " 
replace state="AL" if state_fullname=="Alabama"
replace state="AK" if state_fullname=="Alaska"
replace state="AZ" if state_fullname=="Arizona"
replace state="AR" if state_fullname=="Arkansas"
replace state="CA" if state_fullname=="California"
replace state="CO" if state_fullname=="Colorado"
replace state="CT" if state_fullname=="Connecticut"
replace state="DE" if state_fullname=="Delaware"
replace state="FL" if state_fullname=="Florida"
replace state="GA" if state_fullname=="Georgia"
replace state="HI" if state_fullname=="Hawaii"
replace state="ID" if state_fullname=="Idaho"
replace state="IL" if state_fullname=="Illinois"
replace state="IN" if state_fullname=="Indiana"
replace state="IA" if state_fullname=="Iowa"
replace state="KS" if state_fullname=="Kansas"
replace state="KY" if state_fullname=="Kentucky"
replace state="LA" if state_fullname=="Louisiana"
replace state="ME" if state_fullname=="Maine"
replace state="MD" if state_fullname=="Maryland"
replace state="MA" if state_fullname=="Massachusetts"
replace state="MI" if state_fullname=="Michigan"
replace state="MN" if state_fullname=="Minnesota"
replace state="MS" if state_fullname=="Mississippi"
replace state="MO" if state_fullname=="Missouri"
replace state="MT" if state_fullname=="Montana"
replace state="NE" if state_fullname=="Nebraska"
replace state="NV" if state_fullname=="Nevada"
replace state="NH" if state_fullname=="New Hampshire"
replace state="NJ" if state_fullname=="New Jersey"
replace state="NM" if state_fullname=="New Mexico"
replace state="NY" if state_fullname=="New York"
replace state="NC" if state_fullname=="North Carolina"
replace state="ND" if state_fullname=="North Dakota"
replace state="OH" if state_fullname=="Ohio"
replace state="OK" if state_fullname=="Oklahoma"
replace state="OR" if state_fullname=="Oregon"
replace state="PA" if state_fullname=="Pennsylvania"
replace state="RI" if state_fullname=="Rhode Island"
replace state="SC" if state_fullname=="South Carolina"
replace state="SD" if state_fullname=="South Dakota"
replace state="TN" if state_fullname=="Tennessee"
replace state="TX" if state_fullname=="Texas"
replace state="UT" if state_fullname=="Utah"
replace state="VT" if state_fullname=="Vermont"
replace state="VA" if state_fullname=="Virginia"
replace state="WA" if state_fullname=="Washington"
replace state="WV" if state_fullname=="West Virginia"
replace state="WI" if state_fullname=="Wisconsin"
replace state="WY" if state_fullname=="Wyoming"

replace state="DC" if state_fullname=="District of Columbia"

drop if state==" " // drop guam, VI, PR. would be reasonable to add them back
// would need to get their populations for the list below. 
****************************************************
// step 3: apply population by state
****************************************************
// ref: 
// https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html
// http://www2.census.gov/programs-surveys/popest/datasets/2010-2019/national/totals/nst-est2019-alldata.csv?#
gen statepop=.
replace statepop=4903185 if state=="AL"
replace statepop=731545 if state=="AK"
replace statepop=7278717 if state=="AZ"
replace statepop=3017804 if state=="AR"
replace statepop=39512223 if state=="CA"
replace statepop=5758736 if state=="CO"
replace statepop=3565287 if state=="CT"
replace statepop=973764 if state=="DE"
replace statepop=705749 if state=="DC"
replace statepop=21477737 if state=="FL"
replace statepop=10617423 if state=="GA"
replace statepop=1415872 if state=="HI"
replace statepop=1787065 if state=="ID"
replace statepop=12671821 if state=="IL"
replace statepop=6732219 if state=="IN"
replace statepop=3155070 if state=="IA"
replace statepop=2913314 if state=="KS"
replace statepop=4467673 if state=="KY"
replace statepop=4648794 if state=="LA"
replace statepop=1344212 if state=="ME"
replace statepop=6045680 if state=="MD"
replace statepop=6892503 if state=="MA"
replace statepop=9986857 if state=="MI"
replace statepop=5639632 if state=="MN"
replace statepop=2976149 if state=="MS"
replace statepop=6137428 if state=="MO"
replace statepop=1068778 if state=="MT"
replace statepop=1934408 if state=="NE"
replace statepop=3080156 if state=="NV"
replace statepop=1359711 if state=="NH"
replace statepop=8882190 if state=="NJ"
replace statepop=2096829 if state=="NM"
replace statepop=19453561 if state=="NY"
replace statepop=10488084 if state=="NC"
replace statepop=762062 if state=="ND"
replace statepop=11689100 if state=="OH"
replace statepop=3956971 if state=="OK"
replace statepop=4217737 if state=="OR"
replace statepop=12801989 if state=="PA"
replace statepop=1059361 if state=="RI"
replace statepop=5148714 if state=="SC"
replace statepop=884659 if state=="SD"
replace statepop=6829174 if state=="TN"
replace statepop=28995881 if state=="TX"
replace statepop=3205958 if state=="UT"
replace statepop=623989 if state=="VT"
replace statepop=8535519 if state=="VA"
replace statepop=7614893 if state=="WA"
replace statepop=1792147 if state=="WV"
replace statepop=5822434 if state=="WI"
replace statepop=578759 if state=="WY"

****************************************************
// step 4: make daily death count per capita
****************************************************
// now make variables for cases and deaths per capita in each state (per million persons)
gen statepopave_deaths = (deaths/statepop) *1000000

****************************************************
// step 5: make a variable for when the death rate is 
// >=1/1,000,000 people in each state, and count days
// following that
****************************************************
sort state date
gen days_1_death=.
replace days_1_death=0 if statepopave_deaths < 1 
replace days_1_death=1 if (statepopave_deaths >= 1 & statepopave_deaths[_n-1] <1 ) ///
& (state==state[_n-1])
//
replace days_1_death = days_1_death[_n-1]+1 if state==state[_n-1] ///
& days_1_death[_n-1]!=0
****************************************************
// step 6: save database
****************************************************

save nytimes_state_fu.dta, replace

****************************************************
// step 7: specify last day of follow-up and 
// get rank of states and location
// to put state names in x,y location for the 
// last day of follow-up
****************************************************
// reload
use nytimes_state_fu.dta, clear

// ****THIS NEEDS TO BE EDITED EVERY DAY.****
// Set the final date of follow-up. 
// as of today (3/29/2020), 3/27/2020 is the most
// recent day of data in the NY times database.
// 
// This is intentionally not automated because I want to manually adjust
// labels and range each time. 
global month Mar // needs to be in 3 letter abbreviation for month
global date 27 // 2 number day in month

// drop any day beyond the specified date
drop if date>date("${date}${month}2020", "DMY")

// this global will make the x axis 1 day longer than the current follow-up
sum days_1_death
global maxdate = r(max)+1

// actually determine the order of states on the last day of follow-up,
// which is how the labels and colors are applied.
// need to drop all but the last date of follow-up
keep if date==date("${date}${month}2020", "DMY")
gsort -statepopave_deaths // sort in reverse order
gen n=_n // make variable that contains order based upon sort
drop if n >10 // drop those not in the top 10

// need to figure out where to put the labels of state names
// this loop plucks out the state name and x&y coordinates for the last
// day of follow-up. 
// it also prints the order of the states. 
foreach x in 1 2 3 4 5 6 7 8 9 10 {
global statename`x'=state[`x'] // pull state name
global datecount`x' = days_1_death[`x'] + 0.2 // x axis, need to offset by 0.2 
//                                      so the label isn't on top of the dot
global statedeath`x' = statepopave_deaths[`x'] // yaxis

di "State rank #`x': ${statename`x'}"
di "(x axis) # of days: ${datecount`x'}" 
di "(y axis) deaths/million: ${statedeath`x'}"
di " "
}
//
// The labels might overlap each other. This you can manually readjust the 
// location on the y axis following here. This won't alter data in the 
// figure, just the location of the labels. 

global statedeath1 = ${statedeath1} // don't need to move label
global statedeath2 = ${statedeath2} // don't need to move label 
global statedeath3 = ${statedeath3}  // don't need to move label
global statedeath4 = ${statedeath4}  // don't need to move label
global statedeath5 = ${statedeath5}  // don't need to move label
global statedeath6 = ${statedeath6}  // don't need to move label
global statedeath7 = ${statedeath7}  // don't need to move label
global statedeath8 = ${statedeath8}+1.5 // move GA up on y axis
global statedeath9 = ${statedeath8}-2 // move DC down on y axis
global statedeath10 = ${statedeath10} // don't need to move label
****************************************************
// step 8: specify colors, make figure, save figure
// in size compatible with twitter
****************************************************

// reload the full dataset
use nytimes_state_fu.dta, replace
// drop any day beyond the specified date. 
drop if date>date("${date}${month}2020", "DMY")

// I like the s1mono scheme. Default stata theme is ugly. 
set scheme s1mono
// colors for these states, taken from colorbrewer website
// ref: https://colorbrewer2.org/#type=diverging&scheme=RdYlBu&n=10
// these are RGB triads
global color1 165 0 38
global color2 215 48 39
global color3 244 109 67
global color4 253 174 97
global color5 254 224 144
global color6 224 243 248
global color7 171 217 233
global color8 116 173 209
global color9 69 117 180
global color10 49 54 149

// the actual graphic!
// note: you need to put 'sort' after the 'twoway scatter' command so the line doesn't loop back around. 
twoway ///
(scatter statepopave_deaths days_1_death if state=="${statename1}" & days_1_death>=1 & date>=1, ///
mcolor("${color1}") msymbol(O) lpattern(solid) lcolor("${color1}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename2}" & days_1_death>=1 & date>=1, ///
mcolor("${color2}") msymbol(O) lpattern(solid) lcolor("${color2}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename3}" & days_1_death>=1 & date>=1, ///
mcolor("${color3}") msymbol(O) lpattern(solid) lcolor("${color3}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename4}" & days_1_death>=1 & date>=1, ///
mcolor("${color4}") msymbol(O) lpattern(solid) lcolor("${color4}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename5}" & days_1_death>=1 & date>=1, ///
mcolor("${color5}") msymbol(O) lpattern(solid) lcolor("${color5}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename6}" & days_1_death>=1 & date>=1, ///
mcolor("${color6}") msymbol(O) lpattern(solid) lcolor("${color6}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename7}" & days_1_death>=1 & date>=1, ///
mcolor("${color7}") msymbol(O) lpattern(solid) lcolor("${color7}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename8}" & days_1_death>=1 & date>=1, ///
mcolor("${color8}") msymbol(O) lpattern(solid) lcolor("${color8}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename9}" & days_1_death>=1 & date>=1, ///
mcolor("${color9}") msymbol(O) lpattern(solid) lcolor("${color9}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename10}" & days_1_death>=1 & date>=1, ///
mcolor("${color10}") msymbol(O) lpattern(solid) lcolor("${color10}") connect(L) sort) ///
, ///
yline(30, lcolor(gs14)) ///will need at add additional horizontal lines as figure grows
yline(20, lcolor(gs14)) ///
yline(10, lcolor(gs14)) ///
title("COVID-19 Cumulative Mortality by US State") ///
t1title("Top 10 states, through $month $date, 2020") ///
xla(1(2)$maxdate) ///
yla(0(5)40) ///
yti("# COVID19 Deaths/Million Persons") ///
xti("Day Since ≥1 Death/Million Persons") ///
legend(off) ///
/// the following will render each label with a surrounding box that's the same color as the line. 
text(${statedeath1} ${datecount1} "${statename1}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color1}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath2} ${datecount2} "${statename2}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color2}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath3} ${datecount3} "${statename3}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color3}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath4} ${datecount4} "${statename4}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color4}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath5} ${datecount5} "${statename5}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color5}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath6} ${datecount6} "${statename6}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color6}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath7} ${datecount7} "${statename7}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color7}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath8} ${datecount8} "${statename8}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color8}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath9} ${datecount9} "${statename9}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color9}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath10} ${datecount10} "${statename10}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color10}%100") lstyle(solid) lwidth(thin)) ///
caption("Using NY Times COVID19 database"  ///
"https://github.com/nytimes/covid-19-data/blob/master/us-states.csv",  ///
size(small)) ///
xsize(15.3) ysize(9.0) 
// twitter default width & height is 1100x628 pixels. 
//This last line sets the corresponding height and width in inches using 72 dpi. 

graph export "COVID_mortality_2020_${month}_${date}_continuous.png", replace width(1100) 
// width(1100) sets the output to be default width on twitter, or 1100 dpi.