Rendering XKCD #2023 “Misleading Graph Makers” in Stata

Let’s render an XKCD comic using Stata!

I loved today’s XKCD comic so I decided to take some time while eating my sandwich to write a .do file script to render it in Stata. There aren’t great smooth line options without figuring out the exact function for each line in Stata, so I approximated the data points. One interesting problem was including quotes in the X axis label since quotation marks are used to define the label and line breaks for labels. The solution was wrapping the line with an opening tick (`, to the left of number 1 on your keyboard) and closing with an apostrophe. This is also a nice example of how to input data in a .do file.

End result:


clear all

input id proportion band1 band2 band3 band4 band5 band6 band7 band8 band9 band10
id proportion band1 band2 band3 band4 band5 band6 band7 band8 band9 band10
0 . 21 22 23 24 25 26 27 28 29 30
0.3 . 21 22 23.7 25.5 26.3 28 28.8 29.2 29.5 30
0.5 . 20.8 22.5 24.7 27 28 29 29.2 29.4 29.7 30
0.7 . 20.6 25 27.4 28.4 29 29.3 29.5 29.6 29.9 30
0.9 . 20.1 28 28.5 29 29.3 29.5 29.7 29.8 29.9 30
1 23 20.1 28.5 29 29.3 29.5 29.6 29.7 29.8 29.9 30
1.3 . 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
2 23.5 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
3 22.3 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
4 23.5 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
5 23 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
6 28 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30

set scheme s1mono

graph twoway ///
(connected proportion id, lcolor(gs0) mcolor(gs0)) ///
(scatter band1 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band2 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band3 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band4 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band5 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band6 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band7 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band8 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band9 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band10 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
, ///
title(Y-Axis) ///
xlabel(none) /// 
xline(1(1)6, lpattern(solid) lcolor(gs12)) ///
ylabel(20 "0%" 25 "50%" 30 "100%", angle(0)) ///
aspect(1) ///
ytitle("") ///
legend(off) ///
xtitle(`"People have wised up to the "carefully"' ///
`"chosen Y-axis range" trick, so we misleading"' ///
"graph makers have had to get creative.")

graph export xkcd_2023.png, width(1000) replace


My lunch break is over!  Back to work.

Making Scatterplots and Bland-Altman plots in Stata

Scatterplots with a fitted line

This is pretty straightforward in Stata. This is a variation of a figure that I made for a JAMA Internal Medicine paper. (I like to think that the original figure was publication quality, but they had their graphics team redo it in their own format.) This contains a diagonal line of unity and cutoffs for systolic hypertension as vertical/horizontal bars. Different from the publication version, this includes a thick dotted fitted line.

This code actually makes two different scatterplots then merges them together and puts labels over the merged version.

The scatter function here wants a Y-axis variable and X-axis variable. The two variables were ibp_sbp_pair (Y-axis) and average_sbp_omron_23 (X-axis) for systolic and ibp_dbp_pair and average_dbp_omron_23 for diastolic.


**********scatterplot ave with lines of fit
twoway (lfit ibp_sbp_pair average_sbp_omron_23, lcolor(gray) lpattern(dash) lwidth(vthick)) /// line of fit code
(function y=x, ra(average_sbp_omron_23) clcolor(gs4)) /// diagonal line of unity
(scatter ibp_sbp_pair average_sbp_omron_23 , mcolor(black) msize(vsmall)), /// make dots appear for scatter, y x axis
legend(off) /// hide legend
title("Systolic BP", color(black)) ///
ytitle("") /// no title, will add when merging SBP and DBP
xtitle("") /// ditto
xline(140, lpattern(solid) lcolor(gray)) /// cutoff for systolic hypertension
yline(140, lpattern(solid) lcolor(gray)) /// ditto
graphregion(color(white)) ylabel(, grid glcolor(gs14)) /// white background, light gray lines
xlabel(90(20)170) ylabel(90(20)170) /// where X and Y labels occur
aspectratio(1) // force figure to be a 1x1 square, not a rectangle
graph save 20_sbp_scatterplot_fit.gph, replace // need graph to merge later
graph export 20_sbp_scatterplot_fit.png, width(4000) replace

twoway (lfit ibp_dbp_pair average_dbp_omron_23, lcolor(gray) lpattern(dash) lwidth(vthick)) /// 
(function y=x, ra(average_dbp_omron_23) clcolor(gs4)) ///
(scatter ibp_dbp_pair average_dbp_omron_23, mcolor(black) msize(vsmall)), /// 
legend(off) ///
title("Diastolic BP", color(black)) ///
ytitle("") ///
xtitle("") ///
xline(90, lpattern(solid) lcolor(gray)) ///
yline(90, lpattern(solid) lcolor(gray)) ///
graphregion(color(white)) ylabel(, grid glcolor(gs14)) ///
xlabel(30(20)110) ylabel(30(20)110) ///
graph save 21_dbp_scatterplot_fit.gph, replace
graph export 21_dbp_scatterplot_fit.png, width(4000) replace

****combined scatterplot
graph combine 20_sbp_scatterplot_fit.gph 21_dbp_scatterplot_fit.gph, /// 
graphregion(color(white)) ///
b1title("Standard (mmHg)") ///
l1title("IBP (mmHg)") ///
graph save combined_scatterplots_fit.gph, replace // 
graph export combined_scatterplots_fit.png, width(4000) replace

Bland-Altman plots

This is from the same paper.

Again, two different B-A plots that are merged then labels applied. The dotted line is relative mean difference, the long dashed lines are +/- 2 SD.

As far as Stata’s graph maker is concerned, this is a scatterplot. You just need to set up all of the variables intentionally to trick it into rendering a B-A plot. The Y-axis is the difference between the variables and the X-axis is a mean of the variables.


***prep for figure
gen mean_sbp_ave=(average_sbp_omron_23+ibp_sbp_pair)/2 // this will be the x-axis
gen diff_sbp_ave=ibp_sbp_pair-average_sbp_omron_23 // this will be y-axis
sum diff_sbp_ave // this allows you to make a macro of the mean ("r(mean)") of the y axis variable
global mean1=r(mean) // this saves the macro as mean1, to be called later
global lowerCL1=r(mean) - 2*r(sd) // this saves a macro for the mean+2 times the SD ("r(sd)")
global upperCL1=r(mean) + 2*r(sd)
***make graph
graph twoway scatter diff_sbp_ave mean_sbp_ave, ///
legend(off) mcolor(black) ///
ytitle("") /// ytitle("Reference Minus Comparator (mmHg)")
xtitle("") /// xtitle("Average of Reference and Comparator (mmHg)")
title("Systolic BP", color(black)) /// 
yline($mean1, lpattern(shortdash) lcolor(gray)) /// calls the macro from above
yline($lowerCL1, lpattern(dash) lcolor(gray)) /// ditto
yline($upperCL1, lpattern(dash) lcolor(gray)) /// 
graphregion(color(white)) ylabel(, grid glcolor(gs14)) /// white background
ylabel(-40(20)40) xlabel(90(20)170) /// 
aspectratio(1.08) // annoyingly, this wasn't a perfectly square figure so this line fixes it.
***save graph
graph save 1_sbp_bland_altman_ave.gph, replace
graph export 1_sbp_bland_altman_ave.png, width(4000) replace

***prep for figure
gen mean_dbp_ave=(average_dbp_omron_23+ibp_dbp_pair)/2
gen diff_dbp_ave=ibp_dbp_pair-average_dbp_omron_23
sum diff_dbp_ave
global mean1=r(mean)
global lowerCL1=r(mean) - 2*r(sd)
global upperCL1=r(mean) + 2*r(sd)
***make graph
graph twoway scatter diff_dbp_ave mean_dbp_ave, ///
legend(off) mcolor(black) ///
ytitle("") /// 
xtitle("") ///
title("Diastolic BP", color(black)) /// 
msize(vsmall) ///
yline($mean1, lpattern(shortdash) lcolor(gray)) ///
yline($lowerCL1, lpattern(dash) lcolor(gray)) ///
yline($upperCL1, lpattern(dash) lcolor(gray)) /// 
graphregion(color(white)) ylabel(, grid glcolor(gs14)) ///
ylabel(-40(20)40) xlabel(30(20)110) ///
***save graph
graph save 2_dbp_bland_altman_ave.gph, replace
graph export 2_dbp_bland_altman_ave.png, width(4000) replace

***********combined image bland altman
graph combine 1_sbp_bland_altman_ave.gph pictures/2_dbp_bland_altman_ave.gph, ///
ycommon /// so the y axes are on the same scale 
graphregion(color(white)) ///
b1title("Average of IBP and Standard (mmHg)") ///
l1title("IBP Minus Standard (mmHg)") ///
graph save combined_dbp_sbp_ba.gph, replace // 
graph export combined_dbp_sbp_ba.png, width(4000) replace

Code to make a dot and 95% confidence interval figure in Stata

Dot and confidence interval figures in Stata

Stata has a pretty handy -twoway dot- code that can be combined with -twoway rcap- to make the code below. The only annoying thing is that -twoway dot- inserts a line that connects your dot with the Y-axis. The code below will make this sharp-looking figure without the connecting line.

Also, there is a bonus code at the end to make this figure:


First step, make an Excel file

I made an excel file with the below columns called “dot and 95 percent ci data.xlsx” saved in the same folder as my .do file. This figure will display row 1 at the top and row 14 at the bottom. The gaps in between the lines are the absent rows 3,6, 9, and 12. Group tells Stata if you want a hollow or solid dot. Proportion is the point estimate and low95 and high95 are the surrounding 95% confidence intervals.

Note that the data here are made up and are not related to any actual ongoing clinical investigation. 

Next step, make a .do file

In the same folder as the Excel file, copy/paste/save the code below as a .do file. Close Excel and close Stata then find the .do file from Windows Explorer and double click it. Doing this will force Stata to set the working directory as the folder containing the .do file (and the Excel file).

If the code won’t work, you probably have Excel open. Close it and try again.

That’s it!

Share and enjoy. Code below.

******************************IMPORT DATA HERE**********************************
import excel "dot and 95 percent ci data.xlsx", firstrow clear
destring, replace

******************************CODE STARTS HERE********************************** 
capture ssc install scheme-burd, replace // this installs more color patterns. 
set scheme burd4 // I like burd4.

twoway ///
(dot proportion row if group ==1, horizontal dcolor(bg)) /// dcolor(bg) hides dotted line 
/// connecting the dot with the Y axis by matching it with the background. 
(dot proportion row if group ==2, horizontal dcolor(bg)) ///
(rcap low95 high95 row, horizontal), /// code for 95% CI
legend(row(1) order(1 "legend 1" 2 "legend 2") pos(6)) /// legend at 6 o'clock position
ylabel(1.5 "Model A" 4.5 "Model B" 7.5 "Model C" 10.5 "Model D" 13.5 "Model E", angle(0) noticks) ///
/// note that the labels are 1.5, 4.5, etc so they are between rows 1&2, 4&5, etc.
/// also note that there is a space in between different rows by leaving out rows 3, 6, 9, and 12 
xlabel(.95 " " 1 "1.0" 1.1 "1.1" 1.2 "1.2" 1.3 "1.3" 1.4 "1.4" 1.5 "1.5" 1.6 " ", angle(0)) /// no 1.6 label
xtitle("X axis") /// 
ytitle("Y axis") /// 
yscale(reverse) /// y axis is flipped
/// aspect (next line) is how tall or wide the figure is

graph export "dot and 95 percent ci figure.png", replace width(2000)
//graph export "dot and 95 percent ci figure.tif", replace width(2000)

Bonus code for an AJC paper figure

This figure is from a 2018 manuscript that we published in AJC using SPRINT data.

Here’s a view of our “024a arr.xlsx” Excel file that contained the data that went into this figure. The label1 and label2 were to help me make sense of everything while writing this code and aren’t actually used. All that is called by the code below is row, group, proportion, low95, and high95.

Here is the code:

******************************IMPORT DATA HERE**********************************
import excel "024a arr.xlsx", firstrow clear
destring, replace

******************************CODE STARTS HERE********************************** 
set scheme s1color

twoway ///
(dot proportion row, horizontal ndot(0) mcolor(black)) /// ndots(0) hides dotted line 
/// connecting the dot with the Y axis 
(rcap low95 high95 row, horizontal lcolor(black)), /// code for 95% CI
legend(row(2) order(1 "Point Estimate" 2 "95% CI") pos(6)) /// legend at 6 o'clock position
ylabel(1 "All" 3 "Q1" 4 "Q2" 5 "Q3" 6 "Q4" 8 "<10%" 9 "≥10%" 12 "All" 14 "Q1" 15 "Q2" 16 "Q3" 17 "Q4" 19 "<10%" 20 "≥10%", angle(0) noticks) ///
/// there is a space in between different rows by leaving out rows 3, 6, 9, and 12 
xlabel(-5 "-5%" 0 "0%" 5 "5%" 10 "10%", angle(0)) ///
xtitle("Risk difference") /// 
ytitle(" SAEs ASCVD Events") /// the leading space just makes the label sit centered in the figure
xline(0, lcolor(gs4)) ///
yscale(reverse) /// 
text(4.5 11 "P=0.84", place(e) size(small)) ///
text(8.5 11 "P=0.67", place(e) size(small)) ///
text(15.5 11 "P=0.82", place(e) size(small)) ///
text(19.5 11 "P=0.95", place(e) size(small)) ///

graph export "024a arr figure.png", replace width(2000) 
graph export "024a arr figure.tif", replace width(2000) // tif file sent in for publication

Making a horizontal stacked bar graph with -graph twoway rbar- in Stata

Making a horizontal stacked bar graph in Stata

I spent a bit of time making a variation of this figure today. (The data here are made up.) I’m pleased with how it came out. I first tried to use the -graph bar, horizontal- command, but it didn’t give me as much customization as -twoway graph rbar…, horizontal-. I think it looks pretty slick.



Start with an Excel file

I made an Excel file called stacked bar graph data.xlsx that I saved in the same folder as a .do file. I closed Stata and reopened that .do file from Windows explorer so that Stata set the working directory as the same folder that contains the .do file. More importantly, it set the working directory as the same folder that also contains the Excel file.


Group is the number of the individual bars, bottom is the bottom of the first segment of a bar, q1 is the top of the first segment of each bar. The rest should be obvious. I made this for quartiles, hence the q1-4 names. You can tweak the numbers by editing the Excel file, hitting save/close, and rerunning the .do file.

Make sure that you save and close Excel before running the .do file or the .do file won’t run and you will be sad. 

My .do file for making this horizontal stacked bar graph

Here’s my code! I hope it’s useful.


******************************IMPORT DATA HERE**********************************
import excel "stacked bar graph data.xlsx", firstrow clear // make sure that excel
//                                                           is closed before you
//                                                           run this script!
destring, replace

******************************CODE STARTS HERE**********************************
capture ssc install scheme-burd, replace // this installs nicer color schemes
// see the schemes here:
set scheme burd4 

graph twoway ///
(rbar bottom q1 group, horizontal) ///
(rbar q1 q2 group, horizontal) ///
(rbar q2 q3 group, horizontal) ///
(rbar q3 q4 group, horizontal) ///
, /// if you modify this file and it stops working, check the placement of this comma
xscale(log) /// make the x axis log scale
xla(0.25 "0.25" 0.5 "0.5" 1 "1" 2.5 "2.5" 5 "5" 10 "10" 25 "25" 50 "50" 100 "100") ///
yla(1 "Group 1" 2 "Group 2" 3 "Group 3" 4 "Group 4") ///
ytitle("Y title") ///
xtitle("X title") ///
text(4 1.2 "Q1", color(white)) /// first number is y second is x.
text(4 4.3 "Q2", color(white)) ///
text(4 20 "Q3", color(white)) ///
text(4 90 "Q4", color(white)) ///
legend(off) /// no legend, aspect is the shape of the figure. 1 is tall and thin.
// export the graph as a PNG file
graph export "stacked graph.png", replace width(2000)
// graph export "stacked graph.tif", replace width(2000) // in case you want as a tiff

Downloading and analyzing NHANES datasets with Stata in a single .do file

Learn to love NHANES

NHANES is a robust, nationally-representative cross-sectional study. For the past ~18 years it sampled different communities across the US in 2 year continuous cycles. A few of these years are linked to National Death Index data, so you can assess risk factors at the time of the survey and use time-to-event mortality data to identify novel risk factors for death. Manuscripts using NHANES data have been published across the spectrum of medical journals, all the way up to NEJM. The best part? You can download almost all of NHANES data from the CDC website right now for free.

Manipulating NHANES data is challenging for beginners because of the sheer quantity of individual files and requirement for weighting. Plus, all of the files are in SAS XPT format so you have to download, import, save, and merge before you can even think about starting an analysis. To make this data management task slightly more complex, the CDC sporadically publishes interval updates of the source data files on their website. Files may be updated for errors or removed entirely without you knowing about it. (I strongly recommend subscribing the the NHANES Listserv to get real-time updates.) If you have NHANES files saved locally from 2-3 years ago, there’s a reasonable chance that you are using outdated databases, which could yield some false conclusions. Re-downloading all of the many files every time you want to do a project is a big headache.

Leverage Stata’s internet connectivity to make NHANES analyses easy

I love that Stata will download datasets for you with just a URL. The .do file below shows you how easy it is to download just the needed files on the fly then do some simple analyses. This means that you don’t have to worry about maintaining your personal database of NHANES files. If the source files are updated by the CDC, no worry! Every time you run this .do file, it’ll grab the freshest data files available. If the source data files are removed for inaccuracies, the file won’t run and you’ll be prompted to investigate. For example, the 2011-2012 Folate lab results were withdrawn February 2018. If you tried to download the FOLATE_G file, CDC’s website will give their version of a 404 error and Stata will stop the .do file cold in its tracks.

What this .do file does

In this short script, you’ll see how to 1. Import the NHANES SAS XPT files directly from the CDC website with just the XPT file’s URL, 2. Save data as Stata .dta files, 3. Merge the .dta files, 4. Review basic coding issues, 5. Run an analysis using weighting, and 5. Display data.

In this example, we’ll look at the 2009-2010 NHANES results and apply weighting to estimate the amount of the US population who have been told that they have high blood pressure. Just copy/paste the code below and save into a .do file. Set the working directory to be in the same folder as your .do file. Or, copy/paste the .do file, save it, close Stata, open the .do file through Explorer, and then run! Opening the .do file from Windows Explorer with Stata closed sets the .do file’s parent folder to be Stata’s working directory.

What this .do file doesn’t

You won’t be able to merge files with multiple entries per users. For example, the Prescription Medications – Drug Information questionnaire has a row for each medication and ICD code its use for all participants. You’ll need to use the reshape wide command on those variables before merging. BTW, that code is: .bys seqn: gen j= _n [linebreak] .reshape wide rxddrug rxddrgid rxqseen rxddays, i(seqn) j(j)

This also won’t merge with the National Death Index files, which are hosted elsewhere. The NHANES-NDI linkage website provides an example Stata .do script that would be straightforward to include below.

Finally, if you are trying to combine analyses from multiple NHANES cycles (say, combinine 2009-2010 with 2011-2012), things get a bit more complicated. You’ll need to append .dta files and consider adjusting the weights.

// Link to all NHANES datasets:
// 1. Click on years of interest (e.g., 2009-2010).
// 2. Scroll down, click on data of interest (e.g., demographics).
// 3. Right click on the XPT data file and copy the URL.
// 4. Note that the DOC file containing an overview of the file is right there.
//    Take a peek at its contents and return with questions. 
*********************download the demographics!!**********************
// The demographics file contains the weights.
// You ALWAYS need the demographics file.
// Paste the URL for the demographics of interest below:
import sasxport "", clear
sort seqn // Sort the file in order of the unique identifiers. Not necessary,
//           but rarely having an unsorted file will cause analytic issues.
// Save as a Stata dataset
save "DEMO_F.dta", replace
*********************download other files*****************************
// Let's look at the "questionnaire" for "blood pressure & cholesterol"
// and "kidney conditions - urology"
// Lost? Go back to the link on the very first line of this .do file
// and click on the year of interest again (2009-2010), scroll down and
// click "questionnaire". 
// Paste the URL for the BP & cholesterol below:
import sasxport "", clear
// Sort and save as an Stata dataset
sort seqn 
save "BPQ_F.dta", replace
// We aren't going to use the kidney file in this analysis, but just an
// example of how to merge a second dataset, copy/paste URL for kidney conditions
import sasxport "", clear
// Sort and save as an Stata dataset
sort seqn
save "KIQ_U_F.dta", replace
*********************merge your datsets*******************************
clear all // clear all memory
// start with the demographics file
use "DEMO_F.dta", clear
// Merge demographics with bp & chol
// The unique identifier in this dataset is "seqn", it may be different in other
// NHANES datasets.
merge 1:1 seqn using "BPQ_F.dta"
drop  _merge // this is a merge variable that needs to be dropped prior to 
//              merging other datasets. This variable will be helpful 
//              in exploring causes of merging issues. 
// Merge with kidney conditions datafile
merge 1:1 seqn using "KIQ_U_F.dta"
drop  _merge
// you can now save as a merged data file if you want. 
save nhanesBPall.dta, replace
*********************do an analysis!!*********************************
// Fortunately, Stata makes this easy with built in survey commands.
// You need to tell it that it's survey data with the "svyset" command then use 
// specific functions designed for weighted analyses.
// To explore available these functions: 
// 1. Check out the drop-down menu in stata, statistics --> survey data
// 2. Stata 13 documentation about this code: 
// 3. A helpful video from Stata:
// For the more recent continuous NHANES ones, here are the variables needed for
// weighting:
//   weight for interview data: wtint2yr 
//   weight for laboratory (MEC) data: wtmec2yr 
//   Sampling units (PSU): sdmvpsu
//   Strata: sdmvstra
//   Single unit: there isn't one. 
// NOTE: Older NHANES datasets use different variables
// We are using interview data (questionnaires), so our command is
svyset sdmvpsu [pweight = wtint2yr], strata(sdmvstra) vce(linearized) singleunit(missing)
// Syntax: "svy: COMMAND var" - for a list of commands, type "help svy_estimation"
// The 4 basic survey descriptive commands:
// 1. mean
// 2. proportion
// 3. ratio
// 4. total
// there are a bunch of other commands, including logistic regression, etc. 
// let's see what the mean age was, the variable is "ridageyr"
svy: mean ridageyr
// you see that the mean age of the US population (301 million people) in 2009-2010 was
// 36.7 years. COOL!
// NEXT:
// Let's see the amount of people who have been told they have high blood pressure
// VARIABLE: bpq020
// First, look at the documentation for this variable on the source XPT file's DOC page
// note that this is coded as 1=yes, 2=no, 7=refused, 9=don't know, .=missing
// Only asked if age >=16, so not the entire 301 million US population will be here
// let's look at the responses given
svy: proportion bpq020
// let's make a new variable with 0=no, 1=yes, .=all others
gen toldhtn0n1y =.
replace toldhtn0n1y=0 if bpq020==2 // no (you need a double equals sign after the if)
replace toldhtn0n1y=1 if bpq020==1 // yes
//                                    all others remain as =.
// let's see the population on BP meds
svy: proportion toldhtn0n1y
// SO: of the 234 million adults >16 years, 28% have been told that hey have htn
// A little program to spit out the results in english:
matrix htnansw = e(b) // e(b) is where the # of people is stored. Type "ereturn list" to see
//                       which matrices are available. Type "matrix list [name of matrix]" to 
//                       see content of each matrix. This "matrix htnansw = e(b)" command will
//                       save the temporary matrix for e(b) under the permanent matrix
//                       "htnansw" that we can then manipulate.
// If you typed "matrix list htnansw", you'd see that the proportion answering "yes"
// is saved in the second column of the first row of this matrix, or [1,2].
local yesbp = htnansw[1,2] // pluck out the value of the 1,2 cell in the saved matrix 
//                            (the yes proportion) as a macro to call later
// NOTE: You call macros by opening with the tick to the left of number 1 on your keyboard,
// writing the name of the macro, then closing with a traditional apostrophe.
// Read about macros here:
// save # of people who answered y/n (234 million)
matrix subpop = e(_N_subp) // pluck out the # of people in this population, aka the # of 
//                            americans >=16 years old, as a permanent matrix named "subpop"
local population = subpop[1,1] // make a macro plucking the 1,1 cell where the total
//                                # of americans are in this population
// how data can be presented:
di "Unrounded " "`yesbp'" // just to prove how you can present it. Note that yesbp is a macro.
di "Rounded " round(`yesbp'*100) // round at the decimal after mult by 100
di "Total population is " `population' // note exponent
di "Total population is " %18.0fc `population' // note no exponent but helpful commas
di "Among Americans >=16 years, " round(`yesbp'*100) "% have been told that they have high BP."
di "Among Americans >=16 years, " round(`yesbp'*`population') " Americans have been told they have high BP."
// NOTE: If you run this line-by-line, stata may drop the macros above.  
// Run the script from the very top if you are getting errors in these last few lines. 
// Fin.

Generic start of a Stata .do file

I took the Stata programming class at the Johns Hopkins School of Public Health during grad school It was taught by Dorry Segev. If you are at the school, I highly, highly, highly recommend taking it and doing all of the assignments in term 4. It saved me many hours of labor in writing up my thesis. It’s a phenomenal class.

One of the biggest takeaways from the class was using a .do file as much as possible when interacting with Stata. As in 99% of the time.

Below is the stock header and footer of every .do file that I make. Steps to success:

  1. Open a blank .do file
  2. Paste the code from below
  3. Save it in the same folder as your dataset
  4. Close Stata
  5. In Windows File Explorer, find your new .do file and open it up then get rolling.

By opening the .do file through file explorer, Stata automatically knows which folder you are working in. Then you don’t have to write the entire directory to start. For example, you can write:

use data.dta, clear

…and not

use c:\windows\users\myname\work\research\001project\data\data.dta, clear

******************************HEADER STARTS HERE********************************
// at the beginning of every do file:
macro drop _all // remove macros from previous work, if any
capture log close // Close any open logs. Capture will ignore a command that gives 
//                   an error. So if there isn't an open log, instead of giving you 
//                   an error and stopping here, it'll just move onto the next line.
clear all // clean the belfries
drop _all // get rid of everything!

log using output.log, replace text // change the name of this to whatever you'd like

// The purpose of this .do file is... [say why you are writing this do file]

version 15 // Every version of Stata is slightly different, but all are backwards 
//            compatible with previous ones. If you open up this do file with a way 
//            newer version, it'll run it in version 14 compatibility mode. Change 
//            this to the current version of Stata that you are using. This will 
//            also keep your code from running on older versions of stata that will 
//            break with new code that it isn't designed to handle. 

set more off, permanently // so you don't have to keep clicking through stata to 
//                           keep it running

set linesize 255 // this keeps longer lines from getting clipped. Helpful for making 
//                  tables.

capture shell md pictures // this makes a folder called pictures in the Windows 
//                           version of stata. Save your pictures here.
capture shell mkdir pictures // ditto, except in the Mac version.

******************************IMPORT DATA HERE**********************************
* working with stata dta file: 
// use "data.dta", clear // change this with whatever data file you are using and 
//                        remove the double slashes. Note: putting quotes around 
//                        filenames lets you open files with spaces in the name.
* working with excel file: 
// import excel using "name of file.xlsx", firstrow clear // firstrow imports 
//                       the first row of the sheet as variable. Change to the 
//                       appropriate name and delete the double lines as needed. 
* working with csv file:
// import delim "name of file.csv", clear
******************************CODE STARTS HERE********************************** 
// ... you get the idea

******************************FOOTER STARTS HERE********************************
// At the very end of your .do file: 
log close
// Fin.

Table 1 program

Here are a few simple Stata programs that will write a CSV file for your Table 1.

This has three parts:

  1. Header (writes the first line of the table with names and the row with Ns),
  2. Program for continuous variables, and
  3. Program for dichotomous variables.

Each time you want to add another line your table, just call the appropriate program followed by the variable of interest.

clear all // get rid of everything in memory
webuse auto.dta, clear

set seed 12345
gen treatment = .
replace treatment = round(runiform()) // make a random treatment 
// variable that's 0 or 1

// Header + second row with Ns
quietly {
capture log close table1 // force closes any tables with the same name
log using "my_table_1.csv", text replace name(table1) //replace will erase 
// any CSV files you started already with the same name
noisily disp ",All,Group 1,Group 2" // line 1
local nall=r(N)
count if treatment==0
local ntreatment0=r(N)
count if treatment==1
local ntreatment1=r(N)
noisily disp "N," `nall' "," `ntreatment0' "," `ntreatment1'
log close table1

// Program for continuous variables
capture program drop table1_cont // drops any programs with the same name
program define table1_cont
quietly {
syntax varlist
capture log close table1
log using "my_table_1.csv", text append name(table1) // append will 
// keep writing onto existing tables
foreach var of varlist `varlist' {
sum `var'
local `var'mean = r(mean)
local `var'sd = r(sd)
local `var'n=r(N)
sum `var' if treatment==0
local `var'mean0 = r(mean)
local `var'sd0 = r(sd)
local `var'n0=r(N)
sum `var' if treatment==1
local `var'mean1 = r(mean)
local `var'sd1 = r(sd)
local `var'n1=r(N)

noisily disp "`var' (Mean (SD))," ///
%3.1f ``var'mean' " (" %3.1f ``var'sd' "),"  ///
 %3.1f ``var'mean0' " (" %3.1f ``var'sd0' "),"  ///
 %3.1f ``var'mean1' " (" %3.1f ``var'sd1' ")"
} // end varlist loop
log close table1
} // end quietly

// program for dichotomous variables
capture program drop table1_dichotomous
program define table1_dichotomous
quietly {
syntax varlist
capture log close table1
log using "my_table_1.csv", text append name(table1)
foreach var of varlist `varlist' {
sum `var'
local `var'n= r(N)
local `var'mean = r(mean)*100
sum `var' if treatment==0
local `var'n0 = r(N)
local `var'mean0 = r(mean)*100
sum `var' if treatment==1
local `var'n1 = r(N)
local `var'mean1 = r(mean)*100
noisily disp "`var' (N (%))," ///
``var'n' " (" %3.1f ``var'mean' "),"  ///
``var'n0' " (" %3.1f ``var'mean0' ")," ///
``var'n1' " (" %3.1f ``var'mean1' ")" 
log close table1 
// now just call these programs as needed: 

table1_cont trunk 
table1_cont weight 
table1_dichotomous foreign 
// and so-on

Here is the output from the example above:

,All,Group 1,Group 2
trunk (Mean (SD)),13.8 (4.3),13.0 (3.9),14.8 (4.6)
weight (Mean (SD)),3019.5 (777.2),2906.0 (746.1),3176.8 (804.1)
foreign (N (%)),74 (29.7),43 (32.6),31 (25.8)

If opened with MS Excel, it will look like this: