Rendering XKCD #2023 “Misleading Graph Makers” in Stata

Let’s render an XKCD comic using Stata!

I loved today’s XKCD comic so I decided to take some time while eating my sandwich to write a .do file script to render it in Stata. There aren’t great smooth line options without figuring out the exact function for each line in Stata, so I approximated the data points. One interesting problem was including quotes in the X axis label since quotation marks are used to define the label and line breaks for labels. The solution was wrapping the line with an opening tick (`, to the left of number 1 on your keyboard) and closing with an apostrophe. This is also a nice example of how to input data in a .do file.

End result:

Code:

clear all

input id proportion band1 band2 band3 band4 band5 band6 band7 band8 band9 band10
id proportion band1 band2 band3 band4 band5 band6 band7 band8 band9 band10
0 . 21 22 23 24 25 26 27 28 29 30
0.3 . 21 22 23.7 25.5 26.3 28 28.8 29.2 29.5 30
0.5 . 20.8 22.5 24.7 27 28 29 29.2 29.4 29.7 30
0.7 . 20.6 25 27.4 28.4 29 29.3 29.5 29.6 29.9 30
0.9 . 20.1 28 28.5 29 29.3 29.5 29.7 29.8 29.9 30
1 23 20.1 28.5 29 29.3 29.5 29.6 29.7 29.8 29.9 30
1.3 . 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
2 23.5 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
3 22.3 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
4 23.5 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
5 23 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
6 28 20.1 29.2 29.3 29.4 29.5 29.6 29.7 29.8 29.9 30
end

set scheme s1mono

graph twoway ///
(connected proportion id, lcolor(gs0) mcolor(gs0)) ///
(scatter band1 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band2 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band3 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band4 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band5 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band6 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band7 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band8 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band9 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
(scatter band10 id, conn(j) lstyle(solid) lcolor(gs12) mstyle(none)) ///
, ///
title(Y-Axis) ///
xlabel(none) /// 
xline(1(1)6, lpattern(solid) lcolor(gs12)) ///
ylabel(20 "0%" 25 "50%" 30 "100%", angle(0)) ///
aspect(1) ///
ytitle("") ///
legend(off) ///
xtitle(`"People have wised up to the "carefully"' ///
`"chosen Y-axis range" trick, so we misleading"' ///
"graph makers have had to get creative.")

graph export xkcd_2023.png, width(1000) replace

 

My lunch break is over!  Back to work.

Making Scatterplots and Bland-Altman plots in Stata

Scatterplots with a fitted line

This is pretty straightforward in Stata. This is a variation of a figure that I made for a JAMA Internal Medicine paper. (I like to think that the original figure was publication quality, but they had their graphics team redo it in their own format.) This contains a diagonal line of unity and cutoffs for systolic hypertension as vertical/horizontal bars. Different from the publication version, this includes a thick dotted fitted line.

This code actually makes two different scatterplots then merges them together and puts labels over the merged version.

The scatter function here wants a Y-axis variable and X-axis variable. The two variables were ibp_sbp_pair (Y-axis) and average_sbp_omron_23 (X-axis) for systolic and ibp_dbp_pair and average_dbp_omron_23 for diastolic.

Code:

**********scatterplot ave with lines of fit
***sbp
twoway (lfit ibp_sbp_pair average_sbp_omron_23, lcolor(gray) lpattern(dash) lwidth(vthick)) /// line of fit code
(function y=x, ra(average_sbp_omron_23) clcolor(gs4)) /// diagonal line of unity
(scatter ibp_sbp_pair average_sbp_omron_23 , mcolor(black) msize(vsmall)), /// make dots appear for scatter, y x axis
legend(off) /// hide legend
title("Systolic BP", color(black)) ///
ytitle("") /// no title, will add when merging SBP and DBP
xtitle("") /// ditto
xline(140, lpattern(solid) lcolor(gray)) /// cutoff for systolic hypertension
yline(140, lpattern(solid) lcolor(gray)) /// ditto
graphregion(color(white)) ylabel(, grid glcolor(gs14)) /// white background, light gray lines
xlabel(90(20)170) ylabel(90(20)170) /// where X and Y labels occur
aspectratio(1) // force figure to be a 1x1 square, not a rectangle
graph save 20_sbp_scatterplot_fit.gph, replace // need graph to merge later
graph export 20_sbp_scatterplot_fit.png, width(4000) replace

***dbp
twoway (lfit ibp_dbp_pair average_dbp_omron_23, lcolor(gray) lpattern(dash) lwidth(vthick)) /// 
(function y=x, ra(average_dbp_omron_23) clcolor(gs4)) ///
(scatter ibp_dbp_pair average_dbp_omron_23, mcolor(black) msize(vsmall)), /// 
legend(off) ///
title("Diastolic BP", color(black)) ///
ytitle("") ///
xtitle("") ///
xline(90, lpattern(solid) lcolor(gray)) ///
yline(90, lpattern(solid) lcolor(gray)) ///
graphregion(color(white)) ylabel(, grid glcolor(gs14)) ///
xlabel(30(20)110) ylabel(30(20)110) ///
aspectratio(1)
graph save 21_dbp_scatterplot_fit.gph, replace
graph export 21_dbp_scatterplot_fit.png, width(4000) replace

****combined scatterplot
graph combine 20_sbp_scatterplot_fit.gph 21_dbp_scatterplot_fit.gph, /// 
graphregion(color(white)) ///
b1title("Standard (mmHg)") ///
l1title("IBP (mmHg)") ///
ysize(3)
graph save combined_scatterplots_fit.gph, replace // 
graph export combined_scatterplots_fit.png, width(4000) replace

Bland-Altman plots

This is from the same paper.

Again, two different B-A plots that are merged then labels applied. The dotted line is relative mean difference, the long dashed lines are +/- 2 SD.

As far as Stata’s graph maker is concerned, this is a scatterplot. You just need to set up all of the variables intentionally to trick it into rendering a B-A plot. The Y-axis is the difference between the variables and the X-axis is a mean of the variables.

Code:

***sbp
***prep for figure
gen mean_sbp_ave=(average_sbp_omron_23+ibp_sbp_pair)/2 // this will be the x-axis
gen diff_sbp_ave=ibp_sbp_pair-average_sbp_omron_23 // this will be y-axis
sum diff_sbp_ave // this allows you to make a macro of the mean ("r(mean)") of the y axis variable
global mean1=r(mean) // this saves the macro as mean1, to be called later
global lowerCL1=r(mean) - 2*r(sd) // this saves a macro for the mean+2 times the SD ("r(sd)")
global upperCL1=r(mean) + 2*r(sd)
***make graph
graph twoway scatter diff_sbp_ave mean_sbp_ave, ///
legend(off) mcolor(black) ///
ytitle("") /// ytitle("Reference Minus Comparator (mmHg)")
xtitle("") /// xtitle("Average of Reference and Comparator (mmHg)")
title("Systolic BP", color(black)) /// 
yline($mean1, lpattern(shortdash) lcolor(gray)) /// calls the macro from above
yline($lowerCL1, lpattern(dash) lcolor(gray)) /// ditto
yline($upperCL1, lpattern(dash) lcolor(gray)) /// 
graphregion(color(white)) ylabel(, grid glcolor(gs14)) /// white background
ylabel(-40(20)40) xlabel(90(20)170) /// 
aspectratio(1.08) // annoyingly, this wasn't a perfectly square figure so this line fixes it.
***save graph
graph save 1_sbp_bland_altman_ave.gph, replace
graph export 1_sbp_bland_altman_ave.png, width(4000) replace

***dbp
***prep for figure
gen mean_dbp_ave=(average_dbp_omron_23+ibp_dbp_pair)/2
gen diff_dbp_ave=ibp_dbp_pair-average_dbp_omron_23
sum diff_dbp_ave
global mean1=r(mean)
global lowerCL1=r(mean) - 2*r(sd)
global upperCL1=r(mean) + 2*r(sd)
***make graph
graph twoway scatter diff_dbp_ave mean_dbp_ave, ///
legend(off) mcolor(black) ///
ytitle("") /// 
xtitle("") ///
title("Diastolic BP", color(black)) /// 
msize(vsmall) ///
yline($mean1, lpattern(shortdash) lcolor(gray)) ///
yline($lowerCL1, lpattern(dash) lcolor(gray)) ///
yline($upperCL1, lpattern(dash) lcolor(gray)) /// 
graphregion(color(white)) ylabel(, grid glcolor(gs14)) ///
ylabel(-40(20)40) xlabel(30(20)110) ///
aspectratio(1.08)
***save graph
graph save 2_dbp_bland_altman_ave.gph, replace
graph export 2_dbp_bland_altman_ave.png, width(4000) replace

***********combined image bland altman
graph combine 1_sbp_bland_altman_ave.gph pictures/2_dbp_bland_altman_ave.gph, ///
ycommon /// so the y axes are on the same scale 
graphregion(color(white)) ///
b1title("Average of IBP and Standard (mmHg)") ///
l1title("IBP Minus Standard (mmHg)") ///
ysize(3)
graph save combined_dbp_sbp_ba.gph, replace // 
graph export combined_dbp_sbp_ba.png, width(4000) replace

The confusion nomenclature of epidemiology and biostatistics

This should be more simple.

Epidemiology and biostatistics are awash with synonyms and each institution has its own preferred nomenclature to describe the same general concepts. I started this page as a central place to document the various terms by concept. I’ll plan on revisiting and updating over time.

Regression

Fundamentals

You probably learned the fundamentals of regression in introductory algebra but may not realize it.  Remember drawing a graph from a slope-intercept equation? Draw a graph where Y is equal to 1/4x plus 5. (Here is the relevant Khan Academy Algebra I video about this.) You take the general equation:

Y = mx + b

…where Y is the y-axis, m is the slope of the line, and b is where the line crosses the y-axis. The equation you will write is:

Y=1/4x + 5

…and you will draw:

 

This sounding familiar? When you do a linear regression, you do the same thing. Instead, you regress Y on X, or:

Y = β1x1 + β0

And fitting in the variables here, you want to figure out what a predicted cholesterol level will be for folks by a given age. You would regress cholesterol level on age:

Cholesterol level = β1*Age + β0

Here, x1 is the slope of the line for age and β0 is the intercept on the Y-axis, essentially the same as the b in Y=mx+b. When you run a regression in Stata, you type

regress y x

or here,

regress cholesterol age

Let’s say that Stata spits out something like:

      Source |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx  
-------------+------------------------------    
       Model |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx   
    Residual |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx    
-------------+------------------------------    
       Total |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx   

------------------------------------------------------------------------------
  cholesterol|      coeff       se         t     P>|t|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |     0.500     xxxxxxxx    xxxxx   0.000     0.4000      0.60000
       _cons |     100       xxxxxxxx    xxxxx   0.000     90.000      110.000
------------------------------------------------------------------------------

The βcoefficient for age is 0.5. The intercept, or β0 is 100. You would interpret this as cholesterol level = 0.5*age in years + 100. You could plot this using your Algebra 1 skills.

Cholesterol = 0.5*age + 100

Or you can substitute in actual numbers. What is the predicted cholesterol at age 50? Answer: 125.

If you want to make it more complex and add more variables to explain cholesterol level, it’s no longer a straight line on a graph, but the concept is the same. A multiple linear regression adds more X variables. You can figure out what a predicted cholesterol level will be for folks by age, sex, and BMI. You would regress cholesterol level on age, sex, and BMI. (You would code sex as 0 or 1, like female = 1 and male = 0.)

Y =  β1x1 + β2x2 + β3x3 + β0

Or,

Y =  β1*Age + β2*Sex + β3*BMI + β0

You get the idea.

Names of Y and X

This is what irks me. There are so many synonyms for Y and X variables. Here is a chart that I’ll update over time with synonyms seen in the wild.

 Y = x
Dependent Independent
Outcome Predictor
Covariate
Factor
Exposure variable
Explanatory variable

Code to make a dot and 95% confidence interval figure in Stata

Dot and confidence interval figures in Stata

Stata has a pretty handy -twoway dot- code that can be combined with -twoway rcap- to make the code below. The only annoying thing is that -twoway dot- inserts a line that connects your dot with the Y-axis. The code below will make this sharp-looking figure without the connecting line.

Also, there is a bonus code at the end to make this figure:

 

First step, make an Excel file

I made an excel file with the below columns called “dot and 95 percent ci data.xlsx” saved in the same folder as my .do file. This figure will display row 1 at the top and row 14 at the bottom. The gaps in between the lines are the absent rows 3,6, 9, and 12. Group tells Stata if you want a hollow or solid dot. Proportion is the point estimate and low95 and high95 are the surrounding 95% confidence intervals.

Note that the data here are made up and are not related to any actual ongoing clinical investigation. 

Next step, make a .do file

In the same folder as the Excel file, copy/paste/save the code below as a .do file. Close Excel and close Stata then find the .do file from Windows Explorer and double click it. Doing this will force Stata to set the working directory as the folder containing the .do file (and the Excel file).

If the code won’t work, you probably have Excel open. Close it and try again.

That’s it!

Share and enjoy. Code below.

********************************************************************************
******************************IMPORT DATA HERE**********************************
********************************************************************************
import excel "dot and 95 percent ci data.xlsx", firstrow clear
destring, replace

******************************************************************************** 
******************************CODE STARTS HERE********************************** 
******************************************************************************** 
capture ssc install scheme-burd, replace // this installs more color patterns. 
set scheme burd4 // I like burd4.

twoway ///
(dot proportion row if group ==1, horizontal dcolor(bg)) /// dcolor(bg) hides dotted line 
/// connecting the dot with the Y axis by matching it with the background. 
(dot proportion row if group ==2, horizontal dcolor(bg)) ///
(rcap low95 high95 row, horizontal), /// code for 95% CI
legend(row(1) order(1 "legend 1" 2 "legend 2") pos(6)) /// legend at 6 o'clock position
ylabel(1.5 "Model A" 4.5 "Model B" 7.5 "Model C" 10.5 "Model D" 13.5 "Model E", angle(0) noticks) ///
/// note that the labels are 1.5, 4.5, etc so they are between rows 1&2, 4&5, etc.
/// also note that there is a space in between different rows by leaving out rows 3, 6, 9, and 12 
xlabel(.95 " " 1 "1.0" 1.1 "1.1" 1.2 "1.2" 1.3 "1.3" 1.4 "1.4" 1.5 "1.5" 1.6 " ", angle(0)) /// no 1.6 label
xtitle("X axis") /// 
ytitle("Y axis") /// 
yscale(reverse) /// y axis is flipped
/// aspect (next line) is how tall or wide the figure is
aspect(.5)

graph export "dot and 95 percent ci figure.png", replace width(2000)
//graph export "dot and 95 percent ci figure.tif", replace width(2000)

Bonus code for an AJC paper figure

This figure is from a 2018 manuscript that we published in AJC using SPRINT data.

Here’s a view of our “024a arr.xlsx” Excel file that contained the data that went into this figure. The label1 and label2 were to help me make sense of everything while writing this code and aren’t actually used. All that is called by the code below is row, group, proportion, low95, and high95.

Here is the code:

********************************************************************************
******************************IMPORT DATA HERE**********************************
********************************************************************************
import excel "024a arr.xlsx", firstrow clear
destring, replace

******************************************************************************** 
******************************CODE STARTS HERE********************************** 
******************************************************************************** 
set scheme s1color

twoway ///
(dot proportion row, horizontal ndot(0) mcolor(black)) /// ndots(0) hides dotted line 
/// connecting the dot with the Y axis 
(rcap low95 high95 row, horizontal lcolor(black)), /// code for 95% CI
legend(row(2) order(1 "Point Estimate" 2 "95% CI") pos(6)) /// legend at 6 o'clock position
ylabel(1 "All" 3 "Q1" 4 "Q2" 5 "Q3" 6 "Q4" 8 "<10%" 9 "≥10%" 12 "All" 14 "Q1" 15 "Q2" 16 "Q3" 17 "Q4" 19 "<10%" 20 "≥10%", angle(0) noticks) ///
/// there is a space in between different rows by leaving out rows 3, 6, 9, and 12 
xlabel(-5 "-5%" 0 "0%" 5 "5%" 10 "10%", angle(0)) ///
xtitle("Risk difference") /// 
ytitle(" SAEs ASCVD Events") /// the leading space just makes the label sit centered in the figure
xline(0, lcolor(gs4)) ///
yscale(reverse) /// 
text(4.5 11 "P=0.84", place(e) size(small)) ///
text(8.5 11 "P=0.67", place(e) size(small)) ///
text(15.5 11 "P=0.82", place(e) size(small)) ///
text(19.5 11 "P=0.95", place(e) size(small)) ///
aspect(1.68)


graph export "024a arr figure.png", replace width(2000) 
graph export "024a arr figure.tif", replace width(2000) // tif file sent in for publication

Making a horizontal stacked bar graph with -graph twoway rbar- in Stata

Making a horizontal stacked bar graph in Stata

I spent a bit of time making a variation of this figure today. (The data here are made up.) I’m pleased with how it came out. I first tried to use the -graph bar, horizontal- command, but it didn’t give me as much customization as -twoway graph rbar…, horizontal-. I think it looks pretty slick.

 

 

Start with an Excel file

I made an Excel file called stacked bar graph data.xlsx that I saved in the same folder as a .do file. I closed Stata and reopened that .do file from Windows explorer so that Stata set the working directory as the same folder that contains the .do file. More importantly, it set the working directory as the same folder that also contains the Excel file.

 

Group is the number of the individual bars, bottom is the bottom of the first segment of a bar, q1 is the top of the first segment of each bar. The rest should be obvious. I made this for quartiles, hence the q1-4 names. You can tweak the numbers by editing the Excel file, hitting save/close, and rerunning the .do file.

Make sure that you save and close Excel before running the .do file or the .do file won’t run and you will be sad. 

My .do file for making this horizontal stacked bar graph

Here’s my code! I hope it’s useful.

 

********************************************************************************
******************************IMPORT DATA HERE**********************************
********************************************************************************
import excel "stacked bar graph data.xlsx", firstrow clear // make sure that excel
//                                                           is closed before you
//                                                           run this script!
destring, replace

********************************************************************************
******************************CODE STARTS HERE**********************************
********************************************************************************
capture ssc install scheme-burd, replace // this installs nicer color schemes
// see the schemes here: https://github.com/briatte/burd
set scheme burd4 

graph twoway ///
(rbar bottom q1 group, horizontal) ///
(rbar q1 q2 group, horizontal) ///
(rbar q2 q3 group, horizontal) ///
(rbar q3 q4 group, horizontal) ///
, /// if you modify this file and it stops working, check the placement of this comma
xscale(log) /// make the x axis log scale
xla(0.25 "0.25" 0.5 "0.5" 1 "1" 2.5 "2.5" 5 "5" 10 "10" 25 "25" 50 "50" 100 "100") ///
yla(1 "Group 1" 2 "Group 2" 3 "Group 3" 4 "Group 4") ///
ytitle("Y title") ///
xtitle("X title") ///
text(4 1.2 "Q1", color(white)) /// first number is y second is x.
text(4 4.3 "Q2", color(white)) ///
text(4 20 "Q3", color(white)) ///
text(4 90 "Q4", color(white)) ///
legend(off) /// no legend, aspect is the shape of the figure. 1 is tall and thin.
aspect(.25) 
//
// export the graph as a PNG file
graph export "stacked graph.png", replace width(2000)
// graph export "stacked graph.tif", replace width(2000) // in case you want as a tiff

Downloading and analyzing NHANES datasets with Stata in a single .do file

Learn to love NHANES

NHANES is a robust, nationally-representative cross-sectional study. For the past ~18 years it sampled different communities across the US in 2 year continuous cycles. A few of these years are linked to National Death Index data, so you can assess risk factors at the time of the survey and use time-to-event mortality data to identify novel risk factors for death. Manuscripts using NHANES data have been published across the spectrum of medical journals, all the way up to NEJM. The best part? You can download almost all of NHANES data from the CDC website right now for free.

Manipulating NHANES data is challenging for beginners because of the sheer quantity of individual files and requirement for weighting. Plus, all of the files are in SAS XPT format so you have to download, import, save, and merge before you can even think about starting an analysis. To make this data management task slightly more complex, the CDC sporadically publishes interval updates of the source data files on their website. Files may be updated for errors or removed entirely without you knowing about it. (I strongly recommend subscribing the the NHANES Listserv to get real-time updates.) If you have NHANES files saved locally from 2-3 years ago, there’s a reasonable chance that you are using outdated databases, which could yield some false conclusions. Re-downloading all of the many files every time you want to do a project is a big headache.

Leverage Stata’s internet connectivity to make NHANES analyses easy

I love that Stata will download datasets for you with just a URL. The .do file below shows you how easy it is to download just the needed files on the fly then do some simple analyses. This means that you don’t have to worry about maintaining your personal database of NHANES files. If the source files are updated by the CDC, no worry! Every time you run this .do file, it’ll grab the freshest data files available. If the source data files are removed for inaccuracies, the file won’t run and you’ll be prompted to investigate. For example, the 2011-2012 Folate lab results were withdrawn February 2018. If you tried to download the FOLATE_G file, CDC’s website will give their version of a 404 error and Stata will stop the .do file cold in its tracks.

What this .do file does

In this short script, you’ll see how to 1. Import the NHANES SAS XPT files directly from the CDC website with just the XPT file’s URL, 2. Save data as Stata .dta files, 3. Merge the .dta files, 4. Review basic coding issues, 5. Run an analysis using weighting, and 5. Display data.

In this example, we’ll look at the 2009-2010 NHANES results and apply weighting to estimate the amount of the US population who have been told that they have high blood pressure. Just copy/paste the code below and save into a .do file. Set the working directory to be in the same folder as your .do file. Or, copy/paste the .do file, save it, close Stata, open the .do file through Explorer, and then run! Opening the .do file from Windows Explorer with Stata closed sets the .do file’s parent folder to be Stata’s working directory.

What this .do file doesn’t

You won’t be able to merge files with multiple entries per users. For example, the Prescription Medications – Drug Information questionnaire has a row for each medication and ICD code its use for all participants. You’ll need to use the reshape wide command on those variables before merging. BTW, that code is: .bys seqn: gen j= _n [linebreak] .reshape wide rxddrug rxddrgid rxqseen rxddays, i(seqn) j(j)

This also won’t merge with the National Death Index files, which are hosted elsewhere. The NHANES-NDI linkage website provides an example Stata .do script that would be straightforward to include below.

Finally, if you are trying to combine analyses from multiple NHANES cycles (say, combinine 2009-2010 with 2011-2012), things get a bit more complicated. You’ll need to append .dta files and consider adjusting the weights.

// Link to all NHANES datasets: https://wwwn.cdc.gov/nchs/nhanes/default.aspx
// 1. Click on years of interest (e.g., 2009-2010).
// 2. Scroll down, click on data of interest (e.g., demographics).
// 3. Right click on the XPT data file and copy the URL.
// 4. Note that the DOC file containing an overview of the file is right there.
//    Take a peek at its contents and return with questions. 
//
**********************************************************************
*********************download the demographics!!**********************
**********************************************************************
// The demographics file contains the weights.
// You ALWAYS need the demographics file.
//
// Paste the URL for the demographics of interest below:
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT", clear
sort seqn // Sort the file in order of the unique identifiers. Not necessary,
//           but rarely having an unsorted file will cause analytic issues.
// Save as a Stata dataset
save "DEMO_F.dta", replace
//
**********************************************************************
*********************download other files*****************************
**********************************************************************
// Let's look at the "questionnaire" for "blood pressure & cholesterol"
// and "kidney conditions - urology"
//
// Lost? Go back to the link on the very first line of this .do file
// and click on the year of interest again (2009-2010), scroll down and
// click "questionnaire". 
//
// Paste the URL for the BP & cholesterol below:
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn 
save "BPQ_F.dta", replace
//
// We aren't going to use the kidney file in this analysis, but just an
// example of how to merge a second dataset, copy/paste URL for kidney conditions
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/KIQ_U_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn
save "KIQ_U_F.dta", replace
//
**********************************************************************
*********************merge your datsets*******************************
**********************************************************************
clear all // clear all memory
//
// start with the demographics file
// ALWAYS INCLUDE THE DEMOGRAPHICS FILE SINCE IT HAS WEIGHTS
use "DEMO_F.dta", clear
//
// Merge demographics with bp & chol
// The unique identifier in this dataset is "seqn", it may be different in other
// NHANES datasets.
merge 1:1 seqn using "BPQ_F.dta"
drop  _merge // this is a merge variable that needs to be dropped prior to 
//              merging other datasets. This variable will be helpful 
//              in exploring causes of merging issues. 
// Merge with kidney conditions datafile
merge 1:1 seqn using "KIQ_U_F.dta"
drop  _merge
//
// you can now save as a merged data file if you want. 
save nhanesBPall.dta, replace
//
**********************************************************************
*********************do an analysis!!*********************************
**********************************************************************
// YOU NEED TO APPLY WEIGHTING.
// Fortunately, Stata makes this easy with built in survey commands.
// You need to tell it that it's survey data with the "svyset" command then use 
// specific functions designed for weighted analyses.
// To explore available these functions: 
// 1. Check out the drop-down menu in stata, statistics --> survey data
// 2. Stata 13 documentation about this code: 
//            https://www.stata.com/manuals13/svysvyset.pdf
// 3. A helpful video from Stata: https://www.youtube.com/watch?v=lRTl8GKsZTE
//
// USING SVYSET:
// For the more recent continuous NHANES ones, here are the variables needed for
// weighting:
//   weight for interview data: wtint2yr 
//   weight for laboratory (MEC) data: wtmec2yr 
//   Sampling units (PSU): sdmvpsu
//   Strata: sdmvstra
//   Single unit: there isn't one. 
// NOTE: Older NHANES datasets use different variables
//
// We are using interview data (questionnaires), so our command is
svyset sdmvpsu [pweight = wtint2yr], strata(sdmvstra) vce(linearized) singleunit(missing)
//
// DOING ANALYSES:
// Syntax: "svy: COMMAND var" - for a list of commands, type "help svy_estimation"
//
// The 4 basic survey descriptive commands:
// 1. mean
// 2. proportion
// 3. ratio
// 4. total
//
// there are a bunch of other commands, including logistic regression, etc. 
//
// let's see what the mean age was, the variable is "ridageyr"
svy: mean ridageyr
// you see that the mean age of the US population (301 million people) in 2009-2010 was
// 36.7 years. COOL!
//
// NEXT:
// Let's see the amount of people who have been told they have high blood pressure
// VARIABLE: bpq020
// First, look at the documentation for this variable on the source XPT file's DOC page
// https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.htm
// note that this is coded as 1=yes, 2=no, 7=refused, 9=don't know, .=missing
//
// Only asked if age >=16, so not the entire 301 million US population will be here
//
// let's look at the responses given
svy: proportion bpq020
// RECODE:
// let's make a new variable with 0=no, 1=yes, .=all others
gen toldhtn0n1y =.
replace toldhtn0n1y=0 if bpq020==2 // no (you need a double equals sign after the if)
replace toldhtn0n1y=1 if bpq020==1 // yes
//                                    all others remain as =.
//
// let's see the population on BP meds
svy: proportion toldhtn0n1y
// SO: of the 234 million adults >16 years, 28% have been told that hey have htn
// A little program to spit out the results in english:
matrix htnansw = e(b) // e(b) is where the # of people is stored. Type "ereturn list" to see
//                       which matrices are available. Type "matrix list [name of matrix]" to 
//                       see content of each matrix. This "matrix htnansw = e(b)" command will
//                       save the temporary matrix for e(b) under the permanent matrix
//                       "htnansw" that we can then manipulate.
//
// If you typed "matrix list htnansw", you'd see that the proportion answering "yes"
// is saved in the second column of the first row of this matrix, or [1,2].
//
local yesbp = htnansw[1,2] // pluck out the value of the 1,2 cell in the saved matrix 
//                            (the yes proportion) as a macro to call later
//
// NOTE: You call macros by opening with the tick to the left of number 1 on your keyboard,
// writing the name of the macro, then closing with a traditional apostrophe.
// Read about macros here: https://www.ssc.wisc.edu/sscc/pubs/stata_prog1.htm
//
// save # of people who answered y/n (234 million)
matrix subpop = e(_N_subp) // pluck out the # of people in this population, aka the # of 
//                            americans >=16 years old, as a permanent matrix named "subpop"
//
local population = subpop[1,1] // make a macro plucking the 1,1 cell where the total
//                                # of americans are in this population
//
// how data can be presented:
di "Unrounded " "`yesbp'" // just to prove how you can present it. Note that yesbp is a macro.
di "Rounded " round(`yesbp'*100) // round at the decimal after mult by 100
di "Total population is " `population' // note exponent
di "Total population is " %18.0fc `population' // note no exponent but helpful commas
di "Among Americans >=16 years, " round(`yesbp'*100) "% have been told that they have high BP."
di "Among Americans >=16 years, " round(`yesbp'*`population') " Americans have been told they have high BP."
//
// NOTE: If you run this line-by-line, stata may drop the macros above.  
// Run the script from the very top if you are getting errors in these last few lines. 
//
// Fin.

ClipSpeak: The most user-friendly, simple text-to-speech app ever

I came across the excellent text-to-speech app, ClipSpeak, last week. It’s a freeware program that you can download here.

Installing and opening ClipSpeak

I extracted the .zip file to my desktop and created a shortcut on my Windows Taskbar for ClipSpeak. Open it up and you’ll notice that nothing happened. (You may notice a new yellow icon next to your clock.)

Using ClipSpeak

Open up anything with text. Yes, anything. A browser. A Word document. A .txt file. Highlight what you want read outloud can hit “copy” (Ctrl-C).

WHAT’S THAT SONOROUS NOISE?

Yes, that’s ClipSpeak reading you your copied text. You can right-click on the system icon of ClipSpeak (next to the clock, possibly buried under the ^ symbol) and increase/decrease the speed under Select Voice

Why I love this

This is the most simple program ever. There is no interface. It does what you want it to do. Want it to stop reading? Quit the program and open it back up to clear the audio playback.

MS Word’s new Read Aloud feature: Helpful for dyslexia and typo-finding

New in MS Word 365 is Read Aloud. It’s a pretty straightforward, stripped down/simplified text-to-speech (TTS) tool. If you have dyslexia or otherwise are having a difficult time finding typos in your work, try this feature. It’s very helpful.

To turn it on, just head over to the Review tab and click on the Read Aloud button.

It’ll automatically start reading from your cursor forward. Traditional icons for audio playback will float in the top right corner of your document window. These include play/pause, skip back (one paragraph), ad skip forward (one paragraph).

There is an icon of a dude with radio waves coming out of his head. That’s the settings button.

I like cranking the speed up by a few notches. You can also switch to other installed voices. I’m a fan of David’s sonorous speech. I’d totally pay a few bucks to have a Gilbert Gottfried voice though.

What if you have an older version of MS Word? Well, before this feature was released, I was using WordTalk v4.3. It’s a free, standalone app from the folks at the University of Edinburgh. It has some nice keyboard shortcuts that you can enable. The interface is very klunky though. In a pinch, it’ll do the trick.

2017-09-28 update

After using Read Aloud consistently for the past month, I still enjoy and recommend its use. However, I find it to be a bit glitchy with documents that have many co-author comments or embedded references from Zotero. In these documents, the playback will stop mid-word on occasion.

There are also some keyboard shortcuts for Read Aloud, documented here:

  • CTRL+Alt+Space – Start or quit Read Aloud
  • CTRL+Space – Play/pause
  • CTRL+Left arrow or CTRL+Right arrow – Skip back or forward a paragraph
  • Alt+Left or Alt+Right – Decrease or increase reading speed

2017-11-08 update

I received a new computer that ships with MS Office 2016, not 365. Read Aloud isn’t in Office 2016. It turns out that you can enable the older Text-to-Speech functionality in a pinch. If you don’t have the MS Speech Platform installed, you can download it here.

First, right click on the ribbon and select “customize the ribbon”.

Next, make a new group under the Review tab. I called mine Speak. Then under Choose Commands From, select All Commands. Scroll down to Speak and add it to your new group called Speak. Hit save.

You should now have a Speak button on your Review tab. Highlight what you want it to read and click Speak. Click it again to stop the reading.

You can adjust the reading speed under the Windows Control Panel –> Speech Recognition –> Text to Speech (on the left).

Generic start of a Stata .do file

I took the Stata programming class at the Johns Hopkins School of Public Health during grad school It was taught by Dorry Segev. If you are at the school, I highly, highly, highly recommend taking it and doing all of the assignments in term 4. It saved me many hours of labor in writing up my thesis. It’s a phenomenal class.

One of the biggest takeaways from the class was using a .do file as much as possible when interacting with Stata. As in 99% of the time.

Below is the stock header and footer of every .do file that I make. Steps to success:

  1. Open a blank .do file
  2. Paste the code from below
  3. Save it in the same folder as your dataset
  4. Close Stata
  5. In Windows File Explorer, find your new .do file and open it up then get rolling.

By opening the .do file through file explorer, Stata automatically knows which folder you are working in. Then you don’t have to write the entire directory to start. For example, you can write:

use data.dta, clear

…and not

use c:\windows\users\myname\work\research\001project\data\data.dta, clear


 
********************************************************************************
******************************HEADER STARTS HERE********************************
********************************************************************************
// at the beginning of every do file:
macro drop _all // remove macros from previous work, if any
capture log close // Close any open logs. Capture will ignore a command that gives 
//                   an error. So if there isn't an open log, instead of giving you 
//                   an error and stopping here, it'll just move onto the next line.
clear all // clean the belfries
drop _all // get rid of everything!

log using output.log, replace text // change the name of this to whatever you'd like

// The purpose of this .do file is... [say why you are writing this do file]

version 15 // Every version of Stata is slightly different, but all are backwards 
//            compatible with previous ones. If you open up this do file with a way 
//            newer version, it'll run it in version 14 compatibility mode. Change 
//            this to the current version of Stata that you are using. This will 
//            also keep your code from running on older versions of stata that will 
//            break with new code that it isn't designed to handle. 

set more off, permanently // so you don't have to keep clicking through stata to 
//                           keep it running

set linesize 255 // this keeps longer lines from getting clipped. Helpful for making 
//                  tables.

capture shell md pictures // this makes a folder called pictures in the Windows 
//                           version of stata. Save your pictures here.
capture shell mkdir pictures // ditto, except in the Mac version.

********************************************************************************
******************************IMPORT DATA HERE**********************************
********************************************************************************
* working with stata dta file: 
// use "data.dta", clear // change this with whatever data file you are using and 
//                        remove the double slashes. Note: putting quotes around 
//                        filenames lets you open files with spaces in the name.
* working with excel file: 
// CLOSE YOUR EXCEL FILE BEFORE TRYING TO OPEN IT OR STATA WILL STOP HERE.
// import excel using "name of file.xlsx", firstrow clear // firstrow imports 
//                       the first row of the sheet as variable. Change to the 
//                       appropriate name and delete the double lines as needed. 
* working with csv file:
// import delim "name of file.csv", clear
******************************************************************************** 
******************************CODE STARTS HERE********************************** 
******************************************************************************** 
// ... you get the idea

********************************************************************************
******************************FOOTER STARTS HERE********************************
********************************************************************************
// At the very end of your .do file: 
log close
// Fin.

Keeping your digital work organized

I’m not an organized person by nature. However, I’m the kind of guy that likes to implement systems to overcome problems. In medical school, I didn’t have any sort of structure for my projects and found it frustrating to find the most recent versions of papers, posters, and other writings. Here’s an approach that I started using in fellowship that works for me.

Step 1: Save all of your work in one folder

This seems pretty obvious, but modern operating systems, cloud-based storage systems, and network drives all compete for your documents. If you are a windows user, I’m willing to bet that you have different projects saved on the Desktop, in your My Documents folder, in your cloud storage, and in your email in/outboxes. You probably also have alternate working drafts of your writings on different computers. This is a recipe for disaster.

Instead, make one folder that is accessible from your usual workstation. Call it Work, or something more creative. Save freaking everything in your Work folder from now until the end of time. 

Step 2: Make subfolders for different types of your work

Within your Work folder, make a subfolder for each flavor of project that you do. I have individual folders for GrantsEditorials, Non research (for blog posts and other miscellaneous work), and Research.

Step 3: Numerically and descriptively label each of your projects in its own sub-subfolder

Let’s say you are about to have your first meeting for a new research project studying the effects of Fro-yo vs. ice cream on cholesterol among adults. All of the relevant documents for your next research project will now be saved in the 001 froyo ice cream folder, seen here:

(You also had a good idea for a maple syrup study that you have also started simultaneously.)

What is the threshold to start a new numerically and descriptively labeled sub-subfolder? Mine is pretty low. If I think I have at least a marginally good idea and have started emailing folks about it, I go ahead and start a new folder. A PDF copy of those initial emails are often the first things that I save in that new numerically and descriptively labeled sub-subfolder. (I loathe MS Outlook but have to use it. Its search is terrible. Saving copies of important emails saves me some headache.)

What kinds of things are you going to save into these folders? That’s up to you. I recommend saving your literature searches, copies of relevant important emails, research proposals, and so on. I like to make sub-sub-subfolders with headings like Data and analysisManuscriptsPDFs (for relevant literature saved from your library’s journal subscriptions), and Presentations. You should probably number these folders to match the overall sub-subfolder number as well.

Step 4: Name each item within your sub-sub-subfolders starting with the a) sub-subfolder name, b) a short descriptor of the thing you are writing, and c) ending with a revision and version number 

Example: You finished all of your awesome dessert-related research and are starting to write up the main findings. You are going to start your first MS Word version of the first draft. I suggest naming it like this: “001 froyo ice cream main findings r00 v01.docx“.

Using version numbers

The r00 (that’s r zero zero) means that this particular draft has never undergone a major revision from a prior draft. The v01 (that’s v zero one) means that it’s the first version ever. You’ll email out a file titled:

001 froyo ice cream main findings r00 v01.docx 

…to your primary collaborator who will send back something with edits, they’ll probably save it as:

001 froyo ice cream main findings r00 v01 BHO edits.docx

…which is nice but doesn’t help with your naming structure. Immediately save that to Work\Research\001 froyo ice cream\Manuscripts\ as version 02, or:

001 froyo ice cream main findings r00 v02.docx

Accept or refuse their edits and save that version as:

001 froyo ice cream main findings r00 v03.docx

…then send this out to more collaborators. You’ll wind up going through somewhere between 5-20 (or more) manuscript versions before you are ready to submit to a journal. What’s nice about this system is that you’ll have no problems figuring out which version of a file you are getting edits on. Remember, at >5 collaborators, the probability of one co-author checking their email <1 time per month approaches 100%. You'll be working on the 10th version of your draft and receive an email from the slow reviewer with edits to version 03. Sticking to this scheme helps you figure out how to integrate their suggestions into the current version by tracking the intermediate changes.

Note: I suggest using two digits (v01-v99) for the revision and version numbers rather than starting with 1 digit (v1-v99). This helps keeping files in alphabetical-numerical order when sorting in your file explorer. Computers aren’t smart enough to figure out that v12 doesn’t come between v1 and v2.

Using revision numbers

For some reason, the New England Journal of Medicine doesn’t accept your manuscript for publication after sending it out for review. Lucky for you, it was sent out for review and the reviewers gave you some high-quality feedback that will improve your paper. Now it’s time to start revision 01! When prepping it for the inevitable JAMA submission, you’ll be working starting on revision 01 and version 01, or:

001 froyo ice cream main findings r01 v01.docx 

In the end, you’ll have a folder looking like this:

Look at how pristine and well-organized this is!

Step 4.5: Use trailing letters for projects with multiple publications

Let’s say that you decide to write both a main findings manuscript and a subgroup analysis manuscript. How to stay organized? Just the 001 Manuscripts folder into two sub-sub-sub subfolders for each publication, with 001a for the main findings and 001b for the subgroup analysis. It’ll look like this:

The naming scheme changes slightly as filenames will start with 001a instead of 001, like this:

Step 5: For the love of all things decent, regularly back up your Work folder!!

I can’t stress this enough. Keeping your Work folder in a commercial cloud storage drive doesn’t count as backing things up. Dropbox only saves things for 30 days in their basic plan. If you aren’t actively backing up your data, you can safely assume it isn’t backed up. One dead hard drive in a non-backed up folder can mean months or years of lost work.

Remember to check your institution’s policies for determining your backup method of choice. If you are working with PHI, saving it on an unencrypted thumb drive on your keychain won’t cut it. If it’s okay with your data policies, I recommend buying a DVD-R spindle or a bunch of cheap thumb drives and setting a calendar date to write your Work folder to a different DVD-R or different thumb drive once per month (i.e., you should have this on a different thumb drive each month). Keep these in a locked, secured, fireproof area that is consistent with the policies of your organization. Better yet, keep them in multiple locked, secured, fireproof areas if possible. For bonus points, get an M-Disc DVD discs and compatible burner, which should be stable for >1,000 years, unlike the usual 10-20 years of conventional burnable DVD-Rs and CD-Rs.

Or, just speak with your friendly neighborhood IT professional about how your institution recommends backing up the data you are working with. Chances are that they have a server that they back up 3 times per day that you can just use for a modest price. It’ll be worth the sorrow of lost data and drafts.