Code to make a dot and 95% confidence interval figure in Stata

Dot and confidence interval figures in Stata

Stata has a pretty handy -twoway scatter- code that can be combined with -twoway rcap- to make the figure below. Example code at the bottom.

Horizontal version

Vertical version

First step, make an Excel file

I made an excel file with the below columns called “dot and 95 percent ci data.xlsx” saved in the same folder as my .do file. This figure will display row 1 at the top and row 14 at the bottom. The gaps in between the lines are the absent rows 3,6, 9, and 12. Group tells Stata if you want a red diamond or blue square. Proportion is the point estimate and low95 and high95 are the surrounding 95% confidence intervals.

Note that the data here are made up and are not related to any actual ongoing clinical investigation. 

Next step, make a .do file

In the same folder as the Excel file, copy/paste/save the code below as a .do file. Close Excel and close Stata then find the .do file from Windows Explorer and double click it. Doing this will force Stata to set the working directory as the folder containing the .do file (and the Excel file).

If the code won’t work, you probably have Excel open. Close it and try again.

That’s it! Code for both versions of the figure follows.

********************************************************************************
******************************IMPORT DATA HERE**********************************
********************************************************************************
import excel "dot and 95 percent ci data.xlsx", firstrow clear
destring, replace


******************************************************************************** 
******************************CODE STARTS HERE********************************** 
******************************************************************************** 
set scheme s1mono // black and white

twoway ///
(rcap low95 high95 row, horizontal) /// code for 95% CI
(scatter row proportion if group ==1, mcolor(red)) /// dot for group 1
(scatter row proportion if group ==2, mcolor(blue)) /// dot for group 2
, legend(row(1) order(2 "legend 1" 3 "legend 2") pos(6)) /// legend at 6 o'clock position
ylabel(1.5 "Model A" 4.5 "Model B" 7.5 "Model C" 10.5 "Model D" 13.5 "Model E", angle(0) noticks) ///
/// note that the labels are 1.5, 4.5, etc so they are between rows 1&2, 4&5, etc.
/// also note that there is a space in between different rows by leaving out rows 3, 6, 9, and 12 
xlabel(.95 " " 1 "1.0" 1.1 "1.1" 1.2 "1.2" 1.3 "1.3" 1.4 "1.4" 1.5 "1.5" 1.6 " ", angle(0)) /// no 1.6 label
title("Title") ///
xtitle("X axis") /// 
ytitle("Y axis") /// 
yscale(reverse) /// y axis is flipped
xline(1.0, lpattern(dash) lcolor(gs8)) ///
/// aspect (next line) is how tall or wide the figure is
aspect(.5)

graph export "dot and 95 percent ci figure horiz.png", replace width(2000)
//graph export "dot and 95 percent ci figure horiz.tif", replace width(2000)


********************************************************************************
******************************CODE STARTS HERE**********************************
********************************************************************************
set scheme s1mono // black and white

twoway ///
(rcap low95 high95 row, vert) /// code for 95% CI
(scatter proportion row if group ==1, mcolor(red)) /// dot for group 1
(scatter proportion row if group ==2, mcolor(blue)) /// dot for group 2
, legend(row(1) order(2 "legend 1" 3 "legend 2") pos(6)) /// legend at 6 o’clock position
xlabel(1.5 "Model A" 4.5 "Model B" 7.5 "Model C" 10.5 "Model D" 13.5 "Model E", angle(0) noticks) ///
/// note that the labels are 1.5, 4.5, etc so they are between rows 1&2, 4&5, etc.
/// also note that there is a space in between different rows by leaving out rows 3, 6, 9, and 12
ylabel(0.9 "0.9" 1 "1.0" 1.1 "1.1" 1.2 "1.2" 1.3 "1.3" 1.4 "1.4" 1.5 "1.5" , angle(0)) /// no 1.6 label
title("Title") ///
xtitle("X axis") ///
ytitle("Y axis") ///
yline(1.0, lpattern(dash) lcolor(gs8)) ///
/// aspect (next line) is how tall or wide the figure is
aspect(.5)

graph export "dot and 95 percent ci figure vert.png", replace width(2000)
//graph export "dot and 95 percent ci figure vert.tif", replace width(2000)

Making a horizontal stacked bar graph with -graph twoway rbar- in Stata

Making a horizontal stacked bar graph in Stata

I spent a bit of time making a variation of this figure today. (The data here are made up.) I’m pleased with how it came out. I first tried to use the -graph bar, horizontal- command, but it didn’t give me as much customization as -twoway graph rbar…, horizontal-. I think it looks pretty slick.

Start with an Excel file

I made an Excel file called stacked bar graph data.xlsx that I saved in the same folder as a .do file. I closed Stata and reopened that .do file from Windows explorer so that Stata set the working directory as the same folder that contains the .do file. More importantly, it set the working directory as the same folder that also contains the Excel file.

Group is the number of the individual bars, bottom is the bottom of the first segment of a bar, q1 is the top of the first segment of each bar. The rest should be obvious. I made this for quartiles, hence the q1-4 names. You can tweak the numbers by editing the Excel file, hitting save/close, and rerunning the .do file.

Make sure that you save and close Excel before running the .do file or the .do file won’t run and you will be sad. 

My .do file for making this horizontal stacked bar graph

Here’s my code! I hope it’s useful.

********************************************************************************
******************************IMPORT DATA HERE**********************************
********************************************************************************
import excel "stacked bar graph data.xlsx", firstrow clear // make sure that excel
//                                                           is closed before you
//                                                           run this script!
destring, replace

********************************************************************************
******************************CODE STARTS HERE**********************************
********************************************************************************
capture ssc install scheme-burd, replace // this installs nicer color schemes
// see the schemes here: https://github.com/briatte/burd
set scheme burd4 

graph twoway ///
(rbar bottom q1 group, horizontal) ///
(rbar q1 q2 group, horizontal) ///
(rbar q2 q3 group, horizontal) ///
(rbar q3 q4 group, horizontal) ///
, /// if you modify this file and it stops working, check the placement of this comma
xscale(log) /// make the x axis log scale
xla(0.25 "0.25" 0.5 "0.5" 1 "1" 2.5 "2.5" 5 "5" 10 "10" 25 "25" 50 "50" 100 "100") ///
yla(1 "Group 1" 2 "Group 2" 3 "Group 3" 4 "Group 4") ///
ytitle("Y title") ///
xtitle("X title") ///
text(4 1.2 "Q1", color(white)) /// first number is y second is x.
text(4 4.3 "Q2", color(white)) ///
text(4 20 "Q3", color(white)) ///
text(4 90 "Q4", color(white)) ///
legend(off) /// no legend, aspect is the shape of the figure. 1 is tall and thin.
aspect(.25) 
//
// export the graph as a PNG file
graph export "stacked graph.png", replace width(2000)
// graph export "stacked graph.tif", replace width(2000) // in case you want as a tiff

Downloading and analyzing NHANES datasets with Stata in a single .do file

Learn to love NHANES

NHANES is a robust, nationally-representative cross-sectional study. For the past ~18 years it sampled different communities across the US in 2 year continuous cycles. A few of these years are linked to National Death Index data, so you can assess risk factors at the time of the survey and use time-to-event mortality data to identify novel risk factors for death. Manuscripts using NHANES data have been published across the spectrum of medical journals, all the way up to NEJM. The best part? You can download almost all of NHANES data from the CDC website right now for free. (Here’s a list of all data variables available.)

Manipulating NHANES data is challenging for beginners because of the sheer quantity of individual files and requirement for weighting. Plus, all of the files are in SAS XPT format so you have to download, import, save, and merge before you can even think about starting an analysis. To make this data management task slightly more complex, the CDC sporadically publishes interval updates of the source data files on their website. Files may be updated for errors or removed entirely without you knowing about it. (I strongly recommend subscribing the the NHANES Listserv to get real-time updates.) If you have NHANES files saved locally from 2-3 years ago, there’s a reasonable chance that you are using outdated databases, which could yield some false conclusions. Re-downloading all of the many files every time you want to do a project is a big headache.

Leverage Stata’s internet connectivity to make NHANES analyses easy

I love that Stata will download datasets for you with just a URL. The .do file below shows you how easy it is to download just the needed files on the fly then do some simple analyses. This means that you don’t have to worry about maintaining your personal database of NHANES files. If the source files are updated by the CDC, no worry! Every time you run this .do file, it’ll grab the freshest data files available. If the source data files are removed for inaccuracies, the file won’t run and you’ll be prompted to investigate. For example, the 2011-2012 Folate lab results were withdrawn February 2018. If you tried to download the FOLATE_G file, CDC’s website will give their version of a 404 error and Stata will stop the .do file cold in its tracks.

What this .do file does

In this short script, you’ll see how to 1. Import the NHANES SAS XPT files directly from the CDC website with just the XPT file’s URL, 2. Save data as Stata .dta files, 3. Merge the .dta files, 4. Review basic coding issues, 5. Run an analysis using weighting, and 5. Display data.

In this example, we’ll look at the 2009-2010 NHANES results and apply weighting to estimate the amount of the US population who have been told that they have high blood pressure. Just copy/paste the code below and save into a .do file. Set the working directory to be in the same folder as your .do file. Or, copy/paste the .do file, save it, close Stata, open the .do file through Explorer, and then run! Opening the .do file from Windows Explorer with Stata closed sets the .do file’s parent folder to be Stata’s working directory.

What this .do file doesn’t

You won’t be able to merge files with multiple entries per users. For example, the Prescription Medications – Drug Information questionnaire has a row for each medication and ICD code its use for all participants. You’ll need to use the reshape wide command on those variables before merging. BTW, that code is: .bys seqn: gen j= _n [linebreak] .reshape wide rxddrug rxddrgid rxqseen rxddays, i(seqn) j(j)

This also won’t merge with the National Death Index files, which are hosted elsewhere. The NHANES-NDI linkage website provides an example Stata .do script that would be straightforward to include below.

Also, if you are trying to combine analyses from multiple NHANES cycles (say, combining 2009-2010 with 2011-2012), things get a bit more complicated. You’ll need to append .dta files and consider adjusting the weights.

Finally, if you are interested in using Frames functionality to “merge” a dataset (aka skip saving multiple dta files in the process of building an analytical dataset), then check out this post

// Link to all NHANES datasets: https://wwwn.cdc.gov/nchs/nhanes/default.aspx
// 1. Click on years of interest (e.g., 2009-2010).
// 2. Scroll down, click on data of interest (e.g., demographics).
// 3. Right click on the XPT data file and copy the URL.
// 4. Note that the DOC file containing an overview of the file is right there.
//    Take a peek at its contents and return with questions. 
//
**********************************************************************
*********************download the demographics!!**********************
**********************************************************************
// The demographics file contains the weights.
// You ALWAYS need the demographics file.
//
// Paste the URL for the demographics of interest below:
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT", clear
// NOTE: if you are using Stata 15 or older, above should be "sasxport"
// and not "sasxport5". Ditto for all "sasxport" commands that follow. 
sort seqn // Sort the file in order of the unique identifiers. Not necessary,
//           but rarely having an unsorted file will cause analytic issues.
// Save as a Stata dataset
save "DEMO_F.dta", replace
//
**********************************************************************
*********************download other files*****************************
**********************************************************************
// Let's look at the "questionnaire" for "blood pressure & cholesterol"
// and "kidney conditions - urology"
//
// Lost? Go back to the link on the very first line of this .do file
// and click on the year of interest again (2009-2010), scroll down and
// click "questionnaire". 
//
// Paste the URL for the BP & cholesterol below:
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn 
save "BPQ_F.dta", replace
//
// We aren't going to use the kidney file in this analysis, but just an
// example of how to merge a second dataset, copy/paste URL for kidney conditions
import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/KIQ_U_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn
save "KIQ_U_F.dta", replace
//
**********************************************************************
*********************merge your datsets*******************************
**********************************************************************
clear all // clear all memory
//
// start with the demographics file
// ALWAYS INCLUDE THE DEMOGRAPHICS FILE SINCE IT HAS WEIGHTS
use "DEMO_F.dta", clear
//
// Merge demographics with bp & chol
// The unique identifier in this dataset is "seqn", it may be different in other
// NHANES datasets.
merge 1:1 seqn using "BPQ_F.dta"
drop  _merge // this is a merge variable that needs to be dropped prior to 
//              merging other datasets. This variable will be helpful 
//              in exploring causes of merging issues. 
// Merge with kidney conditions datafile
merge 1:1 seqn using "KIQ_U_F.dta"
drop  _merge
//
// you can now save as a merged data file if you want. 
save nhanesBPall.dta, replace
//
**********************************************************************
*********************do an analysis!!*********************************
**********************************************************************
// YOU NEED TO APPLY WEIGHTING.
// Fortunately, Stata makes this easy with built in survey commands.
// You need to tell it that it's survey data with the "svyset" command then use 
// specific functions designed for weighted analyses.
// To explore available these functions: 
// 1. Check out the drop-down menu in stata, statistics --> survey data
// 2. Stata 13 documentation about this code: 
//            https://www.stata.com/manuals13/svysvyset.pdf
// 3. A helpful video from Stata: https://www.youtube.com/watch?v=lRTl8GKsZTE
//
// USING SVYSET:
// For the more recent continuous NHANES ones, here are the variables needed for
// weighting:
//   weight for interview data: wtint2yr 
//   weight for laboratory (MEC) data: wtmec2yr 
//   Sampling units (PSU): sdmvpsu
//   Strata: sdmvstra
//   Single unit: there isn't one. 
// NOTE: Older NHANES datasets use different variables
//
// We are using interview data (questionnaires), so our command is
svyset sdmvpsu [pweight = wtint2yr], strata(sdmvstra) vce(linearized) singleunit(missing)
//
// DOING ANALYSES:
// Syntax: "svy: COMMAND var" - for a list of commands, type "help svy_estimation"
//
// The 4 basic survey descriptive commands:
// 1. mean
// 2. proportion
// 3. ratio
// 4. total
//
// there are a bunch of other commands, including logistic regression, etc. 
//
// let's see what the mean age was, the variable is "ridageyr"
svy: mean ridageyr
// you see that the mean age of the US population (301 million people) in 2009-2010 was
// 36.7 years. COOL!
//
// NEXT:
// Let's see the amount of people who have been told they have high blood pressure
// VARIABLE: bpq020
// First, look at the documentation for this variable on the source XPT file's DOC page
// https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.htm
// note that this is coded as 1=yes, 2=no, 7=refused, 9=don't know, .=missing
//
// Only asked if age >=16, so not the entire 301 million US population will be here
//
// let's look at the responses given
svy: proportion bpq020
// RECODE:
// let's make a new variable with 0=no, 1=yes, .=all others
gen toldhtn0n1y =.
replace toldhtn0n1y=0 if bpq020==2 // no (you need a double equals sign after the if)
replace toldhtn0n1y=1 if bpq020==1 // yes
//                                    all others remain as =.
//
// let's see the population on BP meds
svy: proportion toldhtn0n1y
// SO: of the 234 million adults >16 years, 28% have been told that hey have htn
// A little program to spit out the results in english:
matrix htnansw = e(b) // e(b) is where the # of people is stored. Type "ereturn list" to see
//                       which matrices are available. Type "matrix list [name of matrix]" to 
//                       see content of each matrix. This "matrix htnansw = e(b)" command will
//                       save the temporary matrix for e(b) under the permanent matrix
//                       "htnansw" that we can then manipulate.
//
// If you typed "matrix list htnansw", you'd see that the proportion answering "yes"
// is saved in the second column of the first row of this matrix, or [1,2].
//
local yesbp = htnansw[1,2] // pluck out the value of the 1,2 cell in the saved matrix 
//                            (the yes proportion) as a macro to call later
//
// NOTE: You call macros by opening with the tick to the left of number 1 on your keyboard,
// writing the name of the macro, then closing with a traditional apostrophe.
// Read about macros here: https://www.ssc.wisc.edu/sscc/pubs/stata_prog1.htm
//
// save # of people who answered y/n (234 million)
matrix subpop = e(_N_subp) // pluck out the # of people in this population, aka the # of 
//                            americans >=16 years old, as a permanent matrix named "subpop"
//
local population = subpop[1,1] // make a macro plucking the 1,1 cell where the total
//                                # of americans are in this population
//
// how data can be presented:
di "Unrounded " "`yesbp'" // just to prove how you can present it. Note that yesbp is a macro.
di "Rounded " round(`yesbp'*100) // round at the decimal after mult by 100
di "Total population is " `population' // note exponent
di "Total population is " %18.0fc `population' // note no exponent but helpful commas
di "Among Americans >=16 years, " round(`yesbp'*100) "% have been told that they have high BP."
di "Among Americans >=16 years, " round(`yesbp'*`population') " Americans have been told they have high BP."
//
// NOTE: If you run this line-by-line, stata may drop the macros above.  
// Run the script from the very top if you are getting errors in these last few lines. 
//
// Fin.