Downloading and analyzing NHANES datasets with Stata in a single .do file

Learn to love NHANES

NHANES is a robust, nationally-representative cross-sectional study. For the past ~18 years it sampled different communities across the US in 2 year continuous cycles. A few of these years are linked to National Death Index data, so you can assess risk factors at the time of the survey and use time-to-event mortality data to identify novel risk factors for death. Manuscripts using NHANES data have been published across the spectrum of medical journals, all the way up to NEJM. The best part? You can download almost all of NHANES data from the CDC website right now for free.

Manipulating NHANES data is challenging for beginners because of the sheer quantity of individual files and requirement for weighting. Plus, all of the files are in SAS XPT format so you have to download, import, save, and merge before you can even think about starting an analysis. To make this data management task slightly more complex, the CDC sporadically publishes interval updates of the source data files on their website. Files may be updated for errors or removed entirely without you knowing about it. (I strongly recommend subscribing the the NHANES Listserv to get real-time updates.) If you have NHANES files saved locally from 2-3 years ago, there’s a reasonable chance that you are using outdated databases, which could yield some false conclusions. Re-downloading all of the many files every time you want to do a project is a big headache.

Leverage Stata’s internet connectivity to make NHANES analyses easy

I love that Stata will download datasets for you with just a URL. The .do file below shows you how easy it is to download just the needed files on the fly then do some simple analyses. This means that you don’t have to worry about maintaining your personal database of NHANES files. If the source files are updated by the CDC, no worry! Every time you run this .do file, it’ll grab the freshest data files available. If the source data files are removed for inaccuracies, the file won’t run and you’ll be prompted to investigate. For example, the 2011-2012 Folate lab results were withdrawn February 2018. If you tried to download the FOLATE_G file, CDC’s website will give their version of a 404 error and Stata will stop the .do file cold in its tracks.

What this .do file does

In this short script, you’ll see how to 1. Import the NHANES SAS XPT files directly from the CDC website with just the XPT file’s URL, 2. Save data as Stata .dta files, 3. Merge the .dta files, 4. Review basic coding issues, 5. Run an analysis using weighting, and 5. Display data.

In this example, we’ll look at the 2009-2010 NHANES results and apply weighting to estimate the amount of the US population who have been told that they have high blood pressure. Just copy/paste the code below and save into a .do file. Set the working directory to be in the same folder as your .do file. Or, copy/paste the .do file, save it, close Stata, open the .do file through Explorer, and then run! Opening the .do file from Windows Explorer with Stata closed sets the .do file’s parent folder to be Stata’s working directory.

What this .do file doesn’t

You won’t be able to merge files with multiple entries per users. For example, the Prescription Medications – Drug Information questionnaire has a row for each medication and ICD code its use for all participants. You’ll need to use the reshape wide command on those variables before merging. BTW, that code is: .bys seqn: gen j= _n [linebreak] .reshape wide rxddrug rxddrgid rxqseen rxddays, i(seqn) j(j)

This also won’t merge with the National Death Index files, which are hosted elsewhere. The NHANES-NDI linkage website provides an example Stata .do script that would be straightforward to include below.

Finally, if you are trying to combine analyses from multiple NHANES cycles (say, combinine 2009-2010 with 2011-2012), things get a bit more complicated. You’ll need to append .dta files and consider adjusting the weights.

// Link to all NHANES datasets: https://wwwn.cdc.gov/nchs/nhanes/default.aspx
// 1. Click on years of interest (e.g., 2009-2010).
// 2. Scroll down, click on data of interest (e.g., demographics).
// 3. Right click on the XPT data file and copy the URL.
// 4. Note that the DOC file containing an overview of the file is right there.
//    Take a peek at its contents and return with questions. 
//
**********************************************************************
*********************download the demographics!!**********************
**********************************************************************
// The demographics file contains the weights.
// You ALWAYS need the demographics file.
//
// Paste the URL for the demographics of interest below:
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT", clear
sort seqn // Sort the file in order of the unique identifiers. Not necessary,
//           but rarely having an unsorted file will cause analytic issues.
// Save as a Stata dataset
save "DEMO_F.dta", replace
//
**********************************************************************
*********************download other files*****************************
**********************************************************************
// Let's look at the "questionnaire" for "blood pressure & cholesterol"
// and "kidney conditions - urology"
//
// Lost? Go back to the link on the very first line of this .do file
// and click on the year of interest again (2009-2010), scroll down and
// click "questionnaire". 
//
// Paste the URL for the BP & cholesterol below:
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn 
save "BPQ_F.dta", replace
//
// We aren't going to use the kidney file in this analysis, but just an
// example of how to merge a second dataset, copy/paste URL for kidney conditions
import sasxport "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/KIQ_U_F.XPT", clear
// Sort and save as an Stata dataset
sort seqn
save "KIQ_U_F.dta", replace
//
**********************************************************************
*********************merge your datsets*******************************
**********************************************************************
clear all // clear all memory
//
// start with the demographics file
// ALWAYS INCLUDE THE DEMOGRAPHICS FILE SINCE IT HAS WEIGHTS
use "DEMO_F.dta", clear
//
// Merge demographics with bp & chol
// The unique identifier in this dataset is "seqn", it may be different in other
// NHANES datasets.
merge 1:1 seqn using "BPQ_F.dta"
drop  _merge // this is a merge variable that needs to be dropped prior to 
//              merging other datasets. This variable will be helpful 
//              in exploring causes of merging issues. 
// Merge with kidney conditions datafile
merge 1:1 seqn using "KIQ_U_F.dta"
drop  _merge
//
// you can now save as a merged data file if you want. 
save nhanesBPall.dta, replace
//
**********************************************************************
*********************do an analysis!!*********************************
**********************************************************************
// YOU NEED TO APPLY WEIGHTING.
// Fortunately, Stata makes this easy with built in survey commands.
// You need to tell it that it's survey data with the "svyset" command then use 
// specific functions designed for weighted analyses.
// To explore available these functions: 
// 1. Check out the drop-down menu in stata, statistics --> survey data
// 2. Stata 13 documentation about this code: 
//            https://www.stata.com/manuals13/svysvyset.pdf
// 3. A helpful video from Stata: https://www.youtube.com/watch?v=lRTl8GKsZTE
//
// USING SVYSET:
// For the more recent continuous NHANES ones, here are the variables needed for
// weighting:
//   weight for interview data: wtint2yr 
//   weight for laboratory (MEC) data: wtmec2yr 
//   Sampling units (PSU): sdmvpsu
//   Strata: sdmvstra
//   Single unit: there isn't one. 
// NOTE: Older NHANES datasets use different variables
//
// We are using interview data (questionnaires), so our command is
svyset sdmvpsu [pweight = wtint2yr], strata(sdmvstra) vce(linearized) singleunit(missing)
//
// DOING ANALYSES:
// Syntax: "svy: COMMAND var" - for a list of commands, type "help svy_estimation"
//
// The 4 basic survey descriptive commands:
// 1. mean
// 2. proportion
// 3. ratio
// 4. total
//
// there are a bunch of other commands, including logistic regression, etc. 
//
// let's see what the mean age was, the variable is "ridageyr"
svy: mean ridageyr
// you see that the mean age of the US population (301 million people) in 2009-2010 was
// 36.7 years. COOL!
//
// NEXT:
// Let's see the amount of people who have been told they have high blood pressure
// VARIABLE: bpq020
// First, look at the documentation for this variable on the source XPT file's DOC page
// https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.htm
// note that this is coded as 1=yes, 2=no, 7=refused, 9=don't know, .=missing
//
// Only asked if age >=16, so not the entire 301 million US population will be here
//
// let's look at the responses given
svy: proportion bpq020
// RECODE:
// let's make a new variable with 0=no, 1=yes, .=all others
gen toldhtn0n1y =.
replace toldhtn0n1y=0 if bpq020==2 // no (you need a double equals sign after the if)
replace toldhtn0n1y=1 if bpq020==1 // yes
//                                    all others remain as =.
//
// let's see the population on BP meds
svy: proportion toldhtn0n1y
// SO: of the 234 million adults >16 years, 28% have been told that hey have htn
// A little program to spit out the results in english:
matrix htnansw = e(b) // e(b) is where the # of people is stored. Type "ereturn list" to see
//                       which matrices are available. Type "matrix list [name of matrix]" to 
//                       see content of each matrix. This "matrix htnansw = e(b)" command will
//                       save the temporary matrix for e(b) under the permanent matrix
//                       "htnansw" that we can then manipulate.
//
// If you typed "matrix list htnansw", you'd see that the proportion answering "yes"
// is saved in the second column of the first row of this matrix, or [1,2].
//
local yesbp = htnansw[1,2] // pluck out the value of the 1,2 cell in the saved matrix 
//                            (the yes proportion) as a macro to call later
//
// NOTE: You call macros by opening with the tick to the left of number 1 on your keyboard,
// writing the name of the macro, then closing with a traditional apostrophe.
// Read about macros here: https://www.ssc.wisc.edu/sscc/pubs/stata_prog1.htm
//
// save # of people who answered y/n (234 million)
matrix subpop = e(_N_subp) // pluck out the # of people in this population, aka the # of 
//                            americans >=16 years old, as a permanent matrix named "subpop"
//
local population = subpop[1,1] // make a macro plucking the 1,1 cell where the total
//                                # of americans are in this population
//
// how data can be presented:
di "Unrounded " "`yesbp'" // just to prove how you can present it. Note that yesbp is a macro.
di "Rounded " round(`yesbp'*100) // round at the decimal after mult by 100
di "Total population is " `population' // note exponent
di "Total population is " %18.0fc `population' // note no exponent but helpful commas
di "Among Americans >=16 years, " round(`yesbp'*100) "% have been told that they have high BP."
di "Among Americans >=16 years, " round(`yesbp'*`population') " Americans have been told they have high BP."
//
// NOTE: If you run this line-by-line, stata may drop the macros above.  
// Run the script from the very top if you are getting errors in these last few lines. 
//
// Fin.