Part 7: Making a table for your outcome of interest (Table 2?)

As we learned in part 5, Table 1 describes your analytical population at baseline by your exposure. For those using a continuous variable as an exposure, it’s by quantile (e.g., tertile, quartile) of the exposure. I propose a table known as “Table 2” that describes the outcome of interest by the exposure used in Table 1. You might have seen something along the lines of this in papers before, and we are going to call it “Table 2”. It’s not a universal table in observational epidemiology, so calling it “Table 2” is a bit much. But we’ll call it “Table 2” for our purposes.

Columns

The columns should be identical to that in your Table 1. (I suggest having an “All” column if you don’t have one in your Table 1 though.)

Rows

In Table 1, I suggested having an N and range for continuous variables of your quantiles. I suggest not including those in your Table 2 if they are already in your Table 1, since it’s duplicative. I suppose it might be helpful for error checking to have them in table 2, and confirming that they are identical to your Table 1. But, I suggest not including a row for Ns and ranges in your Table 2 that is included in the manuscript.

In a very simple Table 2, there might be a single row: the outcome in the analytical population. It might look like this:

AllTertile 1Tertile 2Tertile 3
All

BUT! There might be a stratification of interest in your table. in the REGARDS study, we often stratify by Black vs. White race or by sex. So, you might also include rows for each subsequent group, like this:

AllTertile 1Tertile 2Tertile 3
All
Black participants
White participants

Finally, for subgroups, you might opt to include a minimally-adjusted regression comparing your strata. in this example, we could use a modified Poisson regression (i.e., Poisson regression with sandwich or robust variance estimators, Zou’s method) to compare risk of the outcome overall an in each tertile. I’d just adjust for age and sex in this example. So:

AllTertile 1Tertile 2Tertile 3
All
Black participants
White participants
RR, Black vs. White (ref)

Cell

Here, I suggest presenting # with event/# at risk (percentage with event) in each cell, except in the RR row, which would present RR and the 95% confidence interval. Example (totally made up numbers here and some placeholder ##’s, as FYI):

AllTertile 1Tertile 2Tertile 3
All1000/3000 (33%)300/1000 (30%)400/1000 (40%)500/1000 (50%)
Black participants400/1000 (40%)##/## (##%)##/## (##%)##/## (##%)
White participants600/2000 (30%)##/## (##%)##/## (##%)##/## (##%)
RR, Black vs. White (ref)1.25 (1.20-1.30)1.20 (1.15-1.25)1.22 (1.18-1.30)1.21 (1.19-1.28)

That’s it! Even if you don’t include this table, it’s super handy to have to describe the outcome in the text.

Part 6: Visualizing your continuous exposure at baseline

Visualization of your continuous exposure in an observational epidemiology research project

As we saw in Part 5, it’s important to describe the characteristics of your baseline population by your exposure. This helps readers get a better understanding of internal validity. For folks completing analyses with binary exposures, part 6 isn’t for you. If your analysis includes continuous exposures or ordinal exposures with at least a handful of options, read on.

I think it’s vitally important to visualize your exposure before moving any further forward with your analyses. There are a few reasons that I do this:

  1. Understand the distribution of your exposure. Looking at the raw spread of your data will help you understand if it has a relatively normal distribution, if it’s skewed, if it is multimodal (eg, has several peaks), or if it’s just plain old weird looking. If your exposure is non-normally distributed, then you’ll need to consider the implications of the spread in your analysis. This may mean log-transforming, square root-transforming (if you have lots of zeros in your exposure’s values), or some other sort of transformation.
    • Note: make sure to visualize your transformed exposure value!
  2. Look for patterns that need exploring. You might notice a huge peak at a value of “999”. This might represent missing values, which will need to be recoded. You might notice that values towards the end of the tails of the bell curve might spike up at a particular value. This might represent values that were really above or below the lower limit of detection. You’ll need to think about how to handle such values, possibly recoding them following the NHANES approach as the limit of detection divided by the square root of 2.
  3. Understand the distribution of your exposure by key subgroups. In REGARDS, our analyses commonly focus on racial differences in CVD events. Because of this, I commonly visualize exposures with overlaid histograms for Black and White participants, and see how these exposure variables differ by race. This could easily be done for other sociodemgraphics (notably, by sex), anthropometrics, and disease states.

My approach to building histograms in Stata

A conventional histogram splits continuous data into several evenly-spaced discrete groups called “bins”, and visualizes these discrete groups as a bar graph. Let’s see how to do this in Stata, with some randomly generated data that approximates a normal distribution. While we’re at it, we’ll make a variable called “group” that’s 0 or 1 that we’ll use later. (Also note that this dataset doesn’t use discrete values, so I’m not specifying the discrete option in my “hist” code. If you see spaces in your histogram because you are using discrete values, add “, discrete” after your variable name in the histogram line.)

clear all
// set observations to be 1000:
set obs 1000
// set a random number seed for reproducibility: 
set seed 12345
// make a normally-distributed variable, with mean of 5 and SD of 10:
gen myvariable = rnormal(5,10)
// make a 0 or 1 variable for a group, following instructions for "generate ui":
// https://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/
gen group = floor((1-0+1)*runiform() + 0)
// now make a histogram
hist myvariable

Here’s the overall histogram:

On the X axis you see the ranges of the values of variable of interest, from around -30 to about +40. On the Y axis you see the density plot. I want to show this same figure by group, however, and the bins are not currently transparent. You won’t be able to tell one group from another. So, in Stata, you need to use the “twoway histogram” option instead of just “histogram” and specify transparent colors of the figure using the %NUMBER notation. We’ll also add a legend. We’ll set the scheme to s1mono to get rid of the ugly default blue surrounding box as well. Example:

// set the scheme to s1mono:
set scheme s1mono
// now make your histogram:
twoway ///
(hist myvariable if group==0, color(blue%30)) ///
(hist myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Here’s what you get:

You can modify things as needed. Something you might consider is changing the density to count or frequency, which is done by adding “frequency” or “percent” after the commas but before the colors. You might also opt to select different colors, which you can read about selection of colors in this editorial I wrote with Mary Cushman.

Considerations for designing histograms

One question is how many bins you want. I found this nice 2019 article by Regina L. Nuzzo, PhD (PDF link, PubMed listing) that goes over lots of considerations for the design of histograms. I specifically like the Box, which lists equations to determine number of bins and bid width. In general, if you have too many bins, your data will look choppy:

// using 100 bins here
twoway ///
(hist myvariable if group==0, color(blue%30) bin(100)) ///
(hist myvariable if group==1, color(red%30)  bin(100)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

And if you have too few, you won’t be able to make sense of the overall structure of the data.

// using 2 bins here
twoway ///
(hist myvariable if group==0, color(blue%30) bin(2)) ///
(hist myvariable if group==1, color(red%30)  bin(2)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Be thoughtful about how thinly you want to splice your data.

A quick note on kernel density plots

Kernel density plots have the same idea to histograms, except it shows a smoothed line over your data’s distribution instead of a histogram. I prefer to use histograms when looking at laboratory data, since kernel density plots can smooth over outliers described in point #2 above. The code for a kernel density plot in Stata is nearly identical to the “twoway hist” code above.

twoway ///
(kdensity myvariable if group==0, color(blue%30)) ///
(kdensity myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Output:

You can even combine histograms and kernel density plots!

twoway ///
(hist myvariable if group==0, color(blue%30)) ///
(hist myvariable if group==1, color(red%30)) ///
(kdensity myvariable if group==0, color(blue%30)) ///
(kdensity myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

I’ve never done this myself for a manuscript, but just showing that it’s possible.

Part 5: Baseline characteristics in a Table 1 for a prospective observational study

What’s the deal with Table 1?

Tables describing the baseline characteristics of your analytical sample are ubiquitous in observational epidemiology manuscripts. They are critical to help the reader understand the study population and potential limitations of your analysis. A table characterizing baseline characteristics is so important that it’s typically the first table that appears in any observational epidemiology (or clinical trial) manuscript, so it’s commonly referred to as a “Table 1“. Table 1s are critically important because they help the readers understand internal validity of your study. If your study has poor internal validity, then your results and findings aren’t useful.

The details here are specific to prospective observational studies (e.g., cohort studies), but are generalizable to other sorts of studies (e.g., RCTs, case-control studies).

If you are a Stata user, you might be interested into my primer of using Table1_mc to generate a Table 1.

Guts of a Table 1

There are several variations of the Table 1, here’s how I do it.

COLUMNS: This is your exposure of interest (i.e., dependent variable). This is not the outcome of interest. There’s a few way to divvy up these columns, depending on what sort of data you have:

  • Continuous exposure (e.g., baseline LDL-cholesterol level): Cut this up into quantiles. I commonly use tertiles (3 groups) or quartiles (4 groups). People have very, very strong opinions about whether you use tertiles or quartiles. I don’t see much of a fuss in using either. Of note, there usually is no need to transform your data prior to splitting into quantiles. (And, log transforming continuous data that includes values of zero will replace those zeros with missing data!)
  • Discrete exposure:
    • Dichotomous/binary exposure (e.g., prevalent diabetes status as no/0 or yes/1): This is easy, column headers should be 0 or 1. Make sure to use a descriptive column header like “No prevalent diabetes” and “Prevalent diabetes” instead of numbers 0 and 1.
    • Ordinal exposure, not too many groups (e.g., never smoker/0, former smoker/1, current smoker/2): This is also easy, column headers should be 0, 1, or 2. Make sure to use descriptive column headers.
    • Ordinal exposure, a bunch of groups (e.g., extended Likert scale ranging from super unsatisfied/1 to super satisfied/7): This is a bit tricker. On one hand, there isn’t any real limitation on how wide a table can be in a software package so you could have columns 1, 2, 3, 4, 5 ,6 and 7. This is a bit unwieldy for the reader, however. I personally think it’s better to collapse really wide groupings into a few groups. Here, you could collapse all of the negative responses (1, 2 and 3), leave the neutral response as its own category (4), and collapse all of the positive responses (5, 6, and 7). Also use descriptive column headers, but also be sure to describe how you collapsed groups in the footer of the table.
    • Nominal exposure, not too many groups (e.g., US Census regions of Northeast, Midwest, South, and West): This is easy, just use the groups. Be thoughtful about using a consistent order of these groups throughout your manuscript.
    • Nominal exposure, a bunch of groups (e.g., favorite movie): As with ‘Ordinal data, a bunch of groups’ above, I would collapse these into groups that relate to each other, such as genre of movie.
  • (Optional) Additional first column showing “Total” summary statistics. This presents summary statistics for the entire study population as a whole, instead of by quantile or discrete groupings. I don’t see much value in these and typically don’t include them.
  • (Optional) Additional first column showing count of missingness for each row. This presents a count of missing values for that entire row. I think these are nice to include, but they don’t show missingness by column so are an imperfect way to show missingness. See the section below on ‘cell contents’ for alternative strategies to show missingness.
    • Note: Table1_mc for Stata cannot generate a “missingness” row.
  • (Optional, but suggest to avoid) Following P-value column that shows comparisons across rows. These have fallen out of favor for clinical trial Table 1s. I see little value of them for prospective observational studies and also avoid them.

ROWS: These include the N for each column, the range of values for continuous exposures, and baseline values. Note that the data here are from baseline.

  • N for each group. Make sure that these Ns add up to the expected N in your analytical population at the bottom of your inclusion flow diagram. If it doesn’t match, you’ve done something wrong.
  • (For continuous exposures) Range of values for your quantiles and yes I mean minimum and maximum for each quantile, not IQRs.
  • Sociodemographics (age, sex, race, ± income, ± region, ± education level, etc.)
  • Anthropometrics (height, weight, waist circumference, BMI, etc.)
  • Medical problems as relevant to your study (eg, proportion with hypertension, diabetes, etc.)
  • Medical data as relevant to your study (eg, laboratory assays, details with radiological imaging, details from cardiology reports)
  • Suggest avoiding the outcome(s) of interest as additional rows. I think that presenting the outcomes in this table is inadequate. I prefer to have a separate table or figure dedicated to the outcome of interest that goes much more in-depth than a Table 1 does. Plus, the outcome isn’t ascertained at baseline in a prospective observational study, and describing the population at baseline is the general purpose of Table 1.
  • And for the love of Pete, please make sure that all covariates in your final model appear as rows. If you have a model that adjusts for Epworth Sleepiness Score, for example, make sure that fits in somewhere above.

The first column of your Table 1 will describe each row. The appearance of this row will vary based upon the type of data you have.

  • Overall style of row descriptions as it appears in the first column:
    • N row – I suggest simply using “N”, though some folks use N (upper case) to designate the entire population and n (lower case) to designate subpopulations, so perhaps you might opt to put “n”.
    • Continuous variables (including the row for range)– I suggest a descriptive name and the units. Eg, “Height, cm”
    • Discrete variables – I suggest a descriptive name alone. Some opt to put a hint to the contents of the cell here (eg, adding a percentage sign such as “Female sex, %“), but I think that is better included in the footer of the table. This will probably be determined by the specific journal you are submitting to.
      • Dichotomous/binary values – In this example, sex is dichotomous (male vs. female) since that’s how it has historically been collected in NIH studies. For dichotomous variables, you can include either (1) a row for ‘Male’ and a row for ‘Female’, or (2) simply a row for one of the two sexes (eg, just ‘Female’) since the other row will be the other sex.
      • Other discrete variables (eg, ordinal or nominal) – In this example, we will consider the nominal variable of Race. I suggest having a leading row that provides description of the following rows (eg, “Race group”) then add two spaces before each following race group so the nominal values for the race groups seem nested under the heading.
    • (Optional) Headings for groupings of rows – I like including bold/italicized headings for groupings of data to help keep things organized.

Here’s an example of how I think a blank table should appear:

Table 1 – Here is a descriptive title of your Table 1 followed by an asterix that leads to the footer. I suggest something like “Sociodemographics, anthropometrics, medical problems, and medical data ascertained baseline among [#] participants in [NAME OF STUDY] with [BRIEF INCLUSION CRITERIA] and without [BRIEF EXCLUSION CRITERIA] by [DESCRIPTION OF EXPOSURE LIKE ‘TERTILE OF CRP’ OR ‘PREVALENT DIABETES STATUS’]*”

(Optional)
Missing, N
(Optional)
Total
Exposure
Variable
Tertile 1
Exposure
Variable
Tertile 2
Exposure
Variable
Tertile 3
(Optional,
suggest
avoiding)
P-value
N
Range, ng/mL
Sociodemographics
Age, y
Female sex
Race group
Black
White
Asian
Anthropometrics
Height, cm
Weight, kg
BMI, kg/m²
Medical problems
[List out here]
Medical data
[List out here]

*Footer of your Table 1. I suggest describing the appearance of the cells, eg “Range is minimum and maximum of the exposure for each quantile. Presented as mean (SD) for normally distributed and median (IQR) for skewed continuous variables. Discrete data are presented as column percents.”

Cell contents

The cell contents varies by type of variable and your goal in this table:

  • Simplicity as goal:
    • Normally distributed continuous variables: Mean (SD)
    • Non-normally distributed continuous variables: Median (IQR)
    • Discrete variables: Present column percentages. Not row percentages. For example we’ll consider “income >$75k” by tertile of CRP. A column percentage would show the % of participants in that specific quantile have an income >$75k. A row percentage would show the percentage of participants with income >$75K who were in that specific tertile.
  • Clarity of completeness of data as goal (would not also include “missing” column if doing this)
    • Continuous variables: Present as mean (SD) or median (IQR) as outlined above based upon normality, but also include an ‘n’. Example for age: “75 (6), n=455”
      • Note: Table1_mc in Stata cannot report an ‘n’ with continuous variables.
    • Dichotomous variables: Present column percentage plus ‘n’. Example for female sex: “45%, n=244”.

A word on rounding: I think there is little value on including numbers after the decimal place. I suggest aggressively rounding at the decimal for most things. For example, for BMI, I suggest showing “27 (6)” and not “26.7 (7.2)”. For things that are obtained at the decimal place, I strongly recommend reporting at the decimal. For example, BP is always measured as a whole number, so reporting out a tenth place for BP isn’t of much value. For example, systolic BP is measured as 142, 112, and 138 — not 141.8, 111.8 and 138.4. For discrete variables, I always round the proportion/percentage at the decimal, but clarify very small proportions to be “<1%" if there are any in that group, but it would round to zero or "0%" if there are none in that group.

The one exception to my aggressive “round at the decimal place” strategy is variables that are commonly reported past the decimal place, such as many laboratory values. Serum creatinine is commonly reported to the hundredths place (e.g., “0.88”), so report the summary statistic for that value to the hundredths place, like 0.78 (0.30).

Descriptive labels of metrics assessing discrimination

Discrimination and calibration of models predicting risk

Risk prediction is common in medicine, eg the Framingham Risk Score and the Pooled Cohort Equation/10-year ASCVD risk prediction models. Machine learning models that also predict risk are growing in popularity.

Discrimination and calibration are discussed in this excellent JAMA article. In brief, discrimination relates to how a model divvies up low and high risk individuals. So, in a population of folks of various risk levels, high discriminating models will score higher risk folks higher than low risk folks. For example, a CVD model should say that a 60 year old with hypertension, hyperlipidemia, and diabetes has a higher risk of CVD than a 20 year old with none of those conditions. Calibration, on the other hand, describes how well a model predicts risk for an individual person.

Discrimination descriptive labels

I have to look this up every time, so I am putting this here. Here’s a widely-used labeling for discrimination, which in this manuscript, we called it the “Hosmer-Lemeshow criteria”. These authors applied this to ROC curves, but I think it’s also fair to apply to C-statistics.

  • 0.5: No discrimination
  • 0.5 to <0.7: Poor discrimination
  • 0.7 to <0.8: Acceptable discrimination
  • 0.8 to <0.9: Excellent discrimination
  • ≥0.9: Outstanding discrimination

The reference is here:

Hosmer JD, Lemeshow S, Sturdivant R. Applied Logistic Regression. Hoboken, NJ, USA: John Wiley & Sons Inc; 2013. 

Here’s the Google Books listing for this, in case you want to grab the metadata for this reference using Zotero or whatnot. You’ll see the above labels on page 177 — you can see this with the “Preview” view.

Mediation analysis in Stata using IORW (inverse odds ratio-weighted mediation)

Mediation is a commonly-used tool in epidemiology. Inverse odds ratio-weighted (IORW) mediation was described in 2013 by Eric J. Tchetgen Tchetgen in this publication. It’s a robust mediation technique that can be used in many sorts of analyses, including logistic regression, modified Poisson regression, etc. It is also considered valid if there is an observed exposure*mediator interaction on the outcome.

There have been a handful of publications that describe the implementation of IORW (and its cousin inverse odds weighting, or IOW) in Stata, including this 2015 AJE paper by Quynh C Nguyen and this 2019 BMJ open paper by Muhammad Zakir Hossin (see the supplements of each for actual code). I recently had to implement this in a REGARDS project using a binary mediation variable and wanted to post my code here to simplify things. Check out the Nguyen paper above if you need to modify the following code to run IOW instead of IOWR, or if you are using a continuous mediation variable, rather than a binary one.

A huge thank you to Charlie Nicoli MD, Leann Long PhD, and Boyi Guo (almost) PhD who helped clarify implementation pieces. Also, Charlie wrote about 90% of this code so it’s really his work. I mostly cleaned it up, and clarified the approach as integrated in the examples from Drs. Nguyen and Hossin from the papers above.

IORW using pweight data (see below for unweighted version)

The particular analysis I was running uses pweighting. This code won’t work in data that doesn’t use weighting. This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. (Although you'll probably 
// have to adjust the svyset commands in the 
// program below to work in your dataset, in all 
// actuality)
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
//
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 2.
global outcome mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Components of weighting to go into "svyset" command. 
// You might have 
global samplingweight mysamplingweightvar
global id id_num // ID for your participants
global strata mystratavar
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw_weighted
program iorw_weighted, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
*Step 1: svyset data since your dataset is weighted. If your dataset 
// does NOT require weighting for its analysis, do not use this program. 
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
//
*Step 2: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
svy linearized, subpop(${ifstatement}): logit ${exposure} ${mediator} ${covariates}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 3: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights.  Since our dataset uses sampling
// weights, we need to multiply the dataset's weights times the IORW for the 
// exposure group. This step is fundamentally different for non-weighted 
// datasets. 
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures. 
gen weight_iorw = ${samplingweight} if ${exposure}==0
replace weight_iorw = inverseoddsratio*${samplingweight} if ${exposure}==1
//
*Step 4: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. 
// (same as direct effect, but without the IORW)
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log) 
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 5: DIRECT EFFECTS; using IORW weights instead of the weighting that
// is used typically in your analysis. 
svyset ${id}, strata(${strata}) weight(weight_iorw) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 6: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 7: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 8: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw_weighted
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title}*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%" _col(50) %4.2f `beta4' "% (" %4.2f `low954' "%, " %4.2f `high954' "%)"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
}
// the end.

IORW for datasets that don’t use weighting (probably the one you are looking for)

Here is the code above, except without consideration of weighting in the overall dataset. (Obviously, IORW uses weighting itself.) This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. 
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 2.
global outcome  mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw
program iorw, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
//
*Step 1: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
logit ${exposure} ${mediator} ${covariates} ${ifstatement}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 2: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights. 
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures. 
gen weight_iorw = 1 if ${exposure}==0
replace weight_iorw = inverseoddsratio if ${exposure}==1
//
*Step 3: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. (same as direct effect, but without the IORW)
glm ${outcome} ${exposure} ${covariates} ${ifstatement}, family(poisson) link(log) vce(robust)
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 4: DIRECT EFFECTS; using IORW weights
glm ${outcome} ${exposure} ${covariates} ${ifstatement} [pweight=weight_iorw], family(poisson) link(log) vce(robust)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 5: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 6: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 7: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title}*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%"  _col(50) %4.2f `beta4' " (" %4.2f `low954' ", " %4.2f `high954' ")"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
}
// the end.

ZIP code and county data sets for use in epidemiological research

Everyone knows their (5-digit) ZIP and it can be linked to population-level data. ZIP Codes have limitations since they were designed for mail delivery and not for population details. You can easily get county data from these data as well.

In epidemiological studies (especially EMR and survey data), you’ll almost certainly have a ZIP code or county, and almost never a census tract. It’s easy to find census data sets, but finding the analogous ZIP code dataset is a bit tricker. Every time I try to do a project with ZIP codes, I kick myself for not keeping a list of ZIP code data sets. So, this page will keep a running list of ZIP code-linked datasets. It’ll be updated periodically. If there’s a useful resource that I have missed, please email me at timothy.plante@uvm.edu and I’ll add it.

A few technical notes:

  • US Postal Service (USPS) ZIP codes – It seems that some datasets use a variation of USPS’s active ZIP codes. These are constantly being updated by the US Postal Service. ZIP codes are either the ‘standard’ 5-digit ZIP code or ZIP+4 (e.g., 9 digit). You can narrow down a lot further with the ZIP+4 version, but often times you only have the 5-digit ZIP.
  • ZCTA stands for ZIP Code Tabulation Area and is the US Census’s take on representing the topology of ZIP codes. These are produced for the q10y census. There are different ZCTAs for the 2000 Census and 2010 Census (as of 12/2020). Details about the US Census’s approach to developing ZCTAs can be found here.
    • You can read about the differences between ZCTA and USPS ZIP codes here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1762013/
    • The 2010 ZCTA to county, subdivision, tract, congressional district, metropolitan and micropolitan statistical area, and New England City and Town Area can be found here.
    • Note, there is also a non-ZCTA 5-digit ZIP code standard used by the US Census are specific to the ZIP Code Business Patterns Survey. So, these relate to businesses, not people. Details are here.
  • USPS ZIP code to ZCTA crosswalk – this is provided by UDS at this website: https://udsmapper.org/zip-code-to-zcta-crosswalk/
  • The US Housing and Urban Development (HUD) also has its own ZIP linkage, which can be found here. You can read about the details of the HUD ZIP crosswalk here. This is not the same as ZCTA. The nice thing about the HUD ZIP crosswalk is that it’s updated quarterly, it links to 2000 or 2010 US Census county or tract GEOID via FIPS code, and the OMB’s core-based statistical area (CBSA; basically definitions of urban groups), and congressional districts. It also provides some details about residential vs. business vs. other addresses in that zip code.

Linking ZIP to county

There’s a dataset on the HUD website here, on the “select crosswalk type” dropdown, select ZIP-County. From my read, this is ZCTA ZIP code for more recent datasets, but that isn’t explicitly stated.

Cartography

Here are some resources if you want to make a map.

US Census (ZCTA)

Here’s some mapping files provided by the US Census.

Here’s a great Stata-specific page with both the ZCTA and US Postal Service ZIP files. I recommend the ZCTA if you will be using US census data.

HUD-ZIP linkage

Details are here.

USPS ZIP code

Here’s a commonly-used dataset from Esri’s ArcGIS.

Here’s the USPS ZIP code for Stata.

Distance between zip codes

Can be found here from the NBER here. They also have distance by

Commercial datasets

You might want to take a look at datasets available, some for free, some for a fee, on the AWS marketplace here (this is a link to the search term “Zip”). There are at least a few companies that sell datasets with things like demographics for Zip+4 (eg V12 on AWS), which you might be interested in. While you’re at it, you might also opt to look at Census files on this marketplace here.

Demographics

US Census (ZCTA)

The US Census used to distribute their summary files via FTP for their 2000 census and 2010 census. [Note: those are links to the Summary File 1, which doesn’t include rurality. Those are in Summary File 2.] These 39 and 47 files that must be merged by some convoluted process that I’m not going to try to figure out. Fortunately, the National Bureau of Economic Research (NBER) generated Summary File 1-ZCTA linked files for Stata, SAS, and CSV files that can be downloaded here:

As an example of how to use the NBER files, let’s look at the 2010 files. Files are indexed in this Census Summary File 1 (SF1) document. Search for “File 03” in that PDF to find the details for File 03 on page 184. Note that “P0030002” through “P0030008” are variables for race in the entire population. File 04 then has race and ethnicity among adults (male sex is “P0120002”, female sex is “P0120026”). File 07 has sex by race/ethnicity and age, and so on. You’ll want to save the specific variables from each of these files and generate your own dataset, depending on what you are attempting to do.

But what about rurality? That’s in the Summary File 2 (SF2) document. The US Census data used to be on a website called American Fact Finder, which was simple to use and wasn’t annoying. More recently it was moved to data.census.gov, which is a spiffy looking website that is in all actuality, quite terrible and I want it to go away. I can’t figure out how to download what I want. I tried to make a walkthrough of how to download urban/rurality by ZCTA but it gave me a blank table. Fortunately, I had downloaded it from American Fact Finder before it went offline. You can download the version that I saved here.

An alternative to data.census.gov in the wake of the loss of American Fact Finder is the NHGIS website. You can also try IPUMS, which also has data related to global health, GIS, time use, health surveys (NHIS and MEPS), and higher education information. Finally, there are a whole host of free Census files on AWS, I haven’t tried these but you might want to take a look here.

Social Determinants of Health

Take a look at this handy review of SDoH definitions by Lori Dean and Roland Thorpe at JHU. Before I go any further, I recommend looking at the PhenX toolkit for their resources on SDoH. There are a few conceptual frameworks for SDOH. Here’s Healthy People 2030. Here’s the WHO one. I like the KFF’s Figure 1 here, which defines the following factors (note there’s plenty of overlap between Healthy People 2030 and KFF). Here’s a working list of resources that I’ll keep adding to, organized by KFF’s figure 1 overview:

As I expand these, I will do my best to cover as many of these as possible, as how they apply to ZIP code and county.

Social Deprivation Index or SDI (ZCTA)

Derived from the American Community Survey 5-year estimates. Details include overall SDI score, income, education, employment, housing (% living in crowded rentals), household characteristics (% of single parent households with dependents who are minors), transportation (% car non-ownership), demographics (% black population, % high needs population). Details and download files can be found here.

Here’s the original description, prior to the use of ZCTA. This manuscript only discusses the Primary Care Service Areas (PCSAs), from the Dartmouth Atlas: https://pubmed.ncbi.nlm.nih.gov/22816561/

Area Deprivation Index or ADI (ZIP+4)

More to come. Download site is here. You need to make a free account to access the data. You have to download each state individually, as an FYI.

HUD datasets on housing, income, etc. (can use the HUD-ZIP crosswalk)

Here is the website: https://hudgis-hud.opendata.arcgis.com/

I haven’t explored these data files much, but some details are below. The only file that natively includes the ZCTA is the Difficult Development Areas, under Community Development below.

  • Agency administration – How the HUD is divided. Yawn.
  • Community development – Community development block grant, LIHTC Qualified Census Tracts (aka low income), Difficulty Development Areas for Low Income Housing Tax Credit (LIHTC; high cost of living relative to Area Median Gross Income; interestingly using the ZCTA for metropolitan areas), Neighborhood Stabilization Program (purchase of abandoned buildings), Empowerment Zone/Enterprise Community/Renewal Community (economic growth tax incentives), Revitalization Areas.
  • Community indicators – Details by American Community Survey, self-reported perceived rural/urban status (see Rurality section below), low-to-moderate income population by tract from the American Community Survey, Location Affordability Index from the American Community Survey, extreme temperatures by 1 degree latitude and longitude.
  • Fair housing – More to come.
  • Housing counseling – More to come.
  • Initiatives and demonstrations – More to come.
  • Location affordability – More to come.
  • Mortgage insurance programs – More to come.
  • Rental assistance programs – More to come.
  • Disaster recovery – More to come.

Rurality

RUCA codes (Unclear ZIP type)

There was a bug in the 2010 US Census-derived RUCA-ZIP and the linkage was updated in 2020, and can be found here. I’m trying to figure out whether RUCA is most similar to ZCTA or USPS ZIP Codes. I’ll come back and update what I find out. Update: I didn’t get a response to my inquiry. Since this is linked to the Census data, so possibly ZCTA.

American Housing Survey (AHS) from HUD

Urbanization Perceptions Small Area Index. This was self-reported neighborhood as urban, suburban, or rural. Link is here.

US Census (ZCTA)

The US Census details their rurality take on rurality here. The actual rurality details for the 2010 census are in “Summary File 2”, details of which can be found here. As documented above, data.census.gov is a barrier to downloading census data. Fortunately, I grabbed rurality by ZCTA from American Fact Finder before it was shut down. You can grab my file here.

NCHS Urban-Rural Classification (Counties)

This is a very popular classification methodology people frequently use this scheme so I’m including it here. Details are here.

Health data

County health rankings (county)

Much county-level data can be obtained from the excellent County Health Rankings website from UWI, sponsored by RWJF. These include “ranked” and “unranked” data, the sources of these datapoints are listed in the Excel files that you can download on the website (eg, Vermont’s is here). Ranked includes premature death (deaths <75y), poor fair health, poor physical health, poor mental health, low birthweight, adult smoking adult obesity, food environment index, physical inactivity, access to exercise opportunities, excessive drinking alcohol-impaired driving deaths, STIs, teen births, % uninsured <65, ratio of population to PCPs, ratio of population to dentists, preventable health stays, mammography screening, flu vaccinations, level of education, unemployment, % children in poverty, income inequality, children in single-parent households, social associations, violent crime, injury deaths, air pollution by particulate matter, drinking water violations, households with overcrowding/high housing costs/lack of kitchen facilities/lack of plumbing facilities, % that drive to work alone, long commutes. Unranked includes life expectancy, premature age-adjusted mortality, child mortality, infant mortality, quality of life metrics (frequent physical distress, frequent mental distress, diabetes and HIV prevalence), food insecurity, limited access to healthy foods, drug overdose deaths, motor vehicle crash deaths, insufficient sleep, uninsured adults, uninsured children, ratio of population to primary care providers, disconnected youth (% of 16-19 yo not in school or working), reading scores, math scores, median income, % children eligible for free or reduced price lunch, residential segregation, homicides, suicidies, ,firearm fatalities, juvenile arrests, traffic volume, home ownership, severe housing cost burden, and specific census details.

I can’t find a “download all” option, but the datasets use a preserved naming structure in the download directory, so if you copy the link for one state, you can replace it with the name for another state (replacing spaces with percent sign 20 if spaces if needed) to get that download. It’d be easy to build a loop in Stata to automate the download for all of these datasets.

Other

CBSA – Core-based statistical area

The White House’s OMB defines the CBSA, which is broadly metropolitan areas. So NYC has NYC itself as well as the suburban areas of NYC (NJ, Westchester, etc.) HUD provides USPS ZIP crosswalk here.

Writing your first epidemiology scientific manuscript? Here’s a generic MS Word document to get you started.

Your first manuscript will be be very hard

The first manuscript you’ll ever write is probably best described as a ‘slog’. It’ll take 2-3 times longer than you expect. This’ll be from a few different reasons:

  • Unfamiliarity with typical structure
  • Lack of a structured approach to writing a first draft
  • Developing the analysis too late in the drafting of the manuscript (i.e., not as a first step in drafting)
  • Deciding the tables and figures to include too late in the manuscript (i.e., after completing the analysis)
  • Not knowing how to use MS Word’s advanced features that can help optimize drafting

Here’s a resource that can help

I developed this generic research manuscript over several years of slogging through first drafts of epidemiologic manuscripts. It attempts to address the common problems and includes recommendations for the first drafts.

Here’s what it contains:

  • Page 1 – Helpful hints
  • Page 2 – Suggested steps to bring this to publication
  • Page 3 – ‘fill in the blanks’ cover letter
  • Page 4 – ‘fill in the blanks’ title page
  • Pages 5+ – ‘fill in the blanks’ for the rest of the manuscript

Download:

Click here to download (updated Aug 16, 2021).

I hope it helps!

Use Stata to download the NY Times COVID-19 database and render a Twitter-compatible US mortality figure

Note: this code probably doesn’t work anymore with changes in the NY Times database. I’m keeping it here for historical purposes (4/22/2022).

Here’s the figure!

Code follows

Comments are in-line below. Some unique strategies in this code:

  • This will automatically download the latest NY Times dataset, but the date of “last day of follow-up” needs to be specifically defined. I find that the label locations need to be tweaked every day, and this process isn’t simple to automate.
  • The colors are defined by global macros once and are applied multiple times by calling those macros.
  • Text blocks are rendered next to the last day of follow-up with a translucent white background and non-translucent colored border that matches the dotted line.
  • Twitter figures should be output at 1100 x 628, per this blog. This script does that. Twitter clips images that aren’t this size.
****************************************************
// step 1: download  and save NY times database
****************************************************
version 15.1 // my version of Stata when this was written

import delimited using ///
"https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv", ///
varn(1) clear

// Now make the date a stata date. Load this handy date-fixing program 
// I wrote. The syntax is 'fixdate [variable name] [mdy, ymd, etc]
do https://www.uvm.edu/~tbplante/fixdate_v1_0.do
fixdate date ymd
// rename state to state_fullname
rename state state_fullname
****************************************************
// step 2: keep 50 states+DC, apply abbreviations
****************************************************
gen state=" " 
replace state="AL" if state_fullname=="Alabama"
replace state="AK" if state_fullname=="Alaska"
replace state="AZ" if state_fullname=="Arizona"
replace state="AR" if state_fullname=="Arkansas"
replace state="CA" if state_fullname=="California"
replace state="CO" if state_fullname=="Colorado"
replace state="CT" if state_fullname=="Connecticut"
replace state="DE" if state_fullname=="Delaware"
replace state="FL" if state_fullname=="Florida"
replace state="GA" if state_fullname=="Georgia"
replace state="HI" if state_fullname=="Hawaii"
replace state="ID" if state_fullname=="Idaho"
replace state="IL" if state_fullname=="Illinois"
replace state="IN" if state_fullname=="Indiana"
replace state="IA" if state_fullname=="Iowa"
replace state="KS" if state_fullname=="Kansas"
replace state="KY" if state_fullname=="Kentucky"
replace state="LA" if state_fullname=="Louisiana"
replace state="ME" if state_fullname=="Maine"
replace state="MD" if state_fullname=="Maryland"
replace state="MA" if state_fullname=="Massachusetts"
replace state="MI" if state_fullname=="Michigan"
replace state="MN" if state_fullname=="Minnesota"
replace state="MS" if state_fullname=="Mississippi"
replace state="MO" if state_fullname=="Missouri"
replace state="MT" if state_fullname=="Montana"
replace state="NE" if state_fullname=="Nebraska"
replace state="NV" if state_fullname=="Nevada"
replace state="NH" if state_fullname=="New Hampshire"
replace state="NJ" if state_fullname=="New Jersey"
replace state="NM" if state_fullname=="New Mexico"
replace state="NY" if state_fullname=="New York"
replace state="NC" if state_fullname=="North Carolina"
replace state="ND" if state_fullname=="North Dakota"
replace state="OH" if state_fullname=="Ohio"
replace state="OK" if state_fullname=="Oklahoma"
replace state="OR" if state_fullname=="Oregon"
replace state="PA" if state_fullname=="Pennsylvania"
replace state="RI" if state_fullname=="Rhode Island"
replace state="SC" if state_fullname=="South Carolina"
replace state="SD" if state_fullname=="South Dakota"
replace state="TN" if state_fullname=="Tennessee"
replace state="TX" if state_fullname=="Texas"
replace state="UT" if state_fullname=="Utah"
replace state="VT" if state_fullname=="Vermont"
replace state="VA" if state_fullname=="Virginia"
replace state="WA" if state_fullname=="Washington"
replace state="WV" if state_fullname=="West Virginia"
replace state="WI" if state_fullname=="Wisconsin"
replace state="WY" if state_fullname=="Wyoming"

replace state="DC" if state_fullname=="District of Columbia"

drop if state==" " // drop guam, VI, PR. would be reasonable to add them back
// would need to get their populations for the list below. 
****************************************************
// step 3: apply population by state
****************************************************
// ref: 
// https://www.census.gov/data/tables/time-series/demo/popest/2010s-state-total.html
// http://www2.census.gov/programs-surveys/popest/datasets/2010-2019/national/totals/nst-est2019-alldata.csv?#
gen statepop=.
replace statepop=4903185 if state=="AL"
replace statepop=731545 if state=="AK"
replace statepop=7278717 if state=="AZ"
replace statepop=3017804 if state=="AR"
replace statepop=39512223 if state=="CA"
replace statepop=5758736 if state=="CO"
replace statepop=3565287 if state=="CT"
replace statepop=973764 if state=="DE"
replace statepop=705749 if state=="DC"
replace statepop=21477737 if state=="FL"
replace statepop=10617423 if state=="GA"
replace statepop=1415872 if state=="HI"
replace statepop=1787065 if state=="ID"
replace statepop=12671821 if state=="IL"
replace statepop=6732219 if state=="IN"
replace statepop=3155070 if state=="IA"
replace statepop=2913314 if state=="KS"
replace statepop=4467673 if state=="KY"
replace statepop=4648794 if state=="LA"
replace statepop=1344212 if state=="ME"
replace statepop=6045680 if state=="MD"
replace statepop=6892503 if state=="MA"
replace statepop=9986857 if state=="MI"
replace statepop=5639632 if state=="MN"
replace statepop=2976149 if state=="MS"
replace statepop=6137428 if state=="MO"
replace statepop=1068778 if state=="MT"
replace statepop=1934408 if state=="NE"
replace statepop=3080156 if state=="NV"
replace statepop=1359711 if state=="NH"
replace statepop=8882190 if state=="NJ"
replace statepop=2096829 if state=="NM"
replace statepop=19453561 if state=="NY"
replace statepop=10488084 if state=="NC"
replace statepop=762062 if state=="ND"
replace statepop=11689100 if state=="OH"
replace statepop=3956971 if state=="OK"
replace statepop=4217737 if state=="OR"
replace statepop=12801989 if state=="PA"
replace statepop=1059361 if state=="RI"
replace statepop=5148714 if state=="SC"
replace statepop=884659 if state=="SD"
replace statepop=6829174 if state=="TN"
replace statepop=28995881 if state=="TX"
replace statepop=3205958 if state=="UT"
replace statepop=623989 if state=="VT"
replace statepop=8535519 if state=="VA"
replace statepop=7614893 if state=="WA"
replace statepop=1792147 if state=="WV"
replace statepop=5822434 if state=="WI"
replace statepop=578759 if state=="WY"

****************************************************
// step 4: make daily death count per capita
****************************************************
// now make variables for cases and deaths per capita in each state (per million persons)
gen statepopave_deaths = (deaths/statepop) *1000000

****************************************************
// step 5: make a variable for when the death rate is 
// >=1/1,000,000 people in each state, and count days
// following that
****************************************************
sort state date
gen days_1_death=.
replace days_1_death=0 if statepopave_deaths < 1 
replace days_1_death=1 if (statepopave_deaths >= 1 & statepopave_deaths[_n-1] <1 ) ///
& (state==state[_n-1])
//
replace days_1_death = days_1_death[_n-1]+1 if state==state[_n-1] ///
& days_1_death[_n-1]!=0
****************************************************
// step 6: save database
****************************************************

save nytimes_state_fu.dta, replace

****************************************************
// step 7: specify last day of follow-up and 
// get rank of states and location
// to put state names in x,y location for the 
// last day of follow-up
****************************************************
// reload
use nytimes_state_fu.dta, clear

// ****THIS NEEDS TO BE EDITED EVERY DAY.****
// Set the final date of follow-up. 
// as of today (3/29/2020), 3/27/2020 is the most
// recent day of data in the NY times database.
// 
// This is intentionally not automated because I want to manually adjust
// labels and range each time. 
global month Mar // needs to be in 3 letter abbreviation for month
global date 27 // 2 number day in month

// drop any day beyond the specified date
drop if date>date("${date}${month}2020", "DMY")

// this global will make the x axis 1 day longer than the current follow-up
sum days_1_death
global maxdate = r(max)+1

// actually determine the order of states on the last day of follow-up,
// which is how the labels and colors are applied.
// need to drop all but the last date of follow-up
keep if date==date("${date}${month}2020", "DMY")
gsort -statepopave_deaths // sort in reverse order
gen n=_n // make variable that contains order based upon sort
drop if n >10 // drop those not in the top 10

// need to figure out where to put the labels of state names
// this loop plucks out the state name and x&y coordinates for the last
// day of follow-up. 
// it also prints the order of the states. 
foreach x in 1 2 3 4 5 6 7 8 9 10 {
global statename`x'=state[`x'] // pull state name
global datecount`x' = days_1_death[`x'] + 0.2 // x axis, need to offset by 0.2 
//                                      so the label isn't on top of the dot
global statedeath`x' = statepopave_deaths[`x'] // yaxis

di "State rank #`x': ${statename`x'}"
di "(x axis) # of days: ${datecount`x'}" 
di "(y axis) deaths/million: ${statedeath`x'}"
di " "
}
//
// The labels might overlap each other. This you can manually readjust the 
// location on the y axis following here. This won't alter data in the 
// figure, just the location of the labels. 

global statedeath1 = ${statedeath1} // don't need to move label
global statedeath2 = ${statedeath2} // don't need to move label 
global statedeath3 = ${statedeath3}  // don't need to move label
global statedeath4 = ${statedeath4}  // don't need to move label
global statedeath5 = ${statedeath5}  // don't need to move label
global statedeath6 = ${statedeath6}  // don't need to move label
global statedeath7 = ${statedeath7}  // don't need to move label
global statedeath8 = ${statedeath8}+1.5 // move GA up on y axis
global statedeath9 = ${statedeath8}-2 // move DC down on y axis
global statedeath10 = ${statedeath10} // don't need to move label
****************************************************
// step 8: specify colors, make figure, save figure
// in size compatible with twitter
****************************************************

// reload the full dataset
use nytimes_state_fu.dta, replace
// drop any day beyond the specified date. 
drop if date>date("${date}${month}2020", "DMY")

// I like the s1mono scheme. Default stata theme is ugly. 
set scheme s1mono
// colors for these states, taken from colorbrewer website
// ref: https://colorbrewer2.org/#type=diverging&scheme=RdYlBu&n=10
// these are RGB triads
global color1 165 0 38
global color2 215 48 39
global color3 244 109 67
global color4 253 174 97
global color5 254 224 144
global color6 224 243 248
global color7 171 217 233
global color8 116 173 209
global color9 69 117 180
global color10 49 54 149

// the actual graphic!
// note: you need to put 'sort' after the 'twoway scatter' command so the line doesn't loop back around. 
twoway ///
(scatter statepopave_deaths days_1_death if state=="${statename1}" & days_1_death>=1 & date>=1, ///
mcolor("${color1}") msymbol(O) lpattern(solid) lcolor("${color1}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename2}" & days_1_death>=1 & date>=1, ///
mcolor("${color2}") msymbol(O) lpattern(solid) lcolor("${color2}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename3}" & days_1_death>=1 & date>=1, ///
mcolor("${color3}") msymbol(O) lpattern(solid) lcolor("${color3}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename4}" & days_1_death>=1 & date>=1, ///
mcolor("${color4}") msymbol(O) lpattern(solid) lcolor("${color4}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename5}" & days_1_death>=1 & date>=1, ///
mcolor("${color5}") msymbol(O) lpattern(solid) lcolor("${color5}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename6}" & days_1_death>=1 & date>=1, ///
mcolor("${color6}") msymbol(O) lpattern(solid) lcolor("${color6}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename7}" & days_1_death>=1 & date>=1, ///
mcolor("${color7}") msymbol(O) lpattern(solid) lcolor("${color7}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename8}" & days_1_death>=1 & date>=1, ///
mcolor("${color8}") msymbol(O) lpattern(solid) lcolor("${color8}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename9}" & days_1_death>=1 & date>=1, ///
mcolor("${color9}") msymbol(O) lpattern(solid) lcolor("${color9}") connect(L) sort) ///
(scatter statepopave_deaths days_1_death if state=="${statename10}" & days_1_death>=1 & date>=1, ///
mcolor("${color10}") msymbol(O) lpattern(solid) lcolor("${color10}") connect(L) sort) ///
, ///
yline(30, lcolor(gs14)) ///will need at add additional horizontal lines as figure grows
yline(20, lcolor(gs14)) ///
yline(10, lcolor(gs14)) ///
title("COVID-19 Cumulative Mortality by US State") ///
t1title("Top 10 states, through $month $date, 2020") ///
xla(1(2)$maxdate) ///
yla(0(5)40) ///
yti("# COVID19 Deaths/Million Persons") ///
xti("Day Since ≥1 Death/Million Persons") ///
legend(off) ///
/// the following will render each label with a surrounding box that's the same color as the line. 
text(${statedeath1} ${datecount1} "${statename1}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color1}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath2} ${datecount2} "${statename2}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color2}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath3} ${datecount3} "${statename3}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color3}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath4} ${datecount4} "${statename4}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color4}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath5} ${datecount5} "${statename5}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color5}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath6} ${datecount6} "${statename6}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color6}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath7} ${datecount7} "${statename7}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color7}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath8} ${datecount8} "${statename8}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color8}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath9} ${datecount9} "${statename9}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color9}%100") lstyle(solid) lwidth(thin)) ///
text(${statedeath10} ${datecount10} "${statename10}", ///
size(small) place(e) just(left) box bcolor(white%40) lcolor("${color10}%100") lstyle(solid) lwidth(thin)) ///
caption("Using NY Times COVID19 database"  ///
"https://github.com/nytimes/covid-19-data/blob/master/us-states.csv",  ///
size(small)) ///
xsize(15.3) ysize(9.0) 
// twitter default width & height is 1100x628 pixels. 
//This last line sets the corresponding height and width in inches using 72 dpi. 

graph export "COVID_mortality_2020_${month}_${date}_continuous.png", replace width(1100) 
// width(1100) sets the output to be default width on twitter, or 1100 dpi. 

The confusion nomenclature of epidemiology and biostatistics

This should be more simple.

Epidemiology and biostatistics are awash with synonyms and each institution has its own preferred nomenclature to describe the same general concepts. I started this page as a central place to document the various terms by concept. I’ll plan on revisiting and updating over time.

Regression

Fundamentals

You probably learned the fundamentals of regression in introductory algebra but may not realize it.  Remember drawing a graph from a slope-intercept equation? Draw a graph where Y is equal to 1/4x plus 5. (Here is the relevant Khan Academy Algebra I video about this.) You take the general equation:

Y = mx + b

…where Y is the y-axis, m is the slope of the line, and b is where the line crosses the y-axis. The equation you will write is:

Y=1/4x + 5

…and you will draw:

This sounding familiar? When you do a linear regression, you do the same thing. Instead, you regress Y on X, or:

Y = β1x1 + β0

And fitting in the variables here, you want to figure out what a predicted cholesterol level will be for folks by a given age. You would regress cholesterol level on age:

Cholesterol level = β1*Age + β0

Here, x1 is the slope of the line for age and β0 is the intercept on the Y-axis, essentially the same as the b in Y=mx+b. When you run a regression in Stata, you type

regress y x

or here,

regress cholesterol age

Let’s say that Stata spits out something like:

      Source |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx  
-------------+------------------------------    
       Model |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx   
    Residual |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx    
-------------+------------------------------    
       Total |  xxxxxxxxxxxxxxxxxxxxxxxxxxxx   

------------------------------------------------------------------------------
  cholesterol|      coeff       se         t     P>|t|    [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |     0.500     xxxxxxxx    xxxxx   0.000     0.4000      0.60000
       _cons |     100       xxxxxxxx    xxxxx   0.000     90.000      110.000
------------------------------------------------------------------------------

The βcoefficient for age is 0.5. The intercept, or β0 is 100. You would interpret this as cholesterol level = 0.5*age in years + 100. You could plot this using your Algebra 1 skills.

Cholesterol = 0.5*age + 100

Or you can substitute in actual numbers. What is the predicted cholesterol at age 50? Answer: 125.

If you want to make it more complex and add more variables to explain cholesterol level, it’s no longer a straight line on a graph, but the concept is the same. A multiple linear regression adds more X variables. You can figure out what a predicted cholesterol level will be for folks by age, sex, and BMI. You would regress cholesterol level on age, sex, and BMI. (You would code sex as 0 or 1, like female = 1 and male = 0.)

Y =  β1x1 + β2x2 + β3x3 + β0

Or,

Y =  β1*Age + β2*Sex + β3*BMI + β0

You get the idea.

Names of Y and X

This is what irks me. There are so many synonyms for Y and X variables. Here is a chart that I’ll update over time with synonyms seen in the wild.

 Y=x
Dependent Independent
Outcome Predictor
  Covariate
  Factor
  Exposure variable
  Explanatory variable