Making a scatterplot with R squared and percent coefficient of variation in Stata

The below code outputs this figure. I grabbed the %CV-from-a-regression code from Mehmet Mehmetoglu’s CV program. (Type –ssc install cv– to grab that one. It’s not needed for below.) By the way, you might also be interested in how to make a related Bland-Altman plot here.

Here’s the code:

// note this uses local macros so you need to run this entire thing
// top to bottom in a do file, not line by line. 
//
// Step 1: input some fake data:
clear all

input id y x
1 2 5
2 3 5
3 6 5
4 7 7
5 3 2
6 10 12
7 1 2
8 11 12
9 4 6
10 5 5
end

// Step 2a: Grab the R2 from a regresssion
regress y x
local rsq = e(r2)
// Step 2b, from the same regression, calculate the percent CV
local  rmse= e(rmse)
local depvar = e(depvar)
sum `depvar' if e(sample), meanonly
local depvarmean = r(mean)
local cv = `rmse'/`depvarmean'*100

// Step 3: graph!
set scheme s1mono // I like this scheme

twoway ///
/// Line of unity, you'll need to tweak the range for your data:
(function y=x , range(0 12) lcolor(red) lpattern(solid) lwidth(medium)) ///
/// line of fit:
(lfit y x,  lcolor(red) lpattern(dash) lwidth(medium)) ///
/// Scatter plot dots:
(scatter y x, msymbol(O) msize(medium) mcolor(black)) ///
, ///
aspect(1) /// force output to be square
/// here's the text box with R2 and CV,
/// you'll need to tweak the first two numbers so the y,x
/// coordinates drop the text where you want it
text(12 0 "{bf:R²=`:di %3.1f `=`rsq'''}" "{bf:CV=`:di %4.1f `=`cv'''%}", placement(se) justification(left)) /// 
legend(off) /// Shut off the legend
yla(0(3)12, angle(0)) /// you'll need to tweak range for your data
xla(0(3)12) /// you'll need to tweak range for your data
ytitle("Y!") ///
xtitle("X!") ///
title("Title!") 

Generate random data, make scatterplot with fitted line, and merge multiple figures in Stata

I recently made these two figures with a Stata do file that (A) generates some fake data, (B) plots the data overall with a fitted line, (C) plots each participant individually with fitted line, and (D) merges these four individual plots into one overall plot. One tricky piece of this was to get the –graph combine– command to get the four figures to be square, I had to fiddle with the –ysize– to get them to be square. I had originally tried using the –aspect– option, but it builds in lots of white space to between the figures. this way seemed to work.

Here’s the Stata code to make the above figures.

************************
*****clear data/graphs**
****set graph schemes***
************************
clear
graph drop _all
set scheme s1mono

************************
*****generate data******
************************
set seed 8675309
// generate 40 random values
set obs 40
//...called x and y that range from - to +9
gen y = runiform(-9,9)
gen x = runiform(-9,9)
// now make an indicator for each grouping of 10
gen n =.
replace n= 1 if _n>=0 & _n<=10
replace n= 2 if _n>10 & _n<=20
replace n= 3 if _n>20 & _n<=30
replace n= 4 if _n>30 & _n<=40
// now for each successive 10 rows, add 100, 120, 140, and 160
replace y = y+100 if n==1
replace x = x+100 if n==1

replace y = y+120 if n==2
replace x= x+120 if n==2

replace y= y+140 if n==3
replace x= x+140 if n==3

replace y= y+160 if n==4
replace x= x+160 if n==4

************************
*****single figure******
************************
twoway ///
/// line of unity (45 degree line): 
(function y = x, ra(90 170) clpat(solid) clwidth(medium) clcolor(black)) ///
/// line of fit: 
(lfit y x, lcolor(red) lpattern(dash) lwidth(thick)) ///
/// dots for each group, using colors from colorbrewer2
(scatter y x if n==1, mcolor("230 97 1") msymbol(O) mlcolor(black)) ///
(scatter y x if n==2, mcolor("253 184 99") msymbol(O) mlcolor(black)) ///
(scatter y x if n==3, mcolor("178 171 210") msymbol(O) mlcolor(black)) ///
(scatter y x if n==4, mcolor("94 60 153") msymbol(O) mlcolor(black)) ///
, ///
/// force size:
ysize(4) xsize(4) ///
/// axis titles and rows:
xtitle("Reference BP, mm Hg") ///
ytitle("Cuffless BP, mm Hg") ///
xla(90(20)170) ///
yla(90(20)170) ///
legend(off) ///
/// labels:
text(102 109 "←Participant A", placement(e)) ///
text(119 127 "←Participant B", placement(e)) ///
text(142 135 "Participant C→", placement(w)) ///
text(160 154 "Participant D→", placement(w)) 
/// export figure:
graph export figure1av2.png, replace width(1000)


************************
***four-piece figures***
*****to merge***********
************************
twoway ///
(function y = x if n==1, ra(90 110) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==1, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==1, mcolor("230 97 1") msymbol(O) mlcolor(black)) ///
, ///
title("Participant A") ///
xtitle(" ") ///
ytitle(" ") ///
yla(90(5)110) ///
xla(90(5)110) ///
legend(off) ///
name(figure1b1)

twoway ///
(function y = x if n==2, ra(110 130) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==2, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==2, mcolor("253 184 99") msymbol(O) mlcolor(black)) ///
, ///
title("Participant B") ///
xtitle(" ") ///
ytitle(" ") ///
yla(110(5)130) ///
xla(110(5)130) ///
legend(off) ///
name(figure1b2)

twoway ///
(function y = x if n==3, ra(130 150) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==3, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==3, mcolor("178 171 210") msymbol(O) mlcolor(black)) ///
, ///
title("Participant C") ///
xtitle(" ") ///
ytitle(" ") ///
yla(130(5)150) ///
xla(130(5)150) ///
legend(off) ///
name(figure1b3)

twoway ///
(function y = x if n==4, ra(150 170) clpat(solid) clwidth(medium) clcolor(black)) ///
(lfit y x if n==4, lcolor(red) lpattern(dash) lwidth(thick)) ///
(scatter y x if n==4, mcolor("94 60 153") msymbol(O) mlcolor(black)) ///
, ///
title("Participant D") ///
xtitle(" ") ///
ytitle(" ") ///
yla(150(5)170) ///
xla(150(5)170) ///
legend(off) ///
name(figure1b4)

***********************
***combine figures*****
***********************
graph combine figure1b1 figure1b2 figure1b3 figure1b4 /*figure1b5*/, ///
b1title("Reference BP, mm Hg") ///
l1title("Cuffless BP, mm Hg") ///
/// this forces the four scatterplots to be the same height/width
/// details here:
/// https://www.stata.com/statalist/archive/2013-07/msg00842.html
cols(2) iscale(.7273) ysize(5.9) graphregion(margin(zero))

/// export to figure
graph export figure1bv2.png, replace width(1000)

Diapers, baby wipes, and other baby-related things for new parents

I became a dad during fellowship and have had several other kids since then. From this experience, I wanted to jot down some advice for new parents looking for what to get in anticipation of a transition to parenthood. Note: My recommendations here represent my personal opinion and come from my experience as a dad who overengineers things and not as a doctor. (I am an internist and only take care of adults, and not kids anyway.) I don’t have vested interests in any of the products here and don’t receive any income from links or whatnot.

Diapers

Costco’s Kirkland Signature Diapers. These rock. They are on par with any high-end diaper company out there. (I think they have better performance than the super expensive environmentally-marketed ones that are free from bleaching and whatnot.) Kirkland Signature Diapers come in massive boxes and go on sale once or twice per year. We typically stock up in multiple sizes when they go on sale. They don’t have a newborn size, but our kids only wore newborn-size diapers for about 2 weeks, and honestly they could have all been in size 1 diapers from the get-go. If you are a new parent and your kid is ≥7 lbs (≥3,200 g) at delivery, maybe pick up a 20 pack of some other brand’s newborn size then anticipate flipping over to Size 1 diapers when you run out of newborn size.

Some online commenters have said that Kirkland Signature diapers have changed in quality in recent years. As of 8/2022, we have had ≥1 of our kids in these diapers continuously for 6+ years and I have not noticed any differences in this time. So, no I don’t think the quality has changed. If you don’t have a Costco nearby, it might be cost-effective to get a membership so you can order these online.

If we are traveling and run out of diapers, we get Huggies or Pampers as a backup.

If you go for cloth diapers, more power to you. They didn’t work for us.

Baby wipes

Unlike Costco’s diapers, which I think are equivalent to other high end diapers, Costco’s Kirkland Signature Baby Wipes are in a league of their own. They are far and away the best wipes we have ever used, and we have used pretty much all major brands, including the environmentally-marketed ones. These Kirkland Signature wipes also come in massive boxes. I don’t think they have changed in quality in the past 6+ years. We get the unscented ones.

Do yourself a favor and get the Oxo Tot Perfect Pull Wipes Dispenser too. It’ll allow you to get a wet wipe with one hand, which is essential when you are trying to clean up an angry, messy baby.

Baby bag

I had an old Timbuk2 medium-sized messenger bag that I used instead of a backpack in med school. It was big enough to fit a 14-inch laptop. This works flawlessly as a baby bag. We keep it packed by the door and grab it on the way out as a reflex — just make sure to restock it when you get back from an excursion! Other messenger bags will probably work just fine, but I’ve only had this Timbuk2 one. There a lot to love from a messenger bag form-factor as a baby bag, like ability to flip the cover open without putting it down or even needing to use either arm to carry it.

Either a small or medium messenger bag is reasonable, but if you aren’t sure, or if you might have >1 kid, get the medium. These are easy to find used online, and my recommendation is to get the cheapest, ugliest, most beat up messenger bag that you can find. It’ll be easier to spot in public/the airport, and will only get more beat up over time.

Inside the bag, we had a compact changing pad and 2 mesh/zippered luggage packing cubes. I recommend getting packing cubes that come in multiple sizes and using whichever ones fit in the bag. (You’ll find uses for the others!)

  • In one of the cubes, we kept (1) diapers and (2) a handful of baby wipes inside two separate quart or gallon freezer bags (the same kind of bag you probably already have in your kitchen drawer). We also kept (3) a tube of petroleum jelly, (4) hand sanitizer, and (5) dog poop cleanup bags for used diapers and other garbage, and yes I mean the type of dog bags that you see hooked to a leash.
  • In the other cube we kept changes of kid clothes inside a gallon freezer bag.

We also had some miscellaneous items, like acetaminophen, pacifiers, small books, small toys, etc. tucked in the front pouch.

Garbage can to put diapers in (“diaper pail”)

We have (a) a massive, expensive airtight diaper pail and (b) a smaller, non-airtight, 8 liter/2.1 gallon garbage can with a foot-triggered pop-up lid. The (a) big, fancy diaper pail is so big that you don’t have to change it as much, and the diapers get absolutely rancid in there. Every time you open it and it’s more than half full, you get this “puff” of disgusting air smell that will linger in the air for a while. I prefer our (b) little, non-airtight, 8 liter/2.1 gallon garbage can with a lid to the big expensive, airtight diaper pail. No, it’s not airtight, but I don’t think it matters since smaller size forces you to empty it out every few days so it doesn’t get too nasty. I wouldn’t get anything larger than 10.2 liters/2.7 gallons or else you’ll have a huge pile of gross diapers and the smell issue. I also wouldn’t go any smaller than 8 liters/2.1 gallons either since then you’ll be emptying it constantly. Here’s a search to get you going.

They annoyingly no longer sell small trash can liners at our local Costco, so instead I’m using the waaay oversized 13-gal bags. I actually like these bags for this use though, since they are very strong and give me an opportunity to empty other trash cans into the bag when taking out the diapers.

FYI: I asked my spouse which diaper pail they like and they like the (a) massive one more than the (b) small one. I still like the (b) small one better!

Baby and little kid clothing

When you are expecting, friends and family will get you newborn-sized clothing, which will be cute and all but your kid will grow out of these almost immediately. Make sure to guide potential gift-givers to 3-6 mo, 6-12 mo, and 12-18 mo clothing.

The older the kid is, the rougher they are on everything, including clothing. Starting at 1 year old, think about getting more durable brand clothing that will last long enough to be a hand-me-down within or between families. Clothing from box stores or mall chain stores don’t seem to dependably last through 1 kid aged ≥1 year, let alone a second kid. Brands that we have found durable include:

All of these brands are more expensive when not on sale or second hand. Since they are super durable, the second hand stuff is usually great. We’ve lucked out with some great sales over the years. You might want to sign up for their email advertisements so you catch sales when they pop up.

Toys

I strongly, strongly, strongly recommend implementing a rule that your house doesn’t have toys that take batteries. That’s my only advice.

Miscellaneous items

Here are some items that I’ve been surprised to use as much as we do:

  • 24-pack of white washcloths and 12-pack of white hand towels from Costco. These are in the style you’d find in a hotel. Thirsty and boring. We keep them in their own drawer in the kitchen and preferentially use them for cleanups instead of paper towels. We use washcloths for smaller messes and hand towels for bigger messes. These cloths/towels are easy to clean in a bleach load. Ours are discolored at this point from heavy use, but they have saved many trees since they keep us from using paper products.
  • A bouncy chair, like this Bright Stars one. We place this outside the bathroom so my spouse can put down our infant during a shower or whatnot. We also have one in the kitchen for when we are making meals or cleaning. Just don’t leave in a place where you’ll drop something on your kid or trip on your kid.
  • The Couchcoaster to hold drinks on the couch. Kids need somewhere to put their water bottles when they are older or else you’ll have a leaking upside down bottle on your couch. These are perfect.
  • Stick on Avery No Iron Clothing Labels. We lost 4 or 5 pairs of winter gloves at daycare until we started using these, now we never lose them. Just write your kid’s last name on these with a permanent marker and stick them on the clothing. (Using last names instead of first names is good practice if you think you might have multiple kids so you won’t have to relabel things.) They survive the washer and dryer no problem, but occasionally require a re-do of the name. We use them on hats, gloves, sweaters, sweat shirts, blankets, towels, boots, and whatnot. For most clothing, we put the label inside. For gloves, put it on the outside. Peeling off these labels from clothing leaves behind a goopy residue that won’t come off. Just stick a blank, similar sized label in the same spot after removing one. Easy fix.
  • A label maker. We stick our kids last name on all hard objects (bento boxes, water bottles, food storage, etc.). Having a label maker is surprisingly handy, since the labels don’t fade and survive dishwashers. And my handwriting is stereotypically bad for a doctor. We have an Epson one that isn’t made anymore as far as I can tell. This Brother model looks very similar.
  • Nalgene 4oz and 8oz plastic storage jars. These are made from “Tritan” material and are clear. Nalgene also makes some with PPCO/polypropylene, which are opaque/cloudy, we haven’t gotten those type as they are marketed for laboratory use and aren’t leakproof per some of the listings. The links above are to the Nalgene website, which is the cheapest source, but they are frequently out of stock. It might be prudent to sign up for their “stock alerts” email if they aren’t in stock. They are about 2x the cost on a website that rhymes with zamerzon, but seem to be more reasonably priced on this other website that charges ~$20 for shipping. Why are they great? Just like the iconic Nalgene water bottles, these jars are seemingly indestructible and leak proof. They make packing lunches and snacks super easy since you don’t have to worry about them leaking or coming open. They store breast milk/formula in a pinch. (I wouldn’t freeze liquid in these.) With our first kid, we picked up about 10 of the 4 oz and 5 of the 8 oz to keep up with our usage. Put a label from your label maker on these to keep them from disappearing at daycare/school. Get these Oxo mini scrub brushes so you can get any crud that might build up in the inside of the lid. We also got one of these dishwasher baskets to hold the lids upright in our 2-rack dishwasher, but don’t need the baskets anymore with our new 3-rack dishwasher since the lids lie comfortably flat on the top rack.
  • White noise machine. We’ve tried a few different models and this is our favorite. It has a good balance of low power (running on 5v USB!) and good quality sound. We have this in our room and the kids rooms running all the time at a low volume with the deeper white noise sound (i.e., brown noise).
  • Cable (“zip”) ties. Sometimes you just need to tie some inanimate object down quickly and semi permanently. Zip ties are awesome for that. Get at least 18 inch long ones. This brand has been good.
  • Junk drawer tools. Here are some items that I find useful and use several times per month, so I keep them in the kitchen and not in the tool box. This screwdriver with flat and Phillips head in two sizes. These little cutting pliers and needle nose pliers (keep out of reach of little hands!).
  • House project supplies. I assume you have a basic set of tools. If you have drywall, you’ll be using oodles of drywall anchors while baby proofing. The anchors that come in baby proofing kits are usually terrible, so get these awesome drywall anchors (they require a matching drill bit to go with a power drill, as FYI). Get a stud finder too. I don’t really like ours so won’t recommend it.
  • Screw thread locker. Kids stuff seems to come unscrewed pretty easily. A dab of this blue Loctite Threadlocker on the side of a screw’s threading will keep it from coming undone. This is a cheat code.
  • Bathroom items. These nail clippers are awesome when kids are a little older. For infants, these ones are great. At some point your kid will get a splinter and you will want really good 2-piece set of tweezers like these before that.
  • Oxiclean, a Rubbermaid dishpan, and some nitrile gloves. Doing an “Oxiclean soak” in a dishpan is remarkably effective at getting nasty stains out. We have a great outdoor gear consignment shop in Burlington, VT (Outdoor Gear Exchange). Some of the high-end consignment kids winter gear (e.g., Patagonia) has really caked in dirt/stains, but are otherwise in great shape. The price is usually dramatically lowered because of these stains. We have had 100% success in getting these stains out with an Oxiclean soak, getting these items looking more-or-less brand new. It also works well for other run-of-the-mill clothing stains and deodorant/antiperspirant stains. There are instructions on how to do this soak on the back of the Oxiclean box, but our approach is to (A) put on nitrile gloves, (B) fill the ~4 gallon/~15 liter dishpan about half-way with warm/hot water, (C) put the dishpan on a junky old white towel on top of the not-running dryer (towel to catch any bubbles that might spill over), then (D) add 1-2 full scoops (~1-2 cups/~250-500 mL) of Oxiclean. (E) Mix the Oxiclean/water solution around a bit, then (F) add the clothing items, gently agitating them without trying to spill out the liquid. (G) Leave the batch for at least a couple of hours or overnight then (H) dump everything in the dishpan in the washing machine (also put the junky towel in the washing machine), add some mild laundry detergent, and run a normal small-sized wash cycle. After it’s done, we usually then (I) add in a full load of clothes on top of the just-Oxicleaned-then-washed items and run again with mild detergent. I’m not sure we need to do this re-wash, but it doesn’t take much more effort and it gives a nice piece of mind that the Oxiclean has washed out of the kids clothes. Note: Make sure to use the nitrile gloves when handling this solution and the clothes before they get washed. Don’t mix Oxiclean with bleach or other solvents. Wipe up any spills with paper towels and discard the paper towels immediately. Don’t use the dishpan for other purposes after you’ve used it for Oxicleaning. This soak might discolor some things but we haven’t had that happen with any commercially-produced kids or adult clothing items. It unfortunately won’t remove paint or permanent marker.

Part 7: Making a table for your outcome of interest (Table 2?)

As we learned in part 5, Table 1 describes your analytical population at baseline by your exposure. For those using a continuous variable as an exposure, it’s by quantile (e.g., tertile, quartile) of the exposure. I propose a table known as “Table 2” that describes the outcome of interest by the exposure used in Table 1. You might have seen something along the lines of this in papers before, and we are going to call it “Table 2”. It’s not a universal table in observational epidemiology, so calling it “Table 2” is a bit much. But we’ll call it “Table 2” for our purposes.

Columns

The columns should be identical to that in your Table 1. (I suggest having an “All” column if you don’t have one in your Table 1 though.)

Rows

In Table 1, I suggested having an N and range for continuous variables of your quantiles. I suggest not including those in your Table 2 if they are already in your Table 1, since it’s duplicative. I suppose it might be helpful for error checking to have them in table 2, and confirming that they are identical to your Table 1. But, I suggest not including a row for Ns and ranges in your Table 2 that is included in the manuscript.

In a very simple Table 2, there might be a single row: the outcome in the analytical population. It might look like this:

AllTertile 1Tertile 2Tertile 3
All

BUT! There might be a stratification of interest in your table. in the REGARDS study, we often stratify by Black vs. White race or by sex. So, you might also include rows for each subsequent group, like this:

AllTertile 1Tertile 2Tertile 3
All
Black participants
White participants

Finally, for subgroups, you might opt to include a minimally-adjusted regression comparing your strata. in this example, we could use a modified Poisson regression (i.e., Poisson regression with sandwich or robust variance estimators, Zou’s method) to compare risk of the outcome overall an in each tertile. I’d just adjust for age and sex in this example. So:

AllTertile 1Tertile 2Tertile 3
All
Black participants
White participants
RR, Black vs. White (ref)

Cell

Here, I suggest presenting # with event/# at risk (percentage with event) in each cell, except in the RR row, which would present RR and the 95% confidence interval. Example (totally made up numbers here and some placeholder ##’s, as FYI):

AllTertile 1Tertile 2Tertile 3
All1000/3000 (33%)300/1000 (30%)400/1000 (40%)500/1000 (50%)
Black participants400/1000 (40%)##/## (##%)##/## (##%)##/## (##%)
White participants600/2000 (30%)##/## (##%)##/## (##%)##/## (##%)
RR, Black vs. White (ref)1.25 (1.20-1.30)1.20 (1.15-1.25)1.22 (1.18-1.30)1.21 (1.19-1.28)

That’s it! Even if you don’t include this table, it’s super handy to have to describe the outcome in the text.

Part 6: Visualizing your continuous exposure at baseline

Visualization of your continuous exposure in an observational epidemiology research project

As we saw in Part 5, it’s important to describe the characteristics of your baseline population by your exposure. This helps readers get a better understanding of internal validity. For folks completing analyses with binary exposures, part 6 isn’t for you. If your analysis includes continuous exposures or ordinal exposures with at least a handful of options, read on.

I think it’s vitally important to visualize your exposure before moving any further forward with your analyses. There are a few reasons that I do this:

  1. Understand the distribution of your exposure. Looking at the raw spread of your data will help you understand if it has a relatively normal distribution, if it’s skewed, if it is multimodal (eg, has several peaks), or if it’s just plain old weird looking. If your exposure is non-normally distributed, then you’ll need to consider the implications of the spread in your analysis. This may mean log-transforming, square root-transforming (if you have lots of zeros in your exposure’s values), or some other sort of transformation.
    • Note: make sure to visualize your transformed exposure value!
  2. Look for patterns that need exploring. You might notice a huge peak at a value of “999”. This might represent missing values, which will need to be recoded. You might notice that values towards the end of the tails of the bell curve might spike up at a particular value. This might represent values that were really above or below the lower limit of detection. You’ll need to think about how to handle such values, possibly recoding them following the NHANES approach as the limit of detection divided by the square root of 2.
  3. Understand the distribution of your exposure by key subgroups. In REGARDS, our analyses commonly focus on racial differences in CVD events. Because of this, I commonly visualize exposures with overlaid histograms for Black and White participants, and see how these exposure variables differ by race. This could easily be done for other sociodemgraphics (notably, by sex), anthropometrics, and disease states.

My approach to building histograms in Stata

A conventional histogram splits continuous data into several evenly-spaced discrete groups called “bins”, and visualizes these discrete groups as a bar graph. Let’s see how to do this in Stata, with some randomly generated data that approximates a normal distribution. While we’re at it, we’ll make a variable called “group” that’s 0 or 1 that we’ll use later. (Also note that this dataset doesn’t use discrete values, so I’m not specifying the discrete option in my “hist” code. If you see spaces in your histogram because you are using discrete values, add “, discrete” after your variable name in the histogram line.)

clear all
// set observations to be 1000:
set obs 1000
// set a random number seed for reproducibility: 
set seed 12345
// make a normally-distributed variable, with mean of 5 and SD of 10:
gen myvariable = rnormal(5,10)
// make a 0 or 1 variable for a group, following instructions for "generate ui":
// https://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/
gen group = floor((1-0+1)*runiform() + 0)
// now make a histogram
hist myvariable

Here’s the overall histogram:

On the X axis you see the ranges of the values of variable of interest, from around -30 to about +40. On the Y axis you see the density plot. I want to show this same figure by group, however, and the bins are not currently transparent. You won’t be able to tell one group from another. So, in Stata, you need to use the “twoway histogram” option instead of just “histogram” and specify transparent colors of the figure using the %NUMBER notation. We’ll also add a legend. We’ll set the scheme to s1mono to get rid of the ugly default blue surrounding box as well. Example:

// set the scheme to s1mono:
set scheme s1mono
// now make your histogram:
twoway ///
(hist myvariable if group==0, color(blue%30)) ///
(hist myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Here’s what you get:

You can modify things as needed. Something you might consider is changing the density to count or frequency, which is done by adding “frequency” or “percent” after the commas but before the colors. You might also opt to select different colors, which you can read about selection of colors in this editorial I wrote with Mary Cushman.

Considerations for designing histograms

One question is how many bins you want. I found this nice 2019 article by Regina L. Nuzzo, PhD (PDF link, PubMed listing) that goes over lots of considerations for the design of histograms. I specifically like the Box, which lists equations to determine number of bins and bid width. In general, if you have too many bins, your data will look choppy:

// using 100 bins here
twoway ///
(hist myvariable if group==0, color(blue%30) bin(100)) ///
(hist myvariable if group==1, color(red%30)  bin(100)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

And if you have too few, you won’t be able to make sense of the overall structure of the data.

// using 2 bins here
twoway ///
(hist myvariable if group==0, color(blue%30) bin(2)) ///
(hist myvariable if group==1, color(red%30)  bin(2)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Be thoughtful about how thinly you want to splice your data.

A quick note on kernel density plots

Kernel density plots have the same idea to histograms, except it shows a smoothed line over your data’s distribution instead of a histogram. I prefer to use histograms when looking at laboratory data, since kernel density plots can smooth over outliers described in point #2 above. The code for a kernel density plot in Stata is nearly identical to the “twoway hist” code above.

twoway ///
(kdensity myvariable if group==0, color(blue%30)) ///
(kdensity myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Output:

You can even combine histograms and kernel density plots!

twoway ///
(hist myvariable if group==0, color(blue%30)) ///
(hist myvariable if group==1, color(red%30)) ///
(kdensity myvariable if group==0, color(blue%30)) ///
(kdensity myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

I’ve never done this myself for a manuscript, but just showing that it’s possible.

Part 5: Baseline characteristics in a Table 1 for a prospective observational study

What’s the deal with Table 1?

Tables describing the baseline characteristics of your analytical sample are ubiquitous in observational epidemiology manuscripts. They are critical to help the reader understand the study population and potential limitations of your analysis. A table characterizing baseline characteristics is so important that it’s typically the first table that appears in any observational epidemiology (or clinical trial) manuscript, so it’s commonly referred to as a “Table 1“. Table 1s are critically important because they help the readers understand internal validity of your study. If your study has poor internal validity, then your results and findings aren’t useful.

The details here are specific to prospective observational studies (e.g., cohort studies), but are generalizable to other sorts of studies (e.g., RCTs, case-control studies).

If you are a Stata user, you might be interested into my primer of using Table1_mc to generate a Table 1.

Guts of a Table 1

There are several variations of the Table 1, here’s how I do it.

COLUMNS: This is your exposure of interest (i.e., dependent variable). This is not the outcome of interest. There’s a few way to divvy up these columns, depending on what sort of data you have:

  • Continuous exposure (e.g., baseline LDL-cholesterol level): Cut this up into quantiles. I commonly use tertiles (3 groups) or quartiles (4 groups). People have very, very strong opinions about whether you use tertiles or quartiles. I don’t see much of a fuss in using either. Of note, there usually is no need to transform your data prior to splitting into quantiles. (And, log transforming continuous data that includes values of zero will replace those zeros with missing data!)
  • Discrete exposure:
    • Dichotomous/binary exposure (e.g., prevalent diabetes status as no/0 or yes/1): This is easy, column headers should be 0 or 1. Make sure to use a descriptive column header like “No prevalent diabetes” and “Prevalent diabetes” instead of numbers 0 and 1.
    • Ordinal exposure, not too many groups (e.g., never smoker/0, former smoker/1, current smoker/2): This is also easy, column headers should be 0, 1, or 2. Make sure to use descriptive column headers.
    • Ordinal exposure, a bunch of groups (e.g., extended Likert scale ranging from super unsatisfied/1 to super satisfied/7): This is a bit tricker. On one hand, there isn’t any real limitation on how wide a table can be in a software package so you could have columns 1, 2, 3, 4, 5 ,6 and 7. This is a bit unwieldy for the reader, however. I personally think it’s better to collapse really wide groupings into a few groups. Here, you could collapse all of the negative responses (1, 2 and 3), leave the neutral response as its own category (4), and collapse all of the positive responses (5, 6, and 7). Also use descriptive column headers, but also be sure to describe how you collapsed groups in the footer of the table.
    • Nominal exposure, not too many groups (e.g., US Census regions of Northeast, Midwest, South, and West): This is easy, just use the groups. Be thoughtful about using a consistent order of these groups throughout your manuscript.
    • Nominal exposure, a bunch of groups (e.g., favorite movie): As with ‘Ordinal data, a bunch of groups’ above, I would collapse these into groups that relate to each other, such as genre of movie.
  • (Optional) Additional first column showing “Total” summary statistics. This presents summary statistics for the entire study population as a whole, instead of by quantile or discrete groupings. I don’t see much value in these and typically don’t include them.
  • (Optional) Additional first column showing count of missingness for each row. This presents a count of missing values for that entire row. I think these are nice to include, but they don’t show missingness by column so are an imperfect way to show missingness. See the section below on ‘cell contents’ for alternative strategies to show missingness.
    • Note: Table1_mc for Stata cannot generate a “missingness” row.
  • (Optional, but suggest to avoid) Following P-value column that shows comparisons across rows. These have fallen out of favor for clinical trial Table 1s. I see little value of them for prospective observational studies and also avoid them.

ROWS: These include the N for each column, the range of values for continuous exposures, and baseline values. Note that the data here are from baseline.

  • N for each group. Make sure that these Ns add up to the expected N in your analytical population at the bottom of your inclusion flow diagram. If it doesn’t match, you’ve done something wrong.
  • (For continuous exposures) Range of values for your quantiles and yes I mean minimum and maximum for each quantile, not IQRs.
  • Sociodemographics (age, sex, race, ± income, ± region, ± education level, etc.)
  • Anthropometrics (height, weight, waist circumference, BMI, etc.)
  • Medical problems as relevant to your study (eg, proportion with hypertension, diabetes, etc.)
  • Medical data as relevant to your study (eg, laboratory assays, details with radiological imaging, details from cardiology reports)
  • Suggest avoiding the outcome(s) of interest as additional rows. I think that presenting the outcomes in this table is inadequate. I prefer to have a separate table or figure dedicated to the outcome of interest that goes much more in-depth than a Table 1 does. Plus, the outcome isn’t ascertained at baseline in a prospective observational study, and describing the population at baseline is the general purpose of Table 1.
  • And for the love of Pete, please make sure that all covariates in your final model appear as rows. If you have a model that adjusts for Epworth Sleepiness Score, for example, make sure that fits in somewhere above.

The first column of your Table 1 will describe each row. The appearance of this row will vary based upon the type of data you have.

  • Overall style of row descriptions as it appears in the first column:
    • N row – I suggest simply using “N”, though some folks use N (upper case) to designate the entire population and n (lower case) to designate subpopulations, so perhaps you might opt to put “n”.
    • Continuous variables (including the row for range)– I suggest a descriptive name and the units. Eg, “Height, cm”
    • Discrete variables – I suggest a descriptive name alone. Some opt to put a hint to the contents of the cell here (eg, adding a percentage sign such as “Female sex, %“), but I think that is better included in the footer of the table. This will probably be determined by the specific journal you are submitting to.
      • Dichotomous/binary values – In this example, sex is dichotomous (male vs. female) since that’s how it has historically been collected in NIH studies. For dichotomous variables, you can include either (1) a row for ‘Male’ and a row for ‘Female’, or (2) simply a row for one of the two sexes (eg, just ‘Female’) since the other row will be the other sex.
      • Other discrete variables (eg, ordinal or nominal) – In this example, we will consider the nominal variable of Race. I suggest having a leading row that provides description of the following rows (eg, “Race group”) then add two spaces before each following race group so the nominal values for the race groups seem nested under the heading.
    • (Optional) Headings for groupings of rows – I like including bold/italicized headings for groupings of data to help keep things organized.

Here’s an example of how I think a blank table should appear:

Table 1 – Here is a descriptive title of your Table 1 followed by an asterix that leads to the footer. I suggest something like “Sociodemographics, anthropometrics, medical problems, and medical data ascertained baseline among [#] participants in [NAME OF STUDY] with [BRIEF INCLUSION CRITERIA] and without [BRIEF EXCLUSION CRITERIA] by [DESCRIPTION OF EXPOSURE LIKE ‘TERTILE OF CRP’ OR ‘PREVALENT DIABETES STATUS’]*”

(Optional)
Missing, N
(Optional)
Total
Exposure
Variable
Tertile 1
Exposure
Variable
Tertile 2
Exposure
Variable
Tertile 3
(Optional,
suggest
avoiding)
P-value
N
Range, ng/mL
Sociodemographics
Age, y
Female sex
Race group
Black
White
Asian
Anthropometrics
Height, cm
Weight, kg
BMI, kg/m²
Medical problems
[List out here]
Medical data
[List out here]

*Footer of your Table 1. I suggest describing the appearance of the cells, eg “Range is minimum and maximum of the exposure for each quantile. Presented as mean (SD) for normally distributed and median (IQR) for skewed continuous variables. Discrete data are presented as column percents.”

Cell contents

The cell contents varies by type of variable and your goal in this table:

  • Simplicity as goal:
    • Normally distributed continuous variables: Mean (SD)
    • Non-normally distributed continuous variables: Median (IQR)
    • Discrete variables: Present column percentages. Not row percentages. For example we’ll consider “income >$75k” by tertile of CRP. A column percentage would show the % of participants in that specific quantile have an income >$75k. A row percentage would show the percentage of participants with income >$75K who were in that specific tertile.
  • Clarity of completeness of data as goal (would not also include “missing” column if doing this)
    • Continuous variables: Present as mean (SD) or median (IQR) as outlined above based upon normality, but also include an ‘n’. Example for age: “75 (6), n=455”
      • Note: Table1_mc in Stata cannot report an ‘n’ with continuous variables.
    • Dichotomous variables: Present column percentage plus ‘n’. Example for female sex: “45%, n=244”.

A word on rounding: I think there is little value on including numbers after the decimal place. I suggest aggressively rounding at the decimal for most things. For example, for BMI, I suggest showing “27 (6)” and not “26.7 (7.2)”. For things that are obtained at the decimal place, I strongly recommend reporting at the decimal. For example, BP is always measured as a whole number, so reporting out a tenth place for BP isn’t of much value. For example, systolic BP is measured as 142, 112, and 138 — not 141.8, 111.8 and 138.4. For discrete variables, I always round the proportion/percentage at the decimal, but clarify very small proportions to be “<1%" if there are any in that group, but it would round to zero or "0%" if there are none in that group.

The one exception to my aggressive “round at the decimal place” strategy is variables that are commonly reported past the decimal place, such as many laboratory values. Serum creatinine is commonly reported to the hundredths place (e.g., “0.88”), so report the summary statistic for that value to the hundredths place, like 0.78 (0.30).

How I use the Zotero reference manager for collaborative grants or manuscripts

Why Zotero?

Zotero is an excellent and free reference manager that is my go-to for writing grants and manuscripts. It has some very handy features, like word processor plugins, web browser plugins that will grab PDF versions of documents if available, and ability to share “group libraries” with collaborators. It has all of the standard features of reference managers, like auto-formatting of references to meet submission requirements, automatic renumbering of in-line references, etc.

There are some silly things about Zotero’s initial setup that are important to get out of the way. For example, you don’t necessarily need to have a Zotero account to use Zotero (or at least you didn’t when I used it the first time). Make sure that you read about how to get set up with Zotero under the “Zotero” heading on this page. After you do that, come back here and read on!

How to use Zotero for collaborative projects

Zotero works well with MS Word and Google Docs. Examples here are taken from MS Word, but are also applicable to Google Docs. The main difference between Google Docs and MS Word is that the web browser plugin is also the Google Docs plugin. MS Word has a plugin separate from the web browser plugin. Regardless, whenever you use a Zotero plugin (eg the MS Word, Google Docs, or browser extensions), you also need to have the Zotero desktop app open. You’ll get an error if you try to insert a reference into a document or snag a reference from PubMed/a journal website if the Zotero desktop app isn’t also open.

Organizing your folders (“collections”) and subfolders (“subcollections”).

In your desktop app, navigate to the shared library that I’ll send you. Make folders/collections or subfolders/subcollections in there to help stay organized. To make a new folder/collection, right click on the shared group library and click “New Collection…”. To make subfolders/subcollections, right click on that new folder/subfolder that you made.

I suggest making collections/folders by section of your document, and numbering them so they stay in order, so: “01 Introduction”, “02 Methods”, “03 Results”, and “04 Discussion”. If you are writing something that doesn’t follow a usual flow (eg an opinion piece), number/name things by the major sections in your outline. You can always rename these folders/collections and renumber them so they show up in order.

Now, within each of these folders/collections, make specific groupings of subfolders/subcollections by topic. For example, in the introduction, you might have a sentence talking about the epidemiology/population prevalence of hypertension, then the costs (eg DALY lost) of hypertension, then an overview of the pathophysiology of hypertension, then how some biomarker relates to blood pressure. I recommend having a subfolder for each of these concepts separately in the “01 introduction” folder. You can also order these with numbers or letters, but it also might make sense to keep them unordered if you aren’t sure of how the introduction (or any other section) will flow.

Now repeat this for all of the other subfolders. The results folder might be pretty thin because usually (for me at least) there aren’t many references in that section. For the “discussion section”, I recommend including the suggested sections from my “your first epidemiology manuscript” example under “Download” here. It’ll look like this when you are done:

Next: Grabbing citations.

You need to install the Zotero browser extension, and make sure that you have the Zotero desktop app open when you are grabbing citations. I STRONGLY RECOMMEND GRABBING JOURNAL CITATIONS FROM PubMed AND ONLY PubMed. (For textbooks, Google Books works quite well.) Zotero uses metadata from websites to populate the reference. PubMed’s metadata is pristine. Journal websites have metadata, but it’s inconsistent and often incomplete. So, stick with PubMed when grabbing citations.

The first step is to find your article on PubMed. Let’s say you want to grab the AHA’s statistical update document, and using Google you find it on the Journal’s website. Welp, that’s not pubmed so don’t even think about grabbing the reference from here.

Instead, highlight the title and search Google again, appending the word “PubMed” to the search. This will get you to the Pubmed listing for that article.

Now this is very important! Go over to your Zotero desktop app and choose the subcollection/subfolder you want this to go in. This will be in the Introduction/Population Prevalence of HTN subfolder. You’ll notice it’s empty.

Now go back to the PubMed page for your document and click the Zotero plugin button.

Now when you go back to the Zotero Desktop app, you’ll see that the AHA Statistical Update is now saved in your “population prevalence of HTN” folder. If a PDF is available through Unpaywalled (a separate service that’s integrated into Zotero that grabs free/legal copies of journal PDFs), then a PDF will be saved as well.

When you grab references, make sure to sort them into a specific folder along the way. You’ll thank yourself later.

Next: Inserting citations in an MS Word document

Open up your manuscript file in MS Word. Make sure you click/set your cursor in the place in your document where you want the citation to land. (I’m a “citation goes after the period and not before the period” guy myself.) Then, click the Zotero tab. (If you don’t see the Zotero tab, you might need to try to reboot, or manually install the plugin. See the “Setting Things Up/Zotero” section here.) Now, click the “Add/Edit Citation” button. If this is the first citation/reference you are adding to this document, you’ll be prompted to select a formatting style, just pick anything since you can always change it later (I like the American Medical Association one to start with). You won’t see this pop up when adding other references.

Now you’ll see the hovering Zotero window. This isn’t a part of MS Word, it’s actually the Zotero desktop app. Sometimes this gets lost among your various programs/windows on your desktop and you need to go find it, it’ll be under the Zotero icon on your taskbar on windows. Or Alt+Tab until you find it. Anyway, this allows you to find a citation by text search (ie, by typing in the author name or title), but if you select the dropdown menu here, you can use the “classic view” and manually grab citations from your subcollections/subfolders. I recommend adding citations through the classic view.

In the classic view you will see all of your subcollections/subfolders. Navigate to your subcollection/subfolder of interest and click on the citation that you’d like to insert and hit “okay”. Notice in the “Classic View”, you can select multiple references at the same time by clicking the “Multiple Sources…” button at the bottom.

Now your citation is in-line! See the floating “1” after the first sentence. But where is the Reference list? Let’s plop one in. I added a new heading for references in MS Word and we’ll add it there. Click on the line after your “references” header, go to the zotero tab, and click “add/edit bibliography”.

End product is below. This reference list will update while you insert references in your manuscript. The reference numbers will also update automatically as you go.

Good luck!

Descriptive labels of metrics assessing discrimination

Discrimination and calibration of models predicting risk

Risk prediction is common in medicine, eg the Framingham Risk Score and the Pooled Cohort Equation/10-year ASCVD risk prediction models. Machine learning models that also predict risk are growing in popularity.

Discrimination and calibration are discussed in this excellent JAMA article. In brief, discrimination relates to how a model divvies up low and high risk individuals. So, in a population of folks of various risk levels, high discriminating models will score higher risk folks higher than low risk folks. For example, a CVD model should say that a 60 year old with hypertension, hyperlipidemia, and diabetes has a higher risk of CVD than a 20 year old with none of those conditions. Calibration, on the other hand, describes how well a model predicts risk for an individual person.

Discrimination descriptive labels

I have to look this up every time, so I am putting this here. Here’s a widely-used labeling for discrimination, which in this manuscript, we called it the “Hosmer-Lemeshow criteria”. These authors applied this to ROC curves, but I think it’s also fair to apply to C-statistics.

  • 0.5: No discrimination
  • 0.5 to <0.7: Poor discrimination
  • 0.7 to <0.8: Acceptable discrimination
  • 0.8 to <0.9: Excellent discrimination
  • ≥0.9: Outstanding discrimination

The reference is here:

Hosmer JD, Lemeshow S, Sturdivant R. Applied Logistic Regression. Hoboken, NJ, USA: John Wiley & Sons Inc; 2013. 

Here’s the Google Books listing for this, in case you want to grab the metadata for this reference using Zotero or whatnot. You’ll see the above labels on page 177 — you can see this with the “Preview” view.

Using Stata’s Frames feature to build an analytical dataset

Stata 16 introduced the new Frames functionality, which allows multiple datasets to be stored in memory, with each dataset stored in its own “Frame”. This allows for dynamic manipulation of multiple datasets across multiple Frames. Stata is still simplest to use when manipulating a single dataset (or, frame). So, Stata users will probably be interested in building a single dataset/Frame for a specific analysis that is built from variables taken from multiple datasets/Frames.

One handy application of Frames is to import non-Stata datasets as separate frames and combine them (really, merge) into a single analytical dataset/Frame. Before using Frames, I had previously imported non-Stata datasets, saved them locally, then merged them 1 by 1. With Frames, you just import each dataset into its own Frame, and “merge” them directly, skipping the intermediate “save as Stata dta file” step.

Here’s my approach to building a single analytical dataset from multiple imported datasets, with frames. We’ll do this with NHANES data. This is a modification of the code on this post.

Step 1: Drop (reset) all frames, create new ones to import the new datasets, and run commands within each new frame to import NHANES/SAS datasets.

// Here's the NHANES website, FYI:
// https://wwwn.cdc.gov/nchs/nhanes/default.aspx
//
// drop all frames from memory. This will delete all unsaved data so be careful!!
frames reset
//
// make a blank frame called "DEMO_F"
// you could type "frame create" or the brief synonym "mkf" for 
// "make frame"
mkf DEMO_F
// ...Then run the sas import command within it, grabbing it from the CDC website. 
// Since this command needs to be run from within the DEMO_F frame,
// we can tell Stata to run the command from that frame without 
// actually changing to it using the "frame [name]:" prefix
frame DEMO_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/DEMO_F.XPT", clear 
//
// ditto for the "BPQ_F" and "KIQ_U_F" datasets.
mkf BPQ_F
frame BPQ_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/BPQ_F.XPT", clear
//
mkf KIQ_U_F
frame KIQ_U_F: import sasxport5 "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/KIQ_U_F.XPT", clear
// 
// let's see a list of current frames:
frames dir 
//
// Which frame are you using though? 
// pwf is present working frame, or the current one in use.
pwf 
// You'll see that the pwf if "default". 

Step 2: Create an “analytical” frame that will contain the data you need to complete your analysis, and copy the variable that links all of your data to this frame. Also switch to that new analytical frame.

// The "linking" variable in this dataset is called "seqn", which 
// we will copy ("put") from the DEMO_F frame. 
// (Your file might have a linking variable called "id".)
// This creates a new frame called "analytical" and also moves the "seqn"
// variable from DEMO_F to the new analytical frame in one line.
frame DEMO_F: frame put seqn, into(analytical)
//
// see current list of frames and present working frame:
frames dir
pwf
// Now change from the default frame to the new analytical one.
// You can change frames with "cwf" for "change working frame".
cwf analytical

Step 3: Now that you are are in the new analytical frame, link all of your frames using the “linking” variable (“seqn” here).

// use the "frlink" command to link frames. This can be 1:1 linking or 1:m.
//
// remember that your cwf should be analytical right now. 
//
frlink 1:1 seqn, frame(DEMO_F)
frlink 1:1 seqn, frame(BPQ_F)
frlink 1:1 seqn, frame(KIQ_U_F)
// 
// You are still within the "analytical" frame, but now your frames are all 
// linked or connected to each other. 
// if you look at your dataset with the --browse-- command, you'll see there
// are now new DEMO_F, BPQ_F, and KIQ_U_F variables. These are the "rows" for
// linked IDs in those other frames, so Stata knows where to look for 
// variables in those other frames. 

Step 4: “Get” the specific variables you want from each frame. This is how you merge individual variables from multiple frames.

// from my NHANES post, we need to grab weighting variables from DEMO_F
// and a BP variable from BPQ_F. Just for kicks, we'll also grab self-
// reported "weak kidneys" from KIQ_U_F. 
// 
// remember that your cwf should be analytical right now. 
//
frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_F)
frget bpq020, from(BPQ_F)
frget kiq022, from(KIQ_U_F)
//
// now you have a nice merged database! 

Step 5 (optional): If you are satisfied with your analytical dataset and no longer need the other frames, you can now drop the “linking” variables from the analytical dataset, save your analytical dataset, clear your frames, and reopen your analytical dataset

// You can save this analytical frame as a Stata
// dataset now if you are done manipulating the other frames. 
// 
// You might opt to drop the new linking variables prior to saving
// for simplicity.
drop DEMO_F BPQ_F KIQ_U_F
//
// Stata's "save" command won't save other frames as FYI, just the pwf.
// But in this example, we are done with other frames. 
save analytical.dta, replace
//
// Drop all other frames so you don't get an annoying pop-up 
// about unsaved frames in memory. 
// Be careful! This will drop all data from memory!!
frames reset 
//
// now reopen your previously saved analytical dataset.
use analytical.dta, clear

Bonus: Appending frames

You might need to append frames. There are some details here about how to do this. I’m using the –fframeappend– command by Jürgen Wiemers.

// install fframeappend, only need to do once:
ssc install fframeappend
// change to the frame you want to append all of your data to,
// in this example it's called "appended"
// you'll have to use "mkf appended" if you don't already have one
// called that. 
cwf appended
//now append your frame "a" to the current open frame
fframeappend, using(a)
// now repeat using frame "b" to the current open frame
fframeappend, using(b)

Bonus: Here’s steps 1-4 in a single loop using global macros

For advanced users: Here’s a loop and some global macros that is adaptable to downloading several years. NHANES uses different letters for files from different years, the 2009-2010 one uses “F”.

frames reset
global url "https://wwwn.cdc.gov/Nchs/Nhanes/"
global F "2009-2010/"
global files DEMO BPQ KIQ_U

foreach x in F {
	foreach y in $files {
		mkf `y'_`x'
		frame `y'_`x': import sasxport5 "${url}${`x'}`y'_`x'.xpt"
	}
frame DEMO_`x': frame put seqn, into(analytical_`x')
cwf analytical_`x'
	foreach y in $files {
		frlink 1:1 seqn, frame(`y'_`x') 
	}
	frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_`x')
	frget bpq020, from(BPQ_`x')
	frget kiq022, from(KIQ_U_`x')
}
frames dir 
pwf 

Here’s the same as above, but just for a single year.

frames reset
global dir "https://wwwn.cdc.gov/Nchs/Nhanes/2009-2010/"
global files DEMO_F BPQ_F KIQ_U_F

foreach y in $files {
	mkf `y'
	frame `y': import sasxport5 "${dir}`y'.xpt"
}
frame DEMO_F: frame put seqn, into(analytical)
cwf analytical
foreach y in $files {
	frlink 1:1 seqn, frame(`y') 
}
frget wtint2yr wtmec2yr sdmvpsu sdmvstra, from(DEMO_F)
frget bpq020, from(BPQ_F)
frget kiq022, from(KIQ_U_F)
frames dir 
pwf 

Mediation analysis in Stata using IORW (inverse odds ratio-weighted mediation)

Mediation is a commonly-used tool in epidemiology. Inverse odds ratio-weighted (IORW) mediation was described in 2013 by Eric J. Tchetgen Tchetgen in this publication. It’s a robust mediation technique that can be used in many sorts of analyses, including logistic regression, modified Poisson regression, etc. It is also considered valid if there is an observed exposure*mediator interaction on the outcome.

There have been a handful of publications that describe the implementation of IORW (and its cousin inverse odds weighting, or IOW) in Stata, including this 2015 AJE paper by Quynh C Nguyen and this 2019 BMJ open paper by Muhammad Zakir Hossin (see the supplements of each for actual code). I recently had to implement this in a REGARDS project using a binary mediation variable and wanted to post my code here to simplify things. Check out the Nguyen paper above if you need to modify the following code to run IOW instead of IOWR, or if you are using a continuous mediation variable, rather than a binary one.

A huge thank you to Charlie Nicoli MD, Leann Long PhD, and Boyi Guo (almost) PhD who helped clarify implementation pieces. Also, Charlie wrote about 90% of this code so it’s really his work. I mostly cleaned it up, and clarified the approach as integrated in the examples from Drs. Nguyen and Hossin from the papers above.

IORW using pweight data (see below for unweighted version)

The particular analysis I was running uses pweighting. This code won’t work in data that doesn’t use weighting. This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. (Although you'll probably 
// have to adjust the svyset commands in the 
// program below to work in your dataset, in all 
// actuality)
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
//
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 2.
global outcome mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Components of weighting to go into "svyset" command. 
// You might have 
global samplingweight mysamplingweightvar
global id id_num // ID for your participants
global strata mystratavar
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw_weighted
program iorw_weighted, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
*Step 1: svyset data since your dataset is weighted. If your dataset 
// does NOT require weighting for its analysis, do not use this program. 
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
//
*Step 2: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
svy linearized, subpop(${ifstatement}): logit ${exposure} ${mediator} ${covariates}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 3: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights.  Since our dataset uses sampling
// weights, we need to multiply the dataset's weights times the IORW for the 
// exposure group. This step is fundamentally different for non-weighted 
// datasets. 
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures. 
gen weight_iorw = ${samplingweight} if ${exposure}==0
replace weight_iorw = inverseoddsratio*${samplingweight} if ${exposure}==1
//
*Step 4: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. 
// (same as direct effect, but without the IORW)
svyset ${id}, strata(${strata}) weight(${samplingweight}) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log) 
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 5: DIRECT EFFECTS; using IORW weights instead of the weighting that
// is used typically in your analysis. 
svyset ${id}, strata(${strata}) weight(weight_iorw) vce(linearized) singleunit(certainty)
svy linearized, subpop(${ifstatement}): glm ${outcome} ${exposure} ${covariates}, family(poisson) link(log)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 6: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 7: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 8: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw_weighted
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title} (bootstrap count=" e(N_reps) ")*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%" _col(50) %4.2f `beta4' "% (" %4.2f `low954' "%, " %4.2f `high954' "%)"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
noisily di " "
}
// the end.

IORW for datasets that don’t use weighting (probably the one you are looking for)

Here is the code above, except without consideration of weighting in the overall dataset. (Obviously, IORW uses weighting itself.) This uses modified Poisson regression implemented as GLMs. These can be swapped out for other models as needed. You will have to modify this script if you are using 1. a continuous exposure, 2. more than 1 mediator, 3. a different weighting scheme, or 4. IOW instead of IORW. See the supplement in the above Nguyen paper on how to modify this code for those changes.

*************************************************
// this HEADER is all you should have to change
// to get this to run as weighted data with binary
// exposure using IORW. 
*************************************************
// BEGIN HEADER
//
// Directory of your dataset. You might need to
// include the entire file location (eg "c:/user/ ...")
// My file is located in my working directory so I just list
// a name here. Alternatively, can put URL for public datasets. 
global file "myfile.dta"
//
// Pick a title for the table that will output at the end.
// This is just to help you stay organized if you are running
// a few of these in a row. 
global title "my cool analysis, model 1"
// Components of the regression model. Outcome is binary,
// the exposure (sometimes called "dependent variable" or 
// "treatment") is also binary. This code would need to be modified 
// for a continuous exposure variable. See details in Step 1.
global outcome  mi_incident
global exposure smoking
global covariates age sex
global mediator c.biomarker
global ifstatement if mi_prevalent==0 & biomarker<. & mi_incident <.
//
// Now pick number of bootstraps. Aim for 1000 when you are actually 
// running this, but when debugging, start with 50.
global bootcount 50
// and set a seed. 
global seed 8675309
// END HEADER
*************************************************
//
//
//
// Load your dataset. 
use "${file}", clear
//
// Drop then make a program.
capture program drop iorw
program iorw, rclass
// drop all variables that will be generated below. 
capture drop predprob 
capture drop inverseoddsratio 
capture drop weight_iorw
//
//
*Step 1: Run the exposure model, which regresses the exposure on the
// mediator & covariates. In this example, the exposure is binary so we are 
// using logistic regression (logit). If the exposure is a normally-distributed 
// continuous variable, use linear regression instead. 
logit ${exposure} ${mediator} ${covariates} ${ifstatement}
//
// Now grab the beta coefficient for mediator. We'll need that to generate
// the IORW weights. 
scalar med_beta=e(b)[1,1]
//
*Step 2: Generate predicted probability and create inverse odds ratio and its 
// weight.
predict predprob, p
gen inverseoddsratio = 1/(exp(med_beta*${mediator}))
// 
// Calculate inverse odds ratio weights. 
// Also note that this is for binary exposures, need to revise
// this code for continuous exposures. 
gen weight_iorw = 1 if ${exposure}==0
replace weight_iorw = inverseoddsratio if ${exposure}==1
//
*Step 3: TOTAL EFFECTS (ie no mediator) without IORW weighting yet. (same as direct effect, but without the IORW)
glm ${outcome} ${exposure} ${covariates} ${ifstatement}, family(poisson) link(log) vce(robust)
matrix bb_total= e(b)
scalar b_total=bb_total[1,1] 
return scalar b_total=bb_total[1,1]
//
*Step 4: DIRECT EFFECTS; using IORW weights
glm ${outcome} ${exposure} ${covariates} ${ifstatement} [pweight=weight_iorw], family(poisson) link(log) vce(robust)
matrix bb_direct=e(b)
scalar b_direct=bb_direct[1,1]
return scalar b_direct=bb_direct[1,1]
//
*Step 5: INDIRECT EFFECTS
// indirect effect = total effect - direct effects
scalar b_indirect=b_total-b_direct
return scalar b_indirect=b_total-b_direct
//scalar expb_indirect=exp(scalar(b_indirect))
//return scalar expb_indirect=exp(scalar(b_indirect))
//
*Step 6: calculate % mediation
scalar define percmed = ((b_total-b_direct)/b_total)*100
// since indirect is total minus direct, above is the same as saying:
// scalar define percmed = (b_indirect)/(b_total)*100
return scalar percmed = ((b_total-b_direct)/b_total)*100
//
// now end the program.
end
//
*Step 7: Now run the above bootstraping program
bootstrap r(b_total) r(b_direct) r(b_indirect) r(percmed), seed(${seed}) reps(${bootcount}): iorw
matrix rtablebeta=r(table) // this is the beta coefficients
matrix rtableci=e(ci_percentile) // this is the 95% confidence intervals using the "percentiles" (ie 2.5th and 97.5th percentiles) rather than the default normal distribution method in stata. The rational for using percentiles rather than normal distribution is found in the 4th paragraph of the "analyses" section here (by refs 37 & 38): https://bmjopen.bmj.com/content/9/6/e026258.long
//
// Just in case you are curious, here are the the ranges of the 95% CI, 
// realize that _bs_1 through _bs_3 need to be exponentiated:
estat bootstrap, all // percentiles is "(P)", normal is "(N)"
//
// Here's a script that will display your beta coefficients in 
// a clean manner, exponentiating them when required. 
quietly {
noisily di "${title} (bootstrap count=" e(N_reps) ")*"
noisily di _col(15) "Beta" _col(25) "95% low" _col(35) "95% high" _col(50) "Together"
local beta1 = exp(rtablebeta[1,1])
local low951 = exp(rtableci[1,1])
local high951 = exp(rtableci[2,1])
noisily di "Total" _col(15) %4.2f `beta1' _col(25) %4.2f `low951' _col(35) %4.2f `high951' _col(50) %4.2f `beta1' " (" %4.2f `low951' ", " %4.2f `high951' ")"
local beta2 = exp(rtablebeta[1,2])
local low952 = exp(rtableci[1,2])
local high952 = exp(rtableci[2,2])
noisily di "Direct" _col(15) %4.2f `beta2' _col(25) %4.2f `low952' _col(35) %4.2f `high952' _col(50) %4.2f `beta2' " (" %4.2f `low952' ", " %4.2f `high952' ")"
local beta3 = exp(rtablebeta[1,3])
local low953 = exp(rtableci[1,3])
local high953 = exp(rtableci[2,3])
noisily di "Indirect" _col(15) %4.2f `beta3' _col(25) %4.2f `low953' _col(35) %4.2f `high953' _col(50) %4.2f `beta3' " (" %4.2f `low953' ", " %4.2f `high953' ")"
local beta4 = (rtablebeta[1,4])
local low954 = (rtableci[1,4])
local high954 = (rtableci[2,4])
noisily di "% mediation" _col(15) %4.2f `beta4' "%" _col(25) %4.2f `low954' "%"_col(35) %4.2f  `high954' "%"  _col(50) %4.2f `beta4' " (" %4.2f `low954' ", " %4.2f `high954' ")"
noisily di "*Confidence intervals use 2.5th and 97.5th percentiles"
noisily di " rather than default normal distribution in Stata."
noisily di " "
}
// the end.