Part 6: Visualizing your continuous exposure at baseline

Visualization of your continuous exposure in an observational epidemiology research project

As we saw in Part 5, it’s important to describe the characteristics of your baseline population by your exposure. This helps readers get a better understanding of internal validity. For folks completing analyses with binary exposures, part 6 isn’t for you. If your analysis includes continuous exposures or ordinal exposures with at least a handful of options, read on.

I think it’s vitally important to visualize your exposure before moving any further forward with your analyses. There are a few reasons that I do this:

  1. Understand the distribution of your exposure. Looking at the raw spread of your data will help you understand if it has a relatively normal distribution, if it’s skewed, if it is multimodal (eg, has several peaks), or if it’s just plain old weird looking. If your exposure is non-normally distributed, then you’ll need to consider the implications of the spread in your analysis. This may mean log-transforming, square root-transforming (if you have lots of zeros in your exposure’s values), or some other sort of transformation.
    • Note: make sure to visualize your transformed exposure value!
  2. Look for patterns that need exploring. You might notice a huge peak at a value of “999”. This might represent missing values, which will need to be recoded. You might notice that values towards the end of the tails of the bell curve might spike up at a particular value. This might represent values that were really above or below the lower limit of detection. You’ll need to think about how to handle such values, possibly recoding them following the NHANES approach as the limit of detection divided by the square root of 2.
  3. Understand the distribution of your exposure by key subgroups. In REGARDS, our analyses commonly focus on racial differences in CVD events. Because of this, I commonly visualize exposures with overlaid histograms for Black and White participants, and see how these exposure variables differ by race. This could easily be done for other sociodemgraphics (notably, by sex), anthropometrics, and disease states.

My approach to building histograms in Stata

A conventional histogram splits continuous data into several evenly-spaced discrete groups called “bins”, and visualizes these discrete groups as a bar graph. Let’s see how to do this in Stata, with some randomly generated data that approximates a normal distribution. While we’re at it, we’ll make a variable called “group” that’s 0 or 1 that we’ll use later. (Also note that this dataset doesn’t use discrete values, so I’m not specifying the discrete option in my “hist” code. If you see spaces in your histogram because you are using discrete values, add “, discrete” after your variable name in the histogram line.)

clear all
// set observations to be 1000:
set obs 1000
// set a random number seed for reproducibility: 
set seed 12345
// make a normally-distributed variable, with mean of 5 and SD of 10:
gen myvariable = rnormal(5,10)
// make a 0 or 1 variable for a group, following instructions for "generate ui":
// https://blog.stata.com/2012/07/18/using-statas-random-number-generators-part-1/
gen group = floor((1-0+1)*runiform() + 0)
// now make a histogram
hist myvariable

Here’s the overall histogram:

On the X axis you see the ranges of the values of variable of interest, from around -30 to about +40. On the Y axis you see the density plot. I want to show this same figure by group, however, and the bins are not currently transparent. You won’t be able to tell one group from another. So, in Stata, you need to use the “twoway histogram” option instead of just “histogram” and specify transparent colors of the figure using the %NUMBER notation. We’ll also add a legend. We’ll set the scheme to s1mono to get rid of the ugly default blue surrounding box as well. Example:

// set the scheme to s1mono:
set scheme s1mono
// now make your histogram:
twoway ///
(hist myvariable if group==0, color(blue%30)) ///
(hist myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Here’s what you get:

You can modify things as needed. Something you might consider is changing the density to count or frequency, which is done by adding “frequency” or “percent” after the commas but before the colors. You might also opt to select different colors, which you can read about selection of colors in this editorial I wrote with Mary Cushman.

Considerations for designing histograms

One question is how many bins you want. I found this nice 2019 article by Regina L. Nuzzo, PhD (PDF link, PubMed listing) that goes over lots of considerations for the design of histograms. I specifically like the Box, which lists equations to determine number of bins and bid width. In general, if you have too many bins, your data will look choppy:

// using 100 bins here
twoway ///
(hist myvariable if group==0, color(blue%30) bin(100)) ///
(hist myvariable if group==1, color(red%30)  bin(100)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

And if you have too few, you won’t be able to make sense of the overall structure of the data.

// using 2 bins here
twoway ///
(hist myvariable if group==0, color(blue%30) bin(2)) ///
(hist myvariable if group==1, color(red%30)  bin(2)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Be thoughtful about how thinly you want to splice your data.

A quick note on kernel density plots

Kernel density plots have the same idea to histograms, except it shows a smoothed line over your data’s distribution instead of a histogram. I prefer to use histograms when looking at laboratory data, since kernel density plots can smooth over outliers described in point #2 above. The code for a kernel density plot in Stata is nearly identical to the “twoway hist” code above.

twoway ///
(kdensity myvariable if group==0, color(blue%30)) ///
(kdensity myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

Output:

You can even combine histograms and kernel density plots!

twoway ///
(hist myvariable if group==0, color(blue%30)) ///
(hist myvariable if group==1, color(red%30)) ///
(kdensity myvariable if group==0, color(blue%30)) ///
(kdensity myvariable if group==1, color(red%30)) ///
, ///
legend(order(1 "Group==0" 2 "Group==1"))

I’ve never done this myself for a manuscript, but just showing that it’s possible.