Part 4: Defining your population, exposure, and outcome

Getting the population, exposure, and outcome correct in your analytical dataset, and being able to come back and fix goofs later

Defining a study population, exposure variable, and outcome variable is a critical early step after determining your analysis plan. Most epidemiology projects come as a huge set of datasets, and you’ll probably need to join multiple files into one when defining your analytical population. Defining your analytical population is an easy place to make errors so you’ll want to have a specific script that you can come back and edit again if and when you find goofs.

For the love of Pete — Please generate your population, exposure, and outcome variables using a script so you can go back and reproduce these variables and fix any bugs you might find!

When you make these variables, you’ll likely need to combine several datasets. This will require mastery of importing datasets (if not in the native format for your statistical program) and combining datasets. For Stata, this means using –import– and –save– commands to bring everything over into Stata format, and then using –merge– commands to combine multiple datasets.

Make a variable for your population that is 0 (not included) or 1 (included)

One option in generating your dataset is to drop everyone who isn’t in your dataset. I recommend against dropping individuals who aren’t in your dataset. Instead, create a variable to define your population. Name it something simple like “included”, “primary population”, “group_a” or whatnot. If you will have multiple populations (say, one defined by prevalent hypertension using JNC7 vs. ACC/AHA 2017 hypertension thresholds), then you should have a variable for each addended with a simple way to tell them apart. Like “group_jnc7” and “group_aha2017”.

Useful code in R and Stata to do this:

  • Count
  • Generate and replace (Stata), mutate (R)
  • Combine these with assigning single equals sign “=” (Stata & R, I say out loud “assign” when using this) and “<-" (R)
  • use –if–, –and–, & –or– statements
  • Tests of equality: >, =, <=, != (not), == ("equals exactly"), not single equal sign

Example Stata code to count # of people with diabetes, generate a variable for group_a and assign someone to group_a if they have diabetes.

count if diabetes==1
gen group_a=0
replace group_a=1 if diabetes==1

Here’s example R code to do the same (df=data frame).

nrow( df %>% filter(diabetes == 1) )
df = df %>% mutate(group_a = ifelse(diabetes == 1, 1, 0) )

Make an inclusion flowchart

These are essential charts in observational epidemiology. As you define your population, generate this sort of figure. Offshoots of the nodes define why folks are dropped from the subsequent node. Here’s how I approach this, folks might have different approaches:

  • Node 1 is the overall population.
  • Node 2 is individuals who you would not drop for baseline eligibility reasons (had prior event that discounts them or missing data to prevent assessment of their eligibility)
  • Node 3 is individuals who you would not drop because you can assess them for necessary follow-up (incomplete follow-up, died before required follow-up time, missing data)
  • Node 4 is individuals who you would not drop because they had all required exposure covariates (if looking at stroke by cholesterol level, people who all have cholesterol). This is your analytical population.

If you have two separate populations (eg, different hypertension populations by JNC7 or ACC/AHA 2017), you might opt to make two entirely separate figures. If you have slightly different populations because of multiple exposures (e.g., 3 different inflammatory biomarkers, but you have different missingness between the 3), you might have the last node fork off into different nodes, like this:

I generate these via text output in Stata then manually generate them in PowerPoint.

Defining exposure and outcome

This seems simple, but define clearly what your exposure is and your outcome is. Each should have a simple 0 or 1 variable (if dichotomous) with an intuitive name. You might need 2 separate outcomes if you are using different definitions, like “incident_htn_jnc7” and “incident_htn_aha2017”.

Table 1

“Table 1” shows core features of the population by the exposure. Don’t include the outcome as a row, but include demographics and key risk factors/covariates for outcome (eg if CVD, then diabetes, blood pressure, cholesterol, etc.). Some folks include a 2nd column that presents the N for that row. Some folks also include a P-value comparison as a final row. I tend to generate the P value every time but only present it if the reviewers ask for it.

In Stata, I use the excellent table1_mc program to generate these, which you can read about here. For R, I am told that gtsummary works well.

Part 2: Effective collaborations in epidemiology projects

Pin down your authorship list

Determine authorship list before you lift a finger on any analysis and get buy-in from collaborators on that list.

Stay organized

Have one folder where you save everything, and have subfolders within that for groups of documents. I suggest these subfolders:

  • Manuscript
  • Abstract
  • Data and analysis
  • Paper proposal (for REGARDS projects & others with manuscript proposal documents)

…and keep relevant documents within each one. You might want to put an “archive” folder within each subfolder (e.g., manuscript\archive) and move old drafts into the archive folder to reduce clutter.

Give documents a descriptive name. Don’t call it “manuscript [versioning system].docx”– use terms for your projects. If you are doing a REGARDS paper looking at CRP and risk of hypertension, name it “regards crp htn abstract [versioning system].docx”.

Use an intuitive versioning system. I like revision # then version # (eg r00 v01). Many people use dates. If you use dates to keep track of versions, append your documents with the ISO8601 date convention of YYYY-MM-DD. Trust me. Lots of details on this post.

Have realistic goals and stick to deadlines

Come up with some firm deadlines and do your best to stick with them. Here are some goals to accomplish in moving a project forward, if you wanted an example.

  • Combine all existing written documents (eg, proposal) into one manuscript.
  • Draft blank tables and decide what figures you want to make. Write methods.
  • Generate baseline characteristics. Describe in results.
  • Generate descriptive statistics/histograms for your exposure and outcome(s). Describe in results.
  • Estimate primary and secondary outcome(s). Describe in results.
  • Complete secondary analyses. Describe in results.
  • Finish first draft. Send to your primary mentor or collaborator.
  • Integrate feedback from primary mentor or collaborator into a second draft. Circulate to coauthors.
  • Integrate feedback from coauthors into a document to be submitted to a journal.
  • Format your manuscript for a specific journal and submit it. (This takes a surprisingly large amount of time.)

Managing your mentor: Send reminder emails more frequently than you probably realize

I block off time to work on your stuff, but clinical priorities or other professional/parenting challenges might bump that time. I try to find other time to work on your stuff, but a big crisis might mean that I don’t have a chance to reschedule.

Please, please, please, please email me early and persistently about your projects. This will never annoy me — these emails are very helpful. Quick focused emails are helpful here, especially if you re-forward your prior email threads. Eg, “Hi Tim, wondering if you had a chance to take a look at that draft from last week, reforwarded here. Thanks, [name].”

Working on revisions

Use tracked changes

And remember to turn them on when you send around a draft!

Append your initials to the end of the document that you are editing for someone else

For me, I’ll change a name to “My cool document v1 tbp.docx”.