Part 5: Baseline characteristics in a Table 1 for a prospective observational study

What’s the deal with Table 1?

Tables describing the baseline characteristics of your analytical sample are ubiquitous in observational epidemiology manuscripts. They are critical to help the reader understand the study population and potential limitations of your analysis. A table characterizing baseline characteristics is so important that it’s typically the first table that appears in any observational epidemiology (or clinical trial) manuscript, so it’s commonly referred to as a “Table 1“. Table 1s are critically important because they help the readers understand internal validity of your study. If your study has poor internal validity, then your results and findings aren’t useful.

The details here are specific to prospective observational studies (e.g., cohort studies), but are generalizable to other sorts of studies (e.g., RCTs, case-control studies).

If you are a Stata user, you might be interested into my primer of using Table1_mc to generate a Table 1.

Guts of a Table 1

There are several variations of the Table 1, here’s how I do it.

COLUMNS: This is your exposure of interest (i.e., dependent variable). This is not the outcome of interest. There’s a few way to divvy up these columns, depending on what sort of data you have:

  • Continuous exposure (e.g., baseline LDL-cholesterol level): Cut this up into quantiles. I commonly use tertiles (3 groups) or quartiles (4 groups). People have very, very strong opinions about whether you use tertiles or quartiles. I don’t see much of a fuss in using either. Of note, there usually is no need to transform your data prior to splitting into quantiles. (And, log transforming continuous data that includes values of zero will replace those zeros with missing data!)
  • Discrete exposure:
    • Dichotomous/binary exposure (e.g., prevalent diabetes status as no/0 or yes/1): This is easy, column headers should be 0 or 1. Make sure to use a descriptive column header like “No prevalent diabetes” and “Prevalent diabetes” instead of numbers 0 and 1.
    • Ordinal exposure, not too many groups (e.g., never smoker/0, former smoker/1, current smoker/2): This is also easy, column headers should be 0, 1, or 2. Make sure to use descriptive column headers.
    • Ordinal exposure, a bunch of groups (e.g., extended Likert scale ranging from super unsatisfied/1 to super satisfied/7): This is a bit tricker. On one hand, there isn’t any real limitation on how wide a table can be in a software package so you could have columns 1, 2, 3, 4, 5 ,6 and 7. This is a bit unwieldy for the reader, however. I personally think it’s better to collapse really wide groupings into a few groups. Here, you could collapse all of the negative responses (1, 2 and 3), leave the neutral response as its own category (4), and collapse all of the positive responses (5, 6, and 7). Also use descriptive column headers, but also be sure to describe how you collapsed groups in the footer of the table.
    • Nominal exposure, not too many groups (e.g., US Census regions of Northeast, Midwest, South, and West): This is easy, just use the groups. Be thoughtful about using a consistent order of these groups throughout your manuscript.
    • Nominal exposure, a bunch of groups (e.g., favorite movie): As with ‘Ordinal data, a bunch of groups’ above, I would collapse these into groups that relate to each other, such as genre of movie.
  • (Optional) Additional first column showing “Total” summary statistics. This presents summary statistics for the entire study population as a whole, instead of by quantile or discrete groupings. I don’t see much value in these and typically don’t include them.
  • (Optional) Additional first column showing count of missingness for each row. This presents a count of missing values for that entire row. I think these are nice to include, but they don’t show missingness by column so are an imperfect way to show missingness. See the section below on ‘cell contents’ for alternative strategies to show missingness.
    • Note: Table1_mc for Stata cannot generate a “missingness” row.
  • (Optional, but suggest to avoid) Following P-value column that shows comparisons across rows. These have fallen out of favor for clinical trial Table 1s. I see little value of them for prospective observational studies and also avoid them.

ROWS: These include the N for each column, the range of values for continuous exposures, and baseline values. Note that the data here are from baseline.

  • N for each group. Make sure that these Ns add up to the expected N in your analytical population at the bottom of your inclusion flow diagram. If it doesn’t match, you’ve done something wrong.
  • (For continuous exposures) Range of values for your quantiles and yes I mean minimum and maximum for each quantile, not IQRs.
  • Sociodemographics (age, sex, race, ± income, ± region, ± education level, etc.)
  • Anthropometrics (height, weight, waist circumference, BMI, etc.)
  • Medical problems as relevant to your study (eg, proportion with hypertension, diabetes, etc.)
  • Medical data as relevant to your study (eg, laboratory assays, details with radiological imaging, details from cardiology reports)
  • Suggest avoiding the outcome(s) of interest as additional rows. I think that presenting the outcomes in this table is inadequate. I prefer to have a separate table or figure dedicated to the outcome of interest that goes much more in-depth than a Table 1 does. Plus, the outcome isn’t ascertained at baseline in a prospective observational study, and describing the population at baseline is the general purpose of Table 1.
  • And for the love of Pete, please make sure that all covariates in your final model appear as rows. If you have a model that adjusts for Epworth Sleepiness Score, for example, make sure that fits in somewhere above.

The first column of your Table 1 will describe each row. The appearance of this row will vary based upon the type of data you have.

  • Overall style of row descriptions as it appears in the first column:
    • N row – I suggest simply using “N”, though some folks use N (upper case) to designate the entire population and n (lower case) to designate subpopulations, so perhaps you might opt to put “n”.
    • Continuous variables (including the row for range)– I suggest a descriptive name and the units. Eg, “Height, cm”
    • Discrete variables – I suggest a descriptive name alone. Some opt to put a hint to the contents of the cell here (eg, adding a percentage sign such as “Female sex, %“), but I think that is better included in the footer of the table. This will probably be determined by the specific journal you are submitting to.
      • Dichotomous/binary values – In this example, sex is dichotomous (male vs. female) since that’s how it has historically been collected in NIH studies. For dichotomous variables, you can include either (1) a row for ‘Male’ and a row for ‘Female’, or (2) simply a row for one of the two sexes (eg, just ‘Female’) since the other row will be the other sex.
      • Other discrete variables (eg, ordinal or nominal) – In this example, we will consider the nominal variable of Race. I suggest having a leading row that provides description of the following rows (eg, “Race group”) then add two spaces before each following race group so the nominal values for the race groups seem nested under the heading.
    • (Optional) Headings for groupings of rows – I like including bold/italicized headings for groupings of data to help keep things organized.

Here’s an example of how I think a blank table should appear:

Table 1 – Here is a descriptive title of your Table 1 followed by an asterix that leads to the footer. I suggest something like “Sociodemographics, anthropometrics, medical problems, and medical data ascertained baseline among [#] participants in [NAME OF STUDY] with [BRIEF INCLUSION CRITERIA] and without [BRIEF EXCLUSION CRITERIA] by [DESCRIPTION OF EXPOSURE LIKE ‘TERTILE OF CRP’ OR ‘PREVALENT DIABETES STATUS’]*”

Missing, N
Tertile 1
Tertile 2
Tertile 3
Range, ng/mL
Age, y
Female sex
Race group
Height, cm
Weight, kg
BMI, kg/m²
Medical problems
[List out here]
Medical data
[List out here]

*Footer of your Table 1. I suggest describing the appearance of the cells, eg “Range is minimum and maximum of the exposure for each quantile. Presented as mean (SD) for normally distributed and median (IQR) for skewed continuous variables. Discrete data are presented as column percents.”

Cell contents

The cell contents varies by type of variable and your goal in this table:

  • Simplicity as goal:
    • Normally distributed continuous variables: Mean (SD)
    • Non-normally distributed continuous variables: Median (IQR)
    • Discrete variables: Present column percentages. Not row percentages. For example we’ll consider “income >$75k” by tertile of CRP. A column percentage would show the % of participants in that specific quantile have an income >$75k. A row percentage would show the percentage of participants with income >$75K who were in that specific tertile.
  • Clarity of completeness of data as goal (would not also include “missing” column if doing this)
    • Continuous variables: Present as mean (SD) or median (IQR) as outlined above based upon normality, but also include an ‘n’. Example for age: “75 (6), n=455”
      • Note: Table1_mc in Stata cannot report an ‘n’ with continuous variables.
    • Dichotomous variables: Present column percentage plus ‘n’. Example for female sex: “45%, n=244”.

A word on rounding: I think there is little value on including numbers after the decimal place. I suggest aggressively rounding at the decimal for most things. For example, for BMI, I suggest showing “27 (6)” and not “26.7 (7.2)”. For things that are obtained at the decimal place, I strongly recommend reporting at the decimal. For example, BP is always measured as a whole number, so reporting out a tenth place for BP isn’t of much value. For example, systolic BP is measured as 142, 112, and 138 — not 141.8, 111.8 and 138.4. For discrete variables, I always round the proportion/percentage at the decimal, but clarify very small proportions to be “<1%" if there are any in that group, but it would round to zero or "0%" if there are none in that group.

The one exception to my aggressive “round at the decimal place” strategy is variables that are commonly reported past the decimal place, such as many laboratory values. Serum creatinine is commonly reported to the hundredths place (e.g., “0.88”), so report the summary statistic for that value to the hundredths place, like 0.78 (0.30).