## This should be more simple.

Epidemiology and biostatistics are awash with synonyms and each institution has its own preferred nomenclature to describe the same general concepts. I started this page as a central place to document the various terms by concept. I’ll plan on revisiting and updating over time.

## Regression

### Fundamentals

You probably learned the fundamentals of regression in introductory algebra but may not realize it. Remember drawing a graph from a slope-intercept equation? * Draw a graph where Y is equal to 1/4x plus 5*. (Here is the relevant Khan Academy Algebra I video about this.) You take the general equation:

Y = mx + b

…where Y is the y-axis, m is the slope of the line, and b is where the line crosses the y-axis. The equation you will write is:

Y=1/4x + 5

…and you will draw:

This sounding familiar? When you do a linear regression, you do the same thing. Instead, you * regress Y on X*, or:

Y = β_{1}x_{1} + β_{0}

And fitting in the variables here, you want to *figure out what a predicted cholesterol level will be for folks by a given age*. You would * regress cholesterol level on age*:

Cholesterol level = β_{1}*Age + β_{0}

Here, x_{1} is the slope of the line for age and β_{0} is the intercept on the Y-axis, essentially the same as the b in Y=mx+b. When you run a regression in Stata, you type

regress y x

or here,

regress cholesterol age

Let’s say that Stata spits out something like:

Source | xxxxxxxxxxxxxxxxxxxxxxxxxxxx -------------+------------------------------ Model | xxxxxxxxxxxxxxxxxxxxxxxxxxxx Residual | xxxxxxxxxxxxxxxxxxxxxxxxxxxx -------------+------------------------------ Total | xxxxxxxxxxxxxxxxxxxxxxxxxxxx ------------------------------------------------------------------------------ cholesterol| coeff se tP>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | 0.500 xxxxxxxx xxxxx 0.000 0.4000 0.60000 _cons | 100 xxxxxxxx xxxxx 0.000 90.000 110.000 ------------------------------------------------------------------------------

The β_{1 }coefficient for age is 0.5. The intercept, or β_{0} is 100. You would interpret this as * cholesterol level = 0.5*age in years + 100*. You could plot this using your Algebra 1 skills.

Cholesterol = 0.5*age + 100

Or you can substitute in actual numbers. What is the predicted cholesterol at age 50? Answer: 125.

If you want to make it more complex and add more variables to explain cholesterol level, it’s no longer a straight line on a graph, but the concept is the same. A * multiple linear regression* adds more X variables. You can

*figure out what a predicted cholesterol level will be for folks by age, sex, and BMI.*You would

*(You would code sex as 0 or 1, like female = 1 and male = 0.)*

**regress cholesterol level on age, sex, and BMI.**Y = β_{1}x_{1} + β_{2}x_{2} + β_{3}x_{3} + β_{0}

Or,

Y = β_{1}*Age + β_{2}*Sex + β_{3}*BMI + β_{0}

You get the idea.

### Names of Y and X

This is what irks me. There are so many synonyms for Y and X variables. Here is a chart that I’ll update over time with synonyms seen in the wild.

Y | = | x |
---|---|---|

Dependent | Independent | |

Outcome | Predictor | |

Covariate | ||

Factor | ||

Exposure variable | ||

Explanatory variable |