Propensity Score Methods: Working with Natural Experiments

In the natural sciences, experimentation is understood to be at the foundation of how the work of a field is conducted. However, in the social sciences, experimentation can be more difficult. Instead of working with lasers or minerals or lab rats, we work with human beings situated in their worlds. The idea of conducting a scientific experiment on people may seem simple, but it gets difficult quickly as we start to plan it. In order to make a strong comparison, we would need individuals who are identical on characteristics that would influence how they would respond to the experiment. This is clearly a difficult task.

Even if it were possible to identify such a group, we would then need to give one group the intervention or treatment of our experiment and withhold it from the other group. This is even more difficult if we believe in the utility of the treatment. Consider access to computers in the classroom. Even if we were able to find two classes of students who were quite similar, we would then need to give one class computers while not providing computers to the second class. This is clearly inequitable.

There are, however, statistical methods that address such a conundrum. Many experiments in the social sciences are so-called “natural experiments.” This means that the difference in treatment occurred as a consequence of larger forces. Consider, for example, that one town approved of a school budget that included new computers for all students and the next town over failed to approve a similar measure. The students can be considered participants in a natural experiment with the distribution of the computer “treatment” determined by the voters in their towns.

If we want to make comparisons in outcomes, we need to measure how similar the students in the towns are to each other. Propensity scoring is a statistical technique that creates a composite score for all the individuals based on selected characteristics. This technique is frequently used in quasi-experimental settings, like the example with the computers, where random assignment is not (or ethically cannot be ) used. The propensity score is, broadly, a probability of the individual being selected into the treatment group based on the observed covariates. In our example, it is the probability that the student would attend one school or the other based on characteristics we can observe. This composite value can be used when working with experimental data in a number of ways.

Propensity scores are frequently used to match participants from different groups. In matching techniques, control and treatment participants are matched by propensity scores. The matched cases then constitute a new restricted data set for use in further analysis. This matching can be done with different ratios of treatment and control individuals (commonly 1:1), with different selection techniques (e.g. nearest neighbor, optimal), with or without replacement, and with different caliper values, which limit the acceptable difference in propensity scores that can still constitute a match (Beal & Kupzyk, 2014).

Additionally, propensity scores can be used as a weight in working with data. A weight is used to make certain cases more or less influential to the final results. Individuals more likely to be selected into the treatment group are weighted greater than those unlikely to be selected based on their propensity scores. This approach has the benefit of maximizing the usable data as no cases are removed from the data set; they are just weighted differently. Additionally, this approach maintains the variability in the propensity score in the weights. For techniques in using propensity scores in weighting, a starting point is the work of Lunceford and Davidian (2004).

Propensity scores can also be used for stratification, where categories are created based on ranges of propensity scores, and analyses are performed separately on the different strata (Rosenbaum & Rubin, 1984). Different models can then be used for different strata. For example, we could group the students in the two towns into four groups based on ranges of propensity scores. This would allow us to see if there were the same effects from computers across groups of students.

Finally, propensity scores can be included in a regression equation. This method can be interpreted as estimating the treatment effect when holding the probability for being in the treatment constant. Put another way, this method investigates if the treatment variable matters when holding constant the likeliness of receiving treatment. However, this procedure requires that the propensity scores meet some very specific standards. For a further discussion about these standards, see the work of Rubin (2001).

Propensity scoring depends on the assumption of strong ignorability. That is, every initial difference between the treatment and control group that, absent treatment effects, might result in differences on the outcomes must be accounted for by covariates that are included in the design and analysis. An argument for the feasibility of this assumption is necessary when using propensity scoring techniques, and, as some authors have noted, many researchers often invoke this assumption without fully proving their case (Thoemmes & Kim, 2011). Strong ignorability necessitates the careful and methodical selection of covariates. It should be noted that these covariates need to be non-collinear but correlated with the dependent and independent variables.

Propensity score methodology is being used with increased frequency in the field of educational research. This is due to the difficulty in conducting a true random control trial with interventions given the ethical concerns surrounding withholding potentially beneficial educational techniques from children. However, issues of interpretability and feasibility should be deliberated before using this methodology. Becoming familiar with propensity scoring methods may take some time, but a skillful use of the methodology can yield high quality experimental results in situations where randomly controlled experiments are not realistic.



Beal, S. J., & Kupzyk, K. A. (2014). An Introduction to Propensity Scores: What, When, and How. The Journal of Early Adolescence, 34(1), 66–92. doi:10.1177/0272431613503215

Imbens, G. W. (2000). The role of the propensity score in estimating dose-response functions. Biometrika, 87(3), 706–710.

Lunceford, J. K., & Davidian, M. (2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: a comparative study. Statistics in Medicine, 23(19), 2937–2960.

Rosenbaum, P. R., & Rubin, D. B. (1984). Reducing bias in observational studies using subclassification on the propensity score. Journal of the American Statistical Association, 79(387), 516–524.

Rubin, D. B. (2001). Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology, 2(3-4), 169–188.

Thoemmes, F. J., & Kim, E. S. (2011). A systematic review of propensity score methods in the social sciences. Multivariate Behavioral Research, 46(1), 90–118.

Additional Reading
Rubin, D. B. (1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127(8:2), 757–763.

This entry was posted in Research methods and tagged , , by Mark W. Olofson. Bookmark the permalink.

About Mark W. Olofson

Mark is in the third year in the EDLP Ph.D. program at the University of Vermont. His research interests include modeling teacher knowledge in technology-rich learning environments, the effects of adverse childhood experiences and residential mobility on early learners, and the globalization of public school privatization policies. When he isn't reading, writing, or discussion education, Mark enjoys backpacking, whitewater paddling, and bicycle touring.