Observational Causal Inference with Machine Learning

Cory Bonn
Data Scientist

Imagine you are the CEO of an online education startup and are interested in comparing the effects of different courses you offer on students’ subsequent career successes. For example, you might want to know whether completion of particular courses results in higher salaries or shorter times to employment offers. It is tempting to simply measure the size of salaries or times to employment offers and conclude that courses with the highest-performing students offer the most effective instruction. However, because students choose their own courses for themselves, as opposed to being randomly assigned each course, average differences in employment and salary outcomes across courses might reflect pre-existing differences in students that led them to choose each course. For example, course selections and career outcomes may both be driven by individual differences in students’ aptitudes or experience.

These alternative, explanatory variables (here, pre-existing individual differences) are confounds, and controlling them (without the benefit of a randomized experiment) is the main challenge in causal inference research.

At a high level, there are two common approaches to handling confounds: 

The first is to balance the data on confounding variables by limiting the analysis to reasonably similar individuals in an attempt to uncover a hidden randomized or blocked experiment within the existing data. The most popular family of methods includes techniques for matching treatment to control cases, which prunes control observations for which there is little to no overlap with treatment observations on observed covariates, propensity scores, or both. 

The second technique is to adjust the estimated causal effect size by analyzing data within particular subgroups of covariate space (known as stratification) or by including the confounding variables in a fitted (generalized) linear model, such as treating confounders as additional predictors in a regression of outcomes on treatment status. When the effect of covariates on treatment selection or outcomes is a nonlinear function that must be inferred using machine learning techniques–a popular approach is to use one or more ad hoc models built on observed covariates to provide more flexible means of inferring the estimated treatment effect for different regions of covariate space.

Provided that you have accounted for all necessary competing explanations (as well as made some assumptions, e.g., that the treatment-effect size for each individual does not change regardless of the treatment assignment mechanism), these techniques can be an effective means of recovering estimates of causal effects in the presence of nonrandom assignment, as in observational causal inference.

Balancing on observed confounders using matching techniques

Following the preceding example, imagine that you, as CEO, administered a survey to program graduates that asked about students’ post-course salaries in the year after taking one of two, similar PyTorch programming courses. These courses have different difficulty ratings. In addition, you have access to their standardized performance data from final exams in a prerequisite course. You would like to know which PyTorch course led to better salary outcomes. Unfortunately, it looks like test scores from the prerequisite course predict post-course salaries, so average performance qualifies as a confounder. Even more unfortunate is the fact that the less difficult course was more popular overall and somewhat more popular among lower-scoring students, so the two groups are highly unbalanced.

Since this is a simulation, we know the system of equations generating the data below. The resulting simulated, raw data are plotted in Figure 1.1.

\[\begin{eqnarray}\text{Performance} & \sim & N(\mu = 0, \sigma = 1) \\\text{Propensity Score} & = & logit(-2 + \text{Performance}) \\\text{Course Chosen} & \sim & Bernoulli(\text{Propensity Score})\\\text{Expected Salary} & = & 160 + 70 \times \text{Course Chosen} + 50 \times \text{Performance} \\\text{Observed Salary} & \sim & N(\mu = \text{Expected Salary}, \sigma = 40)\end{eqnarray}\]

Figure 1.1. Raw, simulated salary data.

In a matching procedure, the investigator selects a limited number of points in the control condition closest to the points in the treatment condition in covariate space, discarding control points without matches. Here, provided that there is a reasonable amount of overlap in the distributions of treatment and control points in final exam scores from the prerequisite course, the remaining points will be approximately balanced, and this confounder can be ignored in any subsequent statistical tests.

Though many matching algorithms exist and are an active area of research, the standard technique we use here for demonstration is Mahalanobis-distance matching (using the MatchIt package, Ho et al. 2011). In the case of one covariate (here, prerequisite exam performance), Mahalanobis distance is simply the distance between two points in standard deviation units, though the metric generalizes to multiple covariates and is often used for calculating (Euclidean) distance in standard deviation units between observations in multivariate data. The selected subset of control observations connected to their closest matches are displayed in Figure 1.2, alongside a plot comparing the two distributions while ignoring exam-performance scores.

Figure 1.2. Matched pairs minimizing Mahalanobis distance on performance.

This conceptually simple approach has a number of advantages. First, there is no need to explicitly model any relationship between confounders and either course selection or outcome. As long as there is enough overlap among samples in covariate space, good matching is possible. In addition, when such good matches exist, the investigator increases the power of subsequent statistical tests in the same way that a priori matching in a case-control design with (pseudo-)random assignment increases power and thus testing efficiency. This advantage can outweigh the cost of discarding data.

Though matching is straightforward when the number of features is small, it can be more difficult to guarantee increased balance over all features when the number of confounding features grows quickly relative to the number of samples. In high-dimensional data, match quality decreases due to this sparsity. An alternative approach is to model the relationship between confounders and treatment selection to derive a propensity score, or a predicted probability of selecting a given treatment, and matching only on this score to reduce complexity. Matching on the propensity score alone has become an enormously popular technique, though there are costs to this simplifying procedure. Because the analyst is not directly matching on covariates, propensity-score matching procedures cannot guarantee increased balance in the original covariate space; in extreme cases of over-pruning of observations, balance can actually worsen (King and Nielsen 2019). A final limitation of the technique is that because data focus on a particular region of covariate space of overlapping observations, the estimate of the causal effect size may not generalize well beyond this subregion of feature space. (For further resources on matching methods, see Iacus, King, and Porro 2019.)

Adjusting for confounders by including them in models

In contrast to matching, incorporating confounding variables explicitly into models allows the analyst to take advantage of all of the data. These approaches adjust for confounding variables by modeling a direct relationship between them and the outcome–for example, using a linear or generalized linear model, or using machine-learning approaches to capture more complex, non-linear relationships that are difficult to specify a priori in the absence of substantive theory. The specific meaning of ‘adjustment’ in this context varies by the particular model specification.

Adjustment with (generalized) linear models

In the classical, multiple-regression approach, the analyst assumes treatment and confounding causal effects on the outcome are captured by regression coefficients with simple additive effects or multiplicative interactions to account for subgroup-specific relationships. We demonstrate the regression approach on our existing example as an alternative to matching.

For our simulated data, a multiple linear regression recovers the causal effect and the other coefficients quite well, which is unsurprising given the simple, linear nature of the simulation:










< 0.001

Course Chosen




< 0.001





< 0.001

However, we may not know in advance the functional form of the causal relationship. The real world is not always so simple: heterogeneity in treatment effect size for treatment and control groups may preclude straightforward estimation of treatment effects with a single regression coefficient in a simple, parametric model. For example, the function relating previous exam performance to salary outcomes may not be linear; in the case of a single confounder it may be practical for the analyst to explore multiple functional forms to find the best fit, but as the number of confounders grows this solution may become impractical.

Modeling heterogeneous or nonlinear effects of confounders with meta-learners

In situations where heterogeneity in treatment effect size precludes straightforward estimation of causal effects with a parametric model, more flexible alternatives may be required. The general approach we outline here allows for the possibility that the size of the unobservable, within-person, treatment effect may vary as an unknown function of covariates–one that must be estimated using more flexible models.

Though a number of approaches and algorithms for estimating heterogeneous treatment effects have been developed, we focus on a general-purpose family of estimators called ‘metalearners’ that are relatively simple to understand and implement–even without specialized software packages such as the causalToolbox in R and causalml in python. Metalearners use information contained in auxiliary features to simulate within-person treatment effects. For every observed outcome, a counterfactual, alternative outcome is generated from a fitted model. This simulated outcome is then used to generate an estimate of the treatment effect. Tree-based models are a particularly popular choice for modeling the effects of confounders in the causal-inference literature, though there are no general heuristics to guide investigators. The choice of model is a matter of choosing which model is likely to capture patterns in the data for a given domain.

Metalearners break estimation of within-person treatment effects into (at least) two, separate steps: (1) models are trained to predict outcomes for treatment and control groups and then (2) the within-person treatment effect is calculated based on different imputation strategies using these models. Here we discuss three, closely related approaches, the S-, T- and X- learners (Künzel et al. 2019):

  1. S-Learners use a single model to predict expected outcomes for treatment and control groups using an indicator variable for treatment status. In the second step, the indicator variables are altered to obtain predictions for the control group under treatment and predictions for the treatment group under control. The estimated treatment effect for a given individual is thus the difference between the observed value and the imputed value. Under additive/linear effects of treatment and covariates, the S-Learner reduces to the simple multiple regression model introduced above.
  2. T-learners and X-learners begin by estimating two separate models to estimate the within-person treatment effect: one estimated on treatment observations and the other estimated on control observations. In subsequent steps:
    • T-learners calculate treatment effects by subtracting the predicted values from the treatment model minus the predicted values from the control model, given particular features. No information from the control group is used in developing the treatment-group model and no information from the treatment group is used in developing the control-group model. As a result, this model may be a particularly appropriate choice when the modeler wishes to avoid over-regularization at the potential expense of generalizability, though one option may be to introduce simple weighting schemes to mitigate overfitting.
    • X-learners (1) calculate the treatment effect for those in the treatment group by subtracting the predicted values under the control model from their observed values under treatment and calculate the treatment effect for those in the control group by subtracting the observed values for control from their predicted values under the treatment model, (2) model each of these imputed differences as a function of the covariates, and (3) calculate a weighted average of these modeled effects to create an adjusted estimate of the within-person treatment effect. In the best-case scenario, weights should be chosen to minimize the residual variance of the estimated effect, but separately derived propensity scores are also a common choice. The X-learner shares some common features with (generalized) linear, multilevel modeling in that steps (2) and (3) leverage information from the whole dataset to arrive at models of imputed differences as well as the weighted average that generates the final treatment effects. This efficient use of information from both groups resembles partial pooling of information across subgroups in multilevel models, in which grand means are estimated by weighting group means according to their uncertainty.

Here we demonstrate the T- and X-learner approach using generalized additive models as base learners. As with our previous example, course selection and salary are partially a function of the final exam performance from the prerequisite course. However, in this simulated dataset, both the baseline salary and the within-person treatment effect are nonlinear functions of previous course performance. Figure 2.2.1 shows both the simulated data (left) and the underlying treatment effect from which the data are generated.

Figure 2.2.1. Left: Simulated data with smoothed conditional mean estimates of the course-specific functions relating previous course performance to salary. Right: The underlying, polynomial CATE function with scatterplot of individual effects resulting from noise perturbation.

The first step of both the T- and X-learners is to fit separate models to the treatment and control data. Here we used generalized additive models with cubic-spline bases to estimate these smooth functions.

Both the T-learner and X-learner then use predictions from the models to impute counterfactual outcomes: predictions from the control model are used to impute control estimates for the treated units and predictions from the treatment model are used to generate predicted treatment estimates for the control units. Figure 2.2.2 shows the fitted and imputed/predicted observations from the generalized additive models.

The T-learner then estimates the within-person treatment effect as the difference between the fitted model predictions (not the raw values; e.g., easy course predictions for easy-course subjects from the model fitted only on easy-course subjects) and the imputed/simulated/predicted counterfactual estimates for each person (e.g., difficult-course predictions for easy-course subjects from the model fitted on difficult-course subjects).

Figure 2.2.2. Modeled conditional mean estimates of the course-specific functions relating previous course performance to salary (solid line) with jittered addition of counterfactual outcomes (dotted line).

The X-learner diverges from the T-learner in its more elaborate treatment of information from the observed data and counterfactual, imputed estimates. The differences between the observed, raw data and the counterfactual data are separately modeled for treatment and control groups in a second step and combined in a weighted average to yield a final estimate. Figure 2.2.3 shows the estimated treatment effects for the T-learner, an X-learner where the second-step models are weighted by the standard error of their residuals (precision weighting), and an X-learner where the second-step models are weighted by an estimated propensity score generated by predicted values from a logistic regression of course choice on previous course performance.

Though the T- and X-learners recover the overall shape of the actual treatment effects used to generate the data, at the low and high ends of the distribution of exam-performance data—where data from both groups are sparse—both meta-learners under- and over-shoot the treatment effect. This could potentially be remedied with further regularization in the base learners at any stage of model fitting, either by imposing stronger penalties, tuning model parameters via cross-validation, or placing a prior on effect sizes, depending on the particular approach of choice.

These techniques handle information contained in potentially complex, multivariate spaces in a principled manner while taking advantage of all existing data, which is useful when the analyst wishes to improve the generalizability of the model. However, as with all data-imputation techniques, the estimates are only as good as the model generating the imputed data in the best case. In general, this is a very difficult quantity to measure, as there is no ground truth. In addition, given that these models are meant to handle complex, multidimensional spaces with big data that are not easy to handle for the human observer (a kind of push toward automation), it is likely much more difficult to scrutinize the dataset for potential violations of assumptions necessary for valid causal inference.

Figure 2.2.3. The underlying, polynomial CATE function with T- and X-Learner estimates superimposed.

One particular limitation of T- and X-learners is that they address discrete, two-valued causes. While extensions to many-valued, discrete causes are relatively straightforward to imagine, alternative approaches are necessary for addressing continuous-valued treatments. S-Learners, on the other hand, could—at least in principle—be used with continuous treatment variables, with conditional treatment effects estimated for any counterfactual treatment values of interest to the analyst.

Conclusions and limitations

The methods we have covered are restricted to problems where we can identify, measure, and model all necessary covariates. Covariates should be selected carefully, as inclusion of covariates that are themselves outcomes of other covariates, the treatment status, or the target outcome, may introduce deleterious biases in estimates of causal effects. Conditioning on auxiliary outcomes can induce spurious associations between a priori unrelated variables. A particularly insidious source of biased causal estimates when the auxiliary outcome is selection for inclusion in the dataset itself (e.g., selection bias affects assessment of risk factors for COVID-19 illness and disease outcomes; see Griffith et al. 2020), and is thus not directly observable as a covariate.

For example, imagine a scenario where students choose PyTorch courses more or less at random. However, one course encouraged students to respond more to post-course surveys. In addition, students with higher salaries are more likely to engage in surveys to express their satisfaction with their educational experience. If we partition the data based on who reports (akin to adjusting for reporting, had we been able to observe it directly as a covariate), a spuriously negative effect of course difficulty on salary arises; see Figure 3.1 for a visualization of this effect.

Figure 3.1. What happens when conditioning on a common outcome? Spurious associations may arise. Here, a negative causal effect of course taken is observed within each reporting status, yet there is no overall effect of course taken.

In addition to careful consideration of which confounding mechanisms to consider, the methods we have covered assume that the causal effect is a fixed quantity for each individual, independent of treatment assignment. The analyst must be prepared to defend the assumptions their modeling approach makes and must be hyper-vigilant to potential violations. One possible approach, after a model has been fitted, is the attempt to refute a model by stress-testing the causal estimates and modeling assumptions; many of these fall into the category of data perturbation techniques.

Though automation of causal model identification and inference is an area of active research, there are currently few heuristics to help guide researchers through the many decisions they must make from the beginning to the end of a modeling project, and as such it is particularly important to document each decision made and its underlying motivation. Researchers will inevitably apply many approaches in order to understand their data, and it is especially important in an emerging discipline to understand how alternative modeling decisions may affect the end result.

Further reading

For excellent introductions to causal inference in textbook form, we recommend consulting Counterfactuals and Causal Inference: Methods and Principles for Social Research (Morgan and Winship 2014) and Elements of Causal Inference: Foundations and Learning Algorithms (Peters, Janzing, and Schölkopf 2017).


Griffith, Gareth J., Tim T. Morris, Matthew J. Tudball, Annie Herbert, Giulia Mancano, Lindsey Pike, Gemma C. Sharp, et al. 2020. “Collider Bias Undermines Our Understanding of COVID-19 Disease Risk and Severity.” Nature Communications 11 (1): 5749.

Ho, Daniel E., Kosuke Imai, Gary King, and Elizabeth A. Stuart. 2011. “MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.” Journal of Statistical Software 42 (8): 1–28.

Iacus, Stefano M., Gary King, and Giuseppe Porro. 2019. “A Theory of Statistical Inference for Matching Methods in Causal Research.” Political Analysis 27 (1): 46–68.

King, Gary, and Richard Nielsen. 2019. “Why Propensity Scores Should Not Be Used for Matching.” Political Analysis 27 (4): 435–54.

Künzel, Sören R., Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. 2019. “Metalearners for Estimating Heterogeneous Treatment Effects Using Machine Learning.” Proceedings of the National Academy of Sciences 116 (10): 4156–65.

Morgan, Stephen L., and Christopher Winship. 2014. Counterfactuals and Causal Inference: Methods and Principles for Social Research. 2nd ed. Analytical Methods for Social Research. Cambridge: Cambridge University Press.

Peters, J., D. Janzing, and B. Schölkopf. 2017. Elements of Causal Inference: Foundations and Learning Algorithms. Cambridge, MA, USA: MIT Press.