One of the key skills in data science acquired through experience across many projects is learning to identify the key questions being asked and match them to the right tools for the job.
Now, there are many kinds of questions to ask of your data but, broadly speaking, many questions can be broken down into two categories: questions about patterns and correlations, or questions about causes and effects.
They often go hand-in-hand in a project. For example, clients typically come to us first with a patterns/correlations question: What aspects of my users and their behaviors using my product will tell us who has a high risk of churning? Later, after such features have been identified through tools particularly well-suited to that job, that question morphs into new questions about causes/effects: Okay, so if I implement this change to the product, will I lower churn rates? How much larger should our discount be to retain a customer? What if we stopped sending that email that prompted them to unsubscribe?
The difference between the first question and the second set of questions is subtle yet extremely important when it comes to matching the right tools to the job and achieving the client's goal. While the first question is ultimately about identifying reliable correlations, the latter questions require the much more challenging task of causal inference. Specifically, they are questions that demand counterfactual reasoning, ultimately requiring us to consider possible worlds in which something was different and how they might have led to a different outcome.
Here we wanted to dive deeper into the challenges of causal inference, and specifically review why these questions are hard yet ultimately essential to making better decisions for your organization.
Patterns aren’t always enough
Even the most sophisticated of today’s machine learning models rely heavily on pattern recognition — identifying correlations between predictors and outcomes. While these are incredibly useful tools for answering pattern-recognition questions, they become uninformative — even misleading — when used in to answer causal questions.
We can describe this in a simple example.
Let’s say we have house sales data. There appears to be a positive correlation between a house’s price and it’s age. If we were to predict a house’s price given it’s age we can expect to do so with reasonable accuracy:
Our pattern-recognition tool has found a pattern! Specifically, it found a relation between house price and house age that looks quite reliable.
We might be tempted to draw a causal inference here as well: Houses become more expensive as they get older.
But what happens if we use these same pattern-recognition tools to account for the location of the house?
Now we see something different. Older houses are less expensive when we look within a specific location. Here, location is a confounder that influences both the house price and age. Is the house expensive because it’s old or because it’s Downtown?
We might reasonably draw a very different causal inference from these data. Given these data, we might conclude that houses become less expensive over time.
The “flip” in these results is one case of a phenomena, called Simpon’s paradox, that may seem contrived but actually happens all the time in the real world (see, e.g., this example in the New York Times).
What this quick example illustrates is that, first, pattern-recognition tools are good at finding patterns. Indeed, by carving up the data in two slightly different ways, we found two completely different ways! Both patterns are completely correct, yet neither observed pattern is sufficient for making causal inferences. This latest result could just as easily be wrong as the first.
To be confident in making causal inferences, we will need to put aside the pattern-recognizers and draw on a different set of tools.
Picking up our causal-inference toolkit
As is clear from the previous examples, one of the central challenges in causal inference is understanding which variables are in play.
The first reason why mapping out relevant variables is important is because we are specifically interested in which variables might be true causal variables. We don’t want to mistakenly identify one variable as a cause when in fact it’s driven itself by some unrecognized variable.
The second reason to spend time identifying relevant variables is that one can only identify a causal variable by looking for something which reliably leads to a change in outcome while holding all other variables (potential confounding variables) constant. In the house example, this would mean holding location constant (along with all other identified and yet-to-be-identified confounders), increasing a specific house’s age, and watching its price change.
The causal effect is then the magnitude by which the outcome is changed by some change in the causal variable. And “holding all other variables constant” is the key to measuring those effects. In hypothetical settings (i.e., when we are simply observing historical data and trying to make sense of it, as if often the case), we call this counterfactual reasoning. In other settings, we call this a randomized experiment.
Graphically, we might map out variables in a causal or interventional diagram like so:
Critically, age must be independent of location to properly estimate the causal effect of age; otherwise, location is confounding our measure of the effect.
Randomized experiments allow us to guarantee the independence of these variables by specifically manipulating one while not manipulating the other. Often, non-manipulated variables are left to simply randomly distributed throughout the experimental sample. In other cases, they may be explicitly matched between manipulated (treatment) and non-manipulated (control) samples. The data resulting from either experiment has an ‘interventional’ distribution.
However, randomized experiments aren’t always a viable option as they are often too costly, unethical, illegal, or even impossible. The house example demonstrates the latter. It is impossible to take the same house and age or de-age it to measure its home price; instead, we have to try to make causal inferences from naturally-occurring observational data.
Limited to observational techniques, we have to be careful to account for confounders like location when causal relations like this are possible:
Identify, estimate, and refute
At this point, we know why we need to go beyond pattern-recognition and about the perils of confounders. But how should we proceed?
The first step is identification. We want to identify all of the potential variables that might be playing a role in our causal system. For those that like to throw models at dataset, it can be a bit disconcerting. To be successful, you must develop an understanding of the context of the causal question you’re asking, and that means thinking deeply about the domain of the problem and the data-generation process that led to your observations.
In client projects, it is essential to leverage clients’ own domain knowledge to help identify the causal structure of the question we’re trying to answer.
Considering our house example, here’s a diagram with more (though certainly not all) of the variables identified:
With variables identified, we need to actually estimate the causal effect of the variables of interest. There are several methods to do this, each with their own strengths and weaknesses:
- Regression: Controlling for the confounders own independent effects on the outcome, what is the magnitude/direction of the effect of our variable of interest? Note that confounders may have interaction effects (where one confounder mediates or moderates the effect of another), which means it’s never as simple as a first-order linear model.
- Matching/Stratification: Build a dataset where we have matched examples of houses — selected to be equivalent on all but our variable of interest and outcome. Then, compare the relationship between our variable of interest (e.g., age) and outcome (e.g., price) between the matched pairs. Does the older house have a higher or lower price?
- Weighting: Weight each individual’s outcome (e.g., price) by the inverse probability of having a certain value of our variable of interest (e.g., age), based on what is commonly referred to as a propensity model. This weighting attempts to mimic a randomized experiment by balancing the data set.
- Uplift modeling: Estimate a causal effect for each individual, sometimes called an individual treatment effect. This means estimating each individual’s outcome (e.g., price) under the assumption of a different value of our variable of interest (eg., age) — essentially the counterfactual estimate of the outcome. The difference between this counterfactual prediction and the observed outcome (e.g., price) is used as an estimate of the individual treatment effect.
The last step of any causal inference analysis is to challenge our results through a process sometimes called refutations. Unlike most types of model validations, there is no known answer, but ultimately this period of trying to refute our results can increase one’s confidence in them (if they survive, that is…).
Techniques for refutation include:
- Assessing covariate balance: If our causal variable can truly vary independently of its confounders and have an independent effect on the outcome, we should see that confounders distribute similarly across various measures of the age. For example, we should see houses of all ages in each location.
- Add random variables to confounds: If the statistical tool we are using to estimate the effect is robust, we should be able to add confounders that are noise, or add noise to confounders, and measure similar effects.
- Divide data into subsets: If the causal variable has an independent effect on your outcome, it should be consistent when measured across differences subsets of your data. If not, one should question whether it’s effect is truly independent or, instead, whether it’s moderated or perhaps even carried by another variable.
- Randomize or permute the causal variable: If we randomly re-assign values to the causal variable, we should expect the causal effect to approach zero. For example if we conduct our causal analysis only on houses that are 30 years old, we should expect the average price difference to be zero. If not, there’s a reasonable chance we’re not truly holding everything else constant.
Making the correct causal inferences on observational data is challenging, yet causal questions are essential questions to ask as (1) early research begins to identify interesting patterns and, (2) stakeholders begin to want to make interventions that improve outcomes.
Here we’ve highlighted some conceptual tools to bring to bear on these problems as you frame your questions and causal research. In an upcoming post, we will dive into some statistical tools that implement these concepts into tried-and-tested causal estimators.