## Simpson’s paradox and causal inference with observational data

**TL;DR** Simpson's paradox is a favorite mathematical curiosity. It is inextricably linked to causal inference. When building machine learning models on real-world medical evidence, Simpson's Paradox is something we should be prepared for. This is one aspect of drawing inferences from observational data that marks it as different from randomized controlled trials. When we are interested in making predictions without intervention, Simpson's Paradox is not a danger, merely a confusing annoyance; it won't lead us to false predictions, though it may hamper interpretation. However, when we want to draw causal conclusions, some assumptions about the world are necessary to escape the paradox.

```
__author__ = 'Adam Foster'
```

## Contents

- Prediction and causal inference with observational data
- Prediction vs. causal inference
- Data in this post
- Simpson's paradox: take 1
- Simpson's paradox: take 2
- For prediction, there is no paradox
- Protecting against Simpson's Paradox when venturing causal inferences
- Conclusion
- Appendix A: Data simulation
- Appendix B: Temporal ordering?
- References

## Prediction and causal inference with observational data

In the world of healthcare, there is increasing pressure to leverage observational data, collected amongst a broad population outside of experimental controls, to better understand efficacy, safety, and the value of treatments in practice. The data for these efforts can come from numerous, disparate sources: prescription patterns, public health indicators, medical ontologies, and more traditional health data sources like insurance claims and electronic health records. A single study might integrate many such data sets to produce a vast, detailed picture of the relevant populations, behaviors, and responses.

In this post, we consider two different goals one might have for studies using such observational data:

*Prediction*: To make predictions about future outcomes, based on models we train from the data we have now.*Causal inference*: To understand which of the many hundreds of variables in our model are driving the change – which have a causative effect on the outcome.

Simpson's Paradox (a special case of the reversal paradox [1]) is a famous counter-intuitive result in statistics showing that results can appear, disappear, or even become reversed within the same data set depending on how one chooses to break the population into subgroups. This is relevant to both goals, but it affects them differently. For the prediction problem, Simpson's Paradox is really just an annoyance. For the causal inference problem, it is more serious.

Let's start by stating the two goals more precisely and identifying just how they differ.

## Prediction vs. causal inference

An associational predictive statement might look like,

People who take this treatment are likely to get better

where the intention is identify a pattern that one expects to hold for new cases, without making any claim about *why* the pattern occurs.

In contrast, a causal statement like

This treatment makes people get better

explicates the nature of the connection between treatment and outcome and implicitly says that people who don't get the treatment will not get better (assuming everything else is held constant). We expect the predictive statement also to be true if the causal one is, but the reverse is not true. As the saying goes, correlation does not imply causation.

In the context of healthcare, making a causal claim where only a predictive one holds can have very serious consequences. The randomized control design of clinical trials is meant to help us avoid such mistakes. With observational data, the risks of such mistakes are greater, and Simpson's Paradox helps us to see why, as it results from faulty causal reasoning [2]. In this way, the paradox serves as a warning bell that we need some solid causal assumptions before we can proceed with causal inference from our observational data.

## Data in this post

This post tells a story about making causal inference using real-world evidence. However, **all the data in the post are simulated**. The primary advantage of this is that it allows us to specify the data generation process for ourselves, which allows us to control the correlations and causal effects to expose interesting patterns.

## Simpson's paradox: take 1

In our first example, we illustrate a classic Simpson's Paradox pattern: a relationship that holds for the overall population is reversed for every subgroup.

Suppose we wish to study the effect of two treatments: Treatment 1 (existing therapy) and Treatment 2 (experimental gene therapy) on response. A lower response is better. Suppose the data arose from a real-world study into the two treatments, thereby creating inherent uncertainty about how to properly define our model.

We begin by plotting the response of each patient, graphically separated by which treatment they were receiving (all code for this post is available here):

```
%matplotlib inline
from causal_inference_support import *
```

```
data = get_dataframe(data_generator1(), 200)
```

```
plt.figure(figsize=(8, 6))
sns.boxplot(x='treatment', y='response', data=data)
sns.swarmplot(x='treatment', y='response', data=data, color='.25')
plt.title("Response (lower = better) against treatment")
plt.show()
```

The means for each group are

```
data.groupby('treatment').mean()
```

Is the difference between the means significant? A classical test for this is Student's $t$-test. Let's perform this test now:

```
run_ttest(data, 'treatment', 1, 2, 'response')
```

A natural conclusion at this point: "Treatment 2 is a more effective treatment. The result is significant."

And if we had only treatment and response data, our analysis would have to stop there. But if we live in a world of connected health data, we'll likely know a lot more than that about our participants.

Suppose, then, that we use connections between our data and public geographical data to find that people living in urban areas experience the disease differently than those living in rural areas. Let's consider the effect of the two treatments on urban and rural dwellers separately:

```
fig, axes = plt.subplots(1,2, figsize=(16, 8))
for location, ax in zip(['Urban', 'Rural'], axes):
sns.boxplot(ax=ax, x='treatment', y='response',
data=data[data['location'] == location])
sns.swarmplot(ax=ax, x='treatment', y='response',
data=data[data['location'] == location], color='.25')
ax.set_title("{} dwellers".format(location))
```

Now the group means are

```
data.groupby(['location', 'treatment']).mean()
```

```
print("Urban")
run_ttest(data[data['location'] == 'Urban'], 'treatment', 1, 2, 'response')
print("Rural")
run_ttest(data[data['location'] == 'Rural'], 'treatment', 1, 2, 'response')
```

This is a classic example of Simpson's Paradox. Just above, we showed that, for the population overall, Treatment 2 appears better than Treatment 1. *But*, for both subgroups considered on their own, Treatment 1 is better than Treatment 2.

A story to explain the paradox in our example is as follows. People living in urban areas who suffer much more from the disease (perhaps due to higher pollution levels in cities) are more likely to use Treatment 1. Perhaps it is more available, or better marketed here. People living in rural areas where the effects of the disease are less, tend to use Treatment 2 even though it's not as good at curing the disease. When we look at the whole population, Treatment 2 manages to look better because it's overwhelmingly taken by people in rural areas where the effects of the disease aren't so severe.

## Simpson's paradox: take 2

In the previous example, it's natural to resolve the apparent paradox by including subgroup information in the estimate of causal effects. We now consider an example where subgroup information leads us astray.

We still have two treatment options – Treatment 1 and Treatment 2. Now however, the disease is the same for urban and rural dwellers. The third variable is $S$. Say $S=1$ indicates that the patient went scuba diving in the last month, and $S=0$ indicates they did not. We look at the whole population first:

```
data2 = get_dataframe(data_generator2(), 120)
```

```
plt.figure(figsize=(8, 6))
sns.boxplot(x='treatment', y='response', data=data2)
sns.swarmplot(x='treatment', y='response', data=data2, color='.25')
plt.title("Response against treatment")
plt.show()
```

And, as before, let's look at the response means for the two treatments:

```
data2.drop('scuba_diving', 1).groupby('treatment').mean()
```

This looks like a clear result that Treatment 1 is better than Treatment 2. However, the urbanâ€“rural example should be a warning to us. Let's break down the data into scuba divers and non-scuba divers:

```
fig, axes = plt.subplots(1,2, figsize=(16, 8))
for scuba, ax in zip([True, False], axes):
sns.boxplot(ax=ax, x='treatment', y='response',
data=data2[data2['scuba_diving'] == scuba])
sns.swarmplot(ax=ax, x='treatment', y='response',
data=data2[data2['scuba_diving'] == scuba], color='.25')
ax.set_title("Scuba divers" if scuba else "Non-scuba divers")
```

And the response means for the four groups:

```
data2.groupby(['scuba_diving', 'treatment']).mean()
```

We've got the same subgroup reversal effect as before, except that things happen the other way round. When we look at the overall population, Treatment 1 appears more effective. When we split into the two groups – scuba divers and non-scuba divers – it appears that for each group Treatment 2 is more effective.

To understand what's going on, let's look at the original plot, colored by scuba diving:

```
plt.figure(figsize=(8, 6))
pal = sns.color_palette("Set2", 10)
sns.swarmplot(x='treatment', y='response', data=data2, hue='scuba_diving', palette=pal)
plt.title("Response colored by scuba diving")
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., title='Scuba diver?')
plt.show()
```

Here is an intuitive description of what is happening. People who recover are more likely to go scuba diving, but those on Treatment 2 have to be doing really well to go diving because side effects are so severe. So when we look a scuba divers we get only the most recovered people on Treatment 2. Only the really sick people on Treatment 1 do *not* go scuba diving, so looking at non-scuba divers, Treatment 1 appears worse again.

The common sense (causal) conclusion is that Treatment 1 is more effective and scuba diving is a distracting variable which we should ignore.

## For prediction, there is no paradox

Let's consider these examples as a predictive machine learning problems. To distance ourselves from the interpretation, we'll relabel Treatment as $T$, Response as $R$, Location (urban or rural) as $L$, and Scuba diving as $S$.

```
data.columns = ['L', 'R', 'T']
```

```
data2.columns = ['R', 'S', 'T']
```

### Example 1 with only treatment data

To begin with, we predict $R$ from $T$ alone. In the table, we show the value of $R$ that is predicted when we observe a certain value of $T$. (It is the mean of $R$ restricted to the group $T=t$ for $t=1,2$.)

```
data.groupby('T').mean()
```

The following diagram is a visualization of the predictive model that we would use when only treatment information is available. Upon seeing a new patient, we follow the tree from left to right to arrive at our prediction of the response.

```
display(Image('img/ex1.1.png'))
```

This is the best predictive model available if only $T$ had been measured for our new observation (and $L$ was unknown).

### Example 2 with only treatment data

We repeat the calculation for the scuba diving example, assuming that only $T$ had been measured:

```
data2.drop('S', 1).groupby('T').mean()
```

Here's the tree diagram for this predictive model:

```
display(Image('img/ex2.1.png'))
```

### Example 1 with treatment and location data

Now we measure a new feature, $L$. We include this in our model and have four levels of prediction. (Again, these are simply the means of $R$ restricted to one of the four groups.)

```
data.groupby(['L', 'T']).mean()
```

For an observation with both $T$ and $L$ measured, our predictive model would become more sophisticated. Once again, we follow the tree from left to right to arrive at a prediction:

```
display(Image('img/ex1.2.png'))
```

### Example 2 with treatment and scuba diving data

For the second example, things proceed much the same when we add a new feature, $S$:

```
data2.groupby(['S', 'T']).mean()
```

The tree diagram would then be:

```
display(Image('img/ex2.2.png'))
```

### Discussion

From a predictive machine learning point of view, these models are the best we can get for the given features. If we receive a new observation and are told only that the treatment was Treatment 2, then our model tells us what the prediction will be.

For the urbanâ€“rural example, it is not important whether the prediction is accurate because the treatment is helping to predict the patient's location, which in turn influences the response. For the scuba diving example, it is not important whether recovery leads to scuba diving or scuba diving leads to recovery.

Thus, if we only want to make predictions, Simpson's Paradox is not a pressing concern. We accept that sometimes trends switch when we add a new feature to the model. That's OK – we'll know how to make predictions regardless.

## Protecting against Simpson's Paradox when venturing causal inferences

It's only when we want to make causal statements that the paradoxical nature of Simpson's Paradox becomes apparent. Consider our urbanâ€“rural example again. It cannot be true medically that Treatment 2 is better overall at curing the disease, but Treatment 1 is better for both subgroups. This is where the air of paradox comes from.

The problem lies with the way different variables affect (or confound) one another: people in urban areas where the disease is more severe happened to be more likely to be using Treatment 1. That makes it look bad, even though it's a better treatment in fact.

A doctor wishing to prescribe the best treatment would choose Treatment 1. Common sense tells us that the subgroup model is the correct causal model, and the overall model is not. We can encode our common sense in a diagram:

```
display(Image('img/location.png', width=600))
```

This encodes our assumption that location influences both treatment and response, and treatment influences response. (A precise mathematical meaning can be attached to such diagrams [3].) With this diagram as an assumption, it is possible to deduce that including subgroup information is the right decision in example 1.

In contrast, in the scuba-diving example, common sense said that including subgroup information was not the right thing to do, as we do not expect recent scuba-diving history to affect treatment outcomes in this case. The following captures this commonsense view of the problem:

```
display(Image('img/scuba.png', width=600))
```