I came across a cool example for a DAG (thanks to a comment by Paul Bürkner) while working on some simulation and to me it became a somewhat minimal example for how DAGs work, how they can help inform modeling decisions and what can go wrong if one ignores causality. This post is basically about this paper by Cinelli, Forney, and Pearl (2020) but as it turns out, Richard McElreath spent an entire lecture on the same paper as well. While you should read the paper itself, Richard’s lectures are a treat and I can not recommend them enough.
Before we start, here are the libraries I will use throughout the rest of the code:
library(ggdag) # uses dagitty for DAG logic and creates nice plots from them library(brms) # used to fit bayesian models instead of raw stan library(GGally) # used for the nice pairs plot set.seed(1235813)
The paper shows examples for good, neutral and bad controls. Good controls being those where asymptotic bias is reduced by controlling for the respective variable and bad controls being those where bias is increased. There are two kinds of neutral controls in the paper: Those that increase and those that decrease precision when controlled for. By combining models 1, 8, 9 and 17 from the paper (in case you want to look them up) we get a very compact DAG that allows us to include all four kinds of controls presented in the paper at the same time. It’s also a cool dag for parameter recovery simulation studies, as there is a single effect we are interested in.
The DAG below shows the outcome
y, the treatment/exposure
x we’d like to
estimate the average causal effect of (see the paper for the difference between
average and direct effect) and the four additional control variables
z1, z2, z3, z4.
tidy_ggdag <- tidy_dagitty( dagify( y ~ x + z1 + z2, x ~ z1 + z3, z4 ~ x + y, outcome = "y", exposure = "x" ) ) ggdag(tidy_ggdag, layout = "circle") + theme_dag()
ggdag allow us to do automatically is to calculate
the so called adjustment set. Given an exposure and outcome, the adjustment set
tells us which additional variables to add to a model, to receive an unbiased
estimator for the treatment effect.
ggdag_adjustment_set(tidy_ggdag, node_size = 14) + theme(legend.position = "bottom")
In our case, the only necessary control for an unbiased estimation is
however, that there is more to an estimation besides being unbiased as we will see
in a second.
Alright, let’s generate some data for this DAG. To do this I simply used the formulas from the DAGs definition as a basis for the data generation. To keep things simple, I used normal distributions and simple additive linear combinations of the variables. However DAGs do not rely on the type of model, so you can do the same for more complex model types. The chosen coefficients are just for convenience. While they influence the results, they do not change the overall behavior. The sample size was set rather low, as the observed differences in precision decrease with sample size and I wanted to make them more visible.
N = 100 z1 = rnorm(N) z2 = rnorm(N) z3 = rnorm(N) x = rnorm(N, mean = (z1 + z3), sd = 1) y = rnorm(N, mean = (x + z1 + z2), sd = 1) z4 = rnorm(N, mean=(y + x), sd = 1) data = data.frame(x = x, z1 = z1, z2 = z2, z3 = z3, z4 = z4, y = y) ggpairs(data)
You can see some quite high correlations in the pairs plot, as we would expect from the way the data were generated. However, the correlations alone would not tell us which variables to include or exclude from our models. Only the causal model of the DAG can help us with that.
After the data generation let’s inspect how including or excluding the different
variables changes our model estimations.
One thing to remember is that we are interested in estimating the effect of
y and not on the effects of the four control variables.
These are also the results of single models fit to a single dataset. For proper
parameter recovery we would have to repeat these many times over, something I
might be working on but felt a little much for this post.
As a baseline, we will start with the true model. True in the sense, that it
recreates the original data generating process (of
y). I refrained from adding
z4 to keep the modeling simpler.
As we would expect from the true model, it recovers the effect for
x quite well.
true_model <- brm( y ~ x + z1 + z2, family = gaussian(), data=data, cores = 4 )
fixef(true_model, pars = c("x"))
## Estimate Est.Error Q2.5 Q97.5 ## x 1.017021 0.07421161 0.8687428 1.160975
For the fist bad model, we leave out
z1 from the controls. This opens up the
x -> z1 -> y and should bias the estimation of the effect
And lo and behold, that is exactly what we can observe from the summary and
posterior intervals. The model is quite certain about the effect of
it is rather off target. There is nothing in the summary that would hint at a
sampling problem or some kind of misspecification. So without our DAG, we might
just roll with it and be confident in an effect that is roughly overestimated
by one third.
bad_model1 <- brm( y ~ x + z2, family = gaussian(), data=data, cores = 4 )
fixef(bad_model1, pars = c("x"))
## Estimate Est.Error Q2.5 Q97.5 ## x 1.270522 0.07936353 1.116095 1.432161
Next, we leave out
z2 from the controls. While this does not open any
backdoor paths, it increases the variation of
y and thus should reduce the
precision of the estimated effect of
x. This is one of the differences that
reduces with growing sample sizes.
And again, we can observe just that. While the true effect of 1 is close to the mean of the posterior interval, the interval itself is wider than for the true model.
bad_model2 <- brm( y ~ x + z1, family = gaussian(), data=data, cores = 4 )
fixef(bad_model2, pars = c("x"))
## Estimate Est.Error Q2.5 Q97.5 ## x 1.008441 0.1010231 0.8123737 1.209088
Onward we go to include
z3, which again does not open any backdoor paths.
This time, it decreases the variation of
x and should reduce the
precision of the estimated effect of
x that way.
Again, this is one of the differences that reduces with growing sample sizes.
As with the model for
z2 the interval mean is close to the true effect of 1 but
the interval itself is wider again.
bad_model3 <- brm( y ~ x + z1 + z2 + z3, family = gaussian(), data=data, cores = 4 )
fixef(bad_model3, pars = c("x"))
## Estimate Est.Error Q2.5 Q97.5 ## x 0.9835386 0.1048379 0.7787038 1.190063
Finally we arrive at the final boss of the DAG, the collider
z4. By controlling
for it we open the backdoor path
x -> z4 -> y.
While there is slightly more to this, you’ll have to read the paper for that.
Noticing the backdoor path will be enough for our case here.
As with opening backdoor paths, we would expect the resulting estimation to be biased and again, we get what we were hoping for. This time, a lot of the posterior interval even lies on the wrong side of 0. And as before, there is nothing in the summary and diagnostics to warn us of our grave error.
bad_model4 <- brm( y ~ x + z1 + z2 + z4, family = gaussian(), data=data, cores = 4 )
fixef(bad_model4, pars = c("x"))
## Estimate Est.Error Q2.5 Q97.5 ## x -0.03079065 0.1093182 -0.2445922 0.1817436
One last thing we can do with the five models we have on hand is to use some kind of model comparison to figure out, which of those models is the best. For this we will use the elpd based loo function to compare the models on their out of sample predictions.
loo_compare( loo(true_model), loo(bad_model1), loo(bad_model2), loo(bad_model3), loo(bad_model4) )
## elpd_diff se_diff ## bad_model4 0.0 0.0 ## true_model -40.9 7.7 ## bad_model3 -42.2 7.8 ## bad_model1 -61.1 7.7 ## bad_model2 -72.0 8.1
And, we have a clear winner.
bad_model4 has a significant advantage over all
I guess that settles it.
Turns out we have been all wrong,
x has barely any effect on
y and if any,
it is probably negative.
And with that we will go and publish our findings for fame and fortune. But for some reason, when it gets really quiet, there is this wailing from beyond the veil. It almost sounds like the tormented souls of a million ignored DAGs.
Don’t ignore your DAGs, it makes them sad.
Cinelli, Carlos, Andrew Forney, and Judea Pearl. 2020. “A Crash Course in Good and Bad Controls.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3689437.