Original version posted April 12, 2018
We want to know the causal effect of dietary animal fat on the risk of developing cardiovascular disease (CVD). This causal effect is the effect coefficient β1. This coefficient tells us how our expected risk changes if we make the lifestyle intervention of changing the amount of animal fat in our diet. This is a pretty common type of question in human medicine.
To understand how hard it is to estimate this causal effect, let’s pretend that only two things in the world cause cardiovascular disease: 1) the amount of dietary animal fat (the variable Diet) and 2) how long we take off of work to enjoy lunch (the variable Break). This is a graphical model of the truth.
fd <- data.table(x<-seq(1:3), y=seq(0:5))
x1 <- 'Diet'
x1.h <- 1
x1.v <- 3
x2 <- 'Break'
x2.h <- 1
x2.v <- 1
y <- 'CVD'
y.h <- 3.5
y.v <- 2
base <- 14
gg <- ggplot(data=fd, aes(x=x, y=y)) +
# variables
annotate(geom="text", x=x1.h, y=x1.v, label=x1, size=base) +
annotate(geom="text", x=x2.h, y=x2.v, label=x2, size=base) +
annotate(geom="text", x=y.h, y=y.v, label=y, size=base) +
# paths
geom_segment(aes(x = x1.h+0.5, y = x1.v - 0.00, xend = y.h-0.5, yend = y.v+0.1), size=2, arrow = arrow(length = unit(0.4,"cm"))) +
geom_segment(aes(x = x2.h+0.7, y = x2.v + 0.00, xend = y.h-0.5, yend = y.v-0.1), size=2, arrow = arrow(length = unit(0.4,"cm"))) +
geom_curve(aes(x = x1.h-0.5, y = x1.v - 0.05, xend = x2.h-0.7, yend = x2.v+0.05), curvature = 0.5, size=2, arrow = arrow(length = unit(0.4,"cm"))) +
geom_curve(aes(xend = x1.h-0.5, yend = x1.v - 0.05, x = x2.h-0.7, y = x2.v+0.05), curvature = -0.5, size=2, arrow = arrow(length = unit(0.4,"cm"))) +
# parameters
annotate(geom="text", x=(x1.h + y.h)/2+0.25, y=(x1.v+y.v)/2+0.25, label="beta[1]", size=base, parse=TRUE) +
annotate(geom="text", x=(x1.h + y.h)/2+0.25, y=(x2.v+y.v)/2-0.25, label="beta[2]", size=base, parse=TRUE) +
annotate(geom="text", x=x1.h-1.25, y=(x1.v+x2.v)/2, label="rho", size=base, parse=TRUE) +
xlim(0-0.5, (y.h + 0.5)) +
ylim(0.5, 3.5) +
theme_void()
gg
Figure 1: The truth. A pretend model of the two variables that causally effect cardiovascular disease risk (CVD)
A single-headed arrow indicates the direction of a causal effect and the the greek letter β (“beta”) represents the magnitude of the causal effect. The true causal effect of dietary animal fat is β1 and the true effect of lunch break duration is β2. In addition, the correlation between Diet and Break is ρ (the greek letter “rho”). I haven’t given you numbers for these parameters, but we can come up with a plausible story where β1 is a positive number (the more animal fat in your diet the higher the risk of CVD) and β2 is a negative number (the longer the lunch break the less risk of CVD – yay lunch break!), and ρ is a negative number (people that take short lunch breaks eat fast food such as hamburgers or pepperoni pizza). We can also write this model of the true effects using the equation
CVD=β0+β1Diet+β2Break
β0 is the intercept – it is the base rate, or the rate of CVD in people that eat zero animal fat and take zero break for lunch. A one unit increase (1 gram per day) of animal fat adds β1 risk to this base rate. If β1<0 then risk is decreased. A one unit increase (1 minute per day) of lunch break duration adds β2 risk to this base rate. Again, if β2<0 then risk is decreased. This looks like a regression equation, but it isn’t. It is the equation describing the true causal effect of Diet and Break on CVD risk (again, in our pretend world).
Let’s do a pretend study of the effects of dietary animal fat and lunch duration on CVD using an observational design. In this study, we measure daily intake of dietary fat in the variable Diet and measure the duration of the daily lunch break in the variable Break in a bunch of people. And we follow those people over time to see who does and who does not have CVD events.
Given our data that we’ve collected, we estimate the effects of Diet and Break on CVD risk using regression. Here is the regression equation
E(CVD)=b0+b1Diet+b2Break
Here is what is super important about the regression equation: If (and this is a big if) dietary animal fat and lunch duration are the only two things in the world that causally effect CVD risk, then the regression coefficients are unbiased estimates of the effect coefficients. This means that if we do the study on a bunch more people, our regression coefficients will be closer to the truth (the effect coefficients). In our pretend world then, the regression can be used to get these causal effects.
What if in our pretend world, it just doesn’t occur to us that lunch duration affects CVD risk and so we don’t bother measuring Break. A causal variable that is not included in the regression model is a missing confounder. The consequence is that the regression coefficient b1 of Diet is no longer an unbiased estimate of the causal effect of Diet (β1). In other words, we think we are estimating the causal effect β1 but we are really estimating
E[b1]=β1+ρβ2
where ρβ2 is the omitted variable bias.
Here are some consequences of this
If the correlation between Diet and Break is zero then ρβ2 is zero and there is no bias. The regression works, yaaay!
If there is no effect of lunch duration on CVD risk then ρβ2 is zero and, again, there is no bias. The regression works, yaaay!
If ρ and β2 are anything other than zero, there will be omitted variable bias. How bad this bias will be depends on the magnitude of the correlation and the effect of lunch duration. If these are big, then a regression will not come close to estimating the true effect. Here are examples
The true effect of Diet is relatively big β1=0.8, while the true effect of Break is relatively small β2=−0.2 and the correlation between the two is small (ρ=0.2). The expected regression coefficient for Diet is .8+−.2×.2=0.76 – that’s pretty close to the true value. But what if
The true effect of Diet is relatively small β1=0.2, while the true effect of Break is relatively big β2=−0.8 and the correlation between the two is big (ρ=0.7). The expected regression coefficient for Diet is now .2+−.8×.7=−0.36 – In other words we think dietary animal fat as a negative effect on CVD – that is, the more fat in the diet the lower the risk of CVD. This is opposite of the truth in our pretend world.
So here is the deal. Observational studies will always have missing confounders. Consequently, estimates of causal effects in observational studies will always be biased and we generally won’t know how big this bias is because we don’t know what is missing – if we did we would include it in the model. And again, increasing sample size does not decrease this bias. We can guess at what the big confounders are and measure them and include them in a regression model and hope that any remaining bias is small. More importantly, we can do good fundamental physiology and generate rigorously probed working models for how the potential causal effects cause the outcomes of interest. For some things, we can also
```