Problem Set 7 Digestion

7.1 Estimating Causal Effects

Think about headlines in human health, performance and disease: red wine decreases colon cancer or coffee increases dementia, or oxygenated water increases marathon performance. The mathematical way to think about these is X -> Y, or “X causes Y”. Importantly, if a scientist says something like “X causes Y”, this does not mean that X is the only cause of Y – other things may also cause Y. For example: vegetarian diet -> low blood cholesterol AND running -> low blood cholesterol AND statins -> low blood cholesterol.

Most importantly “cause” is not binary (causes v. doesn’t cause) but has some magnitude (tivially small, or small, or big, or huge). Here, we use the greek letter \(\beta\) (“beta”) to indicate effect size.

We are going to use Google Sheets to create fake data that were generated by a known causal process (known \(\beta\)), and then use a statistical model to estimate the causal process (estimate \(\beta\)) from the fake data. The statistical model is regression, which is the principle statistical method used in the biological sciences to estimate causal effects. We are purposefully using abstract notation (X and Y) instead of meaningful variables (dietary cholesterol and atherosclerotic plaque development) because it is good to be able to think abstractly.

7.2 Simulation 1

Open your Google spreadsheet and

7.2.1 Step 1. Set up the parameters

  1. In column A, cells 2-4, insert “beta_0”, “beta_1”, “E[b1]” (see figure above)
  2. In row 1, columns B and C, insert “True Value”, “Estimate”
  3. In B2, insert a number (it doesn’t matter)
  4. In B3, insert 0.5 (this is the true generating effect of X on Y)
  5. In B4, insert =B3 (this is the expected value of the generating effect of X on Y given the statistical model)

7.2.2 Step 2. Generate fake data

  1. In row 9, coumns A-C, insert “ID”, “X”, “Y”
  2. In A10 insert “1”
  3. In B10 insert =normsinv(rand())
  4. In C10 insert =$B$2 + $B\(3*B10 + sqrt(1-\)B$3^2)*normsinv(rand())
  5. In A11 insert =A10 + 1
  6. Highlight cells B10 and C10. Click on the handle on the lower right corner of the box and drag down 1 row. Your formulas from row 10 should now be in row 11.
  7. Highlight cells A11, B11, C11. Click on the handle on the lower right corner of the box and drag down and down and down until you get to row 1000. You should have copied all three formuilas all the way down.

What is step 2 doing? It is creating fake data. The value is caused by three things, the value in Cell B2, the product of B3 and X, and a random number. The value in B3 is the contribution of X to Y or how “X causes Y” or the “causal effect of X on Y”. If B3 is 0 then there is no causal effect. If B3 is 1 or -1, then the random component is zero.

You have just created fake data with a known generating mechanism! But it is imperative to check the the equations you entered don’t have bugs. If the equations were entered correctly, the standard deviation of the X and Y columns should both be one. Check this

7.2.3 Step 3. Fake data check

  1. In A8, insert “sd”
  2. In B8, insert =stdev(B10:B1000)
  3. Copy B8 and paste in C8.

These numbers should be close to 1.0 (something is probably wrong if it is less than 0.95 or more than 1.05). Refresh the spread sheet by typing command-R (Mac) or control-R (Windows)

7.2.4 Step 4. Does a statistical model recover the known effect?

  1. In C3, insert =slope(C10:C1000, B10:B1000)
  2. In C3, round to three places after the decimal

This is the slope of the regression (the statistical model) of Y on X. It is the estimate of the causal effect. The number should be very close to the true value.

This slope is the regression coefficient b1. The cell labeled “E[b1]” is the “expectation of b1” or the expected value of b1. Your estimate of beta_1 should also be very close to E(b1) since E(b1) is equal to the true generating effect (beta_1).

7.2.5 What you did

7.2.5.1 … in a nutshell

you generated \(Y\) using a “data generating” mechanism and then using the available data (X and Y), you used a statistical analysis to see if you could recover this data generating mechanism. The data generating mechanism is the set of two coefficients \(\beta_0\) and \(\beta_1\).

7.2.5.2 the data generating mechaniusm in a little more detail

The fake data are two variables, X and Y. Y is caused by three things:

\[\begin{equation} y_i = \beta_0 + \beta_1 x_i + \sigma_i \end{equation}\]

the subscript is the “\(i\)th” individual (if ID=7 then i=7). The three components generating \(y_i\) are

  1. \(\beta_0\) is “the intercept”; it is common to all \(i\)
  2. \(\beta_1 x_i\) is the product of the effect (\(\beta_1\)) and an individuals value of \(x\). \(\beta_1\) is the same for all \(i\) but the product is unique to each \(i\).
  3. \(\sigma_i\) is “the error”; this is the random variation due to other factors that “cause” Y but are unique to each \(i\). That is, these factors are not correlated with \(X\).

7.2.6 The model you fit is

\[\begin{equation} y_i = b_0 + b_1 x_i + e_i \end{equation}\]
  1. \(b_0\) is the intercept
  2. \(b_1\) is the slope
  3. \(e_i\) is the residual (the difference between the modeled value and the actual value)

Notice that the statistical model is the same as the generating model. It is not at all surprising that the statistical model “recovers” the data generating mechanism (or the “true values”). The problem in science is, we don’t know the data generating model so we don’t know the correct statistical model. This will hopefully make more sense in the next exercise.

7.2.7 Step 5. Set up the parameters

  1. In column E, rows 2-6, insert the labels “beta_0”, “beta_1”, “beta_2”, “r”, “E(b_1)”
  2. In row 1, columns F and G, insert the labels “True Value”, “Estimate”
  3. In F2, insert a number (it doesn’t matter) (this is the baseline value of generating model)
  4. In F3, insert 0.5 (this is the true generating effect of X1 on Y)
  5. In F4, insert -0.7 (this is the true generating effect of X2 on Y)
  6. In F5, insert 0.7 (this is the true correlation between X1 and X2)

7.2.8 Step 6. Generate fake data

  1. In row 9, coumns E-H, insert “Z”, “X1”, “X2”, “Y”
  2. In E10 insert =normsinv(rand())
  3. In F10 insert =sqrt($F\(5)*\)E10 + sqrt(1-$F$5)*normsinv(rand())
  4. In G10, copy the equation from F10 and insert into G10
  5. In H10, insert = =$F$2 + $F$3F10 + $F\(4*G10 + sqrt(1-\)F$3^2 - $F\(4^2 - 2*\)F\(3*\)F\(4*\)F$5)normsinv(rand())
  6. Highlight cells E10 through H10. Click on the handle on the lower right corner of the box and drag down and down and down until you get to row 1000. You should have copied all four formuilas all the way down.

What is step doing? Like Step 2 above, it is creating fake data. But here the \(Y\) value is caused by five things:

\[\begin{equation} y_i = \beta_0 + \beta_1 x1_i + \beta_2 x2_i + \sigma_i \end{equation}\]
  1. \(\beta_0\) is “the intercept”; it is common to all \(i\)
  2. \(\beta_1 x1_i\) is the product of the effect (\(\beta_1\)) and an individuals value of \(x1\). \(\beta_1\) is the same for all \(i\) but the product is unique to each \(i\). This is the causal or generating effect of \(X1\) on \(Y\)
  3. \(\beta_2 x2_i\) is the product of the effect (\(\beta_2\)) and an individuals value of \(x2\). \(\beta_2\) is the same for all \(i\) but the product is unique to each \(i\). This is the causal or generating effect of \(X2\) on \(Y\)
  4. \(\sigma_i\) is “the error”; this is the random variation due to other factors that “cause” Y but are unique to each \(i\). That is, these factors are not correlated with \(X\).

what is the 5th cause of Y? 5. \(r\) – the correlation between \(X1\) and \(X2\). A correlation is a measures of association and is always \(-1 /ge r /le 1\)

7.2.9 Step 7. Fake data check

  1. Check the standard deviation of X1, X2, and Y as in Step 3 above. All of these should be close to 1.0
  2. insert =correl(F10:F1000, G10:G1000) in G5. This should be close to the true correlation in F5 (The starting correlation is 0.7, so the estimate should be 0.67-0.73)

7.2.10 Step 8. Does a statistical model recover the known effect?

  1. In G3, insert =slope(H10:H1000, F10:F1000)
  2. In G3, round to three places after the decimal

As in Step 4 above, this is the slope of the regression (the statistical model) of Y on X. It is the estimate of the causal effect. The number will not be very close to the true value, at least using the default values specified in Step 5.

This slope is the regression coefficient b1. The cell labeled “E[b1]” is the “expectation of b1” or the expected value of b1 given the statistical model. Here is what is happening

  1. E(b1) should not equal the true value of beta_1 (at least using default values in Step 5), unlike in Simulation 1.
  2. Your estimate of beta_1 should be very close to E(b1) but not to beta_1

What’s going on is the whole point of this exercise. since E(b1) is equal to the true generating effect (beta_1).