# Chapter 13 Issues in inference

## 13.1 Replicated experiments – include $$\texttt{Experiment}$$ as a random factor (better than one-way ANOVA of means)

Replicated experiments are the norm in bench biology. Researchers most commonly 1) analyze the mean response from each replicate, 2) pool all the data, ignoring that they come from multiple, independent experiments, or 3) analyze a “representive” experiment.

The best practice for analyzing data from multiple, independent experiments, one that is almost never used outside of certain subfields where pipelines for this alternative exist, is a linear mixed model with the experiment id added as a random factor. A mixed-effect or repeated measures ANOVA is equivalent to one version of this kind of linear model. The analysis of data from pooled independent experiments with a linear mixed model is a best practice because it increases precision and power relative to a t-test/ANOVA of experiment means and avoids pseudoreplication.

It’s important to understand the assumptions of the different models for analyzing replicated experiments. In a study with multiple, replicated experiments, the replications are “independent” – solutions are re-made, instruments have drifted and are re-calibrated, the grad student doing the work has lived another day. Each experiment has a unique set of factors that contribute to the error variance of the response variable. All measures within an experiment share the component of the error variance unique to that experiment and, as a consequence, the error (residuals) within an experiment are more similar to each other than they are to the residuals between experiments. This is a violation of the independent sampling assumption of the linear model and the lack of independence creates correlated error. The consequences of this on the three analyses are:

1. Linear model/t-test/ANOVA of pooled data ignoring experiment. This model assumes perfect replication between experiments, that essentially, the world was exactly the same for each. If this assumption is true, each replicate within each experiment is an independent data point. A study with n = 5 mice per treatment replicated five times has n = 25. If there are two treatment levels, the t-test will have 48 df. But, if the measures within an experiment share a common source contributing to the variance of the response within that experiment (the temperature in the room was 1°C cooler causing a machine to give a slightly different reading), there are not 25 independent measures because the five measures within each experiment are not independendent. The true df will be less than 48, how much less depends on the magnitude of the correlated error resulting from the non-independence. Pooling data across experiments with correlated error creates an incorrectly small standard error of the mean, resulting in incorrectly narrow confidence intervals and incorrectly small p-values. Pooling data inflates false discovery.
2. Linear model/t-test/ANOVA of experiment means. Analysis of the means violates the independence assumption because the means of the treatment levels within an independent experiment are expected to share common sources of measure variability of that experiment. Instead of inflating false discovery, this correlated error generally works against discovery. The correlated error among the means can be modeled to increase precision and gain power.
3. Linear model of the experiment means with $$\texttt{experiment_id}$$ added as a random factor to account for correlated error due to the unique sources of variance within each experiment. This model is equivalent to a mixed effect ANOVA of the full data set.

Reproducibility blues. Researchers are improving best practices by archiving data on public servers. Nevertheless, it is extremely rare to find the full data for many experiments. For analyses of the means from multiple experiments, researchers almost always archive the means – the data are means pooled – and not the full set of measurements from each experiment. Or, the full data are archived but the experiment id is not included – the data are complete pooled. There is no way to truly replicate a result if only the means are archived (were the means computed correctly?). And, there is no way to re-analyze the data with alternative models that require the full data with the experiment identified.

### 13.1.1 Multiple experiments Example 1 (wound healing Exp4d)

data source

The experiment was designed to estimate the effect of a STAT3(https://en.wikipedia.org/wiki/STAT3){target="_blank"} knockout on the number of cells expressing the cytoskeletal protein vimentin in the wounded tissue.

Following Fisher, we can be more confident of an effect if we consistently get small p-values from replicate experiments.

ids <- levels(exp4d[, experiment_id])
n_exp <- length(ids)
p <- list()
for(exp_i in ids){
exp4d_m1 <- lm(vimentin_cells ~ genotype,
data = exp4d[experiment_id == exp_i,]) # subset
exp4d_m1_test <- coef(summary(exp4d_m1))
p[[exp_i]] <- exp4d_m1_test["genotypeSTAT3_ko", "Pr(>|t|)"]
}

data.table(
experiment = ids,
p = unlist(p)
) %>%
kable(digits = 3,
caption = "P-values from the three independent experients of the exp4d data.") %>%
kable_styling()
Table 13.1: P-values from the three independent experients of the exp4d data.
experiment p
Exp1 0.023
Exp2 0.022
Exp3 0.017

There are methods for combining p-values but we do not need to do this because we have the raw data.

### 13.1.2 Models for combining replicated experiments

A linear mixed model for combining replicated experiments is most easily fit using the aov_4 function from the afex package. The aov_4 function fits a mixed-effects ANOVA model, which is a factorial ANOVA that includes both fixed and random factors. In the Experiment 4d data, the variable $$\texttt{genotype}$$ is a fixed factor and the variable $$\texttt{experiment_id}$$ is a random factor. More information on fixed and random factors is given in the Linear Mixed Model chapter.

The mixed-effect ANOVA models fit by aov_4 are not linear mixed models but inference is equivalent in balanced designs – I give more detail below. The aov_4 function uses a linear mixed model formula interface and is a useful substitute for actually fitting the equivalent linear mixed model because fitting a linear mixed model can be sensitive to the data and sometimes fails, even if the mixed-effect ANOVA model “works.”

fit the model

exp4d_m1 <- aov_4(vimentin_cells ~ genotype + (genotype | experiment_id),
data = exp4d)

inference from the model

exp4d_m1_emm <- emmeans(exp4d_m1, specs = "genotype")

exp4d_m1_pairs <- contrast(exp4d_m1_emm,
method = "revpairwise") %>%
summary(infer = TRUE)
exp4d_m1_pairs %>%
kable(digits = c(1,2,3,1,2,2,1,5)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
STAT3_ko - STAT3_wt -3.5 0.048 2 -3.7 -3.29 -72.5 0.00019

plot the model

# wrangling to get this to work
# ggplot_the_response doesn't like the fit object from afex. That's okay,
# we use it only to get the individual points
exp4d_m0 <- lm(vimentin_cells ~ genotype,
data = exp4d)
exp4d_m0 <- lm(vimentin_cells ~ genotype * experiment_id,
data = exp4d)
exp4d_m0_emm <- emmeans(exp4d_m0, specs = c("genotype", "experiment_id"))

ggplot_the_response(fit = exp4d_m0,
fit_emm = exp4d_m0_emm,
fit_pairs = exp4d_m1_pairs,
y_label = "vimGFP+ cell count",
palette = pal_okabe_ito_blue) ### 13.1.3 Understanding Model exp4d_m1

1. The univariate and multivariate models fit by AOV_4 gives researchers an easy tool in R for replicating ANOVA results from other software, such as Graphpad Prism or JMP. But it is really the contrasts that we want, not the ANOVA table, and we can get these contrasts by passing the aov_4 object to emmeans.
2. The formula in Model exp4d_m1 is misleading. The formula specifies a linear mixed model with a random intercept and a random slope (see Models with random factors – Blocking and pseudoreplication but this is not the model that is fit.
3. The aov_4 function actually fits two models.
• A linear model of the aggregated data with adjusted degrees of freedom (the “aov” or univariate model). “Aggregated” data are the means of each treatment level within each experiment (the means-pooled data).
• A multivariate linear model (the multivariate model) of the aggregated data. A multivariate model has multiple Y variables. For the exp4d data, each experiment (containing the means for each treatments) is a seperate $$Y$$ variable.
1. The default output is the univariate model but the researcher can choose the multivariate model.
exp4d_m1_pairs_multi <- emmeans(exp4d_m1,
specs = "genotype",
model = "multivariate") %>%
contrast(method = "revpairwise") %>%
summary(infer = TRUE)
Table 13.2: Contrasts from the univariate and multivarate models fit to the exp4d data.
contrast estimate SE df lower.CL upper.CL t.ratio p.value
univariate model (m1)
STAT3_ko - STAT3_wt -3.5 0.048 2 -3.7 -3.29 -72.52 0.00019
multivarite model (m2)
STAT3_ko - STAT3_wt -3.5 0.048 2 -3.7 -3.29 -72.52 0.00019
1. The univariate and multivariate models are the same when there is only a single factor with two treatment levels, as with these data. Otherwise the two differ. See Section 16.2.10 in the Linear Mixed Models chapter.
2. The univariate model is equivalent to the linear mixed model exp4d_m2 (see next section) of the aggregated data if there are no missing data.
3. If there are the same number of replicates within all treatment by experiment combinations, then the univariate model is equivalent to the linear mixed model exp4d_m3 of the full (not aggregated) data.

### 13.1.4 The univariate model is equivalent to a linear mixed model of the aggregated data (Model exp4d_m2)

aggregate the data

# compute the experiment means
exp4d_means <- exp4d[, .(vimentin_cells = mean(vimentin_cells)),
by = .(genotype, experiment_id)]

fit the model

exp4d_m2 <- lmer(vimentin_cells ~ genotype + (1 | experiment_id),
data = exp4d_means)
Table 13.3: Contrasts from the univariate model fit by AOV_4 (exp4d_m1) and the linear mixed model of the aggregated data (exp4d_m2)
contrast estimate SE df lower.CL upper.CL t.ratio p.value
univariate model (exp4d_m1)
STAT3_ko - STAT3_wt -3.5 0.048 2 -3.7 -3.29 -72.52 0.00019
linear mixed model (exp4d_m2)
STAT3_ko - STAT3_wt -3.5 0.048 2 -3.7 -3.29 -72.52 0.00019

Notes

1. The univariate model is equivalent to Model exp4d_m2 only if there is no missing data (for example, one treatment has no data in experiment 2). 2.The factor $$\texttt{experiment_id}$$ is added as a random intercept using the formula notation (1 | experiment_id). This random intercept models common variation shared within experiments.
2. Random intercepts were introduced in Example 1 (exp1b) – a paired t-test is a special case of a linear mixed model in the Violations chapter and are described in more detail in Example 1 – A random intercepts and slopes explainer (demo1) in the Linear Mixed Model chapter.

### 13.1.5 A linear mixed model of the full data

exp4d_m3 <- lmer(vimentin_cells ~ genotype +
(1|experiment_id) +
(1|experiment_id:genotype),
data = exp4d)

Notes

1. Model exp4d_m3 is equivalent to the univariate mixed-effect ANOVA (exp4d_m1) and the linear mixed model of the aggregated data (exp4d_m2) if there are the same number of subsamples in every genotype by experiment combination.
2. Model exp4d_m3 is a linear mixed model with the random factor added as two random intercepts.
3. The (1|experiment_id) intercept models common variation shared within experiments.
4. The (1|experiment_id:genotype) is an interaction intercept that models common variation shared within experiment by genotype combinations.
5. The function lmer returns the “boundary (singular) fit” message, which is a warning to researchers to use caution if using the model for inference. This message will not be rare when the the number of experiments is small (2-4).

Note 5 raises the question, why bother with the linear mixed model of the full data when the aov_4 function always “works” and is the same when the design is balanced? The short answer is, just use the aov_4 model with the default (univariate model) outputs. The long answer is, read chapter Models with random factors – linear mixed models.

### 13.1.6 Analysis of the experiment means has less precision and power

Compare inference from the mixed-ANOVA exp4d_m1 or the linear mixed model exp4d_m2 (these are the same) to a linear model of the experiment means (equivalent to a t-test/ANOVA)

exp4d_m4 <- lm(vimentin_cells ~ genotype, data = exp4d_means)
Table 13.4: Contrasts from the univariate model fit by AOV_4 (exp4d_m1) and the linear model of the experiment means (exp4d_m4)
contrast estimate SE df lower.CL upper.CL t.ratio p.value
univariate model (exp4d_m1)
STAT3_ko - STAT3_wt -3.5 0.048 2 -3.70 -3.29 -72.52 0.00019
linear model (exp4d_m4)
STAT3_ko - STAT3_wt -3.5 0.486 4 -4.84 -2.15 -7.20 0.00197

Notes

1. Model exp4d_m4 violates the independence assumption because residuals within an experiment are expected to be more similar. Analyzing the experiment means does not fix this. With the aggregated data, this correlated error masks the true effect.
2. The SE for the estimate of the $$\texttt{genotype}$$ effect is much smaller for the mixed ANOVA/random intercept model than for the fixed effect model of experiment means.
3. The degrees of freedom for the fixed effect model is slighly larger than that for the mixed model exp4d_m1. Degrees of freedom is a measure of independent variation – the fixed effect model assumes that every measure is independent. They aren’t. The inflated degrees of freedom is pseudoreplication.
4. Despite the (slightly) inflated degrees of freedom of the fixed effects model, the mixed model has much more narrow CIs and smaller p-value. This is because of item 2.
5. The increased precision and power of the mixed model relative to the fixed effect model of the experiment means is a general result and not one specific to this example.

### 13.1.7 Don’t do this – a t-test/fixed-effect ANOVA of the full data

exp4d_m5 is a fixed effect model (there are no added random factors).

exp4d_m5 <- lm(vimentin_cells ~ genotype, data = exp4d)
(#tab:issues-exp4d_m5-show)Contrasts from the univariate model fit by AOV_4 (exp4d_m1) and the linear model of the full data (exp4d_m5)
contrast estimate SE df lower.CL upper.CL t.ratio p.value
univariate model (exp4d_m1)
STAT3_ko - STAT3_wt -3.50 0.048 2 -3.70 -3.29 -72.52 0.00019
linear model (exp4d_m5)
STAT3_ko - STAT3_wt -3.49 0.789 58 -5.07 -1.91 -4.42 0.00004

Notes

1. Model exp4d_m4 violates the independence assumption because residuals within an experiment are expected to be more similar.
2. The SE for the estimate of the $$\texttt{genotype}$$ effect is much smaller for the mixed ANOVA/random intercept model than for the fixed effect model but the degrees of freedom for the fixed effect model is much greater than that for the mixed model exp4d_m1. Degrees of freedom is a measure of independent variation – the fixed effect model assumes that every measure is independent. They aren’t. The inflated degrees of freedom is pseudoreplication.
3. The consequence of the pseudoreplication is a too-small p-value. Pseudoreplication leads to more false discovery.

## 13.2 Comparing change from baseline (pre-post) Figure 13.1: Serum DPP4 levels at baseline and 3 months in two groups. (A) The actual experiment: the two treatment levels are randomly allocated to treatment at baseline. We are interested in the effect of DMAB at 3 months and use the baseline measure to increase precision. (B) A thought experiment: the two treatment levels are different groups at baseline. We give DMAB to both groups at baseline to estimate the difference in response.

In a longitudinal experiment, the response variable is measured on the same individuals both before (baseline) and after (post-baseline) some condition is applied. Experiments in which only one post-baseline measure is taken are known as pre-post experiments. For simplicity, I call the baseline measure $$\texttt{pre}$$ and the post-baseline measure $$\texttt{post}$$.

Researchers often analyze pre-post data by comparing the change score ($$\texttt{post} - \texttt{pre}$$) between the groups using a $$t$$-test or one-way ANOVA (or a non-parametric equivalent such as Mann-Whitney-Wilcoxon). The linear model form of the t-test is

$\texttt{post} - \texttt{pre} = \beta_{0} + \beta_{1}(\texttt{treatment}_{\texttt{tr}}) + \varepsilon$

This is the change score model. A similar analysis is a t-test or ANOVA using the percent change from baseline as the response:

$\frac{\texttt{post} - \texttt{pre}}{\texttt{pre}} \times 100 = \beta_{0} + \beta_{1}(\texttt{treatment}_{\texttt{tr}}) + \varepsilon$

The best practice for how to estimate the treatment effect depends very much on when the treatment is applied relative to the baseline ($$\texttt{pre}$$) measure. Consider the two experiments in Figure 13.1

1. In 13.1A, the individuals are randomized into the “Placebo” and the “Denosumab” groups. Plasma DPP4 is measured at baseline, the treatment is applied, and, three months later, plasma DPP4 is re-measured. The treatment is “Denosumab” and we want to compare this to “Placebo.” The key feature of this design is the individuals are from the same initial group and have been treated the same up until the baseline measure. Consequently, the expected difference in means for any measure at baseline is zero.

2. In 13.1B, the treatment of interest (knockout) was applied prior to baseline, DPP4 is measured at baseline without Denosumab and then both groups are given Denosumab and measured again three months later. The treatment is not Denosumab but “knockout” and we want to compare this to “wild type” in the two different conditions. The key here is that individuals are randomly sampled from two different groups (wild type and knockout). Consequently, we cannot expect the expected difference in means for any measure at baseline to be zero.

The best practice for Design 2 is the change score model given above (or equivalents discussed below). The best practice for Design 1 is not the change score model but a linear model in which the baseline measure is added as a covariate.

$\texttt{post} = \beta_{0} + \beta_{1}(\texttt{treatment}_{\texttt{tr}}) + \beta_{2}(\texttt{pre}) + \varepsilon$

This model is commonly known as the ANCOVA model (Analysis of Covariance) even if no ANOVA table is generated. The explanation for this best practice is given below, in the section regression to the mean. The ANCOVA linear model is common in clinical medicine and pharmacology, where researchers are frequently warned about regression to the mean from statisticians. By contrast, the ANCOVA linear model is rare in basic science experimental biology. The analysis of linear models with added covariates is the focus of the chapter Linear models with added covariates.

Alert! If the individuals are sampled from the same population and treatment is randomized at baseline, do not test for a difference in means of the response variable at baseline and then use the ANCOVA linear model if $$p > 0.05$$ and the change score model $$p < 0.05$$. The best practice is only a function of the design, which leads to the expectation of the difference in means at baseline.

1. If the individuals are sampled from the same population and treatment is randomized at baseline, the expected difference at baseline is zero.
1. If the individuals are sampled from the two populations because the treatment was applied prior to baseline, we cannot expect the difference at baseline to be zero (even if the treatment is magic).
• Use the change score model.
• What can go wrong if we use the ANCOVA linear model? Two things. First, the ANCOVA linear model computes a biased estimate of the true effect, meaning that as the sample size increases the estimate does not converge on the true value but the true value plus some bias. Second, the change score has more power than the ANCOVA linear model under the assumption of sampling from different populations at baseline.

Alert! Researchers also use two-way ANOVA or repeated measures ANOVA to analyze data like these. Two-way ANOVA is invalid because the multiple measures on each individual violates the independence assumption. Repeated measures ANOVA will give equivalent results for a pre-post experiment but not for better practice methods that are available when there are more than one post-baseline measure.

The better practice methods are linear models for correlated error, including GLS and linear mixed models. For a pre-post design, these models give equivalent results to the change score model but also allows the estimate of additional effects of interest (see below). For more detail on the analysis of pre-post experiments, see the Linear models for longitudinal experiments – I. pre-post designs chapter.

### 13.2.1 Pre-post example 1 (DPP4 fig4c)

The data in the left panel of Figure 13.1 are from an experiment to estimate the effect of Denosumab on the plasma levels of the enzyme DPP4 in humans. Denosumab is a monocolonal antibody that inhibits osteoclast maturation and survival. Osteoclasts secrete the enzyme DPP4.

#### 13.2.1.1 For the ANCOVA linear model, the data need to be in wide format

In the ANCOVA linear model, the baseline measure is added as a covariate and thought of as a separate variable and not a “response.” This makes sense – how could it be a “response” to treatment when the treatment hasn’t been applied? This means the baseline and post-baseline measures of DPP4 have to be in separate columns of the data (Table 13.5.

Table 13.5: First six rows of the DPP4 data in wide format showing the baseline and post-baseline measures of DPP4 as separate variables.
treatment DPP4_baseline DPP4_post id
Denosumab 436 405 human_1
Denosumab 434 392 human_2
Denosumab 534 480 human_3
Denosumab 317 266 human_4
Denosumab 440 397 human_5
Denosumab 336 370 human_6

#### 13.2.1.2 Fit the ANCOVA model

m1 <- lm(DPP4_post ~ treatment + DPP4_baseline,
data = fig4c)

#### 13.2.1.3 Inference

The coefficient table

m1_coef <- cbind(coef(summary(m1)),
confint(m1))

m1_coef %>%
kable(digits = c(1,2,2,4,1,1)) %>%
kable_styling()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 70.4 24.85 2.83 0.0070 20.2 120.5
treatmentDenosumab -27.0 10.25 -2.63 0.0117 -47.7 -6.3
DPP4_baseline 0.8 0.06 14.08 0.0000 0.7 0.9

Notes

1. Here, we care only about the $$\texttt{treatmentDenosumab}$$ row. This is the estimated effect of Denosumab on serum DPP4 adjusted for baseline (do not use the word “control” as the baseline values were not controlled in any manipulative sense).
2. Chapter Linear models with added covariates explains the interpretation of these coefficients in more detail.

The emmeans table

m1_emm <- emmeans(m1, specs = "treatment")

m1_emm %>%
summary() %>%
kable(digits = c(1,2,2,0,1,1)) %>%
kable_styling()
treatment emmean SE df lower.CL upper.CL
Placebo 403.5 7.09 43 389.2 417.8
Denosumab 376.5 7.40 43 361.6 391.4

Notes

1. The means are conditional on treatment and a value of $$\texttt{DPP4_baseline}$$. emmeans uses the grand mean of $$\texttt{DPP4_baseline}$$ to compute the conditional means. The conditional means are adjusted for baseline DPP4. Note that these means are not equal to the sample means.

The contrast table

m1_pairs <- contrast(m1_emm,
method = "revpairwise") %>%
summary(infer = TRUE)

m1_pairs %>%
kable(digits = c(1,3,2,0,1,1,1,3)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
Denosumab - Placebo -27.003 10.25 43 -47.7 -6.3 -2.6 0.012

Notes

1. As in the coefficient table, the contrast is the estimated effect of Denosumab on serum DPP4. It is the difference in means adjusted (not controlled!) for baseline values.
ggplot_the_model(m1,
m1_emm,
m1_pairs,
y_label = "Serum DPP4 (ng/mL)",
effect_label = "Effect (ng/mL)",
palette = pal_okabe_ito_blue,
legend_position = "none") Figure 13.2: Estimated effect of Denosumab on serum DPP4 relative to placebo.

Notes

1. The treatment means in Figure 13.2 are conditional means adjusted for the baseline measure and are, therefore, not equal to the sample means. The estimated effect is the difference between the conditional means and not the sample means and the inferential statistics (CI, p-value) are based on this difference between the conditional and not the sample means.

### 13.2.2 What if the data in example 1 were from from an experiment where the treatment was applied prior to the baseline measure?

#### 13.2.2.1 Fit the change score model

m1 <- lm(DPP4_post - DPP4_baseline ~ genotype,
data = fig4c_fake)
m1_emm <- emmeans(m1, specs = "genotype")
m1_pairs <- contrast(m1_emm,
method = "revpairwise")
m1_pairs %>%
kable() %>%
kable_styling()
contrast estimate SE df t.ratio p.value
ko - wt -25.83333 11.53394 44 -2.239766 0.0302079

Notes

1. The change score is created on the LHS of the model formula. Alternatively, the change score could be created as a variable in the fig4c_fake data.table using fig4c_fake[, change := DPP4_post - DPP4_baseline]. Making the change in the model formula shows the flexibility of the model formula method of fitting linear models in R.

#### 13.2.2.2 Rethinking a change score as an interaction

If the treatment is randomized at baseline, a researcher should focus on the effect of treatment (the difference between the post-basline measures), adjusted for the baseline measures. The addition of the baseline variable as a covariate increases the precision of the treatment effects and the power of the significance test.

If the treatment is generated prior to baseline, a researcher should focus on the difference in the change from baseline to post-baseline, which is

$effect = (post_{ko} - pre_{ko}) - (post_{wt} - pre_{wt})$

This is the difference in change scores, which is a difference of differences. This difference of differences is the interaction effect between the genotype (“wt” or “ko”) and the time period of the measurement of DPP4 (baseline or post-baseline). The more usual way to estimate interaction effects is a linear model with two crossed factors, which is covered in more detail in the chapter # Linear models with two categorical X – Factorial designs (“two-way ANOVA”). The interaction effect (equal to the difference among the mean of the change scores) can be estimated with the model

$\texttt{dpp4} = \beta_{0} + \beta_{1}(\texttt{genotype}_{\texttt{ko}}) + \beta_{2}(\texttt{time}_{\texttt{post}}) + \beta_{3}(\texttt{genotype}_{\texttt{ko}} \times \texttt{time}_{\texttt{post}}) + \epsilon$

The R script for this looks like this

m2_fixed <- lm(dpp4 ~ genotype*time,
data = fig4c_fake_long)

This model has two factors ($$\texttt{genotype}$$ and $$\texttt{time}$$), each with two levels. The two levels of $$\texttt{time}$$ are “pre” and “post.” $$genotype_{ko}$$ is an indicator variable for “ko” and $$time_{post}$$ is an indicator variable for “post”

Don’t fit this model – the data violate the independence assumption! This violation arises because $$\texttt{dpp4}$$ is measured twice in each individual and both these measures are components of the response (both $$\texttt{dpp4}$$ measures are stacked into a single column). This violation doesn’t arise in the ANCOVA linear model because the baseline measures are a covariate and not a response (remember that the independence assumption only applies to the response variable).

Alert! It is pretty common to see this model fit to pre-post and other longitudinal data. The consequence of the violation is invalid (too large) degrees of freedom for computing standard errors and p-values. This is a kind of pseudoreplication.

To model the correlated error due to the two measures per individual, we use a linear mixed model using $$\texttt{id}$$ as the added random factor. Linear mixed models were introduced in the Violations of independence, homogeneity, or Normality chapter and are covered in more detail in the Models with random factors – Blocking and pseudoreplication chapter.

m2 <- lmer(dpp4 ~ genotype*time + (1|id),
data = fig4c_fake_long)
m2_emm <- emmeans(m2,
specs = c("genotype", "time"),
lmer.df = "Satterthwaite")
m2_ixn <- contrast(m2_emm,
interaction = "revpairwise")
m2_ixn %>%
kable() %>%
kable_styling()
genotype_revpairwise time_revpairwise estimate SE df t.ratio p.value
ko - wt post - baseline -25.83333 11.53394 44 -2.239766 0.0302079

Notes

1. This is the same result as that for the change score score model.
2. The interaction effect is one of the coefficients in the model but to get the same CIs as those in the change score model, we need to use Satterthwaite’s formula for the degrees of freedom. We pass this to the emmeans function using lmer.df = "Satterthwaite"
3. The interaction contrast is computed using the contrast function but using interaction = "revpairwise" instead of method = "revpairwise".

#### 13.2.2.3 The linear mixed model estimates additional effects that we might want

While the linear mixed model and change score model give the same result for the effect of treatment in response to the different conditions, the linear mixed model estimates additional effects that may be of interest. I use a real example (Example 2) to demonstrate this.

### 13.2.3 Pre-post example 2 (XX males fig1c) The experiments in this paper were designed to measure the independent effects of the sex chromosome complement (X or y) and gonads on phenotypic variables related to fat storage, fat metabolism, and cardiovascular disease.

Response variable$$\texttt{fat_mass}$$. Fat mass was measured in each mouse at baseline (exposed to the standard chow diet) and after one week on a western diet.

Fixed factor – The design is two crossed factors (sex chromosome complement and gonad type) each with two levels but here I collapse the four treatment combinations into a single factor $$\texttt{treatment}$$ with four levels: “female_xx,” “female_xy,” “male_xx,” “male_xy.” Male and female are not the typical sex that is merely observed but are constructed by the presence or absence of SRY on an autosome using the Four Core Genotype mouse model. SRY determines the gonad that develops (ovary or testis). Females do not have the autosome with SRY. Males do. Similarly, the chromosome complement is not observed but manipulated. In “xx,” neither chromosome has SRY as the natural condition because there are two X chromosomes. In “xy,” SRY has been removed from the Y chromosome.

Random factor $$\texttt{id}$$. The identification of the individual mouse.

Planned comparisons

1. “female_xy” - “female_xx” at baseline (chow diet)
2. “male_xx” - “male_xy” at baseline (chow diet)
3. “female_xy” - “female_xx” at one week (western diet)
4. “male_xx” - “male_xy” at one week (western diet)
5. the interaction contrast (3 - 1) which addresses, is the effect of the chromosome complement in females conditional on diet
6. the interaction contrast (4 - 2) which addresses, is the effect of the chromosome complement in males conditional on diet

#### 13.2.3.1 Fit the change score model

The change score model only estimates planned comparisons 5 and 6.

m1 <- lm(week_1 - baseline ~ treatment,
data = fig1c_wide)

Inference from the change score model

m1_emm <- emmeans(m1, specs = "treatment")

m1_planned <- contrast(m1_emm,
method = "revpairwise",
summary(infer = TRUE)

m1_planned[c(1, 6),] %>%
kable(digits = c(1,2,2,1,2,2,2,3)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
1 female_xy - female_xx -0.54 0.48 16 -1.55 0.47 -1.13 0.277
6 male_xy - male_xx -1.00 0.48 16 -2.01 0.01 -2.09 0.053

#### 13.2.3.2 Using the linear mixed model to compute all six planned comparisons

m2 <- lmer(fat_mass ~ treatment*time + (1|id), data = fig1c)
m2_emm <- emmeans(m2,
specs = c("treatment", "time"),
lmer.df = "Satterthwaite")

Notes

1. important to add lmer.df = "Satterthwaite" argument

interaction contrasts

The interaction contrasts estimate planned comparisons 5 and 6.

m2_ixn <- contrast(m2_emm,
interaction = c("revpairwise"),
by = NULL,
summary(infer = TRUE)

m2_planned_ixn <- m2_ixn[c(1,6), ] %>%
data.table()

m2_planned_ixn %>%
kable(digits = c(1,1,2,2,1,2,2,2,3)) %>%
kable_styling()
treatment_revpairwise time_revpairwise estimate SE df lower.CL upper.CL t.ratio p.value
female_xy - female_xx week_1 - baseline -0.54 0.48 16 -1.55 0.47 -1.13 0.277
male_xy - male_xx week_1 - baseline -1.00 0.48 16 -2.01 0.01 -2.09 0.053

Notes

1. These are same results as those using the change scores.

Simple effects

The simple effects estimate planned comparisons 1-4.

# get simple effects from model

m2_pairs <- contrast(m2_emm,
method = c("revpairwise"),
simple = "each",
combine = TRUE,
summary(infer = TRUE)

# reduce to planned contrasts
m2_planned_simple <- m2_pairs[c(1,6,7,12),] %>%
data.table()

# clarify contrast
m2_planned_simple[, contrast := paste(time, contrast, sep = ": ")]

# dump first two cols
keep_cols <- names(m2_planned_simple)[-(1:2)]
m2_planned_simple <- m2_planned_simple[, .SD, .SDcols = keep_cols]

m2_planned_simple %>%
kable(digits = c(1,3,3,1,2,2,2,5)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
baseline: female_xy - female_xx -0.046 0.502 24.6 -1.08 0.99 -0.09 0.92766
baseline: male_xy - male_xx -1.422 0.502 24.6 -2.46 -0.39 -2.84 0.00902
week_1: female_xy - female_xx -0.582 0.502 24.6 -1.62 0.45 -1.16 0.25701
week_1: male_xy - male_xx -2.418 0.502 24.6 -3.45 -1.38 -4.82 0.00006

We can combine the six planned comparisons into a single table.

# create contrast table for ixns
m2_planned_ixn[, contrast := paste0("ixn: ", treatment_revpairwise)]

# dump first two cols
keep_cols <- names(m2_planned_ixn)[-(1:2)]
m2_planned_ixn <- m2_planned_ixn[, .SD, .SDcols = keep_cols]

# row bind -- smart enought to recognize column order
m2_planned <- rbind(m2_planned_simple,
m2_planned_ixn)

m2_planned %>%
kable(digits = c(1,2,2,1,2,2,2,5)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
baseline: female_xy - female_xx -0.05 0.50 24.6 -1.08 0.99 -0.09 0.92766
baseline: male_xy - male_xx -1.42 0.50 24.6 -2.46 -0.39 -2.84 0.00902
week_1: female_xy - female_xx -0.58 0.50 24.6 -1.62 0.45 -1.16 0.25701
week_1: male_xy - male_xx -2.42 0.50 24.6 -3.45 -1.38 -4.82 0.00006
ixn: female_xy - female_xx -0.54 0.48 16.0 -1.55 0.47 -1.13 0.27697
ixn: male_xy - male_xx -1.00 0.48 16.0 -2.01 0.01 -2.09 0.05280

### 13.2.4 Regression to the mean

Regression to the mean is the phenomenon that if an extreme value is sampled, the next sample will likely be less extreme – it is closer to the mean. This makes sense. If we randomly sample a single human male and that individual is 6’10" (about four standard deviations above the mean), the height of the next human male that we randomly sample will almost certainly be closer to the mean (about 5’10" in the united states). This phenomenon also applies to sampling a mean. If we randomly sample five human males and the mean height in the group is 5’6" (about 3 SEM below the mean), the mean height of the next sample of five human males that we measure will almost certainly be closer to the mean. And, this phenomenon applies to sampling a difference in means. If we randomly sample two groups of five human males and the difference between the mean heights is 5.7" (about 3 SED), then the difference in mean height in the next sample of two groups of human males that we measure will almost certainly be closer to zero.

How does regression to the mean apply to the analysis of change scores in a pre-post experiment? Consider an experiment where the response is body weight in mice. In a pre-post experiment, mice are randomized to treatment group at baseline. Weight is measured at baseline and at post-baseline. We expect the difference in means at baseline to be zero. If there is no treatment effect, we expect the difference in means at post-baseline to be zero. Because of sampling error there is a difference in weight at baseline. Do we expect the mean difference to be the same post-baseline? No. Even if taken after only 1 hour, the post-baseline weight of each mouse would not equal the baseline weight because of variation in water intake and loss, food intake, fecal weight, and other variables that affect body weight (this is a within-mouse variance). There is a correlation between the pre and post measures – individual mice that weigh more than the mean at baseline will generally weigh more than the mean at post-baseline. But, the bigger this within-mouse variance (or, the longer the time difference between baseline and post-baseline), the smaller this correlation. The consequence is that all these uncontrolled variables contribute to the sampling variance of the means and the difference in means at baseline and post-baseline. If the difference is unusually large at baseline because of these uncontrolled factors contribution to within mouse variance, we expect the difference at post-baseline to be less large – there is regression to the mean (Figure 13.3A). This regression to the mean between the baseline and post-baseline measures will emerge as a treatment by time interaction, or, equivalently, a difference in the mean change score between treatments (Figure 13.3B). Figure 13.3: Regression to the mean. The individual values in (A) are fake data sampled from a normal distribution with true mean equal to 30 at baseline (gray line) or 35 at post-baseline(gray line). The unusually large different in means at baseline is an extreme event. Consequently, the difference at post-baseline is much smaller. This regression to the mean is easily visualized by the lines that converge at the post-baseline value. (B) The results of a linear model (or t-test) fit to the data in (A) using change scores as the response variable. The apparent treatment effect is due to regression to the mean.

## 13.3 Longitudinal designs with more than one-post baseline measure

A rigorous analysis of longitudinal data typically requires sophisticated statistical models but for many purposes, longitudinal data can be analyzed with simple statistical models using summary statistics of the longitud data. Summary statistics include the slope of a line for a response that is fairly linear or the Area Under the Curve (AUC) for humped responses.

### 13.3.1 Area under the curve (AUC)

An AUC (Area under the curve) is a common and simple summary statistic for analyzing data from a glucose tolerance test and many other longitudinal experiments. Here I use the AUC of glucose tolerance tests (GTT) as an example.

#### 13.3.1.1 AUC and iAUC

Let’s generate and plot fake GTT data for a single individual in order to clarify some AUC measurements and define new ones.

fake_auc <- data.table(
time = c(0, 15, 30, 60, 120),
glucose = c(116, 268, 242, 155, 121)
)
fake_auc[, glucose_change := glucose - glucose] Figure 13.4: Glucose tolerance curve measures for the fake GTT data. A. Glucose values for an individual, B. The glucose values shifted so the baseline value is zero. The post-baseline values are change scores (glucose - baseline), or change from baseline. The filled area is the AUC in (A) and the iAUC in (B). The dashed line is the mean glucose over the test period in (A) and the mean change from baseline over the test period in (B).

AUC – The glucose tolerance curve for an individual is a connected set of straight lines that serves as a proxy for the continuous change of glucose over the period (Figure 13.4A). The AUC is the area under this set of straight lines (Figure 13.4A) and is conveniently computed as the sum of the areas of each of the connected trapezoids created by the connected lines.

iAUC – the incremental AUC (iAUC) is the baseline-zeroed AUC. It can be visualized as the area under the connected lines that have been rigidly shifted down so that the baseline value is zero (Figure 13.4B). The iAUC is computed using the trapezoid rule after first subtracting the individual’s baseline value from all of the individual’s values (including the baseline).

#### 13.3.1.2 Rethinking the iAUC as a change-score

The baseline-zeroed values of glucose used to compute iAUC are change scores from the baseline measure. This makes the iAUC a change score – it is the AUC minus the area under the baseline.

#### 13.3.1.3 Rethinking a t-test of the AUC as a t-test of the glucuose concentration averaged over the test period

The glucose concentration averaged over the post-baseline period for an individual is $$glucose_{gtt-post} = \frac{AUC}{Period}$$. Importantly, a t-test of $$glucose_{gtt-post}$$ is equivalent to a t-test of $$AUC$$ because the mean glucose values are simply the AUC values times the same constant (the test period) for all individuals.

#### 13.3.1.4 Rethinking a t-test of the iAUC as a t-test of the change from baseline (change score) averaged over the test period

The change from baseline (change score) averaged over the test period for an individual is $$glucose_{gtt-change} = \frac{iAUC}{Period}$$. As above, a t-test of $$glucose_{gtt-change}$$ is equivalent to a t-test of $$iAUC$$ because the $$glucose_{gtt-change}$$ values are simply the iAUC values times the same constant (the test period) for all individuals.

#### 13.3.1.5 Rethinking AUC as a pre-post design.

We can now rethink data used to construct an AUC as a pre-post design using the baseline value ($$glucose_0$$) as a measure of $$pre$$, $$glucose_{gtt-post}$$ as a measure of $$post$$ and $$glucose_{gtt-change}$$ as a measure of $$post - pre$$ (note that $$glucose_{gtt-change} = glucose_{gtt-post} - glucose_0$$). And, we can use the principles outlined in Comparing change from baseline (pre-post) above to determine best practices.

#### 13.3.1.6 Best practice strategies for analyzing AUC

1. If the treatment was applied prior to the baseline measure, then use the change score model (Or use the linear mixed model if you want to estimate the effect at baseline). This is the most common kind of design in the experimental biology literature
1. glucose_gtt_post - glucose_0 ~ treatment
2. iauc ~ treatment. The t and p values are equivalent to those in 1a
3. glucose ~ treatment*time + (1|id). This is a linear mixed model that allows the computation of both the effect of treatment at baseline and the effect of treatment on the change in the response to the condition (the interaction effect). This is not a LMM of the glucose values at all time periods but a pre-post LMM with the values of $$\texttt{glucose_0}$$ and $$\texttt{glucose_gtt_post}$$ stacked in the data column $$\texttt{glucose}$$. The t and p values for the interaction effect are equivalent to those in 1a and 1b.
2. If treatment is randomized at baseline, use the ANCOVA linear model.
1. glucose_post ~ treatment + glucose_0.
2. auc ~ treatment + glucose_0. The t and p values are equivalent to those in 2a but the units of the response or the effect.

#### 13.3.1.7 The difference in iAUC between groups is an interaction effect. This is an important recognition.

If the treatment was not randomized at baseline, then the potential effects in a glucose tolerance test are:

1. the effect of treatment at baseline, which is a measure of what is going on independent of added glucose. This is the baseline effect.
2. the effect of glucose infusion, which is a measure of the physiological response during the absorptive state. This effect in each treatment level are the change scores. Researchers typically are not interested in this effect (we know glucose levels rise then fall).
3. the difference in the change from baseline to the new condition (glucose infusion), which is equivalent to the difference in the change scores. This is the interaction effect. An interaction effect is evidence that the difference between treatment and control during the post-baseline (absorptive state) period is not more of the same difference occurring at baseline (fasted state) but something different (Figure 13.5). Figure 13.5: Effects in a glucose tolerance data in an experiment in which the treatment is not randomized at baseline. “something different is the interaction effect.” It is the difference in the change from baseline (or the difference in the change scores).

iAUC is a change score (see Rethinking the iAUC as a change-score above). Recall (or re-read) from section 13.2.2.2 above that the difference in the mean change-score between two groups is the interaction effect of the linear model with two factors $} (“cn” and “tr”) and$} (“pre” and “post”) and their interaction.

$\texttt{glucose} = \beta_0 + \beta_1 (\texttt{treatment}_\texttt{tr}) + \beta_2 (\texttt{time}_\texttt{post}) + \beta_3 (\texttt{treatment}_\texttt{tr} \times \texttt{time}_\texttt{post}) + \varepsilon$

This means a t-test of iAUC is equivalent to thet-test of the interaction effect of treatment by time.

This recognition adds an important perspective to the controversy of using iAUC in the analysis of glucose tolerance curves. iAUC is often used in place of AUC to “adjust” for baseline variation in glucose with the belief that this adjustment makes the AUC measure independent of (uncorrelated with) baseline glucose. As Allison et al. note, a change score doesn’t do this for us. The correct way to adjust for baseline variation is adding the baseline measure as a covariate in the linear model (Section 13.2 above).

$\texttt{glucose_mean_post} = \beta_0 + \beta_1 (\texttt{treatment}_\texttt{tr}) + \beta_2 (\texttt{glucose_0}) + \varepsilon$

Nevertheless, in a pre-post design, if the treatment is applied prior to the baseline measure, it is the interaction effect that we want as the measure of the treatment effect and not the difference between post-baseline means conditional on (adjusted for) baseline. That is we want the change score model and not the ANCOVA linear model.

#### 13.3.1.8 Issues in the analysis of designs where the treatment is applied prior to the baseline measure

Almost all glucose tolerance tests in the experimental biology literature have designs where the treatment (genotype, diet, exercise) was applied prior to the baseline measure and we cannot expect the difference in means to be zero at baseline. In these, it is common, but far from standard, for researchers to analyze the data using t tests or post-hoc tests with the AUC adjusted for baseline as the response. This is equivalent to the recommended linear model in 1b.

Two common alternative analyses with potential consequences that can severely mislead the researcher because of conflated effects are

1. Separate t tests at each time point.
2. t test/post-hoc tests of $$\texttt{auc}$$ (the standard AUC)

The reason that these can mislead is because the results conflate the baseline effect and the interaction effect (Figure 13.5). Both the baseline effect and the interaction effect are of physiological interest. The difference in the mean of the AUC combines these two effects. The difference in the means at any post-baseline time point combines these two effects. A t-test of the AUC or separate t-tests at the post-baseline time points conflate these effects. The conflated results muddle the physiology. If a researcher wants to simply conclude “The knockout causes glucose intolerance” then the full AUC is okay but the researcher should recognize that this is the question they are trying to answer. But if a researcher is asking, “is the difference between treatment groups in the absorptive (post-baseline) state something different, or more of the same, as the difference between treatment groups in the fasted (baseline) state?” then the researcher should avoid t-tests of the AUC or the separate t-tests at post-baseline times.

Two common, alternative analyses that can mislead because of model assumptions that are more severely violated than best practice models are

1. two-way ANOVA with treatment and time as the factors, followed by post-hoc tests. The advantage of this analysis is the ability to estimate the effect at baseline and the interaction effect. But, the correlated error due to the multiple measures on each individual violates the independence assumption. Inference from this model will generally be optimisitic – the CIs will be too narrow and the p-values too small. This is an example of pseudoreplication. The correlated error can be modeled with a GLS linear model or a Linear Mixed Model.
2. repeated measures ANOVA with treatment and time as the factors, followed by post-hoc tests. Like the two-way ANOVA, a repeated measures ANOVA can be used to estimate both the baseline and interaction effects. Unlike the two-way ANOVA, a repeated measures ANOVA models the correlated error. But the model for the correlated error is too unrealistic. The better alternatives for modeling the correlated error are the GLS linear model or the Linear Mixed Model.

#### 13.3.1.9 AUC Example 1 – Treatment applied prior to baseline

Source data: Fig. 3f

##### 13.3.1.9.1 An initial plot ##### 13.3.1.9.2 Inference

change-score linear model (model 1a), which estimates the interaction effect (the effect of treatment on the difference in the change from baseline to post-baseline).

m1 <- lm(glucose_gtt_post - glucose_0 ~ treatment,
data = fig3f_wide)

m1_emm <- emmeans(m1, specs = "treatment")
m1_pairs <- contrast(m1_emm,
method = "revpairwise") %>%
summary(infer = TRUE)

m1_pairs %>%
kable(digits = 3) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
KO - WT 48.304 36.904 12 -32.104 128.711 1.309 0.215
# increased digits to compare with lmm below

Notes

1. The estimate is not the average difference over the period but the difference in the average change from baseline. The knockout has an average change from baseline that is 48.3 mg/dL larger than the average change from baseline of the wildtype. Over the period, blood glucose in the knockout is 48.3 mg/dL larger than the expected difference if the only mechanisms generating a difference post-baseline are the same as the mechanisms generating the differences at baseline.
2. This is equivalent to a t test of the AUC adjusted for baseline ($) m1_t <- t.test(iauc ~ treatment, data = fig3f_wide, var.equal = TRUE) %>% tidy() m1_t[1, c(1, 4:8)] ## # A tibble: 1 × 6 ## estimate statistic p.value parameter conf.low conf.high ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -5796. -1.31 0.215 12 -15445. 3852. Notes 1. The estimate of the effect is the average change from baseline over the period times the period. Can anyone look at this number and claim with sincerity, wow that is huge!. The estimate from the change-score model, which is simply the difference in the average change from baseline over the period is a number that should be interpretable. 2. The change-score model (or the analysis of the AUC adjusted for baseline) does not estimate the effect of treatment at baseline (the effect in the fasted state). Researchers probably want this. For this we need a linear model with correlated error such as a linear mixed model. Linear mixed model (model 1c), which estimates the interaction effect and the baseline effect. m2 <- lmer(glucose ~ treatment*time + (1|id), data = fig3f_long) m2_emm <- emmeans(m2, specs = c("treatment", "time"), lmer.df = "Satterthwaite") m2_pairs <- contrast(m2_emm, method = "revpairwise", simple = "each", combine = TRUE, adjust = "none") %>% summary(infer = TRUE) m2_ixn <- contrast(m2_emm, interaction = "trt.vs.ctrl", adjust = "none") %>% summary(infer = TRUE) m2_pairs %>% kable(digits = 3) %>% kable_styling() time treatment contrast estimate SE df lower.CL upper.CL t.ratio p.value glucose_0 . KO - WT 111.571 37.454 18.976 33.172 189.970 2.979 0.008 glucose_gtt_post . KO - WT 159.875 37.454 18.976 81.476 238.274 4.269 0.000 . WT glucose_gtt_post - glucose_0 191.518 26.095 12.000 134.661 248.375 7.339 0.000 . KO glucose_gtt_post - glucose_0 239.821 26.095 12.000 182.965 296.678 9.190 0.000 m2_ixn %>% kable(digits = 3) %>% kable_styling() treatment_trt.vs.ctrl time_trt.vs.ctrl estimate SE df lower.CL upper.CL t.ratio p.value KO - WT glucose_gtt_post - glucose_0 48.304 36.904 12 -32.104 128.711 1.309 0.215 # increased digits to compare to change-score model above Notes 1. The treatment effect at baseline is in the first row of the m2_pairs contrast table from the linear mixed model. 2. The interaction effect (the effect of treatment on the change from baseline) is in the m2_ixn contrast table from the linear mixed model. The estimate, SE, confidence intervals, t-value, and p-value are the same as those from the change score model in m1_pairs. ##### 13.3.1.9.3 A t test of the AUC or separate t-tests at each time point result in ambiguous inference t-test of the AUC m3_t <- t.test(auc ~ treatment, data = fig3f_wide, var.equal = TRUE) %>% tidy() m3_t[1, c(1, 4:8)] ## # A tibble: 1 × 6 ## estimate statistic p.value parameter conf.low conf.high ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -19185 -3.52 0.00422 12 -31057. -7313. Separate t-tests at each time point m4_t0 <- t.test(time_0 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t30 <- t.test(time_30 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t60 <- t.test(time_60 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t90 <- t.test(time_90 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t120 <- t.test(time_120 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t <- data.table( Time = times, p.value = c(m4_t0$p.value,
m4_t30$p.value, m4_t60$p.value,
m4_t90$p.value, m4_t120$p.value
)
)

m4_t %>%
kable %>%
kable_styling()
Time p.value
0 0.0014957
30 0.0006251
60 0.0096282
90 0.0137238
120 0.0263193

## 13.4 Normalization – the analysis of ratios

### 13.4.1 Kinds of ratios in experimental biology

1. The ratio is a density (count per length/area/volume) or a rate (count/time).
• Example: number of marked cells per area of tissue.
• Best practice: GLM for count data with an offset in the model, where an offset is the denominator of the ratio.
1. The ratio is relative to a standard (“normalized”). Example: expression of focal mRNA relative to expression of a standard mRNA that is thought not to be affected by treatment. Best practice: GLM for count data with an offset in the model, where an offset is the denominator of the ratio.
2. The ratio is a proportion (or percent).
• Example: Number of marked cells per total number of cells.
• Best practice: GLM logistic.
1. The ratio is relative to a whole and both the thing in the numerator and the thing in the denominator grow (allometric data).
• Example: adipose mass relative to total lean body mass.
• Best practice: ANCOVA linear model.
• Alert! – It has been known for more than 100 years, and repeatedly broadcasted, that inference from ratios of allometric data range from merely wrong (the inferred effect size is in the right direction, but wrong) to absurd (the direction of the inferred effect is opposite that of the true effect).

### 13.4.2 Example 1 – The ratio is a density (number of something per area)

Researchers frequently count objects and compare the counts among treatments. A problem that often arises is that the counts are made in samples with different areas or volumes of tissue. As a consequence, the variation in treatment response is confounded with tissue size – samples with higher counts may have higher counts because of a different response to treatment, a larger amount of tissue, or some combination. The common practice in experimental biology is to adjust for tissue size variation by constructing the ratio $$\frac{count}{area}$$ and then testing for a difference in the ratio using either a linear model NHST (t-test/ANOVA) or a non-parametric NHST (Mann-Whitney-Wilcoxan). The issue is, the initial counts will have some kind of count distribution (Poisson or negative binomial) that can sometimes look like a sample from a normal distribution (see Introducing Generalized Linear Models using a count data example). The ratio will have some kind of ratio distribution that is hard to model correctly.

A better practice is to model the count using a Generalized Linear Model (GLM) and an offset that adjusts for differences in the area of the sample. NHST of the ratios will perform okay in the sense of Type I error that is close to nominal but will have relatively low power compared to a generalized linear model with offset. If the researcher is interested in best practices including the reporting of uncertainty of estimated effects, a GLM will have more useful confidence intervals – for example CIs from linear model assuming Normal error can often include absurd values such as ratios less than zero.

Source article (Fernández, Álvaro F., et al. “Disruption of the beclin 1–BCL2 autophagy regulatory complex promotes longevity in mice.” Nature 558.7708 (2018): 136-140.)https://www.nature.com/articles/s41586-018-0162-7

Public source

Source data for Fig. 3

The example here is from Fig 3b.

Response variable – number of TUNEL+ cells measured in kidney tissue, where a positive marker indicates nuclear DNA damage.

Background. The experiments in Figure 3 were designed to measure the effect of a knock-in mutation in the gene for the beclin 1 protein on autophagy and tissue health in the kidney and heart. The researchers were interested in autophagy because there is evidence in many non-mammalian model organisms that increased autophagy reduces age-related damage to tissues and increases health and lifespan. BCL2 is an autophagy inhibitor. Initial experiments showed that the knock-in mutation in beclin 1 inhibits BCL2. Inhibiting BCL2 with the knock-in mutation should increase autophagy and, as a consequence, reduce age-related tissue damage.

The researchers measured Tunel+ cells in both 2 month old and 20 month old mice in order to look at age related effects. This design is factorial with two factors ($$\texttt{age}$$ and $$\texttt{genotype}$$. Because we haven’t covered factorial designs in this text, I limit the analysis to the 20 month old mice.

Design - single factor $$\texttt{genotype}$$ with levels “WT” (wildtype) and “KI” (knock-in).

The code for importing the exp3b data is in Section 13.8.6 below.

#### 13.4.2.1 Fit the models

exp3b_m1 <- lm(count_per_area ~ genotype, data = exp3b)

exp3b_m2 <- glm.nb(positive_nuclei ~ genotype +
offset(log(area_mm2)),
data = exp3b)

exp3b_m3 <- wilcox.test(count_per_area ~ genotype, data = exp3b)

#### 13.4.2.2 Check the models

ggcheck_the_model(exp3b_m1) exp3b_m2_simulation <- simulateResiduals(fittedModel = exp3b_m2, n = 250)
plot(exp3b_m2_simulation, asFactor = FALSE) #### 13.4.2.3 Inference from the model

exp3b_m2_coef <- cbind(coef(summary(exp3b_m2)),
confint(exp3b_m2))
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) 0.79 0.295 2.7 0.007 0.26 1.43
genotypeKI -0.76 0.393 -1.9 0.053 -1.55 0.01
exp3b_m2_emm <- emmeans(exp3b_m2,
specs = c("genotype"),
type="response")
genotype response SE df asymp.LCL asymp.UCL
WT 58.6 17.25 Inf 32.9 104.32
KI 27.4 7.12 Inf 16.5 45.63
exp3b_m2_pairs <- contrast(exp3b_m2_emm,
method = "revpairwise") %>%
summary(infer = TRUE)
contrast ratio SE df asymp.LCL asymp.UCL null z.ratio p.value
KI / WT 0.47 0.184 Inf 0.22 1.01 1 -1.93266 0.1

Notes

#### 13.4.2.4 Plot the model The code for this plot is in Section 13.8.7 below.

### 13.4.3 Example 2 – The ratio is normalizing for size differences

data_from <- "A big-data approach to understanding metabolic rate and response to obesity in laboratory mice"
file_name <- "mmpc_all_phases.csv"
file_path <- here(data_folder, data_from, file_name)

geometric_mean <- function(x){
gm <- exp(mean(log(x)))
return(gm)
}
exp1 <- exp1[acclimation == "TRUE", ]

exp1[, resid_mass := total_mass - fat_mass - lean_mass]
exp1[, nonfat_mass := total_mass - fat_mass]

size_cols <- c("fat_mass", "lean_mass", "resid_mass")
exp1[, size := apply(exp1[, .SD, .SDcols = size_cols], 1, geometric_mean)]
size_cols <- c("fat_mass", "lean_mass", "resid_mass", "total_mass", "size")
cor(exp1[, .SD, .SDcols = size_cols])
##              fat_mass  lean_mass  resid_mass total_mass      size
## fat_mass    1.0000000  0.6486331 -0.03980370 0.94096284 0.8192087
## lean_mass   0.6486331  1.0000000 -0.41585035 0.78512117 0.4214359
## resid_mass -0.0398037 -0.4158503  1.00000000 0.05819842 0.4595406
## total_mass  0.9409628  0.7851212  0.05819842 1.00000000 0.8589070
## size        0.8192087  0.4214359  0.45954060 0.85890700 1.0000000

## 13.5 Don’t do this stuff

### 13.5.1 Normalize the response so that all control values are equal to 1.

In many experiments, the response variable is a measure with units that are not readily interpretable biologically. In these experiments, researchers often normalize the response by the mean of the reference group.

$y_i \; \mathrm{(normalized)} = \frac{y_i}{\overline{y}_{ref}}$

The normalized response variable is a multiple of (or a fraction) of the mean response of the reference group.

Less often, researchers normalize by the value of the reference within a batch – for example, the mean of the control group within each independent experiment – and then analyze the batch means with a t-test or ANOVA. If the goal of this by-experiment_id normalization is to adjust for the variance among the experiments, the proper way to do this is Linear mixed model for combining replicated experiments above.

Regardless of the goal, don’t do this. There is no variance in the reference group as all n values are 1. A researcher is telling the t-test or ANOVA that there is no natural variability in the reference group (and the values were measured without measurement error). A classical (Student) t-test or ANOVA will have an incorrectly small error variance ($$\sigma$$) because this zero-variance will go into the computation of the pooled (modeled) variance. The consequence is incorrectly small standard errors, confidence intervals, and p-values, and inflated false discovery.

Let’s do this, to show why we don’t do this. All examples that I’ve found in the literature only archive the batch-normalized data and not the raw data so I’m using fake data for this.

set.seed(9)
n_exp <- 8 # number of experiments
beta_0 <- 100
beta_1 <- 1.5 # genotype effect as a multiple
mu <- c(beta_0, beta_0*beta_1)
sigma <- (beta_1 - 1)*beta_0 # iid error
rho <- 0.7 # correlated error
Sigma <- matrix(c(sigma^2, rho*sigma^2, rho*sigma^2, sigma^2),
nrow = 2)
fake_data <- data.table(NULL)
fake_data[, genotype := rep(c("wt", "ko"), each = n_exp)]
fake_data[, genotype := factor(genotype,
levels = c("wt", "ko"))]

fake_data[, experiment_id := rep(paste0("exp_", 1:n_exp), 2)]

# modeling the experiment means
raw <- rmvnorm(n_exp, mean = mu, sigma = Sigma)
fake_data[, y := c(raw)]

# mean-normalize by the mean of the experiment means of wt group
fake_data[, y_correct := c(raw/mean(raw[,1]))]

# id-normalize by experiment mean of wt group for each experiment_id
fake_data[, y_wrong := c(raw/raw[,1])]

The best practice is a linear mixed model using the mean-normalized response (the paired t-test is a specific case of this).

m1 <- lmer(y_correct ~ genotype +
(1|experiment_id),
data = fake_data)

m1_emm <- emmeans(m1, specs = "genotype")
m1_pairs <- contrast(m1_emm,
method = "revpairwise") %>%
summary(infer = TRUE)

The common practice is a linear model using the mean-normalized response (the Student t-test is a specific case of this). This model violates the independence assumption but this violation usually results in less precision and lower power.

m2 <- lm(y_correct ~ genotype,
data = fake_data)

m2_emm <- emmeans(m2, specs = "genotype")
m2_pairs <- contrast(m2_emm,
method = "revpairwise") %>%
summary(infer = TRUE)

The incorrect practice is a linear model (or t-test or ANOVA) using the id-normalized response. This model violates heterogeneity and estimates the error variance using a group with zero variation.

m3 <- lm(y_wrong ~ genotype,
data = fake_data)

m3_emm <- emmeans(m3, specs = "genotype")
m3_pairs <- contrast(m3_emm,
method = "revpairwise") %>%
summary(infer = TRUE)

A summary table of the three models …

method contrast estimate SE df lower.CL upper.CL t.ratio p.value
paired t ko - wt 0.206 0.09 7 0.00 0.41 2.42 0.0459
student t ko - wt 0.206 0.16 14 -0.13 0.54 1.33 0.2047
dont do this t ko - wt 0.318 0.13 14 0.03 0.60 2.40 0.0308

and a plot of the mean-normalized and id-normalized data and modeled means and CIs.

gg1 <- ggplot_the_response(m1, m1_emm, m1_pairs)
gg2 <- ggplot_the_response(m3,
m3_emm,
m3_pairs,
dots = "jitter")
set.seed(1)
plot_grid(gg1, gg2, ncol = 2, labels = "AUTO") Figure 13.6: Plot of the fake normalized data and modeled means and CI from A) model m1, the linear mixed model with experiment_id as a random factor and B) model m3, the linear model of the response normalized to each experiment mean.

Notes

## 13.6 A difference in significance is not necessarily significant

Thermal stress induces glycolytic beige fat formation via a myogenic state

Figure 2j

## 13.8 Hidden code

### 13.8.1 Import exp4d vimentin cell count data (replicate experiments example)

data_from <- "Distinct inflammatory and wound healing responses to complex caudal fin injuries of larval zebrafish"
file_name <- "elife-45976-fig4-data2-v1.xlsx"
file_path <- here(data_folder, data_from, file_name)

sheet = "Sheet1",
range = "A16:C49") %>%
data.table()

exp4d_wide[, experiment_id := nafill(Replicate, type = "locf")]
exp4d_wide[, experiment_id := paste0("Exp", experiment_id)]

old_cols <- c("STAT3 +/+", "STAT3 -/-")
genotype_levels <- c("STAT3_wt", "STAT3_ko")
setnames(exp4d_wide, old = old_cols, new = genotype_levels)
exp4d <- melt(exp4d_wide,
id.vars = "experiment_id",
measure.vars = genotype_levels,
variable.name = "genotype",
value.name = "vimentin_cells") %>% # cell count
na.omit()
exp4d[, genotype := factor(genotype,
levels = genotype_levels)]

### 13.8.2 Import Fig4c data

data_from <- "Identification of osteoclast-osteoblast coupling factors in humans reveals links between bone and energy metabolism"
file_name <- "41467_2019_14003_MOESM6_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

fig4c_type <- c("text", rep("numeric", 8))
sheet = "Fig4C and 4E",
range = "A3:I48",
col_names = FALSE,
col_types <- fig4c_type) %>%
data.table()

# DPP4 (ng/mL), GLP-1 (pM), Glucose (mmol/L),   Insulin (uIU/mL)
measures <- c("DPP4", "GLP1", "Glucose", "Insulin")
# post is 3 months post treatment
period <- c("baseline", "post")
new_colnames <- paste(rep(measures, 2),
rep(period, each = 4),
sep = "_")

old_colnames <- colnames(fig4c)
setnames(fig4c,
old = old_colnames,
new = c("treatment",
new_colnames))

treatment_levels <- c("Placebo", "Denosumab")
fig4c[, treatment := factor(treatment,
levels = treatment_levels)]
fig4c[, id := paste0("human_", .I)] # .I inserts row number
#View(fig4c)
fig4c_long <- melt(fig4c,
id.vars = c("id", "treatment"),
measure.vars = c("DPP4_baseline", "DPP4_post"),
variable.name = "time",
value.name = "dpp4")

fig4c_long[, time := substr(as.character(time),
6,
nchar(as.character(time)))]
fig4c_long[, time := factor(time,
levels = c("baseline", "post"))]

# View(fig4c_long)

### 13.8.3 XX males fig1c

data_from <- "XX sex chromosome complement promotes atherosclerosis in mice"
file_name <- "41467_2019_10462_MOESM6_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)
sheet = "Figure 1C",
range = "C3:G6",
col_names = FALSE) %>%
data.table()
setnames(fig1c_1,
old = colnames(fig1c_1),
new = paste("mouse", 1:5))
fig1c_1[, time := rep(c("baseline", "week_1"), each = 2)]
fig1c_1[, sex := rep(c("female", "male"), 2)]

sheet = "Figure 1C",
range = "I3:M6",
col_names = FALSE) %>%
data.table()
setnames(fig1c_2,
old = colnames(fig1c_2),
new = paste("mouse", 6:10))
fig1c_2[, time := rep(c("baseline", "week_1"), each = 2)]
fig1c_2[, sex := rep(c("female", "male"), 2)]

fig1c <- rbind(data.table(chromosome = "xx",
melt(fig1c_1,
id.vars = c("sex", "time"),
variable.name = "id",
value.name = "fat_mass")),
data.table(chromosome = "xy",
melt(fig1c_2,
id.vars = c("sex", "time"),
variable.name = "id",
value.name = "fat_mass")))
# id is not unique
fig1c[, id := paste(sex, id, sep="_")] # now it is

fig1c[, treatment := paste(sex, chromosome, sep = "_")]
# unique(fig1c\$treatment)
# "female_xx" "male_xx"   "female_xy" "male_xy"
treatment_levels <- c("female_xx", "female_xy", "male_xx", "male_xy")
fig1c[, treatment := factor(treatment,
levels = treatment_levels)]

fig1c_wide <- dcast(fig1c, chromosome + sex + id + treatment ~ time, value.var = "fat_mass")
fig1c_wide[, percent_change := (week_1 - baseline)/baseline*100]

### 13.8.4 Generation of fake data to illustrate regression to the mean

n_iter <- 1000
n <- 6
np1 <- n + 1
N <- n*2
mu <- 30
beta_1 <- 5
sigma_among <- 2
sigma_within <- sigma_among/2
rho <- sigma_among/(sigma_among + sigma_within)
max_baseline_d <- -9999

time_levels <- c("pre", "post")
treatment_levels <- c("cn","tr")
fake_data <- data.table(
treatment = rep(rep(treatment_levels, each = n), 2),
time = rep(time_levels, each = n*2),
id = rep(paste0("mouse_0", 1:N), 2)
)
fake_data[, time := factor(time,
levels = time_levels)]
fake_data[, treatment := factor(treatment,
levels = treatment_levels)]
diff_pre <- numeric(n_iter)
diff_post <- numeric(n_iter)
ixn <- numeric(n_iter)
ixn_p <- numeric(n_iter)
p_ancova <- numeric(n_iter)
p_change <- numeric(n_iter)
seed_i <- 0
for(i in 1:n_iter){
seed_i <- seed_i + 1
set.seed(seed_i)
# the contribution to variance due to individual differences
mu_i <- rnorm(N, mean = mu, sd = sigma_among)

# the contribution to variance due to variation within individual
mu_i_pre <- rnorm(N, sd = sigma_within)
mu_i_post <- rnorm(N, sd = sigma_within)

pre <- mu_i + mu_i_pre
post <- mu_i + beta_1 + mu_i_post
fake_data[, weight := c(pre, post)]
fit <- lm(weight ~ treatment*time, data = fake_data)
diff_pre[i] <- mean(pre[np1:N]) - mean(pre[1:n])
diff_post[i] <- mean(post[np1:N]) - mean(post[1:n])
ixn[i] <- coef(summary(fit))[4, "Estimate"]
ixn_p[i] <- coef(summary(fit))[4, "Pr(>|t|)"]

fake_data_wide <- dcast(fake_data,
id + treatment ~ time,
value.var = "weight")
fake_data_wide[, change := post-pre]
fake_data_wide[, percent := (post-pre)/pre*100]
p_ancova[i] <- coef(summary(lm(post ~ treatment + pre,
data = fake_data_wide)))[2, "Pr(>|t|)"]
p_change[i] <- coef(summary(lm(change ~ treatment,
data = fake_data_wide)))[2, "Pr(>|t|)"]

}

diverge_list <- which(abs(diff_pre) < 0.1 & abs(diff_post) > 1)
converge_list <- which(abs(diff_pre) > 1 & abs(diff_post) < 0.1)
p_list <- which(p_ancova > 0.1 & p_change < 0.01)
keep <- intersect(converge_list, p_list)

max_seed_con <- which(abs(ixn) == max(abs(ixn)[converge_list]))

# convergence ixn
set.seed(max_seed_con)
mu_i <- rnorm(N, mean = mu, sd = sigma_among)
# the contribution to variance due to variation within individual
mu_i_pre <- rnorm(N, sd = sigma_within)
mu_i_post <- rnorm(N, sd = sigma_within)
pre <- mu_i + mu_i_pre
post <- mu_i + beta_1 + mu_i_post
fake_data[, weight := c(pre, post)]
fake_data_wide <- dcast(fake_data,
id + treatment ~ time,
value.var = "weight")
fake_data_wide[, change := post-pre]
fake_data_wide[, percent := (post-pre)/pre*100]
# coef(summary(lm(post ~ treatment + pre,
#                 data = fake_data_wide)))
# coef(summary(lm(change ~ treatment,
#                 data = fake_data_wide)))
# coef(summary(lm(percent ~ treatment,
#                 data = fake_data_wide)))

# not much we can do about divergence so limit to convergence

fake_means <- fake_data[, .(weight = mean(weight)),
by = c("time", "treatment")]

### 13.8.5 Import fig3f

data_from <- "Innervation of thermogenic adipose tissue via a calsyntenin 3β–S100b axis"
file_name <- "41586_2019_1156_MOESM5_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

sheet = "Figure 3f",
range = "H2:M15",
col_names = FALSE) %>%
data.table()

time_cols <- c(0, 30, 60, 90, 120)
setnames(fig3f_wide,
old = names(fig3f_wide),
new = c("id",
paste0("time_", time_cols)))

treatment_levels <- c("WT", "KO")
fig3f_wide[, treatment := factor(substr(id, 1, 2),
treatment_levels)]

fig3f_wide[, glucose_0 := time_0]

times <- c(0, 30, 60, 90, 120)
n_times <- length(times)
period_full <- times[n_times] - times
period_post <- times[n_times] - times

time_cols <- paste0("time_", times)
Y <- fig3f_wide[, .SD, .SDcols = time_cols] %>%
as.matrix()

# auc
fig3f_wide[, auc := apply(Y, 1, trap.rule, x = times)]

# mean response over full period
fig3f_wide[, glucose_gtt_post := auc/period_full]

# auc of the change. Equal to iauc of le Floch
fig3f_wide[, iauc := apply(Y - Y[,1], 1, trap.rule, x = times)]

# mean change over full period
fig3f_wide[, glucose_gtt_change := iauc/period_full]

# View(fig3f_wide)

**convert fig3f to long format for LMM

### 13.8.6 Import exp3b

data_from <- "Disruption of the beclin 1–BCL2 autophagy regulatory complex promotes longevity in mice"
file_name <- "41586_2018_162_MOESM5_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

sheet = "Fig. 3b",
range = "H4:K50",
col_names = TRUE) %>%
clean_names() %>%
tidyr::fill(genotype) %>%
data.table()
exp3b[, count_per_area := positive_nuclei/area_mm2]
genotype_levels <- c("WT", "KI")
exp3b[, genotype := factor(genotype,
levels = genotype_levels)]
exp3b[, age := "Old"]

### 13.8.7 Plot the model of exp3b (glm offset data)

gg1 <- ggplot_the_effects(exp3b_m2,
exp3b_m2_pairs,
effect_label = "Effect Ratio")
# gg1

xdat_old <- expand.grid(
area_mm2 = seq(min(exp3b[age == "Old", area_mm2]),
max(exp3b[age == "Old", area_mm2]),
length.out = 50),
genotype = genotype_levels,
age = c("Old")
) %>%
data.table()

xdat_old[, y := predict(exp3b_m2, xdat_old, type = "response")]

gg2 <- ggplot(data = exp3b[age == "Old"],
aes(x = area_mm2,
y = positive_nuclei,
color = genotype)) +
geom_point() +
geom_path(data = xdat_old,
aes(x = area_mm2,
y = y,
color = genotype)) +
xlab(expression(paste("Area (", mm^2, ")"))) +
ylab("TUNEL+ nuclei") +
scale_color_manual(values = pal_okabe_ito_blue) +

theme_pubr() +
theme(legend.position = "left") +
NULL

#gg2

plot_grid(gg1, gg2,
nrow = 2,
rel_heights = c(0.4,1),
align = c("v"),
axis = "lr") 