# Chapter 13 Issues in inference

## 13.1 Comparing change from baseline (pre-post)

In a longitudinal experiment, the response variable is measured on the same individuals both before (baseline) and after (post-baseline) some condition is applied. Experiments in which only one post-baseline measure is taken are known as pre-post experiments. For simplicity, I call the baseline measure $$\texttt{pre}$$ and the post-baseline measure $$\texttt{post}$$.

Researchers often analyze pre-post data by comparing the change score ($$\texttt{post} - \texttt{pre}$$) between the groups using a $$t$$-test or one-way ANOVA (or a non-parametric equivalent such as Mann-Whitney-Wilcoxon). The linear model form of the t-test is

$\texttt{post} - \texttt{pre} = \beta_{0} + \beta_{1}(\texttt{treatment}_{\texttt{tr}}) + \varepsilon$

This is the change score model. A similar analysis is a t-test or ANOVA using the percent change from baseline as the response:

$\frac{\texttt{post} - \texttt{pre}}{\texttt{pre}} \times 100 = \beta_{0} + \beta_{1}(\texttt{treatment}_{\texttt{tr}}) + \varepsilon$

The best practice for how to estimate the treatment effect depends very much on when the treatment is applied relative to the baseline ($$\texttt{pre}$$) measure. Consider the two experiments in Figure 13.1

1. In 13.1A, the individuals are randomized into the “Placebo” and the “Denosumab” groups. Plasma DPP4 is measured at baseline, the treatment is applied, and, three months later, plasma DPP4 is re-measured. The treatment is “Denosumab” and we want to compare this to “Placebo.” The key feature of this design is the individuals are from the same initial group and have been treated the same up until the baseline measure. Consequently, the expected difference in means for any measure at baseline is zero.

2. In 13.1B, the treatment of interest (knockout) was applied prior to baseline, DPP4 is measured at baseline without Denosumab and then both groups are given Denosumab and measured again three months later. The treatment is not Denosumab but “knockout” and we want to compare this to “wild type” in the two different conditions. The key here is that individuals are randomly sampled from two different groups (wild type and knockout). Consequently, we cannot expect the expected difference in means for any measure at baseline to be zero.

The best practice for Design 2 is the change score model given above (or equivalents discussed below). The best practice for Design 1 is not the change score model but a linear model in which the baseline measure is added as a covariate.

$\texttt{post} = \beta_{0} + \beta_{1}(\texttt{treatment}_{\texttt{tr}}) + \beta_{2}(\texttt{pre}) + \varepsilon$

This model is commonly known as the ANCOVA model (Analysis of Covariance) even if no ANOVA table is generated. The explanation for this best practice is given below, in the section regression to the mean. The ANCOVA linear model is common in clinical medicine and pharmacology, where researchers are frequently warned about regression to the mean from statisticians. By contrast, the ANCOVA linear model is rare in basic science experimental biology. The analysis of linear models with added covariates is the focus of the chapter Linear models with added covariates.

Alert! If the individuals are sampled from the same population and treatment is randomized at baseline, do not test for a difference in means of the response variable at baseline and then use the ANCOVA linear model if $$p > 0.05$$ and the change score model $$p < 0.05$$. The best practice is only a function of the design, which leads to the expectation of the difference in means at baseline.

1. If the individuals are sampled from the same population and treatment is randomized at baseline, the expected difference at baseline is zero.
1. If the individuals are sampled from the two populations because the treatment was applied prior to baseline, we cannot expect the difference at baseline to be zero (even if the treatment is magic).
• Use the change score model.
• What can go wrong if we use the ANCOVA linear model? Two things. First, the ANCOVA linear model computes a biased estimate of the true effect, meaning that as the sample size increases the estimate does not converge on the true value but the true value plus some bias. Second, the change score has more power than the ANCOVA linear model under the assumption of sampling from different populations at baseline.

Alert! Researchers also use two-way ANOVA or repeated measures ANOVA to analyze data like these. Two-way ANOVA is invalid because the multiple measures on each individual violates the independence assumption. Repeated measures ANOVA will give equivalent results for a pre-post experiment but not for better practice methods that are available when there are more than one post-baseline measure.

The better practice methods are linear models for correlated error, including GLS and linear mixed models. For a pre-post design, these models give equivalent results to the change score model but also allows the estimate of additional effects of interest (see below). For more detail on the analysis of pre-post experiments, see the Linear models for longitudinal experiments – I. pre-post designs chapter.

### 13.1.1 Example 1 (DPP4 fig4c)

The data in the left panel of Figure 13.1 are from an experiment to estimate the effect of Denosumab on the plasma levels of the enzyme DPP4 in humans. Denosumab is a monocolonal antibody that inhibits osteoclast maturation and survival. Osteoclasts secrete the enzyme DPP4.

#### 13.1.1.1 For the ANCOVA linear model, the data need to be in wide format

In the ANCOVA linear model, the baseline measure is added as a covariate and thought of as a separate variable and not a “response.” This makes sense – how could it be a “response” to treatment when the treatment hasn’t been applied? This means the baseline and post-baseline measures of DPP4 have to be in separate columns of the data (Table 13.1.

Table 13.1: First six rows of the DPP4 data in wide format showing the baseline and post-baseline measures of DPP4 as separate variables.
treatment DPP4_baseline DPP4_post id
Denosumab 436 405 human_1
Denosumab 434 392 human_2
Denosumab 534 480 human_3
Denosumab 317 266 human_4
Denosumab 440 397 human_5
Denosumab 336 370 human_6

#### 13.1.1.2 Fit the ANCOVA model

m1 <- lm(DPP4_post ~ treatment + DPP4_baseline,
data = fig4c)

#### 13.1.1.3 Inference

The coefficient table

m1_coef <- cbind(coef(summary(m1)),
confint(m1))

m1_coef %>%
kable(digits = c(1,2,2,4,1,1)) %>%
kable_styling()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
(Intercept) 70.4 24.85 2.83 0.0070 20.2 120.5
treatmentDenosumab -27.0 10.25 -2.63 0.0117 -47.7 -6.3
DPP4_baseline 0.8 0.06 14.08 0.0000 0.7 0.9

Notes

1. Here, we care only about the $$\texttt{treatmentDenosumab}$$ row. This is the estimated effect of Denosumab on serum DPP4 adjusted for baseline (do not use the word “control” as the baseline values were not controlled in any manipulative sense).
2. Chapter Linear models with added covariates explains the interpretation of these coefficients in more detail.

The emmeans table

m1_emm <- emmeans(m1, specs = "treatment")

m1_emm %>%
summary() %>%
kable(digits = c(1,2,2,0,1,1)) %>%
kable_styling()
treatment emmean SE df lower.CL upper.CL
Placebo 403.5 7.09 43 389.2 417.8
Denosumab 376.5 7.40 43 361.6 391.4

Notes

1. The means are conditional on treatment and a value of $$\texttt{DPP4_baseline}$$. emmeans uses the grand mean of $$\texttt{DPP4_baseline}$$ to compute the conditional means. The conditional means are adjusted for baseline DPP4. Note that these means are not equal to the sample means.

The contrast table

m1_pairs <- contrast(m1_emm,
method = "revpairwise") %>%
summary(infer = TRUE)

m1_pairs %>%
kable(digits = c(1,3,2,0,1,1,1,3)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
Denosumab - Placebo -27.003 10.25 43 -47.7 -6.3 -2.6 0.012

Notes

1. As in the coefficient table, the contrast is the estimated effect of Denosumab on serum DPP4. It is the difference in means adjusted (not controlled!) for baseline values.
ggplot_the_model(m1,
m1_emm,
m1_pairs,
y_label = "Serum DPP4 (ng/mL)",
effect_label = "Effect (ng/mL)",
palette = pal_okabe_ito_blue,
legend_position = "none")

Notes

1. The treatment means in Figure 13.2 are conditional means adjusted for the baseline measure and are, therefore, not equal to the sample means. The estimated effect is the difference between the conditional means and not the sample means and the inferential statistics (CI, p-value) are based on this difference between the conditional and not the sample means.

### 13.1.2 What if the data in example 1 were from from an experiment where the treatment was applied prior to the baseline measure?

#### 13.1.2.1 Fit the change score model

m1 <- lm(DPP4_post - DPP4_baseline ~ genotype,
data = fig4c_fake)
m1_emm <- emmeans(m1, specs = "genotype")
m1_pairs <- contrast(m1_emm,
method = "revpairwise")
m1_pairs %>%
kable() %>%
kable_styling()
contrast estimate SE df t.ratio p.value
ko - wt -25.83333 11.53394 44 -2.239766 0.0302079

Notes

1. The change score is created on the LHS of the model formula. Alternatively, the change score could be created as a variable in the fig4c_fake data.table using fig4c_fake[, change := DPP4_post - DPP4_baseline]. Making the change in the model formula shows the flexibility of the model formula method of fitting linear models in R.

#### 13.1.2.2 Rethinking a change score as an interaction

If the treatment is randomized at baseline, a researcher should focus on the effect of treatment (the difference between the post-basline measures), adjusted for the baseline measures. The addition of the baseline variable as a covariate increases the precision of the treatment effects and the power of the significance test.

If the treatment is generated prior to baseline, a researcher should focus on the difference in the change from baseline to post-baseline, which is

$effect = (post_{ko} - pre_{ko}) - (post_{wt} - pre_{wt})$

This is the difference in change scores, which is a difference of differences. This difference of differences is the interaction effect between the genotype (“wt” or “ko”) and the time period of the measurement of DPP4 (baseline or post-baseline). The more usual way to estimate interaction effects is a linear model with two crossed factors, which is covered in more detail in the chapter # Linear models with two categorical X – Factorial designs (“two-way ANOVA”). The interaction effect (equal to the difference among the mean of the change scores) can be estimated with the model

$\texttt{dpp4} = \beta_{0} + \beta_{1}(\texttt{genotype}_{\texttt{ko}}) + \beta_{2}(\texttt{time}_{\texttt{post}}) + \beta_{3}(\texttt{genotype}_{\texttt{ko}} \times \texttt{time}_{\texttt{post}}) + \epsilon$

The R script for this looks like this

m2_fixed <- lm(dpp4 ~ genotype*time,
data = fig4c_fake_long)

This model has two factors ($$\texttt{genotype}$$ and $$\texttt{time}$$), each with two levels. The two levels of $$\texttt{time}$$ are “pre” and “post.” $$genotype_{ko}$$ is an indicator variable for “ko” and $$time_{post}$$ is an indicator variable for “post”

Don’t fit this model – the data violate the independence assumption! This violation arises because $$\texttt{dpp4}$$ is measured twice in each individual and both these measures are components of the response (both $$\texttt{dpp4}$$ measures are stacked into a single column). This violation doesn’t arise in the ANCOVA linear model because the baseline measures are a covariate and not a response (remember that the independence assumption only applies to the response variable).

Alert! It is pretty common to see this model fit to pre-post and other longitudinal data. The consequence of the violation is invalid (too large) degrees of freedom for computing standard errors and p-values. This is a kind of pseudoreplication.

To model the correlated error due to the two measures per individual, we use a linear mixed model using $$\texttt{id}$$ as the added random factor. Linear mixed models were introduced in the Violations of independence, homogeneity, or Normality chapter and are covered in more detail in the Models with random factors – Blocking and pseudoreplication chapter.

m2 <- lmer(dpp4 ~ genotype*time + (1|id),
data = fig4c_fake_long)
m2_emm <- emmeans(m2,
specs = c("genotype", "time"),
lmer.df = "Satterthwaite")
m2_ixn <- contrast(m2_emm,
interaction = "revpairwise")
m2_ixn %>%
kable() %>%
kable_styling()
genotype_revpairwise time_revpairwise estimate SE df t.ratio p.value
ko - wt post - baseline -25.83333 11.53394 44 -2.239766 0.0302079

Notes

1. This is the same result as that for the change score score model.
2. The interaction effect is one of the coefficients in the model but to get the same CIs as those in the change score model, we need to use Satterthwaite’s formula for the degrees of freedom. We pass this to the emmeans function using lmer.df = "Satterthwaite"
3. The interaction contrast is computed using the contrast function but using interaction = "revpairwise" instead of method = "revpairwise".

#### 13.1.2.3 The linear mixed model estimates additional effects that we might want

While the linear mixed model and change score model give the same result for the effect of treatment in response to the different conditions, the linear mixed model estimates additional effects that may be of interest. I use a real example (Example 2) to demonstrate this.

### 13.1.3 Example 2 (XX males fig1c)

The experiments in this paper were designed to measure the independent effects of the sex chromosome complement (X or y) and gonads on phenotypic variables related to fat storage, fat metabolism, and cardiovascular disease.

Response variable$$\texttt{fat_mass}$$. Fat mass was measured in each mouse at baseline (exposed to the standard chow diet) and after one week on a western diet.

Fixed factor – The design is two crossed factors (sex chromosome complement and gonad type) each with two levels but here I collapse the four treatment combinations into a single factor $$\texttt{treatment}$$ with four levels: “female_xx,” “female_xy,” “male_xx,” “male_xy.” Male and female are not the typical sex that is merely observed but are constructed by the presence or absence of SRY on an autosome using the Four Core Genotype mouse model. SRY determines the gonad that develops (ovary or testis). Females do not have the autosome with SRY. Males do. Similarly, the chromosome complement is not observed but manipulated. In “xx,” neither chromosome has SRY as the natural condition because there are two X chromosomes. In “xy,” SRY has been removed from the Y chromosome.

Random factor $$\texttt{id}$$. The identification of the individual mouse.

Planned comparisons

1. “female_xy” - “female_xx” at baseline (chow diet)
2. “male_xx” - “male_xy” at baseline (chow diet)
3. “female_xy” - “female_xx” at one week (western diet)
4. “male_xx” - “male_xy” at one week (western diet)
5. the interaction contrast (3 - 1) which addresses, is the effect of the chromosome complement in females conditional on diet
6. the interaction contrast (4 - 2) which addresses, is the effect of the chromosome complement in males conditional on diet

#### 13.1.3.1 Fit the change score model

The change score model only estimates planned comparisons 5 and 6.

m1 <- lm(week_1 - baseline ~ treatment,
data = fig1c_wide)

Inference from the change score model

m1_emm <- emmeans(m1, specs = "treatment")

m1_planned <- contrast(m1_emm,
method = "revpairwise",
summary(infer = TRUE)

m1_planned[c(1, 6),] %>%
kable(digits = c(1,2,2,1,2,2,2,3)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
1 female_xy - female_xx -0.54 0.48 16 -1.55 0.47 -1.13 0.277
6 male_xy - male_xx -1.00 0.48 16 -2.01 0.01 -2.09 0.053

#### 13.1.3.2 Using the linear mixed model to compute all six planned comparisons

m2 <- lmer(fat_mass ~ treatment*time + (1|id), data = fig1c)
m2_emm <- emmeans(m2,
specs = c("treatment", "time"),
lmer.df = "Satterthwaite")

Notes

1. important to add lmer.df = "Satterthwaite" argument

interaction contrasts

The interaction contrasts estimate planned comparisons 5 and 6.

m2_ixn <- contrast(m2_emm,
interaction = c("revpairwise"),
by = NULL,
summary(infer = TRUE)

m2_planned_ixn <- m2_ixn[c(1,6), ] %>%
data.table()

m2_planned_ixn %>%
kable(digits = c(1,1,2,2,1,2,2,2,3)) %>%
kable_styling()
treatment_revpairwise time_revpairwise estimate SE df lower.CL upper.CL t.ratio p.value
female_xy - female_xx week_1 - baseline -0.54 0.48 16 -1.55 0.47 -1.13 0.277
male_xy - male_xx week_1 - baseline -1.00 0.48 16 -2.01 0.01 -2.09 0.053

Notes

1. These are same results as those using the change scores.

Simple effects

The simple effects estimate planned comparisons 1-4.

# get simple effects from model

m2_pairs <- contrast(m2_emm,
method = c("revpairwise"),
simple = "each",
combine = TRUE,
summary(infer = TRUE)

# reduce to planned contrasts
m2_planned_simple <- m2_pairs[c(1,6,7,12),] %>%
data.table()

# clarify contrast
m2_planned_simple[, contrast := paste(time, contrast, sep = ": ")]

# dump first two cols
keep_cols <- names(m2_planned_simple)[-(1:2)]
m2_planned_simple <- m2_planned_simple[, .SD, .SDcols = keep_cols]

m2_planned_simple %>%
kable(digits = c(1,3,3,1,2,2,2,5)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
baseline: female_xy - female_xx -0.046 0.502 24.6 -1.08 0.99 -0.09 0.92766
baseline: male_xy - male_xx -1.422 0.502 24.6 -2.46 -0.39 -2.84 0.00902
week_1: female_xy - female_xx -0.582 0.502 24.6 -1.62 0.45 -1.16 0.25701
week_1: male_xy - male_xx -2.418 0.502 24.6 -3.45 -1.38 -4.82 0.00006

We can combine the six planned comparisons into a single table.

# create contrast table for ixns
m2_planned_ixn[, contrast := paste0("ixn: ", treatment_revpairwise)]

# dump first two cols
keep_cols <- names(m2_planned_ixn)[-(1:2)]
m2_planned_ixn <- m2_planned_ixn[, .SD, .SDcols = keep_cols]

# row bind -- smart enought to recognize column order
m2_planned <- rbind(m2_planned_simple,
m2_planned_ixn)

m2_planned %>%
kable(digits = c(1,2,2,1,2,2,2,5)) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
baseline: female_xy - female_xx -0.05 0.50 24.6 -1.08 0.99 -0.09 0.92766
baseline: male_xy - male_xx -1.42 0.50 24.6 -2.46 -0.39 -2.84 0.00902
week_1: female_xy - female_xx -0.58 0.50 24.6 -1.62 0.45 -1.16 0.25701
week_1: male_xy - male_xx -2.42 0.50 24.6 -3.45 -1.38 -4.82 0.00006
ixn: female_xy - female_xx -0.54 0.48 16.0 -1.55 0.47 -1.13 0.27697
ixn: male_xy - male_xx -1.00 0.48 16.0 -2.01 0.01 -2.09 0.05280

### 13.1.4 Regression to the mean

Regression to the mean is the phenomenon that if an extreme value is sampled, the next sample will likely be less extreme – it is closer to the mean. This makes sense. If we randomly sample a single human male and that individual is 6’10" (about four standard deviations above the mean), the height of the next human male that we randomly sample will almost certainly be closer to the mean (about 5’10" in the united states). This phenomenon also applies to sampling a mean. If we randomly sample five human males and the mean height in the group is 5’6" (about 3 SEM below the mean), the mean height of the next sample of five human males that we measure will almost certainly be closer to the mean. And, this phenomenon applies to sampling a difference in means. If we randomly sample two groups of five human males and the difference between the mean heights is 5.7" (about 3 SED), then the difference in mean height in the next sample of two groups of human males that we measure will almost certainly be closer to zero.

How does regression to the mean apply to the analysis of change scores in a pre-post experiment? Consider an experiment where the response is body weight in mice. In a pre-post experiment, mice are randomized to treatment group at baseline. Weight is measured at baseline and at post-baseline. We expect the difference in means at baseline to be zero. If there is no treatment effect, we expect the difference in means at post-baseline to be zero. Because of sampling error there is a difference in weight at baseline. Do we expect the mean difference to be the same post-baseline? No. Even if taken after only 1 hour, the post-baseline weight of each mouse would not equal the baseline weight because of variation in water intake and loss, food intake, fecal weight, and other variables that affect body weight (this is a within-mouse variance). There is a correlation between the pre and post measures – individual mice that weigh more than the mean at baseline will generally weigh more than the mean at post-baseline. But, the bigger this within-mouse variance (or, the longer the time difference between baseline and post-baseline), the smaller this correlation. The consequence is that all these uncontrolled variables contribute to the sampling variance of the means and the difference in means at baseline and post-baseline. If the difference is unusually large at baseline because of these uncontrolled factors contribution to within mouse variance, we expect the difference at post-baseline to be less large – there is regression to the mean (Figure 13.3A). This regression to the mean between the baseline and post-baseline measures will emerge as a treatment by time interaction, or, equivalently, a difference in the mean change score between treatments (Figure 13.3B).

## 13.2 Longitudinal designs with more than one-post baseline measure

A rigorous analysis of longitudinal data typically requires sophisticated statistical models but for many purposes, longitudinal data can be analyzed with simple statistical models using summary statistics of the longitud data. Summary statistics include the slope of a line for a response that is fairly linear or the Area Under the Curve (AUC) for humped responses.

### 13.2.1 Area under the curve (AUC)

An AUC (Area under the curve) is a common and simple summary statistic for analyzing data from a glucose tolerance test and many other longitudinal experiments. Here I use the AUC of glucose tolerance tests (GTT) as an example.

#### 13.2.1.1 AUC and iAUC

Let’s generate and plot fake GTT data for a single individual in order to clarify some AUC measurements and define new ones.

fake_auc <- data.table(
time = c(0, 15, 30, 60, 120),
glucose = c(116, 268, 242, 155, 121)
)
fake_auc[, glucose_change := glucose - glucose[1]]

AUC – The glucose tolerance curve for an individual is a connected set of straight lines that serves as a proxy for the continuous change of glucose over the period (Figure 13.4A). The AUC is the area under this set of straight lines (Figure 13.4A) and is conveniently computed as the sum of the areas of each of the connected trapezoids created by the connected lines.

iAUC – the incremental AUC (iAUC) is the baseline-zeroed AUC. It can be visualized as the area under the connected lines that have been rigidly shifted down so that the baseline value is zero (Figure 13.4B). The iAUC is computed using the trapezoid rule after first subtracting the individual’s baseline value from all of the individual’s values (including the baseline).

#### 13.2.1.2 Rethinking the iAUC as a change-score

The baseline-zeroed values of glucose used to compute iAUC are change scores from the baseline measure. This makes the iAUC a change score – it is the AUC minus the area under the baseline.

#### 13.2.1.3 Rethinking a t-test of the AUC as a t-test of the glucuose concentration averaged over the test period

The glucose concentration averaged over the post-baseline period for an individual is $$glucose_{gtt-post} = \frac{AUC}{Period}$$. Importantly, a t-test of $$glucose_{gtt-post}$$ is equivalent to a t-test of $$AUC$$ because the mean glucose values are simply the AUC values times the same constant (the test period) for all individuals.

#### 13.2.1.4 Rethinking a t-test of the iAUC as a t-test of the change from baseline (change score) averaged over the test period

The change from baseline (change score) averaged over the test period for an individual is $$glucose_{gtt-change} = \frac{iAUC}{Period}$$. As above, a t-test of $$glucose_{gtt-change}$$ is equivalent to a t-test of $$iAUC$$ because the $$glucose_{gtt-change}$$ values are simply the iAUC values times the same constant (the test period) for all individuals.

#### 13.2.1.5 Rethinking AUC as a pre-post design.

We can now rethink data used to construct an AUC as a pre-post design using the baseline value ($$glucose_0$$) as a measure of $$pre$$, $$glucose_{gtt-post}$$ as a measure of $$post$$ and $$glucose_{gtt-change}$$ as a measure of $$post - pre$$ (note that $$glucose_{gtt-change} = glucose_{gtt-post} - glucose_0$$). And, we can use the principles outlined in Comparing change from baseline (pre-post) above to determine best practices.

#### 13.2.1.6 Best practice strategies for analyzing AUC

1. If the treatment was applied prior to the baseline measure, then use the change score model (Or use the linear mixed model if you want to estimate the effect at baseline). This is the most common kind of design in the experimental biology literature
1. glucose_gtt_post - glucose_0 ~ treatment
2. iauc ~ treatment. The t and p values are equivalent to those in 1a
3. glucose ~ treatment*time + (1|id). This is a linear mixed model that allows the computation of both the effect of treatment at baseline and the effect of treatment on the change in the response to the condition (the interaction effect). This is not a LMM of the glucose values at all time periods but a pre-post LMM with the values of $$\texttt{glucose_0}$$ and $$\texttt{glucose_gtt_post}$$ stacked in the data column $$\texttt{glucose}$$. The t and p values for the interaction effect are equivalent to those in 1a and 1b.
2. If treatment is randomized at baseline, use the ANCOVA linear model.
1. glucose_post ~ treatment + glucose_0.
2. auc ~ treatment + glucose_0. The t and p values are equivalent to those in 2a but the units of the response or the effect.

#### 13.2.1.7 The difference in iAUC between groups is an interaction effect. This is an important recognition.

If the treatment was not randomized at baseline, then the potential effects in a glucose tolerance test are:

1. the effect of treatment at baseline, which is a measure of what is going on independent of added glucose. This is the baseline effect.
2. the effect of glucose infusion, which is a measure of the physiological response during the absorptive state. This effect in each treatment level are the change scores. Researchers typically are not interested in this effect (we know glucose levels rise then fall).
3. the difference in the change from baseline to the new condition (glucose infusion), which is equivalent to the difference in the change scores. This is the interaction effect. An interaction effect is evidence that the difference between treatment and control during the post-baseline (absorptive state) period is not more of the same difference occurring at baseline (fasted state) but something different (Figure 13.5).

iAUC is a change score (see Rethinking the iAUC as a change-score above). Recall (or re-read) from section 13.1.2.2 above that the difference in the mean change-score between two groups is the interaction effect of the linear model with two factors $} (“cn” and “tr”) and$} (“pre” and “post”) and their interaction.

$\texttt{glucose} = \beta_0 + \beta_1 (\texttt{treatment}_\texttt{tr}) + \beta_2 (\texttt{time}_\texttt{post}) + \beta_3 (\texttt{treatment}_\texttt{tr} \times \texttt{time}_\texttt{post}) + \varepsilon$

This means a t-test of iAUC is equivalent to thet-test of the interaction effect of treatment by time.

This recognition adds an important perspective to the controversy of using iAUC in the analysis of glucose tolerance curves. iAUC is often used in place of AUC to “adjust” for baseline variation in glucose with the belief that this adjustment makes the AUC measure independent of (uncorrelated with) baseline glucose. As Allison et al. note, a change score doesn’t do this for us. The correct way to adjust for baseline variation is adding the baseline measure as a covariate in the linear model (Section 13.1 above).

$\texttt{glucose_mean_post} = \beta_0 + \beta_1 (\texttt{treatment}_\texttt{tr}) + \beta_2 (\texttt{glucose_0}) + \varepsilon$

Nevertheless, in a pre-post design, if the treatment is applied prior to the baseline measure, it is the interaction effect that we want as the measure of the treatment effect and not the difference between post-baseline means conditional on (adjusted for) baseline. That is we want the change score model and not the ANCOVA linear model.

#### 13.2.1.8 Issues in the analysis of designs where the treatment is applied prior to the baseline measure

Almost all glucose tolerance tests in the experimental biology literature have designs where the treatment (genotype, diet, exercise) was applied prior to the baseline measure and we cannot expect the difference in means to be zero at baseline. In these, it is common, but far from standard, for researchers to analyze the data using t tests or post-hoc tests with the AUC adjusted for baseline as the response. This is equivalent to the recommended linear model in 1b.

Two common alternative analyses with potential consequences that can severely mislead the researcher because of conflated effects are

1. Separate t tests at each time point.
2. t test/post-hoc tests of $$\texttt{auc}$$ (the standard AUC)

The reason that these can mislead is because the results conflate the baseline effect and the interaction effect (Figure 13.5). Both the baseline effect and the interaction effect are of physiological interest. The difference in the mean of the AUC combines these two effects. The difference in the means at any post-baseline time point combines these two effects. A t-test of the AUC or separate t-tests at the post-baseline time points conflate these effects. The conflated results muddle the physiology. If a researcher wants to simply conclude “The knockout causes glucose intolerance” then the full AUC is okay but the researcher should recognize that this is the question they are trying to answer. But if a researcher is asking, “is the difference between treatment groups in the absorptive (post-baseline) state something different, or more of the same, as the difference between treatment groups in the fasted (baseline) state?” then the researcher should avoid t-tests of the AUC or the separate t-tests at post-baseline times.

Two common, alternative analyses that can mislead because of model assumptions that are more severely violated than best practice models are

1. two-way ANOVA with treatment and time as the factors, followed by post-hoc tests. The advantage of this analysis is the ability to estimate the effect at baseline and the interaction effect. But, the correlated error due to the multiple measures on each individual violates the independence assumption. Inference from this model will generally be optimisitic – the CIs will be too narrow and the p-values too small. This is an example of pseudoreplication. The correlated error can be modeled with a GLS linear model or a Linear Mixed Model.
2. repeated measures ANOVA with treatment and time as the factors, followed by post-hoc tests. Like the two-way ANOVA, a repeated measures ANOVA can be used to estimate both the baseline and interaction effects. Unlike the two-way ANOVA, a repeated measures ANOVA models the correlated error. But the model for the correlated error is too unrealistic. The better alternatives for modeling the correlated error are the GLS linear model or the Linear Mixed Model.

#### 13.2.1.9 Example 1 – Treatment applied prior to baseline

Source data: Fig. 3f

##### 13.2.1.9.2 Inference

change-score linear model (model 1a), which estimates the interaction effect (the effect of treatment on the difference in the change from baseline to post-baseline).

m1 <- lm(glucose_gtt_post - glucose_0 ~ treatment,
data = fig3f_wide)

m1_emm <- emmeans(m1, specs = "treatment")
m1_pairs <- contrast(m1_emm,
method = "revpairwise") %>%
summary(infer = TRUE)

m1_pairs %>%
kable(digits = 3) %>%
kable_styling()
contrast estimate SE df lower.CL upper.CL t.ratio p.value
KO - WT 48.304 36.904 12 -32.104 128.711 1.309 0.215
# increased digits to compare with lmm below

Notes

1. The estimate is not the average difference over the period but the difference in the average change from baseline. The knockout has an average change from baseline that is 48.3 mg/dL larger than the average change from baseline of the wildtype. Over the period, blood glucose in the knockout is 48.3 mg/dL larger than the expected difference if the only mechanisms generating a difference post-baseline are the same as the mechanisms generating the differences at baseline.
2. This is equivalent to a t test of the AUC adjusted for baseline ($) m1_t <- t.test(iauc ~ treatment, data = fig3f_wide, var.equal = TRUE) %>% tidy() m1_t[1, c(1, 4:8)] ## # A tibble: 1 x 6 ## estimate statistic p.value parameter conf.low conf.high ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -5796. -1.31 0.215 12 -15445. 3852. Notes 1. The estimate of the effect is the average change from baseline over the period times the period. Can anyone look at this number and claim with sincerity, wow that is huge!. The estimate from the change-score model, which is simply the difference in the average change from baseline over the period is a number that should be interpretable. 2. The change-score model (or the analysis of the AUC adjusted for baseline) does not estimate the effect of treatment at baseline (the effect in the fasted state). Researchers probably want this. For this we need a linear model with correlated error such as a linear mixed model. Linear mixed model (model 1c), which estimates the interaction effect and the baseline effect. m2 <- lmer(glucose ~ treatment*time + (1|id), data = fig3f_long) m2_emm <- emmeans(m2, specs = c("treatment", "time"), lmer.df = "Satterthwaite") m2_pairs <- contrast(m2_emm, method = "revpairwise", simple = "each", combine = TRUE, adjust = "none") %>% summary(infer = TRUE) m2_ixn <- contrast(m2_emm, interaction = "trt.vs.ctrl", adjust = "none") %>% summary(infer = TRUE) m2_pairs %>% kable(digits = 3) %>% kable_styling() time treatment contrast estimate SE df lower.CL upper.CL t.ratio p.value glucose_0 . KO - WT 111.571 37.454 18.976 33.172 189.970 2.979 0.008 glucose_gtt_post . KO - WT 159.875 37.454 18.976 81.476 238.274 4.269 0.000 . WT glucose_gtt_post - glucose_0 191.518 26.095 12.000 134.661 248.375 7.339 0.000 . KO glucose_gtt_post - glucose_0 239.821 26.095 12.000 182.965 296.678 9.190 0.000 m2_ixn %>% kable(digits = 3) %>% kable_styling() treatment_trt.vs.ctrl time_trt.vs.ctrl estimate SE df lower.CL upper.CL t.ratio p.value KO - WT glucose_gtt_post - glucose_0 48.304 36.904 12 -32.104 128.711 1.309 0.215 # increased digits to compare to change-score model above Notes 1. The treatment effect at baseline is in the first row of the m2_pairs contrast table from the linear mixed model. 2. The interaction effect (the effect of treatment on the change from baseline) is in the m2_ixn contrast table from the linear mixed model. The estimate, SE, confidence intervals, t-value, and p-value are the same as those from the change score model in m1_pairs. ##### 13.2.1.9.3 A t test of the AUC or separate t-tests at each time point result in ambiguous inference t-test of the AUC m3_t <- t.test(auc ~ treatment, data = fig3f_wide, var.equal = TRUE) %>% tidy() m3_t[1, c(1, 4:8)] ## # A tibble: 1 x 6 ## estimate statistic p.value parameter conf.low conf.high ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -19185 -3.52 0.00422 12 -31057. -7313. Separate t-tests at each time point m4_t0 <- t.test(time_0 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t30 <- t.test(time_30 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t60 <- t.test(time_60 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t90 <- t.test(time_90 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t120 <- t.test(time_120 ~ treatment, data = fig3f_wide, var.equal = TRUE) m4_t <- data.table( Time = times, p.value = c(m4_t0$p.value,
m4_t30$p.value, m4_t60$p.value,
m4_t90$p.value, m4_t120$p.value
)
)

m4_t %>%
kable %>%
kable_styling()
Time p.value
0 0.0014957
30 0.0006251
60 0.0096282
90 0.0137238
120 0.0263193

## 13.4 Comparing ratios

1. The ratio is a density (count per length/area/volume) or a rate (count/time).
• Example: number of marked cells per area of tissue.
• Best practice: GLM for count data with an offset in the model, where an offset is the denominator of the ratio.
1. The ratio is relative to a standard (“normalized”). Example: expression of focal mRNA relative to expression of a standard mRNA that is thought not to be affected by treatment. Best practice: GLM for count data with an offset in the model, where an offset is the denominator of the ratio.
2. The ratio is a proportion (or percent).
• Example: Number of marked cells per total number of cells.
• Best practice: GLM logistic.
1. The ratio is relative to a whole and both the thing in the numerator and the thing in the denominator grow (allometric data).
• Example: adipose mass relative to total lean body mass.
• Best practice: ANCOVA linear model.
• Alert! – It has been known for more than 100 years, and repeatedly broadcasted, that inference from ratios of allometric data range from merely wrong (the inferred effect size is biased) to absurd (the direction of the inferred effect is opposite that of the true effect).

## 13.5 Don’t do this stuff

1. Normalize so all control values are 1.

## 13.7 Hidden code

### 13.7.1 Import Fig4c data

data_from <- "Identification of osteoclast-osteoblast coupling factors in humans reveals links between bone and energy metabolism"
file_name <- "41467_2019_14003_MOESM6_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

fig4c_type <- c("text", rep("numeric", 8))
sheet = "Fig4C and 4E",
range = "A3:I48",
col_names = FALSE,
col_types <- fig4c_type) %>%
data.table()

# DPP4 (ng/mL), GLP-1 (pM), Glucose (mmol/L),   Insulin (uIU/mL)
measures <- c("DPP4", "GLP1", "Glucose", "Insulin")
# post is 3 months post treatment
period <- c("baseline", "post")
new_colnames <- paste(rep(measures, 2),
rep(period, each = 4),
sep = "_")

old_colnames <- colnames(fig4c)
setnames(fig4c,
old = old_colnames,
new = c("treatment",
new_colnames))

treatment_levels <- c("Placebo", "Denosumab")
fig4c[, treatment := factor(treatment,
levels = treatment_levels)]
fig4c[, id := paste0("human_", .I)] # .I inserts row number
#View(fig4c)
fig4c_long <- melt(fig4c,
id.vars = c("id", "treatment"),
measure.vars = c("DPP4_baseline", "DPP4_post"),
variable.name = "time",
value.name = "dpp4")

fig4c_long[, time := substr(as.character(time),
6,
nchar(as.character(time)))]
fig4c_long[, time := factor(time,
levels = c("baseline", "post"))]

# View(fig4c_long)

### 13.7.2 XX males fig1c

data_from <- "XX sex chromosome complement promotes atherosclerosis in mice"
file_name <- "41467_2019_10462_MOESM6_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)
sheet = "Figure 1C",
range = "C3:G6",
col_names = FALSE) %>%
data.table()
setnames(fig1c_1,
old = colnames(fig1c_1),
new = paste("mouse", 1:5))
fig1c_1[, time := rep(c("baseline", "week_1"), each = 2)]
fig1c_1[, sex := rep(c("female", "male"), 2)]

sheet = "Figure 1C",
range = "I3:M6",
col_names = FALSE) %>%
data.table()
setnames(fig1c_2,
old = colnames(fig1c_2),
new = paste("mouse", 6:10))
fig1c_2[, time := rep(c("baseline", "week_1"), each = 2)]
fig1c_2[, sex := rep(c("female", "male"), 2)]

fig1c <- rbind(data.table(chromosome = "xx",
melt(fig1c_1,
id.vars = c("sex", "time"),
variable.name = "id",
value.name = "fat_mass")),
data.table(chromosome = "xy",
melt(fig1c_2,
id.vars = c("sex", "time"),
variable.name = "id",
value.name = "fat_mass")))
# id is not unique
fig1c[, id := paste(sex, id, sep="_")] # now it is

fig1c[, treatment := paste(sex, chromosome, sep = "_")]
# unique(fig1c\$treatment)
# "female_xx" "male_xx"   "female_xy" "male_xy"
treatment_levels <- c("female_xx", "female_xy", "male_xx", "male_xy")
fig1c[, treatment := factor(treatment,
levels = treatment_levels)]

fig1c_wide <- dcast(fig1c, chromosome + sex + id + treatment ~ time, value.var = "fat_mass")
fig1c_wide[, percent_change := (week_1 - baseline)/baseline*100]

### 13.7.3 Generation of fake data to illustrate regression to the mean

n_iter <- 1000
n <- 6
np1 <- n + 1
N <- n*2
mu <- 30
beta_1 <- 5
sigma_among <- 2
sigma_within <- sigma_among/2
rho <- sigma_among/(sigma_among + sigma_within)
max_baseline_d <- -9999

time_levels <- c("pre", "post")
treatment_levels <- c("cn","tr")
fake_data <- data.table(
treatment = rep(rep(treatment_levels, each = n), 2),
time = rep(time_levels, each = n*2),
id = rep(paste0("mouse_0", 1:N), 2)
)
fake_data[, time := factor(time,
levels = time_levels)]
fake_data[, treatment := factor(treatment,
levels = treatment_levels)]
diff_pre <- numeric(n_iter)
diff_post <- numeric(n_iter)
ixn <- numeric(n_iter)
ixn_p <- numeric(n_iter)
p_ancova <- numeric(n_iter)
p_change <- numeric(n_iter)
seed_i <- 0
for(i in 1:n_iter){
seed_i <- seed_i + 1
set.seed(seed_i)
# the contribution to variance due to individual differences
mu_i <- rnorm(N, mean = mu, sd = sigma_among)

# the contribution to variance due to variation within individual
mu_i_pre <- rnorm(N, sd = sigma_within)
mu_i_post <- rnorm(N, sd = sigma_within)

pre <- mu_i + mu_i_pre
post <- mu_i + beta_1 + mu_i_post
fake_data[, weight := c(pre, post)]
fit <- lm(weight ~ treatment*time, data = fake_data)
diff_pre[i] <- mean(pre[np1:N]) - mean(pre[1:n])
diff_post[i] <- mean(post[np1:N]) - mean(post[1:n])
ixn[i] <- coef(summary(fit))[4, "Estimate"]
ixn_p[i] <- coef(summary(fit))[4, "Pr(>|t|)"]

fake_data_wide <- dcast(fake_data,
id + treatment ~ time,
value.var = "weight")
fake_data_wide[, change := post-pre]
fake_data_wide[, percent := (post-pre)/pre*100]
p_ancova[i] <- coef(summary(lm(post ~ treatment + pre,
data = fake_data_wide)))[2, "Pr(>|t|)"]
p_change[i] <- coef(summary(lm(change ~ treatment,
data = fake_data_wide)))[2, "Pr(>|t|)"]

}

diverge_list <- which(abs(diff_pre) < 0.1 & abs(diff_post) > 1)
converge_list <- which(abs(diff_pre) > 1 & abs(diff_post) < 0.1)
p_list <- which(p_ancova > 0.1 & p_change < 0.01)
keep <- intersect(converge_list, p_list)

max_seed_con <- which(abs(ixn) == max(abs(ixn)[converge_list]))

# convergence ixn
set.seed(max_seed_con)
mu_i <- rnorm(N, mean = mu, sd = sigma_among)
# the contribution to variance due to variation within individual
mu_i_pre <- rnorm(N, sd = sigma_within)
mu_i_post <- rnorm(N, sd = sigma_within)
pre <- mu_i + mu_i_pre
post <- mu_i + beta_1 + mu_i_post
fake_data[, weight := c(pre, post)]
fake_data_wide <- dcast(fake_data,
id + treatment ~ time,
value.var = "weight")
fake_data_wide[, change := post-pre]
fake_data_wide[, percent := (post-pre)/pre*100]
# coef(summary(lm(post ~ treatment + pre,
#                 data = fake_data_wide)))
# coef(summary(lm(change ~ treatment,
#                 data = fake_data_wide)))
# coef(summary(lm(percent ~ treatment,
#                 data = fake_data_wide)))

# not much we can do about divergence so limit to convergence

fake_means <- fake_data[, .(weight = mean(weight)),
by = c("time", "treatment")]

### 13.7.4 Import fig3f

data_from <- "Innervation of thermogenic adipose tissue via a calsyntenin 3β–S100b axis"
file_name <- "41586_2019_1156_MOESM5_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

sheet = "Figure 3f",
range = "H2:M15",
col_names = FALSE) %>%
data.table()

time_cols <- c(0, 30, 60, 90, 120)
setnames(fig3f_wide,
old = names(fig3f_wide),
new = c("id",
paste0("time_", time_cols)))

treatment_levels <- c("WT", "KO")
fig3f_wide[, treatment := factor(substr(id, 1, 2),
treatment_levels)]

fig3f_wide[, glucose_0 := time_0]

times <- c(0, 30, 60, 90, 120)
n_times <- length(times)
period_full <- times[n_times] - times[1]
period_post <- times[n_times] - times[2]

time_cols <- paste0("time_", times)
Y <- fig3f_wide[, .SD, .SDcols = time_cols] %>%
as.matrix()

# auc
fig3f_wide[, auc := apply(Y, 1, trap.rule, x = times)]

# mean response over full period
fig3f_wide[, glucose_gtt_post := auc/period_full]

# auc of the change. Equal to iauc of le Floch
fig3f_wide[, iauc := apply(Y - Y[,1], 1, trap.rule, x = times)]

# mean change over full period
fig3f_wide[, glucose_gtt_change := iauc/period_full]

# View(fig3f_wide)

**convert fig3f to long format for LMM