# Chapter 12 Violations of independence, homogeneity, or Normality

## 12.1 Lack of independence

### 12.1.2 Inferences from the linear mixed model and paired t-tests are not the same when there are more than two groups

#### 12.1.2.1 Example 2 (diHOME fig2a) – more than two groups

The structure of the Figure 2A experiment from the chapter Linear models with a single, categorical *X* is ambiguous. It is not clear from the archived data if the measures of 12,13 diHome were on separate mice within each of the three treatments (“Control,” “1 hour cold,” “30 min NE”) or if the same mice were used in each set. If each mouse was measured three times, I assume the researchers would have stated this, which they didn’t. But, to outline modeling correlated data when we have more than two groups, let’s say each mouse in the Figure 2A experiment was subjected to all three treatments. We have three measures per mouse and the response measure violates the independence assumption.

The linear model (in regression notation) is

\[\begin{equation} diHome = (\beta_0 + \beta_{0_i}) + \beta_1 treatment\_cold\ + \beta_2 treatment\_NE\ + \varepsilon \end{equation}\]

#### 12.1.2.2 Fit the model

```
# fit the model
<- lmer(diHOME ~ treatment + (1|id), data = fig2a) fig2a_m2
```

Notes

- If a treatment factor has more than two groups, nothing special has to be done relative to a model with a treatment factor with only two groups.

#### 12.1.2.3 Inference

Inference using the linear model with added random factor:

```
# estimated marginal means table
<- emmeans(fig2a_m2, specs = "treatment")
fig2a_m2_emm
# contrasts table
<- contrast(fig2a_m2_emm,
fig2a_m2_planned method = "trt.vs.ctrl",
adjust = "none") %>%
summary(infer = TRUE)
%>%
fig2a_m2_planned kable(digits = c(1,1,2,1,1,1,5,5)) %>%
kable_styling()
```

contrast | estimate | SE | df | lower.CL | upper.CL | t.ratio | p.value |
---|---|---|---|---|---|---|---|

1 hour cold - Control | 8.1 | 3.88 | 9.4 | -0.6 | 16.8 | 2.08124 | 0.06589 |

30 min NE - Control | 14.8 | 3.62 | 9.0 | 6.6 | 23.0 | 4.08718 | 0.00271 |

Inference using two paired t-tests:

```
<- t.test(
fig2a_ttest_cold x = fig2a[treatment == "1 hour cold", diHOME],
y = fig2a[treatment == "Control", diHOME],
paired = TRUE
) fig2a_ttest_cold
```

```
##
## Paired t-test
##
## data: fig2a[treatment == "1 hour cold", diHOME] and fig2a[treatment == "Control", diHOME]
## t = 2.1973, df = 4, p-value = 0.09293
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.925189 16.533005
## sample estimates:
## mean of the differences
## 7.303908
```

```
<- t.test(
fig2a_ttest_ne x = fig2a[treatment == "30 min NE", diHOME],
y = fig2a[treatment == "Control", diHOME],
paired = TRUE
) fig2a_ttest_ne
```

```
##
## Paired t-test
##
## data: fig2a[treatment == "30 min NE", diHOME] and fig2a[treatment == "Control", diHOME]
## t = 3.7564, df = 5, p-value = 0.01321
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.670273 24.918436
## sample estimates:
## mean of the differences
## 14.79435
```

Notes

**important**– the*t*-test for the comparison of “1 hour cold” and “Control” excluded mouse 6 because of a missing “1 hour cold” measure. The linear mixed model was fit to all data, including the “Control” and “30 min NE” measures. The consequence is higher power/more precision in the linear mixed model.- The paired t-test
*t*and*p*values are not equal to those from the linear mixed model. This is partly because of the missing data (note 1). Even if there were no missing data, these wouldn’t be equal because the linear mixed model uses an estimate of \(\sigma\) computed from the single model with all three groups to compute the standard errors. See the comparison of t-tests and the contrast table from a linear model in Section**??**. - The results of the separate, paired t-tests and the linear mixed model differ in a way that might affect our inference about the system. Which is correct? Neither – they simply make different assumptions about the data generating model.

- The linear model strategy has more power and precision but this advantage is small. The best reason to use linear models instead of separate
*t*-tests is learning how to use linear models, and their extensions, gives you phenomenal cosmic power. - Do not compute both separate paired
*t*-tests**and**the linear models and then convince yourself that that the assumption of the method with the*p*-value that matches your hypothesis is correct. See the p-hacking discussion above.

## 12.2 Heterogeneity of variances

Heterogeneity of variance among treatment groups is a problem for inference, especially if the sample size is unequal among groups (statisticians tend to agree that heterogeneity is much more problematic than a non-normal response).

#### 12.2.0.1 A Welch t-test is a special case of a linear model for heterogeneity

A “which test?” strategy points to a Welch’s *t*-test in place of Student’s *t*-test if there is heterogeneity of variances between treatment groups. A Welch t-test is infrequent in the experimental biology literature, perhaps because

- it is poorly known and it doesn’t occur to researchers to use a test that models heterogeneity of variances.
- heterogeneity often arises in right-skewed data, which is often analyzed with a non-parametric test like the Mann-Whitney U test.

The Welch *t*-test is a special case of a linear model that explicitly models the within-group variance using **generalized least squares** (GLS). The 95% CI of a mean differences and *p*-values from the fit gls linear model and from Welch’s t-test are the same. Advantages of using a linear modeling strategy is that a researcher uses the model to estimate effects (difference in means) and measures of uncertainty in the effects (standard errors or confidence intervals of the difference). Advantages of specifically using a GLS linear model is that it is easily expanded to analyze more complex designs including 1) more than one factor, 2) added covariates, 3) correlated residuals due to non-independence.

Some statisticians argue that researchers should *always* use a Welch t-test instead of Student’s t-test. Given this logic, researchers should consider using GLS linear models for more complex experimental designs (added covariates, factorial) in place of classical ANCOVA and two-way ANOVA.

Modeling variance heterogeneity is the focus of Chapter 18 so the account here is brief. Heterogeneity can be modeled using a generalized least squares linear model with the `gls`

function. The `weights`

argument is used to model the variances using each group’s sample variance. In this example, I use the data from the Figure 1b experiment, which can be compared to the analysis of the same data in Example 2 above.

#### 12.2.0.2 Fit the model

```
# gls fails with missing data
<- fig2a[is.na(diHOME) == FALSE,] # omit rows with missing data
subdata <- gls(diHOME ~ treatment,
fig2a_m3 data = subdata,
weights = varIdent(form = ~ 1 | treatment))
```

The model `fig2a_m3`

uses variance computed in each group separately as the estimate of \(\sigma\) for that group. The coefficient table of the GLS model is

#### 12.2.0.3 Inference from the linear model

```
<- cbind(coef(summary(fig2a_m3)),
fig2a_m3_coef confint(fig2a_m3))
%>%
fig2a_m3_coef kable(digits = c(4,4,4,6,4,4)) %>%
kable_styling()
```

Value | Std.Error | t-value | p-value | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|

(Intercept) | 12.0231 | 1.2001 | 10.0181 | 0.000000 | 9.6708 | 14.3753 |

treatment1 hour cold | 7.1404 | 3.1635 | 2.2571 | 0.040505 | 0.9401 | 13.3407 |

treatment30 min NE | 14.7944 | 4.5688 | 3.2382 | 0.005951 | 5.8398 | 23.7490 |

Notes

**Important**for reporting CIs and*p*-values. Unlike the linear model modeling homogenous variance, the CIs and*p*-values for the coefficients of \(\texttt{treatment1 hour cold}\) and \(\texttt{treatment30 min NE}\) are*not*the same as the*p*-values of these equivalent contrasts in the contrasts table (see below). The reason is, the computation of the CI and*p*-values in the two tables use two different degrees of freedom. Report the CI and*p*-values from the contrast table using the Satterthwaite df.

The modeled means and contrasts are computed as above for the `lm`

object

```
<- emmeans(fig2a_m3, specs="treatment")
fig2a_m3_emm fig2a_m3_emm
```

```
## treatment emmean SE df lower.CL upper.CL
## Control 12.0 1.20 5 8.94 15.1
## 1 hour cold 19.2 2.93 4 11.04 27.3
## 30 min NE 26.8 4.41 5 15.49 38.1
##
## Degrees-of-freedom method: satterthwaite
## Confidence level used: 0.95
```

Notes

- The SE of the means in this table are modeled SEs but are equal to the sample SE of the means, because this was specified in the GLS model.

```
<- contrast(fig2a_m3_emm,
fig2a_m3_pairs method = "revpairwise",
adjust = "none") %>%
summary(infer = TRUE)
%>%
fig2a_m3_pairs kable(digits = c(1,4,4,1,4,4,4,5)) %>%
kable_styling()
```

contrast | estimate | SE | df | lower.CL | upper.CL | t.ratio | p.value |
---|---|---|---|---|---|---|---|

1 hour cold - Control | 7.1404 | 3.1635 | 5.3 | -0.8394 | 15.1202 | 2.2571 | 0.07026 |

30 min NE - Control | 14.7944 | 4.5688 | 5.7 | 3.4896 | 26.0991 | 3.2382 | 0.01889 |

30 min NE - 1 hour cold | 7.6540 | 5.2915 | 8.4 | -4.4595 | 19.7675 | 1.4465 | 0.18452 |

Notes

- Compare the statistics for “1 hour cold - Control” and “30 min NE - Control” to the coefficients for \(\texttt{treatment1 hour cold}\) and \(\texttt{treatment30 min NE}\) in the coefficient table. The estimates, SE, and
*t*are the same but the CIs and*p*values differ. The contrast function is using a different method (“satterthwaite”) for computing the degrees of freedom and this results in a different value of the*t*-distribution tail area used to compute the CI and*p*value.

#### 12.2.0.4 Inference using Welch t-test

```
<- t.test(fig2a[treatment == "1 hour cold", diHOME],
test1 == "Control", diHOME],
fig2a[treatment var.equal = FALSE)
<- t.test(fig2a[treatment == "30 min NE", diHOME],
test2 == "Control", diHOME],
fig2a[treatment var.equal = FALSE)
<- t.test(fig2a[treatment == "30 min NE", diHOME],
test3 == "1 hour cold", diHOME],
fig2a[treatment var.equal = FALSE)
```

Notes

- the default
`t.test`

is the Welch t-test. However, I’ve included`var.equal = FALSE`

so the method is transparent.

#### 12.2.0.5 Compare inference from the linear model and Welch t-tests

Compare the contrast *p* values to the three Welch *t*-tests of all pairs of treatment levels in the fig2a experiment.

contrast | t (gls lm) | p (gls lm) | t (welch t) | p (welch t) |
---|---|---|---|---|

1 hour cold - Control | 2.257139 | 0.0702626 | 2.257139 | 0.0702622 |

30 min NE - 1 hour cold | 1.446454 | 0.1845154 | 1.446454 | 0.1845155 |

30 min NE - Control | 3.238157 | 0.0188908 | 3.238158 | 0.0188904 |

The *t* and *p*-values computed from the GLS linear model and from the three, pairwise Welch *t*-tests are the same (to about the 6th decimal place). They are the same because each is estimating \(\sigma^2\) separately for each group and not as the pooled (among two groups for *t*-test or three groups for the linear model) estimate and because they use the same degrees of freedom to compute the *p*-value.

Let’s summarize these comparisons

- Inference from a linear model using homogenous variance (the
`lm`

function) and from a Student’s*t*-test are the same if there are only two levels in the treatment variable. - Inference from a linear model using homogenous variance (the
`lm`

function) and from the series of pairwise, Student’s*t*-tests differ when there are more than two levels in the treatment variable. - Inference from a GLS linear model using heterogenous variance (the
`gls`

function) and from a Welch*t*-test are the same regardless of the number of levels in the treatment variable.

Even though the linear model that models heterogeneity and the Welch *t*-test produce the same results, researchers should use the linear model because

- A linear modeling strategy encourages researchers to think about the effect and uncertainty in the effect and not just a
*p*-value. - The linear model is nearly infinitely flexible and expandible while the
*t*-test has extremely limited flexibility (The Welch*t*-test is one way to expand the classical, Student’s*t*-test).

### 12.2.1 When groups of the focal test have >> variance

## 12.3 The conditional response isn’t Normal

It is common to hear or read from a statistitican that of all the issues with inference from a classical linear model (including *t*-tests and ANOVA), non-normal “data” is the least worrisome – we should be much more concerned about lack of independence and heterogeneity.

- The theoretical basis for this sentiment is the
**central limit theorem**. In short, if we repeatedly 1) sample a non-normal distribution and 2) compute the sample mean, the distribution*of these resampled means*is approximately normal (the distribution converges to normal as the sample size increases). Because inferential statistics using a classical linear model assume a normal distribution of the means, inference from classical linear models are increasingly robust to non-normal distributions as the sample size increases. The main issue for applied statistics in experimental biology is the small sample sizes. - Empirical results from simulations support the theory. Actual type I error using classical linear models with non-normal data tends to be close to the nominal (say, 0.05) level.
- That said,
**generalized linear models that explicitly model the distribution have more power than classical linear models and this is a pretty good reason to use GLMs**. - Confidence intervals computed using the normal distribution on non-normal data can be awkward, for example, the CIs might include impossible values such as negative counts or percent of cells expressing a certain marker above 100 %.

The normality assumption of the classical linear model does not apply to the distribution of the response (*Y*) variable and certainly not to the distribution of the treatment (*X*) variable but to the *residuals of the fit model* (actually, it applies to the distribution of the means, but if the sampling distribution is normal then the means are normal). An equivalent way to think about this is that the normality assumption applies to the distribution of the response if all values of \(X\) in the sample are the same – this is known as the **conditional response**. Contrary to common practice by experimental biologists, and the advice of some textbooks, there is no good reason to use the *p*-value from a test of normality as a decision tool to use either a *t*-test or a non-parametric test like the Mann-Whitney-Wilcoxon. A test of normality for small samples will frequently result in \(p > 0.05\) (“normal”), even for fake data sampled from a nonnormal distribution, while a test with a large sample size will frequently result in \(p < 0.05\) (“not-normal”) even if a histogram of the residuals appears normal.

A Normal Q-Q plot (Section 11.3.2) is a useful tool for the decision to use some alternative to inference using a Normal assumption. If a Q-Q plot indicates that the residuals are far from expected for a normal distribution, then the researcher has several alternatives, the best choice of which partially depends on the goals of the analysis. Some alternative to a linear model assuming normal distribution include:

- Count data – count data tend to be right skewed and groups with higher mean counts tend to have larger variance.

- Generalized linear models using either a negative binomial or quasi-poisson distribution family are a good alternative to consider.
- Bootstrap confidence intervals and permutation
*p*-values are resampling techniques. Bootstrap confidence intervals are a good alternative if the sample size is much larger (say, > 40) than typical in experimental biology. Permutation*p*-values are a good alternative regardless of sample size. - A GLS linear model to account for any heterogeneity in variance that typically accompanies non-normal data. This doesn’t address the non-normal distribution but, again, heterogeneity is typically a much larger problem than non-normality.
- Log transformation of a count response is controversial at best. log transformations can be pretty good at making the response look more like it was sampled from a normal distribution (and can make variances more similar). But…If the counts include zero, researchers have to add a kludge factor (typically equal to 1) to all counts prior to transformation. The value of the kludge factor is arbitrary and matters (adding .1 gives different results than adding 1). A log transformation (or any transformation) raises interpretation problems. For the log transformation, the effect is the difference between the means of the log-transformed counts. I don’t know what the magnitude of an effect on a log scale means biologically. If we backtransform this, the effect is the ratio of the geometric means of the two groups. A ratio effect is easy enough to interpret but do we really want to model the geometric and not the arithmetic means? Finally, log transformation does not specifically address either the shape of the distribution or the heterogeneity that often comes with the non-normal shape. A generalized linear model can specifically model both the shape and the variance.
- Classic non-parametric tests such as Mann-Whitney-Wilcoxon were invented when computers were not fast enough to perform permutation tests and before the development of Generalized Linear Models (and other modern methods). Since at least 1990, there are better alternatives to non-parametric tests.

- Fraction (proportion) data – For example, the number of cells expressing a certain marker relative to all counted cells in the sample. Fraction data have hard bounds at 0 and 1, or at 0 and 100 if converted to a percent. Fraction data tend to be right skewed if the mean is closer to zero and left skewed if the mean is closer to the upper bound.

- Generalized linear model using a binomial distribution family with a logit link (“logistic regression”) is a good alternative to consider. This model is used for Bernouili (success/fail) responses (for example the subject “lived” or “died”), where success is assigned the value 1 and fail is assigned the value 0. Using this model with proportion data is equivalent to assigning a 0 (“does not express marker”) or 1 (“expresses marker”) to all cells in the count.
- Bootstrap confidence intervals and permutation
*p*-values are resampling techniques. Bootstrap confidence intervals are a good alternative if the sample size is much larger (say, > 40) than typical in experimental biology. Permutation*p*-values are a good alternative regardless of sample size. - A GLS linear model to account for any heterogeneity in variance that typically accompanies non-normal data. This doesn’t address the non-normal distribution but, again, heterogeneity is typically a much larger problem than non-normality.
- arcsin transformation. See [The arcsine is asinine: the analysis of proportions in ecology] (https://esajournals.onlinelibrary.wiley.com/doi/full/10.1890/10-0340.1)
- Classic non-parametric tests such as Mann-Whitney-Wilcoxon were invented when computers were not fast enough to perform permutation tests and before the development of Generalized Linear Models (and other modern methods). Since at least 1990, there are better alternatives to non-parametric tests.

Generalized linear models that explicitly model the distribution of the response is covered in more detail in the chapter Linear models for count data – Generalized Linear Models I.

### 12.3.1 Example 1 (fig6f) – Linear models for non-normal count data

Source article: Exercise reduces inflammatory cell production and cardiovascular inflammation via instruction of hematopoietic progenitor cells

#### 12.3.1.1 Fit a Linear Model – Inference assuming Normal conditional response

**fit the model**

`<- lm(neutrophil_count ~ treatment, data = fig6f) m1 `

**check the model**

`ggcheck_the_model(m1)`

The left plot shows classic right skew. The right plot shows evidence for positive relationship between mean and variance. This suggest we fit a negative binomial or quasi-poisson model.

**inference**

```
<- emmeans(m1, specs = "treatment")
m1_emm <- contrast(m1_emm,
m1_pairs method = "trt.vs.ctrl") %>%
summary(infer = TRUE)
```

#### 12.3.1.2 Fit a Generalized Linear Model – inference assuming negative binomial conditional response

Generalized linear models for count data are covered more thoroughly in the chapter Linear models for count data – Generalized Linear Models I.

**fit the model**

```
<- glmmTMB(neutrophil_count ~ treatment,
m2 data = fig6f,
family = nbinom2(link = "log"))
```

Notes

- The function
`glmmTMB`

allows fitting from multiple distribution families and works like the`lm`

function with a few tweaks, including an argument for the distribution family.

**inference**

```
<- coef(summary(m2))$cond
m2_coef
%>%
m2_coef kable(digits = c(2,3,1,5)) %>%
kable_styling()
```

Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|

(Intercept) | 15.11 | 0.085 | 177.9 | 0.00000 |

treatmentExercise | -0.53 | 0.137 | -3.8 | 0.00013 |

Notes

- The
`coef`

function for a glmmTMB object is a list with several elements. the element \(\texttt{\$cond}\)" returns the coefficient table. - \(b_0\) (\(\texttt{(intercept)}\)) is the natural log of the mean of the reference (“sedentary”) group.
- \(b_1\) (\(\texttt{(intercept)}\)) is the difference of the natural log of the mean of exercise group mean and the natural log of the mean of the sedentary group mean.
- The coefficient values are on a
**link scale**, which is the scale of the transformation of the group means in the GLM fit. The link transformation for this fit was`log`

(the natural log transformation). In general, values on the**response scale**are more interpretable. \(exp(b_0)\) is the modeled mean of the reference group on the response scale (because the model had no covariates, it is the sample mean of the reference group). \(exp(b_1)\) is the**ratio**of the exercise to sedentary means. Remember: \(exp(log(a) - log(b)) = a/b\)

```
<- emmeans(m2,
m2_emm specs = "treatment",
type = "response")
<- contrast(m2_emm,
m2_pairs method = "trt.vs.ctrl",
type = "response") %>%
summary(infer = TRUE)
```

Notes

- The argument
`type = "response"`

tells`emmeans`

and`contrast`

to return the estimates on the response (not link) scale. See the notes for the coefficient table above for a brief explanation of the link and response scale. For more detailed explanation, see the Chapter Linear models for count data – Generalized Linear Models I

#### 12.3.1.3 Compare inference from the LM and GLM models

```
<- ggplot_the_model(
gg1
m1,
m1_emm,
m1_pairs,legend_position = "none",
y_label = "Neutrophil Count (/mL)",
effect_label = "Effect (/mL)",
palette = pal_okabe_ito_blue,
rel_heights = c(0.75,1)
)
<- ggplot_the_model(
gg2
m2,
m2_emm,
m2_pairs,legend_position = "none",
y_label = "Neutrophil Count (/mL)",
effect_label = "Relative Effect",
palette = pal_okabe_ito_blue,
rel_heights = c(0.75,1),
effect_x_lim = c(0.35, 1.25)
)
plot_grid(gg1, gg2, nrow=2, labels = "AUTO")
```

Notes

- Generalized linear models are covered more thoroughly in Chapter 17.
- The effect estimated from the GLM is not a difference but the ratio of the exercise group mean to the sedentary group mean. Both differences and ratios indicate how much bigger (or smaller) the mean of one group is compared to another but in a linear model “how much bigger” is a difference while in the glm here “how much bigger” is a multiple (or “times”) – the mean of the exercise group is .59X the mean of the sedentary group.
- The CIs for the means and for the effect are asymmetric in the GLM. The CIs from a

### 12.3.2 My data aren’t normal, what is the best practice?

- The major advantage of using a GLM for count data instead of the classical linear model is the increased power of a significance test and increased precision of estimates. And, CIs will reflect the asymmetry in the uncertainty, unlike the CIs of a classical linear model. However, researchers should gaurd against the slight increase in Type I error of a GLM.

sim | normal | type | n | theta | lm | gls | glm_nb | glm_qp | lmp | mww |
---|---|---|---|---|---|---|---|---|---|---|

1 | TRUE | type 1 | 10 | 0.05 | 0.050 | 0.076 | 0.052 | 0.053 | 0.046 | |

2 | FALSE | type 1 | 10 | 6.6 | 0.04 | 0.041 | 0.075 | 0.044 | 0.046 | 0.039 |

3 | FALSE | type 1 | 10 | 1 | 0.04 | 0.034 | 0.075 | 0.046 | 0.046 | 0.041 |

4 | FALSE | type 1 | 12, 8 | 6.6 | 0.05 | 0.049 | 0.080 | 0.048 | 0.050 | 0.046 |

5 | TRUE | power | 10 | 0.57 | 0.567 | 0.653 | 0.572 | 0.571 | 0.528 | |

6 | FALSE | power | 10 | 6.6 | 0.49 | 0.476 | 0.590 | 0.494 | 0.499 | 0.440 |

7 | FALSE | power | 10 | 1 | 0.10 | 0.082 | 0.171 | 0.113 | 0.114 | 0.088 |

8 | FALSE | power | 12, 8 | 6.6 | 0.44 | 0.501 | 0.578 | 0.452 | 0.444 | 0.438 |

9 | FALSE | power | 12, 8 | 1, 6.6 | 0.03 | 0.082 | 0.144 | 0.054 | 0.041 | 0.045 |

10 | FALSE | type 1 | 12, 8 | 6.6, 1 | 0.13 | 0.103 | 0.139 | 0.116 | 0.132 | 0.144 |

Table 12.1 summarizes the results of a simulation to compare the performance of alternatives to a *t*-test. The simulated data were modeled to look like the Figure 6f (Figure 12.1). The methods are the classical linear model (lm) (equivalent to the *t*-test), GLS linear model (gls), negative binomial GLM (glm-nb), quasi-poisson GLM (glm-qp), permutation test (lmp), and Mann-Whitney-Wilcoxon test. Performance is based on the the *p*-values from the test. The frequency of \(p < 0.05\) is given in the column for each method. If the simulation was run with no treatment effect, this frequency is the simulated Type I error, given the nominal error of \(\alpha = 0.05\). Values less then 0.05 are conservative and values greater than 0.05 are liberal. If the simulation was run with a treatment effect, the frequency is the simulated power of the test.

Over the limited parameter space of the non-normal distribution (the shape of the count distribution including the relationship between mean and variance, the sample size, the effect size), the negative binomial GLM has substantially higher power than the classical linear model, although this comes at some cost to inflated Type I error. Even with a Normal simulated distribution, the negative binomial GLM has higher power than the classical linear model, but again, at some cost to Type I error.

In the simulations with a negative binomial distributed response, the quasipoisson GLM has consistently higher power than the classical linear model and excellent Type I error control. The performance of the permutation test is very similar to that of the quasipoisson GLM. In the simulations with a negative binomial distributed response, the GLS linear model has about the same power as the classical linear model. Perhaps surprisingly, the Mann-Whitney-Wilcoxon fails to outperform the classical linear model and even underperforms the classical linear in some parts of the parameter space.