# Chapter 20 Linear models for count data – Generalized Linear Models I

Biologists frequently count stuff, and design experiments to estimate the effects of different factors on these counts. Count data can cause numerous problems with linear models that assume a normal, conditional distribution, including 1) counts are discrete, and can be zero or positive integers only, 2) counts tend to bunch up on the small side of the range, creating a distribution with a positive skew, 3) a sample of counts can have an abundance of zeros, and 4) the variance of counts increases with the mean (see Figure 20.1 for some of these properties). Some count data can be approximated by and reasonably modeled with a normal distribution. More often, count data should modeled with a Poisson distribution or negative binomial distribution using a generalized linear model. Poisson and negative binomial distributions are **discrete probability distributions** with two important properties: 1) the distribution contains only zero and positive integers and 2) the variance is a function of the mean. Back before modern computing and fast processors, count data were often analyzed by either **transforming** the response or by **non-parametric hypothesis tests**. One reason to prefer a statistical modeling approach with a GLM is that we can get interpretable parameter estimates. By contrast, both the analysis of transformed data and non-parametric hypothesis tests are really tools for computing “correct” *p*-values.

## 20.1 The Generalized Linear Model (GLM)

As outlined in Two specifications of a linear model, a common way that biological researchers are taught to think about a response variable is

\[ response = expected + error \]

or, using the notation of this text,

\[ \begin{align} y &= \beta_0 + \beta_1 \texttt{treatment} + \varepsilon \\ \varepsilon &\sim \operatorname{Normal}(0, \sigma^2) \tag{20.1} \end{align} \]

That is, we can think of a response as the sum of some systematic part (\(\beta_0 + \beta_1 \texttt{treatment}\)) and a stochastic (“random error”) part (\(\varepsilon\)), where the stochastic part is a random draw from a normal distribution with mean zero and variance \(\sigma^2\). This way of thinking about the generation of the response is useful for linear models, and model checking linear models, but is much less useful for thinking about generalized linear models or model checking generalized liner models. For example, if we want to model the number of angiogenic sprouts in response to some combination of GAS6 and treatment using a Poisson distribution, the following is the **wrong** way to think about the statistical model

\[ \begin{align} \texttt{sprouts} = &\ \beta_0 + \beta_1 \texttt{treatment}_{\texttt{GAS6}} + \beta_2 \texttt{genotype}_{\texttt{FAK_ko}} + \\ &\ \beta_3 \texttt{treatment}_{\texttt{GAS6}}:\texttt{genotype}_{\texttt{FAK_ko}} + \varepsilon_i\\ \varepsilon \sim &\ \operatorname{Poisson}(\lambda) \tag{20.2} \end{align} \] That is, we should not think of a count as the sum of a systematic part and a random draw from a Poisson distribution. Why? Because it is the counts, conditional on the \(\treattt{treatment\) and \(\treattt{genotype\), that are poisson distributed, not the residuals from the fit model.

Thinking about the distribution of count data using model (20.2) leads to absurd consequences. For example, if we set the mean of the Poisson “error” to zero (like with a normal distribution), then the error term for every observation would *have to* be zero (because the only way to get a mean of zero with non-negative integers is if every value is zero). Or, if the study is modeling the effect of a treatment on the counts (that is, the \(X\) are dummy variables) then \(\beta_0\) is the expected mean count of the control (or reference) group. But if we add non-zero Poisson error to this, then the mean of the control group would be larger than \(\beta_0\). This doesn’t make sense. And finally, equation (20.2) generates a continuous response, instead of an integer, because \(\beta_0\) and \(\beta_1\) are continuous.

A better way to think about the data generation for a linear model that naturally leads to the *correct* way to think about data generation for a generalized linear model, is

\[ \begin{align} y_i &\sim N(\mu_i, \sigma^2)\\ \mathrm{E}(y_i| \texttt{treatment}) &= \mu_i\\ \mu_i &= \beta_0 + \beta_1 \texttt{treatment}_i \tag{20.3} \end{align} \]

That is, a response is a random draw from a normal distribution with mean \(mu\) (not zero!) and variance \(\sigma^2\). Line 1 is the stochastic part of this specification. Line 3 is the systematic part.

The specification of a generalized linear model has both stochastic and systematic parts but adds a third part, which is a **link function** connecting the stochastic and systematic parts.

**The stochastic part**, which is a probability distribution from the exponential family (this is sometimes called the “random part”) \[ \begin{equation} y_i \sim \operatorname{Prob}(\mu_i) \end{equation} \]**the systematic part**, which is a linear predictor (I like to think about this as the deterministic part) \[ \begin{equation} \eta = \beta_0 + \beta_1 \texttt{treatment}_i \end{equation} \]- a
**link function**connecting the two parts \[ \begin{equation} \eta_i = g(\mu_i) \end{equation} \]

\(\mu\) (the Greek symbol mu) is the conditional mean (or expectation \(\mathrm{E}(Y|X)\)) of the response on the **response scale** and \(\eta\) (the Greek symbol eta) is the conditional mean of the response on the **link scale**. The response scale is the scale of the raw measurements and has units of the raw measurements. The link scale is the scale of the transformed mean. A GLM models the response with a distribution specified in the stochastic part. The probability distributions introduced in this chapter are the Poisson and Negative Binomial. The natural link function, and default link function in R, for the Poisson and Negative Binomial is the “log link,” \(\eta = log(\mu)\). More generally, while each distribution has a natural (or, “canonical”) link function, one can use alternatives. Given this definition of a generalized linear model, a linear model is a GLM with a normal distribution and an **Identity** link (\(\eta = \mu\)).

\[ \begin{align} y_i &\sim \operatorname{Normal}(\mu_i, \sigma^2)\\ \mathrm{E}(y_i| \texttt{treatment}) &= \mu_i\\ \eta &= \beta_0 + \beta_1 \texttt{treatment}\\ \mu_i &= \eta_i \end{align} \]

Think about the link function and GLM more generally like this: the GLM constructs a linear model that predicts the conditional means *on the link scale*. If the model uses a “log link” (\(\eta_i = \mathrm{log}(\mu_i)\)), then \(\eta_i\) – the conditional mean on the link scale – is the *log of the modeled mean on the response scale*. The modeled mean on the response scale is the inverse link function.

\[ \begin{equation} \mu_i = g^{-1}(\eta_i) \end{equation} \]

For a log link, a modeled mean is \(\mathrm{exp}(\eta_i)\))$.

Importantly, in a GLM, the individual data values *are not transformed*. A GLM with a log link is not the same as a linear model on log transformed data.

When modeling counts using the Poisson or negative binomial distributions with a log link, the link scale is linear, and so the effects are additive on the link scale, while the response scale is nonlinear (it is the exponent of the link scale), and so the *effects are multiplicative on the response scale*. If this doesn’t make sense now, an example is worked out below. The inverse of the link function backtransforms the parameters from the link scale back to the response scale. So, for example, a prediction on the response sale is \(\mathrm{exp}(\hat{\eta})\) and a coefficient on the response scale is \(\mathrm{exp}(b_j)\).

## 20.2 Kinds of experimental biology data that are modeled by a GLM

- If a response is a count then use a Poisson, quasi-Poisson or negative binomial family with a log link.
- If a response is binary response, for example, presence/absence or success/failure or survive/die, then use the binomial family with a logistic link (other link functions are also useful). This is classically known as
**logistic regression**. - If a response is a fraction that is a ratio of counts, for example the fraction of total cells that express some marker, then use the binomial family with a logistic link. Think of each count as a “success.”
- If a response is a fraction of counts per “effort” (cells per area or volume or time), then use a Poisson, quasi-Poisson or negative binomial family with a log link on the raw count and include the measure of effort as an offset.
- If a response is continuous but the variance increases with mean then use the gamma family
- If a response is a fraction of a continuous measure per “effort” (tumor area per total area or volume or time), then use a gamma family on the raw measure and include the measure of effort as an offset.

## 20.3 Example 1 – A generalized linear model explainer (“angiogenic sprouts” exp3a)

### 20.3.1 Understand the data

The researchers designed a set of experiments to investigate the effects of pericyte derived FAK (focal adhesion kinase, a protein tyrosine kinase) on the Gas6/Axl pathway regulating angiogenesis promoting tumor growth. Pericytes are cells immediately deep to the endothelium (the epithelial lining) of the smallest blood vessels, including capillaries, arterioles, and venules. Angiogenesis is the growth of new blood vessels. GAS6 (growth arrest specific 6) is a protein commonly expressed in tumors.

The example data is from the experiment for Figure 3a. The design is \(2 \times 2\) – two factors (\(\texttt{treatment}\) and \(\texttt{genotype}\)) each with two levels.

- Factor 1: \(\texttt{treatment}\)

- reference level: “PBS.” Phosphate buffered saline added to tissue. This is the control treatment.
- treatment level: “GAS6.” Added to tissue in solution. The experiment is designed to test its effect on promoting angiogenesis in the development of tumors.

- Factor 2: \(\texttt{genotype}\)

- reference level: “FAK_wt.” The functional genotype.
- treatment level: “FAK_ko.” Tissue-specific FAK deletion in pericytes. The experiment is designed to test the effect of pericyte-derived FAK on slowing angiogenesis in the development of tumors. If true, then its deletion should result in increased tumor development.

The four treatment by genotype combinations are

- Control (“PBS FAK_wt”) – negative control
- FAK_ko (“PBS FAK_ko”) – PBS control. Unknown response given deletion of putative angiogenesis inhibitor but no added putative angiogenesis promotor
- GAS6 (“GAS6 FAK_wt”) – GAS6 control. GAS6 expected to promote angiogenesis but this the response is expected to be inhibited by some amount by FAK
- GAS6+FAK_ko (“GAS6 FAK_ko”) – focal treatment. Expected positive angiogenesis

The planned contrasts are

- (PBS FAK_ko) - (PBS FAK_wt) – the effect of FAK deletion given the control treatment
- (GAS6 FAK_ko) - (GAS6 FAK_wt) – the effect of FAK deletion given the GAS6 treatment
- ((PBS FAK_ko) - (PBS FAK_wt)) - ((GAS6 FAK_ko) - (GAS6 FAK_wt)). The interaction effect giving the effect of the combined treatment relative to the individual effects.

### 20.3.2 Model fit and inference

#### 20.3.2.1 Fit the model

```
<- lm(sprouts ~ treatment * genotype, data = exp3a)
exp3a_m1 <- glm(sprouts ~ treatment * genotype,
exp3a_m2 family = "poisson",
data = exp3a)
<- glm.nb(sprouts ~ treatment * genotype,
exp3a_m3 data = exp3a)
```

#### 20.3.2.2 Check the linear model

`ggcheck_the_model(exp3a_m1)`

Notes

- left panel shows classic right skew conditional distribution with larger values much larger than expected with a normal distribution.
- right panel shows heterogeneity and specifically the variance increasing with the mean.

#### 20.3.2.3 Check the poisson model

**Check shape and homogeneity**

```
# from the DHARMa package
<- simulateResiduals(fittedModel = exp3a_m2, n = 250)
exp3a_m2_simulation plot(exp3a_m2_simulation, asFactor = FALSE)
```

Notes

- poisson glm fails to generated scaled residuals approximating uniform distribution.

**Check dispersion**

```
<- simulateResiduals(fittedModel = exp3a_m2,
exp3a_m2_simulation_refit n = 250,
refit = TRUE)
<- testDispersion(exp3a_m2_simulation_refit) exp3a_m2_test_dispersion
```

Notes

- large overdispersion

**Check zero inflation**

`<- testZeroInflation(exp3a_m2_simulation_refit) exp3a_m2_test_zi `

Notes

- The data has too many zeros relative to the expected number from a poisson GLM.

#### 20.3.2.4 Check the negative binomial model

**Check shape and homogeneity**

```
# from the DHARMa package
<- simulateResiduals(fittedModel = exp3a_m3,
exp3a_m3_simulation n = 250)
plot(exp3a_m3_simulation, asFactor = FALSE)
```

Notes

- uniform q-q for negative binomial GLM looks good.
- spread-location plot looks good.

**Check dispersion**

```
<- simulateResiduals(fittedModel = exp3a_m3,
exp3a_m3_simulation_refit n = 250,
refit = TRUE)
<- testDispersion(exp3a_m3_simulation_refit) exp3a_m3_test_dispersion
```

Notes

- good

**Check zero inflation**

`<- testZeroInflation(exp3a_m3_simulation_refit) exp3a_m3_test_zi `

#### 20.3.2.5 Inference from the model

```
<- cbind(coef(summary(exp3a_m3)),
exp3a_m3_coef confint(exp3a_m3))
```

Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|

(Intercept) | 0.90 | 0.231 | 3.9 | 0.000 | 0.45 | 1.36 |

treatmentGAS6 | 1.30 | 0.287 | 4.5 | 0.000 | 0.74 | 1.86 |

genotypeFAK_ko | 0.04 | 0.357 | 0.1 | 0.900 | -0.65 | 0.75 |

treatmentGAS6:genotypeFAK_ko | 0.46 | 0.425 | 1.1 | 0.275 | -0.37 | 1.30 |

```
<- emmeans(exp3a_m3,
exp3a_m3_emm specs = c("treatment", "genotype"),
type="response")
```

treatment | genotype | response | SE | df | asymp.LCL | asymp.UCL |
---|---|---|---|---|---|---|

PBS | FAK_wt | 2.47 | 0.6 | Inf | 1.57 | 3.88 |

GAS6 | FAK_wt | 9.05 | 1.5 | Inf | 6.48 | 12.64 |

PBS | FAK_ko | 2.58 | 0.7 | Inf | 1.52 | 4.40 |

GAS6 | FAK_ko | 15.04 | 2.4 | Inf | 11.06 | 20.46 |

```
# exp3a_m3_emm # print in console to get row numbers
# set the mean as the row number from the emmeans table
<- c(1,0,0,0)
pbs_fak_wt <- c(0,1,0,0)
gas6_fak_wt <- c(0,0,1,0)
pbs_fak_ko <- c(0,0,0,1)
gas6_fak_ko
#1. (PBS FAK_ko) - (PBS FAK_wt)
#2. (GAS6 FAK_ko) - (GAS6 FAK_wt)
#3. ((PBS FAK_ko) - (PBS FAK_wt)) - (GAS6 FAK_ko) - (GAS6 FAK_wt).
<- contrast(
exp3a_m3_planned
exp3a_m3_emm,method = list(
"(PBS FAK_ko) - (PBS FAK_wt)" = c(pbs_fak_ko - pbs_fak_wt),
"(GAS6 FAK_ko) - (GAS6 FAK_wt)" = c(gas6_fak_ko - gas6_fak_wt),
"Interaction" = c(gas6_fak_ko - gas6_fak_wt) -
c(pbs_fak_ko - pbs_fak_wt)
),adjust = "none"
%>%
) summary(infer = TRUE)
```

contrast | ratio | SE | df | asymp.LCL | asymp.UCL | z.ratio | p.value |
---|---|---|---|---|---|---|---|

(PBS FAK_ko) / (PBS FAK_wt) | 1.05 | 0.373 | Inf | 0.52 | 2.10 | 0.13 | 0.90040 |

(GAS6 FAK_ko) / (GAS6 FAK_wt) | 1.66 | 0.385 | Inf | 1.06 | 2.62 | 2.19 | 0.02820 |

Interaction | 1.59 | 0.676 | Inf | 0.69 | 3.66 | 1.09 | 0.27535 |

Notes

#### 20.3.2.6 Plot the model

#### 20.3.2.7 Alternaplot the model

## 20.4 Understanding Example 1

### 20.4.1 Modeling strategy

Instead of testing assumptions of a model using formal hypothesis tests *before* fitting the model, a better strategy is to 1) fit one or more models based on initial evaluation of the data, and then do 2) **model checking** using **diagnostic plots**, diagnostic statistics, and simulation (see Section All statistical analyses should be followed by model checking).

For the exp3a data, I fit a linear model, a Poisson GLM, and a negative binomial GLM. I use the diagnostic plots and statistics to help me decide which model to report.

### 20.4.2 Model checking fits to count data

We use the fit models to check

- the compatibility between the quantiles of the observed residuals and the distribution of expected quantiles from the family in the model fit
- if the observed distribution is over or under dispersed
- if there are more zeros than expected by the theoretical distribution. If so, the observed distribution is
**zero-inflated**

#### 20.4.2.1 Checking the linear model exp3a_m1 – a Normal-QQ plot

Figure **??**A shows a histogram of the residuals from the fit linear model. The plot shows that the residuals seem to be clumped at the negative end of the range, which suggests that a model with a normally distributed conditional outcome (or normal error) is not well approximated.

A better way to investigate this is with the **Normal-QQ** plot in Figure **??**B, which plots the sample quantiles for a variable against their theoretical quantiles. If the conditional outcome approximates a normal distribution,

- the points should roughly follow the robust regression line,
- the points should be largely inside the 95% CI (gray) bounds for sampling a normal distribution with the variance estimated by the model, and
- the robust regression line should be largely inside the CI bounds.

For the sprout data, the points are above the line at the positive end, hug the upper bound of the 95% CI at the negative end, are well above the 95% CI at the positive end, and the robust regression is distinctly shallower than a line bisecting the CI bounds. At the left (negative) end, the observed values are more positive than the theoretical values. Remembering that this plot is of residuals, if we think about this as counts, this means that our smallest counts are not as small as we would expect given the mean, the variance, and a normal distribution. This shouldn’t be surprising – the counts range down to zero and counts cannot be below zero. At the positive end, the sample values are again more positive than the theoretical values. Thinking about this as counts, this means that are largest counts are larger than expected given the mean, the variance, and a normal distribution. This pattern is what we’d expect of count data.

#### 20.4.2.2 Checking the linear model exp3a_m1 – Spread-level plot for checking homoskedasticity

A linear model also assumes that the error is **homoskedastic** – the error variance is not a function of the value of the \(X\) variables). Non-homoskedastic error is **heteroskedastic**. I will typically use “homogenous variance” and “heterogenous variance” since these terms are more familiar to biologists. The fit model can be checked for homogeneity using a spread-level (also known as a scale-location) plot, which comes in several forms. I like a spread-level plot that is a scatterplot of the positive square-root of the standardized residuals against the fitted values (remember that the fitted values are the values computed by the linear predictor of the model – they are the “predicted values” of the observed data). If the residuals approximate a normal distribution, then a regression line through the scatter should be close to horizontal. The regression line in the spread-level plot of the fit of the linear model to the sprout data shows a distinct increase in the “scale” (the square root of the standardized residuals) with increased fitted value, which is expected of data sampled from a distribution in which the variance increases the mean.

#### 20.4.2.3 Two distributions for count data – Poisson and Negative Binomial

The pattern in the Normal-QQ plot in Figure **??**B should discourage a researcher from modeling the data with a normal distribution and instead model the data with an alternative distribution using a Generalized Linear Model. There is no unique mapping between observed data and a data generating mechanism with a specific distribution, so this decision is not as easy as thinking about the data generation mechanism and then simply choosing the “correct” distribution. Section 4.5 in Bolker (xxx) is an excellent summary of how to think about the generating processes for different distributions in the context of ecological data. Since the response in the angiogenic sprouts data are counts, we need to choose a distribution that generates integer values, such as the Poisson or the negative binomial.

- Poisson – A Poisson distribution is the probability distribution of the number of occurrences of some thing (a white blood cell, a tumor, or a specific mRNA transcript) generated by a process that generates the thing at a constant rate per unit effort (duration or space). This constant rate is the parameter \(\lambda\), which is the expectation (the expected mean of the counts), so \(\mathrm{E}(Y) = \mu = \lambda\). Because the rate per effort is constant,
*the variance of a Poisson variable equals the mean*, \(\sigma^2 = \mu = \lambda\). Figure**??**shows three samples from a Poisson distribution with \(\lambda\) set to 1, 5, and 10. The plots show that, as the mean count (\(\lambda\)) moves away from zero, a Poisson distribution 1) becomes less skewed and more closely approximates a normal distribution and 2) has an increasingly low probability of including zero (less than 1% zeros when the mean is 5).

A Poisson distribution, then, is useful for count data in which the conditional variance is close to the conditional mean. Very often, biological count data are not well approximated by a Poisson distribution because the variance is either less than the mean, an example of **underdispersion**^{5}, or greater than the mean, an example of **overdispersion**^{6}. A useful distribution for count data with overdispersion is the negative binomial.

- Negative Binomial – The negative binomial distribution is a discrete probability distribution of the number of successes that occur before a specified number of failures \(k\) given a probability
*p*of success. This isn’t a very useful way of thinking about modeling count data in biology. What is useful is that the Negative Binomial distribution can be used simply as way of modeling an “overdispersed” Poisson process. Using the parameterization in the`MASS::glm.nb`

function, the mean of a negative binomial variable is \(\mu\) and the variance is \(\sigma^2 = \mu + \frac{\mu^2}{\theta}\). As a method for modeling an overdispersed Poisson variable, \(\theta\) functions as a **dispersion parameter* controlling the amount of overdispersion and can be any real, positive value (not simply a positive integer), including values less than 1. As \(\theta\) approaches positive infinity, the “overdispersion” bit \(\frac{\mu^2}{\theta}\) goes to zero and the variance goes to \(\mu\), which is the same as the Poisson.

#### 20.4.2.4 Model checking a GLM I – the quantile-residual uniform-QQ plot

Normal-QQ plots were introduced in Section @ref{normal-qq} of the Model Checking chapter and applied to the linear model fit of the angiogenic sprout data in Section 20.4.2.1 above. We cannot use a Normal-QQ plot with a Poisson or negative binomial GLM fit because the residuals from this fit are not expected to be normally distributed. An alternative to a Normal-QQ plot for a GLM fit is a **quantile-residual uniform-QQ** plot of observed **quantile residuals**.

Notes

- The x-axis (“Expected”) contains the expected quantiles from a uniform distribution.
- The y-axis (“Observed”) contains the observed quantile residuals from a GLM fit, which are the residuals from the fit model that are transformed in a way that the expected distribution is uniform under the fit model family. This means that we’d expect the quantile residuals to closely approximate the expected quantiles from a uniform distribution. If the approximation is close, the points will fall along the “y = x” line in the plot.
- The gray shaded area is a 95% confidence interval computed using a parametric bootstrap. At any value of the expected quantile, the interval will include an observed quantile 95% of the time. This gray area gives us a sense of the variability we’d get when we fit models to random samples from the specified model.
- In the quantile-residual QQ plot for Model
`exp3a_m2`

, the observed residuals are far outside the 95% boundary. The observed residuals are smaller than expected at the negative (left) end and larger than expected at the right (high) end. This means the residuals are more spread out than expected for a Poisson sample. The data are overdispersed*for this model*. Understand that overdispersion is not a property of data but of the residuals from a specific model fit to the data.

**Misconceivable** – A common misconception is that if the distribution of the response approximates a Poisson distribution, then the residuals of a GLM fit with a Poisson distribution should be normally distributed, which could then be checked with a Normal-QQ plot, and homoskedastic, which could be checked with a scale-location plot. Neither of these is true because a GLM does not transform the data and, in fact, the model definition does not specify anything about the distribution of an “error” term – there is no \(\varepsilon\) in the model definition above! This is why thinking about the definition of a linear model by specifying an error term with a normal distribution can be confusing and lead to misconceptions when learning GLMs.

#### 20.4.2.5 Model checking a GLM II – Spread-level plot for checking homoskedasticity

Notes

#### 20.4.2.6 Model checking a GLM III – Checking dispersion

Notes

- This plot is a histogram of the sum of squared
**Pearson residuals**of fake data sampled from the fit model. Pearson residuals are the raw residuals divided by the square root of the fitted value. Remember that in the Poisson distribution, the variance is equal to the expectation (mean), so a Pearson residual is the raw residual divided by the standard deviation of the residual. A way to think about this is, Pearson residuals “correct” for the heterogeneity in variance that arises among groups with different mean counts. - The sum of squared
**Pearson residuals**is a measure of the dispersion of the residuals. - The red line is the observed sum of the squared Pearson residuals of the fit model.
- If the observed dispersion approximates that expected from sampling from the fit model, the red line will be within the histogram.
- The red line here is far larger than expected given the histogram, which indicates that the residuals are overdispersed given the fit model.
- Overdispersion will be common with Poisson GLM fits to biological data.

#### 20.4.2.7 Model Checking a count GLM – Check zero inflation

Counts can have the value zero. Data that have more zeros than expected given a fit count GLM model (Poisson, quasi-Poisson, negative binomial) is **zero-inflated**.

`<- testZeroInflation(simulation_output) zero_inflation_test `

- This plot is a histogram of the number of zeros in each of the fake data sets generated by the fit model. An observed number of zeros at the extremes of this distribution are unlikely given the fit model. The number of zeros in the observed data is greater than expected by the model.

### 20.4.3 Biological count data are rarely fit well by a Poisson GLM. Instead, fit a quasi-poisson or negative binomial GLM model.

Here are the diagnostic plots of the negative binomial GLM fit to the exp3a data

### 20.4.4 A GLM is a linear model on the link scale

The negative-binomial GLM fit to the angiogenic sprout data (exp3a) is

\[ \begin{align} \texttt{sprouts}_i &\sim \operatorname{NB}(\mu_i, \theta)\\ \mathrm{E}(\texttt{sprouts}\ | \ \texttt{treatment, genotype}) &= \mu\\ \mu_i &= \mathrm{exp}(\eta_i)\\ \eta_i &= \beta_0 \ + \\ & \quad \; \beta_1 \texttt{treatment}_{\texttt{GAS6}} \ + \\ & \quad \; \beta_2 \texttt{genotype}_{\texttt{FAK_ko}} \ + \\ & \quad \; \beta_3 \texttt{treatment}_{\texttt{GAS6}}:\texttt{genotype}_{\texttt{FAK_ko}} \tag{20.4} \end{align} \]

- The first line of Model (20.4) is the stochastic part stating the response is modeled as a random Negative Binomial variable with conditional mean \(\mu_i\), and variance \(\mu + \frac{\mu^2}{\theta}\). Fitting the model to the data estimates \(\mu_i\) for all
*i*. \(\mu_i\) will be the same for all mice within each treatment by genotype combination because they share the same conditions (the values of the indicator variables for treatment, genotype, and their interaction). - The second line states the \(\mu\) is the mean conditional on the value of \(\texttt{treatment}\) and \(\texttt{genotype}\)
- The third line connects the conditional mean on the link scale (\(\eta\)) with the conditional mean on the response scale (\(\mu\)). This is the backtransformation.
- The fourth line is the
**linear predictor**– it is a linear model on the link scale. The linear predictor includes three indicator variables. - Remember that a linear model is a model in which the coefficients are additive, meaning that the coefficients do not have exponents or are not multiplied by each other.

### 20.4.5 Coeffecients of a Generalized Linear Model with a log-link function are on the link scale.

The coefficients of the fit negative binomial model are

Estimate | Std. Error | z value | Pr(>|z|) | 2.5 % | 97.5 % | |
---|---|---|---|---|---|---|

(Intercept) | 0.90446 | 0.231 | 3.9 | 0.000 | 0.45 | 1.36 |

treatmentGAS6 | 1.29805 | 0.287 | 4.5 | 0.000 | 0.74 | 1.86 |

genotypeFAK_ko | 0.04462 | 0.357 | 0.1 | 0.900 | -0.65 | 0.75 |

treatmentGAS6:genotypeFAK_ko | 0.46382 | 0.425 | 1.1 | 0.275 | -0.37 | 1.30 |

Notes

- The coefficients are on the link scale, which is a linear (additive) scale.

#### 20.4.5.1 The intercept of a Generalized Linear Model with a log-link function is the mean of the reference on the link scale

- In the linear model
`exp3a_m1`

, the intercept is the modeled mean of the reference group. In a GLM with a log-link (including Models`exp3a_m2`

and`exp3a_m3`

), the intercept is the mean of the reference group*on the link scale*. The modeled mean of the reference on the response scale is computed by the backtransformaiton \(\exp(b_1)\). The function “exp(x)” is the exponent, which is often written using the notation \(e^{x}\). - The transformation between link scale and response scale is part of the specification of a Generalized Linear Model. For the negative binomial GLM fit to the exp3a data, this specification is in line 3 of the specification (Model (20.4)).

Compare the computation here with the modeled mean in the emmeans table (Section 20.3.2.5 or with a computation of the sample mean.

```
<- coef(exp3a_m3)[1]
b1 exp(b1)
```

```
## (Intercept)
## 2.470588
```

`mean(exp3a[t_by_g == "PBS FAK_wt", sprouts])`

`## [1] 2.470588`

#### 20.4.5.2 The coefficients of the indicator variables of a Generalized Linear Model with a log-link function are effect-ratios on the response scale

- In the linear model
`exp3a_m1`

, the coefficient of an indicator variable is a difference in means. In a GLM with a log-link, the coefficient of an indicator variable is the difference (group minus reference) of the modeled means*on the link scale*. Let’s check this with some computations.

The modeled means on the link scale are in the emmeans table

```
<- emmeans(exp3a_m3,
exp3a_m3_emm_link specs = c("treatment", "genotype")) %>%
summary() %>%
data.table()
```

Using this table, the difference between the link-scale mean of “GAS6 FAK_wt” and “PBS FAK_wt” (the reference) is

```
== "GAS6" &
exp3a_m3_emm_link[treatment == "FAK_wt", emmean] -
genotype == "PBS" & genotype == "FAK_wt", emmean] exp3a_m3_emm_link[treatment
```

`## [1] 1.298045`

And the coefficient for the \(\texttt{treatment_gas6}\) indicator variable is

```
# coefficient for treatment_gas6
<- coef(exp3a_m3)[2]
b2 b2
```

```
## treatmentGAS6
## 1.298045
```

- In a GLM with a log-link, the exponent of the coefficient of an indicator variable is the ratio of the modeled means
*on the response scale*. Let’s check this with some computations.

The modeled means on the response scale are in the emmeans table

```
<- emmeans(exp3a_m3,
exp3a_m3_emm_response specs = c("treatment", "genotype"),
type = "response") %>%
summary() %>%
data.table()
```

Using this table, the ratio of the response-scale mean of “GAS6 FAK_wt” to the response-scale mean of “PBS FAK_wt” (the reference) is

```
== "GAS6" &
exp3a_m3_emm_response[treatment == "FAK_wt", response]/
genotype == "PBS" & genotype == "FAK_wt", response] exp3a_m3_emm_response[treatment
```

`## [1] 3.662132`

And the coefficient for the \(\texttt{treatment_gas6}\) indicator variable backtransformed to the response scale is

```
<- exp(b2)
exp_b2 exp_b2
```

```
## treatmentGAS6
## 3.662132
```

- Since this backtransformed coefficient is both an effect and a ratio, I call it an
**effect ratio**. It’s value is how many times bigger (or smaller if less than one) the non-reference response is relative to the reference response. The response of the GAS6 treatment is 3.66\(\times\) that of the reference treatment.

### 20.4.6 Modeled means in the emmeans table of a Generalized Linear Model can be on the link scale or response scale.

#### 20.4.6.1 The emmeans table containing the statistics on the link scale

treatment | genotype | emmean | SE | df | asymp.LCL | asymp.UCL |
---|---|---|---|---|---|---|

PBS | FAK_wt | 0.90446 | 0.23 | Inf | 0.45 | 1.36 |

GAS6 | FAK_wt | 2.20250 | 0.17 | Inf | 1.87 | 2.54 |

PBS | FAK_ko | 0.94908 | 0.27 | Inf | 0.42 | 1.48 |

GAS6 | FAK_ko | 2.71094 | 0.16 | Inf | 2.40 | 3.02 |

- The values are the statistics on the link scale. The mean and CI, but not the SE, can be meaningfully backtransformed to the response scale using the exponent function to get the mean and CI on the response scale.
- The CIs are “asymptotic,” meaning they are computed using infinite degrees of freedom. The consequences is that the CIs will be too narrow, especially for small
*n*. - Check the math! Given asymptotic CIs, the lower CI should be (mean - 1.96 \(\times\) SE) and the upper should be (mean + 1.96 \(\times\) SE).

`0.90446 - 1.96*0.23`

`## [1] 0.45366`

- A GLM is linear on the link scale. This means the model is additive on the link scale. Modeled means on the link scale are computed by
*adding*the coefficients.

```
<- coef(exp3a_m3) # the model coefficients
b
# the way I typically compute these
<- b[1]
pbs_fak_wt_link <- b[1] + b[2]
gas6_fak_wt_link <- b[1] + b[3]
pbs_fak_ko_link <- b[1] + b[2] + b[3] + b[4]
gas6_fak_ko_link
# but this is the algebra more consistent with the linear model math
<- b[1] + b[2]*0 + b[3]*0 + b[4]*0
pbs_fak_wt_link <- b[1] + b[2]*1 + b[3]*0 + b[4]*0
gas6_fak_wt_link <- b[1] + b[2]*0 + b[3]*1 + b[4]*0
pbs_fak_ko_link <- b[1] + b[2]*1 + b[3]*1 + b[4]*1
gas6_fak_ko_link
# combine into a table
<- data.table(
exp_3a_means group = t_by_g_levels,
"mean (link scale)" = c(pbs_fak_wt_link,
gas6_fak_wt_link,
pbs_fak_ko_link,
gas6_fak_ko_link) )
```

group | mean (link scale) |
---|---|

PBS FAK_wt | 0.90446 |

GAS6 FAK_wt | 2.20250 |

PBS FAK_ko | 0.94908 |

GAS6 FAK_ko | 2.71094 |

Compare the values here to those in the emmeans table on the link scale (Table @ref(tab:exp3a_m3_emm_link)).

#### 20.4.6.2 The emmeans table containing the statistics on the response scale

treatment | genotype | response | SE | df | asymp.LCL | asymp.UCL |
---|---|---|---|---|---|---|

PBS | FAK_wt | 2.47059 | 0.57 | Inf | 1.57 | 3.88 |

GAS6 | FAK_wt | 9.04762 | 1.54 | Inf | 6.48 | 12.64 |

PBS | FAK_ko | 2.58333 | 0.70 | Inf | 1.52 | 4.40 |

GAS6 | FAK_ko | 15.04348 | 2.36 | Inf | 11.06 | 20.46 |

- The values in the column “response” are the modeled means on the response scale. These values are the exponent of the values in the “emmean” column of the emmeans table on the link scale.
- The values in the columns “asymp.LCL” and “asymp.UCL” are the 95% confidence intervals on the response scale. These values are the exponent of the values in the same columns of the emmeans table on the link scale.
- Don’t do additive math on the response scale! Remember, a GLM is linear on the link scale. The CIs on the response scale are not the mean plus or minus 1.96 \(\times\) SE!

`2.47059 - 1.96*0.57`

`## [1] 1.35339`

`# oops`

- Modeled means on the response scale are computed by backtransforming the modeled means on the link scale. Since the fit model used a log-link, the backtransformation is the exponent.

```
<- exp(pbs_fak_wt_link)
pbs_fak_wt_response <- exp(gas6_fak_wt_link)
gas6_fak_wt_response <- exp(pbs_fak_ko_link)
pbs_fak_ko_response <- exp(gas6_fak_ko_link)
gas6_fak_ko_response
# add a column to the exp_3a_means data.table
"mean (response scale)" := c(pbs_fak_wt_response,
exp_3a_means[,
gas6_fak_wt_response,
pbs_fak_ko_response, gas6_fak_ko_response)]
```

group | mean (link scale) | mean (response scale) |
---|---|---|

PBS FAK_wt | 0.90446 | 2.47059 |

GAS6 FAK_wt | 2.20250 | 9.04762 |

PBS FAK_ko | 0.94908 | 2.58333 |

GAS6 FAK_ko | 2.71094 | 15.04348 |

Compare the values here to those in the emmeans table on the response scale (Table @ref(tab:exp3a_m3_emm_response)).

### 20.4.7 Some consequences of fitting a linear model to count data

#### 20.4.7.1 One – linear models can make absurd predictions

`plot_grid(gg1, gg2, gg3, ncol=3, labels = "AUTO")`

Notes

- A prediction interval is a confidence interval of a prediction – using the fit model to predict future responses given the same conditions (here, assignment to one of the four different treatment combinations).
- Left panel: The prediction intervals from the linear model imply that negative sprouts could be sampled. This is absurd.
- Middle panel: The fit linear model is used to make 100 fake predictions in each group.
- Right panel: The fit negative binomial GLM is used to make 100 fake predictions in each group. Nothing absurd here.

#### 20.4.7.2 Two – linear models can perform surprisingly well if one is only interested in *p*-values

*P*-values are a function of the sampling distribution of group means and differences in means, and, due to the magic of the central limit theorem, linear models fit to count data perform surprisingly well in the sense of

- Type I error that approximates the nominal value
- Reasonable power compared to GLM models and many non-parametric tests.

## 20.5 Working in R

### 20.5.1 Fitting GLMs to count data

The poisson family is specified with the base R `glm()`

function. For negative binomial, use `glm.nb`

from the MASS package

```
# poisson - less likely to fit to real biological data well
# because of overdispersion
<- glm(y ~ treatment, family = "poisson", data = dt)
fit
# two alternatives to overdispersed poisson fit
# quasipoisson
<- glm(y ~ treatment, family = "quasipoisson", data=dt)
fit
# note that "family" is not an argument since this function is used only to fit a negative binomial distribution!
<- glm.nb(y ~ treatment, data = dt) fit
```

### 20.5.2 Fitting a GLM to a continuous conditional response with right skew.

The Gamma family is specified with the base R `glm()`

function.

`<- glm(y ~ treatment, family = Gamma(link = "log"), data = dt) fit `

### 20.5.3 Fitting a GLM to a binary (success or failure, presence or absence, survived or died) response

The binomial family is specified with base R `glm()`

function.

```
# if the data includes a 0 or 1 for every observation of y
<- glm(y ~ treatment, family = "binomial", data = dt)
fit
# if the data includes the frequency of success AND there is a measure of the total n
:= n - success]
dt[ , failure <- glm(cbind(success, failure) ~ treatment, family = "binomial", data = dt) fit
```

### 20.5.4 Fitting Generalized Linear Mixed Models

Generalized linear mixed models are fit with `glmer`

from the lmer package.

```
# random intercept of factor "id"
<- glmer(y ~ treatment + (1|id), family = "poisson", data = dt)
fit
# random intercept and slope of factor "id"
<- glmer(y ~ treatment + (treatment|id), family = Gamma(link = "log"), data = dt)
fit
# Again, negative binomial uses a special function
<- glmer.nb(y ~ treatment + (treatment|id), data = dt) fit
```

Another good package for GLMMs is `glmmTMB`

from the glmmTMB package

```
# negative binomial
<- glmmTMB(y ~ treatment + (1|id), family="nbinom2", data = dt)
fit
# nbinom1, the mean variance relationship is that of quasipoisson
<- glmmTMB(y ~ treatment + (1|id), family="nbinom1", data = dt) fit
```

## 20.6 Model checking GLMs

The `DHARMa`

package has an excellent set of model checking tools. The DHARMa package uses simulation to generate fake data sampled from the fit model using the function `simulateResiduals`

.

```
<- simulateResiduals(fittedModel = exp3a_m3,
simulation_output n = 250,
refit = FALSE)
```

```
<- simulateResiduals(fittedModel = exp3a_m2,
simulation_output n = 250,
refit = FALSE)
```

- The DHARMa package uses simulation to

`plot(simulation_output)`

`plotQQunif(simulation_output)`

- Three test statistics are superimposed. Use these
*p*-values cautiously – they are guides and not thresholds of demarcation. The two we care about here are

- The KS statistic indicates that the quantile residuals are not very compatible with a Poisson model – think of this as having a very low probability of sampling these counts from a Poisson with the estimated \(\lambda\).
- The dispersion statistic indicates that the value of the dispersion of the quantile residuals is not very compatible with a Poisson model – think of this as having a very low probability of sampling counts with this dispersion from a Poisson with the estimated \(\lambda\).