Chapter 20 Linear models for count data – Generalized Linear Models I

Biologists frequently count stuff, and design experiments to estimate the effects of different factors on these counts. Count data can cause numerous problems with linear models that assume a normal, conditional distribution, including 1) counts are discrete, and can be zero or positive integers only, 2) counts tend to bunch up on the small side of the range, creating a distribution with a positive skew, 3) a sample of counts can have an abundance of zeros, and 4) the variance of counts increases with the mean (see Figure 20.1 for some of these properties). Some count data can be approximated by and reasonably modeled with a normal distribution. More often, count data should modeled with a Poisson distribution or negative binomial distribution using a generalized linear model. Poisson and negative binomial distributions are discrete probability distributions with two important properties: 1) the distribution contains only zero and positive integers and 2) the variance is a function of the mean. Back before modern computing and fast processors, count data were often analyzed by either transforming the response or by non-parametric hypothesis tests. One reason to prefer a statistical modeling approach with a GLM is that we can get interpretable parameter estimates. By contrast, both the analysis of transformed data and non-parametric hypothesis tests are really tools for computing “correct” p-values.

Histogram of the number of angiogenic sprouts in response to two of the four treatment combinations for the experiment in Example 1.

Figure 20.1: Histogram of the number of angiogenic sprouts in response to two of the four treatment combinations for the experiment in Example 1.

20.1 The Generalized Linear Model (GLM)

As outlined in Two specifications of a linear model, a common way that biological researchers are taught to think about a response variable is

\[ response = expected + error \]

or, using the notation of this text,

\[ \begin{align} y &= \beta_0 + \beta_1 \texttt{treatment} + \varepsilon \\ \varepsilon &\sim \operatorname{Normal}(0, \sigma^2) \tag{20.1} \end{align} \]

That is, we can think of a response as the sum of some systematic part (\(\beta_0 + \beta_1 \texttt{treatment}\)) and a stochastic (“random error”) part (\(\varepsilon\)), where the stochastic part is a random draw from a normal distribution with mean zero and variance \(\sigma^2\). This way of thinking about the generation of the response is useful for linear models, and model checking linear models, but is much less useful for thinking about generalized linear models or model checking generalized liner models. For example, if we want to model the number of angiogenic sprouts in response to some combination of GAS6 and treatment using a Poisson distribution, the following is the wrong way to think about the statistical model

\[ \begin{align} \texttt{sprouts} = &\ \beta_0 + \beta_1 \texttt{treatment}_{\texttt{GAS6}} + \beta_2 \texttt{genotype}_{\texttt{FAK_ko}} + \\ &\ \beta_3 \texttt{treatment}_{\texttt{GAS6}}:\texttt{genotype}_{\texttt{FAK_ko}} + \varepsilon_i\\ \varepsilon \sim &\ \operatorname{Poisson}(\lambda) \tag{20.2} \end{align} \] That is, we should not think of a count as the sum of a systematic part and a random draw from a Poisson distribution. Why? Because it is the counts, conditional on the \(\treattt{treatment\) and \(\treattt{genotype\), that are poisson distributed, not the residuals from the fit model.

Thinking about the distribution of count data using model (20.2) leads to absurd consequences. For example, if we set the mean of the Poisson “error” to zero (like with a normal distribution), then the error term for every observation would have to be zero (because the only way to get a mean of zero with non-negative integers is if every value is zero). Or, if the study is modeling the effect of a treatment on the counts (that is, the \(X\) are dummy variables) then \(\beta_0\) is the expected mean count of the control (or reference) group. But if we add non-zero Poisson error to this, then the mean of the control group would be larger than \(\beta_0\). This doesn’t make sense. And finally, equation (20.2) generates a continuous response, instead of an integer, because \(\beta_0\) and \(\beta_1\) are continuous.

A better way to think about the data generation for a linear model that naturally leads to the correct way to think about data generation for a generalized linear model, is

\[ \begin{align} y_i &\sim N(\mu_i, \sigma^2)\\ \mathrm{E}(y_i| \texttt{treatment}) &= \mu_i\\ \mu_i &= \beta_0 + \beta_1 \texttt{treatment}_i \tag{20.3} \end{align} \]

That is, a response is a random draw from a normal distribution with mean \(mu\) (not zero!) and variance \(\sigma^2\). Line 1 is the stochastic part of this specification. Line 3 is the systematic part.

The specification of a generalized linear model has both stochastic and systematic parts but adds a third part, which is a link function connecting the stochastic and systematic parts.

  1. The stochastic part, which is a probability distribution from the exponential family (this is sometimes called the “random part”) \[ \begin{equation} y_i \sim \operatorname{Prob}(\mu_i) \end{equation} \]
  2. the systematic part, which is a linear predictor (I like to think about this as the deterministic part) \[ \begin{equation} \eta = \beta_0 + \beta_1 \texttt{treatment}_i \end{equation} \]
  3. a link function connecting the two parts \[ \begin{equation} \eta_i = g(\mu_i) \end{equation} \]

\(\mu\) (the Greek symbol mu) is the conditional mean (or expectation \(\mathrm{E}(Y|X)\)) of the response on the response scale and \(\eta\) (the Greek symbol eta) is the conditional mean of the response on the link scale. The response scale is the scale of the raw measurements and has units of the raw measurements. The link scale is the scale of the transformed mean. A GLM models the response with a distribution specified in the stochastic part. The probability distributions introduced in this chapter are the Poisson and Negative Binomial. The natural link function, and default link function in R, for the Poisson and Negative Binomial is the “log link,” \(\eta = log(\mu)\). More generally, while each distribution has a natural (or, “canonical”) link function, one can use alternatives. Given this definition of a generalized linear model, a linear model is a GLM with a normal distribution and an Identity link (\(\eta = \mu\)).

\[ \begin{align} y_i &\sim \operatorname{Normal}(\mu_i, \sigma^2)\\ \mathrm{E}(y_i| \texttt{treatment}) &= \mu_i\\ \eta &= \beta_0 + \beta_1 \texttt{treatment}\\ \mu_i &= \eta_i \end{align} \]

Think about the link function and GLM more generally like this: the GLM constructs a linear model that predicts the conditional means on the link scale. If the model uses a “log link” (\(\eta_i = \mathrm{log}(\mu_i)\)), then \(\eta_i\) – the conditional mean on the link scale – is the log of the modeled mean on the response scale. The modeled mean on the response scale is the inverse link function.

\[ \begin{equation} \mu_i = g^{-1}(\eta_i) \end{equation} \]

For a log link, a modeled mean is \(\mathrm{exp}(\eta_i)\))$.

Importantly, in a GLM, the individual data values are not transformed. A GLM with a log link is not the same as a linear model on log transformed data.

When modeling counts using the Poisson or negative binomial distributions with a log link, the link scale is linear, and so the effects are additive on the link scale, while the response scale is nonlinear (it is the exponent of the link scale), and so the effects are multiplicative on the response scale. If this doesn’t make sense now, an example is worked out below. The inverse of the link function backtransforms the parameters from the link scale back to the response scale. So, for example, a prediction on the response sale is \(\mathrm{exp}(\hat{\eta})\) and a coefficient on the response scale is \(\mathrm{exp}(b_j)\).

20.2 Kinds of experimental biology data that are modeled by a GLM

  1. If a response is a count then use a Poisson, quasi-Poisson or negative binomial family with a log link.
  2. If a response is binary response, for example, presence/absence or success/failure or survive/die, then use the binomial family with a logistic link (other link functions are also useful). This is classically known as logistic regression.
  3. If a response is a fraction that is a ratio of counts, for example the fraction of total cells that express some marker, then use the binomial family with a logistic link. Think of each count as a “success.”
  4. If a response is a fraction of counts per “effort” (cells per area or volume or time), then use a Poisson, quasi-Poisson or negative binomial family with a log link on the raw count and include the measure of effort as an offset.
  5. If a response is continuous but the variance increases with mean then use the gamma family
  6. If a response is a fraction of a continuous measure per “effort” (tumor area per total area or volume or time), then use a gamma family on the raw measure and include the measure of effort as an offset.

20.3 Example 1 – A generalized linear model explainer (“angiogenic sprouts” exp3a)

20.3.1 Understand the data

Article source Lechertier, Tanguy, et al. “Pericyte FAK negatively regulates Gas6/Axl signalling to suppress tumour angiogenesis and tumour growth.” Nature communications 11.1 (2020): 1-14.

data source

The researchers designed a set of experiments to investigate the effects of pericyte derived FAK (focal adhesion kinase, a protein tyrosine kinase) on the Gas6/Axl pathway regulating angiogenesis promoting tumor growth. Pericytes are cells immediately deep to the endothelium (the epithelial lining) of the smallest blood vessels, including capillaries, arterioles, and venules. Angiogenesis is the growth of new blood vessels. GAS6 (growth arrest specific 6) is a protein commonly expressed in tumors.

The example data is from the experiment for Figure 3a. The design is \(2 \times 2\) – two factors (\(\texttt{treatment}\) and \(\texttt{genotype}\)) each with two levels.

  1. Factor 1: \(\texttt{treatment}\)
  • reference level: “PBS.” Phosphate buffered saline added to tissue. This is the control treatment.
  • treatment level: “GAS6.” Added to tissue in solution. The experiment is designed to test its effect on promoting angiogenesis in the development of tumors.
  1. Factor 2: \(\texttt{genotype}\)
  • reference level: “FAK_wt.” The functional genotype.
  • treatment level: “FAK_ko.” Tissue-specific FAK deletion in pericytes. The experiment is designed to test the effect of pericyte-derived FAK on slowing angiogenesis in the development of tumors. If true, then its deletion should result in increased tumor development.

The four treatment by genotype combinations are

  1. Control (“PBS FAK_wt”) – negative control
  2. FAK_ko (“PBS FAK_ko”) – PBS control. Unknown response given deletion of putative angiogenesis inhibitor but no added putative angiogenesis promotor
  3. GAS6 (“GAS6 FAK_wt”) – GAS6 control. GAS6 expected to promote angiogenesis but this the response is expected to be inhibited by some amount by FAK
  4. GAS6+FAK_ko (“GAS6 FAK_ko”) – focal treatment. Expected positive angiogenesis

The planned contrasts are

  1. (PBS FAK_ko) - (PBS FAK_wt) – the effect of FAK deletion given the control treatment
  2. (GAS6 FAK_ko) - (GAS6 FAK_wt) – the effect of FAK deletion given the GAS6 treatment
  3. ((PBS FAK_ko) - (PBS FAK_wt)) - ((GAS6 FAK_ko) - (GAS6 FAK_wt)). The interaction effect giving the effect of the combined treatment relative to the individual effects.

20.3.2 Model fit and inference

20.3.2.1 Fit the model

exp3a_m1 <- lm(sprouts ~ treatment * genotype, data = exp3a)
exp3a_m2 <- glm(sprouts ~ treatment * genotype,
                family = "poisson",
                data = exp3a)
exp3a_m3 <- glm.nb(sprouts ~ treatment * genotype,
                data = exp3a)

20.3.2.2 Check the linear model

ggcheck_the_model(exp3a_m1)

Notes

  1. left panel shows classic right skew conditional distribution with larger values much larger than expected with a normal distribution.
  2. right panel shows heterogeneity and specifically the variance increasing with the mean.

20.3.2.3 Check the poisson model

Check shape and homogeneity

# from the DHARMa package
  exp3a_m2_simulation <- simulateResiduals(fittedModel = exp3a_m2, n = 250)
  plot(exp3a_m2_simulation, asFactor = FALSE)

Notes

  1. poisson glm fails to generated scaled residuals approximating uniform distribution.

Check dispersion

exp3a_m2_simulation_refit <- simulateResiduals(fittedModel = exp3a_m2,
                                       n = 250,
                                       refit = TRUE)
exp3a_m2_test_dispersion <- testDispersion(exp3a_m2_simulation_refit)

Notes

  1. large overdispersion

Check zero inflation

exp3a_m2_test_zi <- testZeroInflation(exp3a_m2_simulation_refit)

Notes

  1. The data has too many zeros relative to the expected number from a poisson GLM.

20.3.2.4 Check the negative binomial model

Check shape and homogeneity

# from the DHARMa package
  exp3a_m3_simulation <- simulateResiduals(fittedModel = exp3a_m3,
                                        n = 250)
  plot(exp3a_m3_simulation, asFactor = FALSE)

Notes

  1. uniform q-q for negative binomial GLM looks good.
  2. spread-location plot looks good.

Check dispersion

exp3a_m3_simulation_refit <- simulateResiduals(fittedModel = exp3a_m3,
                                               n = 250,
                                               refit = TRUE)
exp3a_m3_test_dispersion <- testDispersion(exp3a_m3_simulation_refit)

Notes

  1. good

Check zero inflation

exp3a_m3_test_zi <- testZeroInflation(exp3a_m3_simulation_refit)

20.3.2.5 Inference from the model

exp3a_m3_coef <- cbind(coef(summary(exp3a_m3)),
                       confint(exp3a_m3))
Estimate Std. Error z value Pr(>|z|) 2.5 % 97.5 %
(Intercept) 0.90 0.231 3.9 0.000 0.45 1.36
treatmentGAS6 1.30 0.287 4.5 0.000 0.74 1.86
genotypeFAK_ko 0.04 0.357 0.1 0.900 -0.65 0.75
treatmentGAS6:genotypeFAK_ko 0.46 0.425 1.1 0.275 -0.37 1.30
exp3a_m3_emm <- emmeans(exp3a_m3,
                        specs = c("treatment", "genotype"),
                        type="response")
treatment genotype response SE df asymp.LCL asymp.UCL
PBS FAK_wt 2.47 0.6 Inf 1.57 3.88
GAS6 FAK_wt 9.05 1.5 Inf 6.48 12.64
PBS FAK_ko 2.58 0.7 Inf 1.52 4.40
GAS6 FAK_ko 15.04 2.4 Inf 11.06 20.46
# exp3a_m3_emm # print in console to get row numbers
# set the mean as the row number from the emmeans table
pbs_fak_wt <- c(1,0,0,0)
gas6_fak_wt <- c(0,1,0,0)
pbs_fak_ko <- c(0,0,1,0)
gas6_fak_ko <- c(0,0,0,1)

#1. (PBS FAK_ko) - (PBS FAK_wt)
#2. (GAS6 FAK_ko) - (GAS6 FAK_wt) 
#3. ((PBS FAK_ko) - (PBS FAK_wt)) - (GAS6 FAK_ko) - (GAS6 FAK_wt). 
exp3a_m3_planned <- contrast(
  exp3a_m3_emm,
  method = list(
    "(PBS FAK_ko) - (PBS FAK_wt)" = c(pbs_fak_ko - pbs_fak_wt),
    "(GAS6 FAK_ko) - (GAS6 FAK_wt)" = c(gas6_fak_ko - gas6_fak_wt),
    "Interaction" = c(gas6_fak_ko - gas6_fak_wt) -
      c(pbs_fak_ko - pbs_fak_wt)
  ),
  adjust = "none"
) %>%
  summary(infer = TRUE)
contrast ratio SE df asymp.LCL asymp.UCL z.ratio p.value
(PBS FAK_ko) / (PBS FAK_wt) 1.05 0.373 Inf 0.52 2.10 0.13 0.90040
(GAS6 FAK_ko) / (GAS6 FAK_wt) 1.66 0.385 Inf 1.06 2.62 2.19 0.02820
Interaction 1.59 0.676 Inf 0.69 3.66 1.09 0.27535

Notes

20.3.2.6 Plot the model

20.3.2.7 Alternaplot the model

20.4 Understanding Example 1

20.4.1 Modeling strategy

Instead of testing assumptions of a model using formal hypothesis tests before fitting the model, a better strategy is to 1) fit one or more models based on initial evaluation of the data, and then do 2) model checking using diagnostic plots, diagnostic statistics, and simulation (see Section All statistical analyses should be followed by model checking).

For the exp3a data, I fit a linear model, a Poisson GLM, and a negative binomial GLM. I use the diagnostic plots and statistics to help me decide which model to report.

20.4.2 Model checking fits to count data

We use the fit models to check

  1. the compatibility between the quantiles of the observed residuals and the distribution of expected quantiles from the family in the model fit
  2. if the observed distribution is over or under dispersed
  3. if there are more zeros than expected by the theoretical distribution. If so, the observed distribution is zero-inflated

20.4.2.1 Checking the linear model exp3a_m1 – a Normal-QQ plot

Figure ??A shows a histogram of the residuals from the fit linear model. The plot shows that the residuals seem to be clumped at the negative end of the range, which suggests that a model with a normally distributed conditional outcome (or normal error) is not well approximated.

Diagnostic plots of angiogenic sprout data (exp3a). A) Distribution of the residuals of the fit linear model. B) Normal-QQ plot of the residuals of the fit linear model.

Figure 20.2: Diagnostic plots of angiogenic sprout data (exp3a). A) Distribution of the residuals of the fit linear model. B) Normal-QQ plot of the residuals of the fit linear model.

A better way to investigate this is with the Normal-QQ plot in Figure ??B, which plots the sample quantiles for a variable against their theoretical quantiles. If the conditional outcome approximates a normal distribution,

  1. the points should roughly follow the robust regression line,
  2. the points should be largely inside the 95% CI (gray) bounds for sampling a normal distribution with the variance estimated by the model, and
  3. the robust regression line should be largely inside the CI bounds.

For the sprout data, the points are above the line at the positive end, hug the upper bound of the 95% CI at the negative end, are well above the 95% CI at the positive end, and the robust regression is distinctly shallower than a line bisecting the CI bounds. At the left (negative) end, the observed values are more positive than the theoretical values. Remembering that this plot is of residuals, if we think about this as counts, this means that our smallest counts are not as small as we would expect given the mean, the variance, and a normal distribution. This shouldn’t be surprising – the counts range down to zero and counts cannot be below zero. At the positive end, the sample values are again more positive than the theoretical values. Thinking about this as counts, this means that are largest counts are larger than expected given the mean, the variance, and a normal distribution. This pattern is what we’d expect of count data.

20.4.2.2 Checking the linear model exp3a_m1 – Spread-level plot for checking homoskedasticity

A linear model also assumes that the error is homoskedastic – the error variance is not a function of the value of the \(X\) variables). Non-homoskedastic error is heteroskedastic. I will typically use “homogenous variance” and “heterogenous variance” since these terms are more familiar to biologists. The fit model can be checked for homogeneity using a spread-level (also known as a scale-location) plot, which comes in several forms. I like a spread-level plot that is a scatterplot of the positive square-root of the standardized residuals against the fitted values (remember that the fitted values are the values computed by the linear predictor of the model – they are the “predicted values” of the observed data). If the residuals approximate a normal distribution, then a regression line through the scatter should be close to horizontal. The regression line in the spread-level plot of the fit of the linear model to the sprout data shows a distinct increase in the “scale” (the square root of the standardized residuals) with increased fitted value, which is expected of data sampled from a distribution in which the variance increases the mean.

20.4.2.3 Two distributions for count data – Poisson and Negative Binomial

The pattern in the Normal-QQ plot in Figure ??B should discourage a researcher from modeling the data with a normal distribution and instead model the data with an alternative distribution using a Generalized Linear Model. There is no unique mapping between observed data and a data generating mechanism with a specific distribution, so this decision is not as easy as thinking about the data generation mechanism and then simply choosing the “correct” distribution. Section 4.5 in Bolker (xxx) is an excellent summary of how to think about the generating processes for different distributions in the context of ecological data. Since the response in the angiogenic sprouts data are counts, we need to choose a distribution that generates integer values, such as the Poisson or the negative binomial.

  1. Poisson – A Poisson distribution is the probability distribution of the number of occurrences of some thing (a white blood cell, a tumor, or a specific mRNA transcript) generated by a process that generates the thing at a constant rate per unit effort (duration or space). This constant rate is the parameter \(\lambda\), which is the expectation (the expected mean of the counts), so \(\mathrm{E}(Y) = \mu = \lambda\). Because the rate per effort is constant, the variance of a Poisson variable equals the mean, \(\sigma^2 = \mu = \lambda\). Figure ?? shows three samples from a Poisson distribution with \(\lambda\) set to 1, 5, and 10. The plots show that, as the mean count (\(\lambda\)) moves away from zero, a Poisson distribution 1) becomes less skewed and more closely approximates a normal distribution and 2) has an increasingly low probability of including zero (less than 1% zeros when the mean is 5).

A Poisson distribution, then, is useful for count data in which the conditional variance is close to the conditional mean. Very often, biological count data are not well approximated by a Poisson distribution because the variance is either less than the mean, an example of underdispersion5, or greater than the mean, an example of overdispersion6. A useful distribution for count data with overdispersion is the negative binomial.

  1. Negative Binomial – The negative binomial distribution is a discrete probability distribution of the number of successes that occur before a specified number of failures \(k\) given a probability p of success. This isn’t a very useful way of thinking about modeling count data in biology. What is useful is that the Negative Binomial distribution can be used simply as way of modeling an “overdispersed” Poisson process. Using the parameterization in the MASS::glm.nb function, the mean of a negative binomial variable is \(\mu\) and the variance is \(\sigma^2 = \mu + \frac{\mu^2}{\theta}\). As a method for modeling an overdispersed Poisson variable, \(\theta\) functions as a **dispersion parameter* controlling the amount of overdispersion and can be any real, positive value (not simply a positive integer), including values less than 1. As \(\theta\) approaches positive infinity, the “overdispersion” bit \(\frac{\mu^2}{\theta}\) goes to zero and the variance goes to \(\mu\), which is the same as the Poisson.

20.4.2.4 Model checking a GLM I – the quantile-residual uniform-QQ plot

Normal-QQ plots were introduced in Section @ref{normal-qq} of the Model Checking chapter and applied to the linear model fit of the angiogenic sprout data in Section 20.4.2.1 above. We cannot use a Normal-QQ plot with a Poisson or negative binomial GLM fit because the residuals from this fit are not expected to be normally distributed. An alternative to a Normal-QQ plot for a GLM fit is a quantile-residual uniform-QQ plot of observed quantile residuals.

Quantile-residual uniform-QQ plot of the Poisson GLM fit to the angiogenic sprouts (exp3a) data.

(#fig:glm-exp3a_m2-check-poisson-again)Quantile-residual uniform-QQ plot of the Poisson GLM fit to the angiogenic sprouts (exp3a) data.

Notes

  1. The x-axis (“Expected”) contains the expected quantiles from a uniform distribution.
  2. The y-axis (“Observed”) contains the observed quantile residuals from a GLM fit, which are the residuals from the fit model that are transformed in a way that the expected distribution is uniform under the fit model family. This means that we’d expect the quantile residuals to closely approximate the expected quantiles from a uniform distribution. If the approximation is close, the points will fall along the “y = x” line in the plot.
  3. The gray shaded area is a 95% confidence interval computed using a parametric bootstrap. At any value of the expected quantile, the interval will include an observed quantile 95% of the time. This gray area gives us a sense of the variability we’d get when we fit models to random samples from the specified model.
  4. In the quantile-residual QQ plot for Model exp3a_m2, the observed residuals are far outside the 95% boundary. The observed residuals are smaller than expected at the negative (left) end and larger than expected at the right (high) end. This means the residuals are more spread out than expected for a Poisson sample. The data are overdispersed for this model. Understand that overdispersion is not a property of data but of the residuals from a specific model fit to the data.

Misconceivable – A common misconception is that if the distribution of the response approximates a Poisson distribution, then the residuals of a GLM fit with a Poisson distribution should be normally distributed, which could then be checked with a Normal-QQ plot, and homoskedastic, which could be checked with a scale-location plot. Neither of these is true because a GLM does not transform the data and, in fact, the model definition does not specify anything about the distribution of an “error” term – there is no \(\varepsilon\) in the model definition above! This is why thinking about the definition of a linear model by specifying an error term with a normal distribution can be confusing and lead to misconceptions when learning GLMs.

20.4.2.5 Model checking a GLM II – Spread-level plot for checking homoskedasticity

Spread-level plot of the Poisson GLM fit to the angiogenic sprouts (exp3a) data.

(#fig:glm-exp3a_m2-spreadlevel-again)Spread-level plot of the Poisson GLM fit to the angiogenic sprouts (exp3a) data.

Notes

20.4.2.6 Model checking a GLM III – Checking dispersion

Dispersion plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

(#fig:glm-exp3a_m2-check-dispersion-again)Dispersion plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

Notes

  1. This plot is a histogram of the sum of squared Pearson residuals of fake data sampled from the fit model. Pearson residuals are the raw residuals divided by the square root of the fitted value. Remember that in the Poisson distribution, the variance is equal to the expectation (mean), so a Pearson residual is the raw residual divided by the standard deviation of the residual. A way to think about this is, Pearson residuals “correct” for the heterogeneity in variance that arises among groups with different mean counts.
  2. The sum of squared Pearson residuals is a measure of the dispersion of the residuals.
  3. The red line is the observed sum of the squared Pearson residuals of the fit model.
  4. If the observed dispersion approximates that expected from sampling from the fit model, the red line will be within the histogram.
  5. The red line here is far larger than expected given the histogram, which indicates that the residuals are overdispersed given the fit model.
  6. Overdispersion will be common with Poisson GLM fits to biological data.

20.4.2.7 Model Checking a count GLM – Check zero inflation

Counts can have the value zero. Data that have more zeros than expected given a fit count GLM model (Poisson, quasi-Poisson, negative binomial) is zero-inflated.

zero_inflation_test <- testZeroInflation(simulation_output)

  1. This plot is a histogram of the number of zeros in each of the fake data sets generated by the fit model. An observed number of zeros at the extremes of this distribution are unlikely given the fit model. The number of zeros in the observed data is greater than expected by the model.

20.4.3 Biological count data are rarely fit well by a Poisson GLM. Instead, fit a quasi-poisson or negative binomial GLM model.

Here are the diagnostic plots of the negative binomial GLM fit to the exp3a data

Quantile-residual uniform-QQ plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

(#fig:glm-exp3a_m3-qq-again)Quantile-residual uniform-QQ plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

Spread-level plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

(#fig:glm-exp3a_m3-spreadlevel-again)Spread-level plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

Dispersion plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

(#fig:glm-exp3a_m3-check-dispersion-again)Dispersion plot of the negative binomial GLM fit to the angiogenic sprouts (exp3a) data.

20.4.7 Some consequences of fitting a linear model to count data

20.4.7.1 One – linear models can make absurd predictions

plot_grid(gg1, gg2, gg3, ncol=3, labels = "AUTO")

Notes

  1. A prediction interval is a confidence interval of a prediction – using the fit model to predict future responses given the same conditions (here, assignment to one of the four different treatment combinations).
  2. Left panel: The prediction intervals from the linear model imply that negative sprouts could be sampled. This is absurd.
  3. Middle panel: The fit linear model is used to make 100 fake predictions in each group.
  4. Right panel: The fit negative binomial GLM is used to make 100 fake predictions in each group. Nothing absurd here.

20.4.7.2 Two – linear models can perform surprisingly well if one is only interested in p-values

P-values are a function of the sampling distribution of group means and differences in means, and, due to the magic of the central limit theorem, linear models fit to count data perform surprisingly well in the sense of

  1. Type I error that approximates the nominal value
  2. Reasonable power compared to GLM models and many non-parametric tests.

20.5 Working in R

20.5.1 Fitting GLMs to count data

The poisson family is specified with the base R glm() function. For negative binomial, use glm.nb from the MASS package

# poisson - less likely to fit to real biological data well 
# because of overdispersion
fit <- glm(y ~ treatment, family = "poisson", data = dt)

# two alternatives to overdispersed poisson fit

# quasipoisson
fit <- glm(y ~ treatment, family = "quasipoisson", data=dt)

# note that "family" is not an argument since this function is used only to fit a negative binomial distribution!
fit <- glm.nb(y ~ treatment, data = dt)

20.5.2 Fitting a GLM to a continuous conditional response with right skew.

The Gamma family is specified with the base R glm() function.

fit <- glm(y ~ treatment, family = Gamma(link = "log"), data = dt)

20.5.3 Fitting a GLM to a binary (success or failure, presence or absence, survived or died) response

The binomial family is specified with base R glm() function.

# if the data includes a 0 or 1 for every observation of y
fit <- glm(y ~ treatment, family = "binomial", data = dt)

# if the data includes the frequency of success AND there is a measure of the total n
dt[ , failure := n - success]
fit <- glm(cbind(success, failure) ~ treatment, family = "binomial", data = dt)

20.5.4 Fitting Generalized Linear Mixed Models

Generalized linear mixed models are fit with glmer from the lmer package.

# random intercept of factor "id"
fit <- glmer(y ~ treatment + (1|id), family = "poisson", data = dt)

# random intercept and slope of factor "id"
fit <- glmer(y ~ treatment + (treatment|id), family = Gamma(link = "log"), data = dt)

# Again, negative binomial uses a special function
fit <- glmer.nb(y ~ treatment + (treatment|id), data = dt)

Another good package for GLMMs is glmmTMB from the glmmTMB package

# negative binomial
fit <- glmmTMB(y ~ treatment + (1|id), family="nbinom2", data = dt)

# nbinom1, the mean variance relationship is that of quasipoisson
fit <- glmmTMB(y ~ treatment + (1|id), family="nbinom1", data = dt)

20.6 Model checking GLMs

The DHARMa package has an excellent set of model checking tools. The DHARMa package uses simulation to generate fake data sampled from the fit model using the function simulateResiduals.

simulation_output <- simulateResiduals(fittedModel = exp3a_m3,
                                       n = 250,
                                       refit = FALSE)
simulation_output <- simulateResiduals(fittedModel = exp3a_m2,
                                       n = 250,
                                       refit = FALSE)
  1. The DHARMa package uses simulation to
plot(simulation_output)

plotQQunif(simulation_output)

  1. Three test statistics are superimposed. Use these p-values cautiously – they are guides and not thresholds of demarcation. The two we care about here are
  • The KS statistic indicates that the quantile residuals are not very compatible with a Poisson model – think of this as having a very low probability of sampling these counts from a Poisson with the estimated \(\lambda\).
  • The dispersion statistic indicates that the value of the dispersion of the quantile residuals is not very compatible with a Poisson model – think of this as having a very low probability of sampling counts with this dispersion from a Poisson with the estimated \(\lambda\).

20.7 Hidden code

20.7.1 Import Example 1 data (exp3a – “angiogenic sprouts”)

data_from <- "Pericyte FAK negatively regulates Gas6-Axl signalling to suppress tumour angiogenesis and tumour growth"
file_name <- "41467_2020_16618_MOESM3_ESM.xlsx"
file_path <- here(data_folder, data_from, file_name)

exp3a_wide <- read_excel(file_path,
                         sheet = "Figure 3",
                         range = "B4:E26",
                         col_names = FALSE) %>%
  data.table()

input_labels <- c("PBS FAK_wt", "PBS FAK_ko", "GAS6 FAK_wt", "GAS6 FAK_ko")
colnames(exp3a_wide) <- input_labels

exp3a <- melt(exp3a_wide,
              measure.vars = input_labels,
              variable.name = "t_by_g",
              value.name = "sprouts") %>%
  na.omit()

# change order of factor levels
t_by_g_levels <- c("PBS FAK_wt", "GAS6 FAK_wt", "PBS FAK_ko", "GAS6 FAK_ko")

exp3a[, c("treatment", "genotype"):= tstrsplit(t_by_g,
                                             " ",
                                             fixed = TRUE)]

exp3a[, t_by_g := factor(t_by_g, levels = t_by_g_levels)]
treatment_levels <- c("PBS", "GAS6")
exp3a[, treatment := factor(treatment, levels = treatment_levels)]
genotype_levels <- c("FAK_wt", "FAK_ko")
exp3a[, genotype := factor(genotype, levels = genotype_levels)]

  1. the variance is less than that expected by the probability model↩︎

  2. the variance is greater than that expected by the probability model↩︎