Chapter 9 Model Checking

We us a linear model (or statistical model more generally) to infer effects or predict future outcomes. Our inference is uncertain. Given some modeling assumptions, we can quantify this uncertainty with standard errors, and from these standard errors we can compute confidence intervals and p-values.

9.1 All statistical analyses should be followed by model checking

Model checking is a series of graphical analyses to explore how well the data approximate the fit model and model assumptions.

Many introductory textbooks in biostatistics recommend that researchers uses formal hypothesis tests to check assumptions. A Shapiro-Wilk test is often used to test “normality” (the residuals from the model fit are normally distributed) and a Bartlett’s or Levine test is often used to test homogenity of variances (the residuals from different groups have equal variance). These textbooks often recommend to 1) start with the test of the assumption (the “assumption test”) and then use the results to 2a) choose a “correct” hypothesis test or 2b) transform the data so that it “conforms” to the assumptions of the parametric test. For example, if the \(p < 0.05\) for the Shapiro-Wilk then use a non-parametric test such as a Mann-Whitney U/Wilcoxan test. Or log-transform the response and then use a standard t-test. The logic of this two-test workflow (assumption test followed by the hypothsis test) suffers from several issues. First, the subsequent p-value of the hypothesis test is not valid because this p-value is the long-run frequency of a test-statistic as large or larger than the observed statistic conditional on the null – not conditional on the subset of nulls with \(p > 0.05\) in the assumption test. Second, no real population from which some data are drawn is truly normal or homoskedastic. That is, with increased sample size, any empirical data will “fail” a test of normality or homogeneity of variance. We shouldn’t dichotomize data into “normal” and “not normal”. Instead, data falls on a continuum from something approximating normal or homoskedastic to something far from normal or homoskedastic. Third, and most importantly, our analysis should follow the logic of our goals. If our goal is the estimation of effects, we cannot get meaningful estimates from a non-parametric test (with a few exceptions) or a transformed response, as these methods are entirely about computing a “correct” p-value. Alternatives to classic non-parametric tests and transformations are bootstrap estimates of confidence limits, permutation tests, and generalized linear models.

The random error specification of the linear model is

\[\begin{align} Y &= \beta_0 + \beta_1 X + \varepsilon\\ \varepsilon &\sim N(0, \sigma) \tag{9.1} \end{align}\]

where \(N(0, \sigma)\) is read as “normal distribution with mean zero and standard deviation sigma”. Any inference about the parameter \(\beta_1\) (such as confidence intervals or hypothesis tests) assumes that the error (\(\varepsilon\)) is IID Normal where IID is independent and identically distributed and Normal refers to the Normal (or Gaussian) distribution.

  1. Independent means that the error for one case cannot be predicted from the error of any other case. There are lots or reasons that errors might be correlated. For example, measures that are taken from sites that are closer together or measures taken closer in time or measures from more closely related biological species will tend to have more similar error than measures taken from sites that are further apart or from times that are further apart or from species that are less closely related. Space and time and phylogeny create spatial and temporal and phylogenetic autocorrelation. Correlated error due to space or time or phylogeny can be modeled with Generalized Least Squares (GLS) models. A GLS model is a variation of model (9.1).

If there are measures both within and among field sites (or humans or rats) then we’d expect the measures within the same site (or human or rat) to err from the model in the same direction. Multiple measures within experimental units (a site or individual) creates “clusters” of error. Lack of independence or clustered error can be modeled using models with random effects. These models go by many names including linear mixed models (common in Ecology), hierarchical models, multilevel models, and random effects models. A linear mixed model is a variation of model (9.1).

  1. Identical means that the errors are “drawn” from the same distribution. Since the model is a linear model, this distribution is a Normal distribution. A consequence of “indentical” is that the error variance is homoskedastic, or constant, or independent of \(X\). If the error variance differs among the \(X\) then the errors are heteroskedastic. Many biological processes generate data in which the error is a function of the mean. For example, measures of biological variables that grow, such as lengths of body parts or population size, have variances that are “grow” with the mean. Or, measures of counts, such as the number of cells damaged by toxin, the number of eggs in a nest, or the number of mRNA transcripts per cell have variances that are a function of the mean. Both growth and count measures can be reasonably modeled using a linear model they are more often modeled using a generalized linear model (GLM), which is an extension of the linear model in equation (9.1). Heteroskedasitc error arising for other reasons, both biological and experimental, can be modeled with Generalized Least Squares (GLS) or with linear mixed models. GLS models are variations of model (9.1).

  2. Normal (Gaussian) error means that 1) the response is continuous and 2) the probability of sampling an individual measuring 0.5 units below the population mean is the same as the probability of sampling an individual measuring 0.5 units above the population mean. Counts (number of cells, number of eggs, number of mRNA transcripts) and binary responses (sucessful escape or sucessful infestation of host) are not continous and often often have asymmetric probablity distributions that are skewed to the right and while sometimes both can be reasonably modeled using a linear model they are more often modeled using a generalized linear model (GLM), which, again, is an extension of the linear model in equation (9.1).

A common misconception is that inference from a linear model assumes that the response (\(Y\)) is normally distributed. Models (9.1) and (5.3) show precisely why this conception is wrong. Model (9.1) states explicitly that it is the error that has the normal distribution – the distribution of \(Y\) is a mix of the distribution of \(X\) and the error. Model (5.3) states that the conditional outcome has a normal distribution, that is, the distribution after adjusting for variation in \(X\).

The linear model

\[\begin{equation} Y_i = \beta_0 + \beta_1 X_i + \varepsilon_i \end{equation}\]

has several important assumptions for the computation of correct standard errors and any statistic derived from standard errors including a confidence interval and a \(p\)-value.

\[\begin{align} Y &\sim N(\mu, \sigma)\\ \mathrm{E}(Y|X) = \mu \mu &= \beta_0 + \beta_1 X \tag{5.3} \end{align}\]
  1. Normality: the errors \(\varepsilon\) have mean zero and a normal distribution. Often this is mistakenly interpreted as the response \(Y\) has to be normally distributed and sometimes that even \(X\) has to be normally distributed. But the assumption applies only to the residuals of the fit model.

  2. Independence: the errors are independent of each other

  3. Homoskedasticity: the variance of the errors is homogenous

  4. the errors are identically distributed

  5. the error is independent of any \(X\)

  6. the error is independent of the response

Here I model fake data in which \(X\) is niformly distributed. \(Y\) is a linear function of \(X\) + normally distributed error, so the \(Y\) is also uniformly distributed but the error is normal. The histogram of \(Y\) shows this uniform distribution clearly. The histogram of the residuals from the fit model show the normal distribution of the error clearly. d

n <- 1000
x <- runif(n)*100
sigma <- 1.0
y <- 5 + 1.2*x + rnorm(n, sd=sigma)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

fit <- lm(y ~ x)
## Call:
## lm(formula = y ~ x)
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3213 -0.6956  0.0173  0.6719  3.4108 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.050577   0.066996   75.39   <2e-16 ***
## x           1.199067   0.001164 1030.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 1.046 on 998 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.9991 
## F-statistic: 1.062e+06 on 1 and 998 DF,  p-value: < 2.2e-16
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

There are tests for normality but I do not recommend these because the test doesn’t add any value from what one can gain by simply inspecting the residuals.

n <- 30
h <- runif(n)*90 + 10
w <- h^3
sigma <- rnorm(n)*.2
logh <- log(h)
logw <- log(w) + sigma
qplot(logh, logw)

x <- h
y <- exp(logw)
qplot(x=x, y=y)

fit <- lm(y ~ x)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.