Chapter 3 An Introduction to Statistical Modeling

This chapter introduces statistical modeling using the linear model. All students are familiar with the idea of a linear model from learning the equation of a line, which is

\[\begin{equation} Y = mX + b \tag{3.1} \end{equation}\]

where \(m\) is the slope of the line and \(b\) is the \(Y\)-intercept. It is useful to think of equation (3.1) as a function that maps values of \(X\) to values of \(Y\). Using this function, if we input some value of \(X\), we always get the same value of Y as the output.

A linear model is a function, like that in equation (3.1), that is fit to a set of data, often to model a process that generated the data or something like the data. The line in Figure 3.1A is just that, a line, but the line in Figure 3.1B is a linear model fit to the data in Figure 3.1B.

A line vs. a linear model. (A) the line $y=-3.48X + 105.7$ is drawn. (B) A linear model fit to the data. The model coefficients are numerically equal to the slope and intercept of the line in A.

Figure 3.1: A line vs. a linear model. (A) the line \(y=-3.48X + 105.7\) is drawn. (B) A linear model fit to the data. The model coefficients are numerically equal to the slope and intercept of the line in A.

3.1 Two specifications of a linear model

3.1.1 The “error draw” specification

A linear model is commonly specified using

\[\begin{align} Y &= \beta_0 + \beta_1 X + \varepsilon\\ \tag{3.2} \end{align}\]

This specification of a linear model has two parts: the linear predictor \(Y = \beta_0 + \beta_1 X\) and the error \(\varepsilon\). The linear predictor part looks like the equation for a line except that 1) \(\beta_0\) is used for the intercept and \(\beta_1\) for the slope and 2) the intercept term precedes the slope term. This re-labeling and re-arrangement make the notation for a linear model more flexible for more complicated linear models. For example \(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon\) is a model where \(Y\) is a function of two \(X\) variables.

As with the equation for a line, the linear predictor part of a linear model is a function that maps a specific value of \(X\) to a value of \(Y\). This mapped value is the expected value given a specific input value of \(X\). This is often written as \(\mathrm{E}[Y|X]\), which is read as “the expected value of \(Y\) given \(X\)”, where “given X” means a specific value of X. Importantly, \(\mathrm{E}[Y|X]\) is the conditional mean, which is the modeled value of \(Y\) for all observations in which \(X\) takes some specific value \(x\).

Introductory textbooks almost always introduce linear models using equation (3.2) above. The key part of the model that is missing from the specification above is a second line \[\begin{equation} \varepsilon \sim N(0, \sigma^2) \end{equation}\]

which is read as “epsilon is distributed as Normal with mean zero and variance sigma squared”. This line explicitly specifies the distribution of the error part. The error part of a linear model is a random “draw” from a normal distribution with mean zero and variance \(\sigma^2\). Think of this as adding some random value to the expected value.

3.1.2 The “conditional draw” specification

A second way of specifying a linear model is

\[\begin{align} y_i &\sim N(\mu_i, \sigma^2)\\ \mathrm{E}(Y|X) &= \mu\\ \mu_i &= \beta_0 + \beta_1 x_i \tag{3.3} \end{align}\]

The first line states that the response variable \(Y\) is a random variable independently drawn from a normal distribution with mean \(\mu\) and variance \(\sigma^2\). This first line is the stochastic part of the statistical model. The second line simply states that \(\mu\) (the greek letter “mu”) from the first line is the conditional mean or conditional expectation. The third line states how \(\mu_i\) is generated given that \(X=x_i\). This is the linear predictor, which is the systematic (or deterministic) part of the statistical model. It is systematic because the same value of \(x_i\) will always generate the same \(\mu_i\).

3.1.3 Comparing the two ways of specifying the linear model

These two ways of specifying the model encourage slightly different ways of thinking about how the data (the response varible \(Y\)) were generated. The error draw specification “generates” data by randomly drawing some error \(\varepsilon_i\) from the specified distribution and adding this to \(x_i\). The conditional draw specification “generates” data by constructing what \(y_i\) “should be” given \(x_i\) (the conditional expection), and then drawing a random variable from a distribution with this expectation. This random draw is \(y_i\) and not the “error”. For the error draw generation, we need only one hat of random numbers, but for the conditional draw generation, we need a hat for each value of \(x_i\).

The conditional draw specification explicitly defines all parameters, including the parameters of the linear predictor (\(\beta_0\) and \(\beta_1\)), the conditional mean \(\mu\) and the variance \(\sigma^2\). The error draw specification only defines the parameters of the linear predictor, and often these are referrred to as “the parameters” in the sense that there are not other parameters. The error draw specification is most useful for thinking about model checking a fit linear model. The random draw specification is more generally useful in that it is easily generalized to more complex models, including hierarchical models, generalized linear models, and Bayesian models. In fact, thinking about a model as a predictor plus error can lead to the misconception that in a generalized linear models, the error has the distribution (binomial, poisson, etc.) modeled.

Although a linear model (or statistical model more generally) is a model of a data-generating process, linear models are not typically used to actually generate any data. Instead, when we use a linear model to understand something about a real dataset, we think of our data as one realization of a process that generates data like ours. A linear model is a model of that process. That said, it is incredibly useful to use linear models to create fake datasets for at least two reasons: to probe our understanding of statistical modeling generally and, more specifically, to check that a model actually creates data like that in the real dataset that we are analyzing.

3.2 What do we call the \(X\) and \(Y\) variables?

The inputs to a linear model (the \(X\) variables) have many names including “independent variables,” “predictor variables,”, “explanatory variables,” “treatment variables,” and “covariates”. The output of a linear model (the \(Y\) variable or variables if the model is multivariate) is the “dependent variable,” “response,” or “outcome.” The \(\beta\) in the linear model are model parameters and if a parameter is multiplied by an \(X\) variable then it is also a coefficient (for example, \(\beta_1\) in model (3.2) is a coefficient). The coefficients of the \(X\) in a linear model (\(\beta_1\) in model (3.2)) are often called “the effects” (so \(\beta_1\) is the effect of \(X_1\)).

3.3 Statistical models are used for prediction, explanation, and description

Researchers typically use statistical models to understand relationships between one or more \(Y\) variables and one or more \(X\) variables. These relationships include

  1. Descriptive modeling. Sometimes a researcher merely wants to describe the relationship between \(Y\) and a set of \(X\) variables, perhaps to discover patterns. For example, the arrival of a spring migrant bird (\(Y\)) as a function of sex (\(X_1\)) and age (\(X_2\)) might show that males and younger individuals arrive earlier. Importantly, if another \(X\) variable is added to the model (or one dropped), the coefficients, and therefore, the precise description, will change. That is, the interpretation of a coefficient as a descriptor is conditional on the other covariates (\(X\) variables) in the model. In a descriptive model, there is no implication of causal effects and the goal is not prediction. Nevertheless, it is very hard for humans to discuss a descriptive model without using causal language, which probably means that it is hard for us to think of these models as mere description. Like natural history, descriptive models are useful as patterns in want of an explanation, using more explicit causal models including experiments.

  2. Predictive modeling. Predictive modeling is very common in applied research. For example, fisheries researchers might model the relationship between population density and habitat variables to predict which subset of ponds in a region are most suitable for brook trout (Salvelinus fontinalis) reintroduction. The goal is to build a model with minimal prediction error, which is the error between predicted and actual values for a future sample. In predictive modeling, the \(X\) (“predictor”) variables are largely instrumental – how these are related to \(Y\) is not a goal of the modeling, although sometimes an investigator may be interested in the relative importance among the \(X\) for predicting \(Y\) (for example, collecting the data may be time consuming, or expensive, or enviromentally destructive, so know which subset of \(X\) are most important for predicting \(Y\) is a useful strategy).

  3. Explanatory (causal) modeling. Very often, researchers are explicitly interested in how the \(X\) variables are causally related to \(Y\). The fisheries researchers that want to reintroduce trout may want to develop and manage a set of ponds to maintain healthy trout populations. This active management requires intervention to change habitat traits in a direction, and with a magnitude, to cause the desired response. This model is predictive – a specific change in \(X\) predicts a specific response in \(Y\) – because the coefficients of the model provide knowledge on how the system functions – how changes in the inputs cause change in the output. Causal interpretation of model coefficients requires a set of strong assumptions about the \(X\) variables in the model. These assumptions are typically met in experimental designs but not observational designs.

With observational designs, biologists are often not very explicit about which of these is the goal of the modeling and use a combination of descriptive, predictive, and causal language to describe and discuss results. Many papers read as if the researchers intend explanatory inference but because of norms within the biology community, mask this intention with “predictive” language. Here, I advocate embracing explicit, explanatory modeling by being very transparent about the model’s goal and assumptions.

3.4 Modeling strategy

  1. choose a model. Statistical modeling includes a diverse array of models, yet almost all methods used by researchers in biology, and all models in this book, are generalizations of the linear model specified in (3.3).

  2. fit the model, in order to estimate the model parameters and the uncertainty in these estimates.

  3. check the model, which means to use a series of diagnostic plots and computations of model output to check that the data reasonably approximate the chosen model.

  4. inference from the model, which means to use the fit parameters to learn, with uncertainty, about the system, or to predict future observations, with uncertainty.

  5. plot the model, which means to plot the estimated parameters (or other results dervived from the estimates) with their uncertainty.

In order to use a statistical model to describe, predict, or explain, we need to fit a model to data in order to estimate the parameters. A linear model fit to some data is

\[\begin{align} \hat{y}_i &= b_0 + b_1 x_i + e_i\\ \tag{3.4} \end{align}\]

\(\hat{y}_i\) (“y hat”) is the predicted value of individual \(i\), \(b_0\) and \(b_1\) are the coefficients of the model fit (though technically \(b_0\) is not a coefficient), and \(e_i\) is the residual. Sometimes \(\hat{y}_i\) is simply called “the prediction”.

If our goal is inference – to infer something about the “population” from the sample using the fit model, then \(\hat{y}_i\) is the point estimate of the parameter \(\mu_i\), the coefficients \(b_0\) and \(b_1\) are point estimates of the parameters \(\beta_0\) and \(\beta_1\), and the standard deviation of the \(e_i\) is an estimate of \(\sigma\). “Population” is in quotes because it is a very abstract concept. Throughout this book, Greek letters refer to a theoretical parameter and Roman letters refer to point estimates.

Throughout this text, I recommend reporting and interpreting interval estimates of the point estimate. A confidence interval is a type of interval estimate. A confidence interval of a parameter is a measure of the uncertainty in the estimate. A 95% confidence interval has a 95% probability (in the sense of long-run frequency) of containing the parameter This probability is a property of the population of intervals that could be computed using the same sampling and measuring procedure. It is not correct, without further assumptions, to state that there is a 95% probability that the parameter lies within the interval. Perhaps a more useful interpretation is that the interval is a compatability interval in that it contains the range of estimates that are compatible with the data, in the sense that a \(t\)-test would not reject the null hypothesis of a difference between the estimate and any value within the interval (this interpretation does not imply anything about the true value).

HarrellPlot of vole data.

Figure 3.2: HarrellPlot of vole data.

For the model fit to the data in Figure 3.1B, the coefficient of \(X\) is the slope of the line. Perhaps surprisingly, we can fit a model like equation (3.2) to data in which the \(X\) variable is categorical. A simple example is the experiment of antioxidants (vitamins C and E) on lifespan in Voles (Fig. 3.2). In this experiment, the \(X\) variable is categorical, with three levels: “Control”, “Vitamin_E” and “Vitamin_C”. Categorical \(X\) variables are often called factors. The trick to using a statistical model with categorical \(X\) is to recode the factor levels into numbers – how this is done is explained in Chapter xxx. When the \(X\) variable is categorical, the coefficients of the \(X\) are differences in group means. The linear model fit to the vole data has two coefficients, one for Vitamin E and one for vitamin C. The estimate and uncertainty of the these two coefficients are shown in the top part of Figure 3.2. The bottom part shows the raw data, as well as the group (factor level) means and the uncertainty in the estimate of these means.

3.5 A mean is the simplest model

The simplest possible model that can be fit to the data is

\[\begin{equation} \mathrm{E}[Y] = b_0 \tag{3.5} \end{equation}\]

which is simply the mean of \(Y\), or, more specifically, the unconditional mean of \(Y\), since its value is not conditional on any value of \(X\).

3.6 Assumptions for inference with a statistical model

Inference refers to using the fit model to generalize from the sample to the population, which assumes that the response is drawn from some specified probability distribution (Normal, or Poisson, or Bernouli, etc.). Throughout this text, I emphasize reporting and interpreting point estimates and confidence intervals. Another kind of inference is a significance test, which is the computation of the probability of “seeing the data” or something more extreme than the data, given a specified null hypothesis. A significance test results in a p-value, which can be reported with the point estimate and confidence interval. Somewhat related to a significance test is a hypothesis test, or what is now often perjoratively called a Null-Hypothesis Signficance Test (NHST), in which the \(p\)-value from a significance test is compared to a pre-specified error rate called \(\alpha\). NHST may be useful for some very limited kinds of science but, in general, is not useful for most biological research and, instead, leads to large misconceptions. A general rule of thumb is, do not compare a reported \(p\)-value to \(\alpha\).

  1. The data were generated by a process that is “linear in the parameters”, which means that the different components of the model are added together. This additive part of the model containing the parameters is the linear predictor in specifications (3.2) and (3.3) above. For example, a cubic polynomial model
\[\begin{equation} \mathrm{E}(Y|X) = \beta_0 + \beta_1 X + \beta_2 X^2 + \beta_3 X^3 \end{equation}\]

is a linear model, even though the function is non-linear, because the different components are added. Because a linear predictor is additive, it can be compactly defined using matrix algebra

\[\begin{equation} \mathrm{E}(Y|X) = \mathbf{X}\boldsymbol{\beta} \end{equation}\]

where \(mathbf{X}\) is the model matrix and \(\boldsymbol{\beta}\) is the vector of parameters. We discuss these more in chapter xxx.

A Generalized Linear Model (GLM) has the form \(g(\mu_i) = \eta_i\) where \(\eta\) (the Greek letter “eta”) is the linear predictor

\[\begin{equation} \eta = \mathbf{X}\boldsymbol{\beta} \end{equation}\]

GLMs are extensions of linear models. There are non-linear models that are not linear in the parameters, that is, the predictor is not a simple dot product of the model matrix and a vector of parameters. For example, the Michaelis-Menten model is a non-linear model

\[\begin{equation} \mathrm{E}(Y|X) = \frac{\beta_1 X}{\beta_2 + X} \end{equation}\]

that is non-linear in the parameters because the parts are not added together. This text covers linear models and generalized linear models, but not non-linear models that are also non-linear in the parameters.

  1. The draws from the probability distribution are independent. Independence implies uncorrelated \(Y\) conditional on the \(X\), that is, for any two \(Y\) with the same value of \(X\), we cannot predict the value of one given the value of the other. For example, in the vole data above, uncorrelated implies that we cannot predict the lifespan of one vole within the Vitamin E treatment given the lifespan of another vole in the Vitamin E treatment. For linear models, this assumption is often stated as “independent errors” (the \(\varepsilon\) in model (3.2)) instead of independent observations.

There are lots of reasons that conditional responses might be correlated. In the vole example, perhaps the voles were housed in batches of 5 individuals, and slight differences in the environment among the housing containers, caused all the voles in some containers to have shorter lifespans than expected given their treatment assigment and all voles in other containers to have longer lifespans than expected given their treatment assigment. More generally, if there are measures both within and among experimental units (field sites or humans or rats) then we’d expect the measures within the same unit to err from the model in the same direction. Multiple measures within experimental units (a site or individual) creates “clustered” observations. Lack of independence or clustered observations can be modeled using models with random effects. These models go by many names including linear mixed models (common in Ecology), hierarchical models, multilevel models, and random effects models. A linear mixed model is a variation of model (3.2). This text introduces linear mixed models in chapter xxx.

Measures that are taken from sites that are closer together or measures taken closer in time or measures from more closely related biological species will tend to have more similar values than measures taken from sites that are further apart or from times that are further apart or from species that are less closely related. Space and time and phylogeny create spatial and temporal and phylogenetic autocorrelation. Correlated error due to space or time or phylogeny can be modeled with Generalized Least Squares (GLS) models. A GLS model is a variation of model (3.2).

3.7 Specific assumptions for inference with a linear model

  1. Constant variance or homoskedasticity. The most common way of thinking about this is the error term \(\varepsilon\) has constant variance, which is a short way of saying that random draws of \(\varepsilon\) in model (3.2) are all from the same (or identical) distribution. This is explicitly stated in the second line of model specification (3.2). If we were to think about this using model specification (3.3), then homoskedasticity means that \(\sigma\) in \(N(\mu, \sigma)\) is constant for all observations (or that the conditional probability distributions are identical, where conditional would mean adjusted for \(\mu\))

Many biological processes generate data in which the error is a function of the mean. For example, measures of biological variables that grow, such as lengths of body parts or population size, have variances that “grow” with the mean. Or, measures of counts, such as the number of cells damaged by toxin, the number of eggs in a nest, or the number of mRNA transcripts per cell have variances that are a function of the mean. Heteroskedastic error can be modeled with Generalized Least Squares, a generalization of the linear model, and with Generalized Linear Models (GLM), which are “extensions” of the classical linear model.

  1. Normal or Gaussian probability distribution. As above, the most common way of thinking about this is the error term \(\varepsilon\) is Normal. Using model specification (3.3), we’d say the conditional probablity distribution of the response is normal. A normal probability distribution implies that 1) the response is continuous and 2) the conditional probability is symmetric around \(mu_i\). If the conditional probability distribution has a long left or right tail it is skewed left or right. Counts (number of cells, number of eggs, number of mRNA transcripts) and binary responses (sucessful escape or sucessful infestation of host) are not continuous and often often have asymmetric probablity distributions that are skewed to the right and while sometimes both can be reasonably modeled using a linear model they are more often modeled using generalized linear models, which, again, is an extension of the linear model in equation (3.3).

A common misconception is that inference from a linear model assumes that the unconditional response (\(Y\)) is normally distributed. Both the “linear model” and “statistical model” ways of specifying the model show precisely why this conception is wrong. Model (3.2) states explicitly that it is the error that has the normal distribution – the distribution of \(Y\) is a mix of the distribution of \(X\) and the error. Model (3.3) states that the conditional outcome has a normal distribution, that is, the distribution after adjusting for variation in \(X\).

3.8 “Statistical model” or “regression model”?

Statistical modeling terminology can be confusing. The \(X\) variables in a statistical model may be quantitative (continuous or integers) or categorical (names or qualitative amounts) or some mix of the two. Linear models with all quantitative independent variables are often called “regression models.” Linear models with all categorical independent variables are often called “ANOVA models.” Linear models with a mix of quantitative and categorical variables are often called “ANCOVA models” if the focus is on one of the categorical \(X\) or “regression models” if there tend to be many independent variables. Other patterns occur. For example “ANCOVA models” often include interaction effects but “regression models” rarely do. To avoid thinking of statistical analysis as “regression vs. ANOVA” (the type of thinking encouraged by many textbooks in biostatistics), I will most often use the term “statistical model” for general usage, and use a more specific term only to emphasize something about the model in that particluar context.

3.9 GLM vs. GLM vs. GLS

Linear models are sometimes called “general linear models” with the abbreviation GLM. This is unfortunate because the abbreviation GLM usually refers to generalized linear models. Regardless, don’t confuse either version of GLM with GLS, which is the abbreviation of generalized least squares. GLS generalizes the linear model to allow for heteroskedastic and/or correlated error (using the “linear model” way of thinking about model specification)