# Chapter 16 Predictive Models

This chapter focusses on modeling **observational data** with multiple \(X\) variables, both continous and categorical. The classical analysis of multiple \(X\) variables is **multiple regression**, sometimes called **multivariable regression** and occassionally, but incorrectly, called **multivariate regression** – “multivariate” refers to multiple \(Y\) variables.

The models in this chapter have the structure

\[\begin{equation} Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + ... \beta_p X_p + \varepsilon \end{equation}\]% where \(p\) is the number of \(X\) variables or **predictors** in the model. This equation is easily generalized to both generalized linear models, linear mixed models, and generalized linear mixed models.

## 16.1 Overfitting

When a model is fit to data, the model coefficients are estimates of the parameters that “generated the data”. The value of an estimate is partly a function of the signal (the parameter) and partly a function of the noise, which is unique to the sample. At a low signal to noise ratio a model is mostly fitting the noise. A measure of how well the model “fits” the data is \(R^2\), which is

\[\begin{equation} R^2 <- 1 - \frac{SS_{residual}}{SS_{total}} \end{equation}\]As \(X\) variables are added to a model, the \(R^2\) necessarily increases. Part of this increase is due to added signal, but part is due to added noise. If the added noise is more than the added signal, then the model fit – that is the parameter estimates – increasingly reflects the noise unique to the sample rather the signal common to every sample. This is the basis of **overfitting**.

To demonstrate overfitting, I fit completely random \(X\) variables to the lifespans for the control voles.

Think about it this way: if I create fake data in there are ten \(X\) variables that are correlewhich \(Y\) is a simple column of random, normal variables that are not a function of