# Chapter 8 A linear model with a single, categorical X

## 8.1 A linear model with a single, categorical X is the engine behind a single factor (one-way) ANOVA and a t-test is a special case of this model.

To introduce modeling with a single, categorical $$X$$ variable, I’ll use the vole data from

2. File: “RSBL-2013-0432 vole data.xlsx”
3. Sheet: “COLD VOLES LIFESPAN”

Normal cellular metabolism creates reactive oxygen species (ROS) that can disrupt cell function and potentially cause cell damage. Anti-oxidants are molecules that bind ROS, inhibiting their ability to disrupt cell activity. A working hypothesis for many years is that supplemental anti-oxidants should improve cell function and, scaling up, whole-animal function (such as lifespan). The vole data explores this with supplemental Vitamins C and E, which are anti-oxidants, in the diet of the short-tailed field vole (Microtus agrestis).

The goal of the study is to measure the effect of anti-oxidants on lifespan. The researchers randomly assigned the voles to one of thre treatment levels: “control”, “vitamin E”, and “vitamin C”. The variable $$treatment$$, is a single, categorical $$X$$ variable. Categorical variables are often called factors and the treatment levels are often called factor levels. There are no units to a categorical $$X$$ variable (even though a certain amount of each anti-oxidant was supplemented). The response ($$Y$$) is $$lifespan$$ measured in days.

The linear model with a categorical $$X$$ variable with three levels is not immediately obvious, and so I don’t present the model until after showing the table of model coefficients

### 8.1.1 Table of model coefficients

Here is the table of coefficients from the linear model fit

Estimate Std. Error t value Pr(>|t|)
(Intercept) 503.4 27.4 18.4 0.000
treatmentvitamin_E -89.9 52.5 -1.7 0.090
treatmentvitamin_C -115.1 54.5 -2.1 0.037

The table has estimates for three parameters. The first estimate (the intercept) is the mean response in the reference level. Here the reference level is the “control” group. The additional estimates are the differences in the mean between each of the other treatment levels and the reference level. These are the “effects” in the model. So typically with categorical $$X$$, when we speak of an we mean a difference in means. These estimates and their meaning are illustrated in Figure 8.1.

(note. The default in R is to set the level that is first alphabetically as the reference level. In the vole data, “control” comes before “vitamin_E” and “vitamin_C” alphabetically, and so by default, it is the reference level. This makes sense for these data – we want to compare the lifespan of the vitamins E and C groups to that of the control group. The reference level can be changed of course.)

### 8.1.2 The linear model

We can see an immediate difference between the coefficient table for a linear model fit to a single, categorical $$X$$ and that for a single, continuous $$X$$. For the latter, there is a single coefficient for $$X$$. For the former, there is a coefficient for each level of the categorical $$X$$ except the “reference” level.

The linear model for a single, continuous $$X$$ with three factor levels is

$$$lifespan = \beta_0 + \beta_1 vitamin\_E + \beta_2 vitamin\_C + \varepsilon$$$

and the estimates in the coefficient table are the coefficients of the fit model

$$$lifespan_i = b_0 + b_1 vitamin\_E + b_2 vitamin\_C + e_i \tag{8.1}$$$

Remember, $$b_0$$ is the mean of the control group, $$b_1$$ is the difference in means between the vitamin E and control groups, and $$b_2$$ is the difference in means between the vitamin C and control groups (Figure 8.1).

In this model, $$vitamin\_E$$ and $$vitamin\_C$$ are dummy variables that contain a one, if the data is from that treatment level, and zero otherwise. This is called dummy coding or treatment coding. The lm function creates these dummy variables under the table, in something called the model matrix, which we’ll cover in the next chapter. You won’t see these columns in your data. But if you did, it would look something like this

lifespan treatment vitamin_E vitamin_C
621 control 0 0
865 control 0 0
583 vitamin_E 1 0
561 vitamin_E 1 0
315 vitamin_C 0 1
157 vitamin_C 0 1

There are alternative coding methods. Dummy coding is the default in R. Note that the method of coding can make a difference in an ANOVA table, and many published papers using R have published incorrect interpretations of ANOVA table outputs. This is both getting ahead of ourselves and somewhat moot, because I don’t advocate publishing ANOVA tables.