Elements of Applied Biostatistics
Preface
0.1
Math
0.2
R and programming
Part I: Getting Started
1
Getting Started – R Projects and R Markdown
1.1
R vs R Studio
1.2
Download and install R and R studio
1.3
Open R Studio and modify the workspace preference
1.4
If you didn’t modify the workspace preferences from the previous section, go back and do it
1.5
R Markdown in a nutshell
1.6
Install R Markdown
1.7
Importing Packages
1.8
Create an R Studio Project for this textbook
1.9
Working on a project, in a nutshell
1.10
Create an R Markdown file for this Chapter
1.10.1
Modify the yaml header
1.10.2
Modify the “setup” chunk
1.10.3
Create a “fake-data” chunk
1.10.4
Create a “plot” chunk
1.10.5
Knit
2
Analyzing experimental data with a linear model
2.1
This text is about the estimation of treatment effects and the uncertainty in our estimates using linear models. This, raises the question, what is “an effect?”
Background physiology to the experiments in Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
Analyses for Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
2.2
Setup
2.3
Data source
2.4
control the color palette
2.5
useful functions
2.6
figure 2b – effect of ASK1 deletion on growth (body weight)
2.6.1
figure 2b – import
2.6.2
figure 2b – exploratory plots
2.7
Figure 2c – Effect of ASK1 deletion on final body weight
2.7.1
Figure 2c – import
2.7.2
Figure 2c – check own computation of weight change v imported value
2.7.3
Figure 2c – exploratory plots
2.7.4
Figure 2c – fit the model: m1 (lm)
2.7.5
Figure 2c – check the model: m1
2.7.6
Figure 2c – fit the model: m2 (gamma glm)
2.7.7
Figure 2c – check the model, m2
2.7.8
Figure 2c – inference from the model
2.7.9
Figure 2c – plot the model
2.7.10
Figure 2c – report
2.8
Figure 2d – Effect of ASK1 KO on glucose tolerance (whole curve)
2.8.1
Figure 2d – Import
2.8.2
Figure 2d – exploratory plots
2.8.3
Figure 2d – fit the model
2.8.4
Figure 2d – check the model
2.8.5
Figure 2d – inference
2.8.6
Figure 2d – plot the model
2.9
Figure 2e – Effect of ASK1 deletion on glucose tolerance (summary measure)
2.9.1
Figure 2e – message the data
2.9.2
Figure 2e – exploratory plots
2.9.3
Figure 2e – fit the model
2.9.4
Figure 2e – check the model
2.9.5
Figure 2e – inference from the model
2.9.6
Figure 2e – plot the model
2.10
Figure 2f – Effect of ASK1 deletion on glucose infusion rate
2.10.1
Figure 2f – import
2.10.2
Figure 2f – exploratory plots
2.10.3
Figure 2f – fit the model
2.10.4
Figure 2f – check the model
2.10.5
Figure 2f – inference
2.10.6
Figure 2f – plot the model
2.11
Figure 2g – Effect of ASK1 deletion on tissue-specific glucose uptake
2.11.1
Figure 2g – import
2.11.2
Figure 2g – exploratory plots
2.11.3
Figure 2g – fit the model
2.11.4
Figure 2g – check the model
2.11.5
Figure 2g – inference
2.11.6
Figure 2g – plot the model
2.12
Figure 2h
2.13
Figure 2i – Effect of ASK1 deletion on liver TG
2.13.1
Figure 2i – fit the model
2.13.2
Figure 2i – check the model
2.13.3
Figure 2i – inference
2.13.4
Figure 2i – plot the model
2.13.5
Figure 2i – report the model
2.14
Figure 2j
Part III: R fundamentals
3
Data – Reading, Wrangling, and Writing
3.1
Learning from this chapter
3.2
Working in R
3.2.1
Importing data
3.3
Data wrangling
3.3.1
Reshaping data – Wide to long
3.3.2
Reshaping data – Transpose (turning the columns into rows)
3.3.3
Combining data
3.3.4
Subsetting data
3.3.5
Wrangling columns
3.3.6
Missing data
3.4
Saving data
3.5
Exercises
4
Plotting Models
4.1
Pretty good plots show the model and the data
4.1.1
Pretty good plot component 1: Modeled effects plot
4.1.2
Pretty good plot component 2: Modeled mean and CI plot
4.1.3
Combining Effects and Modeled mean and CI plots – an Effects and response plot.
4.1.4
Some comments on plot components
4.2
Working in R
4.2.1
Source data
4.2.2
How to plot the model
4.2.3
How to use the Plot the Model functions
4.2.4
How to generate a Response Plot using ggpubr
4.2.5
How to generate a Response Plot with a grid of treatments using ggplot2
4.2.6
How to generate an Effects Plot
4.2.7
How to combine the response and effects plots
4.2.8
How to add the interaction effect to response and effects plots
Part IV: Some Fundamentals of Statistical Modeling
5
Variability and Uncertainty (Standard Deviations, Standard Errors, Confidence Intervals)
5.1
The sample standard deviation vs. the standard error of the mean
5.1.1
Sample standard deviation
5.1.2
Standard error of the mean
5.2
Using Google Sheets to generate fake data to explore the standard error
5.2.1
Steps
5.3
Using R to generate fake data to explore the standard error
5.3.1
part I
5.3.2
part II - means
5.3.3
part III - how do SD and SE change as sample size (n) increases?
5.3.4
Part IV – Generating fake data with for-loops
5.4
Bootstrapped standard errors
5.4.1
An example of bootstrapped standard errors using vole data
5.5
Confidence Interval
5.5.1
Interpretation of a confidence interval
6
P-values
6.1
A
p
-value is the probability of sampling a value as or more extreme than the test statistic if sampling from a null distribution
6.2
Pump your intuition – Creating a null distribution
6.3
A null distribution of
t
-values – the
t
distribution
6.4
P-values from the perspective of permutation
6.5
Parametric vs. non-parametric statistics
6.6
frequentist probability and the interpretation of p-values
6.6.1
Background
6.6.2
This book covers frequentist approaches to statistical modeling and when a probability arises, such as the
p
-value of a test statistic, this will be a frequentist probability.
6.6.3
Two interpretations of the
p
-value
6.6.4
NHST
6.7
Some major misconceptions of the
p
-value
6.7.1
Misconception:
p
is the probability that the null is true
and
\(1-p\)
is probability that the alternative is true
6.7.2
Misconception: a
p
-value is repeatable
6.7.3
Misconception: 0.05 is the lifetime rate of false discoveries
6.7.4
Misconception: a low
p
-value indicates an important effect
6.7.5
Misconception: a low
p
-value indicates high model fit or high predictive capacity
6.8
What the
p
-value does not mean
6.9
Recommendations
6.9.1
Primary sources for recommendations
6.10
Problems
7
Errors in inference
7.1
Classical NHST concepts of wrong
7.1.1
Type I error
7.1.2
Power
7.2
A non-Neyman-Pearson concept of power
7.2.1
Estimation error
7.2.2
Coverage
7.2.3
Type S error
7.2.4
Type M error
Part V: Introduction to Linear Models
8
An introduction to linear models
8.1
Two specifications of a linear model
8.1.1
The “error draw” specification
8.1.2
The “conditional draw” specification
8.1.3
Comparing the error-draw and conditional-draw ways of specifying the linear model
8.1.4
ANOVA notation of a linear model
8.2
A linear model can be fit to data with continuous, discrete, or categorical
\(X\)
variables
8.2.1
Fitting linear models to experimental data in which the
\(X\)
variable is continuous or discrete
8.2.2
Fitting linear models to experimental data in which the
\(X\)
variable is categorical
8.3
Statistical models are used for prediction, explanation, and description
8.4
What do we call the
\(X\)
and
\(Y\)
variables?
8.5
Modeling strategy
8.6
Predictions from the model
8.7
Inference from the model
8.7.1
Assumptions for inference with a statistical model
8.7.2
Specific assumptions for inference with a linear model
8.8
“linear model,”regression model“, or”statistical model"?
9
Linear models with a single, continuous
X
(“regression”)
9.1
A linear model with a single, continuous
X
is classical “regression”
9.1.1
Analysis of “green-down” data
9.1.2
Learning from the green-down example
9.1.3
Using a regression model for “explanation” – causal models
9.1.4
Using a regression model for prediction – prediction models
9.1.5
Using a regression model for creating a new response variable – comparing slopes of longitudinal data
9.1.6
Using a regression model for for calibration
9.2
Working in R
9.2.1
Fitting the linear model
9.2.2
Getting to know the linear model: the
summary
function
9.2.3
Inference – the coefficient table
9.2.4
How good is our model? – Model checking
9.2.5
Plotting models with continuous
X
9.2.6
Creating a table of predicted values and 95% prediction intervals
9.3
Hidden code
9.3.1
Import and plot of fig2c (ecosystem warming experimental) data
9.3.2
Import and plot efig_3d (Ecosysem warming observational) data
9.3.3
Import and plot of fig1f (methionine restriction) data
9.4
Try it
9.4.1
A prediction model from the literature
9.5
Intuition pumps
9.5.1
Correlation and $R^2
10
Linear models with a single, categorical
X
(“t-tests” and “ANOVA”)
10.1
A linear model with a single, categorical
X
variable estimates the effects of the levels of
X
on the response
10.1.1
Example 1 (fig3d) – two treatment levels (“groups”)
10.1.2
Understanding the analysis with two treatment levels
10.1.3
Example 2 – three treatment levels (“groups”)
10.1.4
Understanding the analysis with three (or more) treatment levels
10.2
Working in R
10.2.1
Fit the model
10.2.2
Controlling the output in tables using the coefficient table as an example
10.2.3
Using the emmeans function
10.2.4
Using the contrast function
10.2.5
How to generate ANOVA tables
10.3
Hidden Code
10.3.1
Importing and wrangling the fig_3d data for example 1
10.3.2
Importing and wrangling the fig2a data for example 2
11
Model Checking
11.1
All statistical analyses should be followed by model checking
11.2
Linear model assumptions
11.2.1
A bit about IID
11.3
Diagnostic plots use the residuals from the model fit
11.3.1
Residuals
11.3.2
A Normal Q-Q plot is used to check for characteristic departures from Normality
11.3.3
Mapping QQ-plot departures from Normality
11.3.4
Model checking homoskedasticity
11.4
Using R
11.4.1
Normal Q-Q plots
12
Violations of independence, homogeneity, or Normality
12.1
Lack of independence
12.1.1
A paired t-test is a special case of a linear model for correlated data with two groups
12.1.2
Inferences from the linear mixed model and paired t-tests are not the same when there are more than two groups
12.2
Heterogeneity of variances
12.2.1
When groups of the focal test have >> variance
12.3
The conditional response isn’t Normal
12.3.1
Example 1 (fig6f) – Linear models for non-normal count data
12.3.2
My data aren’t normal, what is the best practice?
12.4
Hidden Code
12.4.1
Importing and wrangling the fig1b data
12.4.2
Importing and wrangling the fig2a data
12.4.3
Importing and wrangling the fig6f data
13
Issues in inference
13.1
Comparing change from baseline (pre-post)
13.1.1
Example 1 (DPP4 fig4c)
13.1.2
What if the data in example 1 were from from an experiment where the treatment was applied prior to the baseline measure?
13.1.3
Example 2 (XX males fig1c)
13.1.4
Regression to the mean
13.2
Longitudinal designs with more than one-post baseline measure
13.2.1
Area under the curve (AUC)
13.3
Comparing responses normalized to a standard
13.4
Comparing ratios
13.4.1
The ratio is a density
13.5
Don’t do this stuff
13.6
Researcher degrees of freedom
13.7
Hidden code
13.7.1
Import Fig4c data
13.7.2
XX males fig1c
13.7.3
Generation of fake data to illustrate regression to the mean
13.7.4
Import fig3f
Part VI: More than one
\(X\)
– Multivariable Models
14
Linear models with added covariates (“ANCOVA”)
14.1
Adding covariates can increases the precision of the effect of interest
14.2
Understanding a linear model with an added covariate – heart necrosis data
14.2.1
Fit the model
14.2.2
Plot the model
14.2.3
Interpretation of the model coefficients
14.2.4
Everything adds up
14.2.5
Interpretation of the estimated marginal means
14.2.6
Interpretation of the contrasts
14.2.7
Adding the covariate improves inference
14.3
Understanding interaction effects with covariates
14.3.1
Fit the model
14.3.2
Plot the model with interaction effect
14.3.3
Interpretation of the model coefficients
14.3.4
What is the effect of a treatment, if interactions are modeled? – it depends.
14.3.5
Which model do we use,
\(\mathcal{M}_1\)
or
\(\mathcal{M}_2\)
?
14.4
Understanding ANCOVA tables
14.5
Working in R
14.5.1
Importing the heart necrosis data
14.5.2
Fitting the model
14.5.3
Using the emmeans function
14.5.4
ANCOVA tables
14.5.5
Plotting the model
14.6
Best practices
14.6.1
Do not use a ratio of part:whole as a response variable – instead add the denominator as a covariate
14.6.2
Do not use change from baseline as a response variable – instead add the baseline measure as a covariate
14.6.3
Do not “test for balance” of baseline measures
14.7
Best practices 2: Use a covariate instead of normalizing a response
15
Linear models with two categorical
\(X\)
– Factorial linear models (“two-way ANOVA”)
15.1
A linear model with crossed factors estimates interaction effects
15.1.1
An interaction is a difference in simple effects
15.1.2
A linear model with crossed factors includes interaction effects
15.1.3
factorial experiments are frequently analyzed as flattened linear models in the experimental biology literature
15.2
Example 1 – Estimation of a treatment effect relative to a control effect (“Something different”) (Experiment 2j glucose uptake data)
15.2.1
Understand the experimental design
15.2.2
Fit the linear model
15.2.3
Inference
15.2.4
Plot the model
15.3
Understanding the linear model with crossed factors 1
15.3.1
What the coefficients are
15.3.2
The interaction effect is something different
15.3.3
Why we want to compare the treatment effect to a control effect
15.3.4
The order of the factors in the model tells the same story differently
15.3.5
Power for the interaction effect is less than that for simple effects
15.3.6
Planned comparisons vs. post-hoc tests
15.4
Example 2: Estimation of the effect of background condition on an effect (“it depends”) (Experiment 3e lesian area data)
15.4.1
Understand the experimental design
15.4.2
Fit the linear model
15.4.3
Check the model
15.4.4
Inference from the model
15.4.5
Plot the model
15.5
Understanding the linear model with crossed factors 2
15.5.1
Conditional and marginal means
15.5.2
Simple (conditional) effects
15.5.3
Marginal effects
15.5.4
The additive model
15.5.5
Reduce models for the right reason
15.5.6
The marginal means of an additive linear model with two factors can be weird
15.6
Example 3: Estimation of synergy (“More than the sum of the parts”) (Experiment 1c JA data)
15.6.1
Examine the data
15.6.2
Fit the model
15.6.3
Model check
15.6.4
Inference from the model
15.6.5
Plot the model
15.6.6
Alternative plot
15.7
Understanding the linear model with crossed factors 3
15.7.1
Thinking about the coefficients of the linear model
15.8
Issues in inference
15.8.1
For pairwise comparisons, it doesn’t matter if you analyze the data with a factorial or a flattened linear model
15.8.2
Adjusting
p
-values for multiple tests
15.9
Two-way ANOVA
15.9.1
How to read a two-way ANOVA table
15.9.2
What do the main effects in an ANOVA table mean?
15.10
Working in R
15.10.1
Model formula
15.10.2
Using the emmeans function
15.10.3
Contrasts
15.10.4
Practice safe ANOVA
15.10.5
Better to avoid these
15.11
Hidden Code
15.11.1
Import exp2j (Example 1)
Part VII – Expanding the Linear Model
16
Linear models for longitudinal experiments – I. pre-post designs
16.1
Best practice models
16.2
Common alternatives that are not recommended
16.3
Advanced models
16.4
Understanding the alternative models
16.4.1
(M1) Linear model with the baseline measure as the covariate (ANCOVA model)
16.4.2
(M2) Linear model of the change score (change-score model)
16.4.3
(M3) Linear model of post-baseline values without the baseline as a covariate (post model)
16.4.4
(M4) Linear model with factorial fixed effects (fixed-effects model)
16.4.5
(M5) Repeated measures ANOVA
16.4.6
(M6) Linear mixed model
16.4.7
(M7) Linear model with correlated error
16.4.8
(M8) Constrained fixed effects model with correlated error (cLDA model)
16.4.9
Comparison table
16.5
Example 1 – a single post-baseline measure (pre-post design)
16.6
Working in R
16.7
Hidden code
16.7.1
Import and wrangle mouse sociability data
17
Linear models for count data – Generalized Linear Models I
17.1
The generalized linear model
17.2
Count data example – number of trematode worm larvae in eyes of threespine stickleback fish
17.2.1
Modeling strategy
17.2.2
Checking the model I – a Normal Q-Q plot
17.2.3
Checking the model II – scale-location plot for checking homoskedasticity
17.2.4
Two distributions for count data – Poisson and Negative Binomial
17.2.5
Fitting a GLM with a Poisson distribution to the worm data
17.2.6
Model checking fits to count data
17.2.7
Fitting a GLM with a Negative Binomial distribution to the worm data
17.3
Working in R
17.3.1
Fitting a GLM to count data
17.3.2
Fitting a generalized linear mixed model (GLMM) to count data
17.3.3
Fitting a generalized linear model to continouus data
17.4
Problems
18
Linear models with heterogenous variance
19
Simulations – Count data (alternatives to a t-test)
19.1
Use data similar to Figure 6f from Example 1
19.2
Functions
19.3
Simulations
19.3.1
Type I, Pseudo-Normal distribution
19.3.2
Type I, neg binom, equal n
19.3.3
Type I, neg binom, equal n, small theta
19.3.4
Type I, neg binom, unequal n
19.3.5
Power, Pseudo-Normal distribution, equal n
19.3.6
Power, neg binom, equal n
19.3.7
Power, neg binom, small theta
19.3.8
Power, neg binom, unequal n
19.3.9
Power, neg binom, unequal n, unequal theta
19.3.10
Type 1, neg binom, equal n, unequal theta
19.4
Save it, Read it
19.5
Analysis
Appendix 1: Getting Started with R
19.6
Get your computer ready
19.6.1
Start here
19.6.2
Install R
19.6.3
Install R Studio
19.6.4
Install R Markdown
19.6.5
(optional) Alternative LaTeX installations
19.7
Start learning R Studio
Appendix 2: Online Resources for Getting Started with Statistical Modeling in R
Published with bookdown
Elements of Statistical Modeling for Experimental Biology
Chapter 18
Linear models with heterogenous variance