Elements of Applied Biostatistics
Preface
0.1
Math
0.2
R and programming
Part I: Getting Started
1
Getting Started – R Projects and R Markdown
1.1
R vs R Studio
1.2
Download and install R and R studio
1.3
Install R Markdown
1.4
Importing Packages
1.5
Create an R Studio Project for this textbook
1.5.1
Create an R Markdown file for this Chapter
1.5.2
Create a “fake-data” chunk
1.5.3
Create a “plot” chunk
1.5.4
Knit
Part II: An introduction to the analysis of experimental data with a linear model
2
Analyzing experimental data with a linear model
2.1
This text is about the estimation of treatment effects and the uncertainty in our estimates using linear models. This, raises the question, what is “an effect”?
Background physiology to the experiments in Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
Analyses for Figure 2 of “ASK1 inhibits browning of white adipose tissue in obesity”
2.2
Setup
2.3
Data source
2.4
control the color palette
2.5
useful functions
2.6
figure 2b – effect of ASK1 deletion on growth (body weight)
2.6.1
figure 2b – import
2.6.2
figure 2b – exploratory plots
2.7
Figure 2c – Effect of ASK1 deletion on final body weight
2.7.1
Figure 2c – import
2.7.2
Figure 2c – check own computation of weight change v imported value
2.7.3
Figure 2c – exploratory plots
2.7.4
Figure 2c – fit the model: m1 (lm)
2.7.5
Figure 2c – check the model: m1
2.7.6
Figure 2c – fit the model: m2 (gamma glm)
2.7.7
Figure 2c – check the model, m2
2.7.8
Figure 2c – inference from the model
2.7.9
Figure 2c – plot the model
2.7.10
Figure 2c – report
2.8
Figure 2d – Effect of ASK1 KO on glucose tolerance (whole curve)
2.8.1
Figure 2d – Import
2.8.2
Figure 2d – exploratory plots
2.8.3
Figure 2d – fit the model
2.8.4
Figure 2d – check the model
2.8.5
Figure 2d – inference
2.8.6
Figure 2d – plot the model
2.9
Figure 2e – Effect of ASK1 deletion on glucose tolerance (summary measure)
2.9.1
Figure 2e – message the data
2.9.2
Figure 2e – exploratory plots
2.9.3
Figure 2e – fit the model
2.9.4
Figure 2e – check the model
2.9.5
Figure 2e – inference from the model
2.9.6
Figure 2e – plot the model
2.10
Figure 2f – Effect of ASK1 deletion on glucose infusion rate
2.10.1
Figure 2f – import
2.10.2
Figure 2f – exploratory plots
2.10.3
Figure 2f – fit the model
2.10.4
Figure 2f – check the model
2.10.5
Figure 2f – inference
2.10.6
Figure 2f – plot the model
2.11
Figure 2g – Effect of ASK1 deletion on tissue-specific glucose uptake
2.11.1
Figure 2g – import
2.11.2
Figure 2g – exploratory plots
2.11.3
Figure 2g – fit the model
2.11.4
Figure 2g – check the model
2.11.5
Figure 2g – inference
2.11.6
Figure 2g – plot the model
2.12
Figure 2h
2.13
Figure 2i – Effect of ASK1 deletion on liver TG
2.13.1
Figure 2i – fit the model
2.13.2
Figure 2i – check the model
2.13.3
Figure 2i – inference
2.13.4
Figure 2i – plot the model
2.13.5
Figure 2i – report the model
2.14
Figure 2j
Part III: R fundamentals
3
Data – Reading, Wrangling, and Writing
3.1
Learning from this chapter
3.2
Working in R
3.2.1
Importing data
3.3
Data wrangling
3.3.1
Reshaping data – Wide to long
3.3.2
Reshaping data – Transpose (turning the columns into rows)
3.3.3
Combining data
3.3.4
Subsetting data
3.3.5
Wrangling columns
3.3.6
Missing data
3.4
Saving data
3.5
Exercises
4
Plotting Models
4.1
Pretty good plots show the model and the data
4.1.1
Pretty good plot component 1: Modeled effects plot
4.1.2
Pretty good plot component 2: Modeled mean and CI plot
4.1.3
Combining Effects and Modeled mean and CI plots – an Effects and response plot.
4.2
Some comments on plot components
4.3
Working in R
4.3.1
Unpooled SE bars and confidence intervals
4.3.2
Adding bootstrap intervals
4.3.3
Adding modeled means and error intervals
4.3.4
Adding p-values
4.3.5
Adding custom p-values
4.3.6
Plotting two factors
4.3.7
Interaction plot
4.3.8
Plot components
Part IV: Some Fundamentals of Statistical Modeling
5
Variability and Uncertainty (Standard Deviations, Standard Errors, Confidence Intervals)
5.1
The sample standard deviation vs. the standard error of the mean
5.1.1
Sample standard deviation
5.1.2
Standard error of the mean
5.2
Using Google Sheets to generate fake data to explore the standard error
5.2.1
Steps
5.3
Using R to generate fake data to explore the standard error
5.3.1
part I
5.3.2
part II - means
5.3.3
part III - how do SD and SE change as sample size (n) increases?
5.3.4
Part IV – Generating fake data with for-loops
5.4
Bootstrapped standard errors
5.4.1
An example of bootstrapped standard errors using vole data
5.5
Confidence Interval
5.5.1
Interpretation of a confidence interval
6
P-values
6.1
A
p
-value is the probability of sampling a value as or more extreme than the test statistic if sampling from a null distribution
6.2
Pump your intuition – Creating a null distribution
6.3
A null distribution of
t
-values – the
t
distribution
6.4
P-values from the perspective of permutation
6.5
Parametric vs. non-parametric statistics
6.6
frequentist probability and the interpretation of p-values
6.6.1
Background
6.6.2
This book covers frequentist approaches to statistical modeling and when a probability arises, such as the
p
-value of a test statistic, this will be a frequentist probability.
6.6.3
Two interpretations of the
p
-value
6.6.4
NHST
6.7
Some major misconceptions of the
p
-value
6.7.1
Misconception:
p
is the probability that the null is true
and
\(1-p\)
is probability that the alternative is true
6.7.2
Misconception: a
p
-value is repeatable
6.7.3
Misconception: 0.05 is the lifetime rate of false discoveries
6.7.4
Misconception: a low
p
-value indicates an important effect
6.7.5
Misconception: a low
p
-value indicates high model fit or high predictive capacity
6.8
What the
p
-value does not mean
6.9
Recommendations
6.9.1
Primary sources for recommendations
6.10
Problems
7
Errors in inference
7.1
Classical NHST concepts of wrong
7.1.1
Type I error
7.1.2
Power
7.2
A non-Neyman-Pearson concept of power
7.2.1
Estimation error
7.2.2
Coverage
7.2.3
Type S error
7.2.4
Type M error
Part V: Introduction to Linear Models
8
An introduction to linear models
8.1
Two specifications of a linear model
8.1.1
The “error draw” specification
8.1.2
The “conditional draw” specification
8.1.3
Comparing the two ways of specifying the linear model
8.2
A linear model can be fit to data with continuous, discrete, or categorical
\(X\)
variables
8.2.1
Fitting linear models to experimental data in which the
\(X\)
variable is continuous or discrete
8.2.2
Fitting linear models to experimental data in which the
\(X\)
variable is categorical
8.3
Statistical models are used for prediction, explanation, and description
8.4
What do we call the
\(X\)
and
\(Y\)
variables?
8.5
Modeling strategy
8.6
Predictions from the model
8.7
Inference from the model
8.7.1
Assumptions for inference with a statistical model
8.7.2
Specific assumptions for inference with a linear model
8.8
“linear model,”regression model“, or”statistical model"?
9
Linear models with a single, continuous
X
9.1
A linear model with a single, continuous
X
is classical “regression”
9.1.1
Analysis of “green-down” data
9.1.2
Learning from the green-down example
9.1.3
Using a regression model for “explanation” – causal models
9.1.4
Using a regression model for prediction – prediction models
9.1.5
Using a regression model for creating a new response variable – comparing slopes of longitudinal data
9.1.6
Using a regression model for for calibration
9.2
Working in R
9.2.1
Fitting the linear model
9.2.2
Getting to know the linear model: the
summary
function
9.2.3
Inference – the coefficient table and Confidence intervals
9.2.4
How good is our model? – Model checking
9.2.5
Plotting models with continuous
X
9.2.6
Creating a table of predicted values and 95% prediction intervals
9.3
Hidden code
9.3.1
Import and plot of fig2c (ecosystem warming experimental) data
9.3.2
Import and plot efig_3d (Ecosysem warming observational) data
9.3.3
Import and plot of fig1f (methionine restriction) data
9.4
Try it
9.4.1
A prediction model from the literature
9.5
Intuition pumps
9.5.1
Correlation and $R^2
10
Linear models with a single, categorical
X
10.1
A linear model with a single, categorical
X
variable estimates the effects of the levels of
X
on the response.
10.1.1
Example 1 – two treatment levels (“groups”)
10.1.2
Understanding the analysis with two treatment levels
10.1.3
Example 2 – three treatment levels (“groups”)
10.1.4
Understanding the analysis with three (or more) treatment levels
10.2
Working in R
10.2.1
Specifying the contrasts
10.2.2
Adjustment for multiple comparisons
10.2.3
Plotting models with a single, categorical
\(X\)
10.3
Issues in inference in models with a single, categorical
\(X\)
10.3.1
Lack of independence
10.3.2
Heterogeneity of variances
10.3.3
The conditional response isn’t Normal
10.3.4
Pre-post designs
10.3.5
Longitudinal designs
10.3.6
Comparing responses normalized to a standard
10.3.7
Comparing responses that are ratios
10.3.8
Researcher degrees of freedom
10.4
Hidden Code
10.4.1
fig2a data
11
Model Checking
11.1
All statistical analyses should be followed by model checking
11.2
Linear model assumptions
11.2.1
A bit about IID
11.3
Diagnostic plots use the residuals from the model fit
11.3.1
Residuals
11.3.2
A Normal Q-Q plot is used to check for characteristic departures from Normality
11.3.3
Mapping QQ-plot departures from Normality
11.3.4
Model checking homoskedasticity
11.4
Using R
11.4.1
Normal Q-Q plots
12
Model Fitting and Model Fit (OLS)
12.1
Least Squares Estimation and the Decomposition of Variance
12.2
OLS regression
12.3
How well does the model fit the data?
\(R^2\)
and “variance explained”
13
Best practices – issues in inference
13.1
Multiple testing
13.1.1
Some background
13.1.2
Multiple testing – working in R
13.1.3
False Discovery Rate
13.2
difference in p is not different
13.3
Inference when data are not Normal
13.3.1
Working in R
13.3.2
Bootstrap Confidence Intervals
13.3.3
Permutation test
13.3.4
Non-parametric tests
13.3.5
Log transformations
13.3.6
Performance of parametric tests and alternatives
Part VI: More than one
\(X\)
– Multivariable Models
14
Adding covariates to a linear model
14.1
Adding covariates can increases the precision of the effect of interest
14.2
Understanding a linear model with an added covariate – heart necrosis data
14.2.1
Fit the model
14.2.2
Plot the model
14.2.3
Interpretation of the model coefficients
14.2.4
Everything adds up
14.2.5
Interpretation of the estimated marginal means
14.2.6
Interpretation of the contrasts
14.2.7
Adding the covariate improves inference
14.3
Understanding interaction effects with covariates
14.3.1
Fit the model
14.3.2
Plot the model with interaction effect
14.3.3
Interpretation of the model coefficients
14.3.4
What is the effect of a treatment, if interactions are modeled? – it depends.
14.3.5
Which model do we use,
\(\mathcal{M}_1\)
or
\(\mathcal{M}_2\)
?
14.4
Understanding ANCOVA tables
14.5
Working in R
14.5.1
Importing the heart necrosis data
14.5.2
Fitting the model
14.5.3
ANCOVA tables
14.5.4
Plotting the model
14.6
Best practices
14.6.1
Do not use a ratio of part:whole as a response variable – instead add the denominator as a covariate
14.6.2
Do not use change from baseline as a response variable – instead add the baseline measure as a covariate
14.6.3
Do not “test for balance” of baseline measures
14.7
Best practices 2: Use a covariate instead of normalizing a response
15
Two (or more) Categorical
\(X\)
– Factorial designs
15.1
Factorial experiments
15.1.1
Model coefficients: an interaction effect is what is leftover after adding the treatment effects to the control
15.1.2
What is the biological meaning of an interaction effect?
15.1.3
The interpretation of the coefficients in a factorial model is entirely dependent on the reference…
15.1.4
Estimated marginal means
15.1.5
In a factorial model, there are multiple effects of each factor (simple effects)
15.1.6
Marginal effects
15.1.7
The additive model
15.1.8
Reduce models for the right reason
15.1.9
What about models with more than two factors?
15.2
Reporting results
15.2.1
Text results
15.3
Working in R
15.3.1
Model formula
15.3.2
Modeled means
15.3.3
Marginal means
15.3.4
Contrasts
15.3.5
Simple effects
15.3.6
Marginal effects
15.3.7
Plotting results
15.4
Problems
16
ANOVA Tables
16.1
Summary of usage
16.2
Example: a one-way ANOVA using the vole data
16.3
Example: a two-way ANOVA using the urchin data
16.3.1
How to read an ANOVA table
16.3.2
How to read ANOVA results reported in the text
16.3.3
Better practice – estimates and their uncertainty
16.4
Unbalanced designs
16.4.1
What is going on in unbalanced ANOVA? – Type I, II, III sum of squares
16.4.2
Back to interpretation of main effects
16.4.3
The anova tables for Type I, II, and III sum of squares are the same if the design is balanced.
16.5
Working in R
16.5.1
Type I sum of squares in R
16.5.2
Type II and III Sum of Squares
17
Predictive Models
17.1
Overfitting
17.2
Model building vs. Variable selection vs. Model selection
17.2.1
Stepwise regression
17.2.2
Cross-validation
17.2.3
Penalization
17.3
Shrinkage
Part VII – Expanding the Linear Model
18
Models with random effects – Blocking and pseudoreplication
18.1
Random effects
18.2
Random effects in statistical models
18.3
Linear mixed models are flexible
18.4
Blocking
18.4.1
Visualing variation due to blocks
18.4.2
Blocking increases precision of point estimates
18.5
Pseudoreplication
18.5.1
Visualizing pseduoreplication
18.6
Mapping NHST to estimation: A paired t-test is a special case of a linear mixed model
18.7
Advanced topic – Linear mixed models shrink coefficients by partial pooling
18.8
Working in R
18.8.1
coral data
19
Models for longitudinal experiments – pre-post designs
19.1
Best practice models
19.2
Common alternatives that are not recommended
19.3
Advanced models
19.4
Understanding the alternative models
19.4.1
(M1) Linear model with the baseline measure as the covariate (ANCOVA model)
19.4.2
(M2) Linear model of the change score (change-score model)
19.4.3
(M3) Linear model of post-baseline values without the baseline as a covariate (post model)
19.4.4
(M4) Linear model with factorial fixed effects (fixed-effects model)
19.4.5
(M5) Repeated measures ANOVA
19.4.6
(M6) Linear mixed model
19.4.7
(M7) Linear model with correlated error
19.4.8
(M8) Constrained fixed effects model with correlated error (cLDA model)
19.4.9
Comparison table
19.5
Example 1 – a single post-baseline measure (pre-post design)
19.6
Working in R
19.7
Hidden code
19.7.1
Import and wrangle mouse sociability data
20
Generalized linear models I: Count data
20.1
The generalized linear model
20.2
Count data example – number of trematode worm larvae in eyes of threespine stickleback fish
20.2.1
Modeling strategy
20.2.2
Checking the model I – a Normal Q-Q plot
20.2.3
Checking the model II – scale-location plot for checking homoskedasticity
20.2.4
Two distributions for count data – Poisson and Negative Binomial
20.2.5
Fitting a GLM with a Poisson distribution to the worm data
20.2.6
Model checking fits to count data
20.2.7
Fitting a GLM with a Negative Binomial distribution to the worm data
20.3
Working in R
20.3.1
Fitting a GLM to count data
20.3.2
Fitting a generalized linear mixed model (GLMM) to count data
20.3.3
Fitting a generalized linear model to continouus data
20.4
Problems
21
Linear models with heterogenous variance
21.1
gls
Part V: Expanding the Linear Model – Generalized Linear Models and Multilevel (Linear Mixed) Models
22
Plotting functions (#ggplotsci)
22.1
odd-even
22.2
estimate response and effects with emmeans
22.3
emm_table
22.4
pairs_table
22.5
gg_mean_error
22.6
gg_ancova
22.7
gg_mean_ci_ancova
22.8
gg_effects
Appendix 1: Getting Started with R
22.9
Get your computer ready
22.9.1
Start here
22.9.2
Install R
22.9.3
Install R Studio
22.9.4
Install R Markdown
22.9.5
(optional) Alternative LaTeX installations
22.10
Start learning R Studio
Appendix 2: Online Resources for Getting Started with Statistical Modeling in R
Appendix 3: Fake Data Simulations
22.11
Performance of Blocking relative to a linear model
Published with bookdown
Elements of Statistical Modeling for Experimental Biology
Part VII – Expanding the Linear Model