Elements of Applied Biostatistics
Preface
0.1
Math
0.2
R and programming
Part I: R fundamentals
1
Organization – R Projects and R Notebooks
1.1
Importing Packages
1.2
Create an R Studio Project for this Class
1.3
R Notebooks
1.3.1
Create an R Notebook for this Chapter
1.3.2
Create a “load-packages” chunk
1.3.3
Create a “simple plot” chunk
1.3.4
Create more R chunks and explore options and play with R code
2
Data – Reading, Writing, and Wrangling
2.1
Create new notebook for this chapter
2.2
Importing data
2.2.1
Excel file
2.2.2
Text file
2.3
Data wrangling
2.3.1
Combining data
2.3.2
Subsetting data
2.3.3
missing data
2.3.4
Reshaping data
2.4
Saving data
2.5
Problems
Part II: Some Fundamentals of Statistical Modeling
3
An Introduction to Statistical Modeling
3.1
Two specifications of a linear model
3.1.1
The “error draw” specification
3.1.2
The “conditional draw” specification
3.1.3
Comparing the two ways of specifying the linear model
3.2
What do we call the
\(X\)
and
\(Y\)
variables?
3.3
Statistical models are used for prediction, explanation, and description
3.4
Modeling strategy
3.5
A mean is the simplest model
3.6
Assumptions for inference with a statistical model
3.7
Specific assumptions for inference with a linear model
3.8
“Statistical model” or “regression model”?
3.9
GLM vs. GLM vs. GLS
4
Variability and Uncertainty (Standard Deviations, Standard Errors, Confidence Intervals)
4.1
The sample standard deviation vs. the standard error of the mean
4.1.1
Sample standard deviation
4.1.2
Standard error of the mean
4.2
Using Google Sheets to generate fake data to explore the standard error
4.2.1
Steps
4.3
Using R to generate fake data to explore the standard error
4.3.1
part I
4.3.2
part II - means
4.3.3
part III - how do SD and SE change as sample size (n) increases?
4.3.4
Part IV – Generating fake data with for-loops
4.4
Bootstrapped standard errors
4.5
Confidence Interval
5
Covariance and Correlation
6
P-values
6.1
\(p\)
-values
6.2
Creating a null distribution.
6.2.1
the Null Distribution
6.2.2
t-tests
6.2.3
P-values from the perspective of permutation
6.3
Statistical modeling instead of hypothesis testing
6.4
frequentist probability and the interpretation of p-values
6.4.1
Background
6.4.2
This book covers frequentist approaches to statistical modeling and when a probability arises, such as the
p
-value of a test statistic, this will be a frequentist probability.
6.4.3
Two interpretations of the
p
-value
6.4.4
NHST
6.4.5
Some major misconceptions of the
\(p\)
-value
6.4.6
Recommendations
6.5
Problems
7
Creating Fake Data
7.0.1
Continuous X (fake observational data)
7.0.2
Categorical X (fake experimental data)
7.0.3
Correlated X (fake observational data)
Part III: Introduction to Linear Models
8
A linear model with a single, continuous
X
8.1
A linear model with a single, continuous
X
is classical “regression”
8.1.1
Using a linear model to estimate explanatory effects
8.1.2
Using a linear model for prediction
8.1.3
Reporting results
8.2
Working in R
8.2.1
Exploring the bivariate relationship between
Y
and
X
8.2.2
Fitting the linear model
8.2.3
Getting to know the linear model: the
summary
function
8.2.4
display: An alternative to summary
8.2.5
Confidence intervals
8.2.6
How good is our model?
8.2.7
exploring a lm object
8.3
Problems
9
A linear model with a single, categorical
X
9.1
A linear model with a single, categorical
X
estimates the effects of
X
on the response.
9.1.1
Table of model coefficients
9.1.2
The linear model
9.1.3
Reporting results
9.2
Comparing the results of a linear model to classical hypothesis tests
9.2.1
t-tests are special cases of a linear model
9.2.2
ANOVA is a special case of a linear model
9.3
Working in R
9.3.1
Fitting the model
9.3.2
Changing the reference level
9.3.3
An introduction to contrasts
9.3.4
Harrell plot
10
Model Checking
10.1
Do coefficients make numeric sense?
10.2
All statistical analyses should be followed by model checking
10.3
Linear model assumptions
10.4
Diagnostic plots use the residuals from the model fit
10.4.1
Residuals
10.4.2
A Normal Q-Q plot is used to check normality
10.4.3
Outliers - an outlier is a point that is highly unexpected given the modeled distribution.
10.5
Model checking homoskedasticity
10.6
Model checking independence - hapiness adverse example.
10.7
Using R
11
Model Fitting and Model Fit (OLS)
11.1
Least Squares Estimation and the Decomposition of Variance
11.2
OLS regression
11.3
How well does the model fit the data?
\(R^2\)
and “variance explained”
12
Plotting Models
12.1
Pretty good plots show the model and the data
12.1.1
Pretty good plot component 1: Modeled effects plot
12.1.2
Pretty good plot component 2: Modeled mean and CI plot with jittered raw data
12.1.3
Combining Effects and Modeled mean and CI plots – an Effects and response plot.
12.2
Some comments on plot components
12.3
Working in R
12.3.1
Unpooled SE bars and confidence intervals
12.3.2
Adding bootstrap intervals
12.3.3
Adding modeled error intervals
12.3.4
Adding p-values
12.3.5
Adding custom p-values
12.3.6
Plotting two factors
12.3.7
Interaction plot
Part IV: More than one
\(X\)
– Multivariable Models
13
Adding covariates to a linear model
13.1
Adding covariates can increases the precision of the effect of interest
13.2
Adding covariates can decrease prediction error in predictive models
13.3
Adding covariates can reduce bias due to confounding in explanatory models
13.4
Best practices 1: A pre-treatment measure of the response should be a covariate and not subtracted from the post-treatment measure (regression to the mean)
13.4.1
Regression to the mean in words
13.4.2
Regression to the mean in pictures
13.4.3
Do not use percent change, believing that percents account for effects of initial weights
13.4.4
Do not “test for balance” of baseline measures
13.5
Best practices 2: Use a covariate instead of normalizing a response
14
Two (or more) Categorical
\(X\)
– Factorial designs
14.1
Factorial experiments
14.1.1
Model coefficients: an interaction effect is what is leftover after adding the treatment effects to the control
14.1.2
What is the biological meaning of an interaction effect?
14.1.3
The interpretation of the coefficients in a factorial model is entirely dependent on the reference…
14.1.4
Estimated marginal means
14.1.5
In a factorial model, there are multiple effects of each factor (simple effects)
14.1.6
Marginal effects
14.1.7
The additive model
14.1.8
Reduce models for the right reason
14.1.9
What about models with more than two factors?
14.2
Reporting results
14.2.1
Text results
14.3
Working in R
14.3.1
Model formula
14.3.2
Modeled means
14.3.3
Marginal means
14.3.4
Contrasts
14.3.5
Simple effects
14.3.6
Marginal effects
14.3.7
Plotting results
14.4
Problems
15
ANOVA Tables
15.1
Summary of usage
15.2
Example: a one-way ANOVA using the vole data
15.3
Example: a two-way ANOVA using the urchin data
15.3.1
How to read an ANOVA table
15.3.2
How to read ANOVA results reported in the text
15.3.3
Better practice – estimates and their uncertainty
15.4
Unbalanced designs
15.4.1
What is going on in unbalanced ANOVA? – Type I, II, III sum of squares
15.4.2
Back to interpretation of main effects
15.4.3
The anova tables for Type I, II, and III sum of squares are the same if the design is balanced.
15.5
Working in R
15.5.1
Type I sum of squares in R
15.5.2
Type II and III Sum of Squares
16
Predictive Models
16.1
Overfitting
16.2
Model building vs. Variable selection vs. Model selection
16.2.1
Stepwise regression
16.2.2
Cross-validation
16.2.3
Penalization
16.3
Shrinkage
Part V: Expanding the Linear Model – Generalized Linear Models and Multilevel (Linear Mixed) Models
17
Generalized linear models I: Count data
17.1
The generalized linear model
17.2
Count data example – number of trematode worm larvae in eyes of threespine stickleback fish
17.2.1
Modeling strategy
17.2.2
Checking the model I – a Normal Q-Q plot
17.2.3
Checking the model II – scale-location plot for checking homoskedasticity
17.2.4
Two distributions for count data – Poisson and Negative Binomial
17.2.5
Fitting a GLM with a Poisson distribution to the worm data
17.2.6
Model checking fits to count data
17.2.7
Fitting a GLM with a Negative Binomial distribution to the worm data
17.3
Working in R
17.4
Problems
18
Linear mixed models
18.1
Random effects
18.2
Random effects in statistical models
18.3
Linear mixed models are flexible
18.4
Visualizing block effects
18.5
Linear mixed models can increase precision of point estimates
18.6
Linear mixed models are used to avoid pseudoreplication
18.7
Linear mixed models shrink coefficients by partial pooling
18.8
Working in R
18.8.1
coral data
19
Linear models with heterogenous variance
19.1
gls
Appendix 1: Getting Started with R
19.2
Get your computer ready
19.2.1
Install R
19.2.2
Install R Studio
19.2.3
Resources for installing R and R Studio
19.2.4
Install LaTeX
19.3
Start learning
19.3.1
Start with Data Camp Introduction to R
19.3.2
Then Move to Introduction to R Studio
19.3.3
Develop your project with an R Studio Notebook
19.4
Getting Data into R
19.5
Additional R learning resources
19.6
Packages used extensively in this text
Appendix 2: Online Resources for Getting Started with Statistical Modeling in R
Published with bookdown
Elementary Statistical Modeling for Applied Biostatistics
Part IV: More than one
\(X\)
– Multivariable Models