Elements of Applied Biostatistics
Preface
0.1
Math
0.2
R and programming
Part I: R fundamentals
1
Organization – R Projects and R Notebooks
1.1
Importing Packages
1.2
Create an R Studio Project for this Class
1.3
R Notebooks
1.3.1
Create an R Notebook for this Chapter
1.3.2
Create a “load-packages” chunk
1.3.3
Create a “simple plot” chunk
1.3.4
Create more R chunks and explore options and play with R code
2
Data – Reading, Wrangling, and Writing
2.1
Create new notebook for this chapter
2.2
Importing data
2.2.1
Excel file
2.2.2
Text file
2.3
Troubleshooting file import
2.3.1
Rule number one in R scripting {# rule1}
2.4
Data wrangling
2.4.1
Converting a single column with all combinations of a 2 x 2 factorial experiment into two columns, each containing the two levels of a factor
2.4.2
Combining data
2.4.3
Subsetting data
2.4.4
missing data
2.4.5
Reshaping data
2.4.6
Miscellaneous data wrangling
2.5
Saving data
Part II: Some Fundamentals of Statistical Modeling
3
An Introduction to Statistical Modeling
3.1
Two specifications of a linear model
3.1.1
The “error draw” specification
3.1.2
The “conditional draw” specification
3.1.3
Comparing the two ways of specifying the linear model
3.2
What do we call the
\(X\)
and
\(Y\)
variables?
3.3
Statistical models are used for prediction, explanation, and description
3.4
Modeling strategy
3.5
A mean is the simplest model
3.6
Assumptions for inference with a statistical model
3.7
Specific assumptions for inference with a linear model
3.8
“Statistical model” or “regression model”?
3.9
GLM vs. GLM vs. GLS
4
Variability and Uncertainty (Standard Deviations, Standard Errors, Confidence Intervals)
4.1
The sample standard deviation vs. the standard error of the mean
4.1.1
Sample standard deviation
4.1.2
Standard error of the mean
4.2
Using Google Sheets to generate fake data to explore the standard error
4.2.1
Steps
4.3
Using R to generate fake data to explore the standard error
4.3.1
part I
4.3.2
part II - means
4.3.3
part III - how do SD and SE change as sample size (n) increases?
4.3.4
Part IV – Generating fake data with for-loops
4.4
Bootstrapped standard errors
4.4.1
An example of bootstrapped standard errors using vole data
4.5
Confidence Interval
4.5.1
Interpretation of a confidence interval
5
Covariance and Correlation
6
P-values
6.1
\(p\)
-values
6.2
Creating a null distribution.
6.2.1
the Null Distribution
6.2.2
t-tests
6.2.3
P-values from the perspective of permutation
6.3
Statistical modeling instead of hypothesis testing
6.4
frequentist probability and the interpretation of p-values
6.4.1
Background
6.4.2
This book covers frequentist approaches to statistical modeling and when a probability arises, such as the
p
-value of a test statistic, this will be a frequentist probability.
6.4.3
Two interpretations of the
p
-value
6.4.4
NHST
6.4.5
Some major misconceptions of the
\(p\)
-value
6.4.6
Recommendations
6.5
Problems
7
Creating Fake Data
7.0.1
Continuous X (fake observational data)
7.0.2
Categorical X (fake experimental data)
7.0.3
Correlated X (fake observational data)
Part III: Introduction to Linear Models
8
A linear model with a single, continuous
X
8.1
A linear model with a single, continuous
X
is classical “regression”
8.1.1
Using a linear model to estimate explanatory effects
8.1.2
Using a linear model for prediction
8.1.3
Reporting results
8.2
Working in R
8.2.1
Exploring the bivariate relationship between
Y
and
X
8.2.2
Fitting the linear model
8.2.3
Getting to know the linear model: the
summary
function
8.2.4
display: An alternative to summary
8.2.5
Confidence intervals
8.2.6
How good is our model?
8.2.7
exploring a lm object
8.3
Problems
9
A linear model with a single, categorical
X
9.1
A linear model with a single, categorical
X
estimates the effects of
X
on the response.
9.1.1
Table of model coefficients
9.1.2
The linear model
9.1.3
Reporting results
9.2
Comparing the results of a linear model to classical hypothesis tests
9.2.1
t-tests are special cases of a linear model
9.2.2
ANOVA is a special case of a linear model
9.3
Working in R
9.3.1
Fitting the model
9.3.2
Changing the reference level
9.3.3
An introduction to contrasts
9.3.4
Harrell plot
10
Model Checking
10.1
Do coefficients make numeric sense?
10.2
All statistical analyses should be followed by model checking
10.3
Linear model assumptions
10.4
Diagnostic plots use the residuals from the model fit
10.4.1
Residuals
10.4.2
A Normal Q-Q plot is used to check normality
10.4.3
Outliers - an outlier is a point that is highly unexpected given the modeled distribution.
10.5
Model checking homoskedasticity
10.6
Model checking independence - hapiness adverse example.
10.7
Using R
11
Model Fitting and Model Fit (OLS)
11.1
Least Squares Estimation and the Decomposition of Variance
11.2
OLS regression
11.3
How well does the model fit the data?
\(R^2\)
and “variance explained”
12
Best Practices – Issues in Inference
12.1
Power
12.1.1
“Types” of Error
12.2
multiple testing
12.2.1
Some background
12.2.2
Multiple testing – working in R
12.2.3
False Discovery Rate
12.3
difference in p is not different
12.4
Inference when data are not Normal
12.4.1
Working in R
12.4.2
Bootstrap Confidence Intervals
12.4.3
Permutation test
12.4.4
Non-parametric tests
12.4.5
Log transformations
12.4.6
Performance of parametric tests and alternatives
12.5
max vs. mean
12.6
pre-post, normalization
13
Plotting Models
13.1
Pretty good plots show the model and the data
13.1.1
Pretty good plot component 1: Modeled effects plot
13.1.2
Pretty good plot component 2: Modeled mean and CI plot
13.1.3
Combining Effects and Modeled mean and CI plots – an Effects and response plot.
13.2
Some comments on plot components
13.3
Working in R
13.3.1
Unpooled SE bars and confidence intervals
13.3.2
Adding bootstrap intervals
13.3.3
Adding modeled means and error intervals
13.3.4
Adding p-values
13.3.5
Adding custom p-values
13.3.6
Plotting two factors
13.3.7
Interaction plot
13.3.8
Plot components
Part IV: More than one
\(X\)
– Multivariable Models
14
Adding covariates to a linear model
14.1
Adding covariates can increases the precision of the effect of interest
14.2
Adding covariates can decrease prediction error in predictive models
14.3
Adding covariates can reduce bias due to confounding in explanatory models
14.4
Best practices 1: A pre-treatment measure of the response should be a covariate and not subtracted from the post-treatment measure (regression to the mean)
14.4.1
Regression to the mean in words
14.4.2
Regression to the mean in pictures
14.4.3
Do not use percent change, believing that percents account for effects of initial weights
14.4.4
Do not “test for balance” of baseline measures
14.5
Best practices 2: Use a covariate instead of normalizing a response
15
Two (or more) Categorical
\(X\)
– Factorial designs
15.1
Factorial experiments
15.1.1
Model coefficients: an interaction effect is what is leftover after adding the treatment effects to the control
15.1.2
What is the biological meaning of an interaction effect?
15.1.3
The interpretation of the coefficients in a factorial model is entirely dependent on the reference…
15.1.4
Estimated marginal means
15.1.5
In a factorial model, there are multiple effects of each factor (simple effects)
15.1.6
Marginal effects
15.1.7
The additive model
15.1.8
Reduce models for the right reason
15.1.9
What about models with more than two factors?
15.2
Reporting results
15.2.1
Text results
15.3
Working in R
15.3.1
Model formula
15.3.2
Modeled means
15.3.3
Marginal means
15.3.4
Contrasts
15.3.5
Simple effects
15.3.6
Marginal effects
15.3.7
Plotting results
15.4
Problems
16
ANOVA Tables
16.1
Summary of usage
16.2
Example: a one-way ANOVA using the vole data
16.3
Example: a two-way ANOVA using the urchin data
16.3.1
How to read an ANOVA table
16.3.2
How to read ANOVA results reported in the text
16.3.3
Better practice – estimates and their uncertainty
16.4
Unbalanced designs
16.4.1
What is going on in unbalanced ANOVA? – Type I, II, III sum of squares
16.4.2
Back to interpretation of main effects
16.4.3
The anova tables for Type I, II, and III sum of squares are the same if the design is balanced.
16.5
Working in R
16.5.1
Type I sum of squares in R
16.5.2
Type II and III Sum of Squares
17
Predictive Models
17.1
Overfitting
17.2
Model building vs. Variable selection vs. Model selection
17.2.1
Stepwise regression
17.2.2
Cross-validation
17.2.3
Penalization
17.3
Shrinkage
Part V: Expanding the Linear Model – Generalized Linear Models and Multilevel (Linear Mixed) Models
18
Linear mixed models
18.1
Random effects
18.2
Random effects in statistical models
18.3
Linear mixed models are flexible
18.4
Blocking
18.4.1
Visualing variation due to blocks
18.4.2
Blocking increases precision of point estimates
18.5
Pseudoreplication
18.5.1
Visualizing pseduoreplication
18.6
Mapping NHST to estimation: A paired t-test is a special case of a linear mixed model
18.7
Advanced topic – Linear mixed models shrink coefficients by partial pooling
18.8
Working in R
18.8.1
coral data
19
Generalized linear models I: Count data
19.1
The generalized linear model
19.2
Count data example – number of trematode worm larvae in eyes of threespine stickleback fish
19.2.1
Modeling strategy
19.2.2
Checking the model I – a Normal Q-Q plot
19.2.3
Checking the model II – scale-location plot for checking homoskedasticity
19.2.4
Two distributions for count data – Poisson and Negative Binomial
19.2.5
Fitting a GLM with a Poisson distribution to the worm data
19.2.6
Model checking fits to count data
19.2.7
Fitting a GLM with a Negative Binomial distribution to the worm data
19.3
Working in R
19.3.1
Fitting a GLM to count data
19.3.2
Fitting a generalized linear mixed model (GLMM) to count data
19.3.3
Fitting a generalized linear model to continouus data
19.4
Problems
20
Linear models with heterogenous variance
20.1
gls
Appendix 1: Getting Started with R
20.2
Get your computer ready
20.2.1
Install R
20.2.2
Install R Studio
20.2.3
Resources for installing R and R Studio
20.2.4
Install LaTeX
20.3
Start learning
20.3.1
Start with Data Camp Introduction to R
20.3.2
Then Move to Introduction to R Studio
20.3.3
Develop your project with an R Studio Notebook
20.4
Getting Data into R
20.5
Additional R learning resources
20.6
Packages used extensively in this text
Appendix 2: Online Resources for Getting Started with Statistical Modeling in R
Appendix 3: Fake Data Simulations
20.7
Performance of Blocking relative to a linear model
Published with bookdown
Elementary Statistical Modeling for Applied Biostatistics
Part IV: More than one
\(X\)
– Multivariable Models