Chapter 1 Getting Started – R Projects and R Markdown

A typical statistical modeling project will consist of:

  1. importing data from Excel or text (.csv or .txt) files
  2. cleaning data
  3. initial exploratory plots
  4. analysis
  5. model checking
  6. generating plots
  7. generating tables
  8. writing text to describe the project, the methods, the analysis, and the interpretation of the results (plots and tables)

The best practice for reproducible research is to have all of these steps in a single document and all of the files for this project in a single folder (directory), preferably on a cloud drive. Too many research projects are not reproducible because the data were cleaned in Excel, and then different parts of the data were separately imported into a GUI statistics software for analysis, and then output from the statistics software was transcribed to Excel to make a table. And other parts of the analysis are used to create a plot in some plotting software. And then the tables and plots are pasted into Microsoft Word to create a report. Any change at any step in this process will require the researcher to remember all the downstream parts that are dependent on the change and to re-do an analysis, or a table, or a plot, etc. etc.

R studio encourages best practices by creating a project folder that contains all project documents and implementing a version of markdown called R Markdown. An R Markdown document can explicitly link all parts of the workflow so that changes in earlier steps automatically flow into the later steps. At the completion of a project, a researcher can choose “run all” from the menu and the data are read, cleaned, analyzed, ploted, tabled, and put into a report with the text.

1.1 R vs R Studio

R is a programming language. It runs under the hood. You never see it. To use R, you need another piece of software that provides a user interface. The software we will use for this is R Studio. R Studio is a slick (very slick) graphical user interface (GUI) for developing R projects.

1.2 Download and install R and R studio

Download R for your OS

Download R Studio Desktop

If you need help installing R and R studio, here is Andy Field’s Installing R and RStudio video tutorial)

1.3 Install R Markdown

In this text, we will write code to analyze data using R Markdown. R markdown is a version of Markdown. Markdown is tool for creating a document containing text (like microsoft Word), images, tables, and code that can be output to the three modern output formats: html (web pages), pdf (reports and documents), and microsoft word (okay, this isn’t modern but it is widely used).

Directions for installing R Markdown

R Markdown can output pdf files. The mechanism for this is to first create a LaTeX (“la-tek”) file. LaTeX is an amazing tool for creating professional pdf documents. You do not need PDF output for this text, but I encourage you to download and install the tinytex distribution, which was created especially for R Markdown in R Studio.

The tinytex distribution is here.

1.4 Importing Packages

The R scripts you write will include functions in packages that are not included in Base R. These packages need to be downloaded from an internet server to your computer. You only need to do this once (although you have to redo it each time you update R). But, each time you start a new R session, you will need to load a package using the library() function. Now is a good time to import packages that we will use

Open R Studio and choose the menu item “Tools” > “Install Packages”. In the “packages” input box, insert the names of packages to install the package. The names can be separated by spaces or commas, for example “data.table, emmeans, ggplot2”. Make sure that “install dependencies” is clicked before you click “Install”. Packages that we will use in this book are

  1. Import and analysis packages
  • here – we use to read from and write to the correct folder
  • janitor – we use the function clean_names from this package
  • readxl – elegant importing from microsoft Excel spreadsheets
  • data.table - improves functionality of data frames
  1. analysis packages
  • nlme – we use this for gls models
  • lme4 – we use this for linear mixed models
  • lmerTest – we use this for inference with linear mixed models
  • glmmTMB – we use this for generalized linear models
  • MASS – we will use glm.nb from this package
  • afex – we use this for classic ANOVA
  • emmeans – we use this to compute modeled means and contrasts
  1. graphing packages
  • ggplot2 – we use this for plotting
  • ggsci – we use this for the color palettes
  • ggpubr – we use this to make ggplots a bit easier
  • ggforce – we use this for improved jitter plots
  • dabestr – we use this to make several plot types
  • cowplot – we use this to combine plots

Once these are installed, you don’t need to do this again although there will be additional packages that you might install. You simply need to use the library() function at the start of a markdown script.

1.5 Create an R Studio Project for this textbook

  1. Create a project folder within the Documents folder (Mac OS) or My Documents folder (Windows OS). All files associated with this book will reside inside this folder. The name of the project folder should be something meaningful, such as “Applied_Biostatics” or the name of your class (for students in my Applied Biostatics class, this folder could be named “BIO_413”).
  2. Within the project folder, create new folders named
    1. “Rmd” – this is where your R markdown files are stored
    2. “R” – this is where additional R script files are stored
    3. “data” – this is where data that we download from public archives are stored
    4. “output” – this is where you will store fake data generated in this class
    5. “images” – this is where image files are stored
  3. Open R Studio and click the menu item File > New Project…
  4. Choose “Existing Directory” and navigate to your project folder
  5. Choose “Create Project”
  6. Check that a “.Rproj” file is in your project folder

1.5.1 Create an R Markdown file for this Chapter

  1. The top-left icon in R Studio is a little plus sign within a green circle. Click this and choose “R Markdown” from the pull-down menu.
  2. Give the file a meaningful title like “Chapter 1 – Organization”
  3. Delete all text below the first code chunk, starting with the header “## R Markdown”

1.5.1.1 Modify the yaml header

Replace “output: html_document” in the yaml header with the following in order to creat a table of content (toc) on the left side of the page and to enable code folding

output:
  html_document:
    toc: true
    toc_float: true
    code_folding: hide

1.5.1.2 Modify the “setup” chunk

The setup chunk should look something like this

knitr::opts_chunk$set(echo = TRUE)

# wrangling packages
library(here)
library(janitor)
library(readxl)
library(data.table)

# analysis packages
library(MASS)

# graphing packages
library(ggsci)
library(ggpubr)
library(ggforce)
library(cowplot)

here <- here::here
data_path <- "data"

1.5.2 Create a “fake-data” chunk

  1. Let’s play around with an R Markdown file. Create a new chunk and label it “fake-data”. Insert the following R script and then click the chunk’s run button
set.seed(4)
n <- 10
fake_data <- data.table(
    treatment = rep(c("cn", "tr"), each = n),
    neutrophil_count_exp1 = rnegbin(n*2, 
                                    mu = rep(c(10, 15), each = n),
                                    theta = 1),
    neutrophil_count_exp2 = rnegbin(n*2, 
                                    mu = rep(c(10, 20), each = n),
                                    theta = 1)
)
# View(fake_data)

This chunk creates fake neutrophil counts in two different experiments. The comment (#) sign before View(fake_data) “comments out” the line of code, so it is not run. View the data by highlighting View(fake_data) and choosing “Run selected line(s)” from the Run menu.

1.5.3 Create a “plot” chunk

  1. Create a new chunk and label it “plot”. Insert the following R script and then click the chunk’s run button
gg_1 <- ggstripchart(data = fake_data,
                x = "treatment",
                y = "neutrophil_count_exp1",
                color = "treatment",
                palette = "jco",
                add = "mean_se",
                legend = "none") +
    ylab("Neutrophil Count (Exp. 1)") +
  stat_compare_means(method = "t.test",
                     label.y = 50,
                     label = "p.format") +
    NULL

gg_2<- ggstripchart(data = fake_data,
                x = "treatment",
                y = "neutrophil_count_exp2",
                color = "treatment",
                palette = "jco",
                add = "mean_se",
                legend = "none") +
  ylab("Neutrophil Count (Exp 2)") +
  stat_compare_means(method = "t.test",
                     label.y = 65,
                     label = "p.format") +
NULL

plot_grid(gg_1, gg_2, labels = "AUTO")

Each plot shows the mean count for each group, the standard error of the mean count, and the p-value from a t-test. This statistical analysis and plot are typical of those found in experimental biology journals. This text will teach alterntatives that implement better practices.

1.5.4 Knit

  1. Knit to an html file
  2. Knit to a pdf file, if you’ve installed tinytex (or some other LaTeX distribution)
  3. Knit to a word document