Chapter 2 Data – Importing and Saving Data
2.1 Create new notebook for this chapter
Be sure to save the notebook in the “notebooks” folder of your BIO_413 project. Annotate your notebook with notes! Update it as you learn more! We will use data.table for importing text files in tab-delimited or comma-separated formats and the readxl package for importing excel files.
Start your notebook with a chunk named “setup”. It is important that “setup” is the name. The text inside the curly brackets at the top of the chunk should be “r setup, message=FALSE”
knitr::opts_chunk$set(fig.width=6, fig.height=4) # use this in a chuck called "setup" to force the working directory to be # at the level of the project file. knitr::opts_knit$set(root.dir = rprojroot::find_rstudio_root_file()) # relative paths to project folders data_path <- "data" # path to data that are imported output_path <- "output" # path to data that are saved
Add a “load-packages” chunk.
library(ggplot2) # ggplot environment library(ggpubr) # publication ready plots library(data.table) # make data.frames data.tables library(readxl) # import excel library(emmeans) # get estimated marginal means and CIs, used for plot library(janitor) # clean_names to clean col labels of imported data
2.2 Importing Data
Throughout this book, we will download data from the Dryad Digital Repository, which is a major resource for increasing reproducibility in science. My own view is that all data should be archived on some public server (exceptions include data that are proprietary or contain sensitive information – such as human health measures).
The downloaded data will be inserted into the “data” folder. To access these data in an R script, the script needs to know “where to look” or the “address.” This address is the path. The default working directory for an R notebook is the directory containing the notebook .Rmd file so the default path is the path to this directory and this file should always be saved to the “notebooks” folder within “BIO_413”. The setup chunk above changes this default working directory to the project folder, which makes it a little easier to navigate to the data folder.
“BIO_413” Folder is the parent of the “notebooks” folder. It is also the parent of the “data” folder. To see any content within the “data” folder, the R script needs to tell R to move back (or up) the directory structure out of the “notebooks” folder into the parent “BIO_413” folder and then forward (or down) into the “data” folder. This would be do
2.2.1 Excel File
The Excel dataset is from an experiment on the growth response of zebra finch chicks to an incubation call that presumably signals “hot environment” to the embryos (Mariette, M.M. and Buchanan, K.L., 2016. Prenatal acoustic communication programs offspring for high posthatching temperatures in a songbird. Science, 353(6301), pp.812-814). The source file is from the Dryad Repository here:
file name: “allDatasetsMarietteBuchanan2016.xls”
- Copy the title of the Dryad page, which is “Data from: Prenatal acoustic communication programs offspring for high post-hatching temperatures in a songbird”
- Create a new folder within “data” and paste in the copied title as the folder name
- Remove the colon from the name, so the folder name is “Data from Prenatal acoustic communication programs offspring for high post-hatching temperatures in a songbird”
- Download the .xls file into this folder
A .xls file is an old (pre 2007) Microsoft Excel file type. It is a binary file and can only be opened into a readable format with specialized software. The more modern Excel file type is .xlsx, which contains within it multiple xml components. An xml file is a text file, and so contains readable content, but the content is xml code to display something. In general, I am a big advocate of archiving stuff as text files (manuscripts, data, scripts, blog posts) because these will always be readable by future software. Microsoft Excel is not likely to die anytime soon and software that can read .xls and especially .xlsx files (again, .xlsx files are text files) is even less likely to disappear but we can feel even more confident if data are archived as text files. That said, a single microsoft excel file with multiple sheets is an efficient method for distributing data and the readxl package provides excellent tools for reading different sheets of a single .xls or .xlsx file.
data_folder <- "Data from Prenatal acoustic communication programs offspring for high post-hatching temperatures in a songbird" filename <- "allDatasetsMarietteBuchanan2016.xls" file_path <- paste(data_path, data_folder, filename, sep="/") chick <- read_excel(file_path, sheet="nestlingMass") # read the excel file chick <- data.table(chick) # convert to data.table
In this book, we will consistently uses this protocol for storing and retrieving downloaded files. The first three lines in the script above creates the directory path to the file. This path includes
- data_path – the relative path into the folder “data” (relative to the location of the project file)
- data_folder – the name of the folder within “data” containing the file
- filename – the name of the file to read
These are all put together into a single path using the function
paste(). Read about paste. It will be used repeatedly. The
read_excel(file_path, sheet="nestlingMass") reads the nestlingMass sheet only. The next line:
chick <- data.table(chick) converts the data.frame into a data.table.
126.96.36.199 Troubleshooting File Import
If you get an error that starts with “Error:
path does not exist:” then R is not “seeing” your specified file given the path you’ve given it.
- Make sure your “setup” chunk is called “setup” and that it includes the line to change the working directory to the project directory and that you have run the chunk
- Make sure the path to the data folder is in your setup chunk.
- Make sure your “data” folder is one level inside your project folder. “one level” means it is not buried deeper inside other folders within the project folder.
- Make sure your “Data from …” folder is one level inside your “data” folder
- Make sure you have the name of the “Data from …” folder correct in your script. To do this, I do not type the name but instead go to the finder (on Mac OS) and highlight the “Data from …” folder and copy the name – the whole name – and then paste in between the quote mark in the line
data_folder <- ""
- Make sure your data file is one level inside the correct “Data from” folder.
- Make sure the file name is correct in the script. As with the folder name, I go to the finder and copy the file name and paste it in place.
More generally, R is very literal when it comes to: Spelling. Humans are very good at understanding misspelled words but the R language (or any computer language) is very literal. “Data from Quantifying the effects of” and “Data from Quantifying the efects of” are different values. Capitalization. R is case sensitive (some programming languages are not). “Data from Quantifying”, “Data from quantifying”, and “data from quantifying” are all different values. Spelling AND capitalization have to be perfect, not simply close. Humans are very good at understanding misspelled and OdDLy capitalized words but the R language (or any computer language) is very literal.
188.8.131.52 Peak at the imported data.table to check that the file was imported correctly and to learn about the contents.
Insert the following after your import script and run:
head(chick) # check -- are there headers? are there the correct number of columns?
## chick ID brood ID brood composition sex rank in nest ## 1: N1.10LF3 N1.10m3 mixed F 2 ## 2: N1.10noCut3 N1.10m3 mixed M 4 ## 3: N1.10RB3 N1.10m3 mixed F 2 ## 4: N1.10RF3 N1.10m3 mixed F 5 ## 5: N1.12LB3 N1.12m3 mixed F 3 ## 6: N1.12LF3 N1.12m3 mixed F 1 ## playback treatment nest temperature above ambient ## 1: treat 4.289583 ## 2: cont 4.289583 ## 3: cont 4.289583 ## 4: cont 4.289583 ## 5: cont 3.972917 ## 6: treat 3.972917 ## max daily temp hatch day mean max temp hatch to day2 ## 1: 17.4 18.83333 ## 2: 19.0 20.53333 ## 3: 17.4 18.83333 ## 4: 19.0 20.53333 ## 5: 29.0 24.63333 ## 6: 25.1 24.80000 ## mean max temp hatch to day10 mean max temp hatch to day13 hatching mass ## 1: 22.70 23.05714 0.7 ## 2: 24.53 23.41429 0.6 ## 3: 22.70 23.05714 0.7 ## 4: 24.53 23.41429 0.6 ## 5: 22.85 22.91429 0.7 ## 6: 23.35 23.24286 0.6 ## day1 mass day2 mass day10 mass day13 mass ## 1: 1.1000000000000001 1.2 NA 9.8 ## 2: 0.80000000000000004 1.1000000000000001 NA 9.1 ## 3: 0.90000000000000002 1.3999999999999999 NA 9.3 ## 4: 0.5 0.90000000000000002 NA 7.7 ## 5: 1 1.3999999999999999 9.4000000000000004 10.1 ## 6: 0.90000000000000002 1.3999999999999999 8.0999999999999996 9.6 ## day13 tarsus ## 1: 14.11 ## 2: 12.90 ## 3: 13.60 ## 4: 13.06 ## 5: 14.08 ## 6: 13.46
head(chick) script displays the first few lines of the data.table
chick. This is one way to check that the data were imported correctly, to examine the column labels to ensure they conform to best practices, and to simply get to know the contents of the import file. In this case, it is easy to see that the column names have spaces in them. Spaces in column names are bad practice because a column name is a variable and variable names should not include spaces (is
brood id one variable or two?). The best practice is to replace a blank with an underscore, for example
brood_id. Some coders just mash words together into a single word like this
broodid but this should generally be avoided because the result can be hard to read. Some coders also use Caps to designate new words like this
BroodId. This is easier to read than simple concatenation but the underscore is the easiest to read.
janitor package has a nice function to clean the column names of a data frame. The default clean includes changing any uppercase letter to lower case and replacing spaces with underscores. Many coders like to work with all lowercase variable names to avoid having to hit the shift key.
chick <- clean_names(chick)
Importantly, resist the temptation to change the column names in the data file, which reduces reproducibility. Leave original data files original. Always increase reproducibility!
184.108.40.206 Best practices for creating data files
https://www.youtube.com/watch?time_continue=309&v=Ry2xjTBtNFE – An excellent video introduction to best practices for organizing data in a spreadsheet that will subsequently be analyzed by statistics software.
Broman, K. W., & Woo, K. H. (2017). Data organization in spreadsheets (No. e3183v1). https://doi.org/10.7287/peerj.preprints.3183v1 – An excelllent review of best practices for organizing data in a spreadsheet.
220.127.116.11 Explore with plots
Just for fun, let’s plot the data and reproduce Fig. 2A and B. We are using the
qplot function, which is from the ggplot2 package. Two plots are made and only a subset of the rows are plotted in each (in A, the subset in which playback_treatment==“treat” and, in B, the subset in which playback_treatment==“cont”). This book uses the ggplot2 package extensively.
qplot(x=nest_temperature_above_ambient, y=day13_mass, data=chick[playback_treatment=="treat"]) + geom_smooth(method="lm")
qplot(x=nest_temperature_above_ambient, y=day13_mass, data=chick[playback_treatment=="cont"]) + geom_smooth(method="lm")
2.2.2 Text File
The example dataset comes from an experiment on the effect of neonicotinoid pesticides on bumble bee colony growth.
file name: “Whitehorn, O’Connor, Wackers, Goulson (2012) Data from ‘Neonicotinoid pesticide reduces bumblebee colony growth and queen production’.csv.csv”
- Copy the title of the Dryad page, which is “Data from: Neonicotinoid pesticide reduces bumblebee colony growth and queen production”
- Create a new folder within “data” and paste in the copied title as the folder name
- Remove the colon from the name, so the folder name is “Data from Neonicotinoid pesticide reduces bumblebee colony growth and queen production”
- Download the .csv file into this folder
A .csv file is a text file that is comma-delimted, which means that the entries of a row are separated by commas. A text file is readable by any text editor software and most other kinds of software. Datasets that are stored as text files are typically saved as either .csv (where the entries of a row are separated by commas) or .txt (where the entries are separated by tabs). The base R way to read a .csv file is using
read.table function is more versatile, as the delimiter can be specified. The function
fread() from the data.table package is fast, smart, and flexible. It is smart in the sense that it guesses what the delimter is. Unfortunately, because of spaces in the column labels for this file, fread guesses incorrectly (another reason why spaces in column labels should be avoided). To overcome this, the statement below specifies that the file contains a “header” (a line containing column labels)
data_folder <- "Data from Neonicotinoid pesticide reduces bumblebee colony growth and queen production" filename <- "Whitehorn, O'Connor, Wackers, Goulson (2012) Data from 'Neonicotinoid pesticide reduces bumblebee colony growth and queen production'.csv.csv" file_path <- paste(data_path, data_folder, filename, sep="/") bee <- fread(file_path, header=TRUE)
Here, as with the import of the Excel file, the first three lines create the directory path to the file. Peek at the file in the console. Again, there are spaces in the column names. Here I’ll leave it to you to change this
Here is a reproduction of Fig 2 from the journal article.
bee[, treatment:=factor(treatment, c("Control", "Low", "High"))] # reorder factor levels ggbarplot(data=bee, x="treatment", y="new_queens", add = "mean_se")
The plot suggests immediately some problems with the plot itself and the associated analysis. First, the y-axis is counts, which means that negative values are impossible. But the standard error bars look like they use standard errors computed from a model that allows infinetly large negative values, and the illustrated standard error bars imply that negative values exist. So these error bars are misleading. Second, it is good practice, especially if sample sizes are modest or small, to “show the data”, which means, show the individual data points and not just a summary of the distribution.
Here are three alternative plots for exploratory purposes. The first simply “shows the data” but still uses the misleading standard error bars. The second uses a box plot. The last plots the means and 95% confidence intervals modeled with a GLM (generalized linear model) to account for the count data (the model used could be improved). Notice that the bar length above the mean is longer than the bar length below the mean (that is the interval is asymmetric about the mean). In order to stay focussed on importing data, I leave explanation of these plots and analysis to later chapters.
ggbarplot(data=bee, x="treatment", y="new_queens", add = c("mean_se", "point"))
ggboxplot(data=bee, x="treatment", y="new_queens")
fit.glm <- glm(new_queens ~ treatment, data=bee, family=poisson()) means.glm <- emmeans(fit.glm, specs="treatment", type = "response") gg <- ggplot(data=data.frame(means.glm), aes(x=treatment, y=rate)) + geom_col(fill="gray") + geom_errorbar(aes(x=treatment, ymin=asymp.LCL, ymax=asymp.UCL), width=0.3) + ylab("New queens") + NULL gg
2.3 Saving Data
For many projects, it is uncommon to save data. I might save simulated data if it takes a long time (hours or days) to generate these and I simply want to work with the simulated data in the future (and not have to regenerate it). Or I might save processed data if it takes a long time to process it and I want to analyze the processed data in the future and not have to re-process it.
Here, we will create a small, simulated data set and save it to the “output” folder. You should have an “output” folder directly (one-level) inside the project folder. Create a folder within the output folder names “chapter_02”. Make sure that you have the script assigning the path to the output folder code in the setup chunk.
set.seed(1) n <- 10 beta_0 <- 10.3 beta_1 <- 1.1 sigma <- 1.5 fake_data <- data.table(treatment=rep(c("Cn","Tr"), each=n), weight=c(rnorm(n, mean=beta_0, sd=sigma), rnorm(n, mean=beta_0 + beta_1, sd=sigma))) ggstripchart(data=fake_data, x="treatment", y="weight", add="mean_ci")
save_data_folder <- "chapter_02" save_filename <- "my_first_fake_data.txt" save_file_path <- paste(output_path, save_data_folder, save_filename, sep="/") write.table(fake_data, save_file_path, sep="\t", quote=FALSE)
Look at your project directory to make sure the file is where it should be! We used
write.table() to create a tab-delimited text file using
sep="\t" to specify tabs to separate the row elements. “” is the standard character string for a tab. Check in your Fake_Data folder and open the file in a text editor.
Download the dataset “data-Lodjak.et.al-2016-FuncEcol.xlsx” from the Dryad repository at https://datadryad.org/resource/doi:10.5061/dryad.rd01s. The .xlsx file presents the data cleanly but the trade-off is that the 1) multiple header rows, and 2) spaces in the header labels, 3) parentheses in the header labels make it more complex to import in a usable way. Import the data and plot Body Mass against Age (that is make Body Mass the “Y” variable and Age the “X” variable) using the qplot function. You should recode the column labels to remove spaces and parentheses using the setnames function.
Download the dataset “Results2015.txt” from the Dryad repository at https://datadryad.org//resource/doi:10.5061/dryad.65vk4. Try to reproduce Fig. 1. It’s not easy. I’ve inserted the figure below.
(grad students only) Download and plot data from a Dryad Repository dataset of your choice.
(grad students only) Create fake experimental data with three treatment levels (control, lo_temp, high_temp). This will require three parameters: an intercept (beta_0), an effect of lo_temp (beta_1), and an effect of high_temp (beta_2). You should be able to plug and play from the script above even if you don’t underdstand at this point what any of it is! Plot it as a strip chart, as above.