Importing and Manipulating Data Frames
Using data in data frames
Learning Objectives
- Load external tabular data from a .csv file into R.
- Describe what an R data frame is.
- Summarize the contents of a data frame in R.
Looking at Metadata
We are studying a population of Escherichia coli (designated Ara-3), which were propagated for more than 40,000 generations in a glucose-limited minimal medium. This medium was supplemented with citrate which E. coli cannot metabolize in the aerobic conditions of the experiment. Sequencing of the populations at regular time points reveals that spontaneous citrate-using mutants (Cit+) appeared at around 31,000 generations. This metadata describes information on the Ara-3 clones. Read more about the Long Term Evolution Experiment here from Lenski.
The metadata file required for this lesson can be downloaded directly here. This data was taken from the Blount et al. paper in Nature.
The columns stand for:
Column | Description |
---|---|
sample | clone name |
generation | generation when the sample was frozen |
clade | based on phylogeny-based tree |
strain | ancestral strain |
cit | citrate-using mutant status |
run | Sequence read archive sample ID |
genome_size | size in Mbp (made up data for this lesson) |
Hint: If you can’t find the Ecoli_metadata.csv file, or have lost track of it, download the file directly using the R
download.file() function
Hint: Make sure you make a new folder to save the new Ecoli_metadata.csv file.
dir.create('data')
download.file("https://raw.githubusercontent.com/theheking/intro-to-r/gh-pages/data/Ecoli_metadata.csv", "data/Ecoli_metadata.csv")
Two methods to import data
Using the visual IDE
1) Select File --> Import Dataset --> From Text (base)...
2) Navigate to where you have downloaded your Ecoli_metadata.csv file.
3) It should look like the image below. Please try out the different settings.
4) This will automatically run the import in the console. Please make sure you copy and paste into your R script.
Directly in the console
You are now ready to load the data. We are going to use the R function read.csv()
to load the data file into memory (as a data.frame
):
metadata <- read.csv('data/Ecoli_metadata.csv')
This statement doesn’t produce any output because the assignment doesn’t display anything. If we want to check that our data has been loaded, we can print the variable’s value: metadata
Let’s check the top (the first 6 lines) of this data.frame
using the function head()
:
head(metadata)
## sample generation clade strain cit run genome_size
## 1 REL606 0 <NA> REL606 unknown 4.62
## 2 REL1166A 2000 unknown REL606 unknown SRR098028 4.63
## 3 ZDB409 5000 unknown REL606 unknown SRR098281 4.60
## 4 ZDB429 10000 UC REL606 unknown SRR098282 4.59
## 5 ZDB446 15000 UC REL606 unknown SRR098283 4.66
## 6 ZDB458 20000 (C1,C2) REL606 unknown SRR098284 4.63
We’ve just done two very useful things.
-
We’ve read our data in to R, so now we can work with it in R
-
We’ve created a data frame (with the read.csv command) in the standard way R works with data.
Debugging errors
Let’s create a very common error you might encounter.
(metadata <- read.csv('dataa/Ecoli_metadata.csv'))
Error in file(file, "rt") : cannot open the connection
In addition: Warning message:
In file(file, "rt") :
cannot open file 'dataa/Ecoli_metadata.csv': No such file or directory
Exercise
On your table can you please try to create a unique error message?
What to do when you get an error
- Don’t panic - you are not special
- First read the error message - what does it say
- Next, check your code for easily done errors (a) matched brackets (b) matched quotation marks (c) correct names
- Then, google the error message. However, this doesn’t always work very well because package developers often rely on the error catching provided by R.
- You end up with general error messages that might not be very helpful to diagnose a problem (e.g. “subscript out of bounds”).
- However, you should check stackoverflow.com. Search using the [r] tag. Most questions have already been answered, but the challenge is to use the right words in the search to find the answers: http://stackoverflow.com/questions/tagged/r
What are data frames?
A data.frame
is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str()
function is useful for inspecting the data types of the columns.
By default, data.frame
coerce) columns that contain characters (i.e., text) into the factor
data type. Depending on what you want to do with the data, you may want to keep these columns as character
. To do so, read.csv()
and read.table()
have an argument called stringsAsFactors
which can be set to FALSE
:
Let’s now check the structure of this data.frame
in more details with the function str()
:
str(metadata)
Inspecting data.frame
objects
We already saw how the functions head()
and str()
can be useful to check the content and the structure of a data.frame
. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.
- Size:
dim()
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object)nrow()
- returns the number of rowsncol()
- returns the number of columns
- Content:
head()
- shows the first 6 rowstail()
- shows the last 6 rows
- Names:
names()
- returns the column names (synonym ofcolnames()
fordata.frame
objects)rownames()
- returns the row names
- Summary:
str()
- structure of the object and information about the class, length and content of each columnsummary()
- summary statistics for each column
Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame
.
Challenge
Based on the given table of functions to asses data structure, can you answer the following questions?
1) What is the class of the object metadata?
2) How many rows and how many columns are in this object?
3) What are the column names for this data frame?
Subsetting Data frames
The metadata data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers (i.e. [row, column]).
metadata[1, 2] # 1st element in the 2nd column
metadata[1, 6] # 1st element in the 6th column
metadata[1:3, 7] # First three elements in the 7th column
metadata[3, ] # 3rd element for all columns
metadata[, 7] # Entire 7th column
Using column names to subset data frames
For larger datasets, it can be tricky to remember the column number that corresponds to a particular variable. Sometimes the column number for a particular variable can change if your analysis adds or removes columns. The best practice when working with columns in a data frame is to refer to them by name. This also makes your code easier to read and your intentions clearer.
There are two ways to select a column by name from a data frame:
- Using
dataframe[ , "column_name"]
- Using
dataframe$column_name
You can do operations on a particular column, by selecting it using the $
sign.
In this case, the entire column is a vector. To see the possible columns names run
names(metadata)
colnames(metadata)
To subset just the strain column from the metadata or
metadata[ , "strain"]
metadata$strain
The first method allows you to select multiple columns at once. Suppose we wanted strain and clade information:
metadata[, c("strain", "clade")]
You can even access columns by column name and select specific rows of interest. For example, if we wanted the strain and clade of just rows 4 through 7, we could do:
metadata[4:7, c("strain", "clade")]
Exercise
Create a new variable of a dataframe that contains just the sample and genome size.
Why would you create a subset of the original dataframe?
Factors
In the data frame we just imported, let’s do:
str(metadata)
As you can see, many of the columns in our data frame are character datatypes. However, there is also a special class called factor
.
They are very useful but not necessarily intuitive and therefore require some attention.
Factors are used to represent categorical data. Factors can be ordered or unordered. They are an important class for statistical analysis and for plotting.
Factors are:
- stored as integers
- have labels associated with these unique integers.
- While factors look (and often behave) like character vectors, they are actually integers under the hood, and you need to be careful when treating them like strings.
Let us convert the clade columns from a character to a factor column using the factor
command.
factor(metadata[,3])
The levels associated with this factor are Levels: (C1,C2) C1 C2 C3 Cit+ UC unknown
.
Exercise
Let’s reread our metadata data frame, but set every character column to be a factor.
By looking at the help page, can you edit the command below with a new argument that would read all strings as factors?
metadata <- read.csv('data/Ecoli_metadata.csv')
Once created, factors can only contain a pre-defined set of values known as levels. By default, R always sorts levels in alphabetical order.
For instance, we see that cit
is a Factor w/ 3 levels, minus
, plus
and unknown
.
Material adapted from (https://datacarpentry.org/R-genomics/01-intro-to-R.html) and (https://datacarpentry.org/semester-biology/materials/r-intro/)