7 Data Frames
This is the big one! All of that stuff about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical and predictive models expect one and they are the most common way to pass data in and out of R. Although critical to understand, data frames are very, very easy to get. What’s a data frame? It’s a table. That’s it. No, really, that’s it.
By the end of this chapter you will know the following:
- What’s a data frame and how do I create one?
- How can I access and assign elements to and from a data frame?
- How do I read and write external data?
7.1 What’s a data frame?
Underneath the hood, a data frame is actually a mashup of lists and vectors. Every data frame is a list, but it’s constrained so that each list element is a vector with the same length. We can see how this exploits some of the fundamental properties of lists and vectors. Because each vector must have the same data type, there’s no danger that I’ll get character data when I only want dates or integers or whatever. At the same time, the list’s flexibility means that we can store different data types in a single data construct.
7.1.1 Creating a data frame
Although there are many functions that will return a data frame, let’s construct one from scratch with the data.frame
function. We’ll first create some vectors and then join them together.
set.seed(1234)
State <- rep(c("TX", "NY", "CA"), 10)
EarnedPremium <- rlnorm(length(State), meanlog = log(50000), sdlog=1)
Loss <- rlnorm(length(State), meanlog = log(50000 * 0.6), sdlog=1)
df <- data.frame(State, EarnedPremium, Loss, stringsAsFactors=FALSE)
Note that I’ve set the “stringsAsFactors” argument to FALSE. If I hadn’t, the column “State” would be a factor, rather than a character. See Data Types: factors for some reasons why we might not want our data to be a factor.
We didn’t have to create the vectors before constructing the data frame. If we like, we can pass them in as the result of function calls within the call to data.frame
.
set.seed(1234)
df <- data.frame(State = rep(c("TX", "NY", "CA"), 10)
, EarnedPremium = round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3)
, stringsAsFactors = FALSE)
When constructing a data frame, there will always be some implicit metadata. For instance, every column must be named; if you don’t provide one, names will be provided for you. R will use variable names by default. If there aren’t any to be found, you’ll get the ugliness you see in the second example below.
df <- data.frame(State, EarnedPremium)
names(df)
#> [1] "State" "EarnedPremium"
df <- data.frame(rep(c("TX", "NY", "CA"), 10)
, round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3))
names(df)
#> [1] "rep.c..TX....NY....CA....10."
#> [2] "round.rlnorm.length.State...meanlog...log.50000...sdlog...1..."
Data frames will also have dimensional attributes set automatically.
dim(df)
#> [1] 30 2
This example is small, but it doesn’t have to be. We could have a lot more rows and lot more columns. I lose focus beyond about 50 columns (i.e. once we’ve moved past column “BA” in a spreadsheet), but I’ve worked with data frames having more than one million rows on a number of occasions. Is that big data? I couldn’t say, but I’m not all that fussed about whether my data is “big” or not. The basic concept of a data frame scales incredibly well. I’ll use the same code to construct a model whether my data has 10 observations or 10 million.
7.1.2 Basic properties of a data frame
Once created, it’s straightforward to get some basic information about the data we have.
summary(df)
#> rep.c..TX....NY....CA....10.
#> CA:10
#> NY:10
#> TX:10
#>
#>
#>
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> Min. : 5652
#> 1st Qu.: 16503
#> Median : 23835
#> Mean : 50519
#> 3rd Qu.: 35784
#> Max. :259781
str(df)
#> 'data.frame': 30 obs. of 2 variables:
#> $ rep.c..TX....NY....CA....10. : Factor w/ 3 levels "CA","NY","TX": 3 2 1 3 2 1 3 2 1 3 ...
#> $ round.rlnorm.length.State...meanlog...log.50000...sdlog...1...: num 150554 31076 24596 30288 9805 ...
head(df)
#> rep.c..TX....NY....CA....10.
#> 1 TX
#> 2 NY
#> 3 CA
#> 4 TX
#> 5 NY
#> 6 CA
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1 150553.808
#> 2 31075.817
#> 3 24595.979
#> 4 30288.404
#> 5 9805.364
#> 6 15555.336
tail(df)
#> rep.c..TX....NY....CA....10.
#> 25 TX
#> 26 NY
#> 27 CA
#> 28 TX
#> 29 NY
#> 30 CA
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 25 42508.90
#> 26 87801.52
#> 27 259781.39
#> 28 23073.15
#> 29 249119.48
#> 30 15708.70
We can also query metadata about our data frame. Note that, for a data frame, names
and colnames
will return the same result. dim
will give the number of rows and columns, or you can use nrow
and ncol
directly. In particular, note what result is returned by the length
function. If you think about the fact that a data frame is actually a list, you may be able to guess why length
returns the value it does.
names(df)
#> [1] "rep.c..TX....NY....CA....10."
#> [2] "round.rlnorm.length.State...meanlog...log.50000...sdlog...1..."
colnames(df)
#> [1] "rep.c..TX....NY....CA....10."
#> [2] "round.rlnorm.length.State...meanlog...log.50000...sdlog...1..."
dim(df)
#> [1] 30 2
length(df)
#> [1] 2
nrow(df)
#> [1] 30
ncol(df)
#> [1] 2
7.1.3 Combining data frames
If you’ve got more than one data frame, it’s possible to combine them in several different ways: rbind
, cbind
and merge
.
rbind
will append rows to the data frame. New rows must have the same number of columns and data types.
dfA = df[1:10,]
dfB = df[11:20, ]
rbind(dfA, dfB)
cbind
must have the same number of rows as the data frame.
dfC = dfA[, 1:2]
cbind(dfA, dfC)
#> rep.c..TX....NY....CA....10.
#> 1 TX
#> 2 NY
#> 3 CA
#> 4 TX
#> 5 NY
#> 6 CA
#> 7 TX
#> 8 NY
#> 9 CA
#> 10 TX
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1 150553.808
#> 2 31075.817
#> 3 24595.979
#> 4 30288.404
#> 5 9805.364
#> 6 15555.336
#> 7 5651.852
#> 8 13079.287
#> 9 37252.876
#> 10 31378.579
#> rep.c..TX....NY....CA....10.
#> 1 TX
#> 2 NY
#> 3 CA
#> 4 TX
#> 5 NY
#> 6 CA
#> 7 TX
#> 8 NY
#> 9 CA
#> 10 TX
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1 150553.808
#> 2 31075.817
#> 3 24595.979
#> 4 30288.404
#> 5 9805.364
#> 6 15555.336
#> 7 5651.852
#> 8 13079.287
#> 9 37252.876
#> 10 31378.579
merge
is similar to a JOIN operation in a relational database. If you’re used to VLOOKUP in Excel10, you’ll love merge
. It’s possible to use multiple columns (e.g. state and effective date) when specifying how to join. If no columns are specified, merge
will use whatever column names the two data frames have in common. Below, we merge a set of rate changes to our premium data.
dfRateChange = data.frame(State = c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2))
df = merge(df, dfRateChange)
7.1.4 expand.grid
Consider expand.grid
. This will create a data frame as the cartesian product (i.e. every combination of elements) of one or more vectors. Among the use cases are to check for missing data elements in another data frame.
dfStateClass <- expand.grid(State = c("TX", "CA", "NY")
, Class = c(1776, 1066, 1492))
7.2 Access and Assignment
Access and assignment will feel like a weird combination of matrices and lists, though with an emphasis on the mechanics of a 2D matrix. We’ll often use the [ ]
operator to specify which row and column we’d like. The first arugment will refer to the row and the second will refer to the column. If either argument is left blank, it will refer to every element.
df[1, 2]
df[2, ]
df[, 2]
df[2, -1]
Note the interesting case when we specify one argument but leave the other blank. I’ll wrap it in a head
function to minimize the output.
head(df[2])
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1 150553.808
#> 2 31075.817
#> 3 24595.979
#> 4 30288.404
#> 5 9805.364
#> 6 15555.336
Once again, if you bear in mind that a data frame is a list, this should makes sense. Without the additional comma, R will assume that we’re dealing with a list and will respond accordingly. In this case, that means returning the second element of the list. You can extend this by requesting multiple elements of the list.
head(df[1:2])
#> rep.c..TX....NY....CA....10.
#> 1 TX
#> 2 NY
#> 3 CA
#> 4 TX
#> 5 NY
#> 6 CA
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1 150553.808
#> 2 31075.817
#> 3 24595.979
#> 4 30288.404
#> 5 9805.364
#> 6 15555.336
It’ll probably be rare that you want this sort of behavior. There are many times that you’ll want to return a single vector from a data frame, but in these cases, I’d recommend either using the $
or the [[ ]]
operator. It’s common to use $
to return a single column from a data frame. This will force you to use the name of the column and so provides a more expressive way to work with your data.
df$EarnedPremium
head(df[[1]])
head(df[["EarnedPremium"]])
For multiple columns, we can pass a character vector to get columns by name, rather than by position.
head(df[, "EarnedPremium"])
#> [1] 14953.677 65986.637 147889.324 4789.018 76795.627 82936.802
head(df[, c("EarnedPremium", "State")])
#> EarnedPremium State
#> 1 14953.677 TX
#> 2 65986.637 NY
#> 3 147889.324 CA
#> 4 4789.018 TX
#> 5 76795.627 NY
#> 6 82936.802 CA
7.2.1 Subsetting
Remember that we can use the same logical and ordinal access functions which apply to vectors to access a data frame.
dfTX = df[df$State == "TX", ]
dfBigPolicies = df[df$EarnedPremium >= 50000, ]
whichState = (df$State == "TX")
dfTX = df[whichState, ]
df10 <- df[1:10, ]
If I’m only looking to filter rows, an easier way would be to use the subset
function.
dfTX = subset(df, State == "TX")
dfBigPolicies = subset(df, EarnedPremium >= 50000)
Bringing that all together, we can combine row and colum access in crazy ways. Here, we’ll get the premium and state for the rows which have losses greater than some value. Let’s say we want everything above the 80th percentile, but rounded to the nearest thousand.
minLoss <- round(quantile(df$Loss, 0.8), -3)
myDF <- df[df$Loss >= minLoss, c("EarnedPremium", "State")]
7.2.2 Altering and adding columns
We can add columns by using the same operators as for access11, though it’s common practice to use $
and the name of our new column.
df$LossRatio = df$Loss / df$EarnedPremium
The transform
will return a new data frame with transformed columns.
df <- transform(df, LogLoss = log(Loss), LogPremium = log(EarnedPremium))
I find with
to be a bit easier than transform
, however it can’t (easily) be used to create more than one column at a time. with
will refer to the columns of a data frame in the statement where it’s called. We typically use this to return a single vector which we then assign as a new column to a data frame.
df$LogLoss = with(df, log(Loss))
7.2.3 Eliminating columns
We can eliminate columns in one of two ways. If we only want to remove one column, we can assign the value NULL to it.
df$LossRatio <- NULL
If we want to eliminate more than one column, we may construct a new data frame which only includes the columns we’d like to keep.
df <- df[, 1:2]
df <- df[, c("State", "EarnedPremium")]
If we’d like to remove specific columns, Hadley Wickham (Wickham, n.d.) showed a nice means to do this by using set differences.
df <- df[, setdiff(colnames(df), c("RateChange", "Losses"))]
7.2.4 Altering column names
df$LossRation = with(df, Losses / EarnedPremium)
colnames(df)[4] = "Loss Ratio"
colnames(df)
7.2.5 Ordering
For vectors, we can use sort
to get a sorted vector. For a data frame, we’ll need to use order
. Remember that order
functions a lot like which
. It will return indices which may then be used to return specific contents.
order(df$EarnedPremium)
#> [1] 4 26 1 28 12 30 18 10 19 13 25 7 9 8 17 22 11 23 16 29 14 21 2
#> [24] 5 24 6 27 15 3 20
df = df[order(df$EarnedPremium), ]
7.2.6 attach
If the name of a data frame is long, typing it to access column elements might start to seem tedious. The attach
function will alleviate this, by attaching the data frame onto something called the “search path” (which I might have described in the section on packages). What’s the search path? Well, all evidence to the contrary, R will look high and low every time you refer to something. As soon as it finds a match, it’ll proceed with whatever calculation you’ve asked it to do. Attaching the data frame to the search path means that the column names of the data frame will be added to the list of places where R will search.
attach(dfA)
# do some stuff
attach(dfB)
#> The following objects are masked from dfA:
#>
#> rep.c..TX....NY....CA....10.,
#> round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
There are many references that suggest using attach
. I don’t. I’ll actually advise against it. Why is attach
a bad idea? If you add a number of similar data frames or packages it can quickly get hard to keep up. The worst case scenario comes when you add two data frames that share column names and you then proceed to carry out analysis.
mean(EarnedPremium)
#> [1] 61457.34
Which EarnedPremium am I talking about? “Well to tell you the truth in all this excitement I kinda lost track myself.”12 attach
won’t necessarily harm you. But it is a loaded gun. If you get tired of typing the name of a data frame, consider using with
.
7.3 Summarizing
We can use the standard suite of aggregation functions to summarize a single column of a data frame.
sum(df$EarnedPremium)
#> [1] 1843720
max(df$EarnedPremium)
#> [1] 559956
For multiple columns, base R comes with the aggregate
function. Its easiest implementation uses a formula which shows how we want the variable split. In the function below, we’re splitting earned premium by state.
aggregate(EarnedPremium ~ State, data = df, FUN = sum)
#> State EarnedPremium
#> 1 CA 673095.7
#> 2 NY 939215.3
#> 3 TX 231409.1
That’s easy to type, but will only give me one column at a time. If I want multiple columns, I’ll ned to do something like the following:
aggregate(df[, c("EarnedPremium", "Loss")], by = df["State"], FUN = sum)
#> State EarnedPremium Loss
#> 1 CA 673095.7 257519.6
#> 2 NY 939215.3 423081.7
#> 3 TX 231409.1 228745.6
Although aggregate
gets the job done, I often find it cumbersome for anything complex. Ever since I started using it, I’ve been hooked on the dplyr
package. We’ll not go into it here, but I’ll recommend checking it out. Hover, spend a bit of time playing with aggregate
first. There are truckloads of packages designed to make manipulation of data frames easier. They’re all powerful and they’re all a little different. I’d suggest learning the functions in base R first, then moving on to tools like dplyr
and data.table
. There’s a lot to be gained from understanding the problems those packages were created to solve.
7.4 Reading and writing external data
First, we’ll write out some data. Then we’ll read it back in. The comma separated value (CSV) format is one of the easiest to work with.
write.csv(df, "SomeFile.csv")
To read data back in, we use the (wait for it) read.csv
function.
myDF <- read.csv('SomeFile.csv')
We can use this same function to read data from the web. If it’s accessible and we’ve provided a proper URL, we can treat it the same as anything on our local file system.
URL <- "http://www.casact.org/research/reserve_data/ppauto_pos.csv"
df <- read.csv(URL, stringsAsFactors = FALSE)
7.5 Exercises
- Create a data frame with 500 rows. Include a column for policy effective date, policy expiration date and claim count. For claim count, consider using the
rpois
function to simulate from a poisson distribution. - Save this data to a .CSV file and then reload that file.
- Which policy had the most claims? Which policy year?
- Add a column for