7 Data Frames

This is the big one! All of that stuff about vectors and lists was prologue to this. The data frame is a seminal concept in R. Most statistical and predictive models expect one and they are the most common way to pass data in and out of R. Although critical to understand, data frames are very, very easy to get. What’s a data frame? It’s a table. That’s it. No, really, that’s it.

By the end of this chapter you will know the following:

What’s a data frame and how do I create one?
How can I access and assign elements to and from a data frame?
How do I read and write external data?

7.1 What’s a data frame?

Underneath the hood, a data frame is actually a mashup of lists and vectors. Every data frame is a list, but it’s constrained so that each list element is a vector with the same length. We can see how this exploits some of the fundamental properties of lists and vectors. Because each vector must have the same data type, there’s no danger that I’ll get character data when I only want dates or integers or whatever. At the same time, the list’s flexibility means that we can store different data types in a single data construct.

7.1.1 Creating a data frame

Although there are many functions that will return a data frame, let’s construct one from scratch with the data.frame function. We’ll first create some vectors and then join them together.

set.seed(1234)
State <- rep(c("TX", "NY", "CA"), 10)
EarnedPremium <- rlnorm(length(State), meanlog = log(50000), sdlog=1)
Loss <- rlnorm(length(State), meanlog = log(50000 * 0.6), sdlog=1)

df <- data.frame(State, EarnedPremium, Loss, stringsAsFactors=FALSE)

Note that I’ve set the “stringsAsFactors” argument to FALSE. If I hadn’t, the column “State” would be a factor, rather than a character. See Data Types: factors for some reasons why we might not want our data to be a factor.

We didn’t have to create the vectors before constructing the data frame. If we like, we can pass them in as the result of function calls within the call to data.frame.

set.seed(1234)
df <- data.frame(State = rep(c("TX", "NY", "CA"), 10)
                 , EarnedPremium = round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3)
                 , stringsAsFactors = FALSE)

When constructing a data frame, there will always be some implicit metadata. For instance, every column must be named; if you don’t provide one, names will be provided for you. R will use variable names by default. If there aren’t any to be found, you’ll get the ugliness you see in the second example below.

df <- data.frame(State, EarnedPremium)
names(df)
#> [1] "State"         "EarnedPremium"
df <- data.frame(rep(c("TX", "NY", "CA"), 10)
                 , round(rlnorm(length(State), meanlog = log(50000), sdlog=1), 3))
names(df)
#> [1] "rep.c..TX....NY....CA....10."                                  
#> [2] "round.rlnorm.length.State...meanlog...log.50000...sdlog...1..."

Data frames will also have dimensional attributes set automatically.

dim(df)
#> [1] 30  2

This example is small, but it doesn’t have to be. We could have a lot more rows and lot more columns. I lose focus beyond about 50 columns (i.e. once we’ve moved past column “BA” in a spreadsheet), but I’ve worked with data frames having more than one million rows on a number of occasions. Is that big data? I couldn’t say, but I’m not all that fussed about whether my data is “big” or not. The basic concept of a data frame scales incredibly well. I’ll use the same code to construct a model whether my data has 10 observations or 10 million.

7.1.2 Basic properties of a data frame

Once created, it’s straightforward to get some basic information about the data we have.

summary(df)
#>  rep.c..TX....NY....CA....10.
#>  CA:10                       
#>  NY:10                       
#>  TX:10                       
#>                              
#>                              
#>                              
#>  round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#>  Min.   :  5652                                                
#>  1st Qu.: 16503                                                
#>  Median : 23835                                                
#>  Mean   : 50519                                                
#>  3rd Qu.: 35784                                                
#>  Max.   :259781
str(df)
#> 'data.frame':    30 obs. of  2 variables:
#>  $ rep.c..TX....NY....CA....10.                                  : Factor w/ 3 levels "CA","NY","TX": 3 2 1 3 2 1 3 2 1 3 ...
#>  $ round.rlnorm.length.State...meanlog...log.50000...sdlog...1...: num  150554 31076 24596 30288 9805 ...
head(df)
#>   rep.c..TX....NY....CA....10.
#> 1                           TX
#> 2                           NY
#> 3                           CA
#> 4                           TX
#> 5                           NY
#> 6                           CA
#>   round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1                                                     150553.808
#> 2                                                      31075.817
#> 3                                                      24595.979
#> 4                                                      30288.404
#> 5                                                       9805.364
#> 6                                                      15555.336
tail(df)
#>    rep.c..TX....NY....CA....10.
#> 25                           TX
#> 26                           NY
#> 27                           CA
#> 28                           TX
#> 29                           NY
#> 30                           CA
#>    round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 25                                                       42508.90
#> 26                                                       87801.52
#> 27                                                      259781.39
#> 28                                                       23073.15
#> 29                                                      249119.48
#> 30                                                       15708.70

We can also query metadata about our data frame. Note that, for a data frame, names and colnames will return the same result. dim will give the number of rows and columns, or you can use nrow and ncol directly. In particular, note what result is returned by the length function. If you think about the fact that a data frame is actually a list, you may be able to guess why length returns the value it does.

names(df)
#> [1] "rep.c..TX....NY....CA....10."                                  
#> [2] "round.rlnorm.length.State...meanlog...log.50000...sdlog...1..."
colnames(df)
#> [1] "rep.c..TX....NY....CA....10."                                  
#> [2] "round.rlnorm.length.State...meanlog...log.50000...sdlog...1..."
dim(df)
#> [1] 30  2
length(df)
#> [1] 2
nrow(df)
#> [1] 30
ncol(df)
#> [1] 2

7.1.3 Combining data frames

If you’ve got more than one data frame, it’s possible to combine them in several different ways: rbind, cbind and merge.

rbind will append rows to the data frame. New rows must have the same number of columns and data types.

dfA = df[1:10,]
dfB = df[11:20, ]
rbind(dfA, dfB)

cbind must have the same number of rows as the data frame.

dfC = dfA[, 1:2]
cbind(dfA, dfC)
#>    rep.c..TX....NY....CA....10.
#> 1                            TX
#> 2                            NY
#> 3                            CA
#> 4                            TX
#> 5                            NY
#> 6                            CA
#> 7                            TX
#> 8                            NY
#> 9                            CA
#> 10                           TX
#>    round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1                                                      150553.808
#> 2                                                       31075.817
#> 3                                                       24595.979
#> 4                                                       30288.404
#> 5                                                        9805.364
#> 6                                                       15555.336
#> 7                                                        5651.852
#> 8                                                       13079.287
#> 9                                                       37252.876
#> 10                                                      31378.579
#>    rep.c..TX....NY....CA....10.
#> 1                            TX
#> 2                            NY
#> 3                            CA
#> 4                            TX
#> 5                            NY
#> 6                            CA
#> 7                            TX
#> 8                            NY
#> 9                            CA
#> 10                           TX
#>    round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1                                                      150553.808
#> 2                                                       31075.817
#> 3                                                       24595.979
#> 4                                                       30288.404
#> 5                                                        9805.364
#> 6                                                       15555.336
#> 7                                                        5651.852
#> 8                                                       13079.287
#> 9                                                       37252.876
#> 10                                                      31378.579

merge is similar to a JOIN operation in a relational database. If you’re used to VLOOKUP in Excel¹⁰, you’ll love merge. It’s possible to use multiple columns (e.g. state and effective date) when specifying how to join. If no columns are specified, merge will use whatever column names the two data frames have in common. Below, we merge a set of rate changes to our premium data.

dfRateChange = data.frame(State = c("TX", "CA", "NY"), RateChange = c(.05, -.1, .2))
df = merge(df, dfRateChange)

7.1.4 `expand.grid`

Consider expand.grid. This will create a data frame as the cartesian product (i.e. every combination of elements) of one or more vectors. Among the use cases are to check for missing data elements in another data frame.

dfStateClass <- expand.grid(State = c("TX", "CA", "NY")
                  , Class = c(1776, 1066, 1492))

7.2 Access and Assignment

Access and assignment will feel like a weird combination of matrices and lists, though with an emphasis on the mechanics of a 2D matrix. We’ll often use the [ ] operator to specify which row and column we’d like. The first arugment will refer to the row and the second will refer to the column. If either argument is left blank, it will refer to every element.

df[1, 2]
df[2, ]
df[, 2]
df[2, -1]

Note the interesting case when we specify one argument but leave the other blank. I’ll wrap it in a head function to minimize the output.

head(df[2])
#>   round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1                                                     150553.808
#> 2                                                      31075.817
#> 3                                                      24595.979
#> 4                                                      30288.404
#> 5                                                       9805.364
#> 6                                                      15555.336

Once again, if you bear in mind that a data frame is a list, this should makes sense. Without the additional comma, R will assume that we’re dealing with a list and will respond accordingly. In this case, that means returning the second element of the list. You can extend this by requesting multiple elements of the list.

head(df[1:2])
#>   rep.c..TX....NY....CA....10.
#> 1                           TX
#> 2                           NY
#> 3                           CA
#> 4                           TX
#> 5                           NY
#> 6                           CA
#>   round.rlnorm.length.State...meanlog...log.50000...sdlog...1...
#> 1                                                     150553.808
#> 2                                                      31075.817
#> 3                                                      24595.979
#> 4                                                      30288.404
#> 5                                                       9805.364
#> 6                                                      15555.336

It’ll probably be rare that you want this sort of behavior. There are many times that you’ll want to return a single vector from a data frame, but in these cases, I’d recommend either using the $ or the [[ ]] operator. It’s common to use $ to return a single column from a data frame. This will force you to use the name of the column and so provides a more expressive way to work with your data.

df$EarnedPremium
head(df[[1]])
head(df[["EarnedPremium"]])

For multiple columns, we can pass a character vector to get columns by name, rather than by position.

head(df[, "EarnedPremium"])
#> [1]  14953.677  65986.637 147889.324   4789.018  76795.627  82936.802
head(df[, c("EarnedPremium", "State")])
#>   EarnedPremium State
#> 1     14953.677    TX
#> 2     65986.637    NY
#> 3    147889.324    CA
#> 4      4789.018    TX
#> 5     76795.627    NY
#> 6     82936.802    CA

7.2.1 Subsetting

Remember that we can use the same logical and ordinal access functions which apply to vectors to access a data frame.

dfTX = df[df$State == "TX", ]
dfBigPolicies = df[df$EarnedPremium >= 50000, ]

whichState = (df$State == "TX")  
dfTX = df[whichState, ]

df10 <- df[1:10, ]

If I’m only looking to filter rows, an easier way would be to use the subset function.

dfTX = subset(df, State == "TX")
dfBigPolicies = subset(df, EarnedPremium >= 50000)

Bringing that all together, we can combine row and colum access in crazy ways. Here, we’ll get the premium and state for the rows which have losses greater than some value. Let’s say we want everything above the 80th percentile, but rounded to the nearest thousand.

minLoss <- round(quantile(df$Loss, 0.8), -3)
myDF <- df[df$Loss >= minLoss, c("EarnedPremium", "State")]

7.2.2 Altering and adding columns

We can add columns by using the same operators as for access¹¹, though it’s common practice to use $ and the name of our new column.

df$LossRatio = df$Loss / df$EarnedPremium

The transform will return a new data frame with transformed columns.

df <- transform(df, LogLoss = log(Loss), LogPremium = log(EarnedPremium))

I find with to be a bit easier than transform, however it can’t (easily) be used to create more than one column at a time. with will refer to the columns of a data frame in the statement where it’s called. We typically use this to return a single vector which we then assign as a new column to a data frame.

df$LogLoss = with(df, log(Loss))

7.2.3 Eliminating columns

We can eliminate columns in one of two ways. If we only want to remove one column, we can assign the value NULL to it.

df$LossRatio <- NULL

If we want to eliminate more than one column, we may construct a new data frame which only includes the columns we’d like to keep.

df <- df[, 1:2]
df <- df[, c("State", "EarnedPremium")]

If we’d like to remove specific columns, Hadley Wickham (Wickham, n.d.) showed a nice means to do this by using set differences.

df <- df[, setdiff(colnames(df), c("RateChange", "Losses"))]

7.2.4 Altering column names

df$LossRation = with(df, Losses / EarnedPremium)
colnames(df)[4] = "Loss Ratio"
colnames(df)

7.2.5 Ordering

For vectors, we can use sort to get a sorted vector. For a data frame, we’ll need to use order. Remember that order functions a lot like which. It will return indices which may then be used to return specific contents.

order(df$EarnedPremium)
#>  [1]  4 26  1 28 12 30 18 10 19 13 25  7  9  8 17 22 11 23 16 29 14 21  2
#> [24]  5 24  6 27 15  3 20
df = df[order(df$EarnedPremium), ]

7.2.6 `attach`

If the name of a data frame is long, typing it to access column elements might start to seem tedious. The attach function will alleviate this, by attaching the data frame onto something called the “search path” (which I might have described in the section on packages). What’s the search path? Well, all evidence to the contrary, R will look high and low every time you refer to something. As soon as it finds a match, it’ll proceed with whatever calculation you’ve asked it to do. Attaching the data frame to the search path means that the column names of the data frame will be added to the list of places where R will search.

attach(dfA)
# do some stuff
attach(dfB)
#> The following objects are masked from dfA:
#> 
#>     rep.c..TX....NY....CA....10.,
#>     round.rlnorm.length.State...meanlog...log.50000...sdlog...1...

There are many references that suggest using attach. I don’t. I’ll actually advise against it. Why is attach a bad idea? If you add a number of similar data frames or packages it can quickly get hard to keep up. The worst case scenario comes when you add two data frames that share column names and you then proceed to carry out analysis.

mean(EarnedPremium)
#> [1] 61457.34

Which EarnedPremium am I talking about? “Well to tell you the truth in all this excitement I kinda lost track myself.”¹² attach won’t necessarily harm you. But it is a loaded gun. If you get tired of typing the name of a data frame, consider using with.

7.3 Summarizing

We can use the standard suite of aggregation functions to summarize a single column of a data frame.

sum(df$EarnedPremium)
#> [1] 1843720
max(df$EarnedPremium)
#> [1] 559956

For multiple columns, base R comes with the aggregate function. Its easiest implementation uses a formula which shows how we want the variable split. In the function below, we’re splitting earned premium by state.

aggregate(EarnedPremium ~ State, data = df, FUN = sum)
#>   State EarnedPremium
#> 1    CA      673095.7
#> 2    NY      939215.3
#> 3    TX      231409.1

That’s easy to type, but will only give me one column at a time. If I want multiple columns, I’ll ned to do something like the following:

aggregate(df[, c("EarnedPremium", "Loss")], by = df["State"], FUN = sum)
#>   State EarnedPremium     Loss
#> 1    CA      673095.7 257519.6
#> 2    NY      939215.3 423081.7
#> 3    TX      231409.1 228745.6

Although aggregate gets the job done, I often find it cumbersome for anything complex. Ever since I started using it, I’ve been hooked on the dplyr package. We’ll not go into it here, but I’ll recommend checking it out. Hover, spend a bit of time playing with aggregate first. There are truckloads of packages designed to make manipulation of data frames easier. They’re all powerful and they’re all a little different. I’d suggest learning the functions in base R first, then moving on to tools like dplyr and data.table. There’s a lot to be gained from understanding the problems those packages were created to solve.

7.4 Reading and writing external data

First, we’ll write out some data. Then we’ll read it back in. The comma separated value (CSV) format is one of the easiest to work with.

write.csv(df, "SomeFile.csv")

To read data back in, we use the (wait for it) read.csv function.

myDF <- read.csv('SomeFile.csv')

We can use this same function to read data from the web. If it’s accessible and we’ve provided a proper URL, we can treat it the same as anything on our local file system.

URL <- "http://www.casact.org/research/reserve_data/ppauto_pos.csv"
df <- read.csv(URL, stringsAsFactors = FALSE)

7.5 Exercises

Create a data frame with 500 rows. Include a column for policy effective date, policy expiration date and claim count. For claim count, consider using the rpois function to simulate from a poisson distribution.
Save this data to a .CSV file and then reload that file.
Which policy had the most claims? Which policy year?
Add a column for

7.5.1 Answer

If you’ve never used VLOOKUP, I can’t imagine why you’re reading this.↩
We can also use cbind to add columns.↩
Copyright some film company.↩