More fun with data frames

Data frames are such a straightforward and essential element of R that it’s easy to lose sight of some of their peculiarities. Last week, I developed some code which would tear apart some data frames and create new ones based on columns specified by the user. This would allow me to dynamically create new data frames for later processing. Everything worked fine until one of those downstream processes threw an error. A bit of digging allowed me to see that the behavior when extracting data frame columns is ever so slightly different when extracting only one column.

Guess what output I get when running the names commands.

myData = data.frame(State = c("NY","NY", "TX", "TX")
                    , Premium = c(100,200,150,75)
                    , Loss = c(80,175,80,80)
                    , ALAE = c(10, 20, 15, 5))

whichColumns = c("State", "Premium")
myColumns = myData[, whichColumns]
names(myColumns)

whichColumns = "State"
myColumns = myData[, whichColumns]
names(myColumns)

Of course it’s blindingly obvious after a moment’s reflection. If you’re only retrieving one column, R won’t return something more complex than a primitive vector. Primitive vectors don’t have names. OK. Let’s have it return a data frame.

myColumns = as.data.frame(myData[, whichColumns])
names(myColumns)

This probably isn’t what one would want. The name of the column in the data frame isn’t “State” as it was in the original data frame.

The help for data frames has a rather telling statement: “How the names of the data frame are created is complex, and the rest of this paragraph is only the basic story.” What really matters is that extraction of a single column removes the name. Again, primitive vectors don’t necessarily have names. Once the name is gone, as.data.frame has nothing to use for a name and does as best it can. The remedy is to change the name of the resultant data frame. Hardly a catastrophe, but it means that I now need two lines of code, rather than one.

myColumns = as.data.frame(myData[, whichColumns])
names(myColumns) = whichColumns

Actually I don’t need to do this. As it happens (and as I’m sure many of you know), it’s possible to extract columns of a data frame while preserving their name with only one statement. This relies on the “[” operator, but with only one index. I hadn’t used this before because using the blank row index as a wildcard works in most circumstances. This had lead me to blindly presume that both indexes were needed.

myColumns = myData[whichColumns]
names(myColumns)
# Hurrah!

What’s the lesson? It’s worth your time to get familiar with very basic operators like “[“. The distinctions between that operator and “[[” and “$” are subtle, but very important. “[[” and “$” will only ever return one item (try sending it a vector of column names). “$” allows you to enter the column name using only the first few letters of its name. Again, it’s obvious once you understand what behavior R intends to support. Until then, you’re likely to get a bug which- at first glance- may seem mysterious.

About these ads

2 Responses to More fun with data frames

  1. Pingback: Stuff I’ve gotten horribly wrong | PirateGrunt

  2. Jim says:

    to avoid that problem, I try to use the subset function as much as possible. I find that there’s enough quirkiness that it’s better to use subset to ensure you are getting the right thing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 286 other followers

%d bloggers like this: