More fun with data frames
June 12, 2013 2 Comments
Data frames are such a straightforward and essential element of R that it’s easy to lose sight of some of their peculiarities. Last week, I developed some code which would tear apart some data frames and create new ones based on columns specified by the user. This would allow me to dynamically create new data frames for later processing. Everything worked fine until one of those downstream processes threw an error. A bit of digging allowed me to see that the behavior when extracting data frame columns is ever so slightly different when extracting only one column.
Guess what output I get when running the
myData = data.frame(State = c("NY","NY", "TX", "TX") , Premium = c(100,200,150,75) , Loss = c(80,175,80,80) , ALAE = c(10, 20, 15, 5)) whichColumns = c("State", "Premium") myColumns = myData[, whichColumns] names(myColumns) whichColumns = "State" myColumns = myData[, whichColumns] names(myColumns)
Of course it’s blindingly obvious after a moment’s reflection. If you’re only retrieving one column, R won’t return something more complex than a primitive vector. Primitive vectors don’t have names. OK. Let’s have it return a data frame.
myColumns = as.data.frame(myData[, whichColumns]) names(myColumns)
This probably isn’t what one would want. The name of the column in the data frame isn’t “State” as it was in the original data frame.
The help for data frames has a rather telling statement: “How the names of the data frame are created is complex, and the rest of this paragraph is only the basic story.” What really matters is that extraction of a single column removes the name. Again, primitive vectors don’t necessarily have names. Once the name is gone, as.data.frame has nothing to use for a name and does as best it can. The remedy is to change the name of the resultant data frame. Hardly a catastrophe, but it means that I now need two lines of code, rather than one.
myColumns = as.data.frame(myData[, whichColumns]) names(myColumns) = whichColumns
Actually I don’t need to do this. As it happens (and as I’m sure many of you know), it’s possible to extract columns of a data frame while preserving their name with only one statement. This relies on the “[” operator, but with only one index. I hadn’t used this before because using the blank row index as a wildcard works in most circumstances. This had lead me to blindly presume that both indexes were needed.
myColumns = myData[whichColumns] names(myColumns) # Hurrah!
What’s the lesson? It’s worth your time to get familiar with very basic operators like “[“. The distinctions between that operator and “[[” and “$” are subtle, but very important. “[[” and “$” will only ever return one item (try sending it a vector of column names). “$” allows you to enter the column name using only the first few letters of its name. Again, it’s obvious once you understand what behavior R intends to support. Until then, you’re likely to get a bug which- at first glance- may seem mysterious.