# Computed columns for dataframes

May 8, 2013 7 Comments

Everyone loves to aggregate data. Everyone loves to create new columns based on other columns. Everyone hates to do the same thing twice. In my continuing work on multilevel view of loss reserving, I reached a point where I realized that I needed a robust mechanism to aggregate computed columns. SQL server and (I’m assuming) other DBMSs have such a construct. Further, it’s something I use routinely in an Excel pivot table. Replicating this in R, doesn’t appear to be straightforward.

Here’s an example, that will be familiar to anyone working in insurance:

myData = data.frame(State = c("NY","NY", "TX", "TX") , Premium = c(100,200,150,75) , Loss = c(80,175,80,80) , ALAE = c(10, 20, 15, 5))

What I’d like to do is calculate loss and loss+ALAE ratios. The former is simply the losses divided by the premium. The latter is the sum of loss and ALAE divided by the premium. I’ll also calculate the ratio of ALAE to loss+ALAE.

The first thing I do is create something called a ComputedColumn. This is simply a list with two elements, 1) the column name and 2) an expression which defines the calculation. The expression will be evaluated later during construction of a new class which I’ve defined. The class is called a “Mesa”. For now, I’m using it to create computed columns and collapse along selected dimensions. In the future, I’d like to use it to generate nice looking TEX or other output for tables.

I’ll note that Ben Escoto did quite a bit of very good work on well-formatted dataframes in his package FAViR. If you haven’t seen it, take a look at it here: http://www.favir.net/start.

Mesa works by evaluating the expressions which are passed into it and constructing a new dataframe for the results. The expressions are kept in the Mesa object so that they may be reapplied whenever we want to alter the Mesa.

ComputedColumn = function(ColumnName, ColumnFormula) { arguments <- as.list(match.call()) ColumnFormula = arguments$ColumnFormula x = list(ColumnName = ColumnName, ColumnFormula = ColumnFormula) x } new_Mesa = function(df, ComputedColumns) { columnVals = lapply(ComputedColumns, function(x) { y = eval(x$ColumnFormula, df)}) newMesa = do.call("cbind", columnVals) colnames(newMesa) = sapply(ComputedColumns, "[[", "ColumnName") newMesa = cbind(df, newMesa) mesa = new("Mesa", MesaData = newMesa, MesaColumns = ComputedColumns) mesa }

When we run the following code, we'll get a table which has the first for columns of myData, augmented with three new ones.

col1 = ComputedColumn("LossRatio", Loss / Premium) col2 = ComputedColumn("ALAERatio", ALAE / (Loss+ALAE)) myCols = list(col1, col2, ComputedColumn("LossAndLAERatio", (Loss + ALAE) / Premium)) myMesa = new_Mesa(myData, myCols) rstudio::viewData(myMesa@MesaData)

State | Premium | Loss | ALAE | LossRatio | ALAERatio | LossAndLAERatio |

NY | 100 | 80 | 10 | 80% | 11% | 90% |

NY | 200 | 175 | 20 | 88% | 10% | 98% |

TX | 150 | 80 | 15 | 53% | 16% | 63% |

TX | 75 | 80 | 5 | 107% | 6% | 113% |

Cool. But what if I'd like to know how New York and Texas are doing in total?

CollapseMesa = function(x, Along) { computedCols = x@MesaColumns df = x@MesaData cols = colnames(df) computedColnames = sapply(x@MesaColumns, "[[", "ColumnName") cols = cols[!cols %in% computedColnames] cols = cols[!cols %in% Along] dfSum = aggregate(df[,cols], by=list(df[,Along]), sum) colnames(dfSum) = gsub("Group.1", Along, colnames(dfSum)) newMesa = new_Mesa(dfSum, x@MesaColumns) newMesa } stateMesa = CollapseMesa(myMesa, "State") rstudio::viewData(stateMesa@MesaData)

State | Premium | Loss | ALAE | LossRatio | ALAERatio | LossAndLAERatio |

NY | 300 | 255 | 30 | 85% | 11% | 95% |

TX | 225 | 160 | 20 | 71% | 11% | 80% |

At present, the grouping mechanism is extremely primitive. It presumes that I’m only interested in collapsing along one dimension and that I only ever want to take the sum. I’ll get around to fixing that at some point in the near future, but this fits my needs for now.

Have a look here: https://github.com/PirateGrunt/Mesa

Hi, I am not sure if it is what are you looking for but have you checked the data.table package?

I’m familiar with data.table, but must confess that I’ve not used it in detail. So far as I can recall, it doesn’t directly support computed columns in the manner that I’ve done, though I’ll have another look. I do know that it’s a very cool package and serves as a nice construct for folks (like me) who come from a RDBMS perspective.

library(plyr)

ddply(myData, .(State), summarize, Premium1 = sum(Premium), Loss1 = sum(Loss), LossRatio = sum(Loss)/sum(Premium), ALAERatio = sum(ALAE)/sum(Loss + ALAE), LossAndLAERatio = sum(Loss + ALAE)/sum(Premium))

Yeah, that’s kinda … I mean, yeah. Yeah, that probably works. But I’m sort of at a loss as to what’s going on there. I know that I can calculate the stuff. I want to express the stuff.

The point of doing it using plyr was to show you it exists. Hadley Wickham has done a tonne of work in this package and it solves your basic problem plus a lot more besides. Have a read of the split-apply-combine pattern http://www.jstatsoft.org/v40/i01/paper

I’m familiar with plyr and use it from time to time. The focus there is more on aggregation than on defining new variables which are functions of others and preserving those variables pre and post aggregation.

Frankly, the thing which does this very well is a pivot table in Excel. Say what you will about Excel, pivot tables are fantastic for basic exploratory analysis.

“The focus there is more on aggregation than on defining new variables which are functions of others and preserving those variables pre and post aggregation.”

That’s what aggregation is.

“Frankly, the thing which does this very well is a pivot table in Excel.”

That’s what the plyr package does. Read the article reffed by A.

So, this post is great if you learned something from the exercise. And it’s great for others who have the same problem. They now know that plyr is a functional approach to this problem. For loops are prone to error, and you’ve even acknowledged that your approach is limited in functionality. Why reinvent the wheel? (Unless it’s for learning, which I’ve mentioned is a legitimate reason.)

For the rest of us, we should be encouraged to use packages that are proven and error tested.