Computed columns for dataframes

Everyone loves to aggregate data. Everyone loves to create new columns based on other columns. Everyone hates to do the same thing twice. In my continuing work on multilevel view of loss reserving, I reached a point where I realized that I needed a robust mechanism to aggregate computed columns. SQL server and (I’m assuming) other DBMSs have such a construct. Further, it’s something I use routinely in an Excel pivot table. Replicating this in R, doesn’t appear to be straightforward.

Here’s an example, that will be familiar to anyone working in insurance:

myData = data.frame(State = c("NY","NY", "TX", "TX")
                    , Premium = c(100,200,150,75)
                    , Loss = c(80,175,80,80)
                    , ALAE = c(10, 20, 15, 5))

What I’d like to do is calculate loss and loss+ALAE ratios. The former is simply the losses divided by the premium. The latter is the sum of loss and ALAE divided by the premium. I’ll also calculate the ratio of ALAE to loss+ALAE.

The first thing I do is create something called a ComputedColumn. This is simply a list with two elements, 1) the column name and 2) an expression which defines the calculation. The expression will be evaluated later during construction of a new class which I’ve defined. The class is called a “Mesa”. For now, I’m using it to create computed columns and collapse along selected dimensions. In the future, I’d like to use it to generate nice looking TEX or other output for tables.

I’ll note that Ben Escoto did quite a bit of very good work on well-formatted dataframes in his package FAViR. If you haven’t seen it, take a look at it here:
http://www.favir.net/start
.

Mesa works by evaluating the expressions which are passed into it and constructing a new dataframe for the results. The expressions are kept in the Mesa object so that they may be reapplied whenever we want to alter the Mesa.

ComputedColumn = function(ColumnName, ColumnFormula)
{
  arguments <- as.list(match.call())
  ColumnFormula = arguments$ColumnFormula
  x = list(ColumnName = ColumnName, ColumnFormula = ColumnFormula)
  x
}

new_Mesa = function(df, ComputedColumns)
{
  columnVals = lapply(ComputedColumns, function(x) {
    y = eval(x$ColumnFormula, df)})
  
  newMesa = do.call("cbind", columnVals)
  colnames(newMesa) = sapply(ComputedColumns, "[[", "ColumnName")
  newMesa = cbind(df, newMesa)
  
  mesa = new("Mesa", MesaData = newMesa, MesaColumns = ComputedColumns)
  mesa
}

When we run the following code, we'll get a table which has the first for columns of myData, augmented with three new ones.

col1 = ComputedColumn("LossRatio", Loss / Premium)
col2 = ComputedColumn("ALAERatio", ALAE / (Loss+ALAE))
myCols = list(col1, col2, ComputedColumn("LossAndLAERatio", (Loss + ALAE) / Premium))

myMesa = new_Mesa(myData, myCols)
rstudio::viewData(myMesa@MesaData)
State Premium Loss ALAE LossRatio ALAERatio LossAndLAERatio
NY 100 80 10 80% 11% 90%
NY 200 175 20 88% 10% 98%
TX 150 80 15 53% 16% 63%
TX 75 80 5 107% 6% 113%

Cool. But what if I'd like to know how New York and Texas are doing in total?

CollapseMesa = function(x, Along)
{
  computedCols = x@MesaColumns
  df = x@MesaData

  cols = colnames(df)
  computedColnames = sapply(x@MesaColumns, "[[", "ColumnName")
  cols = cols[!cols %in% computedColnames]
  cols = cols[!cols %in% Along]
  
  dfSum = aggregate(df[,cols], by=list(df[,Along]), sum)
  colnames(dfSum) = gsub("Group.1", Along, colnames(dfSum))
  newMesa = new_Mesa(dfSum, x@MesaColumns)
  newMesa
}
stateMesa = CollapseMesa(myMesa, "State")
rstudio::viewData(stateMesa@MesaData)
State Premium Loss ALAE LossRatio ALAERatio LossAndLAERatio

NY 300 255 30 85% 11% 95%
TX 225 160 20 71% 11% 80%

At present, the grouping mechanism is extremely primitive. It presumes that I’m only interested in collapsing along one dimension and that I only ever want to take the sum. I’ll get around to fixing that at some point in the near future, but this fits my needs for now.

Have a look here:
https://github.com/PirateGrunt/Mesa

Passing columns of a dataframe to a function without quotes

I love the syntax of calls to lm and ggplot, wherein the dataframe is specified as a variable and specific columns are referenced as though they were separate variables. While developing some of my functions, I’d wanted to introduce something similar. I often find that I have a single large dataframe and want to execute the same function to many columns. I wanted the ability to do this interactively, which ruled out the brute force method of something like lapply. The resulting code in the called function was always a bit messy passing in a character string or position for the column and then writing something like df[,MyColName]. Actually, looking at it now, it seems fairly straightforward. I suppose I just didn’t like the green colored font in RStudio and just wanted to know how it was done. If that smells like a caveat, it is. I’m not 100% certain of the purity of this convention and am open to other views and suggestions.

Turns out the answer is straightforward and relies on use of the eval function. eval lets you specify the environment in which a variable is evaluated and that environment may include a dataframe. Here’s a very simple example, which simply sums the values in a column of a dataframe.

someFunction = function(y, data)
{
  arguments <- as.list(match.call())
  y = eval(arguments$y, data)
  sum(y)
}

First, we pull the arguments out using match.call(). I’ll be honest. I read up on that last week until my brain melted. Here’s more or less what it amounts to. match.call() will return a call object, which has all of the items in the function signature unevaluated. This means that arguments exist as quotes. Quotes describe your variable and sit around waiting to be evaluated. Here, we’re grabbing them before anything else happens so that we can control how that happens. The eval function will use the local environment, unless we tell it to use something else. In this case, we tell it to use the dataframe that we’ve passed in. This allows us to do something cool like the following:

myData = data.frame(A = c(1,2,3), B = c(10,9,8))
someFunction(A, data=myData)
someFunction(B, data=myData)
someFunction(A)

So that’s loads of fun and I love how the function calls look. I also like that I get an error if I try to pass in a column without specifying the dataframe. However, beware. There’s nothing which insists that the first argument to the function must live in the dataframe. Note what happens when we pass in something else

X = c(1,2,3,4,5,6)
someFunction(X)
someFunction(X, data=myData)

This may not be catastrophic, but it’s probably a situation we’d want to be informed of, at least via a warning. I went to the trouble of creating the dataframe and passing it into a function, I’d like to know if it’s being ignored. Even worse, if I create a variable called A, then someFunction(A) will now work without an error. However, it won’t be using the column labelled A in the dataframe. Try the following:

A = c(1,2)
someFunction(A)
someFunction(A, data=myData)

I’m still monkeying around with this, trying to sort out what looks right and is most robust. As always, other views are welcome.

Big RxR Blues

I’ve updated the GitHub repository for RxR with some bits of SQL that are hopefully useful. This is a very quick stab at establishing a basic schema definition for (re)insurance data. This first cut simply creates a table to store accounts, written premium and a query to get earned premium.

The account table has an index field called “SegmentID” which could be used to map to marketing/demographic characteristics. Apart from that, there’s nothing here to describe the exposure other than a field which defines the insurance coverage being offered. That puts quite a burden on the SegmentID field, but it’s fairly straightforward to build some basics like territory, agency, program, etc.

The MonthMap table is the easiest solution I came up with to quickly move from aggregating across year, quarter or semi-annual periods. It feels a bit kludgy at first, but with the date udfs, I can now get appropriate totals without having to think too hard. And I hate having to think hard.

SQL is obviously only one option for data storage. If anyone knows more about XML, JSON or other formats, feel free to chime in. For my work, I use either R or Excel to interact with the data. I manage that with an ODBC DSN. Typically, I’m using a query, so everything is in a nice, tabular format and all of the joins have already been sorted out. In other words, once the data comes to me from the warehouse, I need to do nothing further to it other than draw meaningful conclusions.

Future updates would include:

  • A table or field to identify a specific policy.
  • Addition of exposure elements and rating values to construct the written premium information.
  • Once a policy field has been established, creation of a claim table and associated information.
  • Lots of other things.

Grab a fork and have a look at the repository: 
https://github.com/PirateGrunt/RxR

Callback functions for GUI widgets

Of all the things I dislike about R, one of the biggest is the fact that you can declare a function within the list of arguments to another function. I’ve gotten over it for very minor operations needed by things like lapply, but it can drive me bonkers elsewhere. One such instance is writing an event handler using the gwidgets package. Here’s an example, inspired by the example in “Programming Graphical User Interfaces in R“:

# Create a basic window with an output section
require(gWidgets)
options(guiToolkit = RGtk2)
window = gwindow("File search", visible = FALSE)
paned = gpanedgroup(cont = window)
frmOutput = gframe("Output: ", cont=paned, horizontal = FALSE)
txtOutput = gtext("", cont = frmOutput, expand = TRUE)
size(txtOutput) = c(350, 200)

# Create a button to open a .csv file
btnImportFile = gfilebrowse(text = "Select a file ", quote = FALSE, type = "open", cont = container, filter = "*.csv")
addHandlerChanged(btnImportFile, handler = function(h, ...){
    svalue(txtOutput) = svalue(h$obj)    
  })

OK, that was a fairly tame handler. All that we’re doing is placing the name of the file into the output area. What I’d like to do is use the output to show the names of columns contained in the file. Here’s what that would look like:

btnDescribeColumns = gbutton(text = "Describe columns", cont = container)
addHandlerChanged(btnDescribeColumns, handler = function(h, ...){
   filename = svalue(txtOutput)
   if( filename == ""){
     svalue(txtOutput) = "File not found"
    } else {
     df = read.csv(filename, header = TRUE)
     svalue(txtOutput) = names(df)
    }
})

That’s not too dreadful in isolation, but the event handler sits in the same function where I’ve defined another event handler. I’m now defining two functions inside a third function. For any rich interface, this will quickly get complex. Making changes to a specific handler means trolling through one giant, monolithic function. (Yes, yes, I could just search, but the aesthetics of my code are still poor.)

How about this instead? I code a function to create the widget and the handler is defined there.

AddBtnDescribeColumns = function(container, txtOutput){
  require(gWidgets)
  btnDescribeColumns = gbutton(text = "Describe columns", cont = container)
  
   addHandlerChanged(btnDescribeColumns, handler = function(h, ...){
     filename = svalue(txtOutput)
     if( filename == ""){
       svalue(txtOutput) = "File not found"
      } else {
       df = read.csv(filename, header = TRUE)
       svalue(txtOutput) = names(df)
      }
 })
 
  return (btnDescribeColumns)
}

With a similar approach to the file open button (not shown, but available on Github), the main function for my dialog box is beginning to look fairly readable. Note that I must declare the output box before the buttons, so that it may be passed to the button constructors.

Main = function(WhichToolkit = "RGtk2"){
  require(gWidgets)
  options(guiToolkit = WhichToolkit)
  
  window = gwindow("File search", visible = FALSE)
  
  paned = gpanedgroup(cont = window)
  
  group = ggroup(cont = paned, horizontal = FALSE)
  frmOutput = gframe("Output: ", cont=paned, horizontal = FALSE)
  
  txtOutput = gtext("", cont = frmOutput, expand = TRUE)
  
  size(txtOutput) = c(350, 200)
  
  AddBtnImportFile(group, txtOutput)
  
  AddBtnDescribeColumns(group, txtOutput)
  
  visible(window) = TRUE
}

Another nice aspect of this construct is that I can declare multiple copies of the same button and place them wherever I like. They’ll all have the same event handler, which I only need to code once.

I’ve been working with gWidgets for a sum total of about three hours, so there may be something obvious I’ve missed. If so, please feel free to comment.

Clustering Loss Development Factors

 

Anytime I get a new hammer, I waste no time in trying to find something to bash with it. Prior to last year, I wouldn’t have known what a cluster was, other than the first half of a slang term used to describe a poor decision-making process. Now I’ve seen it in action a few times and have searched for a practical case where I could use it. It occurred to me that it made for a very interesting look at the problem of how to group loss development factors.

As I may have ranted previously, I’ve tried to revert to first principles and challenge the notion that every development lag deserves its own model factor. Moreover, I’m not 100% satisfied with the idea that development age is the ideal way to categorize model factors. Just for fun, let’s see how a k-means clustering algorithm would group loss development factors.

We’ll work with data from a “big company” pulled from the public NAIC data set. (The data set and the logic to create the Triangle object are available in the MRMR package, which is on GitHub here.)

First, we’ll calculate the development factors using a standard multiplicative chain ladder model. For a 10×10 loss triangle, we’ll have 45 factors. We can plot them on the real number line to see where they sit relative to one another. We’re going to display two plots. The upper plot will show the factors colored by development lag. The numbers are listed above as a further guide to the identity of the points. Development lags 2, 3 and 4 appear clustered in groups, but the higher lags are a bit more difficult to distinguish.

The lower plot shows the dots colored based on their assignment to clusters using the kmeans function. The 2nd and 3rd lags are now divided into finer clusters. Also, the 7 through 10 factors in the tail now sort together. (This is a bit difficult to see. Suggestions on how to present this more clearly are welcome.)

Clustered dev factors

Development lag 2 is particularly interesting as it’s the largest factor and the one with the greatest volatility. I’m curious if the clustering suggests something happening as a calendar year effect, or if it’s merely an artifact of the kmeans clustering algorithm (garbage in, garbage out, etc.). So, let’s add a label for the calendar periods. When we do this, we see that lag 2 does appear to have a meaningful calendar period effect. 1989 and 1990 sort together, then there is a cluster from 1991-1994 and a final cluster for calendar years 1995-1997.

Clustered dev factors wCY

The higher lags are more difficult to see, but let’s have a look at lag 3 and higher. Again we notice that calendar years 1995-1997 get sorted apart from the others. At this point, we should have a strong suspicion that there is a calendar year trend which affects our loss development pattern, though the effect may be dampened for higher lags. We can account for this in a number of ways- Berquist-Sherman adjustment or use of a Zehnwirth type fitting formula.

Clustered dev factors gt 2

Note that we’ll get different results whenever we run the kmeans algorithm. Your mileage may vary.

The code is long and not all that interesting, however I welcome any comments. You can find the Gist here.

Building a package in RStudio is actually very easy

So, you’ve written some code and you use it routinely. Now you’ve written some code and you’d like to use version control to ensure that development continues in a robust fashion. You do that and you use Github or something so that not only are changes tracked, but the general public receives the benefit of your knowledge. At the same time, you receive the benefit of their editing and ideas, but in a secure way. Now what? All that useful, well-functioning code needs a home, some documentation and a little dignity. It needs to become a package.

This is actually very easy to do. I had a few false starts, but now that I know what’s going on, I can create an R package faster than you can say Hadley Wickham.

  1. Start with a Github repository. This will have a set of R sourcecode files, which collectively do something interesting.
  2. Within RStudio, create a new project, using your local Github repository as the project’s source directory.
  3. Within the “Build Tools” section of the “Project Options” make sure that you’ve selected “Package” from the “Project Build Tools” dropdown box.
  4. In that same dialog box, enter the “–no-examples” flag of the “Check Package” build options. (At this stage, you probably don’t have any examples composed. The default assumes that you do. I don’t know why this is.)
  5. If RStudio, didn’t create one, create a subdirectory in your project folder called “R”. Move all your code here. Github will reflect the deletion and new file creation.
  6. Create a DESCRIPTION file. There are numerous sources which explain how to do this.
  7. Make sure your code doesn’t have any errors in it. Easiest way to do that is to source all of them.
  8. Build the package.

And that’s it. If you press CTRL+SHIFT+B (in Windows, at any rate) RStudio will build the package and load it for you. All of your functions are now in memory, but won’t appear in the Workspace pane.

RStudio has fine- if brief- documentation on their site, with links to more detailed guidance. The most useful of these to me, so far, was the wiki for Wickham’s devtools package.

Of course, that’s only about half the story. You still have to write documentation for all your functions, provide a demo file, compose a vignette, etc. Then, you may want to submit to CRAN. For now, I’m just using Github, which will be fine for the forseeable future.

The MRMR project may now be loaded directly into R via Github. The documentation is non-existent for now. It’s a very bare bones application, so could be useful as a template for others. Meanwhile, Github has loads of good examples of R packages.

Want to load MRMR? Instructions below will make that happen.

library(devtools)
install_github(repo = "MRMR", username = "PirateGrunt")
library(MRMR)
df = GetNAICData()
head(df)

Object Orientation in R – Notes from a novice

Having posted some code to Git a few days ago and having been wholly dissatisfied with it, I began to do what I often do with code I don’t like. I started re-writing it bigger and weirder and more philosophically pure. Part of this search for Platonic code lead me to explore object oriented programming in R. To me, OOP is very similar to relational database theory; it was love at first sight. That doesn’t mean that I fully appreciate all of the academic nuances, or that I’m very good at it, or even that I agree with all of it, but I’m an unrepentant fan. My dabblings with R to date have been purely modular. I’ve now written my first object. Here’s how I went about it.

I revisited the Coursera lecture slides about creating an S4 object. That was enough for me to define an object and overload a generic method.

setClass("Triangle", 
         representation(TriangleData = "data.frame"
                        , TriangleName = "character"
                        , LossPeriodType = "character"
                        , LossPeriodInterval = "Period"
                        , DevelopmentInterval = "Period"))

setMethod("show", "Triangle"
          , function(object){
            cat("This is a loss triangle\n")
            cat("Its name is", object@TriangleName, "\n")
            cat("Its columns are", colnames(object@TriangleData), "\n")
            print(head(object@TriangleData))
          })

 

One fairly important “gotcha”: The call to the setClass method appears to require that the arguments are named. On more than one occasion when I failed to do so, I get one of R’s typically cryptic error messages, an example of which looks like:

Error in summary(chainLadder@LinearFit) :
error in evaluating the argument ‘object’ in selecting a method for function ‘summary’: Error: no slot of name “LinearFit” for this object of class “TriangleModel”

The code above is sufficient to construct an object, but it’s not terribly useful or robust.  Among other thing, I’d like to ensure that the object will respond in some fashion to inappropriate inputs and do a bit more than just house several data elements and print a summary. So, let’s start to dig deeper.

Beyond the Coursera material, there are three papers that I consulted:

The Leisch paper is a good introduction to OOP in S3. However, because (according to Roger Peng, anyway) S4 is a more “pure” form of OOP, I looked for other sources. The Genolini and Hankin papers are sufficient to give someone enough information to get going with creating an object. In addition, the code for the lubridate package is on GitHub and is a splendid example of clear, detailed S4 objects in action.

With that knowledge in hand, I returned to my Triangle object. To ensure that the object behaves the way I want it to, I wrote a constructor function which will build the object using some sensible inputs. That winds up being a fairly lengthy function, so I’ll not post it here. If you’re curious, here’s the Gist: 

. Basically, I need to allow the user to specify which column contains loss and development time period information. If some of the inputs have not been specified properly, I return an informative error. I wrote a very rudimentary validation function to ensure that the type of loss period is something sensible. (Brief aside: all of my code is English and I’m not all that happy about that. Does anyone have any good suggestions about how to write multilingual code?)

While I’m at it, I overload some generic functions, including one for plotting. This means that my default plot for a triangle will be something which looks sensible. Cool. I can also write custom behaviors, such as a “LatestDiagonal” function, which will return the most recent observation for a set of loss (or origin) periods. To write a custom method, you must first define a new generic function. This seems a bit odd to me, but whatever. I can imagine a way that it makes sense somewhere within R’s engine.

Finally, I set a method to assign a name to the triangle. This is a bit crazy and I’ll be the first to admit that there may be something here which I don’t get. Hankin writes these functions, but doesn’t use them. They simply use the “@” operator to access object properties directly. This makes me wonder what the point of writing an access function is, other than clean looking code.

So what do I think of OOP in R? I tend to view OOP as having four key properties: Encapsulation, Inheritance, Polymorphism and Methods. I barely need to add that this is hardly a canonical list, merely one biased person’s view of what they like to see in an OO langauge.

  • Encapsulation: This is ability to hide the internals of an object and to control how an object’s properties are manipulated. R gets an F here. One may control property assignments, but that comes in one of a few ways: either through coding a setReplace function, a setX function or through a single setValidity function. It would be possible to code a validation function for a specific property, which other functions could call, but (unless I’ve miss something) that function won’t be private. A single setValidity function is inefficient, both in terms of run-time and in development effort. Moreover, there’s nothing stopping the user from modifying an object’s internal properties via a direct call to the “@” reference to an object’s data. 
  • Inheritance: I’ll give R an incomplete as I’ve not yet had a need to construct an object hierarchy with inherited properties and behaviors, though I expect that I will. At present, the model object defaults to OLS. Obviously, I ‘d like to extend that to other structures. Watch this space.
  • Polymorphism: This gets an A. I get the feeling that this is why there’s any OO in R at all. You can barely get through a bit of  documentation without reference to “generic” methods. OO allows a developer to overload standard functions like plot, summary, sum, etc. R’s support here is fairly straightforward and welcome. Provision of a default plot method for a triangle object is helpful to ensure that users get a useful output without much effort on their part.
  • Methods: I never know quite what to call this, but to me it’s the ability of an object to have behaviors. When I was first learning about OO, this seemed to be what separated an object from a mere structure. A structure is a composition of primitive data types, but an object can actually DO something. Here, R is a bit mixed, so I’ll give them a C. Again, I may be missing something, but there doesn’t seem to be any straightforward support for private methods which would allow an object to manage its own data. The requirement to declare a generic and then a specific is bizarre, but I’m content to write it off as a minor sacrifice to the R gods.

OOP purists and other academics will have a different view about what’s important and how well it’s implemented in R. I’ve barely scratched the surface in my own development and look forward to bringing this technology to bear on problems where this is appropriate. Comments are more than welcome. I’m certain that I’ve gotten a few things horribly wrong.

All the code may be found in the MRMR project here:
https://github.com/PirateGrunt/MRMR

Brief demo:

# Demo script

#=============================
# Source the necessary code
source("https://raw.github.com/PirateGrunt/MRMR/master/RegressionSupport.r")
source("https://raw.github.com/PirateGrunt/MRMR/master/NAIC.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/ReservingVisualization.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/Triangle.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/TriangleModel.R")
source("https://raw.github.com/PirateGrunt/MRMR/master/TriangleProjection.R")

#=============================
# Get some data from the big NAIC database 
# and get a triangle we can project
df = GetNAICData("wkcomp_pos.csv")
bigCompany = as.character(df[which(df$CumulativePaid == max(df$CumulativePaid)),"GroupName"])

df.BigCo = subset(df, GroupName == bigCompany)

df.UpperTriangle = subset(df.BigCo, DevelopmentYear <=1997)
df.LowerTriangle = subset(df.BigCo, DevelopmentYear > 1997)

#=============================
# Construct the triangle and display 
# some basic properties
tri = Triangle(TriangleData = df.UpperTriangle
               , TriangleName = bigCompany
               , LossPeriodType = "accident"
               , LossPeriodInterval = years(1)
               , DevelopmentInterval = years(1)
               , LossPeriodColumn = "LossPeriodStart"
               , DevelopmentColumn = "DevelopmentLag")

tri@TriangleName
tri

is(tri, "Triangle")
is.Triangle(tri)

plt = ShowTriangle(tri@TriangleData, bigCompany)

plot(tri)
head(LatestDiagonal(tri))
length(LatestDiagonal(tri)[,1])

plt = ShowTriangle(tri@TriangleData, bigCompany, Cumulative=FALSE)
#Note the apparent calendar year impact in 1996. This is invisible in the cumulative display.

setName(tri) = "AnotherName"
tri@TriangleName
setName(tri) = bigCompany
tri@TriangleName
tri@TriangleName = "Another name"

#===========================
# Now let's fit a model

chainLadder = TriangleModel("CumulPaid"
                            , BaseTriangle = tri
                            , ResponseName = "CumulativePaid"
                            , PredictorName = "DirectEP"
                            , CategoryName = "DevelopmentLag"
                            , MinimumCategoryFrequency = 1
                            , delta = 0)
summary(chainLadder@LinearFit)

Follow

Get every new post delivered to your Inbox.

Join 224 other followers