How to spend an inordinate amount of time becoming efficient

I’ve spent a good deal of 2012 constructing a data warehouse to manage all the various data elements that my company has. Although we’re a small enterprise, the richness and complexity of the information is rather high. Moreover, as a data-driven organization, there’s a strong impetus to construct meaningful analysis with every bit of input we can lay our hands on. Also, we don’t have piles of cash, so the available tools are off the shelf and free. Want something customized? You gotta roll your own.

This leads me naturally to R. I used the RODBC package initially to pull information from the warehouse and analyze the data frame which it returned. Earlier this year, I had used RODBC to store information, but the data volume was fairly modest and was very simple. Our propietary claims systems are happy to provide me with huge Excel or CSV files, but this means I have to write some occasionally tedious VBA to move it into the central data store. Having grown tired of some of the manual effort, I gave R a go.

At first, it was easy. Read the CSV, and then run sqlSave. Heck, once I’ve identified the ODBC channel and where the file is, that’s just two lines of code. Throw in a third to alter the column names. Here’s a dummy example of what that would look like:

library(RODBC)
channel = odbcConnect("MyDSN")
filepath = "C:\\Users\\Documents\\SomeData\\"
filename = "Weekly Data.csv"
myColnames = c("SystemOccurrenceID", "AccountNumber", "PolicyNumber", "SomeData")
df = read.csv(paste(filepath, filename, sep=""), stringsAsFactors = F, col.names=myColnames)
sqlSave(channel, df, "dbo.tblOccurrenceWC", rownames=FALSE, append=T)

Easy, right? Well, it was until this: “Invalid character value for cast specification”. I don’t know if you’ll get that with YOUR data, but I absolutely did with mine. My intuition told me that one of the data types was rendered as a character, when the database was expecting something else. I further presumed that it had something to do with a date field. I can barely spell POSIX, so I’m not wholly sure what happens when SQL Server tries to read that into a datetime field. Some interweb searching suggested that SQL Server might complain if the date had no hours or minutes. But I hadn’t experienced that with other tables and uploads. I went through several iterations of omitting the append=T parameter so that sqlSave would create a new table. I still managed to produce the error. I was on the verge of sending an e-mail to Brian Ripley- one of the authors of RODBC- when I stumbled on two things: 1) There was a field which had a character string which occasionally stored a long dash. This is something that R Studio doesn’t render. I’ll research it at some point in the future, but the fact that it doesn’t easily map to a character set is probably part of the confusion. 2) Once that was resolved, the INSERT made it through about 50 records and then hit one it couldn’t process. Seems I had to alter the column definition from int to bigint. The numbers I was feeding it should have been within the bigint data range, but there you are. Once that was done, everything ran smoothly.

So now, I can just click a button and my data warehouse updates painlessly. Because I’ve already experienced that weird, Kafka-like sense of confusion and torment that you only get with R. Still a tremendous tool, though.

About these ads

3 Responses to How to spend an inordinate amount of time becoming efficient

  1. Vitor Chagas says:

    R is not the best tool to use as an ETL.

    Try using KNIME (http://www.knime.org/) instead, you get a complete workflow system for free and R integration included in the package :)

    Working in a SAS based environment, I just use this at home for my experiments.
    It could give a nice post …

  2. PirateGrunt says:

    I have to confess that I’m not sure what ETL means. I’m not familiar with KNIME, but a superficial glance at their website looks intriguing. I’m trying to be judicious about diving into new open source tools. My computer is already choked with programs that I’ve not gotten into learning. (Hi Weka!) But I’m definitely open to something to help corral my data.

    I agree that R is not meant to act as a data workflow administrator. And yet, it gets the job done often enough that I’m beginning to rely on it for some simple workflow tasks.

    Thanks for the feedback.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 286 other followers

%d bloggers like this: