Passing columns of a dataframe to a function without quotes

I love the syntax of calls to lm and ggplot, wherein the dataframe is specified as a variable and specific columns are referenced as though they were separate variables. While developing some of my functions, I’d wanted to introduce something similar. I often find that I have a single large dataframe and want to execute the same function to many columns. I wanted the ability to do this interactively, which ruled out the brute force method of something like lapply. The resulting code in the called function was always a bit messy passing in a character string or position for the column and then writing something like df[,MyColName]. Actually, looking at it now, it seems fairly straightforward. I suppose I just didn’t like the green colored font in RStudio and just wanted to know how it was done. If that smells like a caveat, it is. I’m not 100% certain of the purity of this convention and am open to other views and suggestions.

Turns out the answer is straightforward and relies on use of the eval function. eval lets you specify the environment in which a variable is evaluated and that environment may include a dataframe. Here’s a very simple example, which simply sums the values in a column of a dataframe.

someFunction = function(y, data)
{
  arguments <- as.list(match.call())
  y = eval(arguments$y, data)
  sum(y)
}

First, we pull the arguments out using match.call(). I’ll be honest. I read up on that last week until my brain melted. Here’s more or less what it amounts to. match.call() will return a call object, which has all of the items in the function signature unevaluated. This means that arguments exist as quotes. Quotes describe your variable and sit around waiting to be evaluated. Here, we’re grabbing them before anything else happens so that we can control how that happens. The eval function will use the local environment, unless we tell it to use something else. In this case, we tell it to use the dataframe that we’ve passed in. This allows us to do something cool like the following:

myData = data.frame(A = c(1,2,3), B = c(10,9,8))
someFunction(A, data=myData)
someFunction(B, data=myData)
someFunction(A)

So that’s loads of fun and I love how the function calls look. I also like that I get an error if I try to pass in a column without specifying the dataframe. However, beware. There’s nothing which insists that the first argument to the function must live in the dataframe. Note what happens when we pass in something else

X = c(1,2,3,4,5,6)
someFunction(X)
someFunction(X, data=myData)

This may not be catastrophic, but it’s probably a situation we’d want to be informed of, at least via a warning. I went to the trouble of creating the dataframe and passing it into a function, I’d like to know if it’s being ignored. Even worse, if I create a variable called A, then someFunction(A) will now work without an error. However, it won’t be using the column labelled A in the dataframe. Try the following:

A = c(1,2)
someFunction(A)
someFunction(A, data=myData)

I’m still monkeying around with this, trying to sort out what looks right and is most robust. As always, other views are welcome.

About these ads

12 Responses to Passing columns of a dataframe to a function without quotes

  1. anspiessanspiess says:

    How about:

    someFunction = function(y, data)
    {
    arguments <- as.list(match.call())
    if(is.null(arguments$data)) y <- eval(arguments$y) else y <- eval(arguments$y, data)
    sum(y)
    }

    Cheers,
    Andrej

    • PirateGrunt says:

      Andrej,

      I had originally included a check to see if the data was null. I left it out so the function block didn’t look quite so busy.

      Cheers,
      PG

  2. Fr. says:

    You do not mention with(), which works a lot like your code: with(myData, sum(A))

    • PirateGrunt says:

      What’s really crazy is that didn’t occur to me, so I didn’t test it. I just did and it doesn’t appear to work. Give this a try and see if you get the same error I did:

      someFunction = function(y, data = NULL)
      {
        y = with(data, y)
        sum(y)
      }
      
      myData = data.frame(A = c(1,2,3), B = c(10,9,8))
      someFunction(A, data=myData)
      
      • Colin says:

        Hi PirateGrunt,
        I tested your code, but I already had a variable A declared. It uses that A, even though your code seems to explicitly overwrite A within the environment of the function. I’m not sure what’s going on here, but I would sorely like to.

      • PirateGrunt says:

        Colin,

        That’s behavior which you should expect if there is already a variable declared which shares the name of one of the columns of your dataframe. There’s an example of this in the post. For this reason- although I like the way the syntax looks- I’m not wholly convinced that it’s sound practice.

        -PG

      • Fr. says:

        I do get the same error, but that’s expected since A is undefined in the environment. What I mean is, can’t you solve your problem by doing something like:

        myData = data.frame(A = c(1,2,3), B = c(10,9,8))
        sumSquares = function(x) { sum(x^2) }
        with(myData, sumSquares(A))
        with(myData, sumSquares(B))

      • PirateGrunt says:

        Ah, I see what you mean. Sure, I could do that. But I don’t want to. :-)

  3. andydolman says:

    Use the enclos argument to eval. This specifies where R will look for objects not found in the specified envir.

    If you create an empty environment inside your function definition

    empty <- emptyenv()

    then modify the eval line in your function by adding enclos = empty

    y = eval(arguments$y, envir=data, enclos=empty)

    R will look in the empty environment if the object is missing from data.

    This is probably still a hacky way of doing it.

    • PirateGrunt says:

      I had a look at enclos, but to be honest, it didn’t make loads of sense to me. I’m a novice when it comes to manipulating various environments. For instance, the notion of an “empty environment” sounds either like a Zen koan or my brain after a rough day at work. Environments are definitely something I need to understand better.

      -PG

  4. psychometriko says:

    I’ve always admired this type of syntax as well, but haven’t wanted to go to the trouble of learning it. This post will be very helpful to me in the future, thanks!

  5. R. Mark Sharp says:

    I ran debug(someFunction)
    Once I got arguments defined, I did some looking around and testing so that I am now really confused.

    debug at #4: y = eval(arguments$y, data)
    Browse[2]> arguments
    [[1]]
    someFunction

    $y
    A

    $data
    myData

    ## looking at what the function uses
    Browse[2]> eval(arguments$y, data)
    [1] 1 2 3

    ## looking at what I thought should work
    Browse[2]> eval(arguments$y, myData)
    [1] 1 2 3

    ## If that works, why does this not work
    Browse[2]> eval(arguments$y, arguments$data)
    Error in eval(arguments$y, arguments$data) :
    invalid ‘envir’ argument of type ‘symbol’

    ## I am sure the answer is somewhere in understanding why this also works
    Browse[2]> eval(arguments$y, eval(arguments$data))
    [1] 1 2 3

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 264 other followers

%d bloggers like this: