March 28, 2015 Leave a comment
Quite some time ago (October 2013, according to Amazon), I bought a copy of “Reproducible Research with R and RStudio” by Christopher Gandrud. And it was awesome. Since then, I’ve been using
knitr and RMarkdown quite a lot. However, until recently, I never bothered with a makefile. At the time, I had assumed that it was something only available to people on *nix systems and back then I was developing exclusively on PC. I even wrote some R scripts that were more or less makefiles; reading the contents of a directory, checking for output and running
knit or whatever. My workflow continued to evolve and get standardised, I moved to Linux and so I picked up Gandrud’s book again to review the bits about makefiles. I’m not sure when it was that I realized that RTools includes a
make program for Windows, but I wish someone had told me that a couple years ago.
So, enough preamble. What’s the benefit and how does it work?
A makefile ensures that all of your work gets done, but only when it needs to and that each step has the raw material it needs to work. Identify what output you expect to see and how to generate that output and the
make utility will go to work. If the output is already there, it will skip to the next thing which needs to get done. Like a 12-bar blues, very simple in concept, but easy to extend to all sorts of complex derivations. Here’s my approach:
- Use RMarkdown files as your default. This will allow you to comment on everything that you’re doing and construct high quality output that you can share with folks down the line. For most steps, I render output to Word. Yes, yes, but my audience likes Word and there’s no drama about different browsers and I can easily edit the content, if I need to.
- The workflow breaks down into four discrete steps: Gather data, Process data, Analyze data, Present data. This is pretty close to what Gandrud proposes.
- Save output in .rda files at each step of the process.
- Gather data. Fetch it from the internet, from your data warehouse, or from wherever. This steps makes a copy of that information, informs where it came from, how you got it and how it’s structured. Save everything in a folder called ‘raw’. At this stage, I try to make no adjustments at all.
- Process data. Take the raw information and alter it. This step typically involves ensuring that data types are righteous- factors are characters if necessary, dates are dates, etc. Calculated and convenience columns (storing the year as well as the date, for example) are created. I might merge data frames into a single table, spread and/or gather as appropriate. Often, though, not a lot happens.
- Analyze data. This is usually exploratory, or even just descriptive. I’ll produce (hopefully) tons of plots and summary tables. At some point, I’ll come to a conclusion about models that I think make sense.
- Present. At the moment, my preference is to use slidy for presentation output. This keeps things fairly clean and simple. More complex explication should use something like LaTeX or Word. I can’t stand technical writing and I’m awful at it, so I usually stick to pictures and bullet points.
How does it work?
I don’t really know. Sorry. I’ve had a go with the GNU documentation and it’s pretty overwhelming. I took Christopher’s basic example and modified it for my purposes. Boiling it down to basic principles, know several things:
makeoperates by building “targets”. Once the full set of targets is built,
- Each target has (probably) a “prerequisite”, which it needs in order to get built. The prerequisite may also be (and often is) a target itself.
- The rule for building the target is called a “recipe” and is typically a shell command.
makemakes liberal use of variables and wildcards. A variable is often written in all caps. To refer to it, enclose it within parentheses and precede it with a dollar sign, e.g. $(MY_VARIABLE).
- You can also include a “clean” step, which will wipe out all the targets. This will ensure that everything gets rebuilt.
I took a few minutes to straighten out my just-for-fun
Baseball repository to align it with my current preferred workflow. Here’s what I did: first, ensure that the directory structure stuck to the gather->process->analyze-> present flow. In this case, that just meant a bit of tidying in my data directory. Second, copy the boilerplate makefile from my gist. Finally, alter the “Project Options” section of the “Tools/Project Options” in RStudio to ensure that the build tool moves from “none” to “Makefile”. That’s it. Let me say that again. That’s it.
NOTE: Please observe the copyright and limited use license at the Lahman site.
Let’s walk through the makefile. The first thing we do is establish where the root directories are. Note the use of variable substition in defining the data directory.
RDIR = . DATA_DIR = $(RDIR)/data
Next, we’ll use wildcards to establish all of the
.Rmd files in each of our four steps as prerequisites. I’ll just show this for the “gather” step. In the second step, the
wildcard command, will pull every
.Rmd file. The third line will perform a substition to construct a list of targets.
GATHER_DIR = $(DATA_DIR)/gather GATHER_SOURCE = $(wildcard $(GATHER_DIR)/*.Rmd) GATHER_OUT = $(GATHER_SOURCE:.Rmd=.docx)
The step with the target of “all” is the key. “all” has prerequisites that are targets of each of the four steps listed above. Just before defining our ultimate target, we define a variable which will act as the “recipe” for each of the steps. The
$< will be substituted with the name of our various
KNIT = Rscript -e "require(rmarkdown); render('$<')" all: $(GATHER_OUT) $(PROCESS_OUT) $(ANALYSIS_OUT) $(PRESENTATION_OUT)
We’re ready to roll. Again, we’ll just show the “gather” step. This will use another form of wildcard which will associate a
.Rmd file with its target. That prerequisite name will be fed into the
KNIT variable we defined earlier.
Within RStudio, you can just hit CTRL-SHIFT-B to execute
make. You’ll see the markdown engine zip through its files and eventually, you’ll see a pile of documentation produced. If everything went well, the next time you execute
make it will tell you that nothing needs to be done. Change a
.Rmd file in the processing step, though, and it will recreate that file and reperform all of the analysis. Of course, it’s possible to define things in such a way that not all of the analysis, or all of the processing, or whatever gets done if something changes upstream. I tend to customize this basic makefile to be a bit more fine tuned. However, I’ll always keep these steps in. This will give me an extra margin of safety to make sure that I’ve not ignored a critical dependency.
Below is a list of things that are awesome:
- Makesfiles. GNU/Linux are like the age of enlightment and the invention of moveable type combined.
- Christopher Gandrud’s book. Seriously, buy it.
- Yhui Xie. I’ve not read his book yet, but his stuff is amazing and
## R version 3.1.3 (2015-03-09) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 14.04.2 LTS ## ## locale: ##  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ##  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 ##  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ##  LC_PAPER=en_US.UTF-8 LC_NAME=C ##  LC_ADDRESS=C LC_TELEPHONE=C ##  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ##  stats graphics grDevices utils datasets methods base ## ## other attached packages: ##  knitr_1.6 RWordPress_0.2-3 ## ## loaded via a namespace (and not attached): ##  digest_0.6.4 evaluate_0.5.5 formatR_0.10 htmltools_0.2.6 ##  RCurl_1.95-4.1 rmarkdown_0.5.1 stringr_0.6.2 tools_3.1.3 ##  XML_3.98-1.1 XMLRPC_0.3-0 yaml_2.1.13