represtools supports a specific, yet flexible analytic workflow. The guiding principles are:
rmarkdownfor code and documentation.
makeprogram is used to ensure that data and content is (re)built when needed by any module.
The four steps are verbs. Their outputs are adjectives which describe the data contained. Each verb and adjective gets its own directory. Let’s say we’d like to analyze baseball data. The command below will create this structure.
The resulting directory tree is as shown below:
- Baseball/ | - analyze/ | - analyzed/ | - cook/ | - cooked/ | - gather/ | - gathered/ | - present/ | - presented/ | - Baseball.Rproj | - Makefile
NewResearch will add an RStudio project file to the directory. That project file will presume that
make is used to build the project.
NewResearch adds this makefile to the project directory. Note that the makefile uses some extensions that are present in the GNU version of
make as described in the
Every analysis begins with data. To start gathering data, simply execute the
Gather function with a file name which describes the data being gathered.
This will create a new .Rmd file in the “gather” directory. The template will set various
knitr options as below:
knitr::opts_knit$set(root.dir = normalizePath('../')) knitr::opts_chunk$set(echo=TRUE, message=TRUE, warning=TRUE, error=TRUE)
normalizePath option ensures that we may run in a child directory of the analysis and use paths relative to that directory. For the gather step, the default is to echo every line of code and generate every message, warning and error.
After data has been gathered, the template will collect the names of every object in memory to save to an .rda file. By default, it will search for objects which start with the characters “df”, “plt” and “fit”. This is a kind of Hungarian notation to identify the most common analysis objects: “df” - a data frame (or tbl_df), “plt” - a plot object and “fit” - a model fit.
DescribeObjects function will apply a vector of functions to a vector of objects. The default is to apply the
str function. Note that we wrap the object names with a call to
NamesToObjects. This will seach for objects in memory based on variable name and return them in a list.
lstObjects <- represtools::ListObjects() represtools::DescribeObjects(represtools::NamesToObjects(lstObjects))
Though not strictly enforced, I find it a best practice to have one Gather file for each data object (typically a data frame).
Gathered data is raw. The opposite of raw data is cooked data. “Cooking”, in this context, refers to all of the manipulation one performs to render data fit for analysis. This is everything from renaming columns to filtering and merging multiple data sets. This could also encompass treatment of data fields like scaling, centering and so on.
inputFiles argument will add those values to the params section of the YAML. By default, the cook .Rmd template will load those files into memory as the first step of the cooking process.
Note that there is a new code chunk just after initializing knitr. the
LoadObjects function will load everything from all of the .rda files listed in
inputFiles. It will return a character string of all the objects loaded. Note that there is a possibility for naming collisions. Be careful not to use the same object name in multiple files.
loadedObjects <- represtools::LoadObjects(params)
If I’ve properly cooked my data, I’m good to go. The
Analyze function will produce a new .Rmd file with the input that I need for analysis.
represtools::Analyze("HitterPerformance", inputFiles = "Hitters")
This template will be less aggressive at informing the user in the published document. Only errors will be shown.
knitr::opts_chunk$set(echo=FALSE, message=FALSE, warning=FALSE, error=TRUE)
Present is slightly different. There will be no data output for further stage. Instead the goal is to generate output for an audience.
represtools::Present("Handedness" , inputFiles = "HitterPerformance" , title = "On the quality of right-handed batters" , output = "word_document")
There is a dummy parameter at the end of the .Rmd file which is used to generate an .rda file. This prevents the make utility from recreating a presentation document. This is also a useful control for the user to ensure that the file is recreated until they are satisfied with the results.
represtools relies on the tool
make to control the workflow.
make keeps track of file dependencies between various modules of the project. A full description of
make may be found in the GNU Make documentation.
Make is an option for constructing an RStudio project and this will be reflected in the RStudio project file, if the user asked for one. One may also invoke Make by calling the
Make function within represtools. The
make program must be installed an available on the system path in order for this to work.
Make accepts one parameter, which indicates which recipe to create. By default, every
represtools Makefile comes with at least four: “all”, “clean”, “cooked” and “cleanCooked”.
represtools::NewResearch("Baseball") represtools::Gather("Hitters") # write some code represtools::Cook("Hitters") # write some code represtools::Analyze("Handedness") # write some code represtools::Present("Handedness", title = "On the quality of right-handed batters", output = "html") # write some code represtools::Make()
And that would do it.
Let’s say that I’d like to augment my analysis with another set of batters. I’ll need a new data frame, so I need a new gather file. It’s fairly straightforward to add a new data file and re-cook. If I’m lucky, my cooked data is such that I don’t need to alter my analysis code to generate new output.
For now, I’m sticking with .rda. It enables you to save more than one item in a file and doesn’t lose the names of the objects. I think it’s possible to introduce confusion if a user loads the same data into two different named objects. As the goal is reproduction, we should try to avoid this. Yes, it’s possible to copy an object to a new variable, but I think this is more transparent in a review of code. I think most R users may not even be aware of the difference between .rda and .rds.
The short answer is nothing. John Myles White doesn’t need my help in getting more efficient at analyzing data. His structure works for him and, I’m sure, loads of other people. It just didn’t work for me. My tiny brain can’t handle more than about 4 folders at a time. I struggle to understand the difference between “data” and “cache”, “doc” and “reports”, “lib” and “src”. ProjectTemplate’s philosophy seems monolithic: load all of the data, work with it and produce a single output. I like to break things into bite-size chunks and only look at whatever element of data is relevant for a certain piece of the analysis. I also find that gathering and cooking tend to be steps that get reworked often, even in a later stage of analysis.
None of this is meant as crticism. John White is loads smarter than me and he’s created something very useful for people whose brains function like his. Some people like Coke. Some people like Pepsi.