Pro Football Data

I’ve made the acquaintance of a group of data analysts here in the triangle and have agreed to arrange an outing to the Durham Bulls minor league baseball team. Because it’s for stat nerds and because I was curious, I went looking for some baseball data to analyze. I found loads of it here, but soon got distracted by the presence of NFL statistics. The season is already well underway, but I thought it might be fun to try and build a predictive model for the sport.

The first step is to get some data. Here, I use an R function to pull HTML tables from the site.

GetGamesHistory = function(FirstYear = 1985, LastYear = 2011)
{
  games.URL.stem = "http://www.pro-football-reference.com/years/"

  for (year in FirstYear:LastYear)
  {
    URL = paste(games.URL.stem, year, "/games.htm", sep="")

    games = readHTMLTable(URL)

    dfThisSeason = games[[1]]

    # Clean up the df
    dfThisSeason = subset(dfThisSeason, Week!="Week")
    dfThisSeason = subset(dfThisSeason, Week!="")
    dfThisSeason$Date = as.character(dfThisSeason$Date)
    dfThisSeason$GameDate = mdy(paste(dfThisSeason$Date, year))

    year(dfThisSeason$GameDate) = with(dfThisSeason, ifelse(month(GameDate) <=6, year(GameDate)+1, year(GameDate)))

    if (year == FirstYear)
    {
      dfAllSeasons = dfThisSeason
    } else {
      dfAllSeasons = rbind(dfAllSeasons, dfThisSeason)
    }

  }

  dfAllSeasons = dfAllSeasons[,c(14, 1, 5, 7, 8, 9)]

  colnames(dfAllSeasons) = c("GameDate", "Week", "Winner", "Loser", "WinnerPoints", "LoserPoints")

  dfAllSeasons$Winner = as.character(dfAllSeasons$Winner)
  dfAllSeasons$Loser = as.character(dfAllSeasons$Loser)
  dfAllSeasons$WinnerPoints = as.integer(as.character(dfAllSeasons$WinnerPoints))
  dfAllSeasons$LoserPoints = as.integer(as.character(dfAllSeasons$LoserPoints))
  dfAllSeasons$ScoreDifference = dfAllSeasons$WinnerPoints - dfAllSeasons$LoserPoints

  dfAllSeasons = subset(dfAllSeasons, !is.na(ScoreDifference))

  return (dfAllSeasons)

}

Created by Pretty R at inside-R.org

So I wrote this code about a week ago and already I can see that I don’t like it. For one, I try to avoid using loops in R unless absolutely necessary. Often, I’ll start out with one just to get going, but usually I find that they can be replaced with one of the apply functions or something similarly succinct. Two, I need to better understand the behavior of the readHTML function. I remember having gone a couple rounds with the points data, which is read in as a factor. This leads to the extremely ugly bit of code where I convert it to a character and then to an integer. If anyone has a better way, I’m all ears. Three, I need to revisit the basic idea of extracting columns by name. Extraction by number is dangerous and confusing. Finally, I’d like to revise the data cleansing so that it lists the game with home, visitor and winner listed. That would make it easier to test whether or not a home field advantage exists.

All that understood, the code works and gives me piles of data. How I look at it will be the subject of the next post.

About these ads

2 Responses to Pro Football Data

  1. Pingback: NFL Prediction – Algorithm 1 « pirategrunt

  2. Pingback: Scraping Pro-Football Data and Interactive Charts using rCharts, ggplot2, and shiny | Patient 2 Earn

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 264 other followers

%d bloggers like this: