Tidy Tuesday: 2019 Week 2

2019 January 8

This will be my first go at the weekly Tidy Tuesday exercises. I likely won’t make every one, but will take a crack at as many as I can. I was particularly drawn to this one because it’s data about popular culture, one of my favorite things to think about.

I’ll be doing this in my favorite Gather -> Cook -> Analyze -> Present format, as implemented in my represtools package. We’ll not be presenting results - this is just EDA for fun. I’ll also not be using the full suite of represrools functionality, just breaking down the steps in this single blog post.

Before we get started, though, we load in the tidyverse package.

library(tidyverse)
## ── Attaching packages ──────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ─────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Gather

First, we’ll fetch the data. In this case, it’s a simple read_csv() to the URL.

url <- 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-01-08/IMDb_Economist_tv_ratings.csv'

tbl_raw <- read_csv(url)
## Parsed with column specification:
## cols(
##   titleId = col_character(),
##   seasonNumber = col_integer(),
##   title = col_character(),
##   date = col_date(format = ""),
##   av_rating = col_double(),
##   share = col_double(),
##   genres = col_character()
## )

Cook: Title table

Raw data in hand, I start to cook what I have. This is largely normalization. It’s overkill, but I’m fussy and struggle to look at data any other way.

The raw data is a denormalized mashup of title, season and genre. We can split those out into separate tables.

Title is easy to pick up. Simply select() the ID and the title and then run unique().

tbl_title <- tbl_raw %>% 
  select(titleId, title) %>% 
  unique()

tbl_title
## # A tibble: 876 x 2
##    titleId   title               
##    <chr>     <chr>               
##  1 tt2879552 11.22.63            
##  2 tt3148266 12 Monkeys          
##  3 tt1837492 13 Reasons Why      
##  4 tt0285331 24                  
##  5 tt5345490 24: Legacy          
##  6 tt1598754 24: Live Another Day
##  7 tt2197797 666 Park Avenue     
##  8 tt0115083 7th Heaven          
##  9 tt0312081 8 Simple Rules      
## 10 tt7235466 9-1-1               
## # ... with 866 more rows

Cook: Check genre

We might expect that genre is a property of the title. That is, each title has one genre value which remains the same over time. That’s not actually the case. First, each title will have multiple genres. This makes sense for some titles - Gilmore Girls can do drama and comedy equally well - but is an awkward fit for others. I love Roseanne, but would say that it has much comedic than dramatic elements. We’ll have more to say about this in a bit.

The other concern is fairly minor. At least one show switches genres from one season to another. We can flag this by using the n_distinct() function to see how many genres there are for each title. We also could have called unique() before group_by(titleId) and gotten the same result.

tbl_genre_switch <- tbl_raw %>% 
  select(titleId, genres) %>% 
  group_by(titleId) %>% 
  summarise(num_genre = n_distinct(genres)) %>% 
  ungroup() %>% 
  filter(num_genre > 1) %>% 
  inner_join(tbl_raw, by = 'titleId') %>% 
  select(title, date, genres)

tbl_genre_switch
## # A tibble: 3 x 3
##   title      date       genres             
##   <chr>      <date>     <chr>              
## 1 Twin Peaks 1990-04-29 Crime,Drama,Mystery
## 2 Twin Peaks 1991-01-11 Crime,Drama,Mystery
## 3 Twin Peaks 2017-07-09 Crime,Drama,Fantasy

Twin Peaks has different genre information between its two incarnations. Seems it morphed from a mystery to fantasy. Not a big deal and one that we could manually correct. However, I’m not going to touch it and will simply let it slide.

Cook: Seasons

This is simply dropping the genre and title fields. In addition, we will add a column for the year, which will help us visualize later. Year is a bit problematic as the TV season does not align 100% with a calendar year. If we were particularly fussy, we could use some pivot date like the end of May (when the TV season in the US typically ends) to adjust the year. In this example, we’re not bothered by this.

tbl_season <- tbl_raw %>% 
  select(-genres, -title) %>% 
  mutate(
      year = lubridate::year(date)
    , semester = lubridate::semester(date, with_year = TRUE)
  )

tbl_season
## # A tibble: 2,266 x 7
##    titleId   seasonNumber date       av_rating share  year semester
##    <chr>            <int> <date>         <dbl> <dbl> <dbl>    <dbl>
##  1 tt2879552            1 2016-03-10      8.49 0.510 2016.    2016.
##  2 tt3148266            1 2015-02-27      8.34 0.460 2015.    2015.
##  3 tt3148266            2 2016-05-30      8.82 0.250 2016.    2016.
##  4 tt3148266            3 2017-05-19      9.04 0.190 2017.    2017.
##  5 tt3148266            4 2018-06-26      9.14 0.380 2018.    2018.
##  6 tt1837492            1 2017-03-31      8.44 2.38  2017.    2017.
##  7 tt1837492            2 2018-05-18      7.51 2.19  2018.    2018.
##  8 tt0285331            1 2002-02-16      8.56 6.67  2002.    2002.
##  9 tt0285331            2 2003-02-09      8.70 7.13  2003.    2003.
## 10 tt0285331            3 2004-02-09      8.72 5.88  2004.    2004.
## # ... with 2,256 more rows

Cook: Split genres

The genres are contained in a vector. This is not typically how I deal with normalized data. We can use separate to change that single column into multiple columns, but first we need to know how many columns to create. We can call str_count on the comma separator to do that. We need to make sure to add 1 to the result, i.e. one comma implies two genres. Note the use of unlist to convert max_genres into a vector.

tbl_genre <- tbl_raw %>% 
  select(titleId, genre = genres)

tbl_genre
## # A tibble: 2,266 x 2
##    titleId   genre                  
##    <chr>     <chr>                  
##  1 tt2879552 Drama,Mystery,Sci-Fi   
##  2 tt3148266 Adventure,Drama,Mystery
##  3 tt3148266 Adventure,Drama,Mystery
##  4 tt3148266 Adventure,Drama,Mystery
##  5 tt3148266 Adventure,Drama,Mystery
##  6 tt1837492 Drama,Mystery          
##  7 tt1837492 Drama,Mystery          
##  8 tt0285331 Action,Crime,Drama     
##  9 tt0285331 Action,Crime,Drama     
## 10 tt0285331 Action,Crime,Drama     
## # ... with 2,256 more rows
max_genres <- tbl_genre %>% 
  mutate(
    commas = str_count(genre, ',')
  ) %>% 
  summarise(
    max_col = max(commas) + 1
  ) %>% 
  unlist()

Now, we can split the single genre column into 3 new columns. We will be warned that not every row has all three. That’s cool. We’ll filter them out, after using gather() to translate our data into a table which stores a titleId and a genre value.

tbl_genre <- tbl_genre %>% 
  tidyr::separate(genre, into = as.character(seq_len(max_genres)), sep = ',') %>% 
  gather(id, genre, -titleId) %>% 
  filter(!is.na(genre)) %>% 
  select(-id) %>% 
  unique()
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 686 rows [6,
## 7, 21, 22, 23, 31, 35, 37, 46, 62, 69, 73, 74, 75, 105, 106, 107, 118, 121,
## 122, ...].
tbl_genre
## # A tibble: 2,246 x 2
##    titleId   genre    
##    <chr>     <chr>    
##  1 tt2879552 Drama    
##  2 tt3148266 Adventure
##  3 tt1837492 Drama    
##  4 tt0285331 Action   
##  5 tt5345490 Action   
##  6 tt1598754 Action   
##  7 tt2197797 Drama    
##  8 tt0115083 Drama    
##  9 tt0312081 Comedy   
## 10 tt7235466 Action   
## # ... with 2,236 more rows

In structuring the data this way, we are assuming that we can map the ratings data to genre, irrespective of season. That’s not 100% accurate, as we saw earlier. The practical effect here means that Twin Peaks will map to four different genres.

Cook: Consolidate genres

There are many genres, so let’s focus on the top 8. forcats will do this. We add this as a new column so that we can see the more granular value for genre, if we’d like.

tbl_genre <- tbl_genre %>% 
  mutate(
    genre_other = forcats::fct_lump(genre, n = 8)
  )

Analyze: Genre over time

Our data has now been lovingly fussed over and we are ready to answer some questions. The first thing I’m curious about is the change in genre over time. I feel as though the ’80s were a decade with lots of comedies, whereas the TV industry started to emphasize drama in the 90s. Because our data begins in the 1990s, we can’t address this question, but we can look at more recent trends.

To do explore this, we must join tbl_genre to tbl_season.

tbl_genre_time <- tbl_genre %>% 
  inner_join(tbl_season %>% select(titleId, date, year, av_rating), by = 'titleId')

Notice that the number of records in this table is more than in our raw data. This is the result of having normalized the genre vector.

We can visualize this with a simple time series.

tbl_genre_time %>% 
  group_by(genre_other, year) %>% 
  summarise(
    count = n()
  ) %>% 
  ggplot(aes(year, count, color = genre_other)) + 
  geom_point()

The data bears out my theory of switching to drama in recen years, though the timing is not quite what I had thought. It looks as though the change really took off after the year 2000. We can maybe call this “the Sopranos effect”.

Analyze: Change in rating over time

We’ve heard the term “peak TV” for over a decade. Has this changed how we critique the shows we watch? That is, are the ratings getting higher over time? Or, are our expectations being recalibrated so that average ratings are relatively stable? Does this vary by genre?

We can alter our time series to show the average rating as the response. We can add a smoother to see whether there is any inherent trend.

tbl_genre_time %>% 
  ggplot(aes(year, av_rating, color = genre_other)) + 
  geom_point() +
  geom_smooth() +
  facet_wrap(~genre_other)
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Overall, things look fairly smooth, apart from a comedic nadir in the mid-90’s.

I love using boxplots as a way to isolate outliers from central tendencies. The plot below suggests a great deal more volatility for dramas than other genres. Moreover, when they deviate from the norm, they tend to be poorly received. There is a similar effect in the past 8 years or so for the crime genre. This isn’t one that I pay attention to, but if I had to guess, I would imagine that there may be some fatigue about the various flavors of CSI or Law and Order.

tbl_genre_time %>% 
  mutate(year = as.factor(year)) %>% 
  ggplot(aes(year, av_rating, color = genre_other)) + 
  geom_boxplot() +
  facet_wrap(~genre_other)

Analyze: More about genre

As we saw earlier, there are a number of shows lurking in the drama genre that I would hesitate to label drama. We can pull out those series which have drama only by creating a new table containing single-genre shows. We

tbl_genre_single <- tbl_genre %>% 
  group_by(titleId) %>% 
  summarise(
    num_genres = n()
  ) %>% 
  filter(
    num_genres == 1
  ) %>% 
  inner_join(tbl_genre)
## Joining, by = "titleId"
tbl_pure_drama <- tbl_genre_single %>% 
  filter(genre == "Drama") %>% 
  select(titleId) %>% 
  inner_join(tbl_genre_time)
## Joining, by = "titleId"

The “pure” dramas show a bit more volatility, but we’ve dropped quite a few of the low outliers from recent years.

tbl_pure_drama %>% 
  mutate(year = as.factor(year)) %>% 
  ggplot(aes(year, av_rating)) + 
  geom_boxplot()

Switching from box plots back to points allows us to see that “pure” dramas are not as well received recently. There’s a clear downward trend in the last few years.

tbl_pure_drama %>% 
  ggplot(aes(year, av_rating)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And yet, there are more pure dramas recently …

tbl_pure_drama %>% 
  ggplot(aes(year)) + 
  geom_bar()

Analyze: Non-comedy drama

The idea of a “pure” drama feels a bit too restrictive. For example “The Sopranos” lists both crime and drama, but not comedy1. Instead, we can focus on dramas which do not list comedy as one of their genres. We proceed similar to the way we identified pure dramas. However, in this instance, we use an anti_join() to drop all of those records which include comedy in their genre listing.

tbl_drama_non_comedy <- tbl_genre %>% 
  anti_join(
      tbl_genre %>% filter(genre == 'Comedy')
    , by = 'titleId'
  ) %>% 
  select(titleId) %>% 
  unique() %>% 
  inner_join(tbl_season)
## Joining, by = "titleId"

We will reproduce our box and point plots for this data set.

tbl_drama_non_comedy %>% 
  ggplot(aes(year, av_rating)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

tbl_drama_non_comedy %>% 
  mutate(year = as.factor(year)) %>% 
  ggplot(aes(year, av_rating)) + 
  geom_boxplot()

It looks as though in recent years,the overall quality has improved, but there are more negative outliers. That is, dramas are marginally better overall, but the bad ones are worse.

Analyze: Skew over time

We can get a sense for this by examining the skewness of the samples. It’s clear that every year is negatively skewed, but that this is not constant over time. The e1071 package has skewness and kurtosis functions that allow us to take sample estimates of moments higher than mean and standard deviation.

We can calculate all of those sample estimates with the summarize_at() function from dplyr. We simply ask it to apply a set of functions to a particular response or set of responses.

library(e1071)

tbl_drama_non_comedy_year <- tbl_drama_non_comedy %>% 
  group_by(year) %>% 
  summarise_at(
      'av_rating'
    , .funs = funs(mean, median, sd, var, skewness, kurtosis)
  )

The median rating has increased over time (peak TV!).

tbl_drama_non_comedy_year %>% 
  ggplot(aes(year, median)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This requires a grain of salt as we are observing things against a y-axis with a lower limit of 7.6. The data is more stable when we use a scale of zero.

tbl_drama_non_comedy_year %>% 
  ggplot(aes(year, median)) +
  geom_point() +
  geom_smooth() + 
  scale_y_continuous(limits = c(0, NA))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Setting the y-axis to zero is likely overkill. There is a probably a practical floor to the ratings which is higher than zero. The smallest value in our data is more than 2.5

tbl_season$av_rating %>% 
  summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.704   7.731   8.115   8.061   8.490   9.682

That understood, the standard deviation in rating has creeped up a bit since 2010, after falling for the first twenty years in our data. This gives some qualitative support to the idea that we’re getting used to higher quality shows and have altered our expectations. This is not a robust statistical conclusion, however. There are any number of additional experimental design elements that one would want to control for to support that conclusion. So, this is more suggestive of a hypothesis than anything else.

tbl_drama_non_comedy_year %>% 
  ggplot(aes(year, sd)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Similarly, we see that, indeed, the skewness has grown more negative over time. The assessment of quality has grown more varied over the years. However, when ratings deviate from the center, they are more likely to be much lower than anything else.

tbl_drama_non_comedy_year %>% 
  ggplot(aes(year, skewness)) +
  geom_point() +
  geom_smooth() +
  ggtitle("Dramas ")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Assessment of higher order moments is a meaningful issue for financial analysis. There are a number of tradable instruments which attempt to capture a price for market volatility. The vix is one such example. Examining the phenomena with “toy” data like television ratings can help us explore the problem in another domain. I’m very curious to posit a time series function which captures my thoughts about the changes in television ratings. A shame there’s no liquid market for an instrument like that.

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
##  [1] e1071_1.6-8     bindrcpp_0.2.2  forcats_0.3.0   stringr_1.3.1  
##  [5] dplyr_0.7.8     purrr_0.2.5     readr_1.1.1     tidyr_0.8.2    
##  [9] tibble_1.4.2    ggplot2_3.0.0   tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_0.2.5    xfun_0.4            lattice_0.20-35    
##  [4] haven_1.1.1         colorspace_1.3-2    generics_0.0.1.9000
##  [7] htmltools_0.3.6     yaml_2.2.0          mgcv_1.8-24        
## [10] utf8_1.1.3          rlang_0.3.0.1       pillar_1.2.1       
## [13] glue_1.3.0          withr_2.1.2         modelr_0.1.2       
## [16] readxl_1.1.0        bindr_0.1.1         plyr_1.8.4         
## [19] munsell_0.5.0       blogdown_0.10       gtable_0.2.0       
## [22] cellranger_1.1.0    rvest_0.3.2         evaluate_0.10.1    
## [25] labeling_0.3        knitr_1.20          class_7.3-14       
## [28] curl_3.2            broom_0.5.0.9001    Rcpp_1.0.0         
## [31] scales_1.0.0        backports_1.1.2     jsonlite_1.5       
## [34] hms_0.4.2           digest_0.6.15       stringi_1.2.2      
## [37] bookdown_0.9        grid_3.4.4          rprojroot_1.3-2    
## [40] cli_1.0.0           tools_3.4.4         magrittr_1.5       
## [43] lazyeval_0.2.1      crayon_1.3.4        pkgconfig_2.0.2    
## [46] Matrix_1.2-14       xml2_1.2.0          lubridate_1.7.4    
## [49] assertthat_0.2.0    rmarkdown_1.10      httr_1.3.1         
## [52] rstudioapi_0.8      R6_2.2.2            nlme_3.1-137       
## [55] compiler_3.4.4

  1. Although The Sopranos had some comedic moments that were hilarious.

comments powered by Disqus