Tidying and Visualizing TV Ratings Data in R

About three years ago, I received a letter in the mail from Nielsen inviting me to participate in one of their panels. After spending a while on the phone with a representative to determine that it wasn’t a scam, I figured I’d give it a go. I tend to take great interest in knowing where data come from (especially when reporters and media sources try to use statistics to make a point), and as an avid tv watcher, it was cool to learn more about how Nielsen generates ratings and estimates program viewership. My randomly sampled “household” would represent thousands of similarly characterized “households” - white male, early 20s, unmarried, lives alone, etc. - and as long as I kept a small meter in my pocket, my televison-watching habits would contribute to “the ratings” (in retrospect, I’m sure I watched way more RuPaul’s Drag Race and Project Runway than the average early 20s male… 👀).

After completing the panel, I’ve been more curious about how companies use Nielsen data to inform decision making (i.e. renewing/canceling shows, optimizing time slots), and about how the data are shared publicly (snippets of ratings information get shared on Wikipedia, on blogs, etc). Since I’ve been wanting to start blogging more, I thought it would be fun to write my first post about how to wrangle and analyze TV ratings data in R!


Finding and tidying data

I happened to come across this blog, which has a wealth of data on tv/movie ratings, such as weekly reports on the viewership of every program on a major broadcast network network (here’s an example of one such report, which I’ll try to download and tidy).

In order to read in the PDF text in R, I decided to use pdftools. There are a few other packages out there that do a similar job, like tabularize, but I found pdftools to be the simplest and easiest to use. Given the formatting of the table (variable names and values containing spaces, nothing cleanly or consistently delimited), there was no clean and simple way to turn the text into a data frame, so I had to use some clever regular expressions to turn this:

## [1] "      Complete Nielsen Ratings: Jan. 29 - Feb 4, 2018\nP2+ R A18-34 R A18-49 R A25-54 R Day Time  Net Program                         P2+ * A18-34 A18-49 A25-54 P2 +/-\n 19      5        8        11    Mon  8:00 ABC The Bachelor                     6.36  1.5    1.7    2.0\n 32     45        49       47    Mon 10:00 ABC The Good Doctor [R]              5.00  0.5    0.8    1.1      -21%\n 12     22        15       10    Mon  8:00 CBS Kevin Can Wait                   7.28  0.6    1.2    2.0\n 17     37        21       13    Mon  8:30 CBS Man with a Plan                  6.64  0.5    1.1    1.8       -9%\n 25     42        30       18    Mon  9:00 CBS Superior Donuts                  5.80  0.5    1.0    1.7      -13%\n 34     52        40       32    Mon  9:30 CBS 9JKL                             4.82  0.4    0.9    1.4      -17%\n 24     51        38       29    Mon 10:00 CBS Scorpion                         5.82  0.4    0.9    1.4       21%\n 73     61        66       66    Mon  8:00 CW  Supergirl                        2.11  0.4    0.6    0.8\n 84     85        88       86    Mon  9:00 CW  Valor (SF)                       1.07  0.2    0.2    0.3      -49%\n 46     31        50       42    Mon  8:00 FOX Lucifer                          3.67  0.6    0.8    1.2\n 36     28        42       35    Mon  9:00 FOX The Resident                     4.75  0.6    0.9    1.3       29%\n 29     26        22       59    Mon  8:00 NBC The Wall                         5.42  0.6    1.1    0.9\n 35     46     "

into something nice and tidy like this!

latest_date = "2018-01-29"
get_ratings(latest_date)
program network weekday time date viewers_millions P2_ranking A18_34_ranking A18_49_ranking A25_54_ranking A18_34 A18_49 A25_54 P2_change start_of_week rerun
The Bachelor ABC Mon 8:00 2018-01-29 6.36 19 5 8 11 1.5 1.7 2.0 NA 2018-01-29 FALSE
The Good Doctor ABC Mon 10:00 2018-01-29 5.00 32 45 49 47 0.5 0.8 1.1 -21% 2018-01-29 TRUE
Kevin Can Wait CBS Mon 8:00 2018-01-29 7.28 12 22 15 10 0.6 1.2 2.0 NA 2018-01-29 FALSE
Man with a Plan CBS Mon 8:30 2018-01-29 6.64 17 37 21 13 0.5 1.1 1.8 -9% 2018-01-29 FALSE
Superior Donuts CBS Mon 9:00 2018-01-29 5.80 25 42 30 18 0.5 1.0 1.7 -13% 2018-01-29 FALSE
9JKL CBS Mon 9:30 2018-01-29 4.82 34 52 40 32 0.4 0.9 1.4 -17% 2018-01-29 FALSE
Scorpion CBS Mon 10:00 2018-01-29 5.82 24 51 38 29 0.4 0.9 1.4 21% 2018-01-29 FALSE
Supergirl CW Mon 8:00 2018-01-29 2.11 73 61 66 66 0.4 0.6 0.8 NA 2018-01-29 FALSE
Valor CW Mon 9:00 2018-01-29 1.07 84 85 88 86 0.2 0.2 0.3 -49% 2018-01-29 FALSE
Lucifer FOX Mon 8:00 2018-01-29 3.67 46 31 50 42 0.6 0.8 1.2 NA 2018-01-29 FALSE
The Resident FOX Mon 9:00 2018-01-29 4.75 36 28 42 35 0.6 0.9 1.3 29% 2018-01-29 FALSE
The Wall NBC Mon 8:00 2018-01-29 5.42 29 26 22 59 0.6 1.1 0.9 NA 2018-01-29 FALSE
Better Late Than Never NBC Mon 9:00 2018-01-29 4.78 35 46 41 34 0.5 0.9 1.3 -12% 2018-01-29 FALSE
The Brave NBC Mon 10:00 2018-01-29 3.90 41 54 56 55 0.4 0.7 1.0 -18% 2018-01-29 FALSE
Fresh off The Boat ABC Tue 8:00 2018-01-30 4.05 39 30 33 36 0.6 1.0 1.3 NA 2018-01-29 FALSE
Fresh off The Boat ABC Tue 8:30 2018-01-30 3.46 51 33 45 44 0.6 0.9 1.2 -15% 2018-01-29 FALSE
State Of The Union N/A ABC Tue 9:00 2018-01-30 NA 94 94 94 94 NA NA NA NA 2018-01-29 FALSE
SOTU Analysis ABC Tue 10:30 2018-01-30 3.61 49 65 61 60 0.3 0.6 0.9 NA 2018-01-29 FALSE
Super Bowl Greatest Commercials CBS Tue 8:00 2018-01-30 7.31 11 39 20 17 0.5 1.1 1.7 NA 2018-01-29 FALSE
State Of The Union N/A CBS Tue 9:00 2018-01-30 NA 95 95 95 95 NA NA NA NA 2018-01-29 FALSE
SOTU Analysis CBS Tue 10:30 2018-01-30 4.91 33 27 39 41 0.6 0.9 1.2 NA 2018-01-29 FALSE
The Flash CW Tue 8:00 2018-01-30 2.60 69 20 48 53 0.7 0.9 1.1 NA 2018-01-29 FALSE
Black Lightning CW Tue 9:00 2018-01-30 2.12 72 36 54 64 0.6 0.8 0.9 -18% 2018-01-29 FALSE
Lethal Weapon FOX Tue 8:00 2018-01-30 2.87 64 58 63 62 0.4 0.6 0.9 NA 2018-01-29 FALSE
State Of The Union N/A FOX Tue 9:00 2018-01-30 NA 96 96 96 96 NA NA NA NA 2018-01-29 FALSE
Ellen’s Game of Games NBC Tue 8:00 2018-01-30 7.58 10 9 7 8 1.1 1.8 2.4 NA 2018-01-29 FALSE
State Of The Union N/A NBC Tue 9:00 2018-01-30 NA 97 97 97 97 NA NA NA NA 2018-01-29 FALSE
SOTU Analysis NBC Tue 10:30 2018-01-30 5.92 23 17 13 15 0.7 1.3 1.8 NA 2018-01-29 FALSE
The Goldbergs ABC Wed 8:00 2018-01-31 4.33 37 47 23 27 0.5 1.1 1.5 NA 2018-01-29 TRUE
Speechless ABC Wed 8:30 2018-01-31 3.20 57 50 47 50 0.5 0.9 1.1 -26% 2018-01-29 TRUE
Modern Family ABC Wed 9:00 2018-01-31 3.49 50 48 44 43 0.5 0.9 1.2 9% 2018-01-29 TRUE
American Housewife ABC Wed 9:30 2018-01-31 2.97 63 57 53 52 0.4 0.8 1.1 -15% 2018-01-29 TRUE
Match Game ABC Wed 10:00 2018-01-31 2.85 66 60 59 58 0.4 0.7 1.0 -4% 2018-01-29 FALSE
The Amazing Race CBS Wed 8:00 2018-01-31 6.57 18 12 12 14 0.8 1.3 1.8 NA 2018-01-29 FALSE
SEAL Team CBS Wed 9:00 2018-01-31 6.65 16 41 27 24 0.5 1.0 1.5 1% 2018-01-29 FALSE
Criminal Minds CBS Wed 10:00 2018-01-31 5.42 30 44 32 31 0.5 1.0 1.4 -18% 2018-01-29 FALSE
Riverdale CW Wed 8:00 2018-01-31 1.34 81 63 70 76 0.4 0.5 0.5 NA 2018-01-29 FALSE
Dynasty CW Wed 9:00 2018-01-31 0.65 88 86 89 92 0.2 0.2 0.2 -51% 2018-01-29 FALSE
The X-Files FOX Wed 8:00 2018-01-31 3.64 47 32 43 38 0.6 0.9 1.3 NA 2018-01-29 FALSE
9-1-1 FOX Wed 9:00 2018-01-31 6.21 21 8 9 9 1.2 1.6 2.1 71% 2018-01-29 FALSE
The Blacklist NBC Wed 8:00 2018-01-31 6.11 22 25 29 26 0.6 1.0 1.5 NA 2018-01-29 FALSE
Law & Order: SVU NBC Wed 9:00 2018-01-31 5.48 28 18 18 21 0.7 1.2 1.6 -10% 2018-01-29 FALSE
Chicago PD NBC Wed 10:00 2018-01-31 6.72 15 16 17 20 0.7 1.2 1.6 23% 2018-01-29 FALSE
Grey’s Anatomy ABC Thu 8:00 2018-02-01 8.93 8 4 6 6 1.7 2.3 2.7 NA 2018-01-29 FALSE
Scandal ABC Thu 9:00 2018-02-01 5.62 27 10 11 16 1.0 1.4 1.8 -37% 2018-01-29 FALSE
HTGAWM ABC Thu 10:00 2018-02-01 3.68 45 19 34 37 0.7 1.0 1.3 -35% 2018-01-29 FALSE
The Big Bang Theory CBS Thu 8:00 2018-02-01 14.69 4 6 4 4 1.4 2.9 4.3 NA 2018-01-29 FALSE
Young Sheldon CBS Thu 8:30 2018-02-01 12.92 5 7 5 5 1.2 2.3 3.5 -12% 2018-01-29 FALSE
Mom CBS Thu 9:00 2018-02-01 9.11 7 15 10 7 0.7 1.5 2.4 -29% 2018-01-29 FALSE
Life in Pieces CBS Thu 9:30 2018-02-01 7.11 14 23 16 12 0.6 1.2 1.8 -22% 2018-01-29 FALSE

While the full code for the function get_ratings can be found in this Github repo, I’ll explain a little about the regular expressions I used. Let’s use this one line of the PDF text as an example:

## [1] " 32     45        49       47    Mon 10:00 ABC The Good Doctor [R]              5.00  0.5    0.8    1.1      -21%"

First, I identified some groups of elements to extract together. Since I have a fairly limited knowledge of regex semantics, the process of extracting elements from these lines was made significantly easier by using the rebus package in R. Here’s a flow of how the rebus expressions I wrote translated as regular expressions, followed by the elements extracted:

Group 1:

Group 2:

Group 3:

Bless you, rebus 🙏🏻. Next, I removed all of the elements above from the string to leave the network and program name, which I then split up into separate elements. All in all, it took a lot of trial and error to figure out the best regular expressions patterns to use, and while this may not be the prettiest or most efficient way to extract the data, it got the job done!

Visualizing the data

Once the data were tidy, the fun could begin! First, I wanted to see what the 20 most viewed programs were last week. I decided to exclude any reruns of shows to keep the list about new TV content.

data %>% 
  filter(rerun == FALSE) %>% 
  top_n(n=20,wt = viewers_millions) %>% 
  select(network,program,viewers_millions,P2_ranking) %>% 
  ggplot()+geom_bar(aes(x = reorder(program,viewers_millions),y = viewers_millions,fill=network),stat = 'identity')+coord_flip()+
  labs(x = "TV Show", y = "Viewers (millions)", title="Most Watched Programs on Broadcast TV from January 29-February 4, 2018") + theme_minimal() + scale_y_continuous(expand=c(0,0))

Unsurprisingly, the Super Bowl and its Post Game commentary dominated the weekly ratings, and This Is Us probably benifited a great deal from the post-Super Bowl time slot. CBS seems to have a lot of popular shows, while Fox and the CW didn’t even make the list. Unfortunately, data from each network’s airing of the State of the Union were not included in the PDF, so they did not make the cut either (though contrary to what one man thinks, this was not the most watched SOTU ever).

Next, I was curious to see which network received the highest average number of views on a daily basis. I decided to just look at Monday-Saturday, since we all know what everyone was watching this past Sunday…

data %>% 
  group_by(network,weekday) %>% 
  filter(!weekday == "Sun") %>% 
  summarise(avgviews = mean(viewers_millions,na.rm=TRUE)) %>% 
  ggplot()+geom_line(aes(x = weekday,y = avgviews,group=network,col=network))+
  scale_x_discrete(limits = c("Mon","Tue","Wed","Thu","Fri","Sat"))+ labs(y = "Average Number of Viewers (millions)")+theme_minimal() + scale_y_continuous(expand=c(0,0))

It looks like the CW consistently underperforms in ratings compared to the other major broadcast networks. Also, Thursday night is clearly the best night for ABC (Shondaland, anybody?? 🍷🍿).

Lastly, I was wondering how viewership and ratings changed over the past month. By defining a vector of dates over the past month when the PDFs were posted, and then using the map_df function in the purrr package, I was able to pretty easily gather all of the past month’s data in a single data frame. Again, I restricted the rankings to only show new content, and I compiled lists of the 10 most viewed programs by week (I used spread from tidyr to distribute the resulting names across four columns, which formed a neat table to display!)

dates = as.Date(latest_date)-7*c(0:3)

dates %>% 
  map_df(~get_ratings(.)) %>% 
  filter(rerun == FALSE) %>% 
  group_by(start_of_week) %>% 
  top_n(n=10,wt = viewers_millions) %>% 
  arrange(desc(start_of_week),desc(viewers_millions)) %>% 
  select(program,start_of_week) %>% 
  ungroup() %>% 
  bind_cols(ranking = rep(1:10,4)) %>% 
  spread(start_of_week,program) %>% 
  kable("html") %>% 
  kable_styling()
ranking 2018-01-08 2018-01-15 2018-01-22 2018-01-29
1 NFL Playoff Overrun NFC Championship Game: PHI v MIN 60th Grammy Awards Super Bowl LII: PHI v NE
2 NFL Div. Playoff: TEN v NE NFC Post-Game NCIS SB Post Game
3 The OT NFC Trophy BULL This Is Us
4 The Big Bang Theory The Big Bang Theory The Good Doctor The Big Bang Theory
5 NCIS Young Sheldon This Is Us Young Sheldon
6 Young Sheldon This Is Us NCIS: New Orleans Blue Bloods
7 BULL Blue Bloods Grey’s Anatomy Mom
8 Blue Bloods The Good Doctor Ellen’s Game of Games Grey’s Anatomy
9 This Is Us MOM Kevin Can Wait Hawaii Five-O
10 MOM Hawaii Five-O Grammy Red Carpet Ellen’s Game of Games

Again, no shock that Americans looooooove their football… 🏈

So there you have it! I had a lot of exploring these data, and there are tons of other visualizations and analyses that these data can be used for. Again, the code with the function to obtain the data can be found on Github - I would love to see what else you can come up with while using it! 😬🎉