Archer and Tidy Data Principles (Part 1)

Updated 2018-03-25

Motivation

After reading The Life Changing Magic of Tidying Text and A tidy text analysis of Rick and Morty I wanted to do something similar for Rick and Morty and I did. Now I’m doing something similar for Archer.

In this post I’ll focus on the Tidy Data principles. However, here is the Github repo with the scripts to scrap the subtitles of Rick and Morty and other shows.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

Let’s scrap

The subtools package returns a data frame after reading srt files. In addition to that resulting data frame I wanted to explicitly point the season and chapter of each line of the subtitles. To do that I had to scrap the subtitles and then use str_replace_all. To follow the steps clone the repo from Github:

git clone https://github.com/pachamaltese/rick_and_morty_tidy_text

Archer Can Be So Tidy

After reading the tidy file I created after scraping the subtitles, I use unnest_tokens to divide the subtitles in words. This function uses the tokenizers package to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around a regex pattern.

if (!require("pacman")) install.packages("pacman")
p_load(data.table, tidyr, tidytext, dplyr, ggplot2, viridis, ggstance, igraph, ggraph)
p_load_gh("dgrtwo/widyr")

archer_subs <- as_tibble(fread("../../data/2017-10-13-rick-and-morty-tidy-data/archer_subs.csv"))

archer_subs_tidy <- archer_subs %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words)

The data is in one-word-per-row format, and we can manipulate it with tidy tools like dplyr. For example, in the last chunk I used an anti_join to remove words such a “a”, “an” or “the”.

Then we can use count to find the most common words in all of Archer episodes as a whole.

archer_subs_tidy %>%
  count(word, sort = TRUE)
# A tibble: 15,732 x 2
   word       n
   <chr>  <int>
 1 archer  4548
 2 lana    2800
 3 yeah    1478
 4 cyril   1473
 5 malory  1462
 6 pam     1300
 7 god      878
 8 wait     846
 9 uh       835
10 gonna    748
# ... with 15,722 more rows

Sentiment analysis can be done as an inner join. Three sentiment lexicons are in the tidytext package in the sentiment dataset. Let’s examine how sentiment changes changes during each season. Let’s find a sentiment score for each word using the Bing lexicon, then count the number of positive and negative words in defined sections of each season.

bing <- sentiments %>%
  filter(lexicon == "bing") %>%
  select(-score)

bing
# A tibble: 6,788 x 3
   word        sentiment lexicon
   <chr>       <chr>     <chr>  
 1 2-faced     negative  bing   
 2 2-faces     negative  bing   
 3 a+          positive  bing   
 4 abnormal    negative  bing   
 5 abolish     negative  bing   
 6 abominable  negative  bing   
 7 abominably  negative  bing   
 8 abominate   negative  bing   
 9 abomination negative  bing   
10 abort       negative  bing   
# ... with 6,778 more rows
archer_sentiment <- archer_subs_tidy %>%
  inner_join(bing) %>% 
  count(episode_name, index = linenumber %/% 50, sentiment) %>% 
  spread(sentiment, n, fill = 0) %>% 
  mutate(sentiment = positive - negative) %>%
  left_join(archer_subs_tidy[,c("episode_name","season","episode")] %>% distinct()) %>% 
  arrange(season,episode) %>% 
  mutate(episode_name = paste(season,episode,"-",episode_name),
         season = factor(season, labels = paste("Season", 1:8))) %>% 
  select(episode_name, season, everything(), -episode)

archer_sentiment
# A tibble: 1,068 x 6
   episode_name        season   index negative positive sentiment
   <chr>               <fct>    <dbl>    <dbl>    <dbl>     <dbl>
 1 S01 E01 - Mole Hunt Season 1    0.      12.       8.       -4.
 2 S01 E01 - Mole Hunt Season 1    1.       3.       6.        3.
 3 S01 E01 - Mole Hunt Season 1    2.      25.       6.      -19.
 4 S01 E01 - Mole Hunt Season 1    3.       8.      10.        2.
 5 S01 E01 - Mole Hunt Season 1    4.       8.       5.       -3.
 6 S01 E01 - Mole Hunt Season 1    5.      17.      10.       -7.
 7 S01 E01 - Mole Hunt Season 1    6.      17.       5.      -12.
 8 S01 E01 - Mole Hunt Season 1    7.      13.       8.       -5.
 9 S01 E01 - Mole Hunt Season 1    8.      12.       3.       -9.
10 S01 E01 - Mole Hunt Season 1    9.       7.       8.        1.
# ... with 1,058 more rows

Now we can plot these sentiment scores across the plot trajectory of each season.

ggplot(archer_sentiment, aes(index, sentiment, fill = season)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  facet_wrap(~season, nrow = 3, scales = "free_x", dir = "v") +
  theme_minimal(base_size = 13) +
  labs(title = "Sentiment in Archer",
       y = "Sentiment") +
  scale_fill_viridis(end = 0.75, discrete = TRUE) +
  scale_x_discrete(expand = c(0.02,0)) +
  theme(strip.text = element_text(hjust = 0)) +
  theme(strip.text = element_text(face = "italic")) +
  theme(axis.title.x = element_blank()) +
  theme(axis.ticks.x = element_blank()) +
  theme(axis.text.x = element_blank())

Looking at Units Beyond Words

Lots of useful work can be done by tokenizing at the word level, but sometimes it is useful or necessary to look at different units of text. For example, some sentiment analysis algorithms look beyond only unigrams (i.e. single words) to try to understand the sentiment of a sentence as a whole. These algorithms try to understand that I am not having a good day is a negative sentence, not a positive one, because of negation.

archer_sentences <- archer_subs %>% 
  group_by(season) %>% 
  unnest_tokens(sentence, text, token = "sentences") %>% 
  ungroup()

Let’s look at just one.

archer_sentences$sentence[12238]
[1] "phrasing!"

We can use tidy text analysis to ask questions such as what are the most negative episodes in each of Archer’s seasons? First, let’s get the list of negative words from the Bing lexicon. Second, let’s make a dataframe of how many words are in each chapter so we can normalize for the length of chapters. Then, let’s find the number of negative words in each chapter and divide by the total words in each chapter. Which chapter has the highest proportion of negative words?

bingnegative <- sentiments %>%
  filter(lexicon == "bing", sentiment == "negative")

wordcounts <- archer_subs_tidy %>%
  group_by(season, episode) %>%
  summarize(words = n())

archer_subs_tidy %>%
  semi_join(bingnegative) %>%
  group_by(season, episode) %>%
  summarize(negativewords = n()) %>%
  left_join(wordcounts, by = c("season", "episode")) %>%
  mutate(ratio = negativewords/words) %>%
  top_n(1)
# A tibble: 8 x 5
# Groups:   season [8]
  season episode negativewords words  ratio
  <chr>  <chr>           <int> <int>  <dbl>
1 S01    E05               130  1135 0.115 
2 S02    E05               127  1078 0.118 
3 S03    E09               127  1102 0.115 
4 S04    E08               129  1315 0.0981
5 S05    E11               133  1397 0.0952
6 S06    E03               136  1081 0.126 
7 S07    E06               138  1108 0.125 
8 S08    E02               123   939 0.131 

Networks of Words

Another function in widyr is pairwise_count, which counts pairs of items that occur together within a group. Let’s count the words that occur together in the lines of the first season.

archer_words <- archer_subs_tidy %>%
  filter(season == "S01")

word_cooccurences <- archer_words %>%
  pairwise_count(word, linenumber, sort = TRUE)

word_cooccurences
# A tibble: 165,920 x 3
   item1  item2      n
   <chr>  <chr>  <dbl>
 1 cyril  archer   61.
 2 archer cyril    61.
 3 lana   archer   59.
 4 archer lana     59.
 5 yeah   archer   58.
 6 archer yeah     58.
 7 god    archer   54.
 8 archer god      54.
 9 yeah   cyril    45.
10 cyril  yeah     45.
# ... with 165,910 more rows

This can be useful, for example, to plot a network of co-occuring words with the igraph and ggraph packages.

set.seed(1717)

word_cooccurences %>%
  filter(n >= 25) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "#a8a8a8") +
  geom_node_point(color = "darkslategray4", size = 8) +
  geom_node_text(aes(label = name), vjust = 2.2) +
  ggtitle(expression(paste("Word Network in Archer's ", 
                           italic("Season One")))) +
  theme_void()