Rick and Morty and Tidy Data Principles (Part 3)

Motivation

The first and second part of this analysis gave the idea that I did too much scrapping and processing and that deserves more analysis to use that information well. In this third and final part I'm also taking a lot of ideas from Julia Silge's blog.

In the GitHub repo of this project you shall find not just Rick and Morty processed subs, but also for Archer, Bojack Horseman, Gravity Falls and Stranger Things. Why? In post post I'm gonna compare the different shows.

Note: If some images appear too small on your screen you can open them in a new tab to show them in their original size.

Word Frequencies

Comparing frequencies across different shows can tell us how similar Gravity Falls, for example, is similar to Rick and Morty. I'll use the subtitles from different shows that I scraped using the same procedure I did with Rick and Morty.

if (!require("pacman")) install.packages("pacman")
p_load(data.table,tidyr,tidytext,dplyr,ggplot2,viridis,ggstance,stringr,scales)
p_load_gh("dgrtwo/widyr")

rick_and_morty_subs   = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/rick_and_morty_subs.csv"))
archer_subs           = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/archer_subs.csv"))
bojack_horseman_subs  = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/bojack_horseman_subs.csv"))
gravity_falls_subs    = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/gravity_falls_subs.csv"))
stranger_things_subs  = as_tibble(fread("2017-10-13_rick_and_morty_tidy_data/stranger_things_subs.csv"))

rick_and_morty_subs_tidy = rick_and_morty_subs %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words)

archer_subs_tidy = archer_subs %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words)

bojack_horseman_subs_tidy = bojack_horseman_subs %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words)

gravity_falls_subs_tidy = gravity_falls_subs %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words)

stranger_things_subs_tidy = stranger_things_subs %>% 
  unnest_tokens(word,text) %>% 
  anti_join(stop_words)

With this processing we can compare frequencies across different shows. Here's an example of the top ten words for each show:

bind_cols(rick_and_morty_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10),
          archer_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10),
          bojack_horseman_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10),
          gravity_falls_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10),
          stranger_things_subs_tidy %>% count(word, sort = TRUE) %>% filter(row_number() <= 10)) %>% 
  setNames(., c("rm_word","rm_n","a_word","a_n","bh_word","bh_n","gf_word","gf_n","st_word","st_n"))
# A tibble: 10 x 10
   rm_word  rm_n a_word   a_n bh_word  bh_n gf_word  gf_n st_word  st_n
     <chr> <int>  <chr> <int>   <chr> <int>   <chr> <int>   <chr> <int>
 1   morty  1898 archer  4548  bojack   956   mabel   457    yeah   485
 2    rick  1691   lana  2800    yeah   704     hey   453     hey   318
 3   jerry   646   yeah  1478     hey   575      ha   416    mike   271
 4    yeah   484  cyril  1473   gonna   522    stan   369   sighs   262
 5   gonna   421 malory  1462    time   451  dipper   347      uh   189
 6  summer   409    pam  1300      uh   382   gonna   345  dustin   179
 7     hey   391    god   878      na   373    time   314   lucas   173
 8      uh   331   wait   846   diane   345    yeah   293   gonna   172
 9    time   319     uh   835    todd   339      uh   265   joyce   161
10    beth   301  gonna   748    love   309    guys   244     mom   157

There are common words such as "yeah" for example.

Now I'll combine the frequencies of all the shows and I'll plot the top 50 frequencies to see similitudes with Rick and Morty:

tidy_others = bind_rows(mutate(archer_subs_tidy, show = "Archer"),
                        mutate(bojack_horseman_subs_tidy, show = "Bojack Horseman"),
                        mutate(gravity_falls_subs_tidy, show = "Gravity Falls"),
                        mutate(stranger_things_subs_tidy, show = "Stranger Things"))

frequency = tidy_others %>%
  mutate(word = str_extract(word, "[a-z]+")) %>%
  count(show, word) %>%
  rename(other = n) %>%
  inner_join(count(rick_and_morty_subs_tidy, word)) %>%
  rename(rick_and_morty = n) %>%
  mutate(other = other / sum(other),
         rick_and_morty = rick_and_morty / sum(rick_and_morty)) %>%
  ungroup() 

frequency_top_50 = frequency %>% 
  group_by(show) %>% 
  arrange(-other,-rick_and_morty) %>% 
  filter(row_number() <= 50)

ggplot(frequency_top_50, aes(x = other, y = rick_and_morty, color = abs(rick_and_morty - other))) +
  geom_abline(color = "gray40") +
  geom_jitter(alpha = 0.1, size = 2.5, width = 0.4, height = 0.4) +
  geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
  scale_x_log10(labels = percent_format()) +
  scale_y_log10(labels = percent_format()) +
  scale_color_gradient(limits = c(0, 0.5), low = "darkslategray4", high = "gray75") +
  facet_wrap(~show, ncol = 4) +
  theme_minimal(base_size = 14) +
  theme(legend.position="none") +
  labs(title = "Comparing Word Frequencies",
       subtitle = "Word frequencies in Rick and Morty episodes versus other shows'",
       y = "Rick and Morty", x = NULL)

plot of chunk rick_and_morty_tidy_3

Now the analysis becomes interesting. Archer is a show that is basically about annoy or seduce presented in a way that good writers can and Gravity Falls is about two kids who spend summer with their granpa. Archer doesn't have as many shared words as Gravity Falls and Rick and Morty do, while Gravity Falls has as many "yeah" as Rick and Morty the summer they talk about is the season and not Morty's sister from Rick and Morty.

What is only noticeable if you have seen the analysed shows suggests that we should explore global measures of lexical variety such as mean word frequency and type-token ratios.

Before going ahead let’s quantify how similar and different these sets of word frequencies are using a correlation test. How correlated are the word frequencies between Rick and Morty and the other shows?

cor.test(data = filter(frequency, show == "Archer"), ~ other + rick_and_morty)
    Pearson's product-moment correlation

data:  other and rick_and_morty
t = 63.351, df = 4651, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6648556 0.6957166
sample estimates:
      cor 
0.6805879 
cor.test(data = filter(frequency, show == "Bojack Horseman"), ~ other + rick_and_morty)
    Pearson's product-moment correlation

data:  other and rick_and_morty
t = 34.09, df = 4053, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.4477803 0.4956335
sample estimates:
      cor 
0.4720545 
cor.test(data = filter(frequency, show == "Gravity Falls"), ~ other + rick_and_morty)
    Pearson's product-moment correlation

data:  other and rick_and_morty
t = 61.296, df = 3396, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7083772 0.7403234
sample estimates:
      cor 
0.7247395 
cor.test(data = filter(frequency, show == "Stranger Things"), ~ other + rick_and_morty)
    Pearson's product-moment correlation

data:  other and rick_and_morty
t = 22.169, df = 2278, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3868980 0.4544503
sample estimates:
      cor 
0.4212582 

The correlation test suggests that Rick and Morty and Gravity Falls are the most similar from the considered sample.

The end

My analysis is now complete but the GitHub repo is open to anyone interested in using it for his/her own analysis. I covered mostly microanalysis, or words analysis as isolated units, while providing rusty bits of analysis beyond words as units that would deserve more and longer posts.

Those who find in this a useful material may explore global measures. One option is to read Text Analysis with R for Students of Literature that I've reviewed some time ago.

Interesting topics to explore are Hapax richness and keywords in context that correspond to mesoanalysis or even going for macroanalysis to do clustering, classification and topic modelling.