Stackoverflow Developer Survey Results 2019 and Bipartite Networks With R

Dataset

The Developer Survey Results is available here. The question I want to analyze is Which of the following programming, scripting, and markup languages have you done extensive development work in over the past year, and which do you want to work in over the next year? (If you both worked with the language and want to continue to do so, please check both boxes in that row.)

Bipartite Networks

An excellent resource to study networks, with R and other tools, is Katherine Ognyanova’s blog.

An example of bipartite networks is the Product Space, a network that clusters similar products, based on the products that export those products.

Under a similar idea, connecting respondents to the programming/scripting/markup languages they use, I can create a network of similar programming languages.

Visualizing Networks With R

Data Wrangling

The first step is to read and re-arrange the data:

library(tidyverse)
library(janitor)

survey_results <- read_csv(
  "~/Downloads/developer_survey_2019/survey_results_public.csv") %>% 
  clean_names() %>% 
  select(respondent, language_worked_with, language_desire_next_year) %>% 
  
  gather(category, answer, -respondent) %>% 
  separate(answer, into = paste0("lang", 1:28), sep = ";") %>% 

  gather(lang_aux, language, -respondent, -category) %>% 
  select(-lang_aux) %>% 
  drop_na() %>% 
  mutate(language = str_replace_all(language, "Other.*", "Other(s)"))

users_per_language_worked_with <- survey_results %>% 
  filter(category == "language_worked_with") %>% 
  group_by(language) %>% 
  summarise(n = n()) %>% 
  mutate(share = n / sum(n))

binary_relation <- survey_results %>% 
  group_by(respondent, language) %>% 
  summarise(n = n()) %>% 
  mutate(x = ifelse(n == 2, 1, 0)) %>% 
  select(-n)

Network

economiccomplexity functions are designed for country-product relations, so here the “countries” are respondents and the “products” are languages.

Let’s explore the programming language complexity index. Here “complexity” is not related to difficulty but to specialization instead. A higher index value means that language has a small group of users and/or that is used for specific purposes.

library(economiccomplexity)

rca <- ec_rca(
  binary_relation, "respondent", "language", "x"
)

com <- ec_complexity_measures(rca, tbl = T)

names(com$complexity_index_p) <- c("language","complexity_index")
com$complexity_index_p
# A tibble: 28 x 2
   language    complexity_index
   <chr>                  <dbl>
 1 Erlang           27.6       
 2 F#                0.272     
 3 WebAssembly       0.0729    
 4 Elixir            0.0480    
 5 Clojure           0.0102    
 6 Dart              0.000784  
 7 VBA               0.000288  
 8 Objective-C       0.000178  
 9 Scala             0.00000487
10 Assembly          0.00000480
# … with 18 more rows

At this point I shall apply a trick. As economiccomplexity::proximity() computes a language-language relation and a respondent-respondent relation in this case, and both relations independent inside the function, I shall use the compute parameter. This is to avoid a very large computation (75,816 respondants) that I won’t use.

pro <- ec_proximity(rca, u = com$ubiquity, compute = "product", tbl = T)
pro$proximity_p
# A tibble: 378 x 3
   from                  to        value
   <chr>                 <chr>     <dbl>
 1 Bash/Shell/PowerShell Assembly 0.0532
 2 C                     Assembly 0.168 
 3 C#                    Assembly 0.0252
 4 C++                   Assembly 0.101 
 5 Clojure               Assembly 0.0269
 6 Dart                  Assembly 0.0293
 7 Elixir                Assembly 0.0236
 8 Erlang                Assembly 0.0260
 9 F#                    Assembly 0.0269
10 Go                    Assembly 0.0403
# … with 368 more rows

Finally I can create the network.

library(igraph)
library(ggraph)

set.seed(1724)

net <- ec_networks(
  pc = NULL,
  pp = pro$proximity_p,
  cutoff_p = 0.2,
  tbl = T,
  compute = "product"
)

share <- 100 * users_per_language_worked_with$share
names(share) <- users_per_language_worked_with$language

g <- net$network_p %>% 
  rename(proximity = value) %>% 
  graph_from_data_frame(directed = F)

g %>% 
  ggraph(layout = "kk") +
  geom_edge_link(aes(edge_alpha = proximity, edge_width = proximity), 
                 edge_colour = "#a8a8a8") +
  geom_node_point(colour = "darkslategray4",
                  size = share[match(V(g)$name,names(share))]) +
  geom_node_text(aes(label = name), vjust = 2.2) +
  ggtitle("Stackoverflow Developer Survey Languages Connection") +
  theme_void()