Working With SPSS© Data in R

Introduction

I was in need of importing SPSS© data for work. There are some options but I've used both foreign and haven R packages. I prefer haven because it integrates better with R's tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.

The Data

For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.

Importing Data

#devtools::install_github("ropenscilabs/skimr")

# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)
library(readr)

# Import foreign statistical formats
library(haven)

# Data
url = "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
sav = "2017-06-24_working_with_spss_data_in_r/udp_national_survey_2015.sav"

if(!file.exists(sav)){download.file(url,sav)}

survey = read_sav(sav)

Exploring data

To explore the data consider the survey is in spanish. So, "fecha" means date, "edad" means age, and sexo means "sex".

# How many surveys do I have by day?
daily = survey %>%
  mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
  rename(date = Fecha) %>% 
  group_by(date) %>%
  summarise(n = n())

ggplot(daily, aes(date, n)) +
  geom_line()

plot of chunk exploring_1

# How is the age distributed?
summary(survey$Edad_Entrevistado)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   32.00   48.00   47.92   61.00   89.00 
age = survey %>%
  mutate(as.integer(Edad_Entrevistado)) %>% 
  rename(age = Edad_Entrevistado) %>% 
  group_by(age) %>%
  summarise(n = n())

ggplot(age, aes(age, n)) +
  geom_line()

plot of chunk exploring_1

# How is the sex distributed?
survey %>%
  rename(sex_id = Sexo_Entrevistado) %>% 
  group_by(sex_id) %>%
  summarise(n = n())
# A tibble: 2 x 2
     sex_id     n
  <dbl+lbl> <int>
1         1   651
2         2   651

Exploring labels

In the last tibble we have no idea what is 1 and 2.

survey %>%
  select(Sexo_Entrevistado) %>% 
  rename(sex_id = Sexo_Entrevistado) %>% 
  distinct() %>% 
  mutate(sex = as_factor(sex_id))
# A tibble: 2 x 2
     sex_id    sex
  <dbl+lbl> <fctr>
1         2  Mujer
2         1 Hombre

The last column (in spanish) shows us that in this survey "1 = Male" and "2 = Female".

I could run

survey %>%
  rename(sex = Sexo_Entrevistado) %>% 
  mutate(sex = as.integer(sex)) %>% 
  mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>% 
  group_by(sex) %>%
  summarise(n = n())
# A tibble: 2 x 2
     sex     n
   <chr> <int>
1 Female   651
2   Male   651

The column names are labelled as well. Here sjlabelled helps if I want to know for example what "P12" means. But instead of just translating labels I'll describe the complete dataset.

Describing the dataset

valid_replies = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="complete") %>% 
  mutate(description = get_label(survey)) %>% 
  select(var,description,everything()) %>% 
  select(-c(stat,level,type)) %>% 
  rename(pcent_valid = value) %>% 
  mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))

histograms = survey %>% 
  mutate_if(is.labelled,as.numeric) %>% 
  skim() %>%
  filter(stat=="hist") %>% 
  select(var,level) %>% 
  rename(histogram = level)

survey_description = valid_replies %>% 
  left_join(histograms) %>% 
  write_csv("2017-06-24_working_with_spss_data_in_r/survey_description.csv")

survey_description
# A tibble: 203 x 4
                 var          description pcent_valid  histogram
               <chr>                <chr>       <chr>      <chr>
 1        PONDERADOR           Ponderador        100% ▂▇▇▅▅▃▁▁▁▁
 2             Folio                Folio        100% ▇▇▇▇▇▇▇▇▇▇
 3            Región               Región        100% ▁▁▂▁▂▁▁▁▇▁
 4            Comuna               Comuna        100% ▁▁▂▁▁▂▁▁▇▁
 5             Fecha     Fecha entrevista        100%       <NA>
 6  Sexo_Encuestador   Sexo Entrevistador         91% ▂▁▁▁▁▁▁▁▁▇
 7               GSE           GSE Visual        100% ▁▁▂▁▇▁▁▆▁▁
 8 Sexo_Entrevistado    Sexo Entrevistado        100% ▇▁▁▁▁▁▁▁▁▇
 9 Edad_Entrevistado    Edad Entrevistado        100% ▇▆▅▆▇▇▅▃▃▂
10       Hora_Inicio Hora Inicio Medición        100%       <NA>
# ... with 193 more rows

Exploring the last tibble there are interesting questions. For example, P12 refers to "Apoyo a la democracia" that is Do you support democracy?.