Working With SPSS Data in R

Updated 2018-03-26

Introduction

I was in need of importing SPSS© data for work. There are some options but I’ve used both foreign and haven R packages. I prefer haven because it integrates better with R’s tidyverse and started using it in detriment of foreign when I verified it behaves well with factors and solves the deprecated factors labels in newer R versions.

The Data

For this post I found Diego Portales University National Survey. It consist in a publicly available survey applied since 2005 and applied at nation-wide level to ask people about their trust in institutions (e.g. government, police, firefighters, etc) and what its their option on same-sex marriage, restricting spaces to smoke, and more.

Importing Data

#devtools::install_github("ropenscilabs/skimr")

# Exploratory Data Analysis tools
library(ggplot2)
library(dplyr)
library(sjlabelled)
library(skimr)

# Read different formats
library(readr) # csv/tsv/txt
library(haven) # sav

# Data
url <- "http://encuesta.udp.cl/descargas/banco%20de%20datos/2015/Encuesta%20Nacional%20UDP%202015.sav"
try(dir.create("2017-06-24-working-with-spss-data-in-r"))

sav <- "../../data/2017-06-24-working-with-spss-data-in-r/udp_national_survey_2015.sav"

if (!file.exists(sav)) {download.file(url,sav)}

survey <- read_sav(sav)

Exploring data

To explore the data consider the survey is in spanish. So, “fecha” means date, “edad” means age, and sexo means “sex”.

# How many surveys do I have by day?
daily <- survey %>%
  mutate(Fecha = as.Date(Fecha, "%d-%m-%Y")) %>%
  rename(date = Fecha) %>% 
  group_by(date) %>%
  summarise(n = n())

ggplot(daily, aes(date, n)) +
  geom_line()

# How is the age distributed?
summary(survey$Edad_Entrevistado)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  18.00   32.00   48.00   47.92   61.00   89.00 
age <- survey %>%
  mutate(as.integer(Edad_Entrevistado)) %>% 
  rename(age = Edad_Entrevistado) %>% 
  group_by(age) %>%
  summarise(n = n())

ggplot(age, aes(age, n)) +
  geom_line()

# How is the sex distributed?
survey %>%
  rename(sex_id = Sexo_Entrevistado) %>% 
  group_by(sex_id) %>%
  summarise(n = n())
# A tibble: 2 x 2
  sex_id        n
  <dbl+lbl> <int>
1 1           651
2 2           651

Exploring labels

In the last tibble we have no idea what is 1 and 2.

survey %>%
  select(Sexo_Entrevistado) %>% 
  rename(sex_id = Sexo_Entrevistado) %>% 
  distinct() %>% 
  mutate(sex = as_factor(sex_id))
# A tibble: 2 x 2
  sex_id    sex   
  <dbl+lbl> <fct> 
1 2         Mujer 
2 1         Hombre

The last column (in spanish) shows us that in this survey “1 = Male” and “2 = Female”.

I could run

survey %>%
  rename(sex = Sexo_Entrevistado) %>% 
  mutate(sex = as.integer(sex)) %>% 
  mutate(sex = recode(sex, `1` = "Male", `2` = "Female")) %>% 
  group_by(sex) %>%
  summarise(n = n())
# A tibble: 2 x 2
  sex        n
  <chr>  <int>
1 Female   651
2 Male     651

The column names are labelled as well. Here sjlabelled helps if I want to know for example what “P12” means. But instead of just translating labels I’ll describe the complete dataset.

Describing the dataset

survey %>% 
  skim() %>%
  filter(stat == "complete") %>% 
  mutate(description = get_label(survey)) %>% 
  rename(pcent_valid = value) %>% 
  mutate(pcent_valid = paste0(100*round(pcent_valid / nrow(survey),2),'%'))
# A tibble: 203 x 7
   variable          type   stat  level pcent_valid formatted description 
   <chr>             <chr>  <chr> <chr> <chr>       <chr>     <chr>       
 1 PONDERADOR        numer… comp… .all  100%        1302      Ponderador  
 2 Folio             numer… comp… .all  100%        1302      Folio       
 3 Región            chara… comp… .all  100%        1302      Región      
 4 Comuna            chara… comp… .all  100%        1302      Comuna      
 5 Fecha             chara… comp… .all  100%        1302      Fecha entre…
 6 Sexo_Encuestador  chara… comp… .all  91%         1186      Sexo Entrev…
 7 GSE               chara… comp… .all  100%        1302      GSE Visual  
 8 Sexo_Entrevistado chara… comp… .all  100%        1302      Sexo Entrev…
 9 Edad_Entrevistado numer… comp… .all  100%        1302      Edad Entrev…
10 Hora_Inicio       chara… comp… .all  100%        1302      Hora Inicio…
# ... with 193 more rows

Exploring the last tibble there are interesting questions. For example, P12 refers to “Apoyo a la democracia” that is Do you support democracy?.