Structural Equation Modelling in R (Part 1)

Brief explanation

Structural Equation Modelling (SEM) is a state of art methodology and fulfills much of broader discusion about statistical modeling, and allows to make inference and causal analysis. SEM models are regression models broadly used in Marketing, Human Resources, Biostatistics and Medicine, revealing their flexibility as analytical tool.

Despite being a state-of-the-art methodology, SEM does not create new developments of statistical theory, since it consists in a combination of multivariate analysis, factor analysis and regression analysis. SEM makes it possible to validate or reject one or more hypothesis about an existing relation between different variables, to estimate simultaneously dependency relations between variables and estimate measurement error in model's variables. SEM may be understood as a net of unidirectional or bidirectional paths linking different variables. Such net can be validated using multivariate data.

Confirmatory Factor Analysis

A usual methodology for model evaluation is Confirmatory Factor Analysis (CFA) that is a particular case of SEM. It is a process which consists in specifying quantity and kinds of observed variables to one or more latent variables and analyze how well those variables measure the latent variable itself. Think of a latent variable as an artificial variable that is represented as a linear combination of observed variables.

Holzinger and Swineford (1939) proposed one the most famous CFA models. In their work they analysed the mental ability based test scores of children from two different schools (details here)

We can represent Holzinger and Swineford's model in a diagram:

cfa_example.svg

In this model, \(\newcommand{\R}{\mathbb{R}} x_i\) are exogenous variables that record scores and \(y_i\) are latent (and endogenous) variables that represent visual ability, textual ability and speed ability respectively, double-headed arrows represent an association between \(y_1\), \(y_2\), and \(y_3\), single-headed arrows represent direct effects, and \(\delta_i\) represent error terms.

A factor loading is a measurable effect which reflects the incidence of an observed variable over another either observed or latent variable. Interpreting factor loadings is equivalent to interpreting direct effects in a linear econometric model. In this example factor loadings are well defined as there are no missing relations between variables, is known which observed variables have an effect over defined latent variables, and both latent variables covariate so there is a relation between latent variables.

Model evaluation

When I wrote my thesis I had to study SEM's goodness of fit and its indicators. The interested reader may visit my Github repo to read more about some important linear algebra aspects of SEM but here I present a table that synthesizes 50% of my thesis:

Indicator Type Classification Direction Ideal values Normed (0-1)
Chi-squared significance adjusted more is better \(>2df\) no
RMSEA significance adjusted less is better \(\leq 0,05\) no
SMSR significance unadjusted less is better \(<0,1\) no
GFI significance unadjusted more is better \(>0,9\) yes

This table is useful for a cheatsheet and to keep in mind what to look when comparing models.

Using Lavaan to obtain parameters estimates

I highly recommend using lavaan. I have used AMOS and Stata but I guess R's flexibility is unbeatable at the moment. Lavaan also provides HolzingerSwineford1939 dataset that contains the complete recordings made by Holzinger and Swineford.

First I load the needed packages to perform the analysis:

# Needed packages to perform the analysis
load <- function(pkg){
  new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
  if (length(new.pkg))
    install.packages(new.pkg, dependencies = TRUE)
  sapply(pkg, require, character.only = TRUE)
} 

packages <- c("lavaan","semPlot","semTools","nonnest2","htmlTable")
load(packages)

Then I fit the model using the cfa() function:

# 1.2. Specify the model
HS.model <- ' visual  =~ x1 + x2 + x3      
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

# 1.3. Fit the model
fit <- cfa(HS.model, data = HolzingerSwineford1939)

Then I obtain a model summary, confidence intervals and goodness of fit indicators:

# 2.1. Display summary output
summary(fit, fit.measures=FALSE)
lavaan (0.5-23.1097) converged normally after  35 iterations

  Number of observations                           301

  Estimator                                         ML
  Minimum Function Test Statistic               85.306
  Degrees of freedom                                24
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Information                                 Expected
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  visual =~                                           
    x1                1.000                           
    x2                0.554    0.100    5.554    0.000
    x3                0.729    0.109    6.685    0.000
  textual =~                                          
    x4                1.000                           
    x5                1.113    0.065   17.014    0.000
    x6                0.926    0.055   16.703    0.000
  speed =~                                            
    x7                1.000                           
    x8                1.180    0.165    7.152    0.000
    x9                1.082    0.151    7.155    0.000

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  visual ~~                                           
    textual           0.408    0.074    5.552    0.000
    speed             0.262    0.056    4.660    0.000
  textual ~~                                          
    speed             0.173    0.049    3.518    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
   .x1                0.549    0.114    4.833    0.000
   .x2                1.134    0.102   11.146    0.000
   .x3                0.844    0.091    9.317    0.000
   .x4                0.371    0.048    7.779    0.000
   .x5                0.446    0.058    7.642    0.000
   .x6                0.356    0.043    8.277    0.000
   .x7                0.799    0.081    9.823    0.000
   .x8                0.488    0.074    6.573    0.000
   .x9                0.566    0.071    8.003    0.000
    visual            0.809    0.145    5.564    0.000
    textual           0.979    0.112    8.737    0.000
    speed             0.384    0.086    4.451    0.000
# 2.2. Obtain confidence intervals for the estimated coefficients
parameterEstimates(fit, ci = TRUE, level = 0.95)
       lhs op     rhs   est    se      z pvalue ci.lower ci.upper
1   visual =~      x1 1.000 0.000     NA     NA    1.000    1.000
2   visual =~      x2 0.554 0.100  5.554      0    0.358    0.749
3   visual =~      x3 0.729 0.109  6.685      0    0.516    0.943
4  textual =~      x4 1.000 0.000     NA     NA    1.000    1.000
5  textual =~      x5 1.113 0.065 17.014      0    0.985    1.241
6  textual =~      x6 0.926 0.055 16.703      0    0.817    1.035
7    speed =~      x7 1.000 0.000     NA     NA    1.000    1.000
8    speed =~      x8 1.180 0.165  7.152      0    0.857    1.503
9    speed =~      x9 1.082 0.151  7.155      0    0.785    1.378
10      x1 ~~      x1 0.549 0.114  4.833      0    0.326    0.772
11      x2 ~~      x2 1.134 0.102 11.146      0    0.934    1.333
12      x3 ~~      x3 0.844 0.091  9.317      0    0.667    1.022
13      x4 ~~      x4 0.371 0.048  7.779      0    0.278    0.465
14      x5 ~~      x5 0.446 0.058  7.642      0    0.332    0.561
15      x6 ~~      x6 0.356 0.043  8.277      0    0.272    0.441
16      x7 ~~      x7 0.799 0.081  9.823      0    0.640    0.959
17      x8 ~~      x8 0.488 0.074  6.573      0    0.342    0.633
18      x9 ~~      x9 0.566 0.071  8.003      0    0.427    0.705
19  visual ~~  visual 0.809 0.145  5.564      0    0.524    1.094
20 textual ~~ textual 0.979 0.112  8.737      0    0.760    1.199
21   speed ~~   speed 0.384 0.086  4.451      0    0.215    0.553
22  visual ~~ textual 0.408 0.074  5.552      0    0.264    0.552
23  visual ~~   speed 0.262 0.056  4.660      0    0.152    0.373
24 textual ~~   speed 0.173 0.049  3.518      0    0.077    0.270
# 2.3 Obtain goodness of fit indicators of the model
fitMeasures(fit, c("chisq", "rmsea", "srmr", "gfi", "ecvi"))
 chisq  rmsea   srmr    gfi   ecvi 
85.306  0.092  0.065  0.943  0.423 
reliability(fit)
          visual   textual     speed     total
alpha  0.6261171 0.8827069 0.6884550 0.7604886
omega  0.6253180 0.8851754 0.6877600 0.8453351
omega2 0.6253180 0.8851754 0.6877600 0.8453351
omega3 0.6120052 0.8850608 0.6858417 0.8596204
avevar 0.3705589 0.7210163 0.4244883 0.5145874

Exporting lavaan output to tables

Lavaan produces a lot of potential pieces of output, but I suggest that at minimum that you export the above model summary, confidence intervals and goodness of fit indicators as part of your final tables:

htmlTable(txtRound(parameterEstimates(fit, ci = TRUE, level = 0.95), digits=3, excl.cols=c(1,3)), align="l|r|r|r|r|r|r|r|r|r")
lhs op rhs est se z pvalue ci.lower ci.upper
1 visual =~ x1 1.000 0.000 1.000 1.000
2 visual =~ x2 0.554 0.100 5.554 2.798 0.358 0.749
3 visual =~ x3 0.729 0.109 6.685 2.313 0.516 0.943
4 textual =~ x4 1.000 0.000 1.000 1.000
5 textual =~ x5 1.113 0.065 17.014 0.000 0.985 1.241
6 textual =~ x6 0.926 0.055 16.703 0.000 0.817 1.035
7 speed =~ x7 1.000 0.000 1.000 1.000
8 speed =~ x8 1.180 0.165 7.152 8.564 0.857 1.503
9 speed =~ x9 1.082 0.151 7.155 8.398 0.785 1.378
10 x1 ~~ x1 0.549 0.114 4.833 1.344 0.326 0.772
11 x2 ~~ x2 1.134 0.102 11.146 0.000 0.934 1.333
12 x3 ~~ x3 0.844 0.091 9.317 0.000 0.667 1.022
13 x4 ~~ x4 0.371 0.048 7.779 7.327 0.278 0.465
14 x5 ~~ x5 0.446 0.058 7.642 2.132 0.332 0.561
15 x6 ~~ x6 0.356 0.043 8.277 2.220 0.272 0.441
16 x7 ~~ x7 0.799 0.081 9.823 0.000 0.640 0.959
17 x8 ~~ x8 0.488 0.074 6.573 4.923 0.342 0.633
18 x9 ~~ x9 0.566 0.071 8.003 1.110 0.427 0.705
19 visual ~~ visual 0.809 0.145 5.564 2.640 0.524 1.094
20 textual ~~ textual 0.979 0.112 8.737 0.000 0.760 1.199
21 speed ~~ speed 0.384 0.086 4.451 8.533 0.215 0.553
22 visual ~~ textual 0.408 0.074 5.552 2.818 0.264 0.552
23 visual ~~ speed 0.262 0.056 4.660 3.168 0.152 0.373
24 textual ~~ speed 0.173 0.049 3.518 4.346 0.077 0.270
htmlTable(txtRound(reliability(fit),3), align="l|r|r|r|r")
visual textual speed total
alpha 0.626 0.883 0.688 0.760
omega 0.625 0.885 0.688 0.845
omega2 0.625 0.885 0.688 0.845
omega3 0.612 0.885 0.686 0.860
avevar 0.371 0.721 0.424 0.515

There are other packages to convert R output to tables such as knitr and xtables that can also export to \(\rm\LaTeX\).

Creating a diagram for the model in R

Now let's get on to producing the exciting part of an SEM analysis - the model diagram. This code might not be good-looking but the result is!

semPaths(fit, style="lisrel", 
        whatLabels = "std", edge.label.cex = .6, node.label.cex = .6, 
        label.prop=0.9, edge.label.color = "black", rotation = 4, 
        equalizeManifests = FALSE, optimizeLatRes = TRUE, node.width = 1.5, 
        edge.width = 0.5, shapeMan = "rectangle", shapeLat = "ellipse", 
        shapeInt = "triangle", sizeMan = 4, sizeInt = 2, sizeLat = 4, 
        curve=2, unCol = "#070b8c")

plot of chunk sem_part_1_cfa_4

If you are running a script instead of a R Markdoen document, you can save the plot as svg with these lines:

svg("../../images/2016-12-07_sem_r_part_1/cfa_example_2.svg")
semPaths(fit, style="lisrel", 
        whatLabels = "std", edge.label.cex = .6, node.label.cex = .6, 
        label.prop=0.9, edge.label.color = "black", rotation = 4, 
        equalizeManifests = FALSE, optimizeLatRes = TRUE, node.width = 1.5, 
        edge.width = 0.5, shapeMan = "rectangle", shapeLat = "ellipse", 
        shapeInt = "triangle", sizeMan = 4, sizeInt = 2, sizeLat = 4, 
        curve=2, unCol = "#070b8c")
#dev.off() # uncomment if you run a script but there's no need to use this in R Markdown

Appendix

If you are curious about the first diagram, I made it with TikZ. This is the code

\documentclass[border=3mm]{standalone}
\usepackage[applemac]{inputenc}
\usepackage[protrusion=true,expansion=true]{microtype}
\usepackage[bb=lucida,bbscaled=1,cal=boondoxo]{mathalfa}
\usepackage[stdmathitalics=true,math-style=iso,lucidasmallscale=true,romanfamily=bright]{lucimatx}
\usepackage{tikz}
\usetikzlibrary{arrows,positioning}
\newcommand{\at}{\makeatletter @\makeatother}
\begin{document}
\begin{tikzpicture}[auto,node distance=.5cm,
    latent/.style={fill=red!20,circle,draw, thick,inner sep=0pt,minimum size=8mm,align=center},
    observed/.style={fill=blue!20,rectangle,draw, thick,inner sep=0pt,minimum width=8mm,minimum height=8mm,align=center},
    error/.style={fill=yellow!20,circle,draw, thick,inner sep=0pt,minimum width=8mm,minimum height=8mm,align=center},
    paths/.style={->,  thick, >=stealth'},
    paths2/.style={<-,  thick, >=stealth'},
    twopaths/.style={<->,  thick, >=stealth'}
]

%Define observed variables
\node [latent] (y1) at (0,0) {$y_1$};
\node [latent] (y2) [below=3cm of y1]  {$y_2$};
\node [latent] (y3) [below=3cm of y2]  {$y_3$};

%%%

\node [observed] (x2) [left=1.8cm of y1]  {$x_2$};
\node [observed] (x1) [above=of x2]  {$x_1$};
\node [observed] (x3) [below=of x2]  {$x_3$};

\node [error] (ex1) [left=0.5cm of x1]  {$\delta_1$};
\node [error] (ex2) [left=0.5cm of x2]  {$\delta_2$};
\node [error] (ex3) [left=0.5cm of x3]  {$\delta_3$};

%%%

\node [observed] (x5) [left=1.8cm of y2]  {$x_5$};
\node [observed] (x4) [above=of x5]  {$x_4$};
\node [observed] (x6) [below=of x5]  {$x_6$};

\node [error] (ex4) [left=0.5cm of x4]  {$\delta_4$};
\node [error] (ex5) [left=0.5cm of x5]  {$\delta_5$};
\node [error] (ex6) [left=0.5cm of x6]  {$\delta_6$};

%%%

\node [observed] (x8) [left=1.8cm of y3]  {$x_8$};
\node [observed] (x7) [above=of x8]  {$x_7$};
\node [observed] (x9) [below=of x8]  {$x_9$};

\node [error] (ex7) [left=0.5cm of x7]  {$\delta_7$};
\node [error] (ex8) [left=0.5cm of x8]  {$\delta_8$};
\node [error] (ex9) [left=0.5cm of x9]  {$\delta_9$};

% Draw paths form latent to observed variables

\foreach \all in {x1,x2,x3}{
    \draw [paths] (y1.west) to node {} (\all.east);
}

\draw [paths] (x1.west) to (ex1.east);
\draw [paths] (x2.west) to (ex2.east);
\draw [paths] (x3.west) to (ex3.east);

%%%

\foreach \all in {x4,x5,x6}{
    \draw [paths] (y2.west) to node {} (\all.east);
}

\draw [paths] (x4.west) to (ex4.east);
\draw [paths] (x5.west) to (ex5.east);
\draw [paths] (x6.west) to (ex6.east);

%%%

\foreach \all in {x7,x8,x9}{
    \draw [paths] (y3.west) to node {} (\all.east);
}

\draw [paths] (x7.west) to (ex7.east);
\draw [paths] (x8.west) to (ex8.east);
\draw [paths] (x9.west) to (ex9.east);

%%%

\draw [twopaths] (y1.east) to [bend left=90] (y2.east);
\draw [twopaths] (y2.east) to [bend left=90] (y3.east);

\draw [twopaths] (y1.east) to [bend left=90] (y3.east);


\end{tikzpicture}
\end{document} 

and then I ran pdf2svg cfa_example.pdf cfa_example.svg.