Structural Equation Modelling (SEM) is a state of art methodology and fulfills much of broader discusion about statistical modeling, and allows to make inference and causal analysis. SEM models are regression models broadly used in Marketing, Human Resources, Biostatistics and Medicine, revealing their flexibility as analytical tool.

Despite being a state-of-the-art methodology, SEM does not create new developments of statistical theory, since it consists in a combination of multivariate analysis, factor analysis and regression analysis. SEM makes it possible to validate or reject one or more hypothesis about an existing relation between different variables, to estimate simultaneously dependency relations between variables and estimate measurement error in model's variables. SEM may be understood as a net of unidirectional or bidirectional paths linking different variables. Such net can be validated using multivariate data.

A usual methodology for model evaluation is *Confirmatory Factor Analysis* (CFA) that is a particular case of SEM. It is a process which consists in specifying quantity and kinds of observed variables to one or more latent variables and analyze how well those variables measure the latent variable itself. Think of a latent variable as an artificial variable that is represented as a linear combination of observed variables.

Holzinger and Swineford (1939) proposed one the most famous CFA models. In their work they analysed the mental ability based test scores of children from two different schools (details here)

We can represent Holzinger and Swineford's model in a diagram:

In this model, \(\newcommand{\R}{\mathbb{R}} x_i\) are exogenous variables that record scores and \(y_i\) are latent (and endogenous) variables that represent visual ability, textual ability and speed ability respectively, double-headed arrows represent an association between \(y_1\), \(y_2\), and \(y_3\), single-headed arrows represent direct effects, and \(\delta_i\) represent error terms.

A *factor loading* is a measurable effect which reflects the incidence of an observed variable over another either observed or latent variable. Interpreting factor loadings is equivalent to interpreting direct effects in a linear econometric model. In this example factor loadings are well defined as there are no missing relations between variables, is known which observed variables have an effect over defined latent variables, and both latent variables covariate so there is a relation between latent variables.

When I wrote my thesis I had to study SEM's goodness of fit and its indicators. The interested reader may visit my Github repo to read more about some important linear algebra aspects of SEM but here I present a table that synthesizes 50% of my thesis:

Indicator | Type | Classification | Direction | Ideal values | Normed (0-1) |
---|---|---|---|---|---|

Chi-squared | significance | adjusted | more is better | \(>2df\) | no |

RMSEA | significance | adjusted | less is better | \(\leq 0,05\) | no |

SMSR | significance | unadjusted | less is better | \(<0,1\) | no |

GFI | significance | unadjusted | more is better | \(>0,9\) | yes |

This table is useful for a cheatsheet and to keep in mind what to look when comparing models.

I highly recommend using lavaan. I have used AMOS and Stata but I guess R's flexibility is unbeatable at the moment. Lavaan also provides `HolzingerSwineford1939`

dataset that contains the complete recordings made by Holzinger and Swineford.

First I load the needed packages to perform the analysis:

```
# Needed packages to perform the analysis
load <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
packages <- c("lavaan","semPlot","semTools","nonnest2","htmlTable")
load(packages)
```

Then I fit the model using the `cfa()`

function:

```
# 1.2. Specify the model
HS.model <- ' visual =~ x1 + x2 + x3
textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9 '
# 1.3. Fit the model
fit <- cfa(HS.model, data = HolzingerSwineford1939)
```

Then I obtain a model summary, confidence intervals and goodness of fit indicators:

```
# 2.1. Display summary output
summary(fit, fit.measures=FALSE)
```

```
lavaan (0.5-23.1097) converged normally after 35 iterations
Number of observations 301
Estimator ML
Minimum Function Test Statistic 85.306
Degrees of freedom 24
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|)
visual =~
x1 1.000
x2 0.554 0.100 5.554 0.000
x3 0.729 0.109 6.685 0.000
textual =~
x4 1.000
x5 1.113 0.065 17.014 0.000
x6 0.926 0.055 16.703 0.000
speed =~
x7 1.000
x8 1.180 0.165 7.152 0.000
x9 1.082 0.151 7.155 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
visual ~~
textual 0.408 0.074 5.552 0.000
speed 0.262 0.056 4.660 0.000
textual ~~
speed 0.173 0.049 3.518 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.x1 0.549 0.114 4.833 0.000
.x2 1.134 0.102 11.146 0.000
.x3 0.844 0.091 9.317 0.000
.x4 0.371 0.048 7.779 0.000
.x5 0.446 0.058 7.642 0.000
.x6 0.356 0.043 8.277 0.000
.x7 0.799 0.081 9.823 0.000
.x8 0.488 0.074 6.573 0.000
.x9 0.566 0.071 8.003 0.000
visual 0.809 0.145 5.564 0.000
textual 0.979 0.112 8.737 0.000
speed 0.384 0.086 4.451 0.000
```

```
# 2.2. Obtain confidence intervals for the estimated coefficients
parameterEstimates(fit, ci = TRUE, level = 0.95)
```

```
lhs op rhs est se z pvalue ci.lower ci.upper
1 visual =~ x1 1.000 0.000 NA NA 1.000 1.000
2 visual =~ x2 0.554 0.100 5.554 0 0.358 0.749
3 visual =~ x3 0.729 0.109 6.685 0 0.516 0.943
4 textual =~ x4 1.000 0.000 NA NA 1.000 1.000
5 textual =~ x5 1.113 0.065 17.014 0 0.985 1.241
6 textual =~ x6 0.926 0.055 16.703 0 0.817 1.035
7 speed =~ x7 1.000 0.000 NA NA 1.000 1.000
8 speed =~ x8 1.180 0.165 7.152 0 0.857 1.503
9 speed =~ x9 1.082 0.151 7.155 0 0.785 1.378
10 x1 ~~ x1 0.549 0.114 4.833 0 0.326 0.772
11 x2 ~~ x2 1.134 0.102 11.146 0 0.934 1.333
12 x3 ~~ x3 0.844 0.091 9.317 0 0.667 1.022
13 x4 ~~ x4 0.371 0.048 7.779 0 0.278 0.465
14 x5 ~~ x5 0.446 0.058 7.642 0 0.332 0.561
15 x6 ~~ x6 0.356 0.043 8.277 0 0.272 0.441
16 x7 ~~ x7 0.799 0.081 9.823 0 0.640 0.959
17 x8 ~~ x8 0.488 0.074 6.573 0 0.342 0.633
18 x9 ~~ x9 0.566 0.071 8.003 0 0.427 0.705
19 visual ~~ visual 0.809 0.145 5.564 0 0.524 1.094
20 textual ~~ textual 0.979 0.112 8.737 0 0.760 1.199
21 speed ~~ speed 0.384 0.086 4.451 0 0.215 0.553
22 visual ~~ textual 0.408 0.074 5.552 0 0.264 0.552
23 visual ~~ speed 0.262 0.056 4.660 0 0.152 0.373
24 textual ~~ speed 0.173 0.049 3.518 0 0.077 0.270
```

```
# 2.3 Obtain goodness of fit indicators of the model
fitMeasures(fit, c("chisq", "rmsea", "srmr", "gfi", "ecvi"))
```

```
chisq rmsea srmr gfi ecvi
85.306 0.092 0.065 0.943 0.423
```

```
reliability(fit)
```

```
visual textual speed total
alpha 0.6261171 0.8827069 0.6884550 0.7604886
omega 0.6253180 0.8851754 0.6877600 0.8453351
omega2 0.6253180 0.8851754 0.6877600 0.8453351
omega3 0.6120052 0.8850608 0.6858417 0.8596204
avevar 0.3705589 0.7210163 0.4244883 0.5145874
```

Lavaan produces a lot of potential pieces of output, but I suggest that at minimum that you export the above model summary, confidence intervals and goodness of fit indicators as part of your final tables:

```
htmlTable(txtRound(parameterEstimates(fit, ci = TRUE, level = 0.95), digits=3, excl.cols=c(1,3)), align="l|r|r|r|r|r|r|r|r|r")
```

lhs | op | rhs | est | se | z | pvalue | ci.lower | ci.upper | |
---|---|---|---|---|---|---|---|---|---|

1 | visual | =~ | x1 | 1.000 | 0.000 | 1.000 | 1.000 | ||

2 | visual | =~ | x2 | 0.554 | 0.100 | 5.554 | 2.798 | 0.358 | 0.749 |

3 | visual | =~ | x3 | 0.729 | 0.109 | 6.685 | 2.313 | 0.516 | 0.943 |

4 | textual | =~ | x4 | 1.000 | 0.000 | 1.000 | 1.000 | ||

5 | textual | =~ | x5 | 1.113 | 0.065 | 17.014 | 0.000 | 0.985 | 1.241 |

6 | textual | =~ | x6 | 0.926 | 0.055 | 16.703 | 0.000 | 0.817 | 1.035 |

7 | speed | =~ | x7 | 1.000 | 0.000 | 1.000 | 1.000 | ||

8 | speed | =~ | x8 | 1.180 | 0.165 | 7.152 | 8.564 | 0.857 | 1.503 |

9 | speed | =~ | x9 | 1.082 | 0.151 | 7.155 | 8.398 | 0.785 | 1.378 |

10 | x1 | ~~ | x1 | 0.549 | 0.114 | 4.833 | 1.344 | 0.326 | 0.772 |

11 | x2 | ~~ | x2 | 1.134 | 0.102 | 11.146 | 0.000 | 0.934 | 1.333 |

12 | x3 | ~~ | x3 | 0.844 | 0.091 | 9.317 | 0.000 | 0.667 | 1.022 |

13 | x4 | ~~ | x4 | 0.371 | 0.048 | 7.779 | 7.327 | 0.278 | 0.465 |

14 | x5 | ~~ | x5 | 0.446 | 0.058 | 7.642 | 2.132 | 0.332 | 0.561 |

15 | x6 | ~~ | x6 | 0.356 | 0.043 | 8.277 | 2.220 | 0.272 | 0.441 |

16 | x7 | ~~ | x7 | 0.799 | 0.081 | 9.823 | 0.000 | 0.640 | 0.959 |

17 | x8 | ~~ | x8 | 0.488 | 0.074 | 6.573 | 4.923 | 0.342 | 0.633 |

18 | x9 | ~~ | x9 | 0.566 | 0.071 | 8.003 | 1.110 | 0.427 | 0.705 |

19 | visual | ~~ | visual | 0.809 | 0.145 | 5.564 | 2.640 | 0.524 | 1.094 |

20 | textual | ~~ | textual | 0.979 | 0.112 | 8.737 | 0.000 | 0.760 | 1.199 |

21 | speed | ~~ | speed | 0.384 | 0.086 | 4.451 | 8.533 | 0.215 | 0.553 |

22 | visual | ~~ | textual | 0.408 | 0.074 | 5.552 | 2.818 | 0.264 | 0.552 |

23 | visual | ~~ | speed | 0.262 | 0.056 | 4.660 | 3.168 | 0.152 | 0.373 |

24 | textual | ~~ | speed | 0.173 | 0.049 | 3.518 | 4.346 | 0.077 | 0.270 |

```
htmlTable(txtRound(reliability(fit),3), align="l|r|r|r|r")
```

visual | textual | speed | total | |
---|---|---|---|---|

alpha | 0.626 | 0.883 | 0.688 | 0.760 |

omega | 0.625 | 0.885 | 0.688 | 0.845 |

omega2 | 0.625 | 0.885 | 0.688 | 0.845 |

omega3 | 0.612 | 0.885 | 0.686 | 0.860 |

avevar | 0.371 | 0.721 | 0.424 | 0.515 |

There are other packages to convert R output to tables such as `knitr`

and `xtables`

that can also export to \(\rm\LaTeX\).

Now let's get on to producing the exciting part of an SEM analysis - the model diagram. This code might not be good-looking but the result is!

```
semPaths(fit, style="lisrel",
whatLabels = "std", edge.label.cex = .6, node.label.cex = .6,
label.prop=0.9, edge.label.color = "black", rotation = 4,
equalizeManifests = FALSE, optimizeLatRes = TRUE, node.width = 1.5,
edge.width = 0.5, shapeMan = "rectangle", shapeLat = "ellipse",
shapeInt = "triangle", sizeMan = 4, sizeInt = 2, sizeLat = 4,
curve=2, unCol = "#070b8c")
```

If you are running a script instead of a R Markdoen document, you can save the plot as svg with these lines:

```
svg("../../images/2016-12-07_sem_r_part_1/cfa_example_2.svg")
semPaths(fit, style="lisrel",
whatLabels = "std", edge.label.cex = .6, node.label.cex = .6,
label.prop=0.9, edge.label.color = "black", rotation = 4,
equalizeManifests = FALSE, optimizeLatRes = TRUE, node.width = 1.5,
edge.width = 0.5, shapeMan = "rectangle", shapeLat = "ellipse",
shapeInt = "triangle", sizeMan = 4, sizeInt = 2, sizeLat = 4,
curve=2, unCol = "#070b8c")
#dev.off() # uncomment if you run a script but there's no need to use this in R Markdown
```

If you are curious about the first diagram, I made it with TikZ. This is the code

```
\documentclass[border=3mm]{standalone}
\usepackage[applemac]{inputenc}
\usepackage[protrusion=true,expansion=true]{microtype}
\usepackage[bb=lucida,bbscaled=1,cal=boondoxo]{mathalfa}
\usepackage[stdmathitalics=true,math-style=iso,lucidasmallscale=true,romanfamily=bright]{lucimatx}
\usepackage{tikz}
\usetikzlibrary{arrows,positioning}
\newcommand{\at}{\makeatletter @\makeatother}
\begin{document}
\begin{tikzpicture}[auto,node distance=.5cm,
latent/.style={fill=red!20,circle,draw, thick,inner sep=0pt,minimum size=8mm,align=center},
observed/.style={fill=blue!20,rectangle,draw, thick,inner sep=0pt,minimum width=8mm,minimum height=8mm,align=center},
error/.style={fill=yellow!20,circle,draw, thick,inner sep=0pt,minimum width=8mm,minimum height=8mm,align=center},
paths/.style={->, thick, >=stealth'},
paths2/.style={<-, thick, >=stealth'},
twopaths/.style={<->, thick, >=stealth'}
]
%Define observed variables
\node [latent] (y1) at (0,0) {$y_1$};
\node [latent] (y2) [below=3cm of y1] {$y_2$};
\node [latent] (y3) [below=3cm of y2] {$y_3$};
%%%
\node [observed] (x2) [left=1.8cm of y1] {$x_2$};
\node [observed] (x1) [above=of x2] {$x_1$};
\node [observed] (x3) [below=of x2] {$x_3$};
\node [error] (ex1) [left=0.5cm of x1] {$\delta_1$};
\node [error] (ex2) [left=0.5cm of x2] {$\delta_2$};
\node [error] (ex3) [left=0.5cm of x3] {$\delta_3$};
%%%
\node [observed] (x5) [left=1.8cm of y2] {$x_5$};
\node [observed] (x4) [above=of x5] {$x_4$};
\node [observed] (x6) [below=of x5] {$x_6$};
\node [error] (ex4) [left=0.5cm of x4] {$\delta_4$};
\node [error] (ex5) [left=0.5cm of x5] {$\delta_5$};
\node [error] (ex6) [left=0.5cm of x6] {$\delta_6$};
%%%
\node [observed] (x8) [left=1.8cm of y3] {$x_8$};
\node [observed] (x7) [above=of x8] {$x_7$};
\node [observed] (x9) [below=of x8] {$x_9$};
\node [error] (ex7) [left=0.5cm of x7] {$\delta_7$};
\node [error] (ex8) [left=0.5cm of x8] {$\delta_8$};
\node [error] (ex9) [left=0.5cm of x9] {$\delta_9$};
% Draw paths form latent to observed variables
\foreach \all in {x1,x2,x3}{
\draw [paths] (y1.west) to node {} (\all.east);
}
\draw [paths] (x1.west) to (ex1.east);
\draw [paths] (x2.west) to (ex2.east);
\draw [paths] (x3.west) to (ex3.east);
%%%
\foreach \all in {x4,x5,x6}{
\draw [paths] (y2.west) to node {} (\all.east);
}
\draw [paths] (x4.west) to (ex4.east);
\draw [paths] (x5.west) to (ex5.east);
\draw [paths] (x6.west) to (ex6.east);
%%%
\foreach \all in {x7,x8,x9}{
\draw [paths] (y3.west) to node {} (\all.east);
}
\draw [paths] (x7.west) to (ex7.east);
\draw [paths] (x8.west) to (ex8.east);
\draw [paths] (x9.west) to (ex9.east);
%%%
\draw [twopaths] (y1.east) to [bend left=90] (y2.east);
\draw [twopaths] (y2.east) to [bend left=90] (y3.east);
\draw [twopaths] (y1.east) to [bend left=90] (y3.east);
\end{tikzpicture}
\end{document}
```

and then I ran `pdf2svg cfa_example.pdf cfa_example.svg`

.