Suppose you want to develop a survey, like we recently did for creativity in higher computing education. You come up with a bunch of questions (*Time seemed to fly while programming!*, *I adapted and merged ideas from others*, etc), recruit victims who are willing to answer them using a scale (*Totally agreed*, *Agreed*, *No way*, etc), and then what? Chances are you picked certain questions from certain subthemes because you suspect a pattern will emerge. How to validate that?

That’s where “factor analysis” comes in: it’s a statistical method—meaning it can analyze quantitative data, like answers from the above scale—to detect *variability* and *correlations* among the questions. My research is very much qualitative (looking for patterns using interviews and discussions), so I had no idea on how to handle that. This blog post servers as a summary for myself and can perhaps entertain others.

## Step 1: Is the data up for it?

First things first: it’s possible that your questions are so bad there’s never going to emerge any pattern. In that case, the “factorability” of the survey is reduced to zero.

There’s a statistic for that (That sentence comes in handy later!), it’s called the Kaiser-Meyer-Olkin (KMO) sampling adequacy. Their original paper suggests a value of at least `.60`

, but preferably higher than `.80`

.

Let’s use a well-known sample data set for our exercise: the psychological Big Five personalty test developed in the eighties. Scientists devised five big personality types, exactly using this very method: openness (`O`

), conscientiousness (`C`

), extraversion (`E`

), agreeableness (`A`

), neuroticism (`N`

). The `psych`

package contains a sample dataset of 2800 responses that can be loaded into the variable `bfi`

with `data('bfi')`

.

I’ll deliberately avoid using statistical formulas here and let R Suite do the talking:

```
library('psych') #import fa functions
data('bfi')
KMO(bfi)
```

The result:

```
Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = bfi)
Overall MSA = 0.84
MSA for each item = ...
```

All right, so far so good. What’s next? An overall item-total correlation (Rit) check to validate the reliability of the questions asked (the items). There are, of course, standardized rules for that, laid out by Ebel and Frisbie. Rit values below `.30`

are considered doubtful or even poor. How are we doing on that front? Let’s try out the `alpha(bfi)`

command and inspect the results:

```
Reliability analysis
Call: alpha(x = bfi)
raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
0.21 0.51 0.66 0.036 1.1 0.019 4.6 0.53 0.017
lower alpha upper 95% confidence boundaries
0.18 0.21 0.25
Reliability if an item is dropped:
raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
A1 0.24 0.53 0.68 0.041 1.15 0.018 0.035 0.021
A2 0.19 0.49 0.64 0.034 0.95 0.019 0.034 0.017
A3 0.19 0.49 0.64 0.034 0.94 0.019 0.033 0.017
...
Item statistics
n raw.r std.r r.cor r.drop mean sd
A1 2784 -0.028 0.0573 -0.057 -0.127 2.4 1.41
A2 2773 0.285 0.3736 0.356 0.210 4.8 1.17
A3 2774 0.267 0.3874 0.379 0.184 4.6 1.30
A4 2781 0.269 0.2938 0.239 0.174 4.7 1.48
A5 2784 0.265 0.3345 0.314 0.185 4.6 1.26
C1 2779 0.204 0.2723 0.215 0.125 4.5 1.24
C2 2776 0.229 0.3506 0.324 0.142 4.4 1.32
C3 2780 0.184 0.2487 0.181 0.099 4.3 1.29
C4 2774 0.040 0.1285 0.060 -0.055 2.6 1.38
C5 2784 0.083 0.1242 0.049 -0.027 3.3 1.63
E1 2777 0.032 0.0023 -0.107 -0.078 3.0 1.63
...
```

Lots of numbers, where to look? First, the `raw_alpha`

is a total alpha value that should be as close to one as possible. A low alpha value (`.21`

is not good) predicts that some questions can be left out while retaining the core and purpose of the survey.

The `r.cor`

column is what we’re interested in. Surprisingly, there are quite a few bad values here. For instance, `A1`

could be scrapped, `C5`

and `E1`

are also worthless. At this point it’s up to the researcher to decide whether or not to throw out questions. We did for a few, after carefully inspecting the question itself. Some participants might have misinterpreted the question. Of course, just throwing out all bad ones to up the statistical relevance is not the way to go.

You can re-match the questions to the codes in the R documentation: `A1`

is “I Am indifferent to the feelings of others.”, `E1`

is “I Don’t talk a lot.” Since we already know the 5 factors it’s a bit difficult to judge whether or not those questions are relevant to the personality types. Anyway, let’s move on and pretend we didn’t see the low alpha value.

## Step 2: Confirm your suspicions

There are two ways to do a factor analysis: *confirmatory* or *exploratory*. With the first, you suspect certain items will belong together, and hope that the statistics will confirm that. With the latter, you have no idea what will emerge, and let the analysis tell you which ones should be grouped together.

Usually, researchers have a pretty good idea, especially if they designed the questions themselves. They’re just looking for confirmation—and a way to publish their results. Lets pretend we think O1, C1, E1, A1, N1 and O2, C2, … belong together, comprising four factors in total. We have to feed R our model ourselves, and then pass it to the `cfa`

function from the lavaan project:

```
library('lavaan')
model <- '
# measurement model
cool =~ O1 + C1 + E1 + A1 + N1
mad =~ O2 + C2 + E2 + A2 + N2
sad =~ O3 + C3 + E3 + A3 + N3
glad =~ O4 + C4 + E4 + A4 + N4
'
fit <- cfa(model, bfi)
summary(fit, fit.measures = TRUE)
```

The `summary`

function spews out about every imaginable statistic you can think of. We’re looking for a few things here:

- The
*Comparative Fit Index*(CFI) that should ideally be`>.80`

. Here, it’s`.433`

. - The
*Root Mean Square Error of Approximation*(RMSEA) and its P-value that should be below`.05`

. Here, it’s`.135`

.

All other stuff (variances, covariances, latent variables) are useless if the CFI is low, as is the case. That means your model is wrong, and the *confirmatory* factor analysis confirmed that your hunch is… well… just a hunch. What’s next, scrapping the data and having another go at collecting “better” ones? Not yet. It just means that our cool, mad, sad, glad factors are hogwash.

## Step 3: Explore other possibilities

At this point, it’s a good idea to trade *confirmatory* for *exploratory*, although fiddling with the measurement model yourself is always an option—in my case, it only resulted in frustration. R still asks the researcher for an educated guess when it comes to the number of factors, though.

If you have no clue where to begin, a parallel analysis is a good idea, which comes with a bunch of optional parameters, including which factor method to use, which I won’t go into here:

```
eigenvalues <- fa.parallel(bfi)
```

This will take a while, but you can limit the result set by using `bfi[1:100]`

.

The result is one line of output (“Parallel analysis suggests that the number of factors = 7 and the number of components = 7”) and a graph that is called a “scree plot”:

The `x`

axis contains the number of potential factors, and the `y`

axis the “eigenvalues”, which explains how much of the variance of the items a factor explains. An eigenvalue of one (the threshold) means that factor on the `x`

axis explains more variance than a unique variable. In other words, If it goes below one, introducing new factors won’t be very beneficial, as the data of the questionnaire contributes little to that factor.

These graphs traditionally resemble a kind of elbow: after x factors, adding another one becomes futile. This is basically R telling us how many factors we should be aiming for. In this case, it could be seven, as suggested, but it could also be six or five. Remember: it’s still an educated guess.

Next, let’s try out that guess. Seven? Sure, why not.

```
factordata <- fa(bfi, 7, rotate="varimax")
> print(factordata$loadings, cutoff=0.3)
```

The result:

```
Loadings:
MR1 MR2 MR3 MR4 MR5 MR6 MR7
A1 -0.538
A2 0.391 0.513
A3 0.524 0.434
A4 0.302 0.311
A5 0.576
C1 0.544
C2 0.670
C3 0.542
C4 -0.609
C5 -0.537
E1 -0.504 0.359
E2 -0.597 0.386
E3 0.657
E4 0.683
E5 0.510 0.308
N1 0.796
N2 0.794
N3 0.723
N4 0.551 0.314
N5 0.518
O1 0.346 -0.393
O2 0.517
O3 0.428 -0.498
O4 0.350
O5 0.591
gender 0.366
education 0.389
age 0.564
```

Whoops, what are those last few items? Demographic info can be removed but I was too lazy to do that. What are we looking at here? The aim for these “loadings” is to (1) have every item (question) loaded on a separate factor, and not multiple, and (2) have the questions grouped in the factors in a way that makes sense. `MR1`

, `MR2`

, etc can afterwards be named. Since we’ve kind of spoiled the ending, this is a bit ridiculous now: obviously, we’re looking to load all the `A`

questions under a factor (“agreeableness”), all the `C`

s, etc.

Now is a good time to re-calculate that comparative fit index: `.97`

. Wow, that’s great! Of course, since R basically suggested seven factors.

What does `rotate=varimax`

mean? There are different kinds of matrix rotations available, grouped into orthogonal and oblique. As the analysis looks for the correlations to determine the best fit, it *can* rotate around an axis to find a better fit. In short, if you assume your factors are going to be correlated (which is mostly the case in psychology!), choose an oblique rotation. If you think they’re not, choose an orthogonal one.

In the above loading output, you can see that for instance `E5`

loads high on `MR1`

(`.510`

) and `MR3`

(`.308`

). That’s annoying. Is this question related to construct 1 or 2? Perhaps five factors is a better fit. Let’s try again with five and apply an `oblimin`

rotation, because we suspect they’re correlated:

```
Loadings:
MR2 MR1 MR3 MR5 MR4
A1 -0.514
A2 0.631
A3 0.581
A4 0.397
A5 0.320 0.458
C1 0.547
C2 0.662
C3 0.563
C4 -0.623
C5 -0.560
E1 -0.539
E2 -0.641
E3 0.490
E4 0.669
E5 0.407
N1 0.782
N2 0.762
N3 0.729
N4 0.502 -0.356
N5 0.531
O1 0.522
O2 -0.436
O3 0.616
O4 0.366
O5 -0.522
```

Bingo. Everything seems to be (almost) neatly grouped into five different factors—the ones we started with: openness (`O`

), conscientiousness (`C`

), extraversion (`E`

), agreeableness (`A`

), neuroticism (`N`

). Of course, when scientists were originally analyzing this data, they didn’t know that yet, and had to come up with those cool labels. What about our CFI? `.89`

—still great.

A few other functions that can be of use:

`principal`

, calculating aforementioned statistics for an exploratory analysis`fa.diagram(bfi)`

visualizes the loaded factors

If you’re still keen on grinding away in R, you can verify the EFA findings by creating a revised model and doing a second CFA. The results should more or less match, including the high CFI value.

## Step 4: Interpret with caution

What’s the point of all those statistics? The emerging factors are never 100% the expected factors. For instance, we expected to see the same themes as the ones we identified in a focus group, as the questions were based on interviewees' opinions, supplemented with academic literature. We instead found (1) less factors, (2) a unique group of sub-themes we didn’t foresee, and (3) discrepancies in our questions.

That means the process of coming up with questions, gathering data, and analyzing them using something like a factor analysis, is an *incremental* process. It should—in theory—be done multiple times using revisions of your questionnaire. In practice, gathering test subjects just once already involves a lot of trouble. I’ve found a lot of psychometric tests that never receive a published revision.

One more thing: to get your stuff really, *really* correct, you need a **Step 0: Validate beforehand**, or a “face validity” phase. There, you present your survey to a small group of experts/critics/potential candidates to iron out the biggest syntactic mistakes. That’s the only qualitative part of a survey design.

Now you know the basics of setting up and validating your own questionnaire. I had no idea what I was doing, and I still have no clue on some of the statistical calculations behind these functions, but at least now I know how to use them and interpret their results. Good luck!