Suppose you want to develop a survey, like we recently did for creativity in higher computing education. You come up with a bunch of questions (Time seemed to fly while programming!, I adapted and merged ideas from others, etc), recruit victims who are willing to answer them using a scale (Totally agreed, Agreed, No way, etc), and then what? Chances are you picked certain questions from certain subthemes because you suspect a pattern will emerge. How to validate that?
That’s where “factor analysis” comes in: it’s a statistical method—meaning it can analyze quantitative data, like answers from the above scale—to detect variability and correlations among the questions. My research is very much qualitative (looking for patterns using interviews and discussions), so I had no idea on how to handle that. This blog post servers as a summary for myself and can perhaps entertain others.
Step 1: Is the data up for it?
First things first: it’s possible that your questions are so bad there’s never going to emerge any pattern. In that case, the “factorability” of the survey is reduced to zero.
There’s a statistic for that (That sentence comes in handy later!), it’s called the Kaiser-Meyer-Olkin (KMO) sampling adequacy. Their original paper suggests a value of at least
.60, but preferably higher than
Let’s use a well-known sample data set for our exercise: the psychological Big Five personalty test developed in the eighties. Scientists devised five big personality types, exactly using this very method: openness (
O), conscientiousness (
C), extraversion (
E), agreeableness (
A), neuroticism (
psych package contains a sample dataset of 2800 responses that can be loaded into the variable
I’ll deliberately avoid using statistical formulas here and let R Suite do the talking:
library('psych') #import fa functions data('bfi') KMO(bfi)
Kaiser-Meyer-Olkin factor adequacy Call: KMO(r = bfi) Overall MSA = 0.84 MSA for each item = ...
All right, so far so good. What’s next? An overall item-total correlation (Rit) check to validate the reliability of the questions asked (the items). There are, of course, standardized rules for that, laid out by Ebel and Frisbie. Rit values below
.30 are considered doubtful or even poor. How are we doing on that front? Let’s try out the
alpha(bfi) command and inspect the results:
Reliability analysis Call: alpha(x = bfi) raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r 0.21 0.51 0.66 0.036 1.1 0.019 4.6 0.53 0.017 lower alpha upper 95% confidence boundaries 0.18 0.21 0.25 Reliability if an item is dropped: raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r A1 0.24 0.53 0.68 0.041 1.15 0.018 0.035 0.021 A2 0.19 0.49 0.64 0.034 0.95 0.019 0.034 0.017 A3 0.19 0.49 0.64 0.034 0.94 0.019 0.033 0.017 ... Item statistics n raw.r std.r r.cor r.drop mean sd A1 2784 -0.028 0.0573 -0.057 -0.127 2.4 1.41 A2 2773 0.285 0.3736 0.356 0.210 4.8 1.17 A3 2774 0.267 0.3874 0.379 0.184 4.6 1.30 A4 2781 0.269 0.2938 0.239 0.174 4.7 1.48 A5 2784 0.265 0.3345 0.314 0.185 4.6 1.26 C1 2779 0.204 0.2723 0.215 0.125 4.5 1.24 C2 2776 0.229 0.3506 0.324 0.142 4.4 1.32 C3 2780 0.184 0.2487 0.181 0.099 4.3 1.29 C4 2774 0.040 0.1285 0.060 -0.055 2.6 1.38 C5 2784 0.083 0.1242 0.049 -0.027 3.3 1.63 E1 2777 0.032 0.0023 -0.107 -0.078 3.0 1.63 ...
Lots of numbers, where to look? First, the
raw_alpha is a total alpha value that should be as close to one as possible. A low alpha value (
.21 is not good) predicts that some questions can be left out while retaining the core and purpose of the survey.
r.cor column is what we’re interested in. Surprisingly, there are quite a few bad values here. For instance,
A1 could be scrapped,
E1 are also worthless. At this point it’s up to the researcher to decide whether or not to throw out questions. We did for a few, after carefully inspecting the question itself. Some participants might have misinterpreted the question. Of course, just throwing out all bad ones to up the statistical relevance is not the way to go.
You can re-match the questions to the codes in the R documentation:
A1 is “I Am indifferent to the feelings of others.”,
E1 is “I Don’t talk a lot.” Since we already know the 5 factors it’s a bit difficult to judge whether or not those questions are relevant to the personality types. Anyway, let’s move on and pretend we didn’t see the low alpha value.
Step 2: Confirm your suspicions
There are two ways to do a factor analysis: confirmatory or exploratory. With the first, you suspect certain items will belong together, and hope that the statistics will confirm that. With the latter, you have no idea what will emerge, and let the analysis tell you which ones should be grouped together.
Usually, researchers have a pretty good idea, especially if they designed the questions themselves. They’re just looking for confirmation—and a way to publish their results. Lets pretend we think O1, C1, E1, A1, N1 and O2, C2, … belong together, comprising four factors in total. We have to feed R our model ourselves, and then pass it to the
cfa function from the lavaan project:
library('lavaan') model <- ' # measurement model cool =~ O1 + C1 + E1 + A1 + N1 mad =~ O2 + C2 + E2 + A2 + N2 sad =~ O3 + C3 + E3 + A3 + N3 glad =~ O4 + C4 + E4 + A4 + N4 ' fit <- cfa(model, bfi) summary(fit, fit.measures = TRUE)
summary function spews out about every imaginable statistic you can think of. We’re looking for a few things here:
- The Comparative Fit Index (CFI) that should ideally be
>.80. Here, it’s
- The Root Mean Square Error of Approximation (RMSEA) and its P-value that should be below
.05. Here, it’s
All other stuff (variances, covariances, latent variables) are useless if the CFI is low, as is the case. That means your model is wrong, and the confirmatory factor analysis confirmed that your hunch is… well… just a hunch. What’s next, scrapping the data and having another go at collecting “better” ones? Not yet. It just means that our cool, mad, sad, glad factors are hogwash.
Step 3: Explore other possibilities
At this point, it’s a good idea to trade confirmatory for exploratory, although fiddling with the measurement model yourself is always an option—in my case, it only resulted in frustration. R still asks the researcher for an educated guess when it comes to the number of factors, though.
If you have no clue where to begin, a parallel analysis is a good idea, which comes with a bunch of optional parameters, including which factor method to use, which I won’t go into here:
eigenvalues <- fa.parallel(bfi)
This will take a while, but you can limit the result set by using
The result is one line of output (“Parallel analysis suggests that the number of factors = 7 and the number of components = 7”) and a graph that is called a “scree plot”:
x axis contains the number of potential factors, and the
y axis the “eigenvalues”, which explains how much of the variance of the items a factor explains. An eigenvalue of one (the threshold) means that factor on the
x axis explains more variance than a unique variable. In other words, If it goes below one, introducing new factors won’t be very beneficial, as the data of the questionnaire contributes little to that factor.
These graphs traditionally resemble a kind of elbow: after x factors, adding another one becomes futile. This is basically R telling us how many factors we should be aiming for. In this case, it could be seven, as suggested, but it could also be six or five. Remember: it’s still an educated guess.
Next, let’s try out that guess. Seven? Sure, why not.
factordata <- fa(bfi, 7, rotate="varimax") > print(factordata$loadings, cutoff=0.3)
Loadings: MR1 MR2 MR3 MR4 MR5 MR6 MR7 A1 -0.538 A2 0.391 0.513 A3 0.524 0.434 A4 0.302 0.311 A5 0.576 C1 0.544 C2 0.670 C3 0.542 C4 -0.609 C5 -0.537 E1 -0.504 0.359 E2 -0.597 0.386 E3 0.657 E4 0.683 E5 0.510 0.308 N1 0.796 N2 0.794 N3 0.723 N4 0.551 0.314 N5 0.518 O1 0.346 -0.393 O2 0.517 O3 0.428 -0.498 O4 0.350 O5 0.591 gender 0.366 education 0.389 age 0.564
Whoops, what are those last few items? Demographic info can be removed but I was too lazy to do that. What are we looking at here? The aim for these “loadings” is to (1) have every item (question) loaded on a separate factor, and not multiple, and (2) have the questions grouped in the factors in a way that makes sense.
MR2, etc can afterwards be named. Since we’ve kind of spoiled the ending, this is a bit ridiculous now: obviously, we’re looking to load all the
A questions under a factor (“agreeableness”), all the
Now is a good time to re-calculate that comparative fit index:
.97. Wow, that’s great! Of course, since R basically suggested seven factors.
rotate=varimax mean? There are different kinds of matrix rotations available, grouped into orthogonal and oblique. As the analysis looks for the correlations to determine the best fit, it can rotate around an axis to find a better fit. In short, if you assume your factors are going to be correlated (which is mostly the case in psychology!), choose an oblique rotation. If you think they’re not, choose an orthogonal one.
In the above loading output, you can see that for instance
E5 loads high on
.308). That’s annoying. Is this question related to construct 1 or 2? Perhaps five factors is a better fit. Let’s try again with five and apply an
oblimin rotation, because we suspect they’re correlated:
Loadings: MR2 MR1 MR3 MR5 MR4 A1 -0.514 A2 0.631 A3 0.581 A4 0.397 A5 0.320 0.458 C1 0.547 C2 0.662 C3 0.563 C4 -0.623 C5 -0.560 E1 -0.539 E2 -0.641 E3 0.490 E4 0.669 E5 0.407 N1 0.782 N2 0.762 N3 0.729 N4 0.502 -0.356 N5 0.531 O1 0.522 O2 -0.436 O3 0.616 O4 0.366 O5 -0.522
Bingo. Everything seems to be (almost) neatly grouped into five different factors—the ones we started with: openness (
O), conscientiousness (
C), extraversion (
E), agreeableness (
A), neuroticism (
N). Of course, when scientists were originally analyzing this data, they didn’t know that yet, and had to come up with those cool labels. What about our CFI?
A few other functions that can be of use:
principal, calculating aforementioned statistics for an exploratory analysis
fa.diagram(bfi)visualizes the loaded factors
If you’re still keen on grinding away in R, you can verify the EFA findings by creating a revised model and doing a second CFA. The results should more or less match, including the high CFI value.
Step 4: Interpret with caution
What’s the point of all those statistics? The emerging factors are never 100% the expected factors. For instance, we expected to see the same themes as the ones we identified in a focus group, as the questions were based on interviewees' opinions, supplemented with academic literature. We instead found (1) less factors, (2) a unique group of sub-themes we didn’t foresee, and (3) discrepancies in our questions.
That means the process of coming up with questions, gathering data, and analyzing them using something like a factor analysis, is an incremental process. It should—in theory—be done multiple times using revisions of your questionnaire. In practice, gathering test subjects just once already involves a lot of trouble. I’ve found a lot of psychometric tests that never receive a published revision.
One more thing: to get your stuff really, really correct, you need a Step 0: Validate beforehand, or a “face validity” phase. There, you present your survey to a small group of experts/critics/potential candidates to iron out the biggest syntactic mistakes. That’s the only qualitative part of a survey design.
Now you know the basics of setting up and validating your own questionnaire. I had no idea what I was doing, and I still have no clue on some of the statistical calculations behind these functions, but at least now I know how to use them and interpret their results. Good luck!