Tuesday, October 27, 2015

Cluster Analysis

Updated November 2, 2015

Cluster analysis is a descriptive tool to find interesting subgroups of participants in your data. It's somewhat analogous to sorting one's disposables at a recycling center: glass items go in one bin, clear plastic in another, opaque plastic in another, etc. The items in any one bin are similar to each other, but the contents of one bin are different from those in another bin.

Typically, the researcher will have a domain of interest, such as political attitudes, dining preferences, or goals in life. Several items (perhaps 5-10) in the relevant domain will be submitted to cluster analysis. Participants who answer similarly to the items will be grouped into the same cluster, so that each cluster will be internally homogeneous, but different clusters will be different from each other.

Here's a very conceptual illustration, using the music-preference items from the 1993 General Social Survey (click image to enlarge)...


The color-coding shows the beginning stage of dividing the respondents into clusters:
  • People whose responses are shaded orange tend to like big band, blues, Broadway musicals, and jazz, and dislike rap and heavy metal. 
  • Those in yellow really like jazz, and are moderately favorable toward country, blues, and rap.
  • Those in green appear to dislike music generally!
Unlike the other techniques we've learned thus far, there are no significance tests. However, after conducting a cluster analysis, the groups you derive can be compared on other variables via MANOVA, discriminant analysis, or log-linear modeling.

Though the general approach of cluster analysis is relatively straightforward, actual implementation is fairly technical. The are two main approaches -- k-means/iterative and hierarchical -- that will be discussed. Key to both methods is determining similarity (or conversely, distance) between cases. The more similar cases are to each other, the more likely they will end up in the same cluster.

k-means/Iterative -- This approach is spatial, very much like discriminant analysis. One must specify the number (k) of clusters one seeks in an analysis, each one having a centroid (again like discriminant analysis). Cases are sorted into groups (clusters), based on which centroid they're closest to. The analysis goes through an iterative (repetitive) process of relocating centroids and determining data-points' distance from them, until the solution doesn't change anymore, as I illustrate in the following graphic.


Methods for locating initial centroids are discussed here. Naftali Harris has an excellent interactive webpage called "Visualizing K-Means Clustering," which illustrates many of the above steps.

There are different criteria for distance, such as single-linkage, average-linkage, etc. (See slide 17 of this slideshow.)

Hierarchical -- This approach uses a dendrogram (tree-diagram), which looks like a sports-tournament bracket college-basketball fans fill out every March with their predictions. As Rapkin and Luke (1993) describe:

Agglomerative hierarchical algorithms start with all cases as separate entities. Cases are combined (agglomerated) in sequence, so that those closest together are placed into the same cluster early in the hierarchy. As the analysis proceeds, small clusters of cases combine to form continually larger and more heterogeneous clusters, until all cases are joined into a single cluster (p. 267).

A particular kind of hierarchical clustering technique is Ward's method, which is said to be conducive to balanced cluster sizes (i.e., each cluster having a roughly equal number of cases, rather than some huge and some tiny clusters).

An "oldie, but goody" introductory article on cluster analysis is the following. The software descriptions are obviously out-of-date, but the general overview is excellent. Of particular value, in my view, is the set of recommendations for determining the number of clusters to retain (pp. 268-270).

Rapkin, B., & Luke, D. (1993). Cluster analysis in community research: Epistemology and practice. American Journal of Community Psychology, 21, 247-277.

An article that illustrates the use of cluster analysis, including how to characterize and name the clusters, is:

Schrick, B., Sharp, E. A., Zvonkovic, A., & Reifman, A. (2012). Never let them see you sweat: Silencing and striving to appear perfect among US college women. Sex Roles, 67, 591-604.

One final issue is the stability of cluster solutions. Even within k-means/iterative methods alone, or hierarchical methods alone, there are many ways to implement cluster analysis. To ensure your cluster solution is not merely the peculiar result of one method, you can use more than one method with the same dataset (e.g., one k-means/iterative method and one hierarchical method). You can save the assigned-memberships in SPSS for both methods and then run a cross-tab of these memberships to verify that the same people would end up grouped together (for the most part) in the various clusters.

As an analogy, think of the "Sorting Hat" in the Harry Potter series, which assigns new students at Hogwarts School into one of the four houses (clusters). Imagine that Headmaster Dumbledore decides to run a quality check on the Sorting Hat, bringing in another hat to independently conduct a second sorting of the kids into houses, so it can be seen if the two hats arrive at similar solutions. In the following hypothetical set of results, the two hats indeed arrive at largely similar solutions, although there are a few disagreements.


And, of course, we have a song...

Run Me Off a Cluster
Lyrics by Alan Reifman
May be sung to the tune of “Hey, Soul Sister” (Monahan/Bjørklund/Lind)

O-K, what we’ll learn, today,
What we’ll learn, today,

Are there, groupings?
Of participants, with the same things?
We use, formulas, for distance,
Compactness, is our insistence,
Within each cluster,

But different sets,
Should be as, far apart as,
Distance gets, oh yeah,
So pick an operation,
There are two main, realizations,
Of clustering techniques,

Hey there, Buster,
Run me off, a cluster...ing, analysis,
A synthesis, to help group our participants,

Hey there, Buster,
Run me off, a cluster…ing, analysis,
Tonight,

Hey, hey,
Hey, hey, hey, hey, hey,
Hey, hey, hey, hey, hey,

A dendrogram,
Is at the heart of,
Hierarchy-based plans, oh yeah,
Each case starts out, as a cluster,
Into pairs, they all will muster,
Then form, larger groups,

In number space,
Clusters and k-mean spots,
Will take their place, oh year,
Testing maps by, iteration,
Till it hits, the final station,
Then we’ll all take, a vacation,

Hey there, Buster,
Run me off, a cluster...ing, analysis,
A synthesis, to help group our participants,

Hey there, Buster,
Run me off, a cluster…ing, analysis,
Tonight,

There’s no official way,
To decide, how many clusters stay,
There are some criteria,
To take consideration of,

Your clusters, need interpretation,
On items, from their derivation,
I hope you find, some cool combinations,

Hey there, Buster,
Run me off, a cluster...ing, analysis,
A synthesis, to help group our participants,

Hey there, Buster,
Run me off, a cluster…ing, analysis,
Tonight,

Hey, hey,
Hey, hey, hey, hey, hey,
Hey, hey, hey, hey, hey,

Tonight,

Hey, hey,
Hey, hey, hey, hey, hey,
Hey, hey, hey, hey, hey,

Tonight...