Friday, November 27, 2015

Brief Overview of Advanced Topics: HLM, Missing Data, Big Data

Given that this was my first time teaching this course (Fall 2015), I could only guess as to how much time each topic would take to cover. As it turned out, we ran out of time to cover three topics, so I wanted to provide brief overviews of them.

Hierarchical Linear Modeling (HLM; also known as Multilevel Modeling). Updated Nov. 18, 2016. When participants are naturally organized in a hierarchical structure (e.g., students within classes, classes within schools), HLM is a suitable statistical approach. The researcher can study dynamics at both a more micro level (in the classroom) and more macro level (at the school).

For example, an elementary school may have 20 classrooms, each with 30 students (total N = 600). One approach would be to analyze relations among your variables of interest (e.g., students' academic efficacy, work habits, and grades) simply within your 600-student sample, not taking into account who each student's teacher is (an undifferentiated sample). However, statistical research (as well as several popular movies, such as this, this, and this) suggest that particular teachers can have an impact on students.

A variation on this approach would be to use the 600-student sample, adding a series of 19 dummy variables (a dummy variable for each teacher's name, e.g., "Smith," "Jones," etc., with one teacher omitted as a reference category) to indicate if a student had (scored as 1) or did not have (scored as 0) a particular teacher (dummy-variable refresher). Doing so will reveal relationships between variables (e.g., efficacy, work habits) in the 600 students, controlling for which teacher students had.

Yet another approach would be to average all the variables within each class, creating a dataset with an N of 20 (i.e., each class would be a case). Each class would have variables such as average student efficacy, average student work habits, and average student grade. In doing so, however, one commits the well-known ecological fallacy. The recent 2016 U.S. presidential election provides an example:
  • At the state level, six of the eight states with the largest African-American populations (as a percent of the state's total population) went Republican (this figure counts the District of Columbia as a state). Before we conclude that African-American voters strongly support the Republicans and Donald Trump, however, we should note that...
  • At the individual level, national exit polls estimate that 88% of Black voters went for the Democrat, Hillary Clinton.
As these examples should make clear, neither a purely individualistic analysis (600 students, not taking teachers into account) nor a purely group-level analysis (20 classrooms, with all students' scores on each variable averaged together within a class) is optimal.

The need for a better analytic technique sets the stage for HLM. Here's an introductory video by Mark Tranmer (we'll start at the 10:00 point). Before watching the video, here are some important concepts to know:
  • Intercept (essentially, the mean of a dependent variable y, when all predictors x are set to zero) and slope (relationship between x and y) of a line (click here for review).
  • Fixed vs. random effects (summary).
  • Individual- vs. group-level variables. Variables such as age and gender describe properties of an individual (e.g., 11-year-old girl, 9-year-old boy). Variables such as age of school buildings (mentioned in the video) and number of books in the library are characteristics of the school itself and hence, school- or group-level . Finally, individual person characteristics are sometimes averaged or otherwise aggregated to create a group-level variable (e.g., schools' percentages of students on free or reduced-cost meals, as an indicator of economic status). Whether a specific student receives a free/reduced-cost meal at school is an individual characteristic, but when portrayed as an average or percentage for the school, it becomes a group (school) characteristic. We might call these "derived" or "constructed" group-level variables as they're assembled from individual-level data, as opposed to being inherently school/group-level (like building age).    
One of my Master's students several years ago used a simplified version of HLM to study residents of different European nations. The student was interested in predictors of egalitarian attitudes toward household tasks and childcare, both among individuals within a given country and between countries. The reference is:

Apparala, M. L., Reifman, A., & Munsch, J. (2003). Cross-national comparison of attitudes toward fathers' and mothers' participation in household tasks and childcare. Sex Roles, 48, 189-203.

Here's a link to the United Nations' Gender Empowerment Measure, which was a country-level predictor of citizens' attitudes toward couple egalitarianism in doing household tasks and childcare.


Missing Data. Nearly all datasets will have missing data, whether through participants declining to answer certain questions, accidentally overlooking items, or other reasons. Increasingly sophisticated techniques for dealing with missing data have been developed, but researchers must be careful that they're using appropriate methods. A couple of excellent overview articles on handling missing data are the following:

Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67, 1012–1028.
 
Schlomer, G. L., Bauman, S., & Card, N. A (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57, 1-10

Also see additional resources in the links section to the right.

When one hears the term "mean substitution," it is important to distinguish between two possible meanings.
  • One involves filling in a participant's missing value on a variable with the mean for the sample on the same variable (i.e., sample-mean substitution). This approach appears to be widely discouraged. 
  • A second meaning involves a participant lacking data on a small number of items that are part of a multiple-item instrument (e.g., someone who answered only eight items on a 10-item self-esteem scale). In this case, it seems acceptable to let the mean of the answered items (that respondent's personal mean) stand in for what the mean of all the items on the instrument would have yielded had no answers been omitted. Essentially, one is assuming that the respondent would have answered the same way on the items left blank that he/she did on the completed items. I personally use a two-thirds rule for personal-mean substitution (e.g., if there were nine items on the scale, I would require at least six of the items to be answered; if not, the respondent would receive a "system-missing" value on the scale). See on this website where it says, "Allow for missing values."

Big Data. The amount of capturable data people generate is truly mind-boggling, from commerce, health care, law enforcement, sports, and other domains. We will watch a brief video (see links section to the right) listing numerous examples. The study of Big Data (also known as "Data Mining") applies statistical techniques such as correlation and cluster analysis to discern patterns in the data and make predictions.

Among the many books appearing on the topic of Big Data, I would recommend Supercrunchers (by Ian Ayres) on business applications, The Victory Lab (by Sasha Issenberg) on political campaigns, and three books pertaining to baseball: Moneyball (by Michael Lewis) on the Oakland Athletics; The Extra 2% (by Jonah Keri) on the Tampa Bay Rays; and Big Data Baseball (by Travis Sawchik) on the Pittsburgh Pirates. These three teams play in relatively small cities by Major League Baseball standards and thus bring in less local-television revenue than the big-city teams. Therefore, teams such as Oakland, Tampa Bay, and Pittsburgh must use statistical techniques to (a) discover "deceptively good" players, whose skills are not well-known to other teams and who can thus be paid non-exorbitant salaries, and (b) identify effective strategies that other teams aren't using (yet). The Pirates' signature strategy, now copied by other teams, is the defensive shift, using statistical "spray charts" unique to each batter.

To end the course, here's one final song...

Big Data
Lyrics by Alan Reifman
May be sung to the tune of “Green Earrings” (Fagen/Becker for Steely Dan)

Sales, patterns,
Companies try,
Running equations,
To predict, what we’ll buy,

Big data,
Lots of numbers,
Floating in, the cloud,
For computers,
To analyze, now,
We know, how,

Sports, owners,
Wanting to win,
Seeking, advantage,
In the numbers, it’s no sin,

Big data,
Lots of numbers,
Floating in, the cloud,
For computers,
To analyze, now,
We know, how

Instrumentals/solos

Big data,
Lots of numbers,
Floating in, the cloud,
For computers,
To analyze, now,
We know, how

Instrumentals/solos

Tuesday, November 17, 2015

Multidimensional Scaling

Updated November 23, 2015

Multidimensional Scaling (MDS) is a descriptive technique, to look for underlying dimensions or structure behind a set of objects. For a given set of objects, the similarity or dissimilarity between each pair must first be determined. This MDS overview document presents different ways of operationalizing similarity/dissimilarity. One ends up with a visual diagram, where more-similar objects end up physically close to each other. As with other techniques we've learned this semester (e.g., log-linear models, cluster analysis), there is no "official" way to determine which solution, of different possible ones, to accept. MDS provides different guidelines for how many dimensions to accept, one of which is the "stress" value (p. 13 of linked article).

The input to MDS in SPSS is either a similarity matrix (e.g., how similar is Object A to Object B? how similar is A to C? how similar is B to C?) or a dissimilarity/distance matrix. Zeroes are placed along the diagonal of the matrix, as it is not meaningful to talk about how similar A is to A, B is to B, etc.

A video on running MDS in SPSS can be accessed via the links column to the right. Once you select your visual solution, you get to name the dimensions, based on where the objects appear in the graph. The video illustrates the use of one particular MDS program called PROXSCAL, in which the numerical values in the input matrix can either represent similarities (i.e., higher numbers = greater similarity) or distances (i.e., higher numbers = greater dissimilarities).

However, the SPSS version we have in our computer lab does not provide access to PROXSCAL (not easily, at least) and only makes a program called ALSCAL readily available. In ALSCAL, higher numbers in the input matrix are read only as distances.

This is presumably where our initial analysis in last Thursday's class went awry. In trying to map the dimensions underlying our Texas Tech Human Development and Family Studies faculty members' research interests, we used the number of times each pair of faculty members had been co-authors on the same article as the measure of similarity. A high number of co-authorships would thus signify that the two faculty members in question had similar research interests. However, ALSCAL treats high numbers as indicative of greater distance (which I failed to catch at the time), thus messing up our analysis.

Once the numbers in the matrix are reverse-scored, so that a high number of co-authorships between a pair of faculty is converted to a low number for distance, then the MDS graph becomes more understandable. Below is an annotated screen-capture from SPSS, on which you can click to enlarge. (The graph does not show some of our newer faculty members, who would not have had much opportunity yet to publish with their faculty colleagues, or some of our faculty members who publish primarily with graduate students or with faculty from outside Texas Tech.)


The stress values shown in the output are somewhere between good and fair, according to the overview document linked above.


And now, our song for this topic...

Multidimensional Scaling is Fun
Lyrics by Alan Reifman
May be sung to the tune of “Minuano (Six Eight)” (Metheny/Mays)

YouTube video of performance here.

Multidimensional scaling, is fun,
Multidimensional scaling, is fun, to run, yeah,
Measure the objects’ similarities,
Or you can enter, as disparities, yes you can,

Multidimensional scaling, is fun,
SPSS is one place, that it’s done,
Submit your matrix in, and a spatial map, will come out,
Multidimensional scaling, is fun,

Multidimensional scaling, is fun,
Multidimensional scaling, is fun, to run, yeah,
Aim for a stress value, below point-ten,
Or you’ll have to run, your model again, yes you will,

Multidimensional scaling, is fun,
ALSCAL is one version, that you can run,
Submit your matrix in, and a spatial map, will come out,
Multidimensional scaling, is fun,

(Guitar improvisation 1:39-3:38, then piano and percussion interlude 3:38-4:55)

Multidimensional scaling, is fun,
Multidimensional scaling, is fun, to run, yeah,
Measure the objects’ similarities,
Or you can enter, as disparities, yes you can,

Multidimensional scaling, is fun,
SPSS is one place, that it’s done,
Submit your matrix in, and a spatial map, will come out,
Multidimensional scaling, is fun,

Multidimensional scaling, is fun,
Multidimensional scaling, is fun, to run, yeah,
Aim for a stress value, below point-ten,
Or you’ll have to run, your model again, yes you will,

Multidimensional scaling, is fun,
PROXSCAL’s another version, you can run,
Submit your matrix in, and a spatial map, will come out,
Multidimensional scaling, is fun,

Yes it’s fun,
Yes it’s fun,
Yes it’s fun,
Let it run!