Friday, November 27, 2015

Brief Overview of Advanced Topics: HLM, Missing Data, Big Data

Given that this was my first time teaching this course (Fall 2015), I could only guess as to how much time each topic would take to cover. As it turned out, we ran out of time to cover three topics, so I wanted to provide brief overviews of them.

Hierarchical Linear Modeling (HLM; also known as Multilevel Modeling). Updated Nov. 18, 2016. When participants are naturally organized in a hierarchical structure (e.g., students within classes, classes within schools), HLM is a suitable statistical approach. The researcher can study dynamics at both a more micro level (in the classroom) and more macro level (at the school).

For example, an elementary school may have 20 classrooms, each with 30 students (total N = 600). One approach would be to analyze relations among your variables of interest (e.g., students' academic efficacy, work habits, and grades) simply within your 600-student sample, not taking into account who each student's teacher is (an undifferentiated sample). However, statistical research (as well as several popular movies, such as this, this, and this) suggest that particular teachers can have an impact on students.

A variation on this approach would be to use the 600-student sample, adding a series of 19 dummy variables (a dummy variable for each teacher's name, e.g., "Smith," "Jones," etc., with one teacher omitted as a reference category) to indicate if a student had (scored as 1) or did not have (scored as 0) a particular teacher (dummy-variable refresher). Doing so will reveal relationships between variables (e.g., efficacy, work habits) in the 600 students, controlling for which teacher students had.

Yet another approach would be to average all the variables within each class, creating a dataset with an N of 20 (i.e., each class would be a case). Each class would have variables such as average student efficacy, average student work habits, and average student grade. In doing so, however, one commits the well-known ecological fallacy. The recent 2016 U.S. presidential election provides an example:
  • At the state level, six of the eight states with the largest African-American populations (as a percent of the state's total population) went Republican (this figure counts the District of Columbia as a state). Before we conclude that African-American voters strongly support the Republicans and Donald Trump, however, we should note that...
  • At the individual level, national exit polls estimate that 88% of Black voters went for the Democrat, Hillary Clinton.
As these examples should make clear, neither a purely individualistic analysis (600 students, not taking teachers into account) nor a purely group-level analysis (20 classrooms, with all students' scores on each variable averaged together within a class) is optimal.

The need for a better analytic technique sets the stage for HLM. Here's an introductory video by Mark Tranmer (we'll start at the 10:00 point). Before watching the video, here are some important concepts to know:
  • Intercept (essentially, the mean of a dependent variable y, when all predictors x are set to zero) and slope (relationship between x and y) of a line (click here for review).
  • Fixed vs. random effects (summary).
  • Individual- vs. group-level variables. Variables such as age and gender describe properties of an individual (e.g., 11-year-old girl, 9-year-old boy). Variables such as age of school buildings (mentioned in the video) and number of books in the library are characteristics of the school itself and hence, school- or group-level . Finally, individual person characteristics are sometimes averaged or otherwise aggregated to create a group-level variable (e.g., schools' percentages of students on free or reduced-cost meals, as an indicator of economic status). Whether a specific student receives a free/reduced-cost meal at school is an individual characteristic, but when portrayed as an average or percentage for the school, it becomes a group (school) characteristic. We might call these "derived" or "constructed" group-level variables as they're assembled from individual-level data, as opposed to being inherently school/group-level (like building age).    
One of my Master's students several years ago used a simplified version of HLM to study residents of different European nations. The student was interested in predictors of egalitarian attitudes toward household tasks and childcare, both among individuals within a given country and between countries. The reference is:

Apparala, M. L., Reifman, A., & Munsch, J. (2003). Cross-national comparison of attitudes toward fathers' and mothers' participation in household tasks and childcare. Sex Roles, 48, 189-203.

Here's a link to the United Nations' Gender Empowerment Measure, which was a country-level predictor of citizens' attitudes toward couple egalitarianism in doing household tasks and childcare.


Missing Data. Nearly all datasets will have missing data, whether through participants declining to answer certain questions, accidentally overlooking items, or other reasons. Increasingly sophisticated techniques for dealing with missing data have been developed, but researchers must be careful that they're using appropriate methods. A couple of excellent overview articles on handling missing data are the following:

Acock, A. C. (2005). Working with missing values. Journal of Marriage and Family, 67, 1012–1028.
 
Schlomer, G. L., Bauman, S., & Card, N. A (2010). Best practices for missing data management in counseling psychology. Journal of Counseling Psychology, 57, 1-10

Also see additional resources in the links section to the right.

When one hears the term "mean substitution," it is important to distinguish between two possible meanings.
  • One involves filling in a participant's missing value on a variable with the mean for the sample on the same variable (i.e., sample-mean substitution). This approach appears to be widely discouraged. 
  • A second meaning involves a participant lacking data on a small number of items that are part of a multiple-item instrument (e.g., someone who answered only eight items on a 10-item self-esteem scale). In this case, it seems acceptable to let the mean of the answered items (that respondent's personal mean) stand in for what the mean of all the items on the instrument would have yielded had no answers been omitted. Essentially, one is assuming that the respondent would have answered the same way on the items left blank that he/she did on the completed items. I personally use a two-thirds rule for personal-mean substitution (e.g., if there were nine items on the scale, I would require at least six of the items to be answered; if not, the respondent would receive a "system-missing" value on the scale). See on this website where it says, "Allow for missing values."

Big Data. The amount of capturable data people generate is truly mind-boggling, from commerce, health care, law enforcement, sports, and other domains. We will watch a brief video (see links section to the right) listing numerous examples. The study of Big Data (also known as "Data Mining") applies statistical techniques such as correlation and cluster analysis to discern patterns in the data and make predictions.

Among the many books appearing on the topic of Big Data, I would recommend Supercrunchers (by Ian Ayres) on business applications, The Victory Lab (by Sasha Issenberg) on political campaigns, and three books pertaining to baseball: Moneyball (by Michael Lewis) on the Oakland Athletics; The Extra 2% (by Jonah Keri) on the Tampa Bay Rays; and Big Data Baseball (by Travis Sawchik) on the Pittsburgh Pirates. These three teams play in relatively small cities by Major League Baseball standards and thus bring in less local-television revenue than the big-city teams. Therefore, teams such as Oakland, Tampa Bay, and Pittsburgh must use statistical techniques to (a) discover "deceptively good" players, whose skills are not well-known to other teams and who can thus be paid non-exorbitant salaries, and (b) identify effective strategies that other teams aren't using (yet). The Pirates' signature strategy, now copied by other teams, is the defensive shift, using statistical "spray charts" unique to each batter.

To end the course, here's one final song...

Big Data
Lyrics by Alan Reifman
May be sung to the tune of “Green Earrings” (Fagen/Becker for Steely Dan)

Sales, patterns,
Companies try,
Running equations,
To predict, what we’ll buy,

Big data,
Lots of numbers,
Floating in, the cloud,
For computers,
To analyze, now,
We know, how,

Sports, owners,
Wanting to win,
Seeking, advantage,
In the numbers, it’s no sin,

Big data,
Lots of numbers,
Floating in, the cloud,
For computers,
To analyze, now,
We know, how

Instrumentals/solos

Big data,
Lots of numbers,
Floating in, the cloud,
For computers,
To analyze, now,
We know, how

Instrumentals/solos