Wednesday, August 26, 2015

Preliminary Data-Analysis Issues

Before we get to the actual multivariate techniques, let's review some principles of data analysis and management. I want to review three principles, in particular: types of variables; dummy variables for including nominal variables in quantitative analyses; and controlling for (or "holding constant") extraneous variables.

Types of variables. We'll review nominal, ordinal, and ratio variables, as I describe on the following webpage (scroll about halfway down).

Dummy variables.  For nominal (i.e., qualitative) variables such as race-ethnicity, religious affiliation, and favorite ice-cream flavor, the scoring system is arbitrary and there's no logic to a 1 => 2 => 3 => 4 => 5 progression, the way there would be for an ordinal (strongly disagree, somewhat disagree, neutral, somewhat agree, strongly agree ) or ratio (pounds, inches, minutes, seconds) variable. Correlating the original 1-through-5 race-ethnicity variable in the following chart with a quantitative variable such as age would be nonsensical. Therefore, the original race-ethnicity variable must be converted to a set of dummy (1, 0) variables as shown in the chart. (You can click on the graphics to enlarge them.)


As a concrete example, Black/African-American was coded as 2 on the original race-ethnicity variable, but now, members of this category receive a score of 1 (for "yes") on the newly created dummy variable known as "Black." Members of all the non-Black categories get a 0 (for "no") on the "Black" variable. A similar logic holds in the creation of dummy variables called "Hispanic," "Asian," and "Other." Such conversions can be done in SPSS, using the Recode technique under the Transform tab.

Before doing the conversion, however, one of the original categories must be excluded from the set of new dummy variables for mathematical reasons. In this example, "White" is the excluded category, also known as the "referent" category. Notice in the above chart that, even though there is no dummy variable called "White," the computer can still recognize which participants are White from the fact that White participants have all zeroes ("no") on the Black, Hispanic, Asian, and Other dummy variables. Creating a fifth dummy variable for White (1 = yes, 0 = no) would thus be redundant and generate error messages!      

With White as the referent category, each other group gets compared to White: Black vs. White, Hispanic vs. White, etc. There is no direct comparison between Black and Hispanic, for example. To use a sports analogy, suppose someone wanted to determine who was the better football team in the 2015 season between University of Texas and University of Arkansas. They didn't play each other that season, but both played against Texas Tech. We could thus see whether UT or Arkansas did better against Tech, with Tech thus serving as a "referent category" of sorts. Back in the domain of real data-analysis, considerations in choosing which category to use as the referent are discussed here.

The next chart shows how the set of dummy variables would be used in a regression analysis.


If a variable is already dichotomous (e.g., male = 0, female = 1), it's ready to go, and no further transformations are needed.

Here's an actual research study that used dummy variables (Moussavi, S., et al. 2007. Depression, chronic diseases, and decrements in health: Results from the World Health Surveys. The Lancet).

Statistical control/holding constant. Finally, let's discuss how a researcher can control for (or hold constant) an extraneous (or "lurking") variable, to get a more direct look at the relationship between the variables of primary interest. We'll start with two real-world examples: the association of body mass and successful in vitro fertilization (controlling for the women's age), and between having health insurance and cancer survival (controlling for type of cancer; see pp. 96-98).

As another example, let's work through a partial correlation, which, as you learned in QM I, is a correlation between two variables, controlling for one or more extraneous variables.

One additional examination of statistical control comes from this article on Analysis of Covariance (see on page 2, the paragraph beginning with "The analysis..."; and on page 3, the paragraph beginning with "The logic...").

Statistical control is actually far more challenging than it might at first seem, as journalist Ezra Klein discusses here.

Monday, August 24, 2015

Welcome!

Welcome to our class on multivariate statistical analysis (Quantitative Methods III). This fall (2015) marks the 30-year anniversary of when I took multivariate myself as a graduate student at the University of Michigan, as seen in the following syllabus segment. (I didn't really tear the syllabus; it's just a visual effect to convey that I'm showing only part of the document.)


I am entering my 19th year on the faculty at Texas Tech, yet this is the first time I've taught multivariate. However, I've taught two other graduate stat courses in the department, QM I/Intro and QM IV/SEM, many times. Further, I've published articles using many of the techniques we'll cover in QM III, so I think I'll do OK.

What do we mean by "multivariate statistics"? I think it's most useful to contrast the term with other related ones.

If we're looking at just one variable at a time, then we're dealing with univariate analyses. Here are some examples.

If we're looking at the relationship of two variables, then we're dealing with bivariate analyses. For example, we might want to know how, among adults, age correlates with annual earnings. Or, using a chi-square for nominal variables, we can test whether self-identification with Democratic, Republican, or other political parties differs in parents vs. non-parents.

Finally, we have multivariate analysis. The word "multiple" would suggest a multivariate analysis is anything using three or more variables. Thus, a multiple-regression analysis featuring one dependent variable and six predictor variables (seven variables total) would count as multivariate. However, some authors confine the term "multivariate" to multiple dependent variables. Multivariate Analysis of Variance (MANOVA) would be one example of this restricted definition of multivariate. In this course, I plan to be more inclusive in deciding what is a multivariate analysis.