Wednesday, August 26, 2015

Preliminary Data-Analysis Issues

Before we get to the actual multivariate techniques, let's review some principles of data analysis and management. I want to review three principles, in particular: types of variables; dummy variables for including nominal variables in quantitative analyses; and controlling for (or "holding constant") extraneous variables.

Types of variables. We'll review nominal, ordinal, and ratio variables, as I describe on the following webpage (scroll about halfway down).

Dummy variables.  For nominal (i.e., qualitative) variables such as race-ethnicity, religious affiliation, and favorite ice-cream flavor, the scoring system is arbitrary and there's no logic to a 1 => 2 => 3 => 4 => 5 progression, the way there would be for an ordinal (strongly disagree, somewhat disagree, neutral, somewhat agree, strongly agree ) or ratio (pounds, inches, minutes, seconds) variable. Correlating the original 1-through-5 race-ethnicity variable in the following chart with a quantitative variable such as age would be nonsensical. Therefore, the original race-ethnicity variable must be converted to a set of dummy (1, 0) variables as shown in the chart. (You can click on the graphics to enlarge them.)


As a concrete example, Black/African-American was coded as 2 on the original race-ethnicity variable, but now, members of this category receive a score of 1 (for "yes") on the newly created dummy variable known as "Black." Members of all the non-Black categories get a 0 (for "no") on the "Black" variable. A similar logic holds in the creation of dummy variables called "Hispanic," "Asian," and "Other." Such conversions can be done in SPSS, using the Recode technique under the Transform tab.

Before doing the conversion, however, one of the original categories must be excluded from the set of new dummy variables for mathematical reasons. In this example, "White" is the excluded category, also known as the "referent" category. Notice in the above chart that, even though there is no dummy variable called "White," the computer can still recognize which participants are White from the fact that White participants have all zeroes ("no") on the Black, Hispanic, Asian, and Other dummy variables. Creating a fifth dummy variable for White (1 = yes, 0 = no) would thus be redundant and generate error messages!      

With White as the referent category, each other group gets compared to White: Black vs. White, Hispanic vs. White, etc. There is no direct comparison between Black and Hispanic, for example. To use a sports analogy, suppose someone wanted to determine who was the better football team in the 2015 season between University of Texas and University of Arkansas. They didn't play each other that season, but both played against Texas Tech. We could thus see whether UT or Arkansas did better against Tech, with Tech thus serving as a "referent category" of sorts. Back in the domain of real data-analysis, considerations in choosing which category to use as the referent are discussed here.

The next chart shows how the set of dummy variables would be used in a regression analysis.


If a variable is already dichotomous (e.g., male = 0, female = 1), it's ready to go, and no further transformations are needed.

Here's an actual research study that used dummy variables (Moussavi, S., et al. 2007. Depression, chronic diseases, and decrements in health: Results from the World Health Surveys. The Lancet).

Statistical control/holding constant. Finally, let's discuss how a researcher can control for (or hold constant) an extraneous (or "lurking") variable, to get a more direct look at the relationship between the variables of primary interest. We'll start with two real-world examples: the association of body mass and successful in vitro fertilization (controlling for the women's age), and between having health insurance and cancer survival (controlling for type of cancer; see pp. 96-98).

As another example, let's work through a partial correlation, which, as you learned in QM I, is a correlation between two variables, controlling for one or more extraneous variables.

One additional examination of statistical control comes from this article on Analysis of Covariance (see on page 2, the paragraph beginning with "The analysis..."; and on page 3, the paragraph beginning with "The logic...").

Statistical control is actually far more challenging than it might at first seem, as journalist Ezra Klein discusses here.