Sunday, September 27, 2015

Discriminant Function Analysis

We've just finished logistic regression, which uses a set of variables to predict status on a two-category outcome, such as whether college students graduate or don't graduate. What if we wanted to make finer distinctions, say into three categories: graduated, dropped-out, and transferred to another school?

There is an extension of logistic regression, known as multinomial logistic regression, which uses a series of pairwise comparisons (e.g., dropped-out vs. graduates, transferred vs. graduates). See explanatory PowerPoint in the links section to the right.

Discriminant function analysis (DFA) allows you to put all three (or more) groups into one analysis. DFA uses spatial-mathematical principles to map out the three (or more) groups' spatial locations (with each group having a mean or "centroid") on a system of axes defined by the predictor variables.  As a result, you get neat diagrams such as this, this, and this.

DFA, like statistical modeling in general, generates a somewhat oversimplified solution that is accurate for a large proportion of cases, but has some error. An example can be seen in this document (see Figure 4). Classification accuracy is one of the statistics one receives in DFA output.

(A solution that would be accurate for all cases might be popular, but wouldn't be useful. As Nate Silver writes in his book The Signal and The Noise, you would have "an overly specific solution to a general problem. This is overfitting, and it leads to worse predictions"; p. 163 )

The axes, known as canonical discriminant functions, are defined in the structure matrix, which shows correlations between your predictor variables and the functions. An example appears in this document dealing with classification of obsidian archaeological finds (see Figure 7-17 and Table 7-18). A warning: Archaeology is a career that often ends in ruins!

[The presence of groups and coefficients may remind you of MANOVA. According to lecture notes from Andrew Ainsworth, "MANOVA and discriminant function analysis are mathematically identical but are different in terms of emphasis. [Discriminant] is usually concerned with actually putting people into groups (classification) and testing how well (or how poorly) subjects are classified. Essentially, discrim is interested in exactly how the groups are differentiated not just that they are significantly different (as in MANOVA)."]

The following article illustrates a DFA with a mainstream HDFS topic:

Hazan, C., & Shaver, P. R. (1987). Romantic love conceptualized as an attachment process. Journal of Personality and Social Psychology, 52, 511-524.

Finally, this video, as well as this document, explain how to implement and interpret DFA in SPSS. And here's our latest song...

Discriminant! 
Lyrics by Alan Reifman
May be sung to the tune of “Notorious” (LeBon/Rhodes/Taylor for Duran Duran)

Disc-disc-discriminant, discriminant!
Disc-disc-discriminant!

(Funky bass groove)

You’ve got multiple groups, all made from categories,
To predict membership, IV’s can tell their stories,
A technique, you can use,
It’s called discriminant -- the results are imminent,
You get an equation, for who belongs in the sets,

Number of functions, you subtract one, from sets,
To form the functions, you get the coefficients,
These weight the IV’s, to yield a composite score,
These scores determine, how it sorts the people,
That’s how, discriminant runs,

Disc-disc...

You can see in a graph, how all the groups are deployed,
Each group has a home base, which is known, as a “centroid,”
Weighted IV’s on axes, how you keep track -- it's just like, you're reading a map,
See how each group differs, from all the other ones there,

Number of functions, you subtract one, from sets,
To form the functions, you get the coefficients,
These weight the IV’s, to yield a composite score,
These scores determine, how it sorts the people,
That’s how, discriminant runs,

Disc-
Disc-disc...

(Brief interlude)

Discriminant,

Number of functions, you subtract one, from sets,
To form the functions, you get the coefficients,
These weight the IV’s, to yield a composite score, 
These scores determine, how it sorts the people,

Number of functions, you subtract one, from sets,
To form the functions, you get the coefficients,
These weight the IV’s, to yield a composite score,
These scores determine, how it sorts the people,
That’s how, discriminant runs,

Disc-discriminant,
Disc-Disc,
That’s how, discriminant runs,

Disc-
Yeah, that’s how, discriminant runs,

Disc-Disc,

(Sax improvisation)

Yeah...That’s how, discriminant runs,
Disc-discriminant,
Disc-disc-discriminant,
That’s how, discriminant runs,
Disc-discriminant,
Disc-disc-discriminant...

Monday, September 21, 2015

Logistic Regression

This week, we'll review ordinary regression (for quantitative dependent variables such as dollars of earnings or GPA at college graduation) and then begin coverage of logistic regression (for dichotomous DV's). Both kinds of regression allow all kinds of predictor variables (quantitative and categorical/dummy variables). Logistic regression involves mathematical elements that may be unfamiliar to some, so we'll go over everything step-by-step.

The example we'll work through is a bit unconventional, but one with a Lubbock connection. Typically, our cases are persons. In this example, however, the cases are songs -- Paul McCartney songs. McCartney, of course, was a member of the Beatles (1960-1970), considered by many the greatest rock-and-roll band of all-time. After the Beatles broke up, McCartney led a new group called Wings (1971-1981), before performing as a solo act. For many years after the Beatles' break-up, he declined to perform his old Beatles songs, but finally resumed doing so in 1989.

Given that McCartney has a catalog of close to 500 songs (excluding ones written entirely or primarily by other members of the Beatles), the question was which songs he would play in his 2014 Lubbock concert. I obtained lists of songs written by McCartney here and here, and a playlist from his Lubbock concert here. Any given song could be played or not played in Lubbock -- a dichotomous dependent variable. The independent variable was whether McCartney wrote the song while with the Beatles or post-Beatles (for Wings or as a solo performer).

This analysis could be done as a 2 (Beatles/post-Beatles era) X 2 (yes/no played in Lubbock) chi-square, but we'll examine it via logistic regression for illustrative purposes. Note that logistic-regression analyses usually would have multiple predictor variables, not just one. The null hypothesis would be that any given non-Beatles song would have the same probability of being played as any given Beatles song. What I really expected, however, was that Beatles songs would have a higher probability of being played than non-Beatles songs.

Following are some PowerPoint slides I made to explain logistic regression, using the McCartney concert example. We'll start out with some simple cross-tabular frequencies and introduction of the concept of odds.


Next are some mathematical formulations of logistic regression (as opposed to the general linear model that informs ordinary regression) and part of the SPSS output from the McCartney example.


(Here's a reference documenting that any number raised to the zero power is one; technically, any non-zero number raised to the zero power is one.)

Note that odds ratios work not only when moving from a score of zero to a score of one on a predictor variable (as in the song example). The prior odds are multiplied by the same factor (the OR) whether moving from zero to one, one to two, two to three, etc.

The last slide is a chart showing general guidelines for interpreting logistic-regression B coefficients and odds ratios. Logistic regression is usually done with unstandardized predictor variables.


The book Applied Logistic Regression by Hosmer, Lemeshow, and Sturdivant is a good resource. We'll also look at some of the materials in the links column to the right and some articles that used logistic regression, and run some example analyses in SPSS.

--------------------------------------------------------------------------------------------------------------------------

One last thing I like to do when working with complex multivariate statistics is run a simpler analysis as an analogue, to understand what's going on. Hopefully, the results from the actual multivariate analysis and the simplified analogue will be similar. A basic cross-tab can be used to simulate what a logistic regression is doing. Consider the following example from the General Social Survey (GSS) 1993 practice data set in SPSS. The dichotomous DV is whether a respondent had obtained a college degree or not, and the predictor variables were age, mother's highest educational level, father's highest educational level, one's own number of children, and one's attitude toward Broadway musicals (high value = strong dislike).

The logistic-regression equation, shown at the top of the following graphic, reveals that father's education (an ordinal variable ranging from less than a high-school diploma [0] to graduate-school degree [4]) had an odds ratio (OR) of 1.53. This tells us that, controlling for all other predictors, each one-unit increment on father's level of educational attainment would raise the respondent's odds of obtaining a college degree by a multiplicative factor of 1.53.


One might expect, therefore, that if we run a cross-tab of father's education (rows) by own degree status (columns), the odds of the respondent having a college degree will increase by 1.53 times, as father's education goes up a level. This is not the case, as shown in the graphic. When the father's educational attainment was less than a high-school diploma, the grown child's odds of having a college degree were .142. When father's education was one level higher, namely a high-school diploma (scored as 1), the grown child's odds of having a college degree became .376. The value .376 is 2.65 times greater than the previous odds of .142, not 1.53 times greater.

A couple of things can be said at this point. First, the cross-tab utilizes only two variables, father's education and grown child's college-degree status; none of the other predictor variables are controlled for. Second, it is an obvious oversimplification to say that an individual's odds of having a college degree should increase by a uniform multiplier (in this case, 1.53) for each increment in father's educational attainment. In reality, the odds might go up by somewhat more than 1.53 between some levels of father's education and by somewhat less than 1.53 between other levels of father's education. However, as long as the 1.53 factor matches the step-by-step multipliers from the cross-tabs reasonably well, it simplifies things greatly to have a single value for the multiplier. (We will discuss this idea of accuracy vs. simplicity later in the course.)

One question that might have occurred to some of you is whether the multiplier values in the cross-tab (2.65, 2.09, etc.) might match logistic-regression results more accurately if we ran a logistic regression with father's education as the only predictor. In fact, averaging the four blue multipliers in the graphic matches very closely with the OR from such an analysis. Whether such a match will generally occur or just occurred this time by chance, I don't know.

--------------------------------------------------------------------------------------------------------------------------

Finally, we have a song. I actually wrote it back in 2007, for a guest lecture I gave in QM III.

e Raised to the B Power 
Lyrics by Alan Reifman
(May be sung to the tune of “Rikki Don’t Lose That Number,” Becker/Fagen for Steely Dan)

(SLOW) Running logistic regression...
For a dichotomous outcome, it’s designed,
Predictor variables can be, any kind,
But what will be your key result?

e raised to the B power,
That’s what gives you the, Odds Ratio,
This is something important, you must know,

e raised to the B power,
When an IV rises one,
DV odds multiply by O.R.,
It’s so much fun!

Will faculty make tenure, yes or no?
Say the O.R. for pubs, is 1.5,
For each new article, then, we multiply,
By this determined ratio,

e raised to the B power,
That’s what gives you the, Odds Ratio,
This is something important, you must know,

e raised to the B power, 
When an IV rises one,
DV odds multiply by O.R.,
And then you’re done...

 (Guitar solo)

Monday, September 7, 2015

MANOVA (Multivariate Analysis of Variance)

This week, we will graduate from ANOVA up to MANOVA. The difference is that the MANOVA includes multiple dependent variables from a given conceptual area. The example below (and the song at the end) use college drinking as the topic.


We'll be drawing heavily from the following article, which contains worked-out examples and many helpful tips:

Grice, J. W., & Iwasaki, M. (2007). A truly multivariate approach to MANOVA. Applied Multivariate Research, 12, 199-226.

Grice and Iwasaki's example involves one independent variable (culture/nationality: European-Americans; Asian-Americans; Asian-Internationals) and the "Big Five" personality traits as multiple dependent variables (neuroticism, extraversion, openness, agreeableness, and conscientiousness).

MANOVA takes the multiple DV's and adds them up in a linear-weighted combination (see "Step 2" on p. 206, the grey box on p. 207, and the equations toward the bottom of p. 209). According to Grice and Iwasaki, "MANOVA maximizes the differences between group means on linear combinations of the dependent variables" (p. 216). See also the paragraph beginning "Reasoning multivariately..." on p. 202.

There are four different MANOVA significance tests in most outputs (Pillai, Wilks, Hotelling, Roy).
Bryan Manly (Multivariate Statistical Methods: A Primer, 3rd edition, 2005) writes that, "Generally, the four tests... can be expected to give similar significance levels, so there is no real need to choose between them...  They are all also considered to be fairly robust [to violations of assumptions] if the sample sizes are equal or nearly so for the [cells]" (p. 49). Manly also notes that Pillai's Trace appears to be most robust to violations of assumptions.

If the overall MANOVA is significant, it has been customary to follow up with a series of "regular" univariate ANOVA's to see which one or more of the multiple DV's in the set differs across the IV groups (e.g., running one ANOVA with just the personality trait of neuroticism as the DV, running another ANOVA with just the trait of extraversion as the DV, etc.). However, this routine has come into question by many statisticians (see Grice and Iwasaki, p. 203, paragraph beginning with "Second, many researchers...").

In fact, under some circumstances, one might want to skip the MANOVA altogether and just run a separate  ANOVA on each DV. Write Grice and Iwasaki:

Are we truly interested in examining the multivariate, linear combinations of Big Five traits, or are we content with considering each trait separately? ... if we have no intention of interpreting the multivariate composites (that is, the linear combinations of traits -- the dependent variables), then the univariate analyses... are perfectly sufficient. There is certainly no shame in  conducting multiple ANOVAs and separately interpreting the results for each dependent variable. It is more than a methodological faux pas, however, to conduct a MANOVA with no intent of interpreting the multivariate combination of variables (p. 202-203; red highlight by Dr. Reifman, other emphases in original).

(See also the discussion on p. 203 of controlling the error-rate of significance levels when performing multiple tests, as well as this webpage.)

In order to get everything you need from MANOVA, you need to run your analysis twice in SPSS, once in the Windows version and once via syntax (to get the weighting coefficients). See the link in the right-hand column "Getting More Extensive Output in SPSS" (especially pages 29 and 33).

MANOVA anticipates Discriminant Analysis, which we'll cover later in the course, and even Structural Equation Modeling, which is the subject matter of QM IV.

Let's conclude with a song:

It’s a MANOVA 
Lyrics by Alan Reifman
(May be sung to the tune of “Maneater,” Hall/Oates/Allen; for audio of a performed version, click here

You have your, IV’s set up,
Sometimes, as a 2 X 4,
You look for effects, on the dependent variable, yes you do,
Multiple measures,
Aren’t what you’re used to seeing, just a single outcome,

You study, alcohol use,
By gender and, by student groups,
You can measure drinking, different ways, to get a broader view,
Multiple measures,
Volume, times drunk, and bingeing days, just to name a few,

(Slow) Mul-ti-ple DV’s,
Analyze ’em, all at once,
Doing so’s, a breeze,
It’s a MANOVA,

(Slow) Mul-ti-ple DV’s,
Analyze ’em, all at once,
Doing so’s, a breeze,
It’s a MANOVA,

 (Brief interlude)

The DV’s are, given weights,
To create, a composite,
The IV groups, are then compared, on these composite scores,
Multiple results,
Are printed out, on which you can, follow through,

(Slow) Mul-ti-ple DV’s,
Analyze ’em, all at once,
Doing so’s, a breeze,
It’s a MANOVA,

(Slow) Mul-ti-ple DV’s,
Analyze ’em, all at once,
Doing so’s, a breeze,
It’s a MANOVA

Thursday, September 3, 2015

Brief Summary of ANOVA Review Points

Here are some key points from today's class, reviewing ANOVA. (Web documents alluded to below are available in the right-hand links column.)
  • A one-way ANOVA compares means of three or more groups on one factor (e.g., comparing GPA at graduation among physical-science, social-science, and arts/humanities majors).
    • As seen in the linked overview of conducting one-way ANOVA by hand, "between-group" compares conditions (e.g., did participants given one word-memorization strategy memorize more words on average than participants given other types of instructions?). 
    • The "within-group" or "error" section of the results refers not to error in the sense of mistakes, but in the sense of imperfect potency of a given instruction. In other words, the fact that not all participants given rhyming (for example) as a strategy for memorizing words ended up memorizing the exact same number of words reflects this imperfection or "error." 
    • The larger the ratio of the between-group mean-square to the error mean-square, the larger the F ratio and the greater likelihood of statistical significance. (The F-statistic, by the way, is named after Sir Ronald Fisher, the inventor of ANOVA.)
    • In the old days, one would have to look at an F-table to see if a given result attained significance (such as the one on this site for p < .05). Nowadays, however, the computer output will tell you the significance of your results.
  • A two-way ANOVA yields three types of mean-comparison effects. In the example we worked out in SPSS, gender (men, women) and marital status (married, widowed, divorced, separated, or single/never-married) were the factors or independent variables, and attitude toward Broadway musicals was the dependent variable. We thus had a 2 X 5 ANOVA. The results display thus included:
    • Main-effect of gender: Did men's mean attitude (collapsing over all marital statuses) differ from women's mean attitude (collapsing over all marital statuses)?
    • Main-effect of marital status: Did married persons' mean attitude toward Broadway musicals (collapsing over men and women) differ from widowed persons' mean attitude, divorced persons' mean attitude, etc.?
    • Interaction of gender X marital status: Did any one or more cells representing combinations of gender and marital status (e.g., divorced men or single women) stand out from the other cells in their mean attitudes toward Broadway musicals?
  • The linked document from David Lane shows how to use graphs to discern whether you have main-effects and/or interactions.
  • There are two further topics we didn't cover today, but will review briefly next Tuesday:
    • Just as a t-test has two versions for comparing two independent, non-overlapping groups (such as Democrats vs. Republicans) and for comparing paired conditions/groups (such as men and women who are heterosexually married to each other; or participants who receive medicine for four weeks and placebo for four weeks), ANOVA has analogous alternatives (here and here).
    • Significant effects in ANOVA tell us only that at least two means differ significantly within a condition. If the main-effect of marital status described above were significant, we would not immediately know which two or more of the five marital-status groups differed from each other on attitude toward Broadway musicals. For example, were married people more favorable than divorced people? Separated people more favorable than widowed people? When a factor has more than two conditions, we must follow up significant ANOVA results with contrasts and comparisons between cells (see links).
  • We can't forget to sing the song "ANOVA Man" by Mark Glickman at the next class (link to CAUSE Fun Resources)! Note the song's reference to "mu," the population mean on a given variable in a given condition. Even though we conduct ANOVA's with sample means in our own studies, the inference is back to the larger population!