Monday, September 21, 2015

Logistic Regression

This week, we'll review ordinary regression (for quantitative dependent variables such as dollars of earnings or GPA at college graduation) and then begin coverage of logistic regression (for dichotomous DV's). Both kinds of regression allow all kinds of predictor variables (quantitative and categorical/dummy variables). Logistic regression involves mathematical elements that may be unfamiliar to some, so we'll go over everything step-by-step.

The example we'll work through is a bit unconventional, but one with a Lubbock connection. Typically, our cases are persons. In this example, however, the cases are songs -- Paul McCartney songs. McCartney, of course, was a member of the Beatles (1960-1970), considered by many the greatest rock-and-roll band of all-time. After the Beatles broke up, McCartney led a new group called Wings (1971-1981), before performing as a solo act. For many years after the Beatles' break-up, he declined to perform his old Beatles songs, but finally resumed doing so in 1989.

Given that McCartney has a catalog of close to 500 songs (excluding ones written entirely or primarily by other members of the Beatles), the question was which songs he would play in his 2014 Lubbock concert. I obtained lists of songs written by McCartney here and here, and a playlist from his Lubbock concert here. Any given song could be played or not played in Lubbock -- a dichotomous dependent variable. The independent variable was whether McCartney wrote the song while with the Beatles or post-Beatles (for Wings or as a solo performer).

This analysis could be done as a 2 (Beatles/post-Beatles era) X 2 (yes/no played in Lubbock) chi-square, but we'll examine it via logistic regression for illustrative purposes. Note that logistic-regression analyses usually would have multiple predictor variables, not just one. The null hypothesis would be that any given non-Beatles song would have the same probability of being played as any given Beatles song. What I really expected, however, was that Beatles songs would have a higher probability of being played than non-Beatles songs.

Following are some PowerPoint slides I made to explain logistic regression, using the McCartney concert example. We'll start out with some simple cross-tabular frequencies and introduction of the concept of odds.


Next are some mathematical formulations of logistic regression (as opposed to the general linear model that informs ordinary regression) and part of the SPSS output from the McCartney example.


(Here's a reference documenting that any number raised to the zero power is one; technically, any non-zero number raised to the zero power is one.)

Note that odds ratios work not only when moving from a score of zero to a score of one on a predictor variable (as in the song example). The prior odds are multiplied by the same factor (the OR) whether moving from zero to one, one to two, two to three, etc.

The last slide is a chart showing general guidelines for interpreting logistic-regression B coefficients and odds ratios. Logistic regression is usually done with unstandardized predictor variables.


The book Applied Logistic Regression by Hosmer, Lemeshow, and Sturdivant is a good resource. We'll also look at some of the materials in the links column to the right and some articles that used logistic regression, and run some example analyses in SPSS.

--------------------------------------------------------------------------------------------------------------------------

One last thing I like to do when working with complex multivariate statistics is run a simpler analysis as an analogue, to understand what's going on. Hopefully, the results from the actual multivariate analysis and the simplified analogue will be similar. A basic cross-tab can be used to simulate what a logistic regression is doing. Consider the following example from the General Social Survey (GSS) 1993 practice data set in SPSS. The dichotomous DV is whether a respondent had obtained a college degree or not, and the predictor variables were age, mother's highest educational level, father's highest educational level, one's own number of children, and one's attitude toward Broadway musicals (high value = strong dislike).

The logistic-regression equation, shown at the top of the following graphic, reveals that father's education (an ordinal variable ranging from less than a high-school diploma [0] to graduate-school degree [4]) had an odds ratio (OR) of 1.53. This tells us that, controlling for all other predictors, each one-unit increment on father's level of educational attainment would raise the respondent's odds of obtaining a college degree by a multiplicative factor of 1.53.


One might expect, therefore, that if we run a cross-tab of father's education (rows) by own degree status (columns), the odds of the respondent having a college degree will increase by 1.53 times, as father's education goes up a level. This is not the case, as shown in the graphic. When the father's educational attainment was less than a high-school diploma, the grown child's odds of having a college degree were .142. When father's education was one level higher, namely a high-school diploma (scored as 1), the grown child's odds of having a college degree became .376. The value .376 is 2.65 times greater than the previous odds of .142, not 1.53 times greater.

A couple of things can be said at this point. First, the cross-tab utilizes only two variables, father's education and grown child's college-degree status; none of the other predictor variables are controlled for. Second, it is an obvious oversimplification to say that an individual's odds of having a college degree should increase by a uniform multiplier (in this case, 1.53) for each increment in father's educational attainment. In reality, the odds might go up by somewhat more than 1.53 between some levels of father's education and by somewhat less than 1.53 between other levels of father's education. However, as long as the 1.53 factor matches the step-by-step multipliers from the cross-tabs reasonably well, it simplifies things greatly to have a single value for the multiplier. (We will discuss this idea of accuracy vs. simplicity later in the course.)

One question that might have occurred to some of you is whether the multiplier values in the cross-tab (2.65, 2.09, etc.) might match logistic-regression results more accurately if we ran a logistic regression with father's education as the only predictor. In fact, averaging the four blue multipliers in the graphic matches very closely with the OR from such an analysis. Whether such a match will generally occur or just occurred this time by chance, I don't know.

--------------------------------------------------------------------------------------------------------------------------

Finally, we have a song. I actually wrote it back in 2007, for a guest lecture I gave in QM III.

e Raised to the B Power 
Lyrics by Alan Reifman
(May be sung to the tune of “Rikki Don’t Lose That Number,” Becker/Fagen for Steely Dan)

(SLOW) Running logistic regression...
For a dichotomous outcome, it’s designed,
Predictor variables can be, any kind,
But what will be your key result?

e raised to the B power,
That’s what gives you the, Odds Ratio,
This is something important, you must know,

e raised to the B power,
When an IV rises one,
DV odds multiply by O.R.,
It’s so much fun!

Will faculty make tenure, yes or no?
Say the O.R. for pubs, is 1.5,
For each new article, then, we multiply,
By this determined ratio,

e raised to the B power,
That’s what gives you the, Odds Ratio,
This is something important, you must know,

e raised to the B power, 
When an IV rises one,
DV odds multiply by O.R.,
And then you’re done...

 (Guitar solo)