The logistic regression model is used when the dependent variable is categorical. A categorical variable is one whose numerical values serve only as labels distinguishing different categories. When a categorical variable has only two mutually exclusive outcomes, the binary logistic regression model is used. The logistic regression model had its origins in the biological sciences of the early twentieth century (Berkson 1944) but has subsequently found wide applicability in many areas of social science. The logistic regression model can be used for all data types but is most commonly used for cross-sectional data.
There are three different ways to derive or view the logistic regression model. In the first approach, one assumes that there is an unobserved or latent variable related to the observed outcome. For example, an individual’s decision to enter the labor force is made by comparing his or her unobserved reservation wage to the market wage. Only if the market wage exceeds the reservation wage does the individual enter the labor force. Secondly, the model can be viewed as a probability model for the dependent variable. Thirdly, the model can be derived from random utility theory or discrete choice model formulation (see McFadden 1974).
In the binary case, some event Y either occurs (Y = 1) or not (Y = 0). The linear probability regression model (LPM) is given by:
where βX = β 0 + β 1 * X1 + β 2 * X + … + βk * Xk, and e is the random error term.
The set of independent variables X affecting the event Y can be either continuous or categorical. The LPM model is problematic. First, the dependent variable Y is not constrained to lie between 0 and 1, and thus may produce nonsense probabilities. Also, the linear assumption holds that for every unit increase in X, the probability of Y occurring increases by the same amount. In many applications this assumption is not tenable. For example, the effect of an additional child on the probability of a female entering the labor force is assumed to be constant. It makes more sense to assume decreasing marginal effects of the number of children on the probability of a female entering the labor force. Also, the effect of changes in a specific X on the probability of Y does not depend on the other independent variables. In most applications this is also an unrealistic assumption. In the example above, the effect of an additional child on the probability of a female entering the labor force is assumed to be independent of, say, her husband’s income.
The logistic regression model was formulated to address these issues and can be written as:
where P = Probability (Y = 1)
The dependent variable is the log of the odds ratio of the event Y occurring or the logit of Y. Since the probability is between 0 and 1, the odds ratio goes from 0 to ∞o and the logit (an increasing function of P) goes from -∞ to +∞. Thus, the dependent variable in the logistic model is not constrained. The logistic model uses the cumulative logistic probability function to constrain the probability P to be between 0 and 1. In theory, any probability distribution can be used, however the most popular choices include the normal, uniform, and the logistic distributions. The uniform distribution gives rise to the linear probability model, while the normal distribution gives rise to the probit model. In most binary applications, the logit and the probit models are very similar. Historically, for ease of computation, mathematical tractability, and ease of interpretability, the logistic model was the preferred choice. Solving for P in the logistic model, one obtains:
The probability P is now nonlinear in X. The effect of changes in X on the probability P are the smallest at the extreme points of the variable X or when P is close to 0 or 1, which makes sense in most applications. In addition, the effect of a unit change in a specific X on P now depends on all of the independent variables X. The appropriate estimation method for the logistic model is the maximum likelihood estimation (MLE) method, since only Y and not P is observed. The MLE estimates are consistent and asymptotically normal and efficient. In addition, the standard likelihood ratio and Wald tests on the coefficients can be used. Since this is a nonlinear model, the marginal effect of X on Y can be estimated using either the point of means of the variables or by averaging the marginal effect over all of the sample observations. The standard measures of goodness of fit are overly pessimistic in this model, and other measures that are used include the count R-squared statistic, which gives the average correct predictions of the model over the two outcomes of Y, or the pseudo R-squared statistic, which is the likelihood ratio statistic comparing the general model with the restricted model where all of the slope parameters (β1 = β2 =…. βk = 0) are zero.
There are several approaches to modeling the polychotomous (more than two categories) case, including the multinomial logistic model. Similarly, the multivariate logistic model may be used in the modeling of two or more choice variables. Other relevant distinctions are made among ordered, unordered, and sequential choice variables. An individual choosing among no work, part-time work, or full-time work is facing an ordered variable. The choice of mode of daily transportation by car, bus, train, or bicycle is an example of an unordered variable. The decision of a high school graduate to attend college or not and then, if college is chosen, to decide on a major program is an example of a sequential choice variable.
A problematic feature of the multinomial logistic model is the property of independence from irrelevant alternatives (IIA). The IIA property states that the choice between any two alternatives is made independent of the remaining alternatives. In situations where there are close substitutes in the set of alternatives, this property is unlikely to hold true. In the mode of transportation example, adding a blue bus to the choice set affects all of the probabilities assigned to each category. Thus, by adding enough buses of different colors, one can make the probability of driving a car arbitrarily small. One can formally test for IIA in the multinomial logistic model or use other models, such as the multinomial probit model, which does not have this property. Unlike the dichotomous single variable situation, there are now major differences between the logistic and probit models and many more types of models to choose from.
SEE ALSO Maximum Likelihood Regression; Probabilistic Regression; Probability Distributions; Regression; Specification Tests
Amemiya Takeshi. 1981. Qualitative Response Models: A Survey. Journal of Economic Literature 19 (4): 1483–1536.
Berkson, Joseph. 1944. Application to the Logistic Function to Bio-Assay. Journal of the American Statistical Association 39: 357–365.
Hosmer, David W., and Stanley Lemeshow. 2000. Applied Logistic Regression. 2nd ed. New York: Wiley.
Long, J. Scott. 1997. Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage.
McFadden, Daniel. 1974. Conditional Logit Analysis of Qualitative Choice Behaviour. In Frontiers in Econometrics, ed. Paul Zarembka, 105–142. New York: Academic Press.
Menard, Scott. 2002. Applied Logistic Regression Analysis. 2nd ed. Thousand Oaks, CA: Sage.
The results of logistic regression models can be expressed in the form of odds ratios, telling us how much change there is in the probability of being unemployed, receiving university education, voting Republican (or whatever), given a unit change in any other given variable–but holding all other variables in the analysis constant. More simply, the results (as measured by the changed odds on being found in a particular category) tell us how much a hypothesized cause has affected this outcome, taking the role of all other hypothesized causes into account.
Most published accounts of research using this particular technique report three statistics for the models. The first of these is the beta (parameter estimate or standardized regression coefficient), which is–crudely speaking–a measure of the size of the effect that an independent variable (let us say social class) has on a dependent variable (for example the probability of being found in employment rather than among the unemployed), after the effects of another variable (such as educational attainment) have been taken into account. The standard error provides us with a means of judging the accuracy of our predictions about the effect in question. One rule of thumb is that the beta should be at least twice the size of the standard error. Finally, many investigators include the odds ratios themselves, since these tend to make the relative probabilities being described in the model intuitively easier to grasp.
For a short introduction to the technique see Anthony Walsh , Statistics for the Social Sciences (1990)
. A more advanced discussion will be found in J. Aldridge and and F. Nelson , Linear Probability, Logit, and Probit Models (1984)