Probabilistic Regression

views updated

Probabilistic Regression

Probabilistic regression, also known as “probit regression,” is a statistical technique used to make predictions on a “limited” dependent variable using information from one or more other independent variables. This technique is one of several possible techniques that can be used when the presence of a “limited” dependent variable prevents the more common ordinary-least-squares regression technique from being used.

A “limited” variable here refers to both nominal-level variables and ordinal-level variables. A nominal-level variable is defined as a variable that can (1) take distinct values, but (2) those values cannot be placed in any meaningful numerical order. A variable where “yes” and “no” are the only two possible values is one common type of nominal-level variable. An ordinal-level variable is defined as a variable where (1) the variable can take several possible values and (2) these different values can be placed in some logical numerical order. A variable that measures whether a person (1) strongly agrees, (2) agrees, (3) is indifferent, (4) disagrees, or (5) strongly disagrees with some given statement is a common type of ordinal-level variable.

In order to use ordinary-least-squares regression, there must be a linear relationship between one’s dependent and independent variables. One can see if this linearity requirement has been met by first making a scatterplot with the independent variable on the x-axis and the dependent variable on the y-axis, then calculating a mean for the dependent variable at each value of the independent variable and plotting this series of means on the scatterplot. If the observed series of means looks approximately like a straight line, then the linearity requirement has been met. If not, other regression techniques must be employed.

There are two main reasons why the linearity requirement of ordinary-least-squares regression is seldom met when the dependent variable of interest is nominal or ordinal. First, because there are a limited number of values that the dependent variable can take (as few as two in some cases), any straight line one would try to impose on the scatterplot could, for low and high values of the independent variable, respectively, extend far above or below the possible values that the dependent variable can take. The second reason is best described in the simple case when the dependent variable takes only two possible values. Because the scatterplot in this instance appears as a series of dots dispersed along two parallel horizontal lines, any single line one would try to impose on the scatterplot would cross each of these horizontal lines at only one point. This imposed line would not pass near many of the data points and therefore would not achieve a high degree of fit, as was originally sought. In this situation, a nonlinear function can achieve a much closer fit to the data.

When one’s dependent variable fails to meet the linearity requirement of ordinary-least squares regression, one attempts to mathematically transform the dependent variable so that there is a linear relationship between the independent variable(s) and the transformed dependent variable. Common transformations include taking the natural logarithm or the square root of the dependent variable. When the dependent variable is “limited” and therefore cannot take values above a certain number or below a certain number, the mathematical function one specifies for the transformation must also not be capable of going above or below those same numbers. Probit regression uses the s-shaped cumulative distribution function of the normal distribution to meet this requirement.

Probit regression predicts the probability of seeing a given value of the dependent variable by fitting the available data to a mathematical function from which probabilities can be calculated. Specifically, this function is the inverse of the indefinite integral of the probability density function of the normal (Gaussian) distribution (also known as the “inverse cumulative density function of the normal distribution”). Formally, this function is written as Where erf ^–1 is the inverse error function and p is the probability of observing a particular outcome on the dependent variable.

Probit regression was first proposed by the entomologist Chester Ittner Bliss (1899–1979) in a 1934 article in Science titled “The Method of Probits.” Noting that his dependent variable was too limited to be modeled using ordinary-least-squares regression, Bliss sought to overcome this limitation by transforming his limited dependent variable into a new dependent variable that did not have the same limitations. Using a table derived from the cumulative density function of the normal distribution, Bliss converted the units of his dependent variable into what he called “probability units.” Bliss then used these probability units as his new dependent variable, and fit them to the rest of his data using standard ordinary-least-squares regression. The word probit is simply an abbreviation of the phrase “probability unit.”

Probit regression, when it was first developed, provided a powerful yet computationally simple way to model limited dependent variables. It required the use of widely available statistical tables and knowledge of the ordinary-least-squares regression technique. Yet probit regression is not the preferred method for modeling relationships between limited dependent variables and a set of independent variables in the social sciences. According to Adrian Raftery (2001), this is largely because interpreting the dependent variable when it has been transformed into probability units is not easy. Probability units have no common-sense interpretation by themselves, so they must be converted back to simple probabilities before one can effectively convey their meaning in words. A competing and more popular method, logistic regression, can handle the same types of dependent variable but has the additional advantage that the results of the procedure are more easily made interpretable in words.

In terms of binary dependent variables, there is little that distinguishes probit regression from its more popular competitor, logistic regression. Probit regression is simpler computationally than logistic regression, but this advantage is negated through the use of computer software. Ironically, however, the development of generalized linear modeling by John Nelder and R. Wedderburn beginning in 1972 and the creation of computer software for performing the technique re-created a space for probit regression Generalized linear modeling was born out of the observation that the many different regression techniques identified at the time were more alike than they were different. Probit regression and logistic regression produced similar predictions for “limited” dependent variables because they were so similar on a general level. But through the process of outlining similarities between the two techniques, key differences were also highlighted. When one has an ordinal-level dependent variable and each category of the dependent variable occurs approximately an equal number of times, logistic regression is most appropriate. When middle categories of the dependent variable occur much more frequently than either low or high categories, probit regression achieves a better degree of fit. Thus, increased computing power and the development of more general statistical models helped to pinpoint a niche for probit regression.

For a more in-depth treatment of probit regression and logistic regression in a social science setting, the reader is referred to E. Scott Adler and Forrest Nelson’s book Linear Probability, Logit, and Probit Models (1984).

SEE ALSO Distribution, Normal; Distribution, Poisson; Distribution, Uniform; Method of Moments; Nonlinear Regression; Ordinality; Pareto, Vilfredo; Probability Distributions; Regression; Regression Analysis