Heckman Selection Correction Procedure
Heckman Selection Correction Procedure
The Heckman selection correction procedure, introduced by American economist James J. Heckman, is a statistical solution to a form of sample selection bias. Sample selection bias can emerge when a population parameter of interest is estimated with a sample obtained from that population by other than random means. Such sampling yields a distorted empirical representation of the population of interest with which to estimate such parameters (Heckman 1990), possibly leading to biased estimates of them. Heckman (1979) was specifically concerned with this possibility in a certain regression context.
Suppose that we are interested in estimating the population regression model
y = β 0 + β 1 · X 1 + β 2 · x 2 + ε 1.
With a truly random sample from the population of interest, it should be straightforward to estimate β 0, β 1, and β 2 via ordinary least squares. Suppose, however, that we observe y only if the units of observation in that random sample make some decision. For instance, we might observe y only if
V* = γ0 + γ1 · x1 + γ2 · X 2 + ε2 >0
ε2 > - γ0 - X1 - γ2 · X 2
This allows us to characterize the sample selection bias that might emerge from attempting to estimate the regression with only the subsample for whom we observe y.
We wish to evaluate the population expectation
However, if we observe y only when v* > 0, available data allows us to evaluate only the expectation
In this setting, it will generally not be possible to separately identify β 0, β 1 and β 2. To see this more clearly, assume that
E (ε1 | ε2 > –γ0 –γ1 · x 1 –γ2 · x 2) ≠ 0
and consider a linear approximation of E (ε1 | ε2 > –γ0 –γ1 · x 1 –γ2 · x 2),
w0 + w1 · x 1 + w2 · x 2 + ζ
where E (ζ) = 0. Then, the overall expectation that can be gleaned from available data would be
Thus, in general, one would not be able to recover valid estimates of β 0, β 1 and β 2 with the subsample for which we observe y. The only exception is when
In other words, if ε1 and ε2 are independent, sample selection bias disappears.
The Heckman selection correction procedure can recover unbiased estimates of β 0, β 1 and β 2. with available data (i.e., where x1 and x2 are observed for the full random sample from the population of interest and y is observed only for the subsample for which v* > 0). The departure point for this technique is to recognize that the sample selection bias problem really stems from a type of specification error. With the subsample for which y is observed we estimate
The problem, from the standpoint of recovering unbiased estimates of β 0, β 1 and β 2, is that we do not observe E (ε1 | ε2 > γ0 –γ1 · x 1 –γ2 · x 2) (which is a function of x1 and x 2) and hence cannot separately identify β 0, β 1 and β 2 from it. However, if one could form an estimate of E (ε1 | ε2 > γ0 –γ1 · x 1 –γ2 · x 2) and regress y on x1,x2 and that estimate, it would be possible to identify separately β 0, β 1 and β 2
To form an estimate of
E (ε1 | ε2 > – γ0 –γ1 · x 1 –γ2 · x 2)
the Heckman procedure begins by assuming that ε1 and ε2 follow a bivariate normal distribution with correlation ρ. Then, using well-known properties of the bivariate normal distribution, we have
where ψ(·) and Φ(·) are the normal density and cumulative density functions, respectively, λ(·) is referred to as the inverse Mill’s ratio. An estimate of it can be formed from the fitted model emerging from estimation of a probit regression of a dummy variable indicating whether y is observed on x1 and x2. Once that estimate of λ(·) has been formed, a second-stage regression of y on x1, x2, and that estimate recovers unbiased estimates of β 0, β 1 and β 2. The two steps are: (1) Estimate the probit model under which the binary status of y (i.e., missing/not missing) is a function of x1 and x2. From the fitted model, form an estimate of the inverse Mill’s ratio (λ) for each observation in the subsample for which y is observed. (2) Regress y on x1, x2, and λ with the subsample for which y is observed. Owing to heteroskedasticity concerns it is common practice actually to estimate the equation of interest via a procedure such as weighted least squares.
The estimated coefficient on λ in the second-stage regression is an estimate of ρ · σε1. A test of its significance is thus in practice a test for correlation between ε1 and ε2. Since ε1 and ε2 are assumed to follow a bivariate normal distribution, this is tantamount to testing whether they are independent. There is a full-information maximum-likelihood version of this model, but this “two-step” procedure has been more widely utilized in applied work.
A few caveats are in order. First, in the decade following its introduction, it became increasingly clear that the model often performs poorly without some source of identification beyond the assumption of joint-normality of the errors ε1 and ε2. Technically, it should be identified by the nonlinearity of the inverse Mill’s ratio (which arises naturally from the assumption of joint normality). The underlying reason for the poor performance evident in some applications is that the inverse Mill’s ratio is nearly linear over much of its range, introducing potentially severe multicollinearity between x1, x2 and the inverse Mill’s ratio during the second-stage regression. The best remedy for this problem is to introduce an instrument z to the first-stage probit estimations that provides some source of variation in the inverse Mill’s ratio unrelated to that provided by x1, x2.
Second, there have been growing warnings about misapplication of the model (an excellent example of such a critique is presented by William Dow and Edward Norton ). The basic concern has been with the popularity of the Heckman procedure in applications where a continuous dependent variable y is characterized by significant mass at zero (with the first-stage probit component used to explain whether y = 0 or y > 0 and the second-stage linear regression component applied to the subsample for which y > 0). However, the Heckman procedure is appropriate only in cases where the zeroes emerge from the censoring of some true value for y. In cases where the zeroes are valid values for y, and as such represent a legitimate corner solution for that variable, some other tool (such as the two-part model) might be more appropriate.
Even in the face of variation in common first and second stage regressors sufficient to exploit non-linearity in the inverse Mill’s ratio (which might obviate the need for first-stage identifying instruments), the model still relies on an assumed joint normality of errors to achieve identification by non-linearity. If the error terms are not in fact jointly normally distributed, the parameter estimates provided by the Heckman selection correction procedure might be biased (and, depending on the precise nature of the problem at hand, the degree of bias might exceed that which would result from simply ignoring the censoring and estimating the equation of interest over only those for whom the outcome y is observed). A variety of alternatives to the assumption of joint normality have been proposed in response to this. In principle, one tack would be to rely on different parametric joint distributional assumptions. However, since most of the readily obvious alternatives would likely involve no less heroic an assumption than joint-normality, the focus has instead shifted to more semiparametric approaches that rely less on explicit assumptions regarding the joint distribution of the error terms.
SEE ALSO Regression; Selection Bias
Dow, William H., and Edward C. Norton. 2003. Choosing Between and Interpreting the Heckit and Two-Part Models for Corner Solutions. Health Services and Outcomes Research Methodology 4 (1): 5–18.
Heckman, James J. 1979. Sample Selection Bias as a Specification Error. Econometrica 47 (1): 153–161.
Heckman, James J. 1990. Selection Bias and Self-selection. In The New Palgrave: Econometrics, eds. John Eatwell, Murray Milgate, and Peter Newman, 201–224. New York: Norton.
Peter M. Lance