Selection Bias

views updated

Selection Bias

An important aspect of empirical investigations in the social sciences is to draw inferences for the whole population of interest when one has data only on a subsample from that population. A first step in conducting such inference is to assume that the subsample under examination is drawn randomly from the population. However, in many instances in economics, or in the social sciences in general, it is not possible to make such an assumption. For example, suppose one is interested in drawing inferences regarding the determinants of wages for women when one has information on wages only for women working in market employment. If the sample of market-employed women “differs” from the population in some systematic way, then using the sample of employed women may lead to incorrect inferences regarding the determinants for all women. This issue, where the inference based on subsample is not appropriate for the entire population, is known as selection bias.

The example of employed women is useful for further illustrating the concept of selection bias. In fact, the original papers on selection bias were motivated by empirical work on precisely this topic. Suppose one has a sample of women of which, without loss of generality, half are employed in market employment and half are not. Furthermore, suppose each woman is characterized by observable characteristics, such as race, age, and education, and unobservable characteristics, such as motivation and ability. Also assume that these unobservable characteristics are uncorrelated with the observable characteristics so as to demonstrate that selection bias arises even with exogenous conditioning variables. Suppose the objective of the researcher is to explore the relationship between the observable characteristics and wages for all women by using only the data on the subsample of workers.

First assume that the decision to work in market employment is random. That is, each individual tosses a coin and on the basis of the coin toss decides whether or not to work. In this case examining the subsample leads to correct inferences because the process of selection for the sample, which is used to draw inferences, is random; thus there are no differences between the employed sample and the nonemployed sample. As a result there is no selection bias.

Second, now suppose that only the highly educated individuals in the population are observed to be working in market employment. In this case the working sample would be overrepresented by individuals with higher levels of education. Thus, the sample of employed workers will have different average characteristics from those observed to be not working in market employment.

However, since the topic of interest is how the individual’s characteristics affect wages, one can control for the role of any observable characteristics when performing estimation over the employed sample. Thus differences in observable characteristics alone will not lead to selection bias.

Finally, consider the case in which all the highly motivated individuals in the sample of women are those that are observed to work in market employment. If motivation affected only the decision to be employed but not the wage, then this would not induce selection bias; even though the sample of working women would have a higher level of motivation on average than the whole population, one could still directly control for all the determinants of wages when examining the relationship between these determinants and wages. However, consider a case in which motivation does affect wages as well as the market work decision. In this case, the sample of employed women would have wages that were determined by the observable characteristics, which one could control for, and their level of motivation, which one could not control for. Moreover, as one has only a sample of motivated individuals working, one must conclude that for this employed sample the role of motivation, on average, is to increase wages. Failure to account for the role of motivation leads to a selection bias and incorrect inferences regarding the determinants of wages. Thus, in general, selection bias occurs when the unobservable features determining the probability of being observed in the sub-sample used for inference (in this example the sample of employed women) are correlated with the unobservables determining the outcome of primary interest (i.e., wages).

The issue of selection bias arises in a large number of empirical investigations in economics. Accounting for selection bias has therefore become a critical feature of empirical work in economics. One can see from the example above that adjusting for selection bias essentially requires controlling for the role of the unobservables. The first efforts to account for selection bias were conducted in a fully parametric setting. That is, the distribution of all the unobservables in the model were fully stated up to unknown parameters. This approach was suggested by James J. Heckman (1974, 1979) and ever since has had a substantial impact on both theoretical and empirical microeconometrics. Subsequent theoretical and empirical work in this area focused on relaxing the distributional assumptions in the model and thus attempted to make inference more robust to the assumptions employed in the earliest investigations. The 1998 article by Francis Vella surveys these departures from the original Heckman formulation and treatment. Another important innovation, as discussed by Charles Manski’s 1989 article, has been the use of bounds in this literature. According to this approach, one attempts to infer the upper and lower bounds on the object of interest (for example, the impact determinants of wages) when one relaxes various assumptions underlying the process determining selection bias.

SEE ALSO Classical Statistical Analysis; Descriptive Statistics; Heckman Selection Correction Procedure; Sampling