Sample attrition is a feature of longitudinal or panel data in which individual observations drop out from the study over time. Attrition may occur for a number of reasons, including insufficient compensation for survey response, induction into military services, transfer of residence with no follow-up information, or death of the respondent.
A dataset suffering from attrition is referred to as an attrited sample, whereas individual observations that drop out over the course of the panel are referred to as attriters. Assuming a longitudinal dataset is randomized at the inception of the data collection process, sample attrition would not pose any challenges in estimation of the attrited panel data if sample attrition occurs randomly. This would be the case if the attriters compose a random selection of individuals in the survey, and the underlying causes of attrition are independent of the survey response being studied. However, attrition in actual panel data is rarely random, since the probability of attrition is most often dependent on the observable and unobservable attributes of the individual observations that simultaneously affect the response variable being studied. For example, in firm-level data used to study business firm profits over time, firms could make a decision about whether to shut down (and thus, be removed from the sample) based on observable characteristics, as well as unobservable firm characteristics such as expected operating revenues and productivity, which are either directly or indirectly a determinant of firm profits as well. Thus less productive firms attrite from the sample, leaving a nonrandom sample for analysis. Any quantitative inference about the entire population of firms based on analysis of just the attrited sample would thus be misleading, since the attrited sample is nonrepresentative of the underlying population. In these cases, estimation of the panel dataset while ignoring sample attrition would lead to biased and inconsistent estimation, and thus incorrect inference.
Sample attrition was first formally described and analyzed in J. Hausman and D. Wise’s work Econometrica (1979). Using the well-known random-effects specification, Hausman and Wise considered models in which the unobservable errors that determine the attrition decision are naturally correlated with the unobservable errors that determine the response variable. Due to this choice of specification, this model is sometimes referred to as the Selection on Unobservables model. The specification is easily manipulated to illustrate that least squares regression using only the retained data would be biased and inconsistent for the parameters of interest. The model also shows that using only the first period randomized sample leads to biased and inconsistent estimators, if future periods are affected by attrition. The latter is a consequence of the unavoidable correlation of error terms due to latent individual specific effects that do not change over time. Then, under the joint normality assumption on the errors, it is shown, however, that consistent estimators of the parameters of interest can be obtained by the maximum likelihood (ML) procedure. Generalizations, including models with relaxed distributional assumptions, have been since suggested by various researchers, including M. Verbeek and T. Nijman in the International Economic Review (1992) and Jeffrey Wooldridge in the Journal of Econometrics (1995). Cheng Hsiao provides a review in his Analysis of Panel Data (1986).
Several late-twentieth-century theoretical advances have been suggested in the analysis of sample attrition models. These have included models wherein an observable determinant of sample attrition is uncorrelated with the response variable, but possibly correlated with the unobservable determinant of the response variable. This difference from the original attrition model has led to this model being known as Selection on Observables, as studied in John Fitzgerald, Peter Gottschalk, and Robert Moffitt’s 1996 research. This model is complementary to the classical attrition model studied by Hausman and Wise; each is formulated on a different assumption and each is of independent interest in empirical research.
A separate consideration in the theoretical work on sample attrition has been the possibility of obtaining a refreshment sample, which refers to supplemental data randomly sampled from the population to augment the attrited sample. This type of approach has been considered by, among others, K. Hirano and colleagues in 2000, who proposed consistent estimators of the parameters of interest when such refreshment samples are available.
In 2004, Mitali Das considered a generalized model of attrition. This model permits the estimand of interest to be either a parameter (as in previous work) or a flexible and unknown function, permits the errors to have unspecified joint distribution, and usefully permits attriters to reappear in future periods.
SEE ALSO Pooled Time Series and Cross-sectional Data; Research, Longitudinal; Selection Bias
Fitzgerald, John, Peter Gottschalk, and Robert Moffitt. 1998. An Analysis of Sample Attrition in Panel Data. Journal of Human Resources 33 (2): 251–299.
Cheng Hsiao. 1986. Analysis of Panel Data. Cambridge, U.K.: Cambridge University Press.
Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1995. Analysis of Semiparametric Regression Models for Repeated Outcomes in the Presence of Missing Data. Journal of the American Statistical Association 90: 106–121.
Wooldridge, Jeffrey M. 2002. Inverse Probability Weighted M-Estimation for Sample Selection, Attrition and Stratification. Portuguese Economics Journal 1: 117–139.