Random Effects Regression

views updated

Random Effects Regression

The random effects estimator is applicable in the context of panel data—that is, data comprising observations on two or more “units” or “groups” (e.g., persons, firms, countries) in two or more time periods. The simplest regression model for such data is pooled Ordinary Least Squares (OLS), the specification for which may be written as

where y_it is the observation on the dependent variable for cross-sectional unit i in period t, X_it is a 1 × k vector of independent variables observed for unit i in period t, β is a k × 1 vector of parameters, and u_it is an error or disturbance term specific to unit i in period t.

One of the assumptions required in order that OLS is optimal is that the error term is independently and identically distributed (IID). In the panel context, the IID assumption means that , in relation to equation (1), equals a constant, for all i and t, whereas the covariance E (u_isu_it )equals zero for all s ≠ t, and the covariance E (u_jtu_it ) equals zero for all j ≠ i

This may be inappropriate in a panel data context, because it amounts to saying that y_it is no more different from y_it than it is from y_is That is, observations from the y_it same individual at a different time are just as independent from y_it as those coming from different individuals. y_it Because this assumption is hard to maintain in many situations, the probabilistic model is often taken to be

where we decompose u_it into a unit-specific and timeinvariant component, v_i, and an observation-specific error, ε_it

The fixed-effects and random-effects models differ in their interpretations of the v_i term: In the fixed-effects model, the v_is are treated as fixed parameters (unit-specific y -intercepts); in the random-effects model, in contrast, they are treated as random drawings from a given probability distribution.

In the fixed-effects approach, we merely acknowledge that differences between individuals exist. Therefore, the parameter β can be estimated by including a dummy variable for each cross-sectional unit and suppressing the global constant. If, however, we are willing to go the extra step of modeling the v_i̕ greater efficiency may be attained by using Generalized Least Squares (GLS), taking into account the structure of the error term. This is the random effects approach (for an early and influential example of this, see Balestra and Nerlove 1966).

Consider observations on a given unit i at two times s and t. From the hypotheses above, it follows that whereas between u_is and u_it . In matrix notation, we may group the T_i observations for unit i into the vector y_i and write

The vector u_i, which includes all the disturbances for i̕ has covariance matrix

where J is a square matrix with all elements equal to 1. It can be shown that the matrix

where has the property

It follows that the transformed system

satisfies the Gauss-Markov conditions, and OLS estimation of (5) provides efficient inference. But because

k_iy_i = y_i – θȳ_i

GLS estimation is equivalent to OLS using “quasi-demeaned” data—that is, variables from which we subtract a fraction θ of their average. Notice that for , θ → 1, whereas for , θ → 0 If all the variance is attributable to the individual effects, the fixed-effects estimator is optimal; if, instead, individual effects are negligible, then pooled OLS is optimal.

To implement GLS we need to calculate θ, which in turn requires estimates of the variances and (These are often referred to as the “within” and “between” variances, respectively, because the former refers to variation within each cross-sectional unit and the latter to variation between the units). Several means of estimating these magnitudes have been suggested in the literature (see Baltagi 2005).

The above derivation presupposes that the ε_it term is IID. Departures from this assumption (e.g., heteroskedasticity) have been analyzed, leading to a sizable body of literature (see, for example, Baltagi 2005 and Arellano 2003).

When is the random effects estimator preferable to fixed effects? If the panel comprises observations on a fixed and relatively small set of units of interest (for example, the member states of the European Union), there is a presumption in favor of fixed effects, because it makes little sense to consider the v_i terms as sampled from an underlying population: In the case of the European Union states, the sample and the population coincide, and even a thought experiment in which the units are different would be audacious. If, instead, the sample comprises observations on a large number of randomly selected individuals (as in many epidemiological and other longitudinal studies), there is a presumption in favor of random effects. Besides this general heuristic, however, certain statistical issues must be taken into account.

First, some panel data sets contain variables whose values are specific to the cross-sectional unit but which do not vary over time (for example, the gender of an individual). If such variables are to be included in the model, the fixed-effects option is simply not available. (When the fixed-effects model is implemented using the dummy variables approach, the trouble is that any time-invariant variables are perfectly collinear with the unit dummies.) Second, the random-effects estimator can be shown to be a matrix-weighted average of pooled OLS and the “between” estimator (a regression using the group means, and hence ignoring the intragroup variation). Suppose we have observations on m units and there are k independent variables of interest. If k > m, the “between” estimator is undefined—we have only m effective observations—and hence so is the random-effects estimator.

If one does not fall foul of one or other of these issues, the choice between fixed effects and random effects may be expressed as a tradeoff between robustness and efficiency.

The robustness of the fixed-effects approach stems from the fact that it makes no hypotheses regarding the differences in mean across the units, except that such differences exist. This estimator “always works,” but at the cost of not being able to estimate the effect of time-invariant regressors.

The richer hypothesis set of the random-effects estimator allows for estimation of the parameters for timeinvariant regressors, and ensures that estimation of the parameters for time-varying regressors is performed more efficiently. But these advantages are tied to the validity of the additional hypotheses. If the individual effects are correlated with some of the explanatory variables, then the random-effects estimator is inconsistent, whereas fixed-effects estimates are still valid. It is on this principle that the “Hausman test” (Hausman 1978) is built: If the fixed-and random-effects estimates agree, to within the usual statistical margin of error, then there is no reason to believe the additional hypotheses are invalid, and as a consequence, no reason not to use the more efficient random-effects estimator.

SEE ALSO Regression

BIBLIOGRAPHY

Arellano, Manuel. 2003. Panel Data Econometrics. Oxford: Oxford University Press.

Balestra, Pietro, and Marc Nerlove. 1966. Pooling Cross-Section and Time Series Data in the Estimation of a Dynamic Model: The Demand for Natural Gas. Econometrica 34 (3): 585–612.

Baltagi, Badi H. 2005. Econometric Analysis of Panel Data. 3rd ed. New York: Wiley.

Hausman, James A. 1978. Specification Tests in Econometrics. Econometrica 46: 1251–1271.

Allin Cottrell

Riccardo “Jack” Lucchetti

International Encyclopedia of the Social Sciences