# Random Effects Regression

# Random Effects Regression

The random effects estimator is applicable in the context of panel data—that is, data comprising observations on two or more “units” or “groups” (e.g., persons, firms, countries) in two or more time periods. The simplest regression model for such data is pooled Ordinary Least Squares (OLS), the specification for which may be written as

where *y _{it}* is the observation on the dependent variable for cross-sectional unit

*i*in period

*t,*X

_{it}is a 1 ×

*k*vector of independent variables observed for unit

*i*in period

*t, β*is a

*k*× 1 vector of parameters, and

*u*is an error or disturbance term specific to unit

_{it}*i*in period

*t*.

One of the assumptions required in order that OLS is optimal is that the error term is independently and identically distributed (IID). In the panel context, the IID assumption means that , in relation to equation (1), equals a constant, for all *i* and *t,* whereas the covariance *E* (*u _{is}u_{it}* )equals zero for all

*s*≠

*t*, and the covariance

*E*(

*u*) equals zero for all

_{jt}u_{it}*j*≠

*i*

This may be inappropriate in a panel data context, because it amounts to saying that *y _{it}* is no more different from

*y*than it is from

_{it}*y*That is, observations from the

_{is}*y*same individual at a different time are just as independent from

_{it}*y*as those coming from different individuals.

_{it}*y*Because this assumption is hard to maintain in many situations, the probabilistic model is often taken to be

_{it}where we decompose *u _{it}* into a unit-specific and timeinvariant component,

*v*, and an observation-specific error,

_{i}*ε*

_{it}The fixed-effects and random-effects models differ in their interpretations of the *v _{i}* term: In the fixed-effects model, the

*v*are treated as fixed parameters (unit-specific

_{i }s*y*-intercepts); in the random-effects model, in contrast, they are treated as random drawings from a given probability distribution.

In the fixed-effects approach, we merely acknowledge that differences between individuals exist. Therefore, the parameter *β* can be estimated by including a dummy variable for each cross-sectional unit and suppressing the global constant. If, however, we are willing to go the extra step of modeling the *v _{i̕}* greater efficiency may be attained by using Generalized Least Squares (GLS), taking into account the structure of the error term. This is the random effects approach (for an early and influential example of this, see Balestra and Nerlove 1966).

Consider observations on a given unit *i* at two times *s* and *t.* From the hypotheses above, it follows that whereas between *u _{is}* and

*u*. In matrix notation, we may group the

_{it}*T*observations for unit

_{i}*i*into the vector

*y*and write

_{i }The vector *u _{i}*, which includes all the disturbances for

*i̕*has covariance matrix

where *J* is a square matrix with all elements equal to 1. It can be shown that the matrix

where has the property

It follows that the transformed system

satisfies the Gauss-Markov conditions, and OLS estimation of (5) provides efficient inference. But because

*k _{i}y_{i}* =

*y*–

_{i}*θȳ*

_{i}GLS estimation is equivalent to OLS using “quasi-demeaned” data—that is, variables from which we subtract a fraction θ of their average. Notice that for , *θ* → 1, whereas for , *θ* → 0 If all the variance is attributable to the individual effects, the fixed-effects estimator is optimal; if, instead, individual effects are negligible, then pooled OLS is optimal.

To implement GLS we need to calculate θ, which in turn requires estimates of the variances and (These are often referred to as the “within” and “between” variances, respectively, because the former refers to variation within each cross-sectional unit and the latter to variation between the units). Several means of estimating these magnitudes have been suggested in the literature (see Baltagi 2005).

The above derivation presupposes that the *ε _{it}* term is IID. Departures from this assumption (e.g., heteroskedasticity) have been analyzed, leading to a sizable body of literature (see, for example, Baltagi 2005 and Arellano 2003).

When is the random effects estimator preferable to fixed effects? If the panel comprises observations on a fixed and relatively small set of units of interest (for example, the member states of the European Union), there is a presumption in favor of fixed effects, because it makes little sense to consider the *v _{i}* terms as sampled from an underlying population: In the case of the European Union states, the sample and the population coincide, and even a thought experiment in which the units are different would be audacious. If, instead, the sample comprises observations on a large number of randomly selected individuals (as in many epidemiological and other longitudinal studies), there is a presumption in favor of random effects. Besides this general heuristic, however, certain statistical issues must be taken into account.

First, some panel data sets contain variables whose values are specific to the cross-sectional unit but which do not vary over time (for example, the gender of an individual). If such variables are to be included in the model, the fixed-effects option is simply not available. (When the fixed-effects model is implemented using the dummy variables approach, the trouble is that any time-invariant variables are perfectly collinear with the unit dummies.) Second, the random-effects estimator can be shown to be a matrix-weighted average of pooled OLS and the “between” estimator (a regression using the group means, and hence ignoring the intragroup variation). Suppose we have observations on *m* units and there are *k* independent variables of interest. If *k > m,* the “between” estimator is undefined—we have only *m* effective observations—and hence so is the random-effects estimator.

If one does not fall foul of one or other of these issues, the choice between fixed effects and random effects may be expressed as a tradeoff between robustness and efficiency.

The robustness of the fixed-effects approach stems from the fact that it makes no hypotheses regarding the differences in mean across the units, except that such differences exist. This estimator “always works,” but at the cost of not being able to estimate the effect of time-invariant regressors.

The richer hypothesis set of the random-effects estimator allows for estimation of the parameters for timeinvariant regressors, and ensures that estimation of the parameters for time-varying regressors is performed more efficiently. But these advantages are tied to the validity of the additional hypotheses. If the individual effects are correlated with some of the explanatory variables, then the random-effects estimator is inconsistent, whereas fixed-effects estimates are still valid. It is on this principle that the “Hausman test” (Hausman 1978) is built: If the fixed-and random-effects estimates agree, to within the usual statistical margin of error, then there is no reason to believe the additional hypotheses are invalid, and as a consequence, no reason *not* to use the more efficient random-effects estimator.

**SEE ALSO** *Regression*

## BIBLIOGRAPHY

Arellano, Manuel. 2003. *Panel Data Econometrics*. Oxford: Oxford University Press.

Balestra, Pietro, and Marc Nerlove. 1966. Pooling Cross-Section and Time Series Data in the Estimation of a Dynamic Model: The Demand for Natural Gas. *Econometrica* 34 (3): 585–612.

Baltagi, Badi H. 2005. *Econometric Analysis of Panel Data*. 3rd ed. New York: Wiley.

Hausman, James A. 1978. Specification Tests in Econometrics. *Econometrica* 46: 1251–1271.

*Allin Cottrell*

*Riccardo “Jack” Lucchetti*

#### More From encyclopedia.com

#### You Might Also Like

#### NEARBY TERMS

**Random Effects Regression**