Generalized Least Squares

views updated

Generalized Least Squares

Generalized least squares (GLS) is a method for fitting coefficients of explanatory variables that help to predict the outcomes of a dependent random variable. As its name suggests, GLS includes ordinary least squares (OLS) as a special case. GLS is also called “Aitken’s estimator,” after A. C. Aitken (1935). The principal motivation for generalizing OLS is the presence of covariance among the observations of the dependent variable or of different variances across these observations, conditional on the explanatory variables. Both phenomena lead to problems with statistical inference procedures commonly used with OLS. Most critically, the standard methods for estimating sampling variances and testing hypotheses become biased. In addition, the OLS-fitted coefficients are inaccurate relative to the GLS-fitted coefficients.

In its simplest form, the linear model of statistics postulates the existence of a linear conditional expectation for a scalar, dependent random variable y given a set of non-random scalar explanatory variables {x ₁, …, x _K}:

E[y ] = β ₁x ₁ + … + β_Kx _K

where the β_k k = 1, …, K, are constant parameters for all values of the x_k. Interest focuses on estimating the β_k given a sample of N observations of y, denoted here by y ₁, …, y_N, and corresponding observations of the x_k, denoted x _1k, …, x_Nk for each explanatory variable indexed by k. Using matrix notation, the linear conditional expectations for the sample are

E[y ] = Xβ ,

where y = [y ₁, … y _N]' is an N × 1 column vector, X = [x_nk; n = 1, … N, k = 1, … K ] is a N × K matrix, and β = [β₁, …, β _K]' is a K × 1 column vector. It is generally assumed that the explanatory variables in X are not linearly dependent so that N ≥ K and there is no α £ ℝR^K, α.≠O , such that X α = 0 .

In addition, the linear model assumes that the variances of the y_n are equal to a common, finite positive constant σ and that the covariances among the y_n are equal to zero. In matrix notation, these assumptions assign to y a scalar variance-covariance matrix:

where I denotes an N × N identity matrix. The fundamental difference between such a linear model and one leading to generalized least squares is that the latter permits an unrestricted variance-covariance matrix, often denoted by

Var[y ] = Σ

where Σ = [σ_mn; m, n = 1, … N ] is an N × N positive semidefinite matrix. In this extension of the linear model, the variances along the diagonal of Σ may vary across observations, and the covariances in the off-diagonal positions of Σ may be nonzero and may also vary across pairs of observations. In this essay, Σ is also assumed to be non-singular.

Many authors refer to the generalized model as the linear model with nonspherical errors. This term derives, in part, from viewing y as the sum of Xβ and an additional, unobserved variable that is an error term. Rather than making assumptions about the observable y and X as above, these writers make equivalent assumptions about the unobserved error term. The term nonspherical refers to the type of variance-covariance matrix possessed by the error term. Multivariate distributions with scalar variance-covariance matrices are often called spherical. This term can be traced to interpreting the set

{u ∊ R^N | u ′(σ² · I )⁻¹ u = 1}

as an N -dimensional sphere (or spheroid) with radius σ In the nonscalar case, the set

{u ∊ R^N | u ′∑ ⁻¹ u = 1}

is an N -dimensional ellipsoid and distributions with non-scalar variance-covariance matrices are called nonspherical. Hence, a linear regression accompanied by a nonscalar variance-covariance matrix may be called the case with nonspherical errors.

EXAMPLES

Leading examples motivating nonscalar variance-covariance matrices include heteroskedasticity and first-order autoregressive serial correlation. Under heteroskedasticity, the variances σ _mn differ across observations n = 1, …, N but the covariances σ _mn, m ≠ n, all equal zero. This occurs, for example, in the conditional distribution of individual income given years of schooling where high levels of schooling correspond to relatively high levels of the conditional variance of income. This heteroskedasticity is explained in part by the narrower range of job opportunities faced by people with low levels of schooling compared to those with high levels.

Serial correlation arises in time-series data where the observations are ordered sequentially by the time period of each observation; y _n is observed in the n th time period. First-order autoregressive (AR(1)) serial correlation occurs when deviations from means (also called errors) satisfy the linear model

while maintaining the assumption that the marginal variance of y _n equals a constant σ² Nonzero covariances of the form

are implied by the recursion

A times series of monthly unemployment rates exhibits such autoregressive serial correlation, reflecting unobserved social, economic, and political influences that change relatively slowly as months pass.

A second leading example of serial correlation occurs in panel data models, designed for datasets with two sampling dimensions, typically one cross-sectional and the other time-series. Repetitive testing of a cross-section of subjects in a laboratory gives this structure as do repeated surveys of a cross-section of households. Panel data models are usually expressed in an error components form:

y_nt = X′ _ntβ + α_n + ε_nt

where σ _n and ε _nt are unobserved error terms with E[σ_n] = E[ε_nt] = 0 and Var[σ _n] = σ ², Var[ε_nt] = σ _ε², and Cov[σ _n ε _nt] = 0, Cov[ε _nt,ε_js] = 0 for all m, n, j = 1, …, N, t, s = 1, …, T, and n ≠ j, s ≠ t. The σ _n are individual effects that recur for all observations of a particular individual and they induce serial correlation:

for m= n and t ≠ s. Unlike the AR(1) case, this covariance does not diminish as the time between observations increases. Instead, all of the observations for an individual are equally correlated.

Correlation also occurs in cross-sectional data. In the seemingly unrelated regressions (SUR) setting, there are several dependent variables and corresponding mean functions:

E[y_ng ] = X′ _ng β_g, g = 1, …, G.

Such dependent variables are typically related as different characteristics of a single experiment or observational unit. For example, the y _ng might be test scores for substantively different tests written by the same individual. Even after accounting for observable differences among the tests and test takers with x _ng, covariance among the test scores may reflect the influence of unobserved personal abilities that affect all of the tests taken by a particular person. Alternatively, the y _ng could be total income in countries during the same time period so that neighboring states possess similar underlying characteristics or face similar environments that induce covariance among their incomes.

STATISTICAL ISSUES

The general linear model motivates two principal issues with statistical inferences about β in the simpler linear model. First, hypothesis tests and estimators of sampling variances and confidence intervals developed under the linear model are biased when ∑ is not scalar. Second, the OLS estimator for β generally will not be the minimum-variance linear unbiased estimator. The OLS estimator

is a linear (in y ) and unbiased estimator when σ is not scalar. However, its sampling variance is

which is generally not proportional to (X′X ), an outcome implied by the simple linear model. When σ is nonsingular, the GLS estimator

is the minimum-variance linear and unbiased estimator. Its variance-covariance matrix is

Var[β̂_GLS] = (X′∑⁻¹X )⁻¹.

GLS can be understood as OLS applied to a linear model transformed to satisfy the scalar variance-covariance restriction. For every Σ, one can always find a matrix A such that Σ = AA' . We will give some examples shortly. Given such an A , it follows that

Var[A ⁻¹ y] = A⁻¹ Var[y ]A ⁻¹ = A⁻¹ ∑A ⁻¹′ = I

or, in words, that Ỹ = A⁻¹ y has a scalar variance-covariance matrix. At the same time,

E[A ⁻¹y] = A ⁻¹E[y] = A⁻¹Xβ

so that the expectation of the transformed y has corresponding transformed explanatory variables X˜ = A⁻¹ X . Applying OLS to estimate β with the transformed variables yields the GLS estimator:

because Σ⁻¹ = (A⁻¹ ) ' A⁻¹ . In a similar fashion, one sees that the OLS criterion function is transformed into the GLS criterion function:

(y͂ - X͂b)′ (y͂ - X͂b) = (y - Xb)′∑⁻¹ (y - Xb).

Heteroskedasticity produces a simple example. To produce observations with equal variances, each data point is divided by the standard deviation

This corresponds to choosing A equal to a diagonal matrix with the reciprocals of these standard deviations arrayed along its diagonal. The estimation criterion function is

which is a weighted sum of squared residuals. For this reason, in this special case GLS is often called weighted least squares (WLS). WLS puts most weight on the observations with the smallest variances, showing how GLS improves upon OLS, which puts equal weight on all observations. Those n for which σ _n is relatively small tend to be closest to the mean of y _n and, hence, more informative about β.

Faced with AR(1) serial correlation in a time series, the appropriate choice of A transforms each data point (except the first) into differences:

y͂_n = y_n - ρy_n-1,

x͂_nk = X_nk - ρx_{n-1, k}, k = 1, …,K.

This transformed y˜ _n display zero covariances:

using (2) for the first and third terms on the right-hand side. This transformation uncovers the new or additional information available in each observation, whereas OLS treats highly correlated observations the same way as uncorrelated observations, giving the former relatively too much weight in that estimator.

The panel data model has a simple GLS transformation as well:

y͂_ni = y_nt -(1 - ω)y͂_n,

x͂_ntk = x_ntk - (1 - ω)x͂_nk, k = 1, …, K

where y¯ _n and x¯ _nk are the individual averages over time and respectively, and

If there is no serial correlation, then σ _α = 0 and y = y˜ _nt. Conversely, the greater σ _α is, the more important the individual average y¯ _n becomes. Like the AR(1) case, a weighted difference removes the covariance among the original y _nt In this case, however, a common time-series sample average appears in every difference, reflecting the equal covariance structure.

Note that the GLS estimator is an instrumental variables (IV) estimator,

^β_IV = (Z'X) ⁻¹ Z'y , for an N × K matrix Z of instrumental variables such that Z'X is invertible. For GLS, Z = σ⁻¹ X Researchers use instrumental variables estimators to overcome omission of explanatory variables in models of the form

y = x'β + ε

where ε is an unobserved term. Even though E[ε] = 0, correlation between the explanatory variables in x and ε biases ^β_OLS and the IV estimator is employed to overcome this bias by using instrumental variables, the variables in Z , that are uncorrelated with e yet correlated with the explanatory variables. In some cases of the linear model, the GLS estimator provides such instrumental variables. If, for example, x _n includes the lagged value of y _n in a time-series application, then residual serial correlation usually invalidates the OLS estimator while GLS still produces an estimator for β.

In the panel data setting, particular concern about the behavior of the unobserved individual effect α _n has led researchers to compare the GLS estimator with another IV estimator. The concern is that the expected value of α _n may vary with some of the observed explanatory variables in x _nt. Various observable characteristics of individuals or households are typically correlated so that one would expect the unobserved characteristics captured in α _n to be correlated with the observed characteristics in x _nt as well. In this situation, the OLS- and GLS-fitted coefficients are not estimators for β because these fitted coefficients pick up the influence of the α _n omitted as explanatory variables. An IV estimator of β that is robust to such correlation is the so-called fixed effects estimator. This estimator is often described as the OLS fit of y_nt – y¯_n to the explanatory variables x_ntk – x¯_nk, k= 1, …, K, but an equivalent IV estimator uses the instrumental variables z _ntk = x _ntk – x _nk. In the special case when ω = 0, the fixed effects and GLS estimators are equal. The GLS estimator is often called the random effects estimator in this context, and the difference between the fixed-effects and random-effects estimators is often used as a diagnostic test for the reliability of GLS estimation (Hausman 1978).

The OLS and GLS estimators are equal for a general σ if the GLS instrument matrix σ⁻¹ X produces the same set of fitted values as the explanatory variable matrix X Formally, ^β_OLS = ^β_GLS if and only if every vector Xα α Ë ℝ r, equals σ⁻¹ X γ for some γ rℝ^K, and vice versa. A practical situation in which this occurs approximately is when AR(1) serial correlation is accompanied by explanatory variables that are powers of n or trigonometric functions of n. Another example arises when all covariances are equal (and not necessarily zero) and the regression function includes an intercept (or constant term), as it usually does. A third example is the case of SUR where the explanatory variables are identical for all equations, so that x _ng = x _n, = 1, …, G.

FEASIBLE METHODS

Feasible inference for β in the general linear model typically must overcome that σ is unknown. There are two popular strategies: (1) to specify σ as a function of a few parameters that can be replaced with estimators, and (2) to use heteroskedasticity-consistent variance estimators.

The AR(1) serial correlation model illustrates the first approach. A natural estimator for the autocorrelation parameter p is the fitted OLS coefficient ^p for predicting the OLS-fitted residual y _n x' _n ^β, with the single explanatory variable y ^n-1 – x' ^n-1^β_OLS, the lagged OLS-fitted residual:

Under certain conditions, this p^ can replace p in σ(p) to estimate the variance-covariance matrix of ^β_OLS, as in (X'X)⁻¹ X' σX (X'X) ⁻¹

where ^σ = X (p^) , or to compute the feasible GLS (FGLS) estimator

^β_FGLS = (X'Ê⁻¹ X) ⁻¹ y.

Similarly, one estimates the variance-covariance matrix of β_FGLS with (X'^σ⁻¹ X)⁻¹ . In large samples, the differences between the feasible and infeasible versions are negligible. In small samples, many researchers use an estimator that requires iterative calculations to find a p and β that are mutually consistent: The fitted residuals produced by β yield p^ and the variance-covariance matrix produced by p^ yields β as the fitted FGLS coefficients. Maximum likelihood estimators, based on an additional assumption that the y _n possess a joint multivariate normal distribution, are leading examples of such estimators.

We will use the pure heteroskedasticity case to illustrate heteroskedasticity-consistent variance estimators. The unknown term in the Var[β_OLS] (shown in (3)) can be written as a sample average:

where , the n th diagonal element of Σ . In a heteroskedasticity-consistent variance estimator this average is replaced by

so that the unknown variances are replaced by the squared OLS fitted residuals. Such estimators do not require a parametric model for Σ and, hence, are more widely applicable. Their justification rests, in part, on

so that one can show that

is a valid estimator for . The feasible heteroskedasticity-consistent variance estimator replaces the unknown β with its estimator ^β_OLS. This variance-covariance estimator is often called the “Eicker-White estimator,” for Friedjielm Eicker and Halbert White.

The heteroskedasticity-consistent variance estimator does not yield a direct counterpart to ^β_FGLS. Nevertheless, estimators that dominate OLS are available. The transformed linear model

E[Z'y ] = Z'Xβ

has a corresponding variance-covariance matrix

Var[Z'y ] = Z'ΣZ

which has a heteroskedasticity-consistent counterpart

and the FGLS analogue

This estimator reduces to OLS if Z = X and produces superior estimators to the extent that Σ ^1/2 Z provides a better linear predictor of Σ^−1/2 X than Σ ^1/2 X does.

The heteroskedasticity-consistent variance estimator has been extended to cover time-series cases with nonzero covariances as well. For example, if only first-order covariances are nonzero then

because σ _{n, n-j} = 0 for j > 1. This term in the OLS variance-covariance matrix can be estimated by

a heteroskedasticity and autocorrelation consistent (HAC) variance-covariance matrix estimator. This works because the second average behaves much like the first in that

so that one can show that

is an estimator for the second term.

One can extend the HAC approach to cover m -dependence in which only m th-order covariances are nonzero for a finite m. However, in practice m should be small relative to the number of observations N. To illustrate the difficulties with large m, consider setting m = N– 1 so that all of the covariances in × are replaced by a product of OLS-fitted residuals. Then this approach yields the estimator

which is the outerproduct of the K × 1 column vector X'(y – X^β_OLS) . It follows that this matrix has a rank of one, contradicting the property that X'ΣX has a rank of K. Nevertheless, the heteroskedasticity-consistent vari-ance-covariance estimator has been generalized to cover situations where all of the covariances may be nonzero. The Newey-West estimator is a popular choice:

where

and

The supporting approximate distribution theory requires m to depend on the sample size N and methods for choosing m are available.

Often statistical inference for βbased upon estimation of Σ or X'ΣX can treat these terms as equal to the objects that they estimate. For example, the statistical distribution theory typically shows that

Q = (Rβ_GLS – Rβ)'(R (X'Σ⁻¹ X)⁻¹ R')_-1 (R^β_GLS – Rβ) is approximately (or exactly) distributed as a chi-squared random variable. This pivotal statistic yields a hypothesis test or confidence interval for Rβ . In large samples,

Q^ = (R^β_FGLS-Rβ)'(R (X'Σ⁻¹ X)⁻¹ R')⁻¹ (R^β_FGLS – Rβ) may be treated as an equivalent statistic. Researchers have shown that bootstrap methods, appropriately applied, can provide better probability approximations in situations with small sample sizes.

SEE ALSO Autoregressive Models; Bootstrap Method; Covariance; Heteroskedasticity; Least Squares, Ordinary; Least Squares, Two-Stage; Residuals; Serial Correlation; Specification; Variance

BIBLIOGRAPHY

Aitken, A. C. 1935. On Least Squares and Linear Combination of Observations. Proceedings of the Royal Society of Edinburgh 55: 42–48.

Cragg, John G. 1983. More Efficient Estimation in the Presence of Heteroscedasticity of Unknown Form. Econometrica 51 (3): 751–764.

Eicker, Friedjielm. 1967. Limit Theorems for Regressions with Unequal and Dependent Errors. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, ed. Lucien Le Cam and Jerzy Neyman, 59–82. Berkeley: University of California Press.

Hausman, Jerry A. 1978. Specification Tests in Econometrics. Econometrica 46 (6): 1251–1272.

Newey, Whitney K., and Kenneth D. West. 1987. A Simple, Positive Semi-definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica 55 (3): 703–708.

White, Halbert. 1980. A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica 48 (4): 817–838.

Paul A. Ruud

International Encyclopedia of the Social Sciences