Generalized Least Squares
Generalized Least Squares
Generalized least squares (GLS) is a method for fitting coefficients of explanatory variables that help to predict the outcomes of a dependent random variable. As its name suggests, GLS includes ordinary least squares (OLS) as a special case. GLS is also called “Aitken’s estimator,” after A. C. Aitken (1935). The principal motivation for generalizing OLS is the presence of covariance among the observations of the dependent variable or of different variances across these observations, conditional on the explanatory variables. Both phenomena lead to problems with statistical inference procedures commonly used with OLS. Most critically, the standard methods for estimating sampling variances and testing hypotheses become biased. In addition, the OLSfitted coefficients are inaccurate relative to the GLSfitted coefficients.
In its simplest form, the linear model of statistics postulates the existence of a linear conditional expectation for a scalar, dependent random variable y given a set of nonrandom scalar explanatory variables {x _{1}, …, x _{K}}:
E[y ] = β _{1}x _{1} + … + β_{K}x _{K}
where the β_{k} k = 1, …, K, are constant parameters for all values of the x_{k}. Interest focuses on estimating the β_{k} given a sample of N observations of y, denoted here by y _{1}, …, y_{N}, and corresponding observations of the x_{k}, denoted x _{1k}, …, x_{Nk} for each explanatory variable indexed by k. Using matrix notation, the linear conditional expectations for the sample are
E[y ] = Xβ ,
where y = [y _{1}, … y _{N}]' is an N × 1 column vector, X = [x_{nk}; n = 1, … N, k = 1, … K ] is a N × K matrix, and β = [β_{1}, …, β _{K}]' is a K × 1 column vector. It is generally assumed that the explanatory variables in X are not linearly dependent so that N ≥ K and there is no α £ ℝR^{K}, α.≠O , such that X α = 0 .
In addition, the linear model assumes that the variances of the y_{n} are equal to a common, finite positive constant σ and that the covariances among the y_{n} are equal to zero. In matrix notation, these assumptions assign to y a scalar variancecovariance matrix:
where I denotes an N × N identity matrix. The fundamental difference between such a linear model and one leading to generalized least squares is that the latter permits an unrestricted variancecovariance matrix, often denoted by
Var[y ] = Σ
where Σ = [σ_{mn}; m, n = 1, … N ] is an N × N positive semidefinite matrix. In this extension of the linear model, the variances along the diagonal of Σ may vary across observations, and the covariances in the offdiagonal positions of Σ may be nonzero and may also vary across pairs of observations. In this essay, Σ is also assumed to be nonsingular.
Many authors refer to the generalized model as the linear model with nonspherical errors. This term derives, in part, from viewing y as the sum of Xβ and an additional, unobserved variable that is an error term. Rather than making assumptions about the observable y and X as above, these writers make equivalent assumptions about the unobserved error term. The term nonspherical refers to the type of variancecovariance matrix possessed by the error term. Multivariate distributions with scalar variancecovariance matrices are often called spherical. This term can be traced to interpreting the set
{u ∊ R^{N}  u ′(σ^{2} · I )^{−1} u = 1}
as an N dimensional sphere (or spheroid) with radius σ In the nonscalar case, the set
{u ∊ R^{N}  u ′∑ ^{−1} u = 1}
is an N dimensional ellipsoid and distributions with nonscalar variancecovariance matrices are called nonspherical. Hence, a linear regression accompanied by a nonscalar variancecovariance matrix may be called the case with nonspherical errors.
EXAMPLES
Leading examples motivating nonscalar variancecovariance matrices include heteroskedasticity and firstorder autoregressive serial correlation. Under heteroskedasticity, the variances σ _{mn} differ across observations n = 1, …, N but the covariances σ _{mn}, m ≠ n, all equal zero. This occurs, for example, in the conditional distribution of individual income given years of schooling where high levels of schooling correspond to relatively high levels of the conditional variance of income. This heteroskedasticity is explained in part by the narrower range of job opportunities faced by people with low levels of schooling compared to those with high levels.
Serial correlation arises in timeseries data where the observations are ordered sequentially by the time period of each observation; y _{n} is observed in the n th time period. Firstorder autoregressive (AR(1)) serial correlation occurs when deviations from means (also called errors) satisfy the linear model
while maintaining the assumption that the marginal variance of y _{n} equals a constant σ^{2} Nonzero covariances of the form
are implied by the recursion
A times series of monthly unemployment rates exhibits such autoregressive serial correlation, reflecting unobserved social, economic, and political influences that change relatively slowly as months pass.
A second leading example of serial correlation occurs in panel data models, designed for datasets with two sampling dimensions, typically one crosssectional and the other timeseries. Repetitive testing of a crosssection of subjects in a laboratory gives this structure as do repeated surveys of a crosssection of households. Panel data models are usually expressed in an error components form:
y_{nt} = X′ _{nt}β + α_{n} + ε_{nt}
where σ _{n} and ε _{nt} are unobserved error terms with E[σ_{n}] = E[ε_{nt}] = 0 and Var[σ _{n}] = σ ^{2}, Var[ε_{nt}] = σ _{ε}^{2}, and Cov[σ _{n} ε _{nt}] = 0, Cov[ε _{nt},ε_{js}] = 0 for all m, n, j = 1, …, N, t, s = 1, …, T, and n ≠ j, s ≠ t. The σ _{n} are individual effects that recur for all observations of a particular individual and they induce serial correlation:
for m= n and t ≠ s. Unlike the AR(1) case, this covariance does not diminish as the time between observations increases. Instead, all of the observations for an individual are equally correlated.
Correlation also occurs in crosssectional data. In the seemingly unrelated regressions (SUR) setting, there are several dependent variables and corresponding mean functions:
E[y_{ng} ] = X′ _{ng} β_{g}, g = 1, …, G.
Such dependent variables are typically related as different characteristics of a single experiment or observational unit. For example, the y _{ng} might be test scores for substantively different tests written by the same individual. Even after accounting for observable differences among the tests and test takers with x _{ng}, covariance among the test scores may reflect the influence of unobserved personal abilities that affect all of the tests taken by a particular person. Alternatively, the y _{ng} could be total income in countries during the same time period so that neighboring states possess similar underlying characteristics or face similar environments that induce covariance among their incomes.
STATISTICAL ISSUES
The general linear model motivates two principal issues with statistical inferences about β in the simpler linear model. First, hypothesis tests and estimators of sampling variances and confidence intervals developed under the linear model are biased when ∑ is not scalar. Second, the OLS estimator for β generally will not be the minimumvariance linear unbiased estimator. The OLS estimator
is a linear (in y ) and unbiased estimator when σ is not scalar. However, its sampling variance is
which is generally not proportional to (X′X ), an outcome implied by the simple linear model. When σ is nonsingular, the GLS estimator
is the minimumvariance linear and unbiased estimator. Its variancecovariance matrix is
Var[β̂_{GLS}] = (X′∑^{−1}X )^{−1}.
GLS can be understood as OLS applied to a linear model transformed to satisfy the scalar variancecovariance restriction. For every Σ, one can always find a matrix A such that Σ = AA' . We will give some examples shortly. Given such an A , it follows that
Var[A ^{−1} y] = A^{−1} Var[y ]A ^{−1} = A^{−1} ∑A ^{−1}′ = I
or, in words, that Ỹ = A^{−1} y has a scalar variancecovariance matrix. At the same time,
E[A ^{−1}y] = A ^{−1}E[y] = A^{−1}Xβ
so that the expectation of the transformed y has corresponding transformed explanatory variables X˜ = A^{−1} X . Applying OLS to estimate β with the transformed variables yields the GLS estimator:
because Σ^{−1} = (A^{−1} ) ' A^{−1} . In a similar fashion, one sees that the OLS criterion function is transformed into the GLS criterion function:
(y͂  X͂b)′ (y͂  X͂b) = (y  Xb)′∑^{−1} (y  Xb).
Heteroskedasticity produces a simple example. To produce observations with equal variances, each data point is divided by the standard deviation
This corresponds to choosing A equal to a diagonal matrix with the reciprocals of these standard deviations arrayed along its diagonal. The estimation criterion function is
which is a weighted sum of squared residuals. For this reason, in this special case GLS is often called weighted least squares (WLS). WLS puts most weight on the observations with the smallest variances, showing how GLS improves upon OLS, which puts equal weight on all observations. Those n for which σ _{n} is relatively small tend to be closest to the mean of y _{n} and, hence, more informative about β.
Faced with AR(1) serial correlation in a time series, the appropriate choice of A transforms each data point (except the first) into differences:
y͂_{n} = y_{n}  ρy_{n1},
x͂_{nk} = X_{nk}  ρx_{n1, k}, k = 1, …,K.
This transformed y˜ _{n} display zero covariances:
using (2) for the first and third terms on the righthand side. This transformation uncovers the new or additional information available in each observation, whereas OLS treats highly correlated observations the same way as uncorrelated observations, giving the former relatively too much weight in that estimator.
The panel data model has a simple GLS transformation as well:
y͂_{ni} = y_{nt} (1  ω)y͂_{n},
x͂_{ntk} = x_{ntk}  (1  ω)x͂_{nk}, k = 1, …, K
where y¯ _{n} and x¯ _{nk} are the individual averages over time and respectively, and
If there is no serial correlation, then σ _{α} = 0 and y = y˜ _{nt}. Conversely, the greater σ _{α} is, the more important the individual average y¯ _{n} becomes. Like the AR(1) case, a weighted difference removes the covariance among the original y _{nt} In this case, however, a common timeseries sample average appears in every difference, reflecting the equal covariance structure.
Note that the GLS estimator is an instrumental variables (IV) estimator,
^β_{IV} = (Z'X) ^{−1} Z'y , for an N × K matrix Z of instrumental variables such that Z'X is invertible. For GLS, Z = σ^{−1} X Researchers use instrumental variables estimators to overcome omission of explanatory variables in models of the form
y = x'β + ε
where ε is an unobserved term. Even though E[ε] = 0, correlation between the explanatory variables in x and ε biases ^β_{OLS} and the IV estimator is employed to overcome this bias by using instrumental variables, the variables in Z , that are uncorrelated with e yet correlated with the explanatory variables. In some cases of the linear model, the GLS estimator provides such instrumental variables. If, for example, x _{n} includes the lagged value of y _{n} in a timeseries application, then residual serial correlation usually invalidates the OLS estimator while GLS still produces an estimator for β.
In the panel data setting, particular concern about the behavior of the unobserved individual effect α _{n} has led researchers to compare the GLS estimator with another IV estimator. The concern is that the expected value of α _{n} may vary with some of the observed explanatory variables in x _{nt}. Various observable characteristics of individuals or households are typically correlated so that one would expect the unobserved characteristics captured in α _{n} to be correlated with the observed characteristics in x _{nt} as well. In this situation, the OLS and GLSfitted coefficients are not estimators for β because these fitted coefficients pick up the influence of the α _{n} omitted as explanatory variables. An IV estimator of β that is robust to such correlation is the socalled fixed effects estimator. This estimator is often described as the OLS fit of y_{nt} – y¯_{n} to the explanatory variables x_{ntk} – x¯_{nk}, k= 1, …, K, but an equivalent IV estimator uses the instrumental variables z _{ntk} = x _{ntk} – x _{nk}. In the special case when ω = 0, the fixed effects and GLS estimators are equal. The GLS estimator is often called the random effects estimator in this context, and the difference between the fixedeffects and randomeffects estimators is often used as a diagnostic test for the reliability of GLS estimation (Hausman 1978).
The OLS and GLS estimators are equal for a general σ if the GLS instrument matrix σ^{−1} X produces the same set of fitted values as the explanatory variable matrix X Formally, ^β_{OLS} = ^β_{GLS} if and only if every vector Xα α Ë ℝ r, equals σ^{−1} X γ for some γ rℝ^{K}, and vice versa. A practical situation in which this occurs approximately is when AR(1) serial correlation is accompanied by explanatory variables that are powers of n or trigonometric functions of n. Another example arises when all covariances are equal (and not necessarily zero) and the regression function includes an intercept (or constant term), as it usually does. A third example is the case of SUR where the explanatory variables are identical for all equations, so that x _{ng} = x _{n}, = 1, …, G.
FEASIBLE METHODS
Feasible inference for β in the general linear model typically must overcome that σ is unknown. There are two popular strategies: (1) to specify σ as a function of a few parameters that can be replaced with estimators, and (2) to use heteroskedasticityconsistent variance estimators.
The AR(1) serial correlation model illustrates the first approach. A natural estimator for the autocorrelation parameter p is the fitted OLS coefficient ^p for predicting the OLSfitted residual y _{n} x' _{n} ^β, with the single explanatory variable y ^{n1} – x' ^{n1}^β_{OLS}, the lagged OLSfitted residual:
Under certain conditions, this p^ can replace p in σ(p) to estimate the variancecovariance matrix of ^β_{OLS}, as in (X'X)^{−1} X' σX (X'X) ^{−1}
where ^σ = X (p^) , or to compute the feasible GLS (FGLS) estimator
^β_{FGLS} = (X'Ê^{−1} X) ^{−1} y.
Similarly, one estimates the variancecovariance matrix of β_{FGLS} with (X'^σ^{−1} X)^{−1} . In large samples, the differences between the feasible and infeasible versions are negligible. In small samples, many researchers use an estimator that requires iterative calculations to find a p and β that are mutually consistent: The fitted residuals produced by β yield p^ and the variancecovariance matrix produced by p^ yields β as the fitted FGLS coefficients. Maximum likelihood estimators, based on an additional assumption that the y _{n} possess a joint multivariate normal distribution, are leading examples of such estimators.
We will use the pure heteroskedasticity case to illustrate heteroskedasticityconsistent variance estimators. The unknown term in the Var[β_{OLS}] (shown in (3)) can be written as a sample average:
where , the n th diagonal element of Σ . In a heteroskedasticityconsistent variance estimator this average is replaced by
so that the unknown variances are replaced by the squared OLS fitted residuals. Such estimators do not require a parametric model for Σ and, hence, are more widely applicable. Their justification rests, in part, on
so that one can show that
is a valid estimator for . The feasible heteroskedasticityconsistent variance estimator replaces the unknown β with its estimator ^β_{OLS}. This variancecovariance estimator is often called the “EickerWhite estimator,” for Friedjielm Eicker and Halbert White.
The heteroskedasticityconsistent variance estimator does not yield a direct counterpart to ^β_{FGLS}. Nevertheless, estimators that dominate OLS are available. The transformed linear model
E[Z'y ] = Z'Xβ
has a corresponding variancecovariance matrix
Var[Z'y ] = Z'ΣZ
which has a heteroskedasticityconsistent counterpart
and the FGLS analogue
This estimator reduces to OLS if Z = X and produces superior estimators to the extent that Σ ^{1/2} Z provides a better linear predictor of Σ^{−1/2} X than Σ ^{1/2} X does.
The heteroskedasticityconsistent variance estimator has been extended to cover timeseries cases with nonzero covariances as well. For example, if only firstorder covariances are nonzero then
because σ _{n, nj} = 0 for j > 1. This term in the OLS variancecovariance matrix can be estimated by
a heteroskedasticity and autocorrelation consistent (HAC) variancecovariance matrix estimator. This works because the second average behaves much like the first in that
so that one can show that
is an estimator for the second term.
One can extend the HAC approach to cover m dependence in which only m thorder covariances are nonzero for a finite m. However, in practice m should be small relative to the number of observations N. To illustrate the difficulties with large m, consider setting m = N– 1 so that all of the covariances in × are replaced by a product of OLSfitted residuals. Then this approach yields the estimator
which is the outerproduct of the K × 1 column vector X'(y – X^β_{OLS}) . It follows that this matrix has a rank of one, contradicting the property that X'ΣX has a rank of K. Nevertheless, the heteroskedasticityconsistent variancecovariance estimator has been generalized to cover situations where all of the covariances may be nonzero. The NeweyWest estimator is a popular choice:
where
and
The supporting approximate distribution theory requires m to depend on the sample size N and methods for choosing m are available.
Often statistical inference for βbased upon estimation of Σ or X'ΣX can treat these terms as equal to the objects that they estimate. For example, the statistical distribution theory typically shows that
Q = (Rβ_{GLS} – Rβ)'(R (X'Σ^{−1} X)^{−1} R')_{1} (R^β_{GLS} – Rβ) is approximately (or exactly) distributed as a chisquared random variable. This pivotal statistic yields a hypothesis test or confidence interval for Rβ . In large samples,
Q^ = (R^β_{FGLS}Rβ)'(R (X'Σ^{−1} X)^{−1} R')^{−1} (R^β_{FGLS} – Rβ) may be treated as an equivalent statistic. Researchers have shown that bootstrap methods, appropriately applied, can provide better probability approximations in situations with small sample sizes.
SEE ALSO Autoregressive Models; Bootstrap Method; Covariance; Heteroskedasticity; Least Squares, Ordinary; Least Squares, TwoStage; Residuals; Serial Correlation; Specification; Variance
BIBLIOGRAPHY
Aitken, A. C. 1935. On Least Squares and Linear Combination of Observations. Proceedings of the Royal Society of Edinburgh 55: 42–48.
Cragg, John G. 1983. More Efficient Estimation in the Presence of Heteroscedasticity of Unknown Form. Econometrica 51 (3): 751–764.
Eicker, Friedjielm. 1967. Limit Theorems for Regressions with Unequal and Dependent Errors. In Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, ed. Lucien Le Cam and Jerzy Neyman, 59–82. Berkeley: University of California Press.
Hausman, Jerry A. 1978. Specification Tests in Econometrics. Econometrica 46 (6): 1251–1272.
Newey, Whitney K., and Kenneth D. West. 1987. A Simple, Positive Semidefinite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix. Econometrica 55 (3): 703–708.
White, Halbert. 1980. A HeteroskedasticityConsistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity. Econometrica 48 (4): 817–838.
Paul A. Ruud
Cite this article
Pick a style below, and copy the text for your bibliography.

MLA

Chicago

APA
"Generalized Least Squares." International Encyclopedia of the Social Sciences. . Encyclopedia.com. 20 Sep. 2018 <http://www.encyclopedia.com>.
"Generalized Least Squares." International Encyclopedia of the Social Sciences. . Encyclopedia.com. (September 20, 2018). http://www.encyclopedia.com/socialsciences/appliedandsocialsciencesmagazines/generalizedleastsquares
"Generalized Least Squares." International Encyclopedia of the Social Sciences. . Retrieved September 20, 2018 from Encyclopedia.com: http://www.encyclopedia.com/socialsciences/appliedandsocialsciencesmagazines/generalizedleastsquares
Citation styles
Encyclopedia.com gives you the ability to cite reference entries and articles according to common styles from the Modern Language Association (MLA), The Chicago Manual of Style, and the American Psychological Association (APA).
Within the “Cite this article” tool, pick a style to see how all available information looks when formatted according to that style. Then, copy and paste the text into your bibliography or works cited list.
Because each style has its own formatting nuances that evolve over time and not all information is available for every reference entry or article, Encyclopedia.com cannot guarantee each citation it generates. Therefore, it’s best to use Encyclopedia.com citations as a starting point before checking the style against your school or publication’s requirements and the mostrecent information available at these sites:
Modern Language Association
The Chicago Manual of Style
http://www.chicagomanualofstyle.org/tools_citationguide.html
American Psychological Association
Notes:
 Most online reference entries and articles do not have page numbers. Therefore, that information is unavailable for most Encyclopedia.com content. However, the date of retrieval is often important. Refer to each style’s convention regarding the best way to format page numbers and retrieval dates.
 In addition to the MLA, Chicago, and APA styles, your school, university, publication, or institution may have its own requirements for citations. Therefore, be sure to refer to those guidelines when editing your bibliography or works cited list.