Least Squares, Ordinary

views updated

Least Squares, Ordinary

Ordinary least squares (OLS) is a method for fitting lines or curves to observed data in situations where one variable (the response variable) is believed to be explained or caused by one or more other explanatory variables. OLS is most commonly used to estimate the parameters of linear regression models of the form

The subscripts i index observations; ε_i is a random variable with zero expected value (i.e., E (ε_i ) = 0, for all observations i = 1, …, n ); and the functions g ₁(), …, g_k () are known. The right-hand side explanatory variables Z _{i 2} …, Z_iK are assumed to be exogenous and to cause the dependent, endogenous left-hand side response variable Y_i. The β₁, …, β_K are unknown parameters, and hence must be estimated. Setting X _ij = g_j (Z_ij ) for all j = 1, …, K and i = 1, …, n, the model in (1) can be written as

The random error term ε_i represents statistical noise due to measurement error in the dependent variable, unexplained variation due to random variation, or perhaps the omission of some explanatory variables from the model.

In the context of OLS estimation, an important feature of the models in (1) and (2) is that they are linear in parameters. Since the functions g_j () are known, it does not matter whether these functions are linear or nonlinear. Given n observations on the variables Y_i and X_ij, the OLS method involves fitting a line (in the case where K = 2), a plane (for K = 3), or a hyperplane (when K > 3) to the data that describes the average, expected value of the response variable for given values of the explanatory variables. With OLS, this is done using a particular criterion as shown below, although other methods use different criteria.

An estimator is a random variable whose realizations are regarded as estimates of some parameter of interest. Replacing the unknown parameters β₁, …, β_K and errors ε_i in (2) with estimators β̂₁, …, β̂_K and ∊̂ yields

The relation in (2) is called the population regression function, whereas the relation in (3) is called the sample regression function. The population regression function describes the unobserved, true relationship between the explanatory variables and the response variable that is to be estimated. The sample regression function is an estimator of the population regression function.

Rearranging terms in (3), the residual ∊̂_i can be expressed as

The OLS estimators β̂1, …, β̂_K of the parameters β₁, …, β_K in (2) are obtained by minimizing the error sum-of-squares

with respect to β̂₁, …, β̂_K. Hence the OLS estimator of the population regression function (2) minimizes the sum of squared residuals, which are the squared distances (in the direction of the Y-axis) between each observed value Y_i and the sample regression function given by (3).

By minimizing the sum of squared residuals, disproportionate weight may be given to observations that are outliers—that is, those that are atypical and that lie apart from the majority of observations. An alternative approach is to minimize the sum

of absolute deviations; the resulting estimator is called the least absolute deviations (LAD) estimator. Although this estimator has some attractive properties, it requires linear programming methods for computation, and the estimator is biased in small samples.

To illustrate, first consider the simple case where K = 1. Then equation (2) becomes

In this simple model, random variables Y_i are random deviations (to the left or to the right) away from the constant β₁. The error-sum-of-squares in (5) becomes

This can be minimized by differentiating with respect to β̂₁ and setting the derivative equal to 0; that is,

and then solving forβ̂₁ to obtain

In this simple model, therefore, the OLS estimator of β₁ is merely the sample mean of the Y_i. Given a set of n observations on Y_i, these values can be used to compute an OLS estimate of β₁ by simply adding them and then dividing the sum by n.

In the slightly more complicated case where K = 2, the population regression function (2) becomes

Minimizing the error sum of squares in this model yields OLS estimators

and

where x̄ ₂ and ȳ are the sample means of X _{i 2} and Y_i, respectively

In the more general case where K > 2, it is useful to think of the model in (2) as a system of equations, such as:

This can be written in matrix form as

where Y is an (n × 1) matrix containing elements Y_i, …, Y_n ; X is an (n Y K ) matrix with element X_ij in the i -th row, j -th column (elements in the first column of X are equal to 1); β is a (K × 1) matrix containing the unknown parameters β₁, …, β_k; and ε is an (n × 1) matrix containing the random variables ε ₁, …, ∊_n. Minimizing the error-sum-of-squares yields the OLS estimator

This is a (K × 1) matrix containing elements β̂₁, …, β̂_K. In modern software, β̂ is computed by inverting the (K × K) matrix X′X using the Q-R decomposition. The OLS estimator can always be computed, unless X′X is singular, which happens when there is an exact linear relationship among any of the columns of the matrix X.

The Gauss-Markov theorem establishes that provided

i. the relationship between Y and X is linear as described by (10);

ii. the elements of X are fixed, nonstochastic, and there exists no exact linear relationship among the columns of X; and

iii. E (ε) = 0 and E(∊∊ ′) = σ² I,

the OLS estimator β̂ is the best (in the sense of minimum variance) linear, unbiased estimator of β. Modified versions of the OLS estimator (e.g., weighted least squares, feasible generalized least squares) can be used when data do not conform to the assumptions given above.

The variance-covariance matrix of the OLS estimator given in (11) is σ² (X′X )^-1, a (K × K) matrix whose diagonal elements give the variances of the K elements of the vector β̂, and whose off-diagonal elements give the corresponding covariances among the elements of β̂. This matrix can be estimated by replacing the unknown variance of ε, namely σ², with the estimator σ̂² = ∊̂'∊̂/(n – K ). The estimated variances and covariances can then be used for hypothesis testing.

DISCOVERY OF THE METHOD

Robin L. Plackett (1972) and Stephen M. Stigler (1981) describe the debate that exists over who discovered the OLS method. The first publication to describe the method was Adriene Marie Legendre’s Nouvelles méthodes pour la détermination des orbites des comètes in 1806 (the term least squares comes from the French words moindres quarrés in the title of an appendix to Legendre’s book), followed by publications by Robert Adrain (1808) and Carl Friedrich Gauss ([1809] 2004). However, Gauss made several claims that he developed the method as early as 1794 or 1795; Stigler discusses evidence that supports these claims in his article “Gauss and the Invention of Least Squares” (1981).

Both Legendre and Gauss considered bivariate models such as the one given in (8), and they used OLS to predict the position of comets in their orbits about the sun using data from astronomical observations. Legendre’s approach was purely mathematical in the sense that he viewed the problem as one of solving for two unknowns in an overdetermined system of n equations. Gauss ([1809] 2004) was the first to give a probabilistic interpretation to the least squares method. Gauss reasoned that for a sequence Y ₁, …, Y_n of n independent random variables whose density functions satisfy certain conditions, if the sample mean

is the most probable combination for all values of the random variables and each n ≥ 1, then for some σ² > 0, the density function of the random variables is given by the normal, or Gaussian, density function

In the nineteenth century, this was known as the law of errors. This argument led Gauss to consider the regression equation as containing independent, normally distributed error terms.

The least squares method quickly gained widespread acceptance. At the beginning of the twentieth century, Karl Pearson remarked in his article in Biometrika (1902, p. 266) that “it is usually taken for granted that the right method for determining the constants is the method of least squares.” Beginning in the 1870s, the method was used in biological, genetic applications by Pearson, Francis Galton, George Udny Yule, and others. In his article “Regression towards Mediocrity in Hereditary Stature” (1886), Galton used the term regression to describe the tendency of the progeny of exceptional parents to be, on average, less exceptional than their parents, but today the term regression analysis has a rather different meaning.

Pearson provided substantial empirical support for Galton’s notion of biological regression by looking at hereditary data on the color of horses and dogs and their offspring and the heights of fathers and their sons. Pearson worked in terms of correlation coefficients, implicitly assuming that the response and explanatory variables were jointly normally distributed. In a series of papers (1896, 1900, 1902, 1903a, 1903b), Pearson formalized and extended notions of correlation and regression from the bivariate model to multivariate models. Pearson began by considering, for the bivariate model, bivariate distributions for the explanatory and response variables. In the case of the bivariate normal distribution, the conditional expectation of Y can be derived in terms of a linear expression involving X, suggesting the model in (8). Similarly, by assuming that the response variable and several explanatory variables have a multivariate normal joint distribution, one can derive the expectation of the response variable, conditional on the explanatory variables, as a linear equation in the explanatory variables, suggesting the multivariate regression model in (2).

Pearson’s approach, involving joint distributions for the response and explanatory variables, led him to argue in later papers that researchers should in some situations consider nonsymmetric joint distributions for response and explanatory variables, which could lead to nonlinear expressions for the conditional expectation of the response variable. These arguments had little influence; Pearson could not offer tangible examples because the joint normal distribution was the only joint distribution to have been characterized at that time. Today, however, there are several examples of bivariate joint distributions that lead to nonlinear regression curves, including the bivariate exponential and bivariate logistic distributions. Aris Spanos describes such examples in his book Probability Theory and Statistical Inference: Econometric Modeling with Observational Data (1999, chapter 7). Ronald A. Fisher (1922, 1925) later formulated the regression problem closer to Gauss’s characterization by assuming that only the conditional distribution of the response variable is normal, without requiring that the joint distribution be normal.

CURRENT USES

Today, OLS and its variants are probably the most widely used statistical techniques. OLS is frequently used in the behavioral and social sciences as well as in biological and physical sciences. Even in situations where the relationship between dependent and explanatory variables is nonlinear, it is often possible to transform variables to arrive at a linear relationship. In other situations, assuming linearity may provide a reasonable approximation to nonlinear relationships over certain ranges of the data. The numerical difficulties faced by Gauss and Pearson in their day are now viewed as trivial given the increasing speed of modern computers. Computational problems are encountered, however, in problems with very large numbers of dimensions. Most of the problems that are encountered involve obtaining accurate solutions for the inverse matrix (X’X )^–1 using computers with finite precision; care must be taken to ensure that solutions are not contaminated by round-off error. Åke Björck, in Numerical Methods for Least Squares Problems (1996), gives an extensive treatment of numerical issues involved with OLS.

SEE ALSO Classical Statistical Analysis; Cliometrics; Econometric Decomposition; Galton, Francis; Ordinary Least Squares Regression; Pearson, Karl; Regression Analysis; Regression Towards the Mean; Statistics

BIBLIOGRAPHY

Adrain, Robert. 1808. Research Concerning the Probabilities of the Errors Which Happen in Making Observations, &c. The Analyst, or Mathematical Museum 1: 93–109.

Björck, Åke. 1996. Numerical Methods for Least Squares Problems. Philadelphia: SIAM.

Fisher, Ronald Aylmer. 1922. The Goodness of Fit of Regression Formulae and the Distribution of Regression Coefficients. Journal of the Royal Statistical Society 85 (4): 597–612.

Fisher, Ronald Aylmer. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.

Galton, Francis. 1886. Regression towards Mediocrity in Hereditary Stature. Journal of the Anthropological Institute 15: 246–263.

Gauss, Carl Friedrich. [1809] 2004. Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections. Trans. Charles Henry Davis. Mineola, NY: Dover.

Gauss, Carl Friedrich. 1866–1933. Werke. 12 vols. Göttingen, Germany: Gedruckt in der Dieterichschen Universitätsdruckerei.

Legendre, Adriene Marie. 1806. Nouvelles méthodes pour la détermination des orbites des comètes. Paris: Courcier.

Pearson, Karl. 1896. Contributions to the Mathematical Theory of Evolution III: Regression, Heredity, and Panmixia. Proceedings of the Royal Society of London 59: 69–71.

Pearson, Karl. 1900. Contributions to the Mathematical Theory of Evolution VIII: On the Correlation of Characters Not Quantitatively Measurable. Proceedings of the Royal Society of London 66: 241–244.

Pearson, Karl. 1902. On the Systematic Fitting of Curves to Observations and Measurements. Biometrika 1 (3): 265–303.

Pearson, Karl. 1903a. Contributions to the Mathematical Theory of Evolution: On Homotyposis in Homologous but Differentiated Organs. Proceedings of the Royal Society of London 71: 288–313.

Pearson, Karl. 1903b. The Law of Ancestral Heredity. Biometrika 2 (2): 211–228.

Plackett, Robin L. 1972. The Discovery of the Method of Least Squares. Biometrika 59 (2): 239–251.

Spanos, Aris. 1999. Probability Theory and Statistical Inference: Econometric Modeling with Observational Data. Cambridge, U.K.: Cambridge University Press.

Stigler, Stephen M. 1981. Gauss and the Invention of Least Squares. Annals of Statistics 9 (3): 465–474.

Paul W. Wilson

International Encyclopedia of the Social Sciences