Least Squares, Ordinary
Least Squares, Ordinary
Ordinary least squares (OLS) is a method for fitting lines or curves to observed data in situations where one variable (the response variable) is believed to be explained or caused by one or more other explanatory variables. OLS is most commonly used to estimate the parameters of linear regression models of the form
The subscripts i index observations; εi is a random variable with zero expected value (i.e., E (εi ) = 0, for all observations i = 1, …, n ); and the functions g 1(), …, gk () are known. The right-hand side explanatory variables Z i 2 …, ZiK are assumed to be exogenous and to cause the dependent, endogenous left-hand side response variable Yi. The β1, …, βK are unknown parameters, and hence must be estimated. Setting X ij = gj (Zij ) for all j = 1, …, K and i = 1, …, n, the model in (1) can be written as
The random error term εi represents statistical noise due to measurement error in the dependent variable, unexplained variation due to random variation, or perhaps the omission of some explanatory variables from the model.
In the context of OLS estimation, an important feature of the models in (1) and (2) is that they are linear in parameters. Since the functions gj () are known, it does not matter whether these functions are linear or nonlinear. Given n observations on the variables Yi and Xij, the OLS method involves fitting a line (in the case where K = 2), a plane (for K = 3), or a hyperplane (when K > 3) to the data that describes the average, expected value of the response variable for given values of the explanatory variables. With OLS, this is done using a particular criterion as shown below, although other methods use different criteria.
An estimator is a random variable whose realizations are regarded as estimates of some parameter of interest. Replacing the unknown parameters β1, …, βK and errors εi in (2) with estimators β̂1, …, β̂K and ∊̂ yields
The relation in (2) is called the population regression function, whereas the relation in (3) is called the sample regression function. The population regression function describes the unobserved, true relationship between the explanatory variables and the response variable that is to be estimated. The sample regression function is an estimator of the population regression function.
Rearranging terms in (3), the residual ∊̂i can be expressed as
The OLS estimators β̂1, …, β̂K of the parameters β1, …, βK in (2) are obtained by minimizing the error sum-of-squares
with respect to β̂1, …, β̂K. Hence the OLS estimator of the population regression function (2) minimizes the sum of squared residuals, which are the squared distances (in the direction of the Y-axis) between each observed value Yi and the sample regression function given by (3).
By minimizing the sum of squared residuals, disproportionate weight may be given to observations that are outliers—that is, those that are atypical and that lie apart from the majority of observations. An alternative approach is to minimize the sum
of absolute deviations; the resulting estimator is called the least absolute deviations (LAD) estimator. Although this estimator has some attractive properties, it requires linear programming methods for computation, and the estimator is biased in small samples.
To illustrate, first consider the simple case where K = 1. Then equation (2) becomes
In this simple model, random variables Yi are random deviations (to the left or to the right) away from the constant β1. The error-sum-of-squares in (5) becomes
This can be minimized by differentiating with respect to β̂1 and setting the derivative equal to 0; that is,
and then solving forβ̂1 to obtain
In this simple model, therefore, the OLS estimator of β1 is merely the sample mean of the Yi. Given a set of n observations on Yi, these values can be used to compute an OLS estimate of β1 by simply adding them and then dividing the sum by n.
In the slightly more complicated case where K = 2, the population regression function (2) becomes
Minimizing the error sum of squares in this model yields OLS estimators
where x̄ 2 and ȳ are the sample means of X i 2 and Yi, respectively
In the more general case where K > 2, it is useful to think of the model in (2) as a system of equations, such as:
This can be written in matrix form as
where Y is an (n × 1) matrix containing elements Yi, …, Yn ; X is an (n Y K ) matrix with element Xij in the i -th row, j -th column (elements in the first column of X are equal to 1); β is a (K × 1) matrix containing the unknown parameters β1, …, βk; and ε is an (n × 1) matrix containing the random variables ε 1, …, ∊n. Minimizing the error-sum-of-squares yields the OLS estimator
This is a (K × 1) matrix containing elements β̂1, …, β̂K . In modern software, β̂ is computed by inverting the (K × K) matrix X′X using the Q-R decomposition. The OLS estimator can always be computed, unless X′X is singular, which happens when there is an exact linear relationship among any of the columns of the matrix X.
The Gauss-Markov theorem establishes that provided
i. the relationship between Y and X is linear as described by (10);
ii. the elements of X are fixed, nonstochastic, and there exists no exact linear relationship among the columns of X; and
iii. E (ε) = 0 and E(∊∊ ′) = σ2 I,
the OLS estimator β̂ is the best (in the sense of minimum variance) linear, unbiased estimator of β. Modified versions of the OLS estimator (e.g., weighted least squares, feasible generalized least squares) can be used when data do not conform to the assumptions given above.
The variance-covariance matrix of the OLS estimator given in (11) is σ2 (X′X )-1, a (K × K) matrix whose diagonal elements give the variances of the K elements of the vector β̂, and whose off-diagonal elements give the corresponding covariances among the elements of β̂. This matrix can be estimated by replacing the unknown variance of ε, namely σ2, with the estimator σ̂2 = ∊̂'∊̂/(n – K ). The estimated variances and covariances can then be used for hypothesis testing.
Robin L. Plackett (1972) and Stephen M. Stigler (1981) describe the debate that exists over who discovered the OLS method. The first publication to describe the method was Adriene Marie Legendre’s Nouvelles méthodes pour la détermination des orbites des comètes in 1806 (the term least squares comes from the French words moindres quarrés in the title of an appendix to Legendre’s book), followed by publications by Robert Adrain (1808) and Carl Friedrich Gauss ( 2004). However, Gauss made several claims that he developed the method as early as 1794 or 1795; Stigler discusses evidence that supports these claims in his article “Gauss and the Invention of Least Squares” (1981).
Both Legendre and Gauss considered bivariate models such as the one given in (8), and they used OLS to predict the position of comets in their orbits about the sun using data from astronomical observations. Legendre’s approach was purely mathematical in the sense that he viewed the problem as one of solving for two unknowns in an overdetermined system of n equations. Gauss ( 2004) was the first to give a probabilistic interpretation to the least squares method. Gauss reasoned that for a sequence Y 1, …, Yn of n independent random variables whose density functions satisfy certain conditions, if the sample mean
is the most probable combination for all values of the random variables and each n ≥ 1, then for some σ2 > 0, the density function of the random variables is given by the normal, or Gaussian, density function
In the nineteenth century, this was known as the law of errors. This argument led Gauss to consider the regression equation as containing independent, normally distributed error terms.
The least squares method quickly gained widespread acceptance. At the beginning of the twentieth century, Karl Pearson remarked in his article in Biometrika (1902, p. 266) that “it is usually taken for granted that the right method for determining the constants is the method of least squares.” Beginning in the 1870s, the method was used in biological, genetic applications by Pearson, Francis Galton, George Udny Yule, and others. In his article “Regression towards Mediocrity in Hereditary Stature” (1886), Galton used the term regression to describe the tendency of the progeny of exceptional parents to be, on average, less exceptional than their parents, but today the term regression analysis has a rather different meaning.
Pearson provided substantial empirical support for Galton’s notion of biological regression by looking at hereditary data on the color of horses and dogs and their offspring and the heights of fathers and their sons. Pearson worked in terms of correlation coefficients, implicitly assuming that the response and explanatory variables were jointly normally distributed. In a series of papers (1896, 1900, 1902, 1903a, 1903b), Pearson formalized and extended notions of correlation and regression from the bivariate model to multivariate models. Pearson began by considering, for the bivariate model, bivariate distributions for the explanatory and response variables. In the case of the bivariate normal distribution, the conditional expectation of Y can be derived in terms of a linear expression involving X, suggesting the model in (8). Similarly, by assuming that the response variable and several explanatory variables have a multivariate normal joint distribution, one can derive the expectation of the response variable, conditional on the explanatory variables, as a linear equation in the explanatory variables, suggesting the multivariate regression model in (2).
Pearson’s approach, involving joint distributions for the response and explanatory variables, led him to argue in later papers that researchers should in some situations consider nonsymmetric joint distributions for response and explanatory variables, which could lead to nonlinear expressions for the conditional expectation of the response variable. These arguments had little influence; Pearson could not offer tangible examples because the joint normal distribution was the only joint distribution to have been characterized at that time. Today, however, there are several examples of bivariate joint distributions that lead to nonlinear regression curves, including the bivariate exponential and bivariate logistic distributions. Aris Spanos describes such examples in his book Probability Theory and Statistical Inference: Econometric Modeling with Observational Data (1999, chapter 7). Ronald A. Fisher (1922, 1925) later formulated the regression problem closer to Gauss’s characterization by assuming that only the conditional distribution of the response variable is normal, without requiring that the joint distribution be normal.
Today, OLS and its variants are probably the most widely used statistical techniques. OLS is frequently used in the behavioral and social sciences as well as in biological and physical sciences. Even in situations where the relationship between dependent and explanatory variables is nonlinear, it is often possible to transform variables to arrive at a linear relationship. In other situations, assuming linearity may provide a reasonable approximation to nonlinear relationships over certain ranges of the data. The numerical difficulties faced by Gauss and Pearson in their day are now viewed as trivial given the increasing speed of modern computers. Computational problems are encountered, however, in problems with very large numbers of dimensions. Most of the problems that are encountered involve obtaining accurate solutions for the inverse matrix (X’X )–1 using computers with finite precision; care must be taken to ensure that solutions are not contaminated by round-off error. Åke Björck, in Numerical Methods for Least Squares Problems (1996), gives an extensive treatment of numerical issues involved with OLS.
SEE ALSO Classical Statistical Analysis; Cliometrics; Econometric Decomposition; Galton, Francis; Ordinary Least Squares Regression; Pearson, Karl; Regression Analysis; Regression Towards the Mean; Statistics
Adrain, Robert. 1808. Research Concerning the Probabilities of the Errors Which Happen in Making Observations, &c. The Analyst, or Mathematical Museum 1: 93–109.
Björck, Åke. 1996. Numerical Methods for Least Squares Problems. Philadelphia: SIAM.
Fisher, Ronald Aylmer. 1922. The Goodness of Fit of Regression Formulae and the Distribution of Regression Coefficients. Journal of the Royal Statistical Society 85 (4): 597–612.
Fisher, Ronald Aylmer. 1925. Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd.
Galton, Francis. 1886. Regression towards Mediocrity in Hereditary Stature. Journal of the Anthropological Institute 15: 246–263.
Gauss, Carl Friedrich.  2004. Theory of the Motion of the Heavenly Bodies Moving about the Sun in Conic Sections. Trans. Charles Henry Davis. Mineola, NY: Dover.
Gauss, Carl Friedrich. 1866–1933. Werke. 12 vols. Göttingen, Germany: Gedruckt in der Dieterichschen Universitätsdruckerei.
Legendre, Adriene Marie. 1806. Nouvelles méthodes pour la détermination des orbites des comètes. Paris: Courcier.
Pearson, Karl. 1896. Contributions to the Mathematical Theory of Evolution III: Regression, Heredity, and Panmixia. Proceedings of the Royal Society of London 59: 69–71.
Pearson, Karl. 1900. Contributions to the Mathematical Theory of Evolution VIII: On the Correlation of Characters Not Quantitatively Measurable. Proceedings of the Royal Society of London 66: 241–244.
Pearson, Karl. 1902. On the Systematic Fitting of Curves to Observations and Measurements. Biometrika 1 (3): 265–303.
Pearson, Karl. 1903a. Contributions to the Mathematical Theory of Evolution: On Homotyposis in Homologous but Differentiated Organs. Proceedings of the Royal Society of London 71: 288–313.
Pearson, Karl. 1903b. The Law of Ancestral Heredity. Biometrika 2 (2): 211–228.
Plackett, Robin L. 1972. The Discovery of the Method of Least Squares. Biometrika 59 (2): 239–251.
Spanos, Aris. 1999. Probability Theory and Statistical Inference: Econometric Modeling with Observational Data. Cambridge, U.K.: Cambridge University Press.
Stigler, Stephen M. 1981. Gauss and the Invention of Least Squares. Annals of Statistics 9 (3): 465–474.
Paul W. Wilson