The term regression was initially conceptualized by Francis Galton (1822-1911) within the framework of inheritance characteristics of fathers and sons. In his famous 1886 paper, Galton examined the average regression relationship between the height of fathers and the height of their sons. A more formal treatment of multiple regression modeling and correlation was introduced in 1903 by Galton’s friend Karl Pearson (1857-1936). Regression analysis is the statistical methodology of estimating a relationship between a single dependent variable (Y ) and a set of predictor (explanatory/independent) variables (X 2, X 3, … X k) based on a theoretical or empirical concept. In some natural science or engineering applications, this relationship is exactly described, but in most social science applications, these relationships are not exact. These nonexact models are probabilistic in nature and capture only approximate features of the relationship. For example, an energy analyst may want to model how the demand for heating oil varies with the price of oil and the average daily temperature. The problem is that energy demand may be determined by other factors, and the nature of the relationship between energy demand and explanatory variables is unknown. Thus, only an approximate relation can be modeled.
The simplest relationship between Y and X is a linear regression model, which is typically written as,
where the index i represents the i -th observation and Ui is the random error term. The β coefficients measure the net effect of each explanatory variable on the dependent variable, and the random disturbance term captures the net effect of all the factors that affect Yi except the influence of the predictor variables. In other words, Ui is the difference between the actual and mean value (E (Yǀi Xi, i = 1, 2, …, k ) =β1 + β2 X2i + β 3X3i + … + βkXki of the dependent variable. It is customary to assume that Uis (i = 1, …, n ) are independently normally distributed with zero mean and constant variance (σ2 ). The parameters are estimated from a random sample of n (n > k ) observations. The explanatory variables are assumed to be nonstochastic and uncorrelated with the error term. The meaning of the β coefficient varies with the functional form of the regression model. For example, if the variables are in linear form, the net effect represents the rate of change, and if the variables are in log form, the net effect represents elasticity, which can be interpreted as a percentage change in the dependent variable with respect to a 1 percent change in the independent variable.
The parameters βs are estimated by minimizing the residual sum of squares , which is known as the ordinary least squares (OLS) method. The minimization problem results in k equations in k unknowns, which gives a unique estimate as long as the explanatory variables are not collinear. An unbiased estimate of σ2 is obtained from the residual sum of squares. When the random error term satisfies the standard assumptions, the OLS method gives the best linear unbiased estimates (BLUE). The statistical significance of the coefficients is tested with the usual t -statistic (t = β̂i /s.e.(β̂i)), which follows a t -distribution with n – k degrees of freedom. In fact, some researchers use this t -statistic as a criterion in a stepwise process to add or delete variables from the preliminary model.
The fit of the regression equation is evaluated by the statistic R 2, which measures the extent of the variation in Y explained by the regression equation. The value of R 2 ranges from 0 to 1, where 0 means no fit and 1 means a perfect fit. The adjusted R 2 (R 2 = 1 – (1 – R 2)(n – 1)/(n – k )), which compensates for n and k, is more suitable than R 2 in comparing models with different subsets of explanatory variables. Sometimes researchers choose the model with the highest R 2, but the purpose of the regression analysis is to obtain the best model based on a theoretical concept or an empirically observed phenomena. Therefore, in formulating models, researchers should consider the logical, theoretical, and prior knowledge between the dependent and explanatory variables. Nevertheless, it is not unusual to get high R 2 with the signs of the coefficients inconsistent with prior knowledge or expectations. Note that R 2s of two different models are comparable only if the dependent variables and the number of observations are the same, because R 2 measures the fraction of the total variation in the dependent variable explained by the regression equation. In addition to R 2, there are other measures, such as Mallows’s Cp statistic and information criteria AIC (Akaike) and BIC (Bayesean), to choose between different combinations of regressors. Francis Diebold (1998) showed that there is no obvious advantage between AIC or BIC, and they are routinely calculated in many statistical programs.
Regression models are often plagued with data problems, such as multicollinearity, heteroskedasticity, and autocorrelation, that violate standard OLS assumptions. Multicollinearity occurs when two or more (or a combination of ) regressors are highly correlated. Multicollinearity is suspected when inconsistencies in the estimates, such as high R 2 and low t -values or high pairwise correlation among regressors, are observed. Remedial measures of multicollinearity include dropping a variable that causes multicollinearity, using a priori information to combine coefficients, pooling time series and cross-section data, adding new observations, transforming variables, and ridge and Stein-rule estimates. It is worthwhile to note that transforming variables and dropping variables may cause specification errors, and Stein-rule and ridge estimates produce biased and inefficient estimates.
The assumption of constant variance of the error term is somewhat unreasonable in empirical research. For example, expenditure on food is steady for low-income households, while it varies substantially for high-income households. Furthermore, heteroskedasticity is more common in cross-sectional data (e.g., sample surveys) than in time series data, because in time series data, changes in all variables are in more or less the same magnitude. Plotting OLS residuals against the predicted Y values provides a visual picture of the heteroskedasticity problem. Formal tests for heteroskedasticity are based on regressing the OLS residuals on various functional forms of the regressors. Halbert White’s (1980) general procedure is widely used among practitioners in detecting and correcting heteroskedasticity. Often, regression models with variables observed over time conflict with the classical assumption of uncorrelated errors. For example, in modeling the impact of public expenditures on economic growth, the OLS model would not be appropriate because the level of public expenditures are correlated over time. OLS estimates of βs in the presence of autocorrelated errors are unbiased, but there is a tendency for σ 2 to be underestimated, leading to overestimated R 2 and invalid t - and F -tests. Firstorder autocorrelation is detected by the Durbin-Watson test and is corrected by estimating difference equations or by the Cochrane-Orcutt iterative procedure.
The OLS method gives the best linear unbiased estimates contingent upon the standard assumptions of the model and the sample data. Any errors in the model and the data may produce misleading results. For example, the true model may have variables in log form, but the estimated model is linear. Even though in social sciences, the true nature of the model is almost always unknown, the researchers’ understanding of the variables and the topic may help to formulate a reasonable functional form for the model. In a situation where the functional form is unknown, possible choices include transforming variables into log form, polynomial regression, a translog model, and Box-Cox transformation. Parameters of the Box-Cox model are estimated by the maximum likelihood method because the variable transformation and the parameter estimation of the model are inseparable.
Specification errors also occur either when relevant explanatory variables are missing from the model or when irrelevant variables are added to the model. For example, let the true regression model be Yi = β 1 + β 2X 2i + β 3X 3i + Ui and the estimated model be Yi = β 1 + β 2 X 2i + Ui If the omitted variable X 3 is correlated with the included variable X 2, then the estimates of β 1 and β 2 are biased and inconsistent. The extent and the direction of the bias depends on the true parameter β 3 and the correlation between the variables X 2 and X 3. In addition, incorrect estimation of σ2 may lead to misleading significance tests. Researchers do not commit these specification errors willingly; often the errors are due to unavailability of data or lack of understanding of the topic. Researchers sometimes include all conceivable variables without paying much attention to the underlying theoretical framework, which leads to unbiased but inefficient estimates, which is less serious than omitting a relevant variable.
Instead of dropping an unobserved variable, it is common practice to use proxy variables. For example, most researchers use general aptitude test scores as proxies for individual abilities. When a proxy variable is substituted for a dependent variable, OLS gives unbiased estimates of the coefficients with a larger variance than in the model with no measurement error. However, when an independent variable is measured with an error, OLS gives biased and inconsistent estimates of the parameters. In general, there is no satisfactory way to handle measurement error problems. A real question that arises is whether to omit a variable altogether or instead use a proxy variable. B. T. McCallum (1972) and Michael Wickens (1972) showed that omitting a variable is more severe than using a proxy variable, even if the proxy variable is a poor one.
In regression analysis, researchers frequently encounter qualitative variables, especially in survey data. A person’s decision to join a labor union or a homemaker’s decision to join the workforce are examples of dichotomous dependent variables. Dichotomous variables also appear in models due to nonobservance of the variable. In general, a dichotomous variable model can be defined as,
where and Y * is an unobservable variable. An observable binary outcome variable Yi is defined as,
Substituting Yi for and estimating the model by OLS is unsuitable primarily because Xi β represents the probability Y = 1 and it could lie outside the range 0 to 1. In the probit and logit formulations of the above model, the probability that Yi is equal to 1 is defined as a probability distribution function,
where F is the cumulative distribution function of U. The choice of the distribution function F translates the regression function Xi β into a number between 0 and 1. Parameters of the model are estimated by maximizing the likelihood function,
Choosing the normal probability distribution for F yields the probit model, and choosing the logistic distribution for F gives the logit model where P (Yi = 1) is given by,
where Φ (.) is the cumulative normal distribution function. Many statistical packages have standard routines to estimate both probit and logit models. In the probit model, the β parameters and σ appear as a ratio, and therefore cannot be estimated separately. Hence, in the probit model, σ is set to 1 without any loss of generality. The cumulative distributions of logit and normal are close to each other, except in the tails, and therefore, the estimates from both models will not be much different from each other. Since the variance of the normal distribution is set to one, and the logistic distribution has a variance of π2/3, it is necessary to divide the logit estimates by π 2/3 to make the estimates comparable. In practical applications, it is useful to compare the effect of each explanatory variable on the probability. A change in probabilities in the probit and logit models for a unit change in the k -th variable are given by π (Xi β )βk and exp(Xi β )βk /(1 + exp(Xi β )2, respectively, where φ (.) is the standard normal probability density function.
In the multivariate probit models, dichotomous variables are jointly distributed with appropriate multivariate distributions. Likelihood functions are based on the possible combinations of the binary choice variable (Yi ) values. A practical difficulty in multivariate models is the presence of multiple integrals in the likelihood function. Some authors have proposed methods for simulating the multivariate probabilities; details were published in the November 1994 issue of the Review of Economics and Statistics. This problem of multiple integration does not arise in the multinomial logit model because the cumulative logistic distribution has a closed form.
The Tobit model can be considered as an extension of the probit and logit models where the dependent variable is observed in the positive range and is unobserved or unavailable in the negative range. Consider the model described in equation (2), where
This a censored regression model where Y * observations below zero are censored or not available, but the X observations are available. On the other hand, in the truncated regression model, both Y * and X observations are unavailable for Y * values below zero. Estimation of parameters are carried out by the maximum likelihood method, where the likelihood function of the model is based on Pr (Y * > 0) and Pr (Y * ≤ 0). When Ui is normally distributed, the likelihood function is given by,
Unlike the probit model, where σ was arbitrarily set to 1, the Tobit model estimates the parameters. In the probit model, only the effect on the probabilities for changes in X values are meaningful to a practitioner. However, in the Tobit model, a practitioner may be interested in how the predicted values of Yi * change due to a unit change in one X variable in three possible scenarios of the model, linear functional form, unconditional mean and conditional mean (i.e., E (Y *) = Xβ, E (Y *) and E (Y */Y * > 0)). The impacts of the variable X j on the above three expressions are given by the βj, ϕ (Z ).βj and βj [1 – Zφ (Z )/ϕ (Z ) – (φ (Z )/ϕ (Z ))2], respectively where Z = Xi β /σ, and φ () and ϕ () are standard normal density and distribution functions. An extension of the Tobit model with the presence of heteroskedasticity has been proposed by James Powell (1984).
The concept of causality in time series variables was introduced by Norbert Wiener (1956) and Clive W. J. Granger (1969), and refers to the predictability of one variable Xt from its past behavior, another variable Yt, and a set of auxiliary variables Zt. In other words, causality refers to a certain type of statistical feedback between the variables. Some variables have a tendency to move together—for example, average household income and level of education. The question is: Does improvement in education level cause the income to increase or vice versa? It is also worthwhile to note that even with a strong statistical relation between variables, causation between the variables may not exist. The idea of causation must come from a plausible theory rather than from statistical methodology. Causality can be tested by regressing each variable on its own and other variables’ lagged values.
A closely related concept in time series variables, as well as in cross-sectional data to a lesser degree, is simultaneity, where the behavior of g (g> 1) stochastic variables are characterized by g structural regression equations. Some of the regressors of these regression equations contain stochastic variables that are directly correlated with the random error term. OLS estimates of such a model produce biased estimates, and the extent of the bias is called simultaneity bias. The implications of simultaneity were recognized long before the estimation methods were devised. Within the regression analysis context, the identification problem has another dimension—that is, whether the difficulty in determining a variable is truly random or not. Moreover, a more serious fundamental question that arises is whether the parameters of a model are estimable.
SEE ALSO Causality; Censoring, Left and Right; Censoring, Sample; Frequency Distributions; Galton, Francis; Identification Problem; Linear Regression; Logistic Regression; Multicollinearity; Ordinary Least Squares Regression; Pearson, Karl; Probabilistic Regression; Probability; Regression; Regression Towards the Mean; Serial Correlation; Specification Error; Specification Tests; Student’s T-Statistic; Test Statistics; Tobit
Aigner, Dennis. 1974. MSE Dominance of Least Squares with Errors-of-Observation. Journal of Econometrics 2 (4): 365-372.
Angrist, Joshua D., and Alan B. Krueger. 2001. Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments. Journal of Economic Perspectives 15 (4): 69-85.
Box, G. E. P., and D. R. Cox. 1964. An Analysis of Transformations. Journal of the Royal Statistical Society, Series B 26 (2): 211-252.
Chamberlin, Gary. 1982. The General Equivalence of Granger and Sims Causality. Econometrica 50 (3): 569-582.
Diebold, Francis X. 1998. Elements of Forecasting. 3rd ed. Cincinnati, OH: South-Western.
Draper, Norman, and Harry Smith. 1998. Applied Regression Analysis. 3rd ed. New York: Wiley.
Dufour, Jean-Marie, Denis Pelletier, and Éric Renault. 2006. Short Run and Long Run Causality in Time Series: Inference. Journal of Econometrics 132 (2): 337-362.
Galton, Francis. 1886. Regression Towards Mediocrity in Hereditary Stature. Journal of the Anthropological Institute 15: 246-263.
Granger, C. W. J. 1969. Investigating Casual Relationships by Econometric Models and Cross-Spectral Methods. Econometrica 37 (3): 424-238.
Greene, William H. 2003. Econometric Analysis. Upper Saddle River, NJ: Prentice Hall.
Gujarati, Damodar M. 2003. Basic Econometrics. 4th ed. Boston: McGraw-Hill.
Hocking, R. R. 1976. The Analysis and Selection of Variables in Linear Regression. Biometrics 32 (1): 1-49.
Madalla, G. S. 1983. Limited Dependent and Qualitative Variables in Econometrics. Cambridge, U.K.: Cambridge University Press.
McCallum, B. T. 1972. Relative Asymptotic Bias from Errors of Omission and Measurement. Econometrica 40 (4): 757-758.
Pearson, Karl, and Alice Lee. 1903. On the Laws of Inheritance in Man. Biometrika 11 (4): 357-462.
Powell, James L. 1984. Least Absolute Deviations Estimation for the Censored Regression Model. Journal of Econometrics 25 (3): 303-325.
Powell, James L. 1986. Symmetrically Trimmed Least Squares Estimation for Tobit Models. Econometrica 54 (6): 1435-1460.
Sims, Christopher A. 1972. Money, Income, and Causality. American Economic Review 62 (4): 540-552.
Tobin, James. 1958. Estimation of Relationships for Limited Dependent Variables. Econometrica 26 (1): 24-36.
Vinod, Hrishikesh D. 1978. A Survey of Ridge Regression and Related Techniques for Improvements over Ordinary Least Squares. Review of Economics and Statistics 60 (1): 121-131.
White, Halbert. 1980. A HeterosKedasticity-consistent Covariance Matrix Estimator and Direct Test for HeterosKedasticity. Econometrica 48 (4): 817-838.
Wickens, Michael. 1972. A Note on the Use of Proxy Variables. Econometrica 40 (4): 759-760.
Wiener, Norbert. 1956. The Theory of Prediction, in Modern Mathematics for the Engineer, Series 1, chapter 8, ed. Edwin F. Beckenback. New York: McGraw Hill.
Zellner, Arnold. 1979. Causality and Econometrics. In Three Aspects of Policy and Policymaking: Knowledge, Data, and Institutions, eds. Karl Brunner and Allan H. Meltzer, 9-50. Amsterdam, NY: North Holland.
A linear regression model is one in which the theoretical mean value, μi, of the observation yi is a linear combination of independent variables, μ = β0 + β1x1 + … + βkxk
when k x-variables are included in the model. The multiples β0, β1,… βk are parameters of the model and are the quantities to be estimated; they are known as regression coefficients, β0 being the intercept or constant term. A model with more than one x-variable is known as a multiple regression model.
Nonlinear regression models express μ as a general function of the independent variables. The general functions include curves such as exponentials and ratios of polynomials, in which there are parameters to be estimated.
Various procedures have been devised to detect variables that make a significant contribution to the regression equation, and to find the combination of variables that best fits the data using as few variables as possible. Analysis of variance is used to assess the significance of the regression model. See also generalized linear model, influence.