Linear regression refers to a linear estimation of the relationship between a dependent variable and one or more independent variables.
Social researchers typically assume that two variables are linearly related unless they have strong reasons to believe the relationship is nonlinear. In general, a linear relationship between a dependent variable (Y ) and an independent variable (X) can be expressed by the equation Y = a + bX, where a is a fixed constant. The value of the dependent variable (Y ) equals the sum of a constant (a ) plus the value of the slope (b ) times the value of the independent variable (X). The slope (b ) shows the amount of change in Y variable for every one-unit change in X. The constant (a ) is also called the Y-intercept, which determines the value of Y when X = 0.
Theoretically, if the dependent variable Y can be perfectly estimated by the independent variable X, then the y should be precisely located on the predicted line. The equation of the predicted line can be expressed as Ŷ = a + bX. The Ŷ (“Y hat”) represents the predicted value Y. However, actual social data never follow a perfect linear relationship. In fact, the actual observed value of Y is rarely on the predicted line. Therefore, it is necessary to take the deviations between the predicted value and actual value into account through the linear regression model. In the linear regression model, for every X value in the data, the linear equation will predict a Y value on the “best-fitting” line. This “best-fitting” line is called a regression line. The linear regression model should then be expressed as Y = a + bX + e. The e is the error term, or a residual, which represents the distance between predicted value (Ŷ) and the actual Y value in the data.
The goal of linear regression estimation is to develop a procedure that identifies and defines the straight line that provides the best fit for any specific set of data. A basic approach of linear regression is to estimate, by minimizing the residuals, the values for the two regression coefficients (a and b ) based on the observed data. In other words, the predicted errors estimated by regression equation must be smaller than the errors made with any other linear relationship. To determine how close the predicted scores are to the observed scores, the method of Ordinary Least Squares (OLS) is the most popular approach used in the linear regression.
OLS estimates regression equation coefficients (a and b ) that minimize the error sum of squares. That is, the OLS approach sums the squared differences between each observed score (Y ) and its score predicted by the regression equation Ŷ, and produces a quantity smaller than that obtained by using any other straight linear equation. The result is a measure of overall squared error between the line and the data: Total squared error = Σ (Y-Ŷ)2.
In Figure 1 the distance between the actual data point (Y ) and the predicted point on the line (Ŷ) is defined as Y-Ŷ. The best-fitting line to the data should thus show a sum of absolute values of Y-Ŷ to be the minimum, or the
sum of distances between Y-Ŷ to be the shortest if there are only two sets of X and Y. Because some of these distances will be positive and some will be negative, the sum of the residuals, Σ(Y-Ŷ) = Σe, is always zero. The process of squaring removes the negative signs so that the sum of these squared errors is greater than zero. Also, it minimizes the sum of all the squared prediction errors.
Under certain ideal conditions, OLS is in fact superior to other estimators. If these necessary assumptions are true, OLS is the best linear unbiased estimator of corresponding population parameters. The OLS assumptions include:
- Linearity. The relationship between the dependent variable (Y ) and independent variable(s) (Xs) has to be linear. But sometimes the true relationship is better described by a curve.
- Normality. It is assumed that y has a normal distribution for every possible value of X. This assumption is especially important in small samples for the marked skewness or severe outliers of residuals.
- Homoscedasticity. It is assumed that the population variance of Y is the same at all levels of X. This assumption of homoscedasticity is required for regression t and F tests, as well as difference-of-means tests and analysis of variance.
- No autocorrelation. The sample observations are assumed to be independent of each other if they were individually and randomly selected from a large population. That is, there should be no correlation between errors and X variables (Hamilton 1992, pp. 110-111).
However, perfect data fitted with all OLS assumptions are rather rare in practice. For example, the OLS assumptions may be violated by problems of a nonlinear relationship, omitting relevant independent variables (Xs) including irrelevant independent variables (Xs), correlation between X and errors, X measured with errors, heteroscedasticity, non-normal distribution, and multi-collinearity, and so on. To deal with these problems, certain nonlinear robust estimators are better and more efficient than the OLS estimation, which will yield unbiased estimations of the correlation between X and Y variables. Therefore, the selection of regression techniques is highly dependent on the characteristics of the selected data.
SEE ALSO Logistic Regression; Ordinary Least Squares Regression
Hamilton, Lawrence C. 1992. Regression with Graphics: A Second Course in Applied Statistics. Belmont, CA: Duxbury Press.