A mathematical model involves an explicit set of equations that describe the relationships among the variables contained in the model. It is important to know not only which variables to include in the model, but also the proper functional forms of the mathematical equations. In the natural sciences, the functional forms are often known from the laws of nature. If a = acceleration, t = time, and d = displacement, the laws of physics dictate the functional form d = at 2/2. The problem of selecting the proper functional form is particularly difficult in the social sciences because the laws of human behavior are not as precise as the laws of nature. Moreover, the social sciences are not like the laboratory sciences, which allow for repeated experiments with the aim of determining the precise mathematical relationships among the variables.
Consider the general functional form for the case of two independent variables x 1 and x 2:y = f (x 1, x 2, ε ), where y is the dependent variable, x 1 and x 2 are the independent variables, and ε is the error term representing the variation in y not explained by x 1 and x 2. Although a particular theory might specify the signs of the partial relationships between y, x 1, and x 2, the form of the function f () is typically unknown. The standard procedure is to posit a linear relationship and to estimate the coefficients a 0, a 1, and a 2 in the regression equation y = a 0+a 1x 1+a 2x 2+ε.
The linear form can be viewed as a first-order approximation of the function f (). For many circumstances, such approximations can be quite useful. However, if the continued application of either independent variable has, say, diminishing effects on y, the linear approximation will be invalid when the change in x1 or x 2 is large. Similarly, the linear approximation might be poor if the effect of x 1 on y depends on the level of the variable x 2.
A popular alternative to the standard linear model is to express some or all of the variables in logarithmic form. Consider (1):
This is equivalent to y = c (x 1)a1 (x2)a2 ε, where a 0 = ln (c ) and ε1 = ln (ε)
Notice that equation (1) is linear as a result of the logarithms, so the coefficients can be easily estimated using linear regression methods. Also, the coefficients have the straightforward interpretation that ai (i = 1, 2) is the percentage change in y resulting from a 1 percent change in x i.
Other popular specifications include those using powers of the variables and their products. In the two-variable case, a second-order approximation is (2):
It is important to use the correct functional form to obtain unbiased and consistent coefficient estimates of the effects of the independent variables on the dependent variable y. One way to select the functional form is to use a general-to-specific methodology: Estimate a very general nonlinear form and, through hypothesis testing, determine whether it is possible to pare down the model to a more specific form. In equation (2), if a11, a22, and a12 are not significantly different from zero, it can be claimed that the linear form is more appropriate than the second-order approximation. When searching over many different functional forms, the usual t -tests and F -tests of statistical significance, as well as the usual measures of fit, such as R 2, are generally not appropriate, however. As one adds regressors and allows for more general functional forms, the fit of the regression to the data will necessarily improve. Moreover, since every sample has a few unusual observations, there is the danger that a general-specification search will lead to overfitting the data, in the sense of selecting an overly complicated functional form. Hence, in a specification search of the most appropriate functional form, most researchers use the specific-to-general methodology: Estimate a simple model, and perform a number of diagnostic checks to determine whether the model is adequate. Then estimate a more complicated specification only if there is some sort of diagnostic failure. Some diagnostic tests, such as Ramsey’s (1969) RESET, attempt to determine whether the regression’s error terms are truly random. Others, such as Brown, Durbin, and Evan’s (1975) CUSUM test, attempt to determine whether the regression coefficients are constant over the entire sample. Nonrandom errors and/or non-constant coefficients indicate that the functional form of the estimating equation is incorrect.
Another method that can be used to select the most appropriate functional form is out-of-sample forecasting (in which the researcher holds back a portion of the observations from the estimation process and estimates the alternative models over the shortened span of data). Forecasts from the alternative models are compared to the actual values of the data held back. The model with the smallest forecast errors is deemed the best.
SEE ALSO Regression
Brown, R. L., J. Durbin, and J. M. Evans. 1975. Techniques for Testing the Constancy of Regression Relationships over Time. Journal of the Royal Statistical Society, Series B, 37: 149–192.
Ramsey, J. B. 1969. Tests for Specification Errors in Classical Least-Squares Regression Analysis. Journal of the Royal Statistical Society, Series B, 31: 350–371.