Nonparametric Regression

views updated

Nonparametric Regression

A regression model may be written generally as

where the subscript i = 1, …, n indexes observations, ε _i is a random error term, uncorrelated with the regressor x_i and with zero expectation, and θ represents a vector of parameters. Given E (ε _i) = 0, E (y_i ǀ x_i ) = m (x_i ǀ θ); consequently, the function m(x_i ǀ θ) is often called the conditional mean function.

In ordinary least squares regression, the parameter vector is assumed to be of length K < ∞ and the conditional mean function is assumed to be linear in parameters, e.g.,

m (x_i ǀ θ ) = θ ₁ + x_i2 θ ₂ + … + x_iK θ_K.

Moreover, in small samples the errors are typically assumed to be normally distributed in order to facilitate inference. With maximum likelihood estimation, the parameter vector is again assumed to be of finite length K, and a particular form is assumed for the conditional mean function, although linearity in parameters is not necessary. In addition, the errors are assumed to come from a particular family of distributions (e.g., normal, beta, gamma, etc.); with independent sampling, the likelihood function can then be derived as a product of marginal probability density functions, each evaluated at one of n sample observations.

Nonparametric regression involves using one of several techniques to avoid the need to assume specific functional forms for the conditional mean function as well as the distribution of the error terms in (1). By making fewer assumptions, one avoids the risk of mis-specifying either the conditional mean function or the distribution of the errors, which can lead to biased and inconsistent estimation. Moreover, with nonparametric regression techniques, the underlying, true model that is estimated can itself be viewed as nonparametric, in the sense that it cannot

not be represented by a function involving a finite number of parameters. In other words, the parameter vector θ can be assumed to have an infinite number of elements. Consequently, the variety of shapes of conditional mean functions that can be estimated by nonparametric regression methods is far greater than what is possible with more conventional parametric estimation methods such as ordinary least squares or maximum likelihood. However, the increased flexibility comes with some costs; in particular, inference is often more difficult, computational burdens are greater, and rates of convergence are slower with non-parametric methods than with parametric methods.

Several nonparametric regression estimators have been developed. One of the most widely used is the Nadarya-Watson estimator (Nadarya, 1964; Watson, 1964) given by

where h is a smoothing parameter or bandwidth and K_h(·) = K(·/h ) is a weighting, or kernel, function that is symmetric and integrates to 1; i.e.,K(-t ) = K(t ) and ∫ K(t) dt = 1.A wide variety of functions satisfy these conditions; for example, any symmetric probability density function can be used as a kernel function, as well as any even-order polynomial over some interval from -ζ +ζ with coefficients chosen so that the polynomial integrates to 1 over this interval.

Although researchers using the Nadarya-Watson estimator (as well as other nonparametric estimators) must choose both a kernel function and a value for the smoothing parameter, the choice of kernel function is less important in determining estimation error than the choice of a value for the bandwidth. The bandwidth h determines the degree of smoothness of the estimator. As h →0, bias of the estimator diminishes but variance increases; as h → ∞, bias increases while variance decreases. A number of data-driven techniques have been developed to optimize the choice of h with respect to one of several criteria; for example, cross-validation methods can be used to choose a value for h that minimizes either mean integrated square error or asymptotic mean integrated square error; see Pagan and Ullah (1999) for details. Alternatively, less computationally burdensome plug-in procedures proposed by Sheather and Jones (1991) can be used to choose an optimal value for the bandwidth.

Figure 1 provides an example illustrating the Nadarya-Watson estimator. With the same data displayed in each panel of Figure 1, estimated regression lines are shown for the ordinary least squares (OLS) estimate of the model

and Nadarya-Watson kernel estimates of (1) with the standard normal density function used as the kernel function and with three different values of h: 0.1, 1.0, and 4.0. The data clearly follow a nonlinear pattern, which the OLS estimator of (3) cannot replicate. With h = 1, the kernel estimator captures the pattern of the data nicely, providing a reasonably smooth estimate of the regression curve.

When the bandwidth is reduced to 0.1, the kernel estimate becomes jagged, while increasing the bandwidth to 4.0 results in an estimated curve that is much flatter than the estimates obtained with smaller bandwidths. The two panels in the bottom row of Figure 1 illustrate the tradeoff between bias and variance associated with larger or smaller values of h as discussed above.

The Nadarya-Watson estimator, when evaluated at an arbitrary point x, is simply a weighted average of values Y_i in a neighborhood of x ; the size of the neighborhood and the weights given to each Y_i are determined jointly by the kernel function K (·) and the bandwidth h. Local polynomial estimators, by contrast, estimate the conditional mean function by fitting locally a p th-order polynomial to the data. This approach offers several advantages over the earlier Nadarya-Watson estimator, as discussed by Fan and Gijbels (1996). Setting p = 1 yields a local linear estimator, which has less bias but no greater variance than the Nadarya-Watson estimator. In addition, while the Nadarya-Watson estimator is inconsistent near boundaries of support for x, local polynomial estimators remain consistent near such boundaries. While estimation of regression derivatives is possible using modifications of the Nadarya-Watson estimator, estimation of derivatives is straightforward with local polynomial estimators. As with the Nadarya-Watson estimator, a number of techniques have been developed to optimize the choice of bandwidth when local polynomial estimators are used, as discussed by Fan and Gijbels (1996).

SEE ALSO Data; Econometric Decomposition; Frequency Distributions; Linear Regression; Ordinary Least Squares Regression; Regression; Regression Analysis; Statistics

BIBLIOGRAPHY

Fan, Jianqing, and Irène Gijbels. 1996. Local Polynomial Modelling and Its Applications. London: Chapman and Hall.

Nadarya, E. A. 1964. On Estimating Regression. Theory of Probability and Its Applications 10: 186–190.

Pagan, Adrian, and Aman Ullah. 1999. Nonparametric Econometrics. Cambridge, U.K.: Cambridge University Press.

Sheather, S. J., and M. C. Jones. 1991. A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation. Journal of the Royal Statistical Society, Series B 53 (3): 683–690.

Watson, G. S. 1964. Smooth Regression Analysis. Sankhya, Series A 26: 359–372.

Paul W. Wilson

International Encyclopedia of the Social Sciences