Statistical Noise

views updated

Statistical Noise

Statistical noise refers to variability within a sample, stochastic disturbance in a regression equation, or estimation error. This noise is often represented as a random variable.

In the case of a regression equation

Y = m (X ǀ θ) + ε,

with E (ε) = 0, the random variable ε is called a disturbance or error term and reflects statistical noise. Noise in this context is usually viewed as arising from omitted explanatory variables; as such, the error term is a proxy for variables not included in the regression. Variables may be omitted from the regression for several reasons. The theory determining the behavior of the dependent variable Y may be incomplete, or perhaps some variables known to influence Y are unavailable to the researcher. Variables that have only slight influence on Y might be eliminated from the regression in order to maintain a parsimonious model. If the conditional mean function m (x ǀ θ) is specified parametrically (e.g., as in ordinary least squares), the error term might reflect error in this specification, which is perhaps only an approximation to the true form of m (x ǀ θ).

Even if the regression equation includes all relevant variables, and if the conditional mean function is correctly specified, the error term may reflect either measurement error in Y or intrinsic randomness in Y. Intrinsic randomness might be the result of nonsystematic variation in human behavior if Y describes the action of individuals. Tastes, preferences, and the like may be explained partly by other variables, but notions of bounded rationality in microeconomic theory suggest that some behavior is inexplicable.

In the typical estimation paradigm, a finite sample of size n is drawn and used to compute an estimate θ̂ of some quantity θ that is of interest. Even if the estimator is statistically consistent, the estimate that is obtained will typically differ from the true quantity θ because the researcher does not have an infinite amount of data, but only a finite sample. The difference between θ̂ and θ can be expressed by writing

θ̂ = θ + ε,

where again ε represents statistical noise, which can be positive or negative. In principle, one could draw many samples of size n and compute estimates of θ from each sample; each estimate would differ from the true θ. These random differences constitute a form of statistical noise. In this case, the noise arises from the fact that each sample of size n will not have exactly the same characteristics (e.g., the means, variances, etc. of observations on individual variables will differ across samples, and will also differ from the mean, variance, etc. of the underlying population from which the data are drawn).

Statistical noise plays a large role in determining what can be learned from a sample of data in any estimation setting. The variance of regression residuals determines, in part, the goodness of fit of an estimated regression line as well as the variance of estimators of regression parameters and other quantities. The variance of an estimator determines the precision of estimates that are obtained from data, which in turn affects the width of confidence intervals and the ability to reject null hypotheses of the form H ₀: θ = 0. Statistical noise is often assumed to be normally distributed, but this assumption is inappropriate in many settings.

SEE ALSO Least Squares, Ordinary; Measurement Error; Properties of Estimators (Asymptotic and Exact); Sampling; Semiparametric Estimation; Specification Error; Variables, Random; White Noise