Maximum Likelihood Regression

views updated

Maximum Likelihood Regression

Maximum likelihood is a methodology used to estimate the parameters of an econometric or statistical model. It was first proposed by Ronald Aylmer Fisher (1890–1962) and is now considered the workhorse of modern econometrics, not only because of its flexibility but also due to the availability of computer power, which has permitted the resolution of complicated numerical problems associated with this technique. Maximum likelihood estimation seeks to determine the parameters of a statistical process that have the highest probability of generating the observed sample of data.

Consider the following regression model: Y_i = β₀ + β₁X _1i + ……β_kX_ki + ε_i for i = 1,2,.….n. In the simplest case, one can assume that the error term ε_i is an independent and identically distributed (iid) normal random variable with variance σ ²; that is, ε_i → N (0, σ ²). It will be shown below that the assumption of a particular density function for ε_i is paramount to write the likelihood function. Under the assumption of normality, the probability density function of the error term is written as

Since ε_i = Y _i – β₀ – β₁X _1i –...... – β_kX_ki, the assumption of normality of the error term is equivalent to the assumption of normality for the conditional probability density function of Y given X: f(Y_i ǀ X _1i, X _2i,.....X_ki ; β ₀,β ₁,.....β_k,σ²) → N (β ₀ + β ₁X_1i +....β_kX_ki,σ² )—that is,

The objective is to estimate the parameter vector θ ≡ (β₀, β₁β₂,.….β_kσ²)’. For a sample of size n, and because of the iid assumption, the joint probability density function of ε₁ ε₂,.… ε_n, is then the product of the marginal densities: f (ε₁ ε₂,.…ε_n; θ ) = f (ε₁;θ)f (ε₂;θ).….. f (ε_n; θ). The likelihood function L(θ ε₁,ε₂,.…. ε _n) is based on the joint probability density function of ε₁, ε₂, ….ε_n, where ε₁,ε₂,.…. ε_n is taken as fixed data and the parameter vector θ is the argument of the function: L(θ ε₁, ε₂ ,.…. ε_n) = f (ε₁;θ)f (ε₂θ)……f (ε_n;θ). Note that ε₁,ε₂,.….ε_n is a function of the data (Y_i,X_1iX_2i,….X_ki through the regression model; that is to say, ε _i = Y_i – β ₀ – β₀ - β₁X_1i – β₂X₂ β₂X_2i –.…. – β_iX_ki

Though the aforementioned regression model deals with a cross-sectional data set, the maximum likelihood principle also applies to time-series data and panel data. For instance, a time-series regression such as Y _t = ФY_t-1 + ε_t for t = 1,2.…T, with iid ε_t→N(0,σ² ), implies that the conditional density function of Y_n given Y_t-1 is also normal, f(Y_tǀY_t-1 ) → N(ФY_t-1,σ²), that is,

As before, the object of interest is to estimate θ ≡ (Φ, σ ²)’ and the likelihood function for a sample of size T is L(θ;Y₁, Y₂ .… Y_T) = f(Y₁; θ)f(Y₂ ǀ Y₁;θ)…….f(Y_Tǀ Y_T-1 ;θ). This function requires the knowledge of the marginaldensity of the first observation f (Y₁;θ). When the samplesize is very large, the contribution of the first observationis almost negligible for any practical purposes. Conditioning on the first observation (Y ₁ is known) wedefine the conditional likelihood function as L(θ;Y_T, Y_T-1 .…Y₂ ǀ Y₁) = f(Y₂ ǀ Y₁; θ)…….f(Y_Tǀ Y_T-1;θ).

Mathematically it is more convenient to work with the logarithm of the likelihood function. The log-likelihood function is defined as:

The maximum likelihood estimator (MLE) of θ is thevalue of θ that maximizes the likelihood of observing the sample (Y_i,X_1i,X_2i. …. X_ki) i= 1… n. In estimating a regression model using maximum likelihood, the question is which value of θ ≡ (β₀, β₁, β₂,.… β_k, σ²)″, out of all the possible values, makes the possibility of occurrence of the observed data the largest. Since the log transformation is monotonic, the value of θ that maximizes the likelihood function is the same as the one that maximizes the log-likelihood function. The statistical inference problem is reduced to a mathematical problem, which under the assumption of normality, looks like:

Equating the first-order conditions (the score vector of first-order partial derivatives) to zero results in a system of k + 2 equations with k + 2 unknowns. The solution to this system is the maximum likelihood estimator. The solution is the maximum of the function if the Hessian (the matrix of second-order partial derivatives) is negative semi-definite.

For a linear regression model, the MLE is very easy to compute. First, we compute the solution for the system of equations corresponding to the parameter vector (β₀, β₁,.… β_k)’. This system is linear, and its solution is identical to the ordinary least squares (OLS) estimator that is, β̂_mle = β̂_ols = (XX)^-1XY. Second, the maximum likelihood estimator of the variance is straightforward to compute once the MLE β̂_i s are obtained. The MLE σ² corresponds to the sample variance of the residuals which is a biased estimator of the population variance. The β̂_mle is identical to the β̂_ols when the likelihood function is constructed under the assumption of normality.

For a nonlinear regression model, the system of equations is usually nonlinear, and to obtain the MLE solution numerical optimization methods are needed. In addition, there is the possibility of heteroscedasticity. This is the case when the variance of the error term is not constant but it depends on a set of variables, for instance, σ² (X,γ). In this instance, there is an additional set of parameters γ to estimate. The system of equations will be nonlinear and the solution will again be obtained by numerical optimization. In nonlinear models, the system of equations may have several solutions, for the likelihood function may exhibit a complicated profile with several local maxima. In this case, the researcher needs to make sure that the global maximum has been achieved by either plotting the profile of the likelihood function (when possible), or by using different initial values to start the iterative procedures within the optimization routine.

The significance of the maximum likelihood estimator derives from its optimal asymptotic properties. In large samples and under correct model specification, the MLE θ_mle is in most cases consistent, asymptotically efficient, and asymptotically normal. Among these properties, maximum efficiency (estimators with the smallest variance) is the most significant property of the ML estimator. Maximum efficiency is a very desirable property because it allows the construction of more powerful tests and smaller confidence intervals than those based in less efficient estimators.

Within the maximum likelihood framework and after obtaining the ML estimators, we can also perform hypothesis testing by comparing the value of the estimates with other fixed values. The likelihood ratio test assesses the likelihood that the data may have been generated by a different set of parameter values (those under the null hypothesis). The test compares the value of the likelihood function under the null hypothesis with that under the alternative. The test is computed as two times the difference of the log-likelihood functions and it is asymptotically distributed as a chi-square with as many degrees of freedom as the number of parameters under the null.

The maximum likelihood estimator is also known as a full information estimator because the estimation is based on the most comprehensive characterization of a random variable, which is the specification of its probability density function. In practice, it is not known what density function should be assumed, and a potential shortcoming of the maximum likelihood estimator is that one could write a likelihood function under a false probability density function. The consequences of this are severe, for the asymptotic properties will not hold anymore. However, one can still apply quasi-maximum likelihood estimation (QMLE). The QML estimator requires that the conditional mean and conditional variance are correctly specified. It assumes that the error is normally distributed, though this may be a false assumption. The quasi-maximum likelihood estimator is still consistent, though the maximum efficiency property is lost. The efficiency loss depends on how far the normal density is from the true density. In practice, estimation is done within the quasimaximum likelihood framework, because a full knowledge of the density is rare. Other estimators can recover some of the efficiency loss but, these belong to the family of nonparametric and semiparametric estimators.

SEE ALSO Linear Regression; Models and Modeling; Probability Distributions; Properties of Estimators (Asymptotic and Exact); Regression; Regression Analysis

BIBLIOGRAPHY

González-Rivera, Gloria, and Feike C. Drost. 1999. Efficiency Comparisons of Maximum-Likelihood-Based Estimators in GARCH Models. Journal of Econometrics 93 (1): 93–111.

White, Halbert L. 1994. Estimation, Inference, and Specification Analysis. Cambridge, UK: Cambridge University Press.

Gloria González-Rivera

International Encyclopedia of the Social Sciences