Maximum Likelihood Regression
Maximum Likelihood Regression
Maximum likelihood is a methodology used to estimate the parameters of an econometric or statistical model. It was first proposed by Ronald Aylmer Fisher (1890–1962) and is now considered the workhorse of modern econometrics, not only because of its flexibility but also due to the availability of computer power, which has permitted the resolution of complicated numerical problems associated with this technique. Maximum likelihood estimation seeks to determine the parameters of a statistical process that have the highest probability of generating the observed sample of data.
Consider the following regression model: Yi = β0 + β1X 1i + ……βkXki + εi for i = 1,2,.….n. In the simplest case, one can assume that the error term εi is an independent and identically distributed (iid) normal random variable with variance σ 2; that is, εi → N (0, σ 2). It will be shown below that the assumption of a particular density function for εi is paramount to write the likelihood function. Under the assumption of normality, the probability density function of the error term is written as
Since εi = Y i – β0 – β1X 1i –...... – βkXki, the assumption of normality of the error term is equivalent to the assumption of normality for the conditional probability density function of Y given X: f(Yi ǀ X 1i , X 2i ,.....Xki ; β 0,β 1,.....βk,σ2) → N (β 0 + β 1X1i +....βkXki,σ2 )—that is,
The objective is to estimate the parameter vector θ ≡ (β0, β1β2,.….βkσ2)’. For a sample of size n, and because of the iid assumption, the joint probability density function of ε1 ε2,.… εn, is then the product of the marginal densities: f (ε1 ε2,.…εn; θ ) = f (ε1;θ)f (ε2;θ).….. f (εn; θ). The likelihood function L(θ ε1,ε2,.…. ε n) is based on the joint probability density function of ε1, ε2, ….εn, where ε1,ε2,.…. εn is taken as fixed data and the parameter vector θ is the argument of the function: L(θ ε1, ε2 ,.…. εn) = f (ε1;θ)f (ε2θ)……f (εn;θ). Note that ε1,ε2,.….εn is a function of the data (Yi,X1iX2i,….Xki through the regression model; that is to say, ε i = Yi – β 0 – β0 - β1X1i – β2X2 β2X2i –.…. – βiXki
Though the aforementioned regression model deals with a cross-sectional data set, the maximum likelihood principle also applies to time-series data and panel data. For instance, a time-series regression such as Y t = ФYt-1 + εt for t = 1,2.…T, with iid εt→N(0,σ2 ), implies that the conditional density function of Yn given Yt-1 is also normal, f(YtǀYt-1 ) → N(ФYt-1,σ2), that is,
As before, the object of interest is to estimate θ ≡ (Φ, σ 2)’ and the likelihood function for a sample of size T is L(θ;Y1, Y2 .… YT) = f(Y1; θ)f(Y2 ǀ Y1;θ)…….f(YTǀ YT-1 ;θ). This function requires the knowledge of the marginaldensity of the first observation f (Y1;θ). When the samplesize is very large, the contribution of the first observationis almost negligible for any practical purposes. Conditioning on the first observation (Y 1 is known) wedefine the conditional likelihood function as L(θ;YT, YT-1 .…Y2 ǀ Y1) = f(Y2 ǀ Y1; θ)…….f(YTǀ YT-1;θ).
Mathematically it is more convenient to work with the logarithm of the likelihood function. The log-likelihood function is defined as:
The maximum likelihood estimator (MLE) of θ is thevalue of θ that maximizes the likelihood of observing the sample (Yi,X1i,X2i. …. Xki ) i= 1… n. In estimating a regression model using maximum likelihood, the question is which value of θ ≡ (β0, β1, β2,.… βk, σ2)″, out of all the possible values, makes the possibility of occurrence of the observed data the largest. Since the log transformation is monotonic, the value of θ that maximizes the likelihood function is the same as the one that maximizes the log-likelihood function. The statistical inference problem is reduced to a mathematical problem, which under the assumption of normality, looks like:
Equating the first-order conditions (the score vector of first-order partial derivatives) to zero results in a system of k + 2 equations with k + 2 unknowns. The solution to this system is the maximum likelihood estimator. The solution is the maximum of the function if the Hessian (the matrix of second-order partial derivatives) is negative semi-definite.
For a linear regression model, the MLE is very easy to compute. First, we compute the solution for the system of equations corresponding to the parameter vector (β0, β1,.… βk)’. This system is linear, and its solution is identical to the ordinary least squares (OLS) estimator that is, β̂mle = β̂ols = (XX)-1XY. Second, the maximum likelihood estimator of the variance is straightforward to compute once the MLE β̂i s are obtained. The MLE σ2 corresponds to the sample variance of the residuals which is a biased estimator of the population variance. The β̂mle is identical to the β̂ols when the likelihood function is constructed under the assumption of normality.
For a nonlinear regression model, the system of equations is usually nonlinear, and to obtain the MLE solution numerical optimization methods are needed. In addition, there is the possibility of heteroscedasticity. This is the case when the variance of the error term is not constant but it depends on a set of variables, for instance, σ2 (X,γ). In this instance, there is an additional set of parameters γ to estimate. The system of equations will be nonlinear and the solution will again be obtained by numerical optimization. In nonlinear models, the system of equations may have several solutions, for the likelihood function may exhibit a complicated profile with several local maxima. In this case, the researcher needs to make sure that the global maximum has been achieved by either plotting the profile of the likelihood function (when possible), or by using different initial values to start the iterative procedures within the optimization routine.
The significance of the maximum likelihood estimator derives from its optimal asymptotic properties. In large samples and under correct model specification, the MLE θmle is in most cases consistent, asymptotically efficient, and asymptotically normal. Among these properties, maximum efficiency (estimators with the smallest variance) is the most significant property of the ML estimator. Maximum efficiency is a very desirable property because it allows the construction of more powerful tests and smaller confidence intervals than those based in less efficient estimators.
Within the maximum likelihood framework and after obtaining the ML estimators, we can also perform hypothesis testing by comparing the value of the estimates with other fixed values. The likelihood ratio test assesses the likelihood that the data may have been generated by a different set of parameter values (those under the null hypothesis). The test compares the value of the likelihood function under the null hypothesis with that under the alternative. The test is computed as two times the difference of the log-likelihood functions and it is asymptotically distributed as a chi-square with as many degrees of freedom as the number of parameters under the null.
The maximum likelihood estimator is also known as a full information estimator because the estimation is based on the most comprehensive characterization of a random variable, which is the specification of its probability density function. In practice, it is not known what density function should be assumed, and a potential shortcoming of the maximum likelihood estimator is that one could write a likelihood function under a false probability density function. The consequences of this are severe, for the asymptotic properties will not hold anymore. However, one can still apply quasi-maximum likelihood estimation (QMLE). The QML estimator requires that the conditional mean and conditional variance are correctly specified. It assumes that the error is normally distributed, though this may be a false assumption. The quasi-maximum likelihood estimator is still consistent, though the maximum efficiency property is lost. The efficiency loss depends on how far the normal density is from the true density. In practice, estimation is done within the quasimaximum likelihood framework, because a full knowledge of the density is rare. Other estimators can recover some of the efficiency loss but, these belong to the family of nonparametric and semiparametric estimators.
SEE ALSO Linear Regression; Models and Modeling; Probability Distributions; Properties of Estimators (Asymptotic and Exact); Regression; Regression Analysis
González-Rivera, Gloria, and Feike C. Drost. 1999. Efficiency Comparisons of Maximum-Likelihood-Based Estimators in GARCH Models. Journal of Econometrics 93 (1): 93–111.
White, Halbert L. 1994. Estimation, Inference, and Specification Analysis. Cambridge, UK: Cambridge University Press.