## Maximum Likelihood Regression

## Maximum Likelihood Regression

# Maximum Likelihood Regression

*Maximum likelihood* is a methodology used to estimate the parameters of an econometric or statistical model. It was first proposed by Ronald Aylmer Fisher (1890–1962) and is now considered the workhorse of modern econometrics, not only because of its flexibility but also due to the availability of computer power, which has permitted the resolution of complicated numerical problems associated with this technique. *Maximum likelihood estimation* seeks to determine the parameters of a statistical process that have the highest probability of generating the observed sample of data.

Consider the following regression model: *Y _{i} = β_{0} + β_{1}X _{1i }* + ……

*β*+

_{k}X_{ki}*ε*for

_{i}*i*= 1,2,.….

*n.*In the simplest case, one can assume that the error term ε

*is an independent and identically distributed (iid) normal random variable with variance*

_{i}*σ*

^{2}; that is, ε

_{i }→

*N*(0,

*σ*

^{2}). It will be shown below that the assumption of a particular density function for ε

*is paramount to write the likelihood function. Under the assumption of normality, the probability density function of the error term is written as*

_{i}Since ε* _{i}* =

*Y*

_{i}– β

_{0}–

*β*–...... –

_{1}*X*_{1i }*β*, the assumption of normality of the error term is equivalent to the assumption of normality for the conditional probability density function of

_{k}X_{ki}*Y*given

*X: f(Y*ǀ

_{i}*X*

_{1i },

*X*

_{2i },.....

*X*;

_{ki}*β*

_{0},

*β*

_{1},.....

*β*,σ

_{k}^{2}) →

*N*(

*β*

_{0}+

*β*

_{1}X

_{1i }+....

*β*)—that is,

_{k}X_{ki},σ^{2}The objective is to estimate the parameter vector *θ ≡ (β _{0}, β_{1}β_{2},.….β_{k}σ^{2})’*. For a sample of size

*n,*and because of the iid assumption, the joint probability density function of ε

_{1}ε

_{2},.… ε

_{n}, is then the product of the marginal densities:

*f*(ε

_{1}ε

_{2},.…ε

_{n};

*θ*) =

*f*(ε

_{1};θ)

*f*(ε

_{2};θ).…..

*f*(ε

_{n}; θ). The

*likelihood function L(θ*ε

_{1},ε

_{2},.….

**ε**

_{n}) is based on the joint probability density function of ε

_{1}, ε

_{2}, ….ε

_{n}, where ε

_{1},ε

_{2},.…. ε

_{n}is taken as fixed data and the parameter vector

*θ*is the argument of the function:

*L(θ*ε

_{1}, ε

_{2},.…. ε

_{n}) =

*f*(ε

_{1};θ)

*f*(ε

_{2}θ)……

*f*(ε

_{n};θ). Note that ε

_{1},ε

_{2},.….ε

_{n}is a function of the data

*(Y*through the regression model; that is to say,

_{i},X_{1i}X_{2i},….X_{ki}**ε**

_{i}=

*Y*–

_{i}**β**

_{0}–

*β*-_{0}*β*

_{1}*X*–_{1i}

*β*_{2}X_{2}*β*–.…. – β

_{2}X_{2i}_{i}X

_{ki}

Though the aforementioned regression model deals with a cross-sectional data set, the maximum likelihood principle also applies to time-series data and panel data. For instance, a time-series regression such as *Y* _{t} = *ФY _{t-1} +* ε

_{t}for

*t =*1,2.…

*T*, with iid ε

_{t}→

*N(0,σ*), implies that the conditional density function of

^{2}*Y*given

_{n}*Y*is also normal,

_{t-1}*f(Y*) →

_{t}ǀY_{t-1}*N(ФY*, that is,

_{t-1},σ^{2})As before, the object of interest is to estimate *θ* ≡ (*Φ, σ* ^{2})’ and the likelihood function for a sample of size *T* is *L(θ;Y _{1}, Y_{2} .… Y_{T}) = f(Y_{1}; θ)f(Y_{2} ǀ Y_{1};θ)…….f(Y_{T}ǀ Y_{T-1} ;θ).* This function requires the knowledge of the marginaldensity of the first observation

*f*(

*Y*When the samplesize is very large, the contribution of the first observationis almost negligible for any practical purposes. Conditioning on the first observation (

_{1};θ).*Y*

_{1}is known) wedefine the

*conditional likelihood function*as

*L(θ;Y*

_{T}, Y_{T-1}.…Y_{2}ǀ Y_{1}) = f(Y_{2}ǀ Y_{1}; θ)…….f(Y_{T}ǀ Y_{T-1};θ).Mathematically it is more convenient to work with the logarithm of the likelihood function. The *log-likelihood function* is defined as:

The *maximum likelihood estimator* (MLE) of *θ* is thevalue of *θ* that maximizes the likelihood of observing the sample (*Y _{i},X_{1i},X_{2i.}* ….

*X*1…

_{ki }) i=*n*. In estimating a regression model using maximum likelihood, the question is which value of

*θ*≡ (β

_{0}, β

_{1}, β

_{2},.… β

_{k}, σ

^{2})″, out of all the possible values, makes the possibility of occurrence of the observed data the largest. Since the log transformation is monotonic, the value of θ that maximizes the likelihood function is the same as the one that maximizes the log-likelihood function. The statistical inference problem is reduced to a mathematical problem, which under the assumption of normality, looks like:

Equating the first-order conditions (the score vector of first-order partial derivatives) to zero results in a system of *k* + 2 equations with *k* + 2 unknowns. The solution to this system is the maximum likelihood estimator. The solution is the maximum of the function if the Hessian (the matrix of second-order partial derivatives) is negative semi-definite.

For a linear regression model, the MLE is very easy to compute. First, we compute the solution for the system of equations corresponding to the parameter vector *(β _{0}, β_{1},.… β_{k})’*. This system is linear, and its solution is identical to the ordinary least squares (OLS) estimator that is,

*β̂*Second, the maximum likelihood estimator of the variance is straightforward to compute once the MLE

_{mle}= β̂_{ols}= (XX)^{-1}XY.*β̂*s are obtained. The MLE σ

_{i}^{2}corresponds to the sample variance of the residuals which is a biased estimator of the population variance. The

*β̂*is identical to the

_{mle}*β̂*when the likelihood function is constructed under the assumption of normality.

_{ols}For a nonlinear regression model, the system of equations is usually nonlinear, and to obtain the MLE solution numerical optimization methods are needed. In addition, there is the possibility of heteroscedasticity. This is the case when the variance of the error term is not constant but it depends on a set of variables, for instance, *σ ^{2} (X,γ).* In this instance, there is an additional set of parameters γ to estimate. The system of equations will be nonlinear and the solution will again be obtained by numerical optimization. In nonlinear models, the system of equations may have several solutions, for the likelihood function may exhibit a complicated profile with several local maxima. In this case, the researcher needs to make sure that the global maximum has been achieved by either plotting the profile of the likelihood function (when possible), or by using different initial values to start the iterative procedures within the optimization routine.

The significance of the maximum likelihood estimator derives from its optimal asymptotic properties. In large samples and under correct model specification, the MLE θ_{mle} is in most cases consistent, asymptotically efficient, and asymptotically normal. Among these properties, *maximum efficiency* (estimators with the smallest variance) is the most significant property of the ML estimator. Maximum efficiency is a very desirable property because it allows the construction of more powerful tests and smaller confidence intervals than those based in less efficient estimators.

Within the maximum likelihood framework and after obtaining the ML estimators, we can also perform hypothesis testing by comparing the value of the estimates with other fixed values. The *likelihood ratio test* assesses the likelihood that the data may have been generated by a different set of parameter values (those under the null hypothesis). The test compares the value of the likelihood function under the null hypothesis with that under the alternative. The test is computed as two times the difference of the log-likelihood functions and it is asymptotically distributed as a chi-square with as many degrees of freedom as the number of parameters under the null.

The maximum likelihood estimator is also known as a *full information estimator* because the estimation is based on the most comprehensive characterization of a random variable, which is the specification of its probability density function. In practice, it is not known what density function should be assumed, and a potential shortcoming of the maximum likelihood estimator is that one could write a likelihood function under a false probability density function. The consequences of this are severe, for the asymptotic properties will not hold anymore. However, one can still apply *quasi-maximum likelihood estimation* (QMLE). The QML estimator requires that the conditional mean and conditional variance are correctly specified. It assumes that the error is normally distributed, though this may be a false assumption. The quasi-maximum likelihood estimator is still consistent, though the maximum efficiency property is lost. The efficiency loss depends on how far the normal density is from the true density. In practice, estimation is done within the quasimaximum likelihood framework, because a full knowledge of the density is rare. Other estimators can recover some of the efficiency loss but, these belong to the family of nonparametric and semiparametric estimators.

**SEE ALSO** *Linear Regression; Models and Modeling; Probability Distributions; Properties of Estimators (Asymptotic and Exact); Regression; Regression Analysis*

## BIBLIOGRAPHY

González-Rivera, Gloria, and Feike C. Drost. 1999. Efficiency Comparisons of Maximum-Likelihood-Based Estimators in GARCH Models. *Journal of Econometrics* 93 (1): 93–111.

White, Halbert L. 1994. *Estimation, Inference, and Specification Analysis*. Cambridge, UK: Cambridge University Press.

*Gloria González-Rivera*