## Simultaneous Equation Estimation

## Simultaneous Equation Estimation

# Simultaneous Equation Estimation

Alternative estimation methods

Evaluation of alternative methods

The distinction between partial and general equilibrium analysis in economic theory is well grounded [*See*ECONOMIC EQUILIBRIUM]. Early work in econometrics paid inadequate attention to this distinction and overlooked for many years the possibilities of improving statistical estimates of individual economic relationships by embedding them in models of the economy as a whole [*see*ECONOMETRIC MODELS, AGGREGATE]. The earliest studies in econometrics were concerned with estimating parameters of demand functions, supply functions, production functions, cost functions, and similar tools of economic analysis. The principal statistical procedure used was to estimate the α’s in the relation

using the criterion that be minimized. This is the principle of “least squares” applied to a single equation in which *y _{t}* is chosen as the dependent variable and x

_{l}t, · · · , x

_{n}t are chosen as the independent variables. The criterion is the minimization of the sum of squared “disturbances” (

*u*) which are assumed to be unobserved random errors. The estimation of the unknown parameters a{ is based on the sample of

_{t}*T*observations of

*y*and x

_{t}_{l}t, · · · , x

_{n}t. This is the usual statistical model and estimation procedure that is used in controlled experimental situations where the set of independent variables consists of selected, fixed variates for the experimental readings on

*y*, the dependent variable. [

_{t}*See*LINEAR HYPOTHESES, article on REGRESSION].

However, economics, like most other social sciences, is largely a nonexperimental science, and it is generally not possible to control the values of x_{1}t, ··· ,x_{n}t. The values of the independent variables, like those of the dependent variable, are produced from the general outcome of economic life, and the econometrician is faced with the problem of making statistical inferences from nonexperimental data. This is the basic reason for the use of simultaneous equation methods of estimation in econometrics. In some situations x_{l}t, ··· ,x_{n}t may not be controlled variates, but they may have a oneway causal influence on *y _{t}* . The main point is that least squares yields desirable results only if ut is independent of x

_{l}t, ·· · , x

_{n}t, that is, if

*E(u*= 0 for all

_{t}x_{it},)*i*and

*t*.

**Properties of estimators** . If the *x _{it}* are fixed variates, estimators of a( obtained by minimizing are

*best linear unbiased*estimators. They are linear estimators because, as shown below, they are linear functions of

*y*. An estimator, ά

_{t}_{i}, of α

_{i}is called unbiased if

i.e., if the expected value of the estimator equals the true value. An estimator is best if among all unbiased estimators it has the least variance, i.e., if

where ά is any other unbiased estimator. Clearly, the properties of being unbiased and best are desirable ones. These properties are defined without reference to sample size. Two related but weaker properties, which are defined for large samples, are *consistency* and *efficiency.*

An estimator, ά_{i}, is consistent if plim ά_{i} = α_{i},that is, if

This states that the probability that at deviates from ofj by an amount less than any arbitrarily small ε tends to unity as the sample size *T* tends to infinity.

Consider now the class of all consistent estimators that are normally distributed as*T*→ ∞. An *efficient estimator* of α_{i} is a consistent estimator whose asymptotic normal distribution has a smaller variance than any other member of this class. [*See*ESTIMATION].

**Inconsistency of least squares** . The choice of estimators, ά_{i}, such that is minimized is formally equivalent to the empirical implementation of the condition that *E(u _{t}x_{it},)* = 0, since the first-order condition for a minimum is

On the one hand, the α_{i} are estimated so as to minimize the residual sum of squares. On the other hand, they are estimated so that the residuals are uncorrelated with x_{l}t, · · · , x_{n}t. The possible inconsistency of this method is clearly revealed by the latter criterion, for if it is assumed that the *u _{t}* are independent of x

_{l}t, ··· ,x

_{n}t when they actually are not, the estimators will be inconsistent. This is shown by the formula

where *M* is the moment matrix whose typical element is Σ_{t}x_{it} x_{jt}; is the determinant of *M*; *m _{uj}* is Σ

_{t}u

_{t}x

_{jt};and

*M*

_{ji}is the

*j, i*cofactor of

*M*. The inconsistency in the estimator is due to the nonvanishing probability limit of

*m*. In a nonexperimental sample of data, such as that observed as the joint outcome of the uncontrolled simultaneous economic process, we would expect many or all the

_{uj}*x*in a problem to be dependent on

_{it}*u*.

_{t}**Identifying restrictions** . Since economic models consist of a set of simultaneous equations generating nonexperimental data, the equations of the model must be *identified* prior to statistical estimation. Unless some restrictions are imposed on specific relationships in a linear system of simultaneous equations, every equation may look alike to the statistician faced with the job of estimating the unknown coefficients. The economist must place a priori restrictions, in advance of statistical estimation, on each of the equations in order to identify them. These restrictions may specify that certain coefficients are known in advance-especially that they are zero, for this is equivalent to excluding the associated variable from an economic relation. Other restrictions may specify linear relationships between the different coefficients. Consider the generalization of a single equation,

where *E(z _{it} u_{t})* = 0 for all

*i*and

*t*, to a whole system,

where *E(z _{jft}u_{it})* = 0 for all

*k, i,*and

*t*. Every variable enters every equation linearly without restriction, and the statistician has no way of distinguishing one relation from another. Zero restrictions, if imposed, would have the form

*β*or

_{rs}= 0*γ*, for some

_{pq}= 0*r, s, p,*or

*q*. In many equations, we may be interested in specifying that sums or differences of variables are economically relevant combinations, i.e., that

*β*or that

_{rs}= β_{ru}*β*or, more generally, that

_{rv}= ─ γ_{rw}The last restriction implies that a homogeneous linear combination of parameters in the rth equation is specified to hold on a priori grounds. The weights w, and v, are known in advance.

If general linear restrictions are imposed on the equations of a linear system, we may state the following rule: an equation in a linear system is identified if it is not possible to reproduce by linear combination of some or all of the equations in the system an equation having the same statistical form as the equation being estimated.

If the restrictions are of the *zero* type, a necessary condition for identification of an equation in a linear system of n equations is that the number of variables excluded from that equation be greater than or equal to n – 1. A necessary and sufficient condition is that it is possible to form at least one nonvanishing determinant of order n– 1 out of those coefficients, properly arranged, with which the excluded variables appear in the n – 1 other equations (Koopmans et al. 1950).

Criteria for identifiability are stated here for linear equation systems. A more general treatment in nonlinear systems is given by Fisher (1966). [*See*STATISTICAL IDENTIFIABILITY]

## Alternative estimation methods

Assuming that we are dealing with an identified system, let us turn to the problems of estimation. In the system of equations above, the y_{it}, are *endogenous or dependent* variables and are equal in number to the number of equations in the system, n. The zkt are exogenous variables and are assumed to be independent of the disturbances, u_{it}.

In one of the basic early papers in simultaneous equation estimation (Mann & Wald 1943), it was shown that large-sample theory would, under fairly general conditions, permit lagged values of endogenous variables to be treated like purely exogenous variables as far as consistency in estimation is concerned. Exogenous and lagged endogenous variables are called *predetermined* variables.

Early econometric studies, for example, that of Tinbergen (1939), were concerned with the estimation of a number of individual relationships in which the possible dependence between variables and disturbances was ignored. These studies stimulated Haavelmo (1943) to analyze the consistency problem, for he noted that the Tinbergen model contained many single-equation least squares estimates of equations that were interrelated in the system Tinbergen was constructing, which was intended to be a theoretical framework describing the economy that generated the observations used.

The lack of independence between disturbances and variables can readily be demonstrated. Consider the two-equation system

The ZM are by assumption independent of w1( and u2t. Some of the y’s are specified to be zero or are otherwise restricted so that the two equations are identified. Suppose we wish to estimate the first equation. To apply least squares to this equation, we would have to select either y^ or j/2 as the dependent variable. Suppose we select wt and set /3n equal to unity. We would then compute the least squares regression of y, on y2 and the zk according to the relation

which incorporates all the identifying restrictions on the y’s.

For this procedure to yield consistent estimators, y,t must be independent of MI(. The question is whether the existence of the second equation has any bearing on the independence of y,t and ult. Multiplying the second equation by ult and forming expectations, we have

From the first equation (with B_{11} = 1), we have

Combining these two expressions, we obtain

In general, this expression does not vanish, and we find that j/2t and w, are not independent.

**The maximum likelihood method** . The maximum likelihood method plays a normative role in the estimation of economic relationships, much like that played by perfect competition in economic theory. This method provides consistent and efficient estimators under fairly general conditions. It rests on specific assumptions, and it may be hard to realize all these assumptions in practice or, indeed, to make all the difficult calculations required for solution of the estimation equations.

For the single-equation model, the maximum likelihood method is immediately seen to be equivalent to ordinary least squares estimation for normally distributed disturbances. Let us suppose that u,, · · ·, UT are T independent, normally distributed variables. The T-element sample has the probability density function

By substitution we can transform this joint density of u_{1}, · · ·, u_{r} into a joint density of j/i, · · ·, yT, given xlt, · · ·, xnl, x12, · · ·, xn,, · · ·, xnT, namely,

This function will be denoted as L, the likelihood function of the sample, and is seen to depend on the unknown parameters at, · · ·, a,, and r. We maximize this function by imposing the following conditions:

These are recognized as the “normal” equations of single-equation least squares theory and the estimation equation for the residual variance-apart from adjustment for degrees of freedom used in estimating or2.

In a system of simultaneous equations, we wishto estimate the parameters in

Here we have n linear simultaneous equations in n endogenous and m exogenous variables. The parameters to be estimated are the elements of the n x n coefficient matrix

**B** = β_{ij}),

the n x m coefficient matrix

Γ=(γ_{ik}),

and the n x n variance-covariance matrix

Σ=(σ_{ij})

The variances and covariances are defined by

σ_{ij}=*E(u _{i}u_{j})*

A rule of normalization is applied for each equation,

In practice, one element of β_{i} = *(β _{il}, · · ·,β_{in})* in each equation is singled out and assigned a value of unity.

The likelihood function for the whole system is

where y_{t} = (y_{it}, · · ·, y_{nt}), z = (z_{1t}, · · ·, z_{mt}) | B | is the determinant of S, and mod | B | is the absolute value of the determinant of B (Koopmans et al. 1950). The matrix B enters this expression as the Jacobian of the transformation from the variables *u _{1t} · · ·, u_{nt}* to

*y*. The problem of maximum likelihood estimation is to maximize

_{1t}, · · ·, y_{nt}*L*or log

*L*with respect to the elements of B,Γ, and Σ. This is especially difficult compared with the similar problem for single equations shown above, because log

*L*is highly nonlinear in the unknown parameters, a difficult source of nonlinearity coming from the Jacobian expression mod | B |.

Maximizing log *L* with respect to Σ^{-1}, we obtainthe maximum likelihood estimator of Σ, which is

Σ = (B T) ** M** (BT),

where M is the moment matrix of the observations, i.e.,

Substitution of Σ into the likelihood function yields the *concentrated form* of the likelihood function

logL = Const. + Tlogmod | B | -(T/2)log | Σ |,

where Const, is a constant. Hence, we seek estimators of B and F that maximize

In the single-equation case we minimize the one-element variance expression, written as a function of the cti. In the simultaneous equation case, we maximize , but this can be shown to be equivalent (Chow 1964) to minimization of | Σ |, subject to the normalization rule

where C is a constant. This normalization is direction normalization, and as long as it is taken into account, scale normalization (such as β_{ii} = 1, cited previously) is arbitrary. Viewed in this way, the method of maximum likelihood applied to a system of equations appears to be a natural generalization of the method of maximum likelihood applied to a single equation, in which case we minimize cr2 subject to a direction-normalization rule.

*Recursive systems*. The concentrated form of the likelihood function shows clearly that a new element is introduced into the estimation process, through the presence of the Jacobian determinant, which makes calculations of the maximizing values of B and F highly nonlinear. It is therefore worthwhile to search for special situations in which estimation methods simplify at least to the point of being based on linear calculations.

It is evident that the concentrated form of the likelihood function would lend itself to simpler methods of estimating B and Γ if | B | were a known constant. This would be the case if B were triangular, for then, by a scale normalization, we would have β_{ii} = 1 and | B | = 1. If B is triangular, the system of equations is called a recursive system. We then simply minimize | ʣ | with respect to the unknown coefficients; this can be looked upon as a generalized variance minimization, an obvious analogue of least squares applied to a single equation.

If, in addition, it can be assumed that Σ is diagonal, maximum likelihood estimators become a series of successive single-equation least squares estimators. Since the matrix B is assumed to be triangular, there must be an equation with only one unlagged endogenous variable. This variable (with unit coefficient) is to be regressed on all the predetermined variables in that equation. Next, there will be an equation with one new endogenous variable. This variable is regressed on the preceding endogenous variable and all the predetermined variables in that equation. In the third equation, another new endogenous variable is introduced. It is regressed on the two preceding endogenous variables and all the predetermined variables in that equation, and so on.

If Σ is not diagonal, a statistically consistent procedure would be to use values of the endogenous variables computed from preceding equations in the triangular array instead of using their actual values. Suppose one equation in the system specifies *y _{1}* as a function of certain z’s. We would regress

*y*on these z’s and then compute values of

_{1}*y*from the relation

_{1}where the γ_{lk} are the least squares regression estimators of the γ_{rt}. (Some of the γ_{1k}, are zero, as a result of the identifying restrictions imposed prior to computing the regression.) Suppose a second equation in the system specifies *y*_{2} as a function of *y*_{1} and certain z’s. Our next step would be to regress *y*_{2}, on ŷ_{1}, and the included z’s and then compute values of *y*_{2} from the relation

The procedure would be continued until all *n* equations are estimated.

Methods of dealing with recursive systems have been studied extensively by Wold, and a summary appears in Strotz and Wold (1960). A recursive system without a diagonal Σ-matrix is found in Barger and Klein (1954). One of the most familiar types of recursive systems studied in econometrics is the cobweb model of demand and supply for agricultural products [*See*BUSINESS CYCLES, article On MATHEMATICAL MODELS].

*Limited-information maximum likelihood*. Another maximum likelihood approach that is widely used is the limited-information maximum likelihood method. It does not hinge on a specific formulation of the model, as do methods for recursive systems; it is a simplified method because it neglects information. As we have seen, identifying restrictions for an equation takes the form of specifying zero values for some parameters or of imposing certain linear relations on some parameters. The term ’limited information” refers to the fact that only the restrictions relating to the particular equation (or subset of equations) being estimated are used. Restrictions on other equations in the system are ignored when a particular equation is being estimated.

Let us again consider the linear system

These equations make up the structural form of the system and are referred to as structural equations. We denote the *reduced form* of this system by

From the reduced form equations select a subset corresponding to the n, endogenous variables in a particular structural equation, say equation i, which is

The summation limit m, indicates the number of predetermined variables included in this equation; we have excluded all zero elements in γ_{i} and indexed the z’s accordingly. Form the joint distribution of *v _{1}, · · ·, v_{nt}* over the sample observations and maximize it with respect to the unknown parameters in the ith structural equation, subject to the restrictions on this equation alone. The restrictions usually take the form

where there are m predetermined variables in the whole system; that is, the γ_{ik}, k = m_{1} + 1, · · · ,*m*, are specified to be zero. The estimated coefficients, AJ, Jut, erf, obtained from this restricted likelihood maximization are the limited-information estimators. Methods of obtaining these estimators and a study of their properties are given in Anderson and Rubin (1949).

Linear regression calculations are all that areneeded in this type of estimation, save for the extraction of a characteristic root of a matrix with dimensionality nj x nt. A quickly convergent series of iterations involving matrix multiplication leads to the computation of this root and associated vector. The vector obtained, properly normalized by making one coefficient unity, provides estimates of the j8j,. The estimates of the yik are obtained from

where the are least squares regression coefficients from the reduced form equations.

It is significant that both full-information and limited-information maximum likelihood estimators are essentially unchanged no matter which variable is selected to have a unit coefficient in each equation. That is to say, if we divide through an estimated equation by the coefficient of any endogenous variable, we get a set of coefficients that would have been obtained by applying the estimation methods under the specification that the same variable have the unit coefficient. Full-information and limited-information maximum likelihood estimators are invariant under this type of scale normalization. Other estimators are not.

**Two-stage least squares** . The classical method of least squares multiple regression applied to a single equation that is part of a larger simultaneous system is inconsistent by virtue of the fact that some of the “explanatory” variables in the regression (the variables with unknown coefficients) may not be independent of the error variable. If we can “purify” such variables to make them independent of the error terms, we can apply ordinary least squares methods to the transformed variables. The method of two-stage least squares does this for us.

Let us return to the equation estimated above by limited information. Choose z/i, say, as the dependent variable, that is, set /3;i equal to unity. In place of y~,, · · · , ynit, we shall use

as explanatory variables. The yjt are computed values from the least squares regressions of z/, on all the zb in the system (k= 1, · · · ,m~). The coefficients 77;t are the computed regression coefficients. The regression of 7/t on y.,, ··-,”,, zt, · · ·, zi provides a two-stage least squares estimator of the single equation. All the equations of a system may be estimated in this way. This can be seen to be a generalization, to systems with nontriangular Jacobians, of the method suggested previously for recursive models in which the variance-covariance matrix of disturbances is not diagonal.

We may write the “normal” equations for theseleast squares estimators as

In this notation *y _{t}* is the vector of computed values is the vector (z

_{lt}, ··· ,z

_{mt},);

b is the estimator of the vector (β_{i2}, · · ·, β_{in1}); and c is the estimator of the vector (γ_{i1} · · ·, γ_{m1}). It should be noted that It should be further observed that

In this expression the whole vector , = (z_{1t}, · · ·, z_{1m},) which includes all the predetermined variables in the system, is used for the evaluation of the relevant moment matrices.

** k-CIass estimators** . Theil (1958) and Basmann (1957), independently, were the first to advocate the method of two-stage least squares. Theil suggested a whole system of estimators, called the /e-class. He denned these as the solutions to

In this expression is the vector of residuals computed from the reduced form regressions of y^t, · · ·, J/v on all the z*t. If k = 0, we have ordinary least squares estimators. If k = 1, we have two-stage least squares estimators. If k – 1 + X and X is the smallest root of the determinantal equation

we have limited-information maximum likelihood estimators. This is a succinct way of showing the relationships between various single-equation methods of estimation. Of these three members of the fe-class, ordinary least squares is not consistent; the other two are.

**Three-stage least squares** . Other members of the feclass could be cited for special discussion, but a more fruitful generalization of Theil’s approach to estimation theory lies in the direction of what he and Zellner (see Zellner & Theil 1962) call three-stage least squares. Chow (1964) has shown that three-stage least squares estimators are better viewed as simultaneous two-stage least squares estimators. Let us denote u( as the vector of residuals associated with each equation in a system that has had elements of replaced by yit, that is, where =(u_{1t}, · · · ,u_{nt})

In this formulation Σ means summation over all values of j except j = i, and it is assumed that the ith endogenous variable in the ith equation has a unit coefficient. Some elements of ySi, and yn, are zero or otherwise restricted for identification. Define

Minimization of | Z | with respect to fta and y^ yields the estimators sought. Thus, we have an extension of the principle of least squares in which the generalized variance is minimized. This is like the principle of maximum likelihood, which also minimizes | Z |, expressed in terms of the fin and yik, but the direction normalization is different.

Theil and Zellner termed their method three-stage least squares because they first derived two-stage least squares estimators for each single equation in the system. They computed the residual variance for each equation and used these as estimators of the variances of the true (unobserved) random disturbances. They then used Aitken’s generalized method of least squares (1935) to estimate all the equations in the system simultaneously. Aitken’s method applies to systems of equations in which the variance-covariance matrix for disturbances is a general known positive definite matrix. Theil and Zellner used the two-stage estimator of this variance-covariance matrix as though it were known. The advantage of this method is that it is of the full-information variety, making use of restrictions on all the equations of the system.

**Other methods** . If the conditions for identification of a single equation are such that there are just enough restrictions to transform linearly and uniquely the reduced form coefficients into the structural coefficients, an indirect least squares method of estimation can be used. Exact identification under zero-type restrictions would enable one to solve

for a unique set of estimated β_{i1}, apart from scale normalization, given a set of estimated π_{jk} . The latter would be determined from least squares estimators of the reduced forms. Since there are n_{1} – 1 of the β_{i1}, to be determined, the necessary condition for exact identification here is that n_{1} – 1 = m-m_{1}.

If there is underidentification, i.e., too few a priori restrictions, structural estimation cannot be completed but unrestricted reduced forms can be estimated by the method of least squares. This is the most information that the econometrician can extract when there is lack of identification. Least squares estimators of the reduced form equations are consistent in the underidentified case, but estimates of the structural parameters cannot be made.

*Instrumental variables*. The early discussion of estimation problems in simultaneous equation models contained, on many occasions, applications of a method known as the method of instrumental variables. In estimating the ith equation of a linear system, i.e.,

we may choose (n, – 1) + ml variables that are independent of MI, . These are known as the instrumental set. Naturally, the exogenous variables in the equation (z1(, · · ·, zmi are possible members of this set. In addition, we need nl – \ more instruments from the list of exogenous variables in the system but not in the zth equation. For this problem let these be denoted as x2t, · · ·, xni,. Since E(z8|Wi() = 0, s = l, ···,m1> and E(xrtu) = 0, r = 2, · · ·, n,, we can estimate the unknown parameters from

With a scale-normalization rule, such as /JH = 1, we have (n! – 1) + TO, linear equations in the same number of unknown coefficients. In exactly identified models there is no problem in picking the xrt, for there will always be exactly n, – 1 z’s excluded from the ith equation. The method is then identical with indirect least squares. If TO – m, > ni – 1, i.e., if there are more exogenous variables outside the zth equation than there are endogenous variables minus one, we have overidentification, and the number of possible instrumental variables exceeds the minimum needed. In order to avoid the problem of subjective or arbitrary choice among instruments, we turn to the methods of limited information or two-stage least squares. In fact, it is instructive to consider how the method of two-stage least squares resolves this matter. In place of single variables as instruments, it uses linear combinations of them. The computed values

are the new instruments. We can view the method either as the regression of y on · · ·, yni, z-i, ··· ,Zmt or as instrumental-variable estimators with y,t, · · ·, y-n,t, ZK , · · ·, Zmtt as the instruments. Both come to the same thing. The method of instrumental variables yields consistent estimators.

Subgroup averages. The instrumental-variables method can be applied in different forms. One form was used by Wald (1940) to obtain consistent estimators of a linear relationship between two variables each of which is subject to error. This gives rise to a method that can be used in estimating econometric systems. Wald proposed that the estimator of β in

y_{t} = α+βx_{t}

where *y _{t}* and xt are both measured with error, be computed from

He proposed ordering the sample in ascending magnitudes of the variable x. From two halves of the sample, we determine two sets of mean values of y and x. The line joining these means will have a slope given by ft. Wald showed the conditions under which these estimates are consistent.

This may be called the method of subgroup averages. It is a very simple method, which may readily be applied to equations with more than two parameters. The sample is split into as many groups as there are unknown parameters to be determined in the equation under consideration. If there are three parameters, for example, the sample may be split into thirds and the parameters estimated from

The extension to more parameters is obvious. The method of subgroup averages can be shown to be a form of the instrumental-variables method by an appropriate assignment of values to “dummy” instrumental variables.

Subgroup averages is a very simple method, and it is consistent, but it is not very efficient.

Simultaneous least squares. The simultaneous least squares method, suggested by Brown (1960), minimizes the sum of squares of all reduced form disturbances, subject to the parameter restrictions imposed on the system, i.e., it minimizes

subject to restrictions. Suppose that the v-lt are expressed as functions of the observables and parameters, with all restrictions included; then Brown’s method minimizes the sum of the elements on the main diagonal of”, where” is the variance-covariance matrix of reduced form disturbances, whereas full-information maximum likelihood min!imizes |ZB|.

Brown’s method has the desirable property of being a full-information method; it is distribution free; it is consistent; but it has the drawback that its results are not invariant under linear transformations of the variables. This drawback can be removed by expressing the reduced form disturbance in standard units

and minimizing

## Evaluation of alternative methods

The various approaches to estimation of whole systems of simultaneous equations or individual relationships within such systems are consistent except for the single-equation least squares method. If the system is recursive and disturbances are independent between equations, least squares estimators are also consistent. In fact, they are maximum likelihood estimators for normally distributed disturbances. But generally, ordinary least squares estimators are not consistent. They are included in the group of alternatives considered here because they have a time-honored status and because they have minimum variance. In large-sample theory, maximum likelihood estimators of parameters are generally efficient compared with all other estimators. That is why we choose full-information maximum likelihood estimators as norms. They are consistent and efficient. Least squares estimators are minimum-variance estimators if their variances are estimated about estimated (inconsistent) sample means. If their variances are measured about the true, or population, values, it is not certain that they are efficient.

Limited-information estimators are less efficient than full-information maximum likelihood estimators. This should be intuitively obvious, since full-information estimators make use of more a priori information; it is proved in Klein (1960). Two-stage least squares estimators have asymptotically the same variance-covariance matrix as limited-information estimators, and three-stage (or simultaneous two-stage) least squares estimators have the same variance-covariance matrix as full-information maximum likelihood estimators. Thus, asymptotically the two kinds of limited-information estimators have the same efficiency, and the two kinds of full-information estimators have the same efficiency. The instrumental-variables or subgroup-averages methods are generally inefficient. Of course, the instrumental-variables method can be pushed to the point where it is the same as two-stage least squares estimation and can thereby gain efficiency.

A desirable aspect of the method of maximum likelihood is that its properties are preserved under a single-valued transformation. Thus, efficient estimators of structural parameters by this method transform into efficient estimators of reduced form parameters. The apparently efficient method of least squares may lose its efficiency under this kind of transformation. In applications of models, we use the reduced form in most cases, not the individual structural equations; therefore the properties under conditions of transformation from structural to reduced form equations are of extreme importance. Limited-information methods are a form of maximum likelihood methods. Therefore the properties of limited information are preserved under transformation.

To obtain limited-information estimators of thesingle equation

we maximize the joint likelihood of vlt, · · ·, vnit in

subject to the restrictions on the ith equation. In this case only the Wj reduced forms corresponding to 2/i , · · · , yn,t are used. It is also possible to simplify calculations, and yet preserve consistency (although at the expense of efficiency), by using fewer than all m predetermined variables in the reduced forms. In this sense the reduced forms of limited-information estimation are not necessarily unique, and the same endogenous variable appearing in different structural equations of a system may not have the same reduced form expression for each equation estimator. There is yet another sense in which we may derive reduced forms for the method of limited information. After each equation of a complete system has been estimated by the method of limited information, we can derive algebraically a set of reduced forms for the whole system. These would, in fact, be the reduced forms used in forecasting, multiplier analysis, and similar applications of systems. The efficiency property noted above for limited and full information has not been proved for systems of this type of reduced forms, but this has been studied in numerical analysis (see below).

**Ease of computation** . Finally we come to an important practical matter in the comparison of the different methods of estimation-relative ease of computation. Naturally, calculations are simpler and smaller in magnitude for single-equation least squares than for any of the other methods except that of subgroup averages. The method of instrumental variables is of similar computational complexity, but for equations with four or more variables it pays to have the advantage of symmetry in the moment matrices, as is the case with single-equation least squares. This is haly a consideration with modern electronic computing machines, but it is worth consideration if electric desk machines are being used.

The next-simplest calculations are those for twostage least squares. These consist of a repeated application of least squares regression techniques of calculation, but the first regressions computed are of substantial size. There are as many independent variables in the regression as there are predetermined variables in the system, provided there are enough degrees of freedom. Essentially, the method amounts to the calculation of parameters and computed dependent variables in

Only the “forward” part of this calculation by the standard Gauss-Doolittle method need be made in order to obtain the moment matrix of the ylt. In the next stage we compute the regression

Two important computing problems arise in the first stage. In many systems m>T; i.e., there are insufficient degrees of freedom in the sample for evaluation of the reduced forms. We may choose a subset of the zkt, or we may use principal components of the ZK (Kloek & Mennes 1960). Systematic and efficient ways of choosing subsets of the zkt have been developed by taking account of the recursive structure of the model (Fisher 1965). In many economic models m has been as large as 30 or more, and it is often difficult to make sufficiently accurate evaluation of the reduced form regression equations of this size, given the amount of multicollinearity found in economic data with common trends and cycles. The same procedures used in handling the degrees-of-freedom problem are recommended for getting round the difficulties of multicollinearity. Klein and Nakamura (1962) have shown that multicollinearity problems are less serious in ordinary than in two-stage least squares. They have also shown that these problems increase as we move on to the methods of limited-information and then full-information maximum likelihood.

Limited-information methods require all the computations of two-stage least squares and, in addition, the extraction of a root of an nt x determinantal equation. The latter calculation can be done in a straightforward manner by iterative matrix multiplication, usually involving fewer than ten iterations.

Both limited information and two-stage least squares are extremely well adapted to modern computers and can be managed without much trouble on electric desk machines.

Three-stage least squares estimators involve the computation of two-stage estimators for each equation of a system, estimation of a variance-covari-ance matrix of structural disturbances, and simultaneous solution of a linear equation system of the order of all coefficients in the system. This last step may involve a large number of estimating equations for a model of 30 or more structural equations.

All the previous methods consist of standard linear matrix operations. The extraction of a characteristic root is the only operation that involves nonlinearities, and the desired root can quickly be found by an iterative process of matrix multiplication. Full-information maximum likelihood methods, however, are quite different. The estimation equations are highly nonlinear. For small systems of two, three, or four equations, estimates have been made without much trouble on large computers (Eisenpress 1962) and on desk machines (Chernoff & Divinsky 1953). The problem of finding the maximum of a function as complicated as the joint likelihood function of a system of 15 to 20 or more equations is, however, formidable. Electronic machine programs have been developed for this purpose. The most standardized sets of full-information maximum likelihood calculations are for systems that are fully linear in both parameters and variables. Single-equation methods require linearity only in unknown parameters, and this is a much weaker restriction. Much progress in computation has been made since the first discussion of these econometric methods of estimation, in 1943, but the problem is far from solved, and there is no simple, push-button computation. This is especially true of full-information maximum likelihood.

Efficient programs have recently been developed for calculating full-information maximum likelihood estimates in either linear or nonlinear systems, and these have been applied to models of as many as 15 structural equations, involving more than 60 unknown parameters.

**Generalization of assumptions** . The basis for comparing different estimation methods or for preferring one method over another rests on asymptotic theory. The property of consistency is a large-sample property, and the sampling errors used to evaluate efficiency measures are asymptotic formulas. Unfortunately, samples of economic data are frequently not large, especially time series data. The amount of small-sample bias or the small-sample confidence intervals for parameter estimators are not generally known in specific formulas. Constructed numerical experiments, designed according to Monte Carlo methods, havethrown some light on the small-sample properties. These are reported below.

Another assumption sometimes made for the basic model is that the error terms are mutually independent. We noted above that successive least squares treatment of equations in recursive systems is identical with maximum likelihood estimation when the variance-covariance matrix of structural disturbances is diagonal. This implies mutual independence among contemporaneous disturbances. In a time series model we usually make another assumption, namely, that

E(u_{it}u_{jt}) = 0, t = t,for all *i, j*

The simplest way in which this assumption can be modified is to allow the errors to be related in some linear autoregressive process, such as

where E(ei,e,f’) = 0 (t * t’, for all i, 7). In a formal sense joint maximum likelihood estimation of structural parameters and autoregressive coefficients, put,, can be laid out in estimation equations, but there are no known instances where these have been solved on a large scale, for the estimation equations are very complicated. For single-equation models or for recursive systems which split into a series of single-equation regressions, the autoregressive parameters of first order have been jointly estimated with structural parameters (Barger & Klein 1954). The principal extensions to larger systems have been in cases where the autoregressive parameters are known a priori. Then it is easy to make known autoregressive transformations of the variables and proceed as in the case of independent disturbances. [*See*TIME SERIES].

Related to the above two points is the treatment of lagged values of endogenous variables as predetermined variables. The presence of lagged endogenous variables reflects serial correlation among endogenous variables rather than among disturbances. In large samples it can be shown that for purposes of estimation we are justified in treating lagged variables as predetermined, but in small samples we incur bias on this account.

Another assumption regarding the disturbances in simultaneous equation systems is that they are mainly due to neglected or unmeasurable variables that affect or disturb each equation of the model. They are regarded as errors in behavior or technology. From a formal mathematical point of view, they could equally well be regarded as a direct error in observation of the normalized dependent variable in each equation, assuming that the system is written so that there is a different normalized dependent variable in each equation. There is an implicit assumption that the exogenous variables are measured without error. If we change the model to one in which random errors enter through disturbances to each relation and also through inaccurate observation of each individual variable, we have a more complicated probability scheme, whose estimation properties have not been developed in full generality. This again has been a case for numerical treatment by Monte Carlo methods.

The procedures of estimating simultaneous equation models as though errors are mutually independent when they really are not and as though variables are accurately measured when they really are not are specification errors. Other misspecifi-cations of models can occur. For simplicity we assume linearity or, at least, linearity in unknown parameters, but the true model may have a different functional form. Errors may not follow the normal distribution, as we usually assume. [*See*ERRORS, article on EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONS.]

Full-information methods are sensitive to specification error because they depend on restrictions imposed throughout an entire system. Single-equation methods depend on a smaller set of restrictions. If an investigator has particular interest in just one equation or in a small sector of the economy, he may incur large specification error by making too superficial a study of the parts of the economy that do not particularly interest him. There is much to be said for using single-equation methods (limited information or two-stage least squares) in situations where one does not have the resources to specify the whole economy adequately.

There are numerous possibilities for specifying models incorrectly. These probably introduce substantial errors in applied work, but they cannot be studied in full generality for there is no particular way of showing all the misspecifications that can occur. We can, however, construct artificial numerical examples of what we believe to be the major specification errors. These are discussed below.

**Sampling experiments** . The effect on estimation methods of using simplified assumptions that are not fully met in real life often cannot be determined by general mathematical analysis. Econo-metricians have therefore turned to constructing sampling experiments with large-scale computers to test proposed methods of estimation where (1) the sample is small; and (2) there is specification error in the statement of the model, such as(a) nonzero parameters assumed to be zero, (b) dependent exogenous variables and errors assumed to be independent, (c) imperfectly measured exogenous variables assumed to be perfectly measured, or (d) serially correlated errors assumed to be not serially correlated.

So-called Monte Carlo methods are used to perform the sampling experiments that conceptually underlie sampling error calculations. These sampling experiments are never, in fact, carried out with nonexperimental sources of data, for we cannot relive economic life over and over again; but we can instruct a machine to simulate such an experiment.

Consider a single equation to be estimated by different methods, for example,

y_{1} = α+βx_{i}+u_{i}, t = 1,2, ··· ,*T*

Fix a and /3 at, say, 3.0 and 0.5, respectively, andset T = 30. This would correspond to the process

We also fix the values of the predetermined variables x,,x~, ··· ,xso once and for all. We set T = 30 to indicate that we are dealing with a 30-element small sample. A sample of 30 annual observations would be the prototype.

Employing a source of random numbers scaled to have a realistic standard deviation and a zero mean, we draw a set of random numbers ult · · ·, u30. We then instruct a machine to use ult ··· ,u,a and x\, · · ·, *ao to compute r/,, · · ·, j/so from the above formulas. From the samples of data, z/,, · · · ,yso and x,, · ·· ,x30, we estimate a and /3 by the methods being studied. Let a and $ be the estimated values. We then draw a new set of random numbers, u,, · · ·, MM, and repeat the process, using the same values of xl, · · ·, x-M. From many such repetitions, say 100, we have sampling distributions of a an. Means of these distributions, when compared with a (= 3.0) and ft (= 0.5), indicate bias, if any, and standard deviations or root-mean-square values about 3.0 or 0.5 indicate efficiency. From these sampling distributions we may compare different estimators of a and ft.

What we have said about this simple type of experiment for a single equation can readily be extended to an entire system:

In this case we must start with assumed values of B and F. We choose a T-element vector of values for each element of x,, the predetermined variables, and repeated T-element vectors of values for each element of u(. The random variables are chosen so that their variance-covariance matrix equals some specified set of values. As in the single-equation case, T = 30 or some likely small-sample value. The xt are often chosen in accordance with the values of predetermined variables used in actual models. In practice, Monte Carlo studies of simultaneous equation models have dealt with small systems having only two, three, or four equations.

Two sets of results are of interest from these studies. Estimates of individual elements in B and F can be studied and compared for different estimators; estimates of B-T, the reduced-form coefficients, can be similarly investigated. In addition, we could form some over-all summary statistic, such as standard error of forecast, for different estimators.

The simplest Monte Carlo experiments have been made to test for small-sample properties alone; they have not introduced measurement errors, serial correlation of disturbances, or other specification errors. Generally speaking, these studies clearly show the bias in single-equation least squares estimates where some of the “independent” variables in the regression calculation are not independent of the random disturbances. Maximum likelihood estimators (full or limited information) show comparatively small bias. The standard deviations of individual parameter estimators are usually smallest for the single-equation least squares method, but this standard deviation is computed about the biased sample mean. If estimated about the true mean, least squares sometimes does not show up well, indicating that bias outweighs efficiency. Full-information maximum likelihood shows up as an efficient method, whether judged in terms of variation about the sample or the true mean. Two-stage least squares estimators appear to have somewhat smaller variance about the true values than do limited-information estimators, and both methods measure up to the efficiency of single-equation least squares methods when variability is measured about the true mean.

Asymptotically, limited-information and two-stage estimators have the same variance-covariance matrices, and they are both inefficient compared with full-information estimators. The Monte Carlo results for small samples are not surprising, although the particular experiments studied give a slight edge to two-stage estimators.

When specification error is introduced, in the form of making an element of T zero in the estimation process when it is actually nonzero in thepopulation, we find that full-information methods are very sensitive. Both limited-information and two-stage estimators perform better than full-information maximum likelihood. Two-stage estimators are the best among all methods examined in this situation. Limited-information estimators are very sensitive to intercorrelation among predetermined variables.

The principal result for Monte Carlo estimators of reduced form parameters is that transformed single-equation least squares values lose their efficiency properties. Being seriously biased as well, these estimates show a poor over-all rating when used for estimating reduced forms for a system as a whole. Full-information estimators, which are shown in these experiments to be sensitive to specification error, do better in estimating reduced form coefficients than in estimating structural coefficients. Their gain in making use of all the a priori information outweighs the losses due to the mis-specification introduced and, in the end, gives them a favorable comparison with ordinary least squares estimators of the reduced form equations that make no use of the a priori information and have no specification error.

If a form of specification error is introduced in a Monte Carlo experiment by having common time trends in elements of xt and u(, so that they are not independent as hypothesized, we find that limited-information estimators are as strongly biased as are ordinary least squares values. If time trend is introduced as an additional variable, however, the limited-information method has small bias.

When observation errors are imposed on the xt, both least squares and limited-information estimators show little change in bias but increases in sampling errors. In this model, it turns out as before that the superior efficiency of least squares estimators of individual structural parameters does not carry over to the estimators of reduced form parameters.

A comprehensive sampling-experiment study of alternative estimators under correctly specified and under misspecified conditions is given in Summers (1965), and Johnston (1963) compares results from several completed Monte Carlo studies. This approach is in its infancy, and further investigations will surely throw new light on the relative merits of different estimation methods.

For some years economists were digesting the modern approach to simultaneous equation estimation introduced by Haavelmo, Mann and Wald, Anderson and Rubin, and Koopmans, Rubin, and Leipnik, and there was a period of little change in this field. Since the development of the two-stage least squares method by Theil, there have been a number of developments. The methods are undergoing interpretation and revision. New estimators are being suggested, and it is likely that many new results will be forthcoming in the next few decades. Wold (1965) has proposed a method based on iterative least squares that recommends itself by its adaptability to modern computers, its consistency, and its capacity to make use of a priori information on all equations simultaneously and to treat some types of nonlinearity with ease. Also, excellent recent books, by Christ (1966), Gold-berger (1964), and Malinvaud (1964), greatly aid instruction in this subject.

Lawrence R. Klein

[*See also*LINEAR HYPOTHESES, article on REGRESSION.]

## BIBLIOGRAPHY

Aitken, A. C. 1935 On Least Squares and Linear Combination of Observations. Royal Society of Edinburgh, Proceedings 55:42-48.

Anderson, T. W.; and Rubin, Herman 1949 Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations. Annals of Mathematical Statistics 20:46-63.

Barger, Harold; and Klein, Lawrence R. 1954 A Quarterly Model for the United States Economy. Journal of the American Statistical Association 49:413-437.

Basmann, R. L. 1957 A Generalized Classical Method of Linear Estimation of Coefficients in a Structural Equation. Econometrica 25:77-83.

Brown, T. M. 1960 Simultaneous Least Squares: A Distribution Free Method of Equation System Structure Estimation. International Economic Review 1: 173-191.

Chernoff, Herman; and Divinsky, Nathan 1953 The Computation of Maximum-likelihood Estimates of Linear Structural Equations. Pages 236-269 in Cowles Commission for Research in Economics, Studies in Econometric Method. Edited by William C. Hood and Tjalling C. Koopmans. New York: Wiley.

Chow, Gregory C. 1964 A Comparison of Alternative Estimators for Simultaneous Equations. Econometrica 32:532-553.

Christ, C. F. 1966 Econometric Models and Methods. New York: Wiley.

Eisenpress, Harry 1962 Note on the Computation of Full-information Maximum-likelihood Estimates of Coefficients of a Simultaneous System. Econometrica 30:343-348.

Fisher, Franklin M. 1965 Dynamic Structure and Estimation in Economy-wide Econometric Models. Pages 589-635 in James S. Duesenberry et al., The Brookings Quarterly Econometric Model of the United States. Chicago: Rand McNally.

Fisher, Franklin M. 1966 The Identification Problem in Econometrics. New York: McGraw-Hill.

Goldberger, Arthur S. 1964 Econometric Theory. New York: Wiley.

Haavelmo, Trygve 1943 The Statistical Implications of a System of Simultaneous Equations. Econometrica 11:1-12.

Johnston, John 1963 Econometric Methods. New York: McGraw-Hill.

Klein, Lawrence R. 1960 The Efficiency of Estimation in Econometric Models. Pages 216-232 in Ralph W. Pfouts, Essays in Economics and Econometrics: A Volume in Honor of Harold Hotelling. Chapel’Hill: Univ. of North Carolina Press.

Klein, Lawrence R.; and Nakamura, Mitsugu 1962 Singularity in the Equation Systems of Econometrics: Some Aspects of the Problem of Multicollinearity. International Economic Review 3:274-299.

Kloek, T.; and Mennes, L. B. M. 1960 Simultaneous Equations Estimation Based on Principal Components of Predetermined Variables. Econometrica 28:45-61.

Koopmans, Tjalling C.; Rubin, Herman; and Leipnik, R. B. (1950) 1958 Measuring the Equation Systems of Dynamic Economics. Pages 53-237 in Tjalling C. Koopmans (editor), Statistical Inference in Dynamic Economic Models. Cowles Commission for Research in Economics, Monograph No. 10. New York: Wiley.

Malinvaud, Edmond (1964) 1966 Statistical Methods of Econometrics. Chicago: Rand McNally. First published in French.

Mann, H. B.; and Wald, Abraham 1943 On the Statistical Treatment of Linear Stochastic Difference Equations. Econometrica 11:173-220.

Strotz, Robert H.; and Wold, Herman 1960 A Triptych on Causal Chain Systems. Econometrica 28:417-463.

Summers, Robert 1965 A Capital Intensive Approach to the Small Sample Properties of Various Simultaneous Equation Estimators. Econometrica 33:1-41.

Theil, Henri (1958) 1961 Economic Forecasts and Policy. 2d ed., rev. Amsterdam: North-Holland Publishing.

Tinbergen, Jan 1939 Statistical Testing of Business-cycle Theories. Volume 2: Business Cycles in the United States of America: 1919-1932. Geneva: League of Nations, Economic Intelligence Service.

Wald, Abraham 1940 The Fitting of Straight Lines if Both Variables Are Subject to Error. Annals of Mathematical Statistics 11:284-300.

Wold, Herman 1965 A Fix-point Theorem With Econometric Background. Arkiv fur Matematik 6:209-240.

Zellner, Arnold; and Theil, Henri 1962 Three-stage Least Squares: Simultaneous Estimation of Simultaneous Equations. Econometrica 30:54-78.