Linear Hypotheses
Linear Hypotheses
I. RegressionE. J. Williams
II. Analysis of VarianceJulian C. Stanley
III. Multiple ComparisonsPeter Nemenyi
I REGRESSION
Regression analysis, as it is presented in this article, is an important and general statistical tool. It is applicable to situations in which one observed variable has an expected value that is assumed to be a function of other variables; the function usually has a specified form with unspecified parameters. For example, an investigator might assume that under appropriate circumstances the expected score on an examination is a linear function of the length of training period. Here there are two parameters, slope and intercept of the line. The techniques of regression analysis may be classified into two kinds : (1) testing the concordance of the observations with the assumed model, usually in the framework of some broader model, and (2) carrying out estimation, or other sorts of inferences, about the parameters when the model is assumed to be correct. This area of statistics is sometimes known as “least squares,” and in older publications it was called “the theory of errors.”
In the regression relations discussed in this article only one variable is regarded as random; the others are either fixed by the investigator (where experimental control is possible) or selected in some way from among the possible values. The relation between the expected value of the random variable (called the dependent variable, the predictand, or the regressand) and the nonrandom variables (called regression variables, independent variables, predictors, or regressors) is known as a regression relation. Thus, if a random variable Y, depending on a variable x, varies at random about a linear function of x, we can write
y = β_{0} + β_{1}x+e,
which expresses a linear regression relation. The parameters β0 and β1 are the regression coefficients or parameters, and e is a random variable with expected value zero. Usually the e’s corresponding to different values of Y are assumed to be uncorrelated and to have the same variance. If η denotes the expected value of Y, the basic relation may be expressed alternatively as
E(Y)= η = β_{0} + β_{1}x.
The parameters in the relation will be either unknown or given by theory; observations of Y for different values of x provide the means of estimating these parameters or testing the concordance of the simple linear structure with the data.
Linear models, linear hypotheses
A regression relation that is linear in the unknown parameters is known as a linear model, and the assertion of such a model as a basis for inference is the assertion of a linear hypothesis. Often the term “linear hypothesis” refers to a restriction on the linear model (for example, specifying that a parameter has the value 7 or that two parameters are equal) that is to be tested. The importance of the linear model lies in its ease of application and understanding; there is a welldeveloped body of theory and techniques for the statistical treatment of linear models, in particular for the estimation of their parameters and the testing of hypotheses about them.
Needless to say, the description of a phenomenon by means of a linear model is usually a matter of convenience; the model is accepted until some more elaborate one is required. Nevertheless, the linear model has a wide range of applicability and is of great value in elucidating relationships, especially in the early stages of an investigation. Often a linear model is applicable only after transformations of the independent variables (like x in the above example), the dependent variable (Y, above), or both [seeStatistical ANALYSIS, SPECIAL PROBLEMS OF, article on TRANSFORMATIONS OF DATA].
In its most general form, regression analysis includes a number of other statistical techniques as special cases. For instance, it is not necessary that the x’s be defined as metric variables. If the values of the observations on Y are classified into a number of groups’say,p ’then the regression relation is written E(Y) =β_{2}x_{1} + β_{2}x_{2} + ... + β_{p}x_{p}, and Xi may be taken to be 1 for all observations in the zth group and 0 for all the others. The p xvariables will then specify the different groups, and the regression relation will define the mean value of Y for each group. In the simplest case, with two groups,
E(Y) = β_{1}x_{1} + β_{2}X_{2},
where x1 = 1 and x2 = 0 for the first group, and vice versa for the second.
The estimation of the population mean from a sample is a special case, since the model is then just
E(Y) = β0,
β0 being the mean of the population.
This treatment of the comparison of different groups is somewhat artificial, although it is important to note that it falls under the regression rubric. Such comparisons are generally carried out by means of the technique known as the analysis of variance [seeLinear HYPOTHESES, article on ANALYSIS OF VARIANCE].
When a regressor is not measured quantitatively but is given only as a ranking (for example, order in time of archeological specimens, social position of occupation), it may still provide a regression relation suitable for estimation or prediction. The simplest way to include such a variable in a relation is to replace the qualitative values (rankings) by arbitrary numerical scores, equally spaced (see, for example, Strodtbeck et al. 1957). More refined methods would use scores spaced according to some measure of “distance” between successive rankings; thus, in some instances the scores have been chosen so that their frequency distribution approximates a grouped normal distribution. Since any method of scoring is arbitrary, the method that is used must be judged by the relations based on it as well as by its theoretical cogency. Simple scoring systems, which can be easily understood, are usually to be preferred.
When both the dependent variable and the regression variable are qualitative, each may be replaced by arbitrary scores as indicated above. Alternative methods determine scores for the dependent variable that are most highly correlated (formally) with the regressor scores or, if the regressor scores for any set of data are open to choice, choose scores for both variables so that the correlation is maximized. The calculation and interpretation of the regression relations for such situations have been discussed by Yates (1948) and by Williams (1952).
Regression, correlation, functional relation
The regression relation is a oneway relation between variables in which the expected value of one random variable is related to nonrandom values of the other variables. It is to be distinguished from other types of statistical relations, in particular from correlation and functional relationships. [SeeMultivariate ANALYSIS, articles on CORRELATION.] Correlation is a relation between two or more random variables and may be described in terms of the amount of variation in one of the variables associated with variation in the other variable or variables. The functional relation, by contrast, is a relation between the expected values of random variables. If quantities related by some physical law are subject to errors of measurement, the functional relation between expected values, rather than the regression relation, is what the investigator generally wants to determine.
Although the regression relation relates a random variable to other, nonrandom variables, in many situations it will apply also when the regression variables are random; then the regression, conditional on the observed values of the random regression variables, is determined. Here the expected value of one random variable is related to the observed values of the other random variables. For a discussion of the fitting of regression lines when the regression variables are subject to error, see Madansky (1959). When more than one variable is to be considered as random, the problem is usually thought of as one of multivariate analysis [seeMultivariate ANALYSIS].
History
The method of least squares, on which most methods of estimation for linear models are based, was apparently first published by Adrien Legendre (1805), but the first treatment along the lines now familiar was given by Carl Friedrich Gauss (1821, see in 1855). Gauss showed that the method gives estimators of the unknown parameters with minimum variance among unbiased linear estimators. This basic result is sometimes known as the GaussMarkov theorem, and the least squares estimators as GaussMarkov estimators.
The term “regression” was first used by Francis Galton, who applied it to certain relations in the theory of heredity, but the term is now applied to relationships in general and to nonlinear as well as to linear relationships.
The linearity of linear hypotheses rests in the way the parameters appear; the x’s may be highly nonlinear functions of underlying nonrandom variables. For example,
and
η β_{1}e^{21} + β_{2} tan x_{1}
both fall squarely under the linear hypothesis model, whereas
η = β_{1}e^{β}_{2}x_{1}
does not fit that model.
There is now a vast literature dealing with the linear model, and the subject is also treated in most statistical textbooks.
Application in the social sciences
There has been a good deal of discussion about the type of model that should be used to describe relations between variables in the social sciences, particularly in economics. Linear regression models have often been considered inadequate for complex economic phenomena, and more complicated models have been developed. Recent work, however, indicates that ordinary linear regression methods have a wider scope than had been supposed. For example, there has been much discussion about how to treat data correlated in time, for which the residuals from the regression equation (the e’s) show autocorrelation. This autocorrelation may be the result of autocorrelation in the variables not included in the model. Geary (1963) suggests that in such circumstances the inclusion of additional regression variables may effectively eliminate the autocorrelation among the residuals, so that standard methods may be applied.
Further discussion of the applicability of regression methods to economic data is given by Ezekiel and Fox (1930, chapters 20 and 24) and also by Wold and Jureen (see in Wold 1953).
Investigators should be encouraged to employ the simple methods of regression analysis as a first step before turning to more elaborate techniques. Despite the relative simplicity of its ideas, it is a powerful technique for elucidating relations, and its results are easily understood and applied. More elaborate techniques, by contrast, do not always provide a readily comprehensible interpretation.
Assumptions in regression analysis
A regression model may be expressed in the following way:
E(Y) = η = + β_{0} + β_{1}x_{1} +...+ β_{p}x_{p},
Y = η + e,
where Y is the random variable, η is its expected value, the x’s are known variables, the β’s are unknown coefficients, and e is a random error or deviation with zero mean. In the notation for variables, either fixed or random, subscripts are used only to distinguish th different variables but not to distinguish different observations of the same variable. The context generally makes the meaning clear. Thus, the above expression is an abbreviated form of
E(Y_{1}) = η = β_{0} + β_{1}x_{1j} + β_{2}x_{2j} + ... + β_{p}x_{pj}, j = 1,2, ... , n_{j}; Y_{j} = η_{j} + e_{j}.
This model is perfectly general; however, in estimating the coefficients, it is usually assumed that the e_{j} are mutually uncorrelated and are of equal variance (homoscedastic).
If there is no regressor variable that is identically one (as in the twosample situation described earlier), the β0 term might well be omitted. This is primarily a matter of notation al convention.
The additional assumption that the errors are normally distributed is convenient and simplifies the theory. It can be shown that, on this assumption, the linear estimators given by least squares are in fact the maximum likelihood (m.l.) estimators of the parameters. In addition, the residual sum of squares ∑(Y — ȃη)^{2} (see below) is the basis for the m.l. estimator of the error variance (σ^{2}).
Apart from the theoretical advantages of the assumption of normality of the e’s, there are the practical advantages that efficient methods of estimation and suitable tests of significance are relatively easy to apply and that the test statistics have wellknown properties and are extensively tabulated. The normality assumption is often reasonable in applications, since even appreciable departures from it do not as a rule seriously invalidate regression analyses based upon normality [seeErrors, article on EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONS].
Some departures from assumptions may be expected in certain situations. For example, if some of the measurements of Y are much larger than others, the associated errors, either errors of measurement or errors resulting from uncontrolled random effects, may well be correspondingly larger, so that the variances of the errors will be heterogeneous (the errors are heteroscedastic). Again, with annual data it is to be expected that errors may arise from unobserved factors whose influence from year to year will be associated, so the errors will not be independent (but see Geary 1963). It is often possible in particular cases to transform the data so that they conform more closely to the assumptions; for instance, a logarithmic or squareroot transformation of Y will often give a variable whose errors have approximately constant variance [seeStatistical Analysis, SPECIAL PROBLEMS OF, article on TRANSFORMATIONS OF DATA]. This will amount to replacing the linear model for Y with a linear model for the transformed variable. In practice this often gives a satisfactory representation of the data, any departure from the model being attributed to error.
The method of least squares determines, for the parameters βi in the regression equation, estimators that minimize the sum of squares of deviations of the Yvalues from the values given by the equation. This sum of squares is
Σ(Y – η)^{2} = Σ(Y – β_{0} – β_{1}x_{1} – ... – β_{p}x_{p})^{2}.
In the following discussion the estimated β’s are denoted by b’s, and the corresponding estimator of η is denoted by ̂η, so that
̂η = b_{0} + b_{1}x_{1} + • • • + b_{p}x_{p}
and the minimized sum of squared deviations is ∑(y — ȃη)^{2}.
The method has the twofold merit of minimizing not only the sum of squares of deviations but also the variance of the estimators bi (among unbiased linear estimators). Thus, for most practical purposes the method of least squares gives estimators with satisfactory properties. Sometimes such esti mators are not appropriate—for example, when errors in one direction are more serious than those in the other—but those cases are usually apparent to the investigator.
The method of least squares applies equally well when the errors are heteroscedastic or even correlated, provided the covariance structure of the errors is known (apart from a constant of proportionality, which may be estimated from the data).
The method can be generalized to take account of the general correlation structure or, equivalently, a linear transformation of the observations may be used to reduce the problem to the simpler case of uncorrelated homoscedastic errors. (Details may be found in Rao 1965, chapter 4.)
When the correlation structure is unknown, the method of least squares may still be applied. If the data are analyzed as though the errors are uncorrelated and homoscedastic, the estimators of the parameters will be unbiased, although they will be less precise than if based on the correct model.
On the other hand, if the assumed linear model is incorrect—for example, if the relation is quadratic in one of the variables but only a linear model is fitted—then the estimators are liable to serious bias.
Since the form of the underlying model is almost always unknown there is usually a corresponding risk of bias. This problem has been studied in various contexts, but there is still much to be done; see Box and Wilson (1951), Box and Andersen (1955), and Plackett (1960, chapter 2).
Simple linear regression
In the simple linear regression model the expected value of Y is a linear function of a single variable, x_{1:}
E(Y) = η = β_{0} + β_{1}x_{1}
The parameter β0 is the intercept, and the parameter β1 is the slope, of the regression line. This model is a satisfactory one in many cases, even if a number of variables affect the expected value of Y, for one of these may have a predominating influence, and although the omission of variables from the relation will lead to some bias, this may not be important in view of the increased simplicity of the model.
In studying the relation between two variables it is almost always desirable to plot a scatter diagram of the points representing the observations. The xaxis, or abscissa, is usually used for the regression variable and the yaxis, or ordinate, for the random variable. If the regression relation is linear the points should show a tendency to fall near a straight line, though if the variation is large this tendency may well be masked. Although for some purposes a line drawn “by eye” is adequate to represent the regression, in general such a line is not sufficiently accurate. There is always the risk of bias in both the position and the slope of the line. Because there is a tendency for the deviations from the line in both the x and y directions to be taken into account in determining the fit, lines fitted by eye are often affected by the scales of measurement used for the two axes. Since Y is the random variable, only the deviations in the y direction should be taken into account in determining the fit of the line. Often the investigator, knowing that there may be error in x_{1}, may attempt to take it into account. It should be understood that this procedure will give an estimate not of the Degression relation but of underlying structure, which often differs from the regression relation. Another and more serious shortcoming of lines drawn by eye is that they do not provide an estimate of the variance about the line, and such an estimate is almost always required.
The method of least squares is commonly used when an arithmetical method of fitting is required, because of its useful properties and its relative ease of application. The equations for the least squares estimators, bi, based on n pairs of observations (x1, Y), are as follows:
b_{1} = ΣY (x_{1} – ̄x_{1})/Σ(x_{1} – ̄x_{1})^{2},
b_{0} = ̄Y – b_{1}̄x_{1}.
Here the summation is over the observed values, and ̄x_{1} = ∑x_{1}/n, and ̄Y = ∑Y/n.(Note that the observations on x1 need not be all different, although they must not all be the same.) The estimated regression function is
̂η = b_{0} + b1x1.
The minimized sum of squares of deviations is
The standard errors (estimated standard deviations) of the estimators may be derived from the minimized sum of squares of deviations. Two independent linear parameters have been fitted, and it may readily be shown that the expected value of this minimized sum of squares is (n — 2) σ^{2}, where σ^{2} is the common variance of the residual errors. Consequently, an unbiased estimator of σ^{2} is given by
s^{2} = ∑(Y — ̂η)^{2}/(n — 2),
and this is the conventional estimator of σ ∑(Y — ȃη The sum of squares for deviations is said to have n — 2 degrees of freedom, representing the number of linearly independent quantities on which it is based.
The estimated variances of the estimators are est. var (b1) = s^{2}/∑(x1 — x̄1)^{2} and est. var and the estimated covariance is est. cov
Separate confidence limits for the parameters β and β1 may be determined from the estimators and their standard errors, using Student’s tdistribution [seeEstimation, article on CONFIDENCE INTERVALS AND REGIONS]. If t_{αm2} denotes the alevel of this distribution for n — 2 degrees of freedom, the I  α confidence limits for β are Confidence limits for the intercept, β0, may be determined in a similar way but are not usually of interest. In a few cases it may be necessary to determine whether the estimator b_{0} is in agreement with some theoretical value of the intercept. Thus, in some situations it is reasonable to expect the regression line to pass through the origin, so that β_{0} = 0. It will then be necessary to test the significance of the departure of b_{0} from zero or, equivalently, to determine whether the confidence limits for β_{0} include zero.
When it is assumed that β_{0} = 0 and there is no need to test this hypothesis, then the regression has only one unknown parameter; in such a case the sum of squares for deviations from the regression line, used to estimate the residual variance, will have n — 1 degrees of freedom.
When the parameters β0 and β1 are both of interest, a joint confidence statement about them may be useful. The joint confidence region is usually an ellipse centered at (b0, b1) and containing all values of (β0, β1) from which (β0, β1) does not differ with statistical significance as measured by an Ftest (see the section on significance testing, below).[The question of joint confidence regions is discussed further inLinear Hypotheses, article On MULTIPLE COMPARISONS.]
Choice of experimental values
The formula for the variance of the regression coefficient b1 shows that it is the more accurately determined the larger is ∑(x1 — x2)^{2} the sum of squares of the values of x1 about their mean. This is in accordance with common sense, since a greater spread of experimental values will magnify the regression effect yet will in general leave the error component unaltered. If accurate estimation of β1 were the only criterion, the optimum allocation of experimental points would be in equal numbers at the extreme ends of the possible range. However, the assumption that a regression is linear, although satisfactory over most of the possible range, is often likely to fail near the ends of the range; for this and other reasons it may be desirable to check the linearity of the regression, and to do so points other than the two extreme values must be observed. In practice, where little is known about the form of the regression relation it is usually desirable to take points distributed uniformly throughout the range. If the experimental points are equally spaced, this will facilitate the fitting of quadratic or higher degree polynomials, using tabulated orthogonal polynomials as described below.
Confidence limits for the regression line
The estimated regression function is
̂η = b_{0} + b_{1}x_{1}
= ȃY + b_{1}(x_{1} — ̄x_{1}),
and corresponding to any specified value, x1* of x1 the variance of ȃη is estimated as
Thus, for any specified value of x1 confidence limits for η can be determined according to the formula
The locus of these limits consists of the two branches of a hyperbola, lying on either side of the fitted regression line; this locus defines what may be described as a confidence curve. A typical regression line fitted to a set of points is shown in Figure 1 with the 95 per cent confidence curve shown as the two inside upper and lower curves, YL.
The above limits are appropriate for the estimated value of η corresponding to a given value of X_{1}. They do not, however, set limits to the whole line. Such limits are given by a method developed by Working and Hotelling, as described, e.g., by Kendall and Stuart ([1943–1946] 1958–1966, vol. 2, chapter 28). [See alsoLinear Hypotheses, article onMultiple Comparisons.] As might be expected, these limits lie outside the corresponding limits for the same probability for a single value of X_{1}. The limits may be regarded as arising from the envelope of all lines whose parameters fall within a suitable confidence region. These limits are given by
where F_{1α2, n2} is the tabulated value for the Fdistribution with 2 and n — 2 degrees of freedom at confidence level 1  α These limits, for a 95 per cent confidence level, are shown as a pair of broken lines in Figure 1.
Figure 1 — Regression line and associated 95 per cent confidence regions*
* The Y_{p} curves, although they appear straight in the figure, are hyperbolas like the other Y curves.
Source of data: Martin, Jean I., 1965, Refugee Settlers: A Study of Displaced Persons in Australia. Canberra: Australian National University.
The user of the confidence limits must be clear about which type of limits he requires. If he is interested in the limits on the estimated η for a particular value X_{1*} or in only one pair of limits at a time, the inner limits, YL, will be appropriate, but if he is interested in limits for many values of x_{1*} (some of which may not be envisaged when the calculations are being made), the WorkingHotelling limits,Y_{WII}, will be needed.
Application of the regression equation
The regression equation is usually determined not only to provide an empirical law relating variables but also as a means of making future estimates or predictions. Thus, in studies of demand, regression relations of demand on price and other factors enable demand to be predicted for future occasions when one or more of these factors is varied. Such prediction is provided directly by the regression equation. It should be noted, however, that the standard error of prediction will be greater than the standard error of the estimated points (ȃ η) on the regression line. This is because a future observation will vary about its regression value with variance equal to the variance of individual values about the regression in the population. When standard errors are being quoted, it is important to distinguish between the standard error of the point η on the regression line and the standard error of prediction. The estimated variance of prediction is
The outside upper and lower curves in Figure 1 are confidence limits for prediction, Y_{P}, based on this variance. Clearly, for making predictions of this sort there is little point in determining the regression line with great accuracy. The major part of the error in such cases will be the variance of individual values.
The formula for the standard error of ȃη or ȃηp shows that the error of estimation increases as the x1value departs from the mean of the sample, so that when the deviation from the mean is large the variance of estimate can be so great as to make the estimate worthless. This is one reason why investigators should be discouraged from attempting to draw inferences beyond the range of the observed values of xl. The other reason is that the assumed linear regression, even though satisfactory within the observed range, may not hold true outside this range.
Inverse estimation
In many situations the investigator is primarily interested in determining the value of X_{1} corresponding to a given level or value, η*. Thus, although it is still appropriate to determine the regression of the random variable Y on the fixed variable x_{1} the inference has to be carried out in reverse. For example, if a drug that affects the reaction time of individuals is being tested at different levels, the reaction time Y will be a random variable with regression on the dose level x>_{1}. However, the purpose of the investigation may be to determine a dose level that will lead to a given time of reaction on the average. The experimental doses, being fixed, cannot be treated as random, so that it is inappropriate to determine a regression of X_{1} on Y, and such a pseudo regression would give spurious results. In such situations the value of x, corresponding to a given value of η has to be estimated from the regression of Y on x_{1}.
The regression equation can be rearranged to give an estimator of x1 corresponding to a given value, η*,
̂X* = (η — b0)/b1.
The approximate estimated variance of the estimator is
A more precise method of treating such a problem is to determine confidence limits for η given x1 and to determine from these, by rearranging the equation, confidence limits for x1. For the regression shown in Figure 1, the 95 per cent confidence curves (the inner curves, YL, on either side of the line) will in this way give confidence limits for x1 corresponding to a given value of η. The point at which the horizontal line Y = η cuts the regression line gives the estimate of x1; the points at which the line cuts the upper and lower curves give, respectively, lower and upper confidence limits for x1. This may be demonstrated by an extension of the reasoning leading to confidence limits. [SeeEstimation, article onconfidence intervals and regions.]
Sometimes, rather than a hypothetical regression value, η*, a single observed value, y* (not in the basic sample), is given, and limits are required for the value of x1 that could be associated with such a value. The estimator ȃX* is given by
ȃX* = (y* — b0)/b1,
and its approximate estimated variance (which must take into account the variation between responses on Y to a given value of x1)is
Using this augmented variance, confidence limits on x1 corresponding to a given y* may be found. For more precise determination of the confidence limits for prediction, the locus of limits for y* given x1 may be inverted to give limits for x1 given y*. In Figure 1, the outer curves are these loci (for the 95 per cent confidence level); the 95 per cent limits for x1 will be given by the intersection of the line Y = y* with these confidence curves for prediction.
Multiple regression
In many situations where a single regression variable is not adequate to represent the variation in the random variable Y, a multiple regression is appropriate. In other situations there may be only one regression variable, but the assumed relation, rather than being linear, is a quadratic or a polynomial of higher degree. Since both multiple linear regression and polynomial regression relations are linear in the unknown parameters, the same techniques are applicable to both; in fact, polynomial regression is a special case of multiple regression. The number of variables to include in a multiple regression, or the degree of polynomial to be applied, is to some extent a matter of judgment.and convenience, although it must be remembered that a regression equation containing a large number of variables is usually inconvenient to use as well as difficult to calculate. With the use of electronic computers, however, there is greater scope for increasing the number of regression variables, since the computations are routine.
Consider the multiple regression equation
E(Y) = η = β_{0} + β_{1}x_{1} + ... + β_{p}x_{p},
with p regression variables and a constant term. The estimation of these p + 1 unknown parameters can be systematically carried out if β0 is also regarded as a regression coefficient corresponding to a regression variable X0 that is always unity. As in simple regression, the method of least squares provides unbiased linear estimators of the coefficients with minimum variance and also provides estimators of the standard errors of these coefficients. The quantities required for determining the estimators are the sums of squares and products of the xvalues, the sums of products of the observed Y with each of the xvalues, and the sum of squares of Y. The method of least squares gives a set of linear equations for the b’s, called the normal equations:
where t_{hi} = t_{ih} = ∑xhxi and ui = ∑Yxi. These equations can be written in matrix form as
Tb = u,
where T = (t_{hi}) and u is the vector of the ui. The solution requires the inversion of the matrix T , the inverse matrix being denoted by T ^{1} (with typical element t^{hi}). The solution may be written in matrix form as
b = T^{1}u
or in extended form as
b_{0} = t^{00}u_{0} + t^{01}u_{1} + • • • + t^{0p}u_{p},
and so forth.
The variance of bi is t^{iiγ2}, and the covariance of bi and bj is t^{ij}γ. It should be remarked that in the special case of “regression through the origin” — that is, when the constant term β0 is assumed to be zero — the first equation and the first term of each other equation are omitted; the constant regressor x0 and its coefficient β0 thus have the same status as any other regression variable.
When the constant term is included, computational labor may be reduced and arithmetical accuracy increased if the sums of squares and products are taken about the means. That is, the t_{hi} and ui are replaced by
and
respectively. All the sums of products with zero subscripts then vanish, and the sums of squares are reduced in magnitude. The constant term has to be estimated separately; it is given by
The computational aspects of matrix inversion and the determination of the regression coefficients are dealt with in many statistical texts, including Williams (1959); in addition, many programs for matrix inversion are available for electronic computers.
Effect of heteroscedasticity
When the error variance of the dependent variable Y is different in different parts of its range (or, strictly, of the range of its expected value, η), estimators of regression coefficients ignoring the heteroscedasticity will be unbiased but of reduced accuracy, as already mentioned. The calculation of improved estimators may then sometimes be necessary.
There are some problems in taking heteroscedasticity into account. Among them is the problem of specification: defining the relation between expected value and variance. Often, with adequate data, the estimated (ȃη) values from the usual unweighted regression line can be grouped and the mean squared deviation from these values for each group used as a rough measure of the variance. The regression can then be refitted, each value being given a weight inversely proportional to the estimated variance. Two iterations of this method are likely to give estimates of about the accuracy practically attainable. If an empirical relation between expected value and error variance can be deduced, this simplifies the problem somewhat; however, the weight for each observation has to be determined from a provisionally fitted relation, so iteration is still required.
To calculate a weighted regression, each observation Y_{j}(j = 1,2,..., n) is given a weight Wj instead of unit weight as in the standard calculation. These weights will be the reciprocals of the estimated variances of each value. Then, if weighted quantities are distinguished by the subscript w,
t_{whi} = Σwx_{h}x_{i},
u_{wi} = ΣwYx_{i},
and the normal equations are
and so on, or in matrix form
T_{w}b_{w} = u_{w}.
The solution is
and the variances of the estimators b_{wi} are approximately t_{ii}_{10} σ^{2}.
When the weights are estimated from the data, as in the iterative method just described, some allowance has to be made in an exact analysis for errors in the weights. This inaccuracy will somewhat reduce the precision of the estimators. However, for most practical purposes, and provided that the number of observations in each group for which weights are estimated is not too small, the errors in the weights may be ignored. (For further discussion of this question see Cochran & Carroll 1953.)
Estimability of the coefficients
It is intuitively clear in a general way that the p + 1 regression variables included in a regression equation should not be too nearly linearly dependent on one another, for then it might be expected that these regression variables could be approximately expressed in terms of a smaller number.
More precisely, in order that meaningful estimators of the regression coefficients exist, it is necessary that the variables be linearly independent (or, equivalently, T must be nonsingular). That is, no one variable should be expressible as a linear combination of the others or, expressed symmetrically, no linear combination of the variables vanishes unless all coefficients are zero. Clearly, if only p  r of the variables are linearly independent, then the regression relation may be represented as a regression on these p —h r, together with arbitrary multiples of the vanishing linear combinations. From the practical point of view, this lack of estimability will cause no problems, provided that the regression on a set of p — r linearly independent variables is calculated. Estimation from the equation will be unaffected, but for testing the significance of the regression it must be noted that the regression sum of squares has not p + 1 but p  r degrees of freedom, and the residual has n — p + r.
However, if the lack of estimability is ignored, the calculations to determine the p + 1 coefficients either will fail (since the matrix T , being singular, has no inverse) or will give misleading results (if an approximate value of T , having an inverse, is used in calculation and the lack of estimability is obscured).
When the regression variables, although linearly independent, are barely so (in the sense that the matrix T , although of rank p + 1, is “almost singular, “having a small but nonvanishing determinant), the regression coefficients will be estimable but will have large standard errors. In typical cases, many of the estimated coefficients will not differ with statistical significance from zero; this merely reflects the fact that the corresponding regression variable may be omitted from the equation and the remaining coefficients adjusted without significant worsening of the fit.
In this situation, as in the case of linear dependence, these effects are not usually important in jpractice; however, they may suggest the advisability of reducing the number of regression variables included in the equation. [For further discussion, seeStatistical IDENTIFIABILITY.]
Conditions on the coefficients
Sometimes the regression coefficients βi are assumed to satisfy some conditions based on theory. Provided these conditions are expressible as linear equations in the coefficients, the method of least squares carries through and leads, as before, to unbiased estimators satisfying the conditions and with minimum variance among linear estimators. It will be clear that with p + 1 regression coefficients subject to r + 1 independent linear restrictions, r + 1 of the coefficients may be eliminated, so that the restricted regression is equivalent to one with p — r coefficients. Thus, in principle there is a choice between expressing the model in terms of p — r unrestricted coefficients or p + 1 restricted ones; often the latter has advantages of symmetry and interpretability.
A simple example of restricted regression is one in which η is a weighted average of the x’s but with unknown weights, β1,..., β^{p}. Here the side conditions would be β0 = 0, β + η^{p} = 1.
As the introduction of side conditions effectively reduces the number of linearly independent coefficients, such conditions are useful in restoring estimability when the coefficients are nonestimable. In many problems these side conditions may be chosen to have practical significance. For example, where an overall mean and a number of treatment “effects” are being estimated, it is conventional to specify the effects so that their mean vanishes; with this specification they represent deviations from the overall mean.
When a restricted regression is being estimated, it will often be possible and of interest to estimate the unrestricted regression as well, in order to see the effect of the restrictions and to test whether the data are concordant with the conditions assumed. The test of significance consists of comparing the (p + 1)variable (unrestricted) regression with the (p — r)variable (restricted) regression, in the manner described in the section on significance testing. This test of concordance is independent of the test of significance of any of the restricted coefficients.
Further details and examples of restricted regression are given by Rao (1965, p. 189) and Williams (1959, pp. 4958). In the remainder of this article, the notation will presume unrestricted regression.
Missing values
When observations on some of the variables are missing, the simplest and usually the only practicable procedure is to ignore the corresponding values of the other variables—that is, to work only with complete sets of observations. However, it is sometimes possible to make use of the incomplete data, provided some additional assumptions are made. Methods have been developed under the assumption that (a) the missing values are in some sense randomly deleted, or the assumption that (b) the variables are all random and follow a multivariate normal distribution. Assumption (b) is treated by Anderson (1957) and Rao (1952, pp. 161165). It is sometimes found, after the least squares equations for the constants in a regression relation have been set up, that some of the values of the dependent variable are unreliable or missing altogether. Rather than recalculate the equations it is often more convenient to replace the missing value by the value expected from the regression relation. This substitution conserves the form of the estimating equations, usually with little disturbance to the significance tests or the variances of the estimators.
The techniques of “fitting missing values” have been most fully developed for experiments designed in such a way that the estimators of various constants are either uncorrelated or have a symmetric pattern of correlations and the estimating equations have a symmetry of form that simplifies their solution. Missing values in such experiments destroy the symmetry and make estimation more difficult; it is therefore a great practical convenience to replace the missing values. Details of the method applied to designed experiments will be found in Cochran and Cox (1950). For applications to general regression models see Kruskal (1961).
The technique is itself an application of the method of least squares. To replace a missing value Y;, a value –η, is chosen so as to minimize its contribution to the residual sum of squares. Thus, the estimate is equivalent to the one that would have been obtained by a fresh analysis; the calculation is simplified by the fact that estimates for only one or a few values are being calculated. The degrees of freedom for the residual sum of squares are reduced by the number of values thus fitted. For most practical purposes it is then sufficiently accurate to treat the fitted values as though they were original observations. The exact analysis is described by Yates (1933) and, in general terms, by Kruskal (1961).
Significance testing
In order to determine the standard errors of the regression coefficients and to test their significance, it is necessary to estimate the residual variance, σ^{2}. The sum of squares of deviations, ∑(Y — ȃη)^{2}, which may readily be shown to satisfy
is found under p + 1 constraints and so may be said to have n — p — 1 degrees of freedom; if the model assumed is correct, so that the deviations are purely random, the expected value of the sum of squares is (n — p — l)σ^{2}. Accordingly, the residual mean square,
is an unbiased estimator of the residual variance. The variances of the regression coefficients are estimated by
est.var(bi) = t^{ii}s^{2},
and the standard errors are the square roots of these quantities. The inverse matrix thus is used both in the calculation of the estimators and in the determination of their standard errors. From the offdiagonal elements t^{hi} of the inverse matrix are derived the estimated covariances between the estimators,
est.cov (b_{h}, b_{i})  t^{ti}s^{2},
The splitting of the total sum of squares of Y into two parts, a part associated with the regression effects and a residual part independent of them, is a particular example of what is known as the analysis of variance [seeLinear HYPOTHESES, article on ANALYSIS OF VARIANCE].
Testing for regression effects. The regression sum of squares, being based on p + 1 estimated quantities, will have p + 1 degrees of freedom. When regression effects are nonexistent, the expected value of each part is proportional to its degrees of freedom. Accordingly, it is often convenient and informative to present these two parts, and their corresponding mean squares, in an analysisofvariance table, such as Table 1.
In the table, the final column gives the expected values of the two mean squares; it shows that real regression effects inflate the regression sum of squares but not the residual sum of squares. This fact provides the basis for tests of significance of a calculated regression, since large values of the ratio of regression mean square to residual mean square give evidence for the existence of a regression relation.
Significance of a single coefficient. The question may arise whether one or more of the regression variables contribute to the relation anything that is not already provided by the other variables. In such circumstances the relevant hypothesis to be examined is that the β’s corresponding to these variables are zero. A more general hypothesis that may sometimes need to be tested is that certain of the β’s take assigned values.
The simplest test is that of the statistical significance of a single coefficient—say, bi. The test will be of its departure from zero, if the contribution of Xi to the regression is in question. More generally, when β is specified, as, say, β^{*}_{i}, it will be relevant to test the significance of departure of bi from β^{*}_{i}. The significance test in either case is the same; the squared difference between estimated and hypothesized values is compared with the estimated variance of that difference, which is s^{2tii}.
The ratio F = (bi — β^{*}_{i} 2/(s^{2tii}) has the Fdistribution with 1 and n — p — I degrees of freedom if the difference is in fact due to sampling fluctuations alone; in this case, the Fstatistic is just the square of the usual tstatistic. When βi differs
Table 7 – Analysisofvariance table for testing regression effects  

Source  Degrees of freedom  Sum of squares  Mean square  Expected mean square 
Regression  P+l  Σb_{i}u_{i}  
Residual  n–p–1  ΣY^{2}–Σb_{i}u_{i}=(n–p–1)S^{2}  S^{2}  σ^{2} 
Total  n  ΣY^{2} 
from β*_{i} the Fstatistic will tend to be larger, so that a righttail test is indicated.
Testing several coefficients. To test a number of regression variables — or, more precisely, their regression coefficients — the method of least squares is equivalent to fitting a regression with and without the variables in question and testing the difference in the regression sums of squares against the estimated error variance. To choose a specific example, suppose the last q coefficients in a pvariable regression are to be tested. If the symbol S^{2} is used to stand for sum of squares, the sum of squares for regression on all p variables may be written
with p + 1 degrees of freedom, and the corresponding sum of squares on the first p — q variables as
with p  q + 1 degrees of freedom. The difference, a sum of squares with q degrees of freedom, provides a criterion for testing the significance of the q regression coefficients. The ratio
has, under the null hypothesis that the last q coefficients are zero, the Fdistribution with q and n — p — I degrees of freedom. This simultaneous test of q coefficients may also be adapted to testing the departure of the q coefficients from theoretical values, not necessarily zero.
The significance test may be conveniently set out as in Table 2, where only the mean squares required for the significance test appear in the last column.
When q = 1 , this test reduces to the test for a single regression coefficient, and the Fratio
is then identical with the Fratio given above for making such a test.
Linear combinations of coefficients. Sometimes it is necessary to test the significance of one or more linear combinations of the coefficients—that is, to test hypotheses about linear combinations of the β’s. A common example is the comparison of two coefficients, β1 and β2, say, for which the comparison bl — b2 is relevant. The Ftest applies to such comparisons also. Thus, for the difference blb2, the estimated variance is s^{2}(t^{11}  2t^{12} + t^{22}), and F=(b1 — b2> 2/[s^{2}(t^{11}  2t^{12})], with 1 and n — p — 1 degrees of freedom.
In general, to test the departure from zero of k linear combinations of regression coefficients the procedure is as follows. Let the linear combinations (expressed in matrix notation) be
where Γ is a (p + 1) x k matrix of known constants. Then the estimated covariance matrix of these linear combinations is
and the Fratio is
with k and n — p — 1 degrees of freedom. Of course, this test can also be adapted to testing the departure of these linear combinations from preassigned values other than zero.
When the population coefficients βi are in fact nonzero, the expected value of the regression mean square in the analysis of variance shown in Table 1 will be larger than σ2 by a term that depends on both the magnitude of the coefficients and the accuracy with which they are estimated (see, for example, the last column of Table 1). Clearly, the greater this term, called the noncentrality, the greater the probability that the null hypothesis will be rejected at the adopted significance level. The Ftest has certain optimum properties, but other tests may be preferred in special circumstances.
Table 2 – Analysisofvariance table for testing several regression coefficients  

Source  Degrees of freedom  Sum of squares  Mean square 
Regression on pq variables  pq + 1  ...  
Additional q variables  
Regress/on on all p variables  p+1  ...  
Residual  s^{2} 
Multivariate analogues
Although hitherto only the regression of a single dependent variable Y on one or more regressors xi has been discussed, it will be realized that often the simultaneous regressions of a number of random variables on the same regressors will be of importance. For instance, in a sociological study of immigrants the regressions of annual income and size of family on age, educational level, and period of residence in the country may be determined; here there are two dependent variables and three regressors.
Often the relations among the different dependent variables will also be of interest, or various linear combinations of the variables, rather than the original variables themselves, may be studied. The linear combination that is most highly correlated with the regressors may sometimes be relevant to the investigation, but the linear compounds will usually be chosen for their practical relevance rather than their statistical properties. [For further discussion of multivariate analogues, seeMultivariateAnalysis, especially the general article,Overview, and the article onClassification And Discrimination.]
Polynomial regression
When the relation between two variables, x1 and Y, appears to be curvilinear, it is natural to fit some form of smooth curve to the data. For some purposes a freehand curve is adequate to represent the relation, but if the curve is to be used for prediction or estimation and standard errors are required, some mathematical method of fitting, such as the method of least squares, must be used. The freehand fitting of a curvilinear relation has all the disadvantages of freehand fitting of a straight line, with the added disadvantage that it is more difficult to distinguish real trends from random fluctuations.
The polynomial form is
Being a linear model, it has the advantages of simplicity, flexibility, and relative ease of calculation. It is for such reasons, not because it necessarily represents the theoretical form of the relation, that a polynomial regression is often fitted to data.
Orthogonal polynomials
The computations in polynomial regression are exactly the same as those in multiple regression, except that some simplification of the arithmetic may be introduced if the same values of X_{1}. are used repeatedly. Then instead of using the powers of XT as the regression variables, these are replaced by orthogonal polynomials of successively increasing degree, so defined that the sum of products of any pair of them, over their chosen values, is zero.
This procedure has the twofold advantage that, first, all the offdiagonal elements of the matrix T are zero, so the calculation of regression coefficients and their standard errors is much simplified, and, second, the regression coefficient on each polynomial and the corresponding sum of squares can be independently determined.
Because it is common for investigators to use data with values of the independent variables equally spaced, the orthogonal polynomials for this particular case have been extensively tabulated. Fisher and Yates (1938) tabulate these orthogonal polynomials up to those of fifth degree, for numbers of equally spaced points up to 75. However, if the data are not equally spaced the tabulated polynomials are not applicable, and the regression must be calculated directly.
Testing adequacy of fit
The question of what degree of polynomial is appropriate to fit to a set of data is discussed below (see “Considerations in regression model choice”). If for each value of X1 there is an array of values for Y, the variation in the data can be analyzed into parts between and within arrays by the techniques of analysis of variance [seeLinear HYPOTHESES, article on ANALYSIS OF VARIANCE]. The sum of squares between arrays can be further analyzed into that part accounted for by regression and that part not so accounted for (deviation from regression). The adequacy of a polynomial fitted to the data is indicated by nonsignificant deviation from regression.
When there is but one observation of Y for each value of X1, such an analysis is not possible. To test the adequacy of a pthdegree polynomial regression, a common though not strictly defensible procedure is to fit a polynomial of degree p + 1 and test whether the coefficient b_{p+1} of x^{p+1}_{1} is significant. Anderson (1962) has treated this problem as a multiple decision problem and has provided optimal procedures that can readily be applied.
Estimation of maxima
Sometimes a polynomial regression is fitted in order to estimate the value of X1 that yields a maximum value of 17. A detailed discussion of the estimation of maxima is given by Hotelling (1941). To give an idea of the methods that are used, consider a quadratic regression of the form
A maximum (or minimum) value of ȃη occurs at the point xm = —b1/2b2, and this value is taken as the estimated position of the maximum. Confidence limits for the position can be determined by means of the following device. If the position of the maximum of the true regression curve is ξ, then ξ = β1/2β2, so that β1 + 2β_{2}ξ = 0. Consequently the quantity
b1 + 2b_{2}ξ
is distributed with mean zero and estimated variance
The confidence limits for ξ with confidence coefficient 1  α are given by the roots of the equation
where F_{α:1, n3} is the apoint of the Fdistribution with 1 and n — 3 degrees of freedom, abbreviated below as Fα. The solution of this equation may be simplified by writing
so that the confidence limits become
Note that these limits are not, in general, symmetrically placed about the estimated value —b1/2b_{2}, since allowance is made for the skewness of the distribution of the ratio. Note also that the limits will include infinite values and will therefore not be of practical use, unless b2 is significant at the alevel. In terms of the gvalues, this means that g_{22} must not exceed 1.
When the regression model is a polynomial in two or more variables, investigation of maxima and other aspects of shape becomes more complex. [A discussion of this problem appears in Experimental Design, article on Response Surfaces.]
Nonlinear models
In a nonlinear model the regression function is nonlinear in one or more of the parameters. Familiar examples are the exponential regression,
η = β0 + β1e^{β}2^{x}1,
and the logistic curve,
β_{2} being the nonlinear parameter in each example. Such nonlinear models usually originate from theo retical considerations but nevertheless are often useful for applying to observational data.
Sometimes the model can be reduced to a linear form by a transformation of variables (and a corresponding change in the specification of the errors). The exponential regression with β0 = 0 may thus be reduced by taking logarithms of the dependent variable and assuming that the errors of the logarithms, rather than the errors of the original values, are distributed about zero. If Z = log_{e} Y and E(Z) = ξ the exponential model with β0 = 0 reduces to ξ = log_{e} β1 + β2x1, a linear model.
The general models shown above cannot be reduced to linear models in this way. For nonlinear models generally, the nonlinear parameters must be estimated by successive approximation. The following method is straightforward and of general applicability.
Suppose the model is
η = β0 + β1 f(x1, β2)
where f(x1 β2) is a nonlinear function of β2 and the estimated regression, determined by least squares, is
̂η = b_{0} + b_{1}f(x_{1}, c).
If c_{0} is a trial value of c (estimated by graphical or other means), the values of f(x1, c) and its first derivative with respect to c (denoted, for brevity, by f and f’, respectively) are calculated for each value of x1 , with c = c0. The regression of Y on f and f’ is then determined in the usual way, yielding the regression equation
̂η = b_{0} + b_{1}f + b_{2}f’.
A first adjustment to cn is given by b2/bl, giving the new approximation
c1 = co + b2/b1.
The process of recalculating the regression on f and f’ and determining successive approximations to c can be continued until the required accuracy is attained (for further details see Williams 1959).
The method is an adaptation of the delta method, which utilizes the principle of propagation of error. If a small change, δβt , is made in a parameter β2, the corresponding change in a function f(β2) is, to a first approximation, f’(βt)δβt. The use of this method allows the replacement of the nonlinear equations for the parameters by approximate linear equations for the adjustments. For a regression relation of the form
η = β_{0} + β_{1}eβ_{2}^{x}1,
Stevens (1951) provides a table to facilitate the calculation of the nonlinear parameter by a method similar to that described above, and Pimentel Gomes (1953) provides tables from which, with a few preliminary calculations, the least squares estimate of the nonlinear parameter can be read off easily.
Considerations in regression model choice
In deciding which of several alternative models shall be used to interpret a relationship, a number of factors must be taken into account. Other things being equal, the model which represents the predictands most closely (where “closeness” is measured in terms of some criterion such as minimum mean square error among linear estimators) will be used. However, questions of convenience and simplicity should also be considered. A regression equation that includes a large number of regression variables is not convenient to use, and an equation with fewer variables may be only slightly less accurate. In deciding between alternative models, the residual variance is therefore not the only factor to take into account.
In polynomial regression particularly, the assumed polynomial form of the model is usually chosen for convenience, so that a polynomial of given degree is not assumed to be the true regression model. Because of this, the testing of individual polynomial coefficients is little more than a guide in deciding on the degree of polynomial to be fitted. Of far more importance is a decision on what degree of variability about the regression model is acceptable, and this decision will be based on practical rather than merely statistical considerations.
Besides the question of including additional variables in a regression, for which significance tests have already been described, there is also the question of alternative regression variables. The alternatives for a regression relation could be different variables or different functions of the same variable—for instance, x_{1} and log X_{1}.
For comparison of two or more individual variables as predictors, a test devised by Hotelling (1940) is suitable, although not strictly accurate. It is based on the correlations between Y and the different predictors and of the predictors among themselves. For comparing two regression variables X1 and x2, the test statistic is
which is distributed approximately as F with 1 and n — 3 degrees of freedom. Here, as before,
and s^{2} is the mean square of residuals from the regression of Y on X1. and x2, with n — 3 degrees of freedom.
E. J. Williams
BIBLIOGRAPHY
Anderson, T. W. 1957 Maximum Likelihood Estimates for a Multivariate Normal Distribution When Some Observations Are Missing. Journal of the American Statistical Association 52:200–203.
Anderson, T. W. 1962 The Choice of the Degree of a Polynomial Regression as a Multiple Decision Problem. Annals of Mathematical Statistics 33:255–265.
Box, George E. P.; and Andersen, S. L. 1955 Permutation Theory in the Derivation of Robust Criteria and the Study of Departures From Assumption. Journal of the Royal Statistical Society Series B 17:126.
Box, George E. P.; and Wilson, K. B. 1951 On the Experimental Attainment of Optimum Conditions. Journal of the Royal Statistical Society Series B 13:145. → Contains seven pages of discussion.
Cochran, William G.; and Carroll, Sarah P. 1953 A Sampling Investigation of the Efficiency of Weighting Inversely as the Estimated Variance.Biometrics 9: 447459.
Cochran, William G.; and Cox, Gertrude M. (1950) 1957 Experimental Designs. 2d ed. New York: Wiley.
Ezekiel, Mordecai; and Fox, Karl A. (1930) 1961 Methods of Correlation and Regression Analysis: Linear and Curvilinear. New York: Wiley.
Fisher, R. A.; and Yates, Frank (1938) 1963 Statistical Tables for Biological, Agricultural and Medical Research. 6th ed., rev. & enl. Edinburgh: Oliver & Boyd; New York: Hafner.
Gauss, Carl F. 1855 Methode des moindres carrés: Mémoires sur la combinaison des observations. Translated by J. Bertrand. Paris: MalletBachelier. → An authorized translation of Carl Friedrich Gauss’s works on least squares.
Geary, R. C. 1963 Some Remarks About Relations Between Stochastic Variables: A Discussion Document. Institut International de Statistique,Revue 31:163–181.
Hotelling, Harold 1940 The Selection of Variates for Use in Prediction With Some Comments on the General Problem of Nuisance Parameters. Annals of Mathematical Statistics 11:271–283.
Hotelling, Harold 1941 Experimental Determination of the Maximum of a Function. Annals of Mathematical Statistics 12:2045.
Kendall, Maurice G.; and Stuart, Alan (1943–1946) 1958–1966 The Advanced Theory of Statistics. New ed. 3 vols. New York: Hafner; London: Griffin. → Volume 1: Distribution Theory, 1958. Volume 2: Inference and Relationship, 1961. Volume 3:Design and Analysis, and Timeseries, 1966. Kendall was the sole author of the 1943–1946 edition.
Kruskal, William H. 1961 The Coordinatefree Approach to GaussMarkov Estimation and Its Application to Missing and Extra Observations. Volume 1, pages 435–451 in Symposium on Mathematical Statistics and Probability, Fourth, Berkeley, Proceedings. Berkeley and Los Angeles: Univ. of California Press.
Legendre, Adrien M. (1805) 1959 On a Method of Least Squares. Volume 2, pages 576–579 in David Eugene Smith, A Source Book in Mathematics. New York: Dover. → First published as “Sur la méthode des moindres carrés “in Legendre’s Nouvelles methodes pour la determination des orbites des cométes.
Madansky, Albert 1959 The Fitting of Straight Lines When Both Variables Are Subject to Error. Journal of the American Statistical Association 54:173–205.
Pimentel Gomes, Frederico 1953 The Use of Mitscherlich’s Regression Law in the Analysis of Experiments With Fertilizers. Biometrics 9:498–516.
Plackett, R. L. 1960 Principles of Regression Analysis. Oxford: Clarendon.
Rao, C. Radhakrishna 1952 Advanced Statistical Methods in Biometric Research. New York: Wiley.
Rao, C. Radhakrishna 1965 Linear Statistical Inference and Its Applications. New York: Wiley.
Stevens, W. L. 1951 Asymptotic Regression. Biometrics 7:247–267.
Strodtbeck, Fred L.; Mcdonald, Margaret R.; and Rosen, Bernard C. 1957 Evaluation of Occupations: A Reflection of Jewish and Italian Mobility Differences. American Sociological Review 22:546–553.
Williams, Evan J. 1952 Use of Scores for the Analysis of Association in Contingency Tables. Biometrika 39: 274289.
Williams, Evan J. 1959 Regression Analysis. New York: Wiley.
Wold, Herman 1953 Demand Analysis: A Study in Econometrics. New York: Wiley.
Yates, Frank 1933 The Analysis of Replicated Experiments When the Field Results Are Incomplete. Empire Journal of Experimental Agriculture 1: 129142.
Yates, Frank 1948 The Analysis of Contingency Tables With Groupings Based on Quantitative Characters. Biometrika 35:176–181.
II ANALYSIS OF VARIANCE
Analysis of variance is a body of statistical procedures for analyzing observational data that may be regarded as satisfying certain broad assumptions about the structure of means, variances, and distributional form. The basic notion of analysis of variance (or ANOVA) is that of comparing and dissecting empirical dispersions in the data in order to understand underlying central values and dispersions.
This basic notion was early noted and developed in special cases by Lexis and von Bortkiewicz [seeLexis; Bortkiewicz]. Not until the pioneering work of R. A. Fisher (1925; 1935), however, were the fundamental principles of analysis of variance and its most important techniques worked out and made public [seeFisher, R. A.]. Early applications of analysis of variance were primarily in agriculture and biology. The methodology is now used in every field of science and is one of the most important statistical areas for the social sciences. (For further historical material see Sampford 1964.)
Much basic material of analysis of variance may usefully be regarded as a special development of regression analysis [seeLinear Hypotheses, article on REGRESSION]. Analysis of variance extends, however, to techniques and models that do not strictly fall under the regression rubric.
In analysis of variance all the standard general theories of statistics, such as point and set estimation and hypothesis testing, come into play. In the past there has sometimes been overemphasis on testing hypotheses.
Onefactor analysis of variance
Suppose that the experiment is set up so that P_{1}, P_{2}, and P_{3/} pupils (where P_{1} = P_{2}, = P_{3} = P) read the chapter in styles 1, 2, and 3, respectively, and that X_{P8} denotes the comprehension score of the p th pupil reading style s. (Here s = 1, 2, 3; in general, s = 1, 2,..., S.) There is a hypothetical mean, or expected, value of X_{ps}, μ_{s}, but X_{ps} differs from μ_{s} because, first, the pupils are chosen randomly from a population of pupils with different inherent means and, second, a given pupil, on hypothetical repetitions of the experiment, would not always obtain the same score. This is expressed by writing
A simple experiment will now be described as an example of ANOVA. Suppose that the publisher of a juniorhighschool textbook is considering styles of printing type for a new edition; there are three styles to investigate, and the same chapter of the book has been prepared in each of the three styles for the experiment. Juniorhighschool pupils are to be chosen at random from an appropriate large population of such pupils, randomly assigned to read the chapter in one of the three styles, and then given a test that results in a readingcomprehension score for each pupil.
X_{p8} = μ_{8} + e_{8}.
Then the assumptions are made that the e_{8} are all independent, that they are all normally distributed, and that they have a common (usually unknown) variance, σ^{2}. By definition, the expectation of e_{p8} is zero.
Because differences among the pupils reading a particular style of type are thrown into the random “error” terms (e_{p8}), μ_{8}, the expectation of X_{p8}, does not depend on p. It is convenient to rewrite (l)as
X_{ps} = μ + (μ_{8} – μ) + e_{p8}where μ = (∑μ_{s})/S, the average of the μ_{s}. For simplicity, set α_{8} = μ_{8} – μ (so that α_{1} + α_{2} +...+ α_{s} = 0) and write the structural equation finally in the conventional form
X_{p8} = μ + α_{8} + e_{p8}.
Here α_{8} is the differential effect on comprehension scores of style s for the relevant population of pupils. The unknowns are μ, the α_{8}, and σ^{2}.
Note that this structure falls under the linear regression hypothesis with coefficients 0 or 1. For example, if E(X_{p8}) represents the expected value of X_{p8},
E(X_{p1})= 1.μ + 1.α_{1}+0.α_{2}+0.α_{3} + ...+0.μ_{S},
E(X_{p2}) = 1.μ + 0.α_{1} + 1.α_{2} + 0.α_{3} +...+0.μ_{S}.
Consider how this illustrative experiment might be conducted. After defining the population to which he wishes to generalize his findings, the experimenter would use a table of random numbers to choose pupils to read the chapter printed in the different styles. (Actually, he would probably have to sample intact school classes rather than individual pupils, so the observations analyzed might be class means instead of individual scores, but this does not change the analysis in principle.) After the three groups have read the same chapter under conditions that differ only in style of type, a single test covering comprehension of the material in the chapter would be administered to all pupils
The experimenter’s attention would be focused on differences between average scores of the three style groups (that is, ̄X._{1} versus ̄X._{2}., ̄X._{2} versus ̄X._{3}, and ̄X._{1} versus ̄X._{3}) relative to the variability of the test scores within these groups. He estimates the µ_{s} via the ̄X._{3}, and he attempts to determine which of the three averages, if any, differ with statistical significance from the others. Eventually he hopes to help the publisher decide which style of type to use for his new edition.
ANOVA of random numbers—an example
An imaginary experiment of the kind outlined above will be analyzed here to illustrate how ANOVA is applied. Suppose that the three P_{s} are each 20, that in fact the µ_{s} are all exactly equal to 0, and that σ= 1 (setting µ_{s} = 0 is just a convenience corresponding to a conventional origin for the comprehensionscore scale).
Sixty random normal deviates, with mean 0 and variance 1, were chosen by use of an appropriate table (RAND Corporation 1955). They are listed in Table 1, where the second column from the left should be disregarded for the moment—it will be used later, in a modified example. From the “data” of Table 1 the usual estimates of the μ_{s} are just the column averages, ̄X._{1}, = –0.09, ̄X._{2}, = 0.10, and ̄X._{3} = 0.08. The estimate of µ is the overall mean, ̄X.. = 0.03, and the estimates of the α_{s} are –0.09 – 0.03 = –0.12, 0.10 – 0.03 = 0.07, and 0.08 – 0.03 = 0.05. Note that these add to zero,
Table 1 — Dafa for hypothetical experiment; 60 random normal deviates  

X_{p1}  X_{p1} + 1*  X_{p2}  X_{p3}  
*This column was obtained by adding 1 to each deviate of the first column.  
0.477  1.477  –0.987  1.158  
–0.017  0.983  2.313  0.879  
0.508  1.508  0.016  0.068  
–0.512  0.488  0.483  1.116  
–0.188  0.812  0.157  0.272  
–1.073  –0.073  1.107  –0.396  
–0.412  0.588  –0.023  –0.983  
1.201  2.201  0.898  –0.267  
–0.676  0.324  –1.404  0.3207  
–1.012  –0.012  –0.080  0.929  
.997  1.997  –1.258  –0.603  
–0.127  0.873  –0.017  0.493  
1.178  2.178  1.607  –1.243  
–1.507  –0.507  0.005  –0.145  
1.010  2.010  0.163  1.334  
–0.528  0.472  –0.771  –0.906  
–0.139  0.861  0.485  –1.633  
0.621  1.621  0.147  0.424  
–2.078  –1.078  –1.764  –0.433  
0.485  1.485  0.986  1.245  
Mean  —0.09  0.91  0.10  0.08 
Variance  0.83  0.83  1.03  0.78 
as required. In ANOVA, for this case, two quantities are compared. The first is the dispersion of the three µ_{*} estimates—that is, the sum of the (̄X._{3} – ̄X..)^{2} , conveniently multiplied by 20, the common sample size. This is called the between–styles dispersion or sum of squares. Here it is 0.4466. (These calculations, as well as those below, are made with the raw data of Table 1, not with the rounded means appearing there.) The second quantity is the within–sample dispersion, the sum of the three quantities ∑_{p}(X_{Ps} – ̄X._{s})^{2}. This is called the withinstyle dispersion or sum of squares. Here it is 50.1253.
This comparison corresponds to the decomposition
X_{ps} –̄X.. = (̄X._{s} – ̄X..) + (X_{ps} –̄X._{s}
and to the sumofsquares identity
which shows how the factor of 20 arises. Such identities in sums of squares are basic in most elementary expositions of ANOVA.
The fundamental notion is that the withinstyle dispersion, divided by its socalled degrees of freedom (here, degrees of freedom for error), unbiasedly estimates σ^{2}. Here the degrees of freedom for error are 57 (equals 60 [for the total number of
Table 2 – Analysisofvariance table for onefactor experiment  

(a) ANOVA of 60 random normal deviates  
Source of variation  df  SS  MS  F  Tabled F_{0.5;2,57} 
^{*} Actually, σ^{2} here is known to be 1.  
* Here P_{+} is used for  
Between styles  31=2  0.4466  0.2233  0.25  3.16 
Within styles  603=57  50.1253  0.8794^{*}  
Total  601=59  50.5719  
(b) ANOVA of general onefactor experiment with S treatments  
Source of variation  dt  MS  EMS  
Between treatments  S1  
Within treatments  P_{+}–S^{*}  σ^{2}  
Total  P_{+}–1^{*} 
observations minus 3 [for the number of μ_{s} estimated]). On the other hand, the betweenstyles dispersion, divided by its degrees of freedom (here 2), estimates σ^{2} unbiasedly if and only if the µ_{s}; are equal; otherwise the estimate will tend to be larger than σ^{2}. Furthermore, the betweenstyles and withinstyle dispersions are statistically independent. Hence, it is natural to look at the ratio of the two dispersions, each divided by its degrees of freedom. The result is the Fstatistic, here
In repeated trials with the null hypothesis (that there are no differences between the µ_{s}) true, the Fstatistic follows an Fdistribution with (in this case) 2 and 57 degrees of freedom [seeDistributions, statistical, article onspecial continuous distributions]. Level of significance is denoted by “α” (which should not be confused with the totally unrelated “α_{8},” denoting style effect; the notational similarity stems from the juxtaposition of two terminological traditions and the finite number of Greek letters). The Ftest at level of significance α of the null hypothesis that the styles are equivalent rejects that hypothesis when the Fstatistic is too large, greater than its l00α percentage point, here F_{α2,57}. If α = 0.05, which is a conventional level, then F_{.05:2,57} = 3.16, so 0.25 is much smaller than the cutoff point, and the null hypothesis is, of course, not rejected. This is consonant with the fact that the null hypothesis is true in the imaginary experiment under discussion.
Table 2 summarizes the above discussion in both algebraic and numerical form. The algebraic form is for S styles with P_{s} students at the sth style.
To reiterate, in an analysis of variance each kind of effect (treatment, factor, and others to be discussed later) is represented by two basic numbers. The first is the socalled sum of squares (SS), corresponding to the effect; it is random, depending upon the particular sample, and has two fundamental properties: (a) If the effect in question is wholly absent, its sum of squares behaves probabilistically like a sum of squared independent normal deviates with zero means, (b) If the effect in question is present, its sum of squares tends to be relatively large; in fact, it behaves probabilistically like a sum of squared independent normal deviates with not all means zero.
The second number is the socalled degrees of freedom (df). This quantity is not random but depends only on the structure of the experimental design. The df is the number of independent normal deviates in the description of sums of squares just given.
A third (derived) number is the socalled mean square (MS), which is computed by dividing the sum of squares by the degrees of freedom. When an effect is wholly absent, its mean square is an unbiased estimator of underlying variance, σ^{2}. When an effect is present, its mean square has an expectation greater than σ^{2}.
In the example considered here, each observation is regarded as the sum of (a) a grand mean, (b) a printingstyle effect, and (c) error. It is con ventional in analysisofvariance tables not to have a line corresponding to the grand mean and to work with sample residuals centered on it; that convention is followed here. Printingstyle effect and error differ in that the latter is assumed to be wholly random, whereas the former is not random but may be zero. The mean square for error estimates underlying variance unbiasedly and is a yardstick for judging other mean squares.
In the standard simple designs to which ANOVA is applied, it is customary to define effects so that the several sums of squares are statistically independent, from which additivity both of sums of squares and of degrees of freedom follows [seeProbability, article onformal probability], In the example, SS_{between} + SS_{within} = SS_{total}, and df_{b} + df_{w} = df_{total}. (Here, and often below, the subscripts “b” and “W” are used to stand for “between” and “within,” respectively.) This additivity is computationally useful, either to save arithmetic or to verify it.
Analysisofvariance tables, which, like Table 2, are convenient and compact summaries of both the relevant formulas and the computed numbers, usually also show expected mean squares (EMS), the average value of the mean squares over a (conceptually) infinite number of experiments. In fixedeffects models (such as the model of the example) these are always of the form σ^{2} (the underlying variance) plus an additional term that is zero when the relevant effect is absent and positive when it is present. The additional term is a convenient measure of the magnitude of the effect.
Expected mean squares, such as those given by the two formulas in Table 2, provide a necessary condition for the Fstatistic to have an Fdistribution when the null hypothesis is true. (Other conditions, such as independence, must also be met.) Note that if the population mean of the sth treatment,µ_{s}, is the same for all treatments (that is, if α_{s} = 0 for all s) then the expected value of MS_{b} will be σ^{2}, the same as the expected value of MS_{W}. If the null hypothesis is true, the average value of the F from a huge number of identical experiments employing fresh, randomly sampled experimental units will be (P_{+} – S)/(P_{+} – S – 2), which is very nearly 1 when, as is usually the case, the total number of experimental units, P_{+}, is large compared with S. Expected mean squares become particularly important in analyses based on models of a nature somewhat different from the one illustrated in Tables 1 and 2, because in those cases it is not always easy to determine which mean square should be used as the denominator of F (see the discussion of some of these other models, below).
The simplest ttests
It is worth digressing to show how the familiar onesample and twosample ttests (or Student tests) fall under the analysisofvariance rubric, at least for the symmetrical twotail versions of these tests.
Singlesample ttest. In the singlesample ttest context, one considers a random sample, X_{l}, X_{2},...,X_{P}, of independent normal observations with the same unknown mean, µ, and the same unknown variance, σ^{2}. Another way of expressing this is to write
X_{p} = µ + e_{p}, p = 1,...,p,
where the e_{p} are independent normal random variables, with mean 0 and common variance σ^{2}. The usual estimator of µ is ̄X., the average of the X_{p} , and this suggests the decomposition into average and deviation from average,
X_{p} = ̄X. + (X_{p} –̄X.),
from which one obtains the sumofsquares identity
(since ∑(X_{P} – ̄X.) = 0), a familiar algebraic relationship. Since the usual unbiased estimator of σ^{2} is s^{2} = ∑(X_{p} –̄X.)^{2} /(P – 1), the sumofsquares identity may be written
Ordinarily the analysisofvariance table is not written out for this simple case; it is, however, the one shown in Table 3. In Table 3 the total row is the actual total including all observations; it is of the essence that the row for mean is separated out.
Table 3  

Effect  df  SS  EMS 
Mean  1  P̄X^{2}  σ^{2}+Pμ^{2} 
Error  P – 1  Σ(X^{p}–̄X.)_{2}  σ^{2} 
Total  P 
The Fstatistic for testing that µ = 0 is the ratio of the mean squares for mean and error,
which, under the null hypothesis, has an Fdistribution with 1 and P – 1 degrees of freedom. Notice that the above Fstatistic is the square of
which is the ordinary tstatistic (or Student statistic) for testing µ = 0. If a symmetrical twotail test is wanted, it is immaterial whether one deals with the tstatistic or its square. On the other hand, for a onetail test the tstatistic would be referred to the tdistribution with P – 1 degrees of freedom [seeDistributions, statistical, article onspecial continuous distributions].
It is important to note that a confidence interval for µ may readily be established from the above discussion [seeEstimation, article onconfidence intervals and regions]. The symmetrical form is
Alternatively, F_{α1 p – 1} can be replaced by the upper 100(α/2) per cent point for the tdistribution with P – 1 degrees of freedom, t_{α}/2 ,P –1.
Suppose, for example, that from a normally distributed population there has been drawn a random sample of 25 observations for which the sample mean, ̄x., is 34.213 and the sample variance, s^{2}, is 49.000. What is the population mean, µ? The usual point estimate from this sample is 34.213. How different from µ is this value likely to be? For α = .05, a 95 per cent confidence interval is constructed by looking up t._{023:24} = 2.064 in a table (for instance, McNemar [1949] 1962, p. 430) and substituting in the formula
Thus,
This result means that if an infinite number of samples, each of size P = 25, were drawn randomly from a normally distributed population and a confidence interval for each sample were set up in the above way, only 5 per cent of the intervals would fail to cover the mean of the population (which is a certain fixed value).
Similarly, from this one sample the unbiased point estimate of σ^{2} is the value of s^{2}, 49.000. Brownlee ([I960] 1965, page 282) shows how to find confidence intervals for σ^{2} [see alsoVariances, statistical study of].
Is it “reasonable” to suppose that the mean of the population from which this sample was randomly chosen is as large as, say, 40? No, because that number does not lie within even the 99 per cent confidence interval. Therefore it would be unreasonable to conclude that the sample was drawn from a population with a mean as great as 40. The relevant test of statistical significance is
the absolute magnitude of which lies beyond the 0.9995 percentile point (3.745) in the tabled tdistribution for 24 degrees of freedom. Therefore, the difference is statistically significant beyond the 0.0005 + 0.0005 = 0.001 level. The null hypothesis being tested was H_{0}: µ = 40, against the alternative hypothesis H_{a} µ ≠ 40. Just as the confidence interval indicated that it is unreasonable to suppose the mean to be equal to 40, this test also shows that 40 will lie outside the 99 per cent confidence interval; however, of the two procedures, the confidence interval gives more information than the significance test.
Twosample ttest. In the twosample ttest context, there are two random samples from normal distributions assumed to have the same variance, σ^{2}, and to have means µ_{1} and µ_{2}. Call the observations in the first sample X_{11}, ... , X_{Pi} l and the observations in the second sample X_{12},..., X_{p23.} The most usual null hypothesis is µ_{1} = µ,_{2}, and for that the tstatistic is
where the P’s are the sample sizes, the ̄X’s are the sample means, and s_{2} is the estimate of σ^{2} based on the pooled withinsample sum of squares,
Here P_{1} + P_{2} – 2 is the number of degrees of freedom for error, the total number of observations less the number of estimated means (̄X._{1} and ̄X._{2} estimate µ_{1} and µ_{2}, respectively). Under the null hypothesis, the tstatistic has the tdistribution with P_{1} + P_{2} – 2 degrees of freedom.
The basic decomposition is
X_{ps} – ̄X.. = (̄X_{.s} – ̄X..) + (X_{ps} – ̄X_{.s}).
leading to the sumofsquares decomposition
Since s has only the values 1 and 2,
Table 4  

O_{Effect}  df  SS  EMS* 
*Note that the expected mean square for style is a plus what is obtained by formal substitution for the random variables (̄X_{1},̄X._{2}) in the sum of squares of their respective expectations (divided by df, which here is 1). This relationship is a perfectly general one in the analysisofvariance model now under discussion, but it must be changed for other models that will be mentioned later.  
Style  1  
Error  σ^{2}  
Total 
and therefore
The analysisofvariance table may be written as in Table 4. The Fstatistic for the null hypothesis that µ = µ_{2} is
and this is exactly the square of the tstatistic for the twosample problem.
Note that the twosample problem as it is analyzed here is only a special case (with S = 2) of the Ssample problem presented earlier.
The numerical example continued
Returning to the numerical example of Table 1, add 1 to every number in the leftmost column to obtain the second column and consider the numbers in the second column as the observations for style 1. Now µ_{1} = 1 and µ_{2} = µ_{3} = 0. What happens to the analysis of variance and the Ftest? Table 5 shows the result; the Fstatistic is 5.41, which is of high statistical significance since F01257 = 5.07. Thus, one would correctly reject the null hypothesis of equality among the three µ_{s}.
The actual value of µ is 1/3 ≅ 0.33, and that of α_{1} 2/3 ≅ 0.67. The estimate of µ is 0.36, and that of α_{1} is 0.55.
With three styles, one can consider many contrasts—for example, style 1 versus style 2, style 1 versus style 3, style 2 versus style 3, 1/2(style 1 + style 2) versus style 3. There are special methods for dealing with several contrasts simultaneously [seeLinear hypotheses, article onmultiple comparisons].
ANOVA with more than one factor
In the illustrative example being considered here, suppose that the publisher had been interested not only in style of type but also in a second factor, such as the tint of the printing ink (t). If he had three styles and four tints, a complete “crossed” factorial design would require 3 * 4 = 12 experimental conditions (s_{1} t_{1}, s_{1} t_{2}, ..., s_{3} t_{4}). From 12 P experimental units he would assign P units at random to each of the 12 conditions, conduct his experiment, and obtain outcome measures to analyze. The total variation between the 12 P outcome measures can be partitioned into four sources rather than into the two found with one factor. The sources of variation are the following: between styles, between tints, interaction of styles with tints, and within styletint combinations (error).
The usual model for the twofactor crossed design is
X_{pst} = µ + α_{s} + β_{t} + γ_{st}+e_{pst},
Where ∑_{s}α_{s} = ∑_{t}β_{t} = ∑_{s}γ_{st} = ∑_{t}γ_{st} = 0.and e_{pst} are independent normally distributed random variables with mean 0 and equal variance σ^{2} for each st combination. The analysisofvariance procedure for this design appears in Table 6. The α_{s} and β_{t} represent main effects of the styles and tints; the σ_{st} denote (twofactor)interactions.
Table 5 – Onefactor ANOVA of 60 transformed random normal deviates  

Source of variation  df  SS  MS  EMS  F 
Between styles  2  8.9246  4.4623  5.41  
Within styles  57  50.1253  0.8794  σ^{2}  
Total  59  59.0499 
Table 6  ANOVA of a complete, crossedc/ass/ficaf/on, fwofacfor factorial design with P experimental units for each factorlevel combination  

Source of variation  df  SS  EMS 
Between styles  S  1  
Between tints  T – 1  
Styles X tints (interaction)  (S – 1)(T – 1)  
Within styletint combinations  ST(P – 1)  σ^{2}  
Total  PST  1 
Interaction
The twofactor design introduces interaction, a concept not relevant in onefactor experiments. It might be found, for example, that, although in general S_{1} is an ineffective style and t_{3} is an ineffective tint, the particular combination s_{1} t_{3} produces rather good results. It is then said that style interacts with tint to produce nonadditive effects; if the effects were additive, an ineffective style combined with an ineffective tint would produce an ineffective combination.
Interaction is zero if E(X_{pst}) = µ + α_{s} + β_{t} for every st, because under this condition the population mean of the stth combination is the population grand mean plus the sum of the effects of the sth style and the tth tint. Then the interaction effect,σ_{st}, is zero for every combination. Table 7 contains hypothetical data showing population means,̄µ_{st} for zero interaction (Lubin 1961 discusses types of interaction). Note mat for every cell of Table 7, ̄µ_{st} –(̄µ_{st} – (̄µ_{st}µ)=µ=3 (Here ̄µ .. is written as µ for simplicity.) For exam pie, for tint 1 and style 1,3 – (5 – 3) – (1 – 3) = 3.
One tests for interaction by computing F MS_{styles * tints}/MS_{within styletint} Comparing this F with the F*stabled at various significance levels for (s–1) (T–1)and ST (P–1) degrees of freedom.
Table 7 – Zero interaction of two factors (hypothetical population means ̄μ_{st}  

Tint/Style  1  2  3  4  Row means (̄μ_{s.}) 
1  3  4  5  8  5 
2  0  1  2  5  2 
3  0  1  2  5  2 
Column means (̄μ_{.t})  1  2  3  6  3=μ 
If there were but one subject reading with each styletint combination (that is, if there were no replication), further assumptions would have to be made to permit testing of hypotheses about main effects. In particular, it is commonly then assumed that the style × tint interaction is zero, so that the expected mean square for interaction in Table 6 reduces to the underlying variance, and the MSstylesxtints may be used in the denominator of the MS_{styles x tints} may be used in the denominator of the F’s for testing main effects. No test of the assumption of additivity is possible through MS_{within style–tint}, because this quantity cannot be calculated. However, Tukey (1949; see also Winer 1962, pp. 216–220) has provided a onedegreeoffreedom test for interaction, or nonadditivity, of a special kind that can be used for testing the hypothesis of no interaction for these unreplicated experiments of the fixedeffects kind. (See Scheffe 1959, pp. 129–134.)
The factorial design may be extended to three or more factors. With three factors there are four sums of squares for interactions: one for the threefactor interaction (sometimes called a secondorder interaction, because a onefactor “interaction” is a main effect) and one each for the three twofactor (that is, firstorder) interactions. If the three factors are A, B, and C, their interactions might be represented as A×B×C , A × ×B , A × C, and B × C. For example, a style of type that for the experiment as a whole yields excellent comprehension may, when combined with a generally effective size of type and a tint of paper that has overall facilitative effect, yield rather poor results. One threefactor factorial experiment permits testing of the hypothesis that there is a no secondorder interaction and permits the magnitude of such interaction to be estimated, whereas three onefactor experiments or a twofactor experiment and a onefactor experiment do not. Usually, three factor nonadditivity is difficult to explain substantively.
A large number of more complex designs, most of them more or less incomplete in some respect as compared with factorial designs of the kind discussed above, have been proposed.[SeeExperimental design; see also Winer 1962; Fisher 1935.]
The analysis of covariance
Suppose that the publisher in the earlier, styleoftype example had known readingtest scores for his 60 pupils prior to the experiment. He could have used these antecedent scores in the analysis of the comprehension scores to reduce the magnitude of the mean square within styles, which, as the estimate of underlying variance, is the denominator of the computed F. At the same time he would adjust the subsequent style means to account for initial differences between readingtestscore means in the three groups. One way of carrying out this more refined analysis would be to perform an analysis of variance of the differences between final comprehension scores and initial reading scores—say, X_{Ps} – Y_{ps}. A better prediction of the outcome measure, X_{ps}, might be secured by computing α + βY_{ps}, where α and β are constants to be estimated.
By a statistical procedure called the analysis of covariance one or more antecedent variables may be used to reduce the magnitude of the sum of squares within styles and also to adjust the observed style means for differences between groups in average initial reading scores. If β ≠0, then the adjusted sum of squares within treatments (which provides the denominator of the Fratio) will be less than the unadjusted SS_{W} of Table 2, thereby tending to increase the magnitude of F. For each independent antecedent variable one uses, one degree of freedom is lost for SS_{W} and none for SS_{b} ; the loss of degrees of freedom for SS_{w} will usually be more than compensated for by the decrease in its magnitude.
A principal statistical condition needed for the usual analysis of covariance is that the regression of outcome scores on antecedent scores is the same for every style, because one computes a single withinstyle regression coefficient to use in adjusting the withinstyle sum of squares. Homogeneity of regression can be tested statistically; see Winer (1962, chapter 11). Some procedures to adopt in the case of heterogeneity of regression are given inBrownlee (1960).
The regression model chosen must be appropriate for the data if the use of one or more antecedent variables is to reduce MS_{W} appreciably. Usually the regression of outcome measures on antecedent measures is assumed to be linear.
The analysis of covariance can be extended to more than one antecedent variable and to more complex designs. (For further details see Cochran 1957; Smith 1957; Winer 1962; McNemar 1949.)
Models—fixed, finite, random, and mixed
In the example, the publisher’s “target population” of styles of print consisted of just those 3 styles that he tried out, so he exhausted the population of styles of interest to him. Suppose that, instead, he had been considering 39 different styles and had drawn at random from these 39 the 3 styles he used in the experiment. His intention is to determine from the experiment based on these 3 styles whether it would make any difference which one of the 39 styles he used for the textbook (of course, in practice a larger sample of styles would be drawn). If the styles did seem to differ in effectiveness, he would estimate from his experimental data involving only 3 styles the variance of the 39 population means of the styles. Then he might perform further experiments to find the most effective styles.
Finiteeffects models
Thus far in this article the model assumed has been the fixedeffects model, in which one uses in the experiment itself all the styles of type to which one wishes to generalize. The 3outof39 experiment mentioned above illustrates a finiteeffects model, with only a small percentage (8 per cent, in the example given) of the styles drawn at random for the experiment but where one has the intention of testing the null hypothesis
H_{o}: µ_{1} = µ_{2} = ... = µ_{39}
against all alternative hvpotheses and estimating “variance,” from and .
Randomeffects models
If the number of “levels” of the factor is very large, so that the number of levels drawn randomly for the experiment is a negligible percentage of the total number, then one has a randomeffects model, sometimes called a componentsof variance model or Model n. This model would apply if, for example, one drew 20 raters at random from an actual or hypothetical population of 100,000 raters and used those 20 to rate each of 25 subjects who had been chosen at random from a population of half a million. (Strictly speaking, the number of raters and the number of subjects in the respective populations would have to be infinite to produce the randomeffects model, but for practical purposes 100,000/20 and 500,000/25 are sufficiently large.) If every rater rated every subject on one trait (say, gregariousness) there would be 20 × 25 = 500 ratings, one for each experimental combination — that is, one for each rater–subject combination.
This, then, would be a twofactor design without replication, that is, with just one rating per rater–subject combination. (Even if the experimenter had used available raters and subjects rather than drawing them randomly from any populations, he would probably want to generalize to other raters and subjects “like” them; see Cornfield & Tukey 1956, p. 913.)
The usual model for an experiment thus conceptualized is
X_{rs} = µ + a_{r} + b_{s} + e_{rs},
where µ is a grand mean, the a’s are the (random) rater effects, the b’s are (random) subject effects, and the e’s combine interaction and inherent measurement error. The 20 + 25 + (20 x 25) random variables are supposed to be independent and assumed to have variances as follows:
For Ftesting purposes, a, b, and e are supposed to be normally distributed.
The analysisofvariance table in such a case is similar to those presented earlier, except that the expected mean square column is changed to the one shown in Table 8.
Table 8  

Effect  EMS 
Rater  
Subject  
Error 
The Fstatistic for testing the hypothesis that the main effect of subjects is absent (that is MS_{*}/MS_{error,} where
Under the null hypothesis that = 0 the Fstatistic has an Fdistribution with 24 and 19 x 24 degrees of freedom. (A similar Fstatistic is used for testing An unbiased estimator of is
with a similar estimator for σ^{2}_{r}. A serious difficulty with these estimators is that they may take negative values; perhaps the best resolution of that difficulty is to enlarge the model. See Nelder (1954), and for another approach and a bibliography, see Thompson (1962).
Note that here it appears impossible to separate random interaction from inherent variability, both of which contribute to σ^{2}_{e}, the variance of the e’s; in the randomeffects model, however, this does not jeopardize significance tests for main effects.
In more complex Model n situations, the Ftests used are inherently different from their Model i analogues; in particular, sample components of variance are often most reasonably compared, not with the “bottom” estimator of σ^{2}, but with some other—usually an interaction—component of variance. (See Hays 1963, pp. 356–489; Brownlee [1960] 1965, pp. 309–396, 467–529.)
Mixed models
If all the levels of one factor are used in an experiment while a random sample of the levels of another factor is used, a mixed model results. Mixed models present special problems of analysis that have been discussed by Scheffe (1959, pp. 261–290) and by Mood and Graybill (1963).
Other topics in ANOVA
Robustness of ANOVA
Fixedeffects models are better understood than the other models and therefore, where appropriate, can be used with considerable confidence. Fixedeffects ANOVA seems “robust” for type i errors to departures from certain mathematical assumptions underlying the Ftest, provided that the number of experimental units is the same for each experimental combination. Two of these assumptions are that the e’s are normally distributed and that they have common variance σ2 for every one of the experimental combinations. In particular, the commonvariance assumption can be relaxed without greatly affecting the probability values for computed F’s. If the number of experimental units does not vary from one factorlevel combination to another, then it may be unnecessary to test for heterogeneity of variances preliminary to performing an ANOVA, because ANOVA is robust to such heterogeneity. (In fact, it may be unwise to make such a test, because the usual test for heterogeneity of variance is more sensitive to nonnormality than is ANOVA.) For further discussion of this point see Lindquist (1953, pp. 78–86), Winer (1962, pp. 239–241), Brownlee ([1960] 1965, chapter 9), and Glass (1966). Brownlee (1960) and others have provided the finitemodel expected mean squares for the complete threefactor factorial design, from which one can readily determine expected mean squares for threefactor fixed, mixed, and random models.
Analysisofvariance F’s are unaffected by linear transformation of the observations—that is, by changes in the X_{ps} of the form a + bX_{ps} , where a and b are constants (b ≠ 0). Multiplying every observation by b multiplies every mean square by b2. Adding a to every observation does not change the mean squares. Thus, if observations are twodecimal numbers running from, say, –1.22 upward, one could, to simplify calculations, drop the decimal (multiply each number by 100) and then add 122 to each observation. The lowest observation would become 100 (–1.22) + 122 = 0. Each mean square would become 100^{2} = 10,000 times as large as for the decimal fractions. With the increasing availability of highspeed digital computers, coding of data is becoming less important than it was formerly.
A brief classification of factors
The ANOVA “factors” considered thus far are style of printing type, tint of ink, rater, and subject. Styles differ from each other qualitatively, as do raters and subjects. Tint of ink might vary more quantitatively than do styles, raters, and subjects—as would, for example, size of printing type or temperature in a classroom. Thus, one basis for classifying factors is whether or not their levels are ordered and, if they are, whether meaningful numbers can be associated with the factor levels.
Another basis for classification is whether the variable is manipulated by the experimenter. In order to conduct a “true” experiment, one must assign his experimental units in some (simple or restrictive) random fashion to the levels of at least one manipulated factor. ANOVA may be applied to other types of data, such as the scores of Englishmen versus Americans on a certain test, but this is an associational study, not a stimulusresponse experiment. Obviously, nationality is not an independent variable in the same sense that printing type is. The direct “causal” inference possible from a wellconducted styleoftype experiment differs from the associational information obtained from the comparison of Englishmen’s scores with those of Americans (see Stanley 1961; 1965; 1966; Campbell & Stanley 1963). Some variables, such as national origin, are impossible to manipulate in meaningful ways, whereas others, such as “enrolls for Latin versus does not enroll for Latin,” can in principle be manipulated, even though they usually are not.
Experimenters use nonmanipulated, classification variables for two chief reasons. First, they may wish to use a factor explicitly in a design in order to isolate the sum of squares for the main effect of that factor so that it will riot inflate the estimate of underlying variance—that is, so it will not make the denominator mean square of F unnecessarily large. For example, if the experimental units available for experimentation are children in grades seven, eight, and nine, and if IQ scores are available, it is wise in studying the three styles of type to use the three (ordered) grades as one fixedeffects factor and a number of ordered IQ levels—say, four—as another fixedeffects factor. If the experimenter suspects that girls and boys may react differently to the styles, he will probably use this twolevel, unordered classification (girls versus boys) as the third factor. This would produce 3 x 4 x 2 x 3 = 72 experimental combinations, so with at least 2 children per combination he needs not less than 144 children.
Probably most children in the higher grades read better, regardless of style, than do most children in the lower grades, and children with high IQ’s tend to read better than children with lower IQ’s, so the main effects of grade and of IQ should be large. Therefore, the variation within gradeIQ—sexstyle groups should be considerably less than within styles alone.
A second reason for using such stratifying or leveling variables is to study their interactions with the manipulated variable. Ninth graders might do relatively better with one style of type and seventh graders relatively better with another style, for example. If so, the experimenter might decide to recommend one style of type for ninth graders and another for seventh graders. With the above design one can isolate and examine one fourfactor interaction, four threefactor interactions, six twofactor interactions, and four main effects, a total of 2^{4} – 1 = 15 sources of variation across conditions. In the fixedeffects model all of these are tested against the variation within the experimental combinations, pooled from all combinations. Testing 15 sources of variation instead of 1 will tend to cause more apparently significant F’s at a given tabled significance level than would be expected under the null hypothesis. For any one of the significance tests, given that the null hypothesis is true, one expects 5 spurious rejections of the true null hypothesis out of 100 tests; thus, if an analyst keeps making Ftests within an experiment, he has more than a .05 probability of securing at least one statistically significant F, even if no actual effects exist. There are systematic ways to guard against this (see, for example, Pearson & Hartley 1954, pp. 39–40). At least, one should be suspicious of higherorder interactions that seem to be significant at or near the .05 level. Many an experimenter utilizing a complex design has worked extremely hard trying to interpret a spuriously significant highorder interaction and in the process has introduced his fantasies into the journal literature.
Studies in which researchers do not manipulate any variables are common and important in the social sciences. These include opinion surveys, studies of variables related to injury in automobile accidents, and studies of the Hiroshima and Nagasaki survivors. ANOVA proves useful in many such investigations.[See Campbell & Stanley 1963; Lindzey 1954; see alsoExperimental design, article onquasiexperimental design.]
“Nesting” and repeated measurements
Many studies and experiments in the social sciences involve one or more factors whose levels do not “cross” the levels of certain other factors. Usually these occur in conjunction with repeated measurements taken on the same individuals. For example, if one classification is school and another is teacher within school, where each teacher teaches two classes within her school with different methods, then teachers are said to be ℌnested” within schools. Schools can interact with methods (a given method may work relatively better in one school than in another) and teachers can interact with methods within schools (a method that works relatively better for one teacher does not necessarily produce better results for another teacher in the same school), but schools cannot interact with teachers, because teachers do not “cross” schools—that is, the same teacher does not teach at more than one school.
This does not mean that a given teacher might not be more effective in another school but merely that the experiment provides no evidence on that point. One could, somewhat inconveniently, devise an experiment in which teachers did cross schools, teaching some classes in one school and some in another. But an experimenter could not, for example, have boys cross from delinquency to nondelinquency and vice versa, because delinquency–nondelinquency is a personal rather than an environmental characteristic. (For further discussion of nested designs see Brownlee [1960] 1965, chapters 13 and 15.)
If the order of repeated measurements on each individual is randomized, as when each person undergoes several treatments successively in random order, there is more likelihood that ANOVA will be appropriate than when the order cannot be randomized, as occurs, for instance, when the learning process is studied over a series of trials. Complications occur also if the successive treatments have differential residual effects; taking a difficult test first may discourage one person in his work on the easier test that follows but make another person try harder. These residual effects seem likely to be of less importance if enough time occurs between successive treatment levels for some of the immediate influence of the treatment to dissipate. Human beings cannot have their memories erased like calculating machines, however, so repeatedmeasurement designs, although they usually reduce certain error terms because intraindividual variability tends to be less than interindividual variability, should not be used indiscriminately when analogous designs without repeated measurements are experimentally and financially feasible. (For further discussion see Winer 1962; Hays 1963, pp. 455–456; Campbell & Stanley 1963.)
Missing observations
For two factors with levels s = 1, 2, ..., S and t = 1, 2, ..., T in the experiment, such that the number of experimental units for the stth experimental combination is n_{st} , one usually designs the experiment so that n_{st} = n, a constant for all st. A few missing observations at the end of the experiment do not rule out a slightly adjusted simple ANOVA, if they were not caused differentially by the treatments. If, for example, one treatment was to administer a severe shock on several occasions, and the other was to give ice cream each time, it would not be surprising to find that fewer shocked than fed experimental subjects come for the final session. The outcome measure might be arithmeticalreasoning score; but if only the more shockresistant subjects take the final test, comparison of the two treatments may be biased. There would be even more difficulty with, say, a male–female by shocked–fed design, because shocking might drive away more women than men (or vice versa).
When attrition is not caused differentially by the factors one may, for onefactor ANOVA, perform the usual analysis. For two or more factors, adjustments in the analysis are required to compensate for the few missing observations. (See Winer 1962, pp. 281–283, for example, for appropriate techniques.)
The power of the Ftest
There are two kinds of errors that one can make when testing a null hypothesis against alternative hypotheses: one can reject the null hypothesis when in fact it is true, or one can fail to reject the null hypothesis when in fact it is false. Rejecting a true null hypothesis is called an “error of the first kind,” or a “type i error.” Failing to reject an untrue null hypothesis is called an “error of the second kind” or “type n error.” The probability of making an error of the first kind is called the size of the significance test and is usually signified by α. The probability of making an error of the second kind is usually signified by β. The quantity 1 – β is called the power of the significance test.
If there is no limitation on the number of experimental units available one can fix both α and β at any desired levels prior to the experiment. To do this some prior estimate of σ2 is required, and it is also necessary to state what nonnull difference among the factorlevel means is considered large enough to be worth detecting. This latter requirement is quite troublesome in many social science experiments, because a good scale of value (such as dollars) is seldom available. For example, how much is a onepoint difference between the mean of style 1 and style 2 on a readingcomprehension test worth educationally? Intelligence quotients and averages of college grades are quasiutility scales, although one seldom thinks of them in just that way. How much is a real increase in IQ from 65 to 70 worth? How much more utility for the college does a gradepoint average of 2.75 (where C = 2 and B = 3) have than a gradepoint average of 2.50? (For further discussion of this topic see Chernoff & Moses 1959.)
In the hypothetical printingstyles example (Tables 1 and 5) it is known that σ^{2} = 1 and that the population mean of style 1 is one point greater than the population means of styles 2 and 3, so with this information it is simple to enter Winer’s Table B.ll (1962, p. 657) with, for example, α = .05 and β .10 and to find that for each of the three styles P = 20 experimental units are needed.
In actual experiments, where σ and the of interest to the experimenter are usually not known, the situation is more difficult (see Brownlee [1960] 1965, pp. 97–111; McNemar [1949] 1962, pp. 63–69; Hays 1963; and especially Scheffe 1959, pp. 38–42, 62–65, 437–455).
Alternatives to analysis of variance
If one conducted an experiment to determine how well tenyearold boys add twodigit numbers at five equally spaced atmospheric temperatures, he could use the techniques of regression analysis to determine the equation for the line that best fits the five means (in the sense of minimum squared discrepancies). This line might be of the simple form α + βT (that is, straight with slope β and intercept α) or it might be based on some other function of T. [See Winer 1962 for further discussion of trend analysis; see alsoLinear hypotheses, article onregression.]
The symmetrical twotail ttest is a special case of the Ftest; Likewise, the unit normal deviate (z), called “critical ratio” in old statistics textbooks when used for testing significance, is a special case of F: z^{2} = F_{1,∞}. The Fdistribution is closely related to the chisquare distribution. [For further discussion of these relationships, seeDistributions, statistical, article onspecial continuous distributions.]
For speed and computational ease, or when assumptions of ANOVA are violated so badly that results would seem dubious even if the data were transformed, there are other procedures available (see Winer 1962). Some of these procedures involve consecutive, untied ranks, whose means and variances are parameters dependent only on the number of ranks; an important example is the Kruskal–Wallis analysis of variance for ranks (Winer 1962, pp. 622–623). Other procedures employ the binomial expansion (p + q)^{n} or the chisquare approximation to it for “sign tests.” Still others involve dichotomizing the values for each treatment at the median and computing X^{2}. Range tests may be used also. [See Winer 1962, p. 77; McNemar (1949) 1962, chapter 19. Some of these procedures are discussed in Nonparametricstatistics.]
When the normal assumption is reasonable, there are often available testing and other procedures that are competitive with the Ftest. The latter has factotum utility, and it has optimal properties when the alternatives of interest are symmetrically arranged relative to the null hypothesis. But when the alternatives are asymmetrically arranged, or in other special circumstances, competitors to F procedures may be preferable. Particularly worthy of mention are Studentized range tests (see Scheffé 1959, pp. 82–83) and halfnormal plotting (see Daniel 1959).
Special procedures are useful when the alternatives specify an ordering. For example, in the styleoftype example it might be known before the experiment that if there is any difference between the styles, style 1 is better than style 2, and style 2 better than style 3 (see Bartholomew 1961; Chacko 1963).
It is also important to mention here the desirability of examining residuals (observations less the estimates of their expectations) as a check on the model and as a source of suggestions toward useful modifications. [SeeStatisticalanalysis, special problems of, article Ontransformations of data; see also Anscombe & Tukey 1963.
Often an observed value appears to be so distant from the other values that the experimenter is tempted to discard it before performing an ANOVA. For a discussion of procedures in such cases, seeStatistical Analysis, Special problems of, article onoutliers.]
Multivariate analysis of variance
The analysis of variance is multivariate in the independent variables (the factors) but univariate in the dependent variables (the outcome measures). S. N. Roy (for example, see Roy & Gnanadesikan 1959) and others have developed a multivariate analysis of variance (MANOVA), multivariate with respect to both independent and dependent variables, of which ANOVA is a special case. A few social scientists (for example, Rodwan 1964; Bock 1963) have used MANOVA, but as yet it has not been used widely by workers in these disciplines.
Julian C. Stanley
BIBLIOGRAPHY
Anscombe, F. J.; and TUKEY, JOHN W. 1963 The Examination and Analysis of Residuals. Technometrics 5:141–160.
Bartholomew, D. J. 1961 Ordered Tests in the Analysis of Variance. Biometrika 48:325–332.
Bock, R. Darrell 1963 Programming Univariate and Multivariate Analysis of Variance. Technometrics 5: 95–117.
Brownlee, Kenneth A. (1960) 1965 Statistical Theory and Methodology in Science and Engineering. 2d ed. New York: Wiley.
Campbell, Donald T.; and STANLEY, J. S. 1963 Experimental and Quasiexperimental Designs for Research on Teaching. Pages 171–246 in Nathaniel L. Gage (editor), Handbook of Research on Teaching. Chicago: Rand McNally. → Republished in 1966 as a separate monograph titled Experimental and Quasiexperimental Designs for Research.
Chacko, V. J. 1963 Testing Homogeneity Against Ordered Alternatives. Annals of Mathematical Statistics 34:945–956.
Chernoff, Herman; and Moses, Lincoln E. 1959 Elementary Decision Theory. New York: Wiley.
Cochran, William G. 1957 Analysis of Covariance: Its Nature and Uses. Biometrics 13:261–281.
Cornfield, Jerome; and Tukey, John W. 1956 Average Values of Mean Squares in Factorials. Annals of Mathematical Statistics 27:907–949.
Daniel, Cuthbert 1959 Use of Half–normal Plots in Interpreting Factorial Two–level Experiments. Technometrics 1:311–341.
Fisher, R. A. (1925) 1958 Statistical Methods for Research Workers. 13th ed. New York: Hafner. → Previous editions were also published by Oliver & Boyd.
Fisher, R. A. (1935) 1960 The Design of Experiments. 7th ed. London: Oliver & Boyd; New York: Hafner.
Glass, Gene V. 1966 Testing Homogeneity of Variances. American Educational Research Journal 3:187–190.
[Gosset, William S.] (1908) 1943 The Probable Error of a Mean. Pages 11–34 in William S. Cosset, “Student’s” Collected Papers. London: University College, Biometrika Office. → First published in Volume 6 of Biometrika.
Hays, William L. 1963 Statistics for Psychologists. New York: Holt.
Lindquist, Everet F. 1953 Design and Analysis of Experiments in Psychology and Education. Boston: Houghton Mifflin.
Lindzey, Gardner (editor) (1954) 1959 Handbook of Social Psychology. 2 vols. Cambridge, Mass.: Addison–Wesley. → Volume 1: Theory and Method. Volume 2: Special Fields and Applications. A second edition, edited by Gardner Lindzey and Elliot Aronson, is in preparation.
Lubin, Ardie 1961 The Interpretation of Significant Interaction. Educational and Psychological Measurement 21:807–817.
Mclean, Leslie D. 1967 Some Important Principles for the Use of Incomplete Designs in Behavioral Research. Chapter 4 in Julian C. Stanley (editor), Improving Experimental Design and Statistical Analysis. Chicago: Rand McNally.
Mcnemar, Quinn (1949) 1962 Psychological Statistics. 3d ed. New York: Wiley.
Mood, Alexander M.; and GRAYBILL, FRANKLIN A. 1963 Introduction to the Theory of Statistics. 2d ed. New York: McGraw–Hill. → The first edition was published in 1950.
Nelder, J. A. 1954 The Interpretation of Negative Components of Variance. Biometrika 41:544–548.
Pearson, Egon S.; and HARTLEY, H. O. (editors) (1954) 1966 Biometrika Tables for Statisticians. Volume 1. 3d ed. Cambridge Univ. Press. → A revision of Tables for Statisticians and Biometricians (1914), edited by Karl Pearson.
Rand CORPORATION 1955 A Million Random Digits With 100,000 Normal Deviates. Glencoe, III.: Free Press.
Rodwan, Albert S. 1964 An Empirical Validation of the Concept of Coherence. Journal of Experimental Psychology 68:167–170.
Roy, S. N.; and GNANADESIKAN, R. 1959 Some Contributions to ANOVA in One or More Dimensions: I and II. Annals of Mathematical Statistics 30:304–317, 318–340.
Sampford, Michael R. (editor) 1964 In Memoriam Ronald Aylmer Fisher, 1890–1962. Biometrics 20, no. 2:237–373.
ScheffÉ, Henry 1959 The Analysis of Variance. New York: Wiley.
Smith, H. FAIRFIELD 1957 Interpretation of Adjusted Treatment Means and Regressions in Analysis of Co–variance. Biometrics 13:282–308.
Stanley, Julian C. 1961 Studying Status vs. Manipulating Variables. Phi Delta Kappa Symposium on Educational Research, Annual Phi Delta Kappa Symposium on Educational Research: [Proceedings] 2:173–208. → Published in Bloomington, Indiana.
Stanley, Julian C. 1965 Quasi–experimentation. School Review 73:197–205.
Stanley, Julian C. 1966 A Common Class of Pseudo–experiments. American Educational Research Journal 3:79–87.
Thompson, W. A. JR. 1962 The Problem of Negative Estimates of Variance Components. Annals of Mathematical Statistics 33:273–289.
Tukey, John W. 1949 One Degree of Freedom for Non–additivity. Biometrics 5:232–242.
Winer, B. J. 1962 Statistical Principles in Experimental Design. New York: McGraw–Hill.
III MULTIPLE COMPARISONS
Multiple comparison methods deal with a dilemma arising in statistical analysis: On the one hand, it would be unfortunate not to analyze the data thoroughly in all its aspects; on the other hand, performing several significance tests, or constructing several confidence intervals, for the same data compounds the error rates (significance levels), and it is often difficult to compute the overall error probability.
Multiple comparison and related methods are designed to give simple overall error probabilities for analyses that examine several aspects of the data simultaneously. For example, some simultaneous tests examine all differences between several treatment means.
Cronbach (1949, especially pp. 399403) describes the problem of inflation of error probabilities in multiple comparisons. The solutions now available are, for the most part, of a later date (see Ryan 1959; Miller 1966). Miller’s book provides a comprehensive treatment of the major aspects of multiple comparisons.
Normal means—confidence regions, tests
1. Simultaneous limits for several means
As a simple example of a situation in which multiple comparison methods might be applied, suppose that independent random samples are drawn from three normal populations with unknown means, μ μ μ_{3,} but known variances, If only the first sample were available, a 99 per cent confidence interval could be constructed for μ
where ̄X_{1} is the sample mean, and n_{1} the size, of the first sample. In hypothetical repetitions of the procedure, the confidence interval covers, or includes, the true value of μ_{1} 99 per cent of the time in the long run. [seeEstimation, article onConfidence Intervals AND Regions.]
If all three samples are used, three statements like (1) can be made, successively replacing the subscript “1” by “2” and “3.” The probability that all three statements together are true, however, is not .99 but .99 x .99 x .99, or .9703.
In a coordinate system with three axes marked μ_{1} μ_{2} and μ_{3} the three intervals together define a 97 per cent (approximately) confidence box. This confidence box is shown in Figure 1. In order to obtain a 99 per cent confidence box—that is, to have all three statements hold simultaneously with probability .99—the confidence levels for the three individual statements must be increased. One
method would be to make each individual confidence level equal to .9967, the cube root of .99.
The simple twotail test of the null hypothesis (H_{0}) μ = 0 rejects it (at significance level .01) if the value 0 is not caught inside the confidence interval (1). It is natural to think of extending this test to the composite null hypothesis μ_{1} = 0 and μ_{2} = 0 and μ_{1} = 0 by rejecting the composite hypothesis if the point (0,0,0) is outside the confidence box corresponding to (1). The significance level of this procedure, however, is not .01 but 1  .9703, almost .03. In order to reduce the significance level to .01, “2.58” in (1) must be replaced by a higher number. If this is done symmetrically, the significance level for each of the three individual statements like (1) must be .0033. In this argument any hypothetical values of the means, may be used in place of 0,0,0 to specify the null hypothesis; the point then takes the place of (0,0,0).
The same principles can be applied just as easily to the case where the three variances are not known but are estimated from the respective samples, in which case 1 per cent points of Student’s tdistribution take the place of 2.58. Of course, any other significance levels may also be used instead of 1 per cent.
Pooled estimate of variance. The problem considered so far is atypically simple because the three intervals are statistically independent, so that probabilities can simply be multiplied. This is no longer true if the variances are unknown but are assumed to be equal and are estimated by a single pooled estimate of variance, ̂σ^{2}, which is the sum of the three withinsample sums of squares divided by n_{1} + n_{2} + n_{3} — 3. This is equal to the mean square used in the denominator of an analysisofvariance F [seeLinearhypotheses, article onAnalysis OF Variance]. The conditions
(where M is a constant to be chosen), use the same ̂σ and hence are not statistically independent. Thus, the probability that all three hold simultaneously is not the product of the three separate probabilities, although this is still a surprisingly good approximation, adequate for most purposes.
Critical values, M_{β}, have, however, been computed for β = .05 and .01 and for any number of degrees of freedom (n1 + n2 + n3 – 3) of ̂^{2}. If M_{β} is substituted for M in the three intervals, the probability that all three conditions simultaneously hold is 1 – β (Tukey 1953).
Exactly the same principles described for the problem of estimating, or testing, three population means also apply to k means. A table providing critical values M_{β} for k = 2, 3, … … …, 10 and for various numbers of degrees of freedom, N — k, has been computed by Pillai and Ramachandran (1954). Part of the table is reproduced in Miller (1966). The square of M_{β} was tabulated earlier by Nair (1948a) for use in another context (see Section 7, below). This table is reproduced in Pearson and Hartley ([1954] 1966, table 19).
Notation. In the following exposition, “̄X” and “μ_{i}” represent sample and population means, respectively (i = 1, … … …, k), “σ^{2}” the population variance, generally assumed to be common to all k populations, “̂^{2}” the pooled sample estimate of cr2, and “SE” the estimated standard error of a statistic (SE will depend on “̂^{2,}” on the particular statistic, and on the sample sizes involved). The symbol “∑” always denotes summation over i, from 1 to k, unless otherwise specified; N denotes ∑ni, , the total sample size, and “ddf” stands for “denominator degrees of freedom,” the degrees of freedom of σ^{2}.
2. Treatments versus control (Dunnett)
Many studies are concerned with the difference between means rather than with the means themselves. For example, sample 1 may consist of controls (that is, observations taken under standard conditions) to be used for comparison with samples 2, 3, … … …, k (taken under different treatments or nonstandard conditions), for the purpose of estimating the treatment effects, μ – μ, … … …, μ – μ. For k = 3, 4, … … …, 10, for any number of denominator degrees of freedom, N – k, greater than 4, and for β = .05 and .01, Dunnett (1955; also in Miller 1966) has tabulated critical values D_{β} such that with probability approximately equal to 1 – a, all k – 1 statements
will be simultaneously true—that is, all k — 1 effects fjii — fr will be covered by confidence intervals centered at X* — X_{1} with halflengths DSE, where
The overall probability is exactly 1 — a if all h sample sizes are equal. It is not the product of k— 1 probabilities (obtained from Student’s tdistribution) of the separate confidence statements, because these are not statistically independent; dependence comes not only from the common estimator of cr in all statements but also from the correlation (p = .5 for sample sizes roughly the same) between any two differences X< — Xx with Xi in common. Surprisingly enough, the product rule gives a close approximation just the same.
Viewed as restrictions on the point (X_{1} X_{2} X_{3}) in threespace, the two (pairs of) inequalities for k = 3 define a confidence region that is the intersection of the slab bounded by two parallel planes, μ_{2} – μ_{1} = ̄X_{2}, – ̄X_{2} ± D_{β}SE, and another slab at 45° to the first slab. This is illustrated in Figure 2, where for simplicity all nt are assumed to be equal. The region is a prism that is infinite in length, is parallel to the 45° line μ_{3} = μ_{3} = μ_{3}, and has a rhombus as its cross section.
Dunnett’s significance test rejects the null hypothesis, H_{o}: μ_{2} = … = μ_{k} = μ_{1}, in favor of the alternative hypothesis that one or more of the PI differ from μ_{1} if the k — 1 confidence intervals do not all contain the value 0 or, equivalently, if
for any i (i — 2, … … … k). If the null hypothesis is of the less trivial form μ – μ = d_{i1}, where the di! are any specified constants, then d_{i1} is subtracted from the differences of sample means in the numerators of t_{ia}.
The probability of rejecting H_{0} if it is true, called the error rate experimentwise , is exactly the stated β if all sample sizes are equal, and is approximately β for unequal niy provided the inequality is not gross. Dunnett (1955) showed that design using equal n_{i}, i = 2, … , fe, but with n} larger in about the proportion is most efficient. Unfortunately this leads to true error rates exceeding the stated β if Dunnett’s table is used, and it is then safer to substitute a Bonferroni tstatistic for Dunnett’s Dβ if k is as big as 6 or 10 (for Bonferroni t,
see Section 14, below; see also Miller 1966, table 2).
Simultaneous onetail tests are of the same form as (2), above, except that the absolutevalue signs are removed and an appropriate smaller critical value, Da, also tabulated in Dunnett (1955), is used. The corresponding confidence intervals are onesided, extending to infinity on the other side.
3. All differences— Tukey method
In order to compare several means with one another rather than only with a single control, a method of Tukey’s (1953) is suitable. It provides simultaneous confidence intervals (or significance tests, if desired) for all = k(k  1) differences, fr  & , among k means.
A constant, T, is chosen so that the probability is at least 1 — a that all (£) statements
ǀ(X_{i}  X_{j})  (µ_{i} µ_{i})ǀ < T_{ε} SE,
or, equivalently,
ǀt_{ij}ǀ=ǀ(X_{i}  X_{j})  (µ_{i} µ_{i})ǀ < SE/T_{ε}
will be simultaneously true. Here SE is equal to lengths TaSE. In a significance test of the null hypothesis, H0 , that the differences, μ — /x,,, have any specified (mutually consistent) values, d{j
(often 0), one substitutes d{j for — JJLJ in the tratios and rejects H0 if the largest ratio is not less than T.
The constant,Tε, is Ra/ V2  .707Ra, where Ra is the upper apoint in the distribution of the Studentized range. Table 29 of Pearson and Hartley ([1954] 1966) shows Ra for μ = .!, .05, and 01, for values of k up to 20, and for any number of ddf. Briefer tables are found in Vianelli (1959) and in a number of textbooks—for example, Winer (1962). More extensive tables prepared by Harter (I960) can also be found in Miller (1966).
Geometrically, Tukey’s (1 — a)confidence region can be obtained, for k — 3, by widening and thickening Dunnett’s prism (Figure 2) in the proportion Ta : Dμ and then removing a pair of triangular prisms by intersection with a third slab. The cross section is hexagonal.
Tukey’s multiple comparisons are frequently used after an Ftest rejects H0 but may also be used in place of F.
Simplified multiple ttests. Simplified multiple ttests, which were developed by Tukey, use the sum of sample ranges in place of or and a critical value, Tμ, adjusted accordingly. (See Kurtz et al. 1965.)
4. One outlying mean (slippage)
In comparing k populations it may be desirable to find out whether one of them (which one is not specified in advance) is outstanding (has “slipped”) relative to the others. Then using k independent treatment samples one may examine the differences, X̂_{i} X̂ where X̂
Halperin provided critical values Ha such that with probability approximately 1 — α,
simultaneously for i = 1, … , k (Halperin et al. 1955). The probability is exactly 1 — α in the case of equal n{. This provides twosided tests for the null hypothesis that all α = α and simultaneous confidence intervals for all the α — α in the usual way. In case the table is not at hand, a good approximation to the righthand side of the inequality is (upper (μ/2fe)point of Student’s £) x
Critical values for the corresponding onesided test, to ascertain whether one of the means has slipped in a specified direction (for example, whether it has slipped down), were first computed by Nair (1952). David (19620; 1962b) provides improved tables. A refinement of Nair’s test and of Halperin’s is presented by Quesenberry and David (1961). In Pearson and Hartley ([1954] 1966), tables 26a and 26£> (and the explanation on p. 51) pertain to these methods, whereas table 26 is Nair’s statistic.
5. Contrasts—Scheffe method
A contrast in k population means is a linear combination, ∑ c_{i}α_{i} with coefficients adding up to zero ∑c_{i} = 0. This is always equal to a multiple of the difference between weighted averages of two sets of means— that is, constant x ∑_{n}a_{n}α_{n} – ∑_{i}b∑_{i} with summations running over two subsets of the subscripts (1, … , fe) having no subscript in common and with ∑_{n}_{a} = 1, ∑_{i}b_{i} = 1. The simple differences, μ — μ are special contrasts. Some other examples include contrasts representing a difference between two groups of means (for example,⅓[μ_{2} + μ_{a} + μ_{5}] — ½[μ_{1} + μ_{4} or slippage of one mean (for example, μ_{2} — μ since this is equal to {[kl]/k} [μ_{1} + μ_{3} + μ_{3} ... +μ_{k}]), or trend (for example, —3μ — μ_{2} + μ_{3} +3μ_{4}
In an exploratory study to compare k means when little is known to suggest a specific pattern of differences in advance, any and all striking contrasts revealed by the data will be of interest. Also, when looking for slippage or simple differences one may wish to take account of some other, unanticipated, pattern displayed by the data.
Any of the systems of multiple comparisons discussed in sections 1—4 can be adapted to obtain tests, or simultaneous intervals, for all contrasts. For example, the k—1 simultaneous conditions where D_{α} represents the critical value of the Dunnett statistic as defined in Section 2, above, imply that every contrast, Σc_{i}μ_{i}, falls into an interval of halflength centered at Σc_{i}X̄_{i} in the case of equal sample sizes.
The following method, developed by Scheffe, however, is more efficient for allcontrasts analyses, because it yields shorter intervals for most contrasts. Scheffe proved that
the largest of all the (infinitely many) Studentized contrasts, where F is the analysisofvariance Fratio for testing equality of all the μ_{i} , and where . Thus, Simultaneous confidence intervals for all contrasts, Σc_{i}μ_{i} , are centered at Σc_{i}X̄_{i} and have halflengths The confidence level is exactly the stated 1 — α, regardless of whether sample sizes are equal.
For k = 3, any particular interval can be depicted in (μ_{1}, μ_{2}, μ_{3})space by a pair of parallel planes equidistant from the line given by μ_{1} – X̄_{1} = μ_{2} – X̄_{1} through the point (X̄_{1}X, ̄_{2}, X̄_{3}). Together these planes constitute all the tangent planes of the cylinder (in the “variables” μ_{1}, μ_{2}, μ_{3})
where F_{α} has degrees of freedom 3 — 1 and n — 3. This cylinder, like the prism of Figure 2, is infinite in length and equally inclined to the coordinate axes. (As in the case of the regions for Dunnett’s and Tukey’s procedures, the addition of the same constant to each of the coordinates X_{1}, X_{2} , X_{3} of a point on the surface will move this point along the surface.) See Figure 3.
Significance test. A value of F ≥ F_{α} implies for at least one contrast (namely, at least for the maximum Studentized contrast). Scheffe’s multiple comparison test declares Σc_{i}X̄_{i} to be statistically significant—that is, Σc_{i}μ_{i} different from zero—for all those contrasts
trasts for which the inequality is true. Thus, one may test every contrast of interest, or every contrast that looks promising, and incur a risk of just α of falsely declaring any Σc_{i}μ_{i} whatsoever to be different from zero; in other words, the probability of making no false statement of the form Σc_{i}μ_{i} ≠ 0 is 1 — μ the probability of making one or more such statements is μ Of course, the Scheffe approach gives a larger confidence interval (or decreased power) than the analogous procedure if only a single contrast is of interest.
General linear combinations. Simultaneous confidence intervals, or tests, can also be obtained for all possible linear combinations, Σc_{i}μ_{i} with the restriction Σc_{i} = o lifted. Then Scheffe’s confidence and significance statements for contrasts remain applicable, except that (k — 1) F_{α} is changed to kFμ and the numerator degrees of freedom of F are changed from k — l to k. (See Miller 1966, chapter 2, sec. 2).
A confidence region for all (standardized) linear combinations consists of the ellipsoid in the k dimensional space with axes labeled μ_{1} μ_{2} … μ_{3}, ∑n_{i}(X_{i}μ_{i})μ < σF_{a}̂σ^{2} For k = 3, any particular interval can be depicted in (μ_{1}, μ_{2}, μ_{2})space by a pair of parallel planes equidistant from the point (X̄_{1}, X̄_{2}, X̄_{3}). Together these planes constitute all the tangent planes of the confidence ellipsoid (in the “variables” μ_{1}, μ_{2}, μ_{2}).
Tukey (1953) and Miller (1966) also discuss the generalization of the application of intervals based on the Studentized range (referred to in Section 3, above) to take care of all linear combinations. Simultaneous intervals for all linear combinations can also be based on the Studentized maximum modulus (Section 1); halflengths become (Tukey 1953).
All of these methods dealing with contrasts and general linear combinations are described in Miller (1966).
Further discussion of normal populations
6. NewmanKeuls and Duncan procedures
The NewmanKeuls procedure is a multiple comparison test for all differences. It does not provide a confidence region. The sample means are arranged and renumbered in order of magnitude, so that X̄_{1} < X̄_{2} < … X̄_{3}<. The first step is the same as Tukey’s test; the null hypothesis is rejected or accepted according as X̄_{k}&  X̄_{1} , the range of the sample means, is ≥ or < T_{ak}SE, where T_{ak} is the upper αpoint of Tukey’s statistic for k means and Nk ddf.
Accepting H_{0} means that there is not enough evidence to establish differences between any of the population means, and the analysis is complete (all k means are then called “homogeneous”). On the other hand, if the null hypothesis is rejected, so that μ_{k}, the population mean corresponding to the largest sample mean, is declared to be different from μ_{1}, the population mean corresponding to the smallest sample mean, the next step is to test X̄_{ki} — X̄_{1} and X̅_{k} — X̅_{2} similarly, but with T_{αk1} in place of T_{a:k} (the original pooled variance estimator σ̅^{2} and N — k ddf are used throughout). A subrange of means that is not found statistically significant is called homogeneous. As long as a subrange is statistically significant, the two subranges obtained by removing in one case its largest and in the other case its smallest X̅_{i} are tested, using a critical value T_{α;h,} where h is only the number of means left in the new subranges—but testing is limited by the rule that every subrange contained in a homogeneous range of means is not tested but is automatically declared to be homogeneous. The result of the whole procedure is to group the means into homogeneous sets, which may also be represented diagrammatically by connecting lines, as in the example presented in Section 10, below.
Critics of the Newman–Keuls method object that the error probabilities, such as that of falsely declaring μ_{2} ≠μ_{5} are not even known in this test; its supporters, however, argue that power should not be wasted by judging subranges by the same stringent criterion used for the full range of all k sample means.
Duncan (1955) goes a step further, arguing that even T_{α:h} is too stringent a criterion because the differences between h means have only h – 1 degrees of freedom. He concludes that T_{γ:h} should be used instead, where 1 – = (1 –γ)^{h–1}. This further increases the power—and the effective type i error probability. For a study of error rates of Tukey, Newman–Keuls, Duncan, and Student tests, see Harter (1957).
7. General Model I design
The Ftest in the oneway analysis of variance and the multiple comparison methods already discussed are based on the fact that ddf times ̂<^{2}/<^{2} has a chisquare distribution and is independent of the sample means. This condition is also satisfied by the residual variance used in randomized blocks, factorial designs, Latin squares, and all Model i designs. Therefore, all these designs permit the use of the methods, and tables, of sections 1—6, to compare the means defined by any one factor, provided that these are independent.
In certain instances of nonparametric multiple comparisons and in certain instances of multiple comparisons of interactions in balanced factorial designs, where the (adjusted or transformed) observations are not independent but equicorrelated, the multiple comparison methods of sections 2–6 still apply: The use of the adjusted error variance, (1 —ρ)̂σ^{2} to compute standard errors fully compensates for the effect of equal correlations (see Tukey 1953; Scheffé 1953; Miller 1966, pp. 41–42, 46–47). Scheffé’s method can also be adapted for use with unequal correlations (see Miller 1966, p. 53).
When several factors, and perhaps some interactions, are ttested in the same experiment, the question arises whether extra adjustment should not be made for the resulting additional compounding of error probabilities. One method open to an experimenter willing to sacrifice power for strict experimentwise control of type I error is the conservative one of using error rates per ttest of α(number of ttests contemplated), that is, using Bonferroni tstatistics (see Section 14). For experimentwise control of error rates in the special case of a 2^{r} factorial design, Nair (1948 a) has tabulated percentage points of the largest of r independent χ^{2’}s with one degree of freedom, divided by an independent variance estimator (Pearson & Hartley [1954] 1966, table 19). The statistic is equal to the square of the Studentized maximum modulus introduced in Section 1.
8. An example—juxtaposition of methods
Three competing theories about how hostility evoked in people by willfully imposed frustration may be diminished led Rothaus and Worchel (1964) to goad 192 experimental subjects into hostility by unfair administration of a test of coordination and then to apply the following “treatments” to four groups, each composed of 48 subjects: (1) no treatment (control); (2) fair readministration of the test, seemingly as a result of a grievance procedure (instrumental communication); (3) an opportunity for verbal expression of hostility (catharsis); (4) conversation to the effect that the test was unfair and the result therefore not indicative of failure on the subjects’ part (ego support). After treatment all subjects were given another—
Table 1 — Analysis of variance of hostility scores  

Source  df  Mean square  Fratio 
* Denotes statistical significance at the 5 per cent level.  
4 treatments  3  369.77  3.38* 
3 subgroups  2  151.31  1.38 
2 sexes  1  41.14  0.38 
2 BIHS levels  1  2.68  0.02 
All the interactions, none of them statistically significant  40  
4 replications (nested)  144  109.54 = σ^{2} 
fair—test of coordination. Each treatment group was subdivided into three subgroups, a different experimenter working with each subgroup. All subjects had been given Behavioral Items for Hostility Scales (BIHS) three weeks before the experiment.
The experimental plan was factorial: 4 treatments x 3 subgroups x 2 sexes x 2 BIHS score groups (high versus low) x 4 replications. The study variable, X, was hostility measured on the Social Sensitivity Scale at the end of the experiment.
The sample means (unordered) for the four treatment groups were ̅x_{2} = 47.08, ̅x_{2} = 42.00, ̅x_{3} = 48.53, ̅x_{4} = 45.40.
In fact, the numbers in Table 1 reflect an analysis of covariance. The mean squares shown are adjusted mean squares, the sample means are adjusted means, and ̂σ^{2} has 143 df. But for the sake of simplicity of interpretation the data will be treated as if they had come from a 4 x 3 x 2 x 2 factorial analysis of variance. The estimated standard error for differences between two means, SE, is
Dunnett comparisons. The Dunnett method, with α = 05, would be applied to the data of the experiment, as analyzed in Table 2. As indicated in Table 2, the onetail test in the direction of the theory (Ht) under study declares μ_{2} to be less than μ_{1}. Thus, the conclusion, if the onesided Dunnett test and the 5 per cent significance level are adopted, is that instrumental communication reduces hostility but that the evidence does not confirm any reduction due to ego support or catharsis. If the twotail test had been chosen, allowing for
Table 2 — Dunnet comparisons of control with three treatments α = .05  

DUNNETT METHOD:TWOSIDED  DUNNETT METHOD:ONESIDED  
Pair  ̄X_{i} – ̄X_{j}  Test D_{α} = 2.40  Confidence interval (halflength = 2.136D_{α} = 5.13  Test: D_{α} = 2.08  Confidence interval (lower length = 2.0136D_{α} = 4.44) (1)–(2)  
*Statistically significant at the 5 per cent level; all other comparisons do not reach statistical significance at the 5 per cent level.  
(1)–(2)  5.08  2.38  (near significance)  (–0.05,10.21)  *  (0.64∞)  
(1)–(3)  –1.45  –0.68  –  (–6.58,3.68)  –  (–5.89∞)  
(1)–(4)  1.68  0.79  –  (–3.45,6.81)  –  (–2.76,∞) 
Table 3 — All pairs, by Tukey and by Scheffé method, α = .05  

TUKEY METHOD  SCHEFFE METHOD  
Pair  ̄X_{i} – ̄X_{i}  Test: T_{α} = 2.60  Confidence interval (halflength = 2.136T_{α} = 5.55)  Test:  Confidence interval  
Statistically significant at the 5 per cent level; all other pairs do not reach statistical significance at the 5 per cent level.  
(l)–(2)  5.08  2.38  –  (–0.47,10.63)  –  (–0.90,11.06) 
(l)–(3)  –1.45  –0.68  –  (–7.00,4.10)  –  (–7.43,4.53) 
(l)–(4)  1.68  0.79  –  (–3.87,7.23)  –  (–4.30,7.66) 
(2)–(3)  –6.53  –3.06  *  (–12.08,–0.98)  *  (–12.51, –0.55) 
(2)–(4)  –3.40  –1.59  –  (–8.95,2.15)  –  (–9.38,2.58) 
(3)–(4)  3.13  1.47  –  (–2.42,8.68)  –  (–2.85,9.11) 
a possible increase in hostility due to treatment, the conclusion would be that there is insufficient evidence to reject.
All pairs— Tukey and Scheffe methods. A comparison of all possible pairs of means by the methods of Tukey and Scheffe is shown in Table 3. The tests of Tukey and Scheffe in this case both discount ̄x_{i} – ̄x_{3} but declare ̄x_{2} – ̄x_{3} “significant.” The conclusion is that instrumental communication leaves the mean hostility of frustrated subjects lower than ego support does, but no other difference is established; specifically, neither test would conclude that instrumental communication actually reduces hostility as compared with no treatment or that ego support increases (or reduces) it.
In addition to the simple differences, the data suggest testing a contrast related to the alternate hypothesis µ_{2} < µ_{4} <_{1} < µ_{3} , for example, the contrast –3µ_{2} – µ_{4} + µ_{1} + 3µ_{3}. (It is legitimate, for these procedures, to choose such a contrast after inspecting the data.) For the present example, –3̄x_{2} – ̄x_{4} + ̄x_{1} + 3̄x_{3} = 21.27; the SE for a Scheffé test is and t = 21.27/6.755 = 3.15, statistically significant at the 5 per cent level (3.15 > 2.80). The conclusion is that µ_{2} ≤ µ_{4} ≤ µ_{1} µ_{3} with at least one strict inequality holding. A Scheffé 95 per cent confidence interval for –3µ_{2} – µ_{4} + µ_{1} + 3µ_{3} is (1.36,40.18). A Tukey test would not find this contrast statistically significant. For this analysis 2.136 is multiplied by 1/2(–3ǀ+ǀ–1 + ǀ1ǀ + ǀ3ǀ) = 4, instead of by yielding 8.544. Thus, in this case t = 21.27/8.544 = 2.49, which is less than 2.60, and a confidence interval is (–0.94, + 43.48).
The SE for individual ̄x_{i} also used in slippage statistics, is For k = 4, M_{.05} = 2.50, and simultaneous confidence intervals for the four µ_{i} are centered at the ̄X_{i} and have halflengths 2.50 x 1.51 = 3.78.
The 5 per cent critical value tabulated by Halperin et al. for twosided slippage tests is 2.23; thus (̄x_{2} – ̄x)/1.51 = 2.48 is statistically significant, whereas the other three tratios for slippage are not. The conclusion of this test would be that mean hostility after instrumental communication is low compared with that after other treatments; no other treatment can be singled out as leaving hostility either low or high compared with that after other treatments.
An example of Newman–Keuls and Duncan tests is given in Section 10, below.
Other multiple comparison methods
9. Nonparametric multiple comparisons
The multiple comparison approach has been articulated with nonparametric (or distributionfree) methods in several ways [for background seeNonparametric statistics].
For example, one of the simplest nonparametric tests is the sign test. Suppose that an experiment concerning techniques for teaching reading deals with school classes and that each class is divided in half at random. One half is taught by method 1, the other by method 2, the methods being allocated at random. Suppose further that improvement in average score on a reading test after two months is the basic observation but that one chooses to consider only whether the pupils taught by method 1 gain more than the pupils taught by method 2, or vice versa, and not the magnitude of the difference. If C is the number of classes for which the pupils taught with method 1 have a larger average gain than those taught with method 2, then the (twosided) sign test rejects the null hypothesis of equal method effect when the absolute value of C is larger than a critical value. The critical value comes simply from a symmetrical binomial distribution.
Suppose now that there are k teaching methods, where k might be 3 or 4, and the classes are each divided at random into k groups and assigned to methods. Let C_{ij}(i≠j) be the number of classes
for which the average gain in readingtest score for the group taught by method i is greater than that for the group taught by method j. Each C^{ij} taken separately has (under the null hypothesis that the corresponding two methods are equally effective) a symmetric binomial distribution which is approximated asymptotically by where n is the number of classes, z is a standard normal variable, and 1/2 is a continuity correction. But to test for the equality of all k methods, the largest ǀC_{ij}ǀ should be used. The critical values of this statistic may be approximated by where T_{α} is the upper α point for Tukey’s statistic with k groups and ddf = ∞.
The same procedure is feasible for other twosample test statistics—for example, rank sums. An analogous method works for comparing k— 1 treatments with a control; in the teachingmethod experiment, if method 1 were the control, this would mean using as the test statistic the maximum over ; j ≠ 1 of ǀC_{ij}ǀ (or of C_{ij} in the onesided case). For a discussion of this material, see Steel (1959).
Joint nonparametric confidence intervals may sometimes be obtained in a similar way. Given a confidence interval estimation procedure related to any twosample test statistic with critical value S_{α} (see Moses 1953; 1965), the same procedure with S_{α} replaced by its multiple comparison analogue C_{α} yields confidence intervals with a joint confidence level of 1 — α
A second class of nonparametric multiple comparison tests arises by analogy with normal theory analysis of variance for the oneway classification and other simple designs [seeLinear hypotheses, article onanalysis of variance]. The procedures start by transforming the observations into ranks or other kinds of simplified scores (except that the socalled permutation tests leave the observations unaltered). The analysis is conditional on the totality of scores and uses as its null distribution that obtained from random allocations of the observed scores to treatments. The test statistic may be the ordinary Fratio on the scores, but modified so that the denominator is the exact overall variance of the given scores. This statistic’s null distribution is approximately F, with k — 1 and ∞ as degrees of freedom (where k is the number of treatments), or, equivalently, k — 1 times the Ftest statistic has as approximate null distribution the chisquare distribution with k — 1 degrees of freedom. Similar adaptations hold for the Tukey test statistic and others. The approach may also be extended to randomized block designs; in another direction, the approach may be extended to compare dispersion, rather than location. Discussions of this material are given by Nemenyi (1963) and Miller (1966, chapter 2, sec. 1.4, and chapter 4, sec. 7.5).
A difficulty with these test procedures is that confidence sets cannot generally be obtained in a straightforward way.
A third nonparametric approach to multiple comparisons is described by Walsh (1965, pp. 535–536). The basic notion applies when there are a number of observations for each treatment or treatment combination. Such a set of observations is divided into several subsets; the average of each subset is taken. These averages are then treated by normal theory procedures of the kind discussed earlier.
For convenient reference, a few 5 per cent and 1 per cent critical points of multiple comparison statistics with ddf = ∞ are listed in Table 4.
Table 4 – Selected 5 per cent and 1 per cenf critical points of multiple comparison statistics with ddf = ∞  

DUNNETT  TUKEY  NAIRHALPERIN  SCHEFFÉ  DUNCAN  
k1 yersus one  all pairs  outlier tests  
k  onetail  twotail  onetail  twotail  
5 per cent level  
2  1.64  1.96  1.96  1.39  1.39  1.96  1.96 
3  1.92  2.21  2.34  1.74  1.91  2.45  2.06 
4  2.06  2.65  2.57  1.94  2.14  2.80  2.13 
5  2.16  2.44  2.73  2.08  2.28  3.08  2.18 
6  2.23  2.51  2.85  2.18  2.39  3.23  2.23 
1 per cent level  
2  2.33  2.58  2.58  1.82  1.82  2.58  2.58 
3  2.56  2.79  2.91  2.22  2.38  3.03  2.68 
4  2.68  2.92  3.11  2.43  2.61  3.37  2.76 
5  2.77  3.00  3.25  2.57  2.76  3.64  2.81 
6  2.84  3.06  3.36  2.68  2.87  3.88  2.86 
Table 5 — Frequency of church attendance of scientists in four different fields  

(1)  (2)  (3)  (4)  
Chemical  Combined  Scare  
Church attendance  engineers  Physicists  Zoologists  Geologists  sample  U  
IT is purely acciaeniai that T_{4} exactly equals 0.  
Source: Vaughan et al. 1966.  
Never  44  65  66  72  247  1  
Not often  38  19  21  30  108  0  
Often  52  46  49  38  185  1  
Very often  33  29  19  17  98  2  
Sample size, n_{i}  167  159  155  157  N = 638  
T_{i} = ∑ Frequency • u  74  39  21  0*  T = 134  
̄u_{i} = T_{i}/n_{i}  0.443  0.245  0.136  0.000  ̄u = 0.210  
1/n_{i}  0.005988  0.006289  0.006452  0.006369  1/N = 0.001567 
10. An example
As an illustration of some distributionfree multiple comparison methods, consider the following data from Vaughan, Sjoberg, and Smith (1966), who sent questionnaires to a sample of scientists listed in American Men of Science in order to compare scientists in four different fields with respect to the role that traditional religion plays in their lives. Table 5 summarizes responses to the question about frequency of church attendance and shows some of the calculations.
Using the data of Table 5, illustrative significance tests of the null hypothesis of four identical population distributions, against various alternatives, will be performed at the 1 per cent level.
The method of Yates (1948) begins by assigning ascending numerical scores, u, to the four ordered categories; arithmetically convenient scores, as shown in the last column of Table 5, are – 1, 0, 1, and 2. Sample totals of scores are calculated—for example, T_{1} = 44(l) + 38(0) + 52(1) + 33(2) = 74, and the average score for sample i is ̄u_{i} = T_{i}/n_{i} From the combined sample (margin) Yates computes an average score, ̄u = T/N = .210, and the variance of scores,
giving (638/637){[247(l) + 108(0) + 185(1) + 98(4)]/638  .210_{2}} = 1.2494.
Yates then computes a variance between means, – T_{2}/N = 17.05, and the critical ratio used is either F = (17.05/3)/1.2494 or X_{2} = 17.05/1.2494= 13.7. The second of these is referred to a table of chisquare with 3 df and found significant at the 1 per cent level (in fact, P =.0034).
It follows that some contrasts must be statistically significant. The almost linear progression of the sample mean scores suggests calculating For the denominator, (3^{2}/167 + 1^{2}/159 + 1^{2}/155 + 3^{2}/157) σ^{2} = .1240 × 1.2494 = .1549, so that X^{2} = 1.438^{2}/.1549 = 13.35, or its square root, (This comes close to the value = 3.70 of the largest standardized contrast—see Section 5.) When 3.65 is referred to the Scheffé table (in Table 4, above) for k = 4, or when 13.35 is referred to a table of chisquare with 3 df, each is found to be statistically significant (in fact, P = .0040). The conclusion that can be drawn from this onesided test for trend is that the population mean scores are ordered ̄μ_{1} ≥ ̄μ_{2} ≥ ̄μ_{3} ≥ ̄μ_{4} with at least one strict inequality holding. Had a trend in this particular order been predicted ahead of time and postulated as the sole alternative hypothesis to be considered, z = 3.65 could have been judged by the normal table, yielding P = .00013. The twotail version of this test is Yates’s onedegreeoffreedom chisquare for trend (1948).
Another contrast that may be tested is the simple difference ̄u_{1} – ̄u_{4} = .443  .000. Here SE = .1243, and z_{14} = .443/.1243 = 3.57. Because it is greater than 3.37, this contrast is statistically significant. Similarly, Z_{13} = (.443  .136)/. 1239 = 2.48, but this is not significant at the 1 per cent level, and the other simple differences are still smaller.
If Tukey’s test had been adopted instead of Scheffé’s, the same ratios would be compared with the critical value 3.11 (k = 4, α= .01). The conclusions would be the same in the present case. Tukey’s method could also be used to test other contrasts.
In the present example, the NewmanKeuls procedure would also have led to the same conclusions about simple differences: Z_{14} = 3.57 is called significant because it is greater than 3.11; then Z_{13} (which equals 2.48) and Z_{24} (which is still smaller) are compared with 2.91 and found “not significant,” and the procedure ends. The conclusions may be summarized as follows:
.443 .245 .136 .000,
where the absence of a line connecting ū_{1} with ū_{4} signifies that ̄μ_{1} and ̄μ_{4} are declared unequal. It may be argued that a conclusion of the form “A, B, and C homogeneous, B, C, and D homogeneous, but A, B, C, and D not homogeneous” is selfcontradictory. This is not necessarily the case if the interpretation is the usual one that A, B, and C may be equal (not enough evidence to prove them unequal) and B, C, and D may be equal, but A and D are not equal.
In Duncan’s procedure the critical value 3.11 used in the first stage would be replaced by 2.76 (see Table 5), and the critical value 2.91 used at the second stage (k = 3) would be replaced by 2.68. Since 3.57 > 2.76 but 2.48 < 2.68, Duncan’s test leads to the same conclusion in the present example as the NewmanKeuls procedure.
A Halperin outlier test would use max ǀū_{i}–ūǀ, in this case .443  2.10 = .233, divide it by and compare the resulting ratio, 2.72, with the critical value, 2.61 (k = 4, 1 per cent level). The next largest ratio is (.210  .000).08944 = 2.35. The conclusion is that chemical engineers tend to report more frequent church attendance than the other groups, but nothing can be said about geologists. If the outlier contrasts had been tested as part of a Scheffe test for all contrasts, none of them would have been found significant at the 1 per cent level (critical value 3.37) or even at the 5 per cent level.
What would happen if unequally spaced scores had been used instead of –1, 0, 1, 2 to quantify the four degrees of religious loyalty? In fact, Vaughan and his associates described the ordered categories not verbally but as frequency of church attendance per month grouped into 0, 1, 24, 5+. Although we do not know whether frequency of church attendance is a linear measure of the importance of religion in a person’s life, the scores (0, 1, 3, 6) could reasonably have been assigned. In the present case this would lead to essentially the same conclusions that the other scoring led to: The mean scores become 2.35, 2.08, 1.82, and 1.57, Yates’s X^{2} changes from 13.7 to 12.5, the standardized contrast for trend changes from to very nearly z_{14} changes from 3.57 to 3.36, and z_{13} and z_{24} again have values too small for statistical significance by Tukey’s criterion or by NewmanKeuls’.
A fundamentally different assignment of scores —for example, 1, 0, 0, 1—would be used to test for differences in spread. It yields sample means, ū_{i,} of .461, .591, .548, .576, ū = 0.541 and a variance, σ^{2}, of .2484. Yates’s analysisofvariance X^{2} is 1.449/0.2484, that is, only 5.83, so P = .12. Thus, no contrast is called significant in a Scheffé test (or, it turns out, in any other multiple comparison test at the 1 per cent significance level). In the present example these tests for spread are unreliable, because the presence of sample location differences, noted above, can vitiate the results of the test for differences in spread.
Throughout the numerical calculations in this section, the continuity correction has been neglected. In the case of unequal sample sizes it is difficult to determine what continuity correction would yield the most accurate results, and the effect of the adjustment would be slight anyway. When sample sizes are equal, the use of ǀT_{i} – T_{j}ǀ – in place of ǀT_{i} – T_{j}ǀ is recommended, as it frequently (although not invariably) improves the fit of the asymptotic approximation used.
11. Comparisons for differences in scale
Standand multiple comparisons of variances of k normal populations, by Cochran (1941), David (1952), and others, utilize ratios of the of . These methods should be used with caution, because they are ultrasensitive to slight nonnormality.
Distributionfree multiple comparison tests for scale differences are also available. Any rank test may be used with a SiegelTukey reranking [seeNonparametric Statistics, article onRanking Methods]. Such methods, too, require caution, because—especially in a joint ranking of all k samples—any sizable location differences may masquerade as differences in scale (Moses 1963).
Safer methods—but with efficiencies of only about 50 per cent for normal distributions—are adaptations of some tests by Moses (1963). In these tests a small integer, s, such as 2 or 3, is chosen, and each sample is randomly subdivided into subgroups of s observations. Let y be the range or variance of a subgroup. Then any multiple comparison tests may be applied to the k samples of y’s (or logi/’s), at the sacrifice of betweensubgroups information. The effective sample sizes have been reduced to [n_{i}/s]; if these are small (about 6), either a nonparametric test or, at any rate, log y’s should be used. (Some nonparametric multiple comparison tests, such as the median test, have no power—that is, they cannot possibly reject the null hypothesis—at significance levels such as .05 with small samples. But rank tests can be used with several samples as small as 4 or 5.)
12. Multiple comparisons of proportions
A simultaneous test for all differences between k proportions p_{1}, … … …, pk, based on large samples, can be obtained by comparing
with a critical value of Tukey’s statistic (Section 3), where X_{i}, i = 1,..., k, denotes the number of “successes” in sample i and X = ∑X_{i.} Analogous asymptotic tests can be used for comparison of several treatments with a control and other forms of multiple comparisons. If X/N is small, the sample sizes must be very large for this asymptotic approximation to be adequate. (For a similar method see Ryan 1960.)
Smallsample multiple comparison tests of proportions may be carried out by transforming the counts into normal variables with known equal variances and then applying any test of sections 1–7 to these standardized variables (using ∞ ddf). [SeeStatistical Analysis, Special Problems OF, article onTransformations of Data; see also Siotani & Ozawa 1958.]
A (1 — α)confidence region for k population proportions is composed of a αconfidence interval for each of them. Simultaneous confidence intervals for a set of differences of proportions may be approximated by using Bonferrbni’s inequality (see Section 14). For a discussion of confidence regions for multinomial proportions, see Goodman (1965).
Some discussion of multiple comparisons of proportions can be found in Walsh (1965, for example, pp. 536–537).
13. Selection and ranking
The approach called selection or ranking assumes a difference between populations and seeks to select the population(s) with the highest mean—or variance or proportion —or to arrange all k populations in order [seeScreening AND Selection; see also Bechhofer 1958], Bechhofer, Kiefer, and Sobel (1967) have written a monograph on the subject.
Error rates, choice of method, history
14. Error rates and choice of method
In a significance test comparing two populations, the significance level is defined as
in repeated use of the same criterion. This is termed the error rate per comparison. The corresponding confidence level for confidence intervals is 1 – a.
For analyses of ksample experiments one may instead define the error rate per experiment,
This is related to what Miller (1966) terms the “expected error rate.” For m (computed or implied) comparisons per experiment, a’ = ma; a = a’/m (see Stanley 1957).
Standard multiple comparison tests specify an error rate experimentwise (or “familywise”):
Miller refers to this as the “probability of a nonzero family error rate” or “probability error rate.”
The only difference between a’ and α is that α counts multiple rejections in a single experiment as only one error whereas a’ counts them as more than one. Hence,α ≤ a’; this is termed Bonferroni’s inequality.
On the other hand, it is also true that unless a’ is large, α is almost equal to a’, so that α and a’ may be used interchangeably for practical purposes. For example, D_{α} for 6 treatments and a control, M_{β} for k = 6, and T_{α} for k = 4 = 6), are all approximately equal to the twotailed critical value [t]_{α/β}= t_{α/12} of Student’s t. More generally, m individual comparisons may safely be made using any statistic at significance level α/m per comparison when it is desired to avoid error rates greater than α experimentwise; this procedure may be applied to comparisons of several correlation coefficients or other quantities for which multiple comparison tables are not available. Only when α is about .10 or more, or when m is very big, does this lead to serious waste. Then a’ grossly overstates α, power is lost, and confidence intervals are unnecessarily long (see Stanley 1957; Ryan 1959; Dunn 1961).
Some authors refer to (α/m) points as Bonferroni statistics and to their use in multiple comparisons as the Bonferroni method. Table 2 in Miller (1966) shows Bonferroni tstatistics, (.05/2m)points of Student’s t for various m and various numbers of ddf.
Bonferroni’s second inequality (see Halperin et al. 1955, p. 191) may sometimes be used to obtain an upper limit for the discrepancy a’ — α and a second approximation to critical values for error rates β experiment wise. This works best in the case of slippage statistics and was used by Halperin and his associates (1955), Doornbos and Prins (1958), Thompson and Willke (1963), and others.
The choice between “experimentwise” and “per comparison” is largely a matter of taste. An experimenter should make it consciously, aware of the implications: A given error probability, a, per comparison implies that the risk of at least one type I error in the analysis is much greater than a; indeed, about a × m such errors will probably occur.
Perhaps analyses reporting error rates experimentwise are generally the most honest, or transparent. However, too dogmatic an application of this principle would lead to all sorts of difficulties. Should not the researcher who in the course of his career analyzes 86 experiments involving 1,729 relevant contrasts control the error rate lifetimewise? If he does not, he is almost bound to make a false positive inference sooner or later.
Sterling (1959) discusses the related problem of concentration of type i errors in the literature that result from the habit of selecting significant findings for publication [seeFallacies, Statistical, for further discussion of this problem}.
There is another context in which the problem of choosing error rates arises: If an experimenter laboriously sets up expensive apparatus for an experiment to compare two treatments or conditions in which he is especially interested, he often feels that it would be unfortunate to pass up the opportunity to obtain additional data of secondary interest at practically no extra cost or trouble; so he makes observations on populations 3, 4,..., k as well. It is then possible that the results are such that a twosample test on the data of primary interest would have shown statistical significance, but no “significant differences” are found in a multiple comparison test. If the bonus observations thus drown out, so to speak, the significant difference, was the experimenter wrong to read them? He was not—the opportunity to obtain extra information should not be wasted, but the analysis should be planned ahead of time with the experimenter’s interests and priorities in mind. He could decide to analyze his primary and subsidiary results as if they had come from separate experiments, or he could conduct multiple comparisons with an overall error rate enlarged to avoid undue loss of power, or he could use a method of analysis which subdivides α, allocating a certain (large) part to the primary comparison and the rest to “data snooping” among the extra observations (Miller 1966, chapter 2, sec. 2.3).
Whenever it is decided to specify error rates experimentwise, a choice between different systems of multiple comparisons (different shapes of confidence regions) remains to be made. In order to study simple differences or slippage only, one of the methods of sections 2–4 above (or a nonparametric version of them) is best—that is, yields the shortest confidence intervals and most powerful tests, provided the n_{i} are (nearly) equal. But Scheffe’s approach (see section 5) is better if a variety of contrasts may receive attention.
When sample sizes are grossly unequal, probability statements based on existing Tukey or Dunnett tables, computed for equal n’s, become too inaccurate. Pending the appearance of appropriate new tables, it is better to use Scheffe’s method, which furnishes exact probabilities. The Bonferroni statistics discussed above offer an alternative solution, preferable whenever attention is strictly limited to a few contrasts chosen ahead of time. Miller (1966, especially chapter 2, sees. 2.3 and 3.3) discusses these questions in some detail.
15. History of multiple comparisons
An early, isolated example of a multiple comparison method was one developed by Working and Hotelling (1929) to obtain a confidence belt for a regression line (see Miller 1966, chapter 3; Kerrich 1955). This region also corresponds to simultaneous confidence intervals for the intercept and slope [seeLinear hypotheses, article onregression]. Hotelling (1927) had already developed the idea of simultaneous confidence interval estimation earlier in connection with the fitting of logistic curves to population time series. In his famous paper introducing the T^{2}statistic, Hotelling (1931) also introduced the idea of simultaneous tests and a confidence ellipsoid for the components of a multivariate normal mean.
The systematic development of multiple comparison methods and theory began later, in connection with the problem of comparing several normal means. The usual method had been the analysisofvariance Ftest, sometimes accompanied by ttests at a stated significance level, a (usually 5 per cent), per comparison.
Fisher, in the 1935 edition of The Design of Experiments, pointed out the problem of inflation of error probabilities in such multiple ttests and recommended the use of ttests at a stated level α’ per experiment. Pearson and Chandra Sekar further discussed the problem (1936). Newman (1939), acting on an informal suggestion by Student, described a test for all differences based on tables of the Studentized range and furnished a table of approximate 5 per cent and 1 per cent points. Keuls formulated Newman’s test more clearly much later (Keuls 1952).
Nair made two contributions in 1948, the onesided test for slippage of means and a table for simultaneous Ftests in a 2^{r} factorial design. Also in the late 1940s, Duncan and Tukey experimented with various tests for normal means which were forerunners of the multiple comparison tests now associated with their names.
The standard methods for multiple comparisons of normal means were developed between 1952 and 1955 by Tukey, Scheffé, Dunnett, and Duncan. Tukey wrote a comprehensive volume on the subject which was widely circulated in duplicated form and extensively quoted but which has not been published (1953). The form of Tukey’s method described in Section 3 for unequal n’s was given independently by Kurtz and by Kramer in 1956. Also in the early and middle 1950s, some multiple comparison methods for normal variances were published, by Hartley, David, Truax, Krishnaiah, and others. Cochran’s slippage test for normal variances was published, for use as a substitute for Bartlett’s test for homogeneity of variances, as early as 1941 (see Cochran 1941).
Selection and ranking procedures for means, variances, and proportions have been developed since 1953 by Bechhofer and others.
An easy, distributionfree slippage test was proposed by Mosteller in 1948—simply count the number of observations in the most extreme sample lying beyond the most extreme value of all the other samples and refer to a table by Mosteller and Tukey (1950). Other distributionfree multiple comparison methods—although some of them can be viewed as applications of S. N. Roy’s work of 1953—did not begin to appear until after 1958.
The most important applications of the very general methodology developed by the school of Roy and Bose since 1953 have been multivariate multiple comparison tests and confidence regions. Such work by Roy, Bose, Gnanadesikan, Krishnaiah, Gabriel, and others is generally recognizable by the word “simultaneous” in the title—for example, SMANOVA, that is, simultaneous multivariate analysis of variance (see Miller 1966, chapter 5).
Another recent development is the appearance of some Bayesian techniques for multiple comparisons. These are discussed by Duncan in the May 1965 issue of Technometries, an issue which is devoted to articles on multiple comparison methods and theory and reflects a cross section of current trends in this field.
Peter Nemenyi
BIBLIOGRAPHY
The only comprehensive source for the subject of multiple comparisons to date is Miller 1966. Multiple comparisons of normal means (and variances) are summarized by a number of authors, notably David 1962a and 1962fo. Several textbooks on statistics— e.g., Winer 1962— also cover some of this ground. Many of the relevant tables, for normal means and variances, can also be found in David 1962a and 1962Z?; Vianelli 1959; and Pearson & Hartley 1954; these volumes also provide explanations of the derivation and use of the tables.
Bechhofer, R. E. 1958 A Sequential Multipledecision Procedure for Selecting the Best One of Several Normal Populations With a Common Unknown Variance, and Its Use With Various Experimental Designs. Biometrics 14:408–429.
Bechhofer, R. E.; Kiefer, J.; and Sobel, M. 1967 Sequential Ranking Procedures. Unpublished manuscript. → Projected for publication by the University of Chicago Press in association with the Institute of Mathematical Statistics.
Cochran, W. G. 1941 The Distribution of the Largest of a Set of Estimated Variances as a Fraction of Their Total. Annals of Eugenics 11:47–52.
Cronbach, Lee J. 1949 Statistical Methods Applied to Rorschach Scores: A Review. Psychological Bulletin 46:393–429.
David, H. A. 1952 Upper 5 and 1% Points of the Maximum Fratio. Biometrika 39:422–424.
David, H. A. 1962a Multiple Decisions and Multiple Comparisons. Pages 144–162 in Ahmed E. Sarhan and Bernard G. Greenberg (editors), Contributions to Order Statistics. New York: Wiley.
David, H. A. 1962b Order Statistics in Shortcut Tests. Pages 94–128 in Ahmed E. Sarhan and Bernard G. Greenberg (editors), Contributions to Order Statistics. New York: Wiley.
Doornbos, R.; and PRINS, H. J. 1958 On Slippage Tests. Part 3: Two Distributionfree Slippage Tests and Two Tables. Indagationes mathematicae 20:438–447.
Duncan, David B. 1955 Multiple Range and Multiple F Tests. Biometrics I I : 1–42.
Duncan, David B. 1965 A Bayesian Approach to Multiple Comparisons. Technometrics 7:171–222.
Dunn, Olive J. 1961 Multiple Comparisons Among Means. Journal of the American Statistical Association 56:52–64.
Dunnett, Charles W. 1955 A Multiple Comparison Procedure for Comparing Several Treatments With a Control. Journal of the American Statistical Association 50:1096–1121.
Fisher, R. A. (1935) 1960 The Design of Experiments. 7th ed. London: Oliver & Boyd; New York: Hafner.
Fisher, R. A.; and Yates, Frank (1938) 1963 Statistical Tables for Biological, Agricultural, and Medical Research. 6th ed., rev. & enl. Edinburgh: Oliver & Boyd; New York: Hafner.
Gabriel, K. R. 1966 Simultaneous Test Procedures for Multiple Comparisons on Categorical Data. Journal of the American Statistical Association 61:1081–1096.
Goodman, Leo A. 1965 On Simultaneous Confidence Intervals for Multinomial Proportions. Technometrics 7:247–252.
Halperin, M.; Greenhouse, S.; Cornfield, J.; and Zalokar, J. 1955 Tables of Percentage Points for the Studentized Maximum Absolute Deviate in Normal Samples. Journal of the American Statistical Association 50:185–195.
Harter, H. Leon 1957 Error Rates and Sample Sizes for Range Tests in Multiple Comparisons. Biometrics 13:511–536.
Harter, H. Leon 1960 Tables of Range and Studentized Range. Annals of Mathematical Statistics 31: 1122–1147.
Hartley, H. O. 1950 The Maximum F–ratio as a Shortcut Test for Heterogeneity of Variance. Biometrika 37:308–312.
Hotelling, Harold 1927 Differential Equations Subject to Error, and Population Estimates. Journal of the American Statistical Association 22:283–314.
Hotelling, Harold 1931 The Generalization of Student’s Ratio. Annals of Mathematical Statistics 2: 360–378.
Kerrich, J. E. 1955 Confidence Intervals Associated With a Straight Line Fitted by Least Squares. Statistica neerlandica 9:125–129.
Keuls, M. 1952 The Use of “Studentized Range” in Connection With an Analysis of Variance. Euphytica 1:112–122.
Kramer, Clyde Y. 1956 Extension of Multiple Range Tests to Group Means With Unequal Number of Replications. Biometrics 12:307–310.
Kramer, Clyde Y. 1957 Extension of Multiple Range Tests to Group Correlated Adjusted Means. Biometrics 13:13–18.
Krishnaiah, P. R. 1965a On a Multivariate Generalization of the Simultaneous Analysis of Variance Test. Institute of Statistical Mathematics (Tokyo), Annals 17, no. 2:167–173.
Krishnaiah, P. R. 1965k Simultaneous Tests for the Equality of Variance Against Certain Alternatives. Australian Journal of Statistics 7:105–109.
Kurtz, T. E. 1956 An Extension of a Method of Making Multiple Comparisons (Preliminary Report). Annals of Mathematical Statistics 27:547 only.
Kurtz, T. E.; LINK, R. F.; TUKEY, J. W.; and WALLACE, D. L. 1965 Shortcut Multiple Comparisons for Balanced Single and Double Classifications. Part 1: Results. Technometrics 7:95–169.
Mchuch, Richard B.; and Ellis, Douglas S. 1955 The “Post Mortem” Testing of Experimental Comparisons. Psychological Bulletin 52:425–428.
Miller, Rupert G. 1966 Simultaneous Statistical Inference. New York: McGrawHill.
Moses, Lincoln E. 1953 Nonparametric Methods. Pages 426–450 in Helen M. Walker and Joseph Lev, Statistical Inference. New York: Holt.
Moses, Lincoln E. 1963 Rank Tests of Dispersion. Annals of Mathematical Statistics 34:973–983.
Moses, Lincoln E. 1965 Confidence Limits From Rank Tests (Reply to a Query). Technometrics 7:257–260.
Mosteller, Frederick W.; and Tukey, John W. 1950 Significance Levels for a ksample Slippage Test. Annals of Mathematical Statistics 21:120–123.
Nair, K. R. 1948a The Studentized Form of the Extreme Mean Square Test in the Analysis of Variance. Biometrika 35:16–31.
Nair, K. R. 1948 b The Distribution of the Extreme Deviate From the Sample Mean and Its Studentized Form. Biometrika 35:118–144.
Nair, K. R. 1952 Tables of Percentage Points of the “Studentized” Extreme Deviate From the Sample Mean. Biometrika 39:189–191.
Nemenyi, Peter 1963 Distributionfree Multiple Comparisons. Ph.D. dissertation, Princeton Univ.
Newman, D. 1939 The Distribution of the Range in Samples From a Normal Population, Expressed in Terms of an Independent Estimate of Standard Deviation. Biometrika 31:20–30.
Pearson, Egon S.; and Chandrasekar, C. 1936 The Efficiency of Statistical Tools and a Criterion for the Rejection of Outlying Observations. Biometrika 28: 308–320.
Pearson, Egon S.; and Hartley, H. O. (editors) (1954) 1966 Biometrika Tables for Statisticians. Vol. 1. 3d ed. Cambridge Univ. Press. → Only the first volume of this edition has as yet been published.
Pillai, K. C. S.; and Ramachandran, K. V. 1954 On the Distribution of the Ratio of the ith Observation in an Ordered Sample From a Normal Population to an Independent Estimate of the Standard Deviation. Annals of Mathematical Statistics 25:565–572.
Quesenberry, C. P.; and David, H. A. 1961 Some Tests for Outliers. Biometrika 48:379–390.
Roessler, R. G. 1946 Testing the Significance of Observations Compared With a Control. American Society for Horticultural Science, Proceedings 47:249–251.
Rothaus, Paul; and Worchel, PHILIP 1964 Ego support, Communication, Catharsis, and Hostility. Journal of Personality 32:296–312.
Roy, S. N.; and Bose, R. C. 1953 Simultaneous Confidence Interval Estimation. Annals of Mathematical Statistics 24:513–536.
Roy, S. N.; and Gnanadesikan, R. 1957 Further Contributions to Multivariate Confidence Bounds. Biometrika 44:399–410.
Ryan, Thomas A. 1959 Multiple Comparisons in Psychological Research. Psychological Bulletin 56:26–47.
Ryan, Thomas A. 1960 Significance Tests for Multiple Comparisons of Proportions, Variances, and Other Statistics. Psychological Bulletin 57:318–328.
Scheffe, Henry 1953 A Method for Judging All Contrasts in the Analysis of Variance. Biometrika 40:87–104.
Siotani, M.; and Ozawa, Masaru 1958 Tables for Testing the Homogeneity of k Independent Binomial Experiments on a Certain Event Based on the Range. Institute of Statistical Mathematics (Tokyo), Annals 10:47–63.
Stanley, julian C. 1957 Additional “Post Mortem” Tests of Experimental Comparisons. Psychological Bulletin 54:128–130.
Steel, Robert G. D. 1959 A Multiple Comparison Sign Test: Treatments vs. Control. Journal of the American Statistical Association 54:767–775.
Sterling, Theodore D. 1959 Publication Decisions and Their Possible Effects on Inferences Drawn From Tests of Significance—or Vice Versa. Journal of the American Statistical Association 54:30–34.
Thompson, W. A. JR.; and Willke, T. A. 1963 On an Extreme Rank Sum Test for Outliers. Biometrika 50: 375–383.
Truax, Donald R. 1953 An Optimum Slippage Test for the Variances of k Normal Populations. Annals of Mathematical Statistics 24:669–674.
Tukey, J. W. 1953 The Problem of Multiple Comparisons. Unpublished manuscript, Princeton Univ.
Vaughan, Ted R.; Sjoberg, G.; and Smith, D. H. 1966 Religious Orientations of American Natural Scientists. Social Forces 44:519–526.
Vianelli, Silvio 1959 Prontuari per calcoli statistici: Tavole numeriche e complementi. Palermo: Abbaco.
Walsh, John E. 1965 Handbook of Nonparametric Statistics. Volume 2: Results for Two and Several Sample Problems, Symmetry, and Extremes. Princeton, N.J.: Van Nostrand.
Winer, B. J. 1962 Statistical Principles in Experimental Design. New York: McGraw–Hill.
Working, Holbrook; and Hotelling, Harold 1929 Application of the Theory of Error to the Interpretation of Trends. Journal of the American Statistical Association 24 (March Supplement) : 73–85.
Yates, Frank 1948 The Analysis of Contingency Tables With Groupings Based on Quantitative Characters. Biometrika 35:176–181.
Cite this article
Pick a style below, and copy the text for your bibliography.

MLA

Chicago

APA
"Linear Hypotheses." International Encyclopedia of the Social Sciences. . Encyclopedia.com. 18 Jul. 2019 <https://www.encyclopedia.com>.
"Linear Hypotheses." International Encyclopedia of the Social Sciences. . Encyclopedia.com. (July 18, 2019). https://www.encyclopedia.com/socialsciences/appliedandsocialsciencesmagazines/linearhypotheses
"Linear Hypotheses." International Encyclopedia of the Social Sciences. . Retrieved July 18, 2019 from Encyclopedia.com: https://www.encyclopedia.com/socialsciences/appliedandsocialsciencesmagazines/linearhypotheses
Citation styles
Encyclopedia.com gives you the ability to cite reference entries and articles according to common styles from the Modern Language Association (MLA), The Chicago Manual of Style, and the American Psychological Association (APA).
Within the “Cite this article” tool, pick a style to see how all available information looks when formatted according to that style. Then, copy and paste the text into your bibliography or works cited list.
Because each style has its own formatting nuances that evolve over time and not all information is available for every reference entry or article, Encyclopedia.com cannot guarantee each citation it generates. Therefore, it’s best to use Encyclopedia.com citations as a starting point before checking the style against your school or publication’s requirements and the mostrecent information available at these sites:
Modern Language Association
The Chicago Manual of Style
http://www.chicagomanualofstyle.org/tools_citationguide.html
American Psychological Association
Notes:
 Most online reference entries and articles do not have page numbers. Therefore, that information is unavailable for most Encyclopedia.com content. However, the date of retrieval is often important. Refer to each style’s convention regarding the best way to format page numbers and retrieval dates.
 In addition to the MLA, Chicago, and APA styles, your school, university, publication, or institution may have its own requirements for citations. Therefore, be sure to refer to those guidelines when editing your bibliography or works cited list.