Multivariate Analysis

views updated May 18 2018

Multivariate Analysis

I. OVERVIEWRalaph A. Bradley




III. CORRELATION (2)Harold Hotelling





Multivariate analysis in statistics is devoted to the summarization, representation, and interpretation of data when more than one characteristic of each sample unit is measured. Almost all data-collection processes yield multivariate data. The medical diagnostician examines pulse rate, blood pressure, hemoglobin, temperature, and so forth; the educator observes for individuals such quantities as intelligence scores, quantitative aptitudes, and class grades; the economist may consider at points in time indexes and measures such as percapita personal income, the gross national product, employment, and the Dow-Jones average. Problems using these data are multivariate because inevitably the measures are interrelated and because investigations involve inquiry into the nature of such interrelationships and their uses in prediction, estimation, and methods of classification. Thus, multivariate analysis deals with samples in which for each unit examined there are observations on two or more stochastically related measurements. Most of multivariate analysis deals with estimation, confidence sets, and hypothesis testing for means, variances, covariances, correlation coefficients, and related, more complex population characteristics.

Only a sketch of the history of multivariate analysis is given here. The procedures of multivariate analysis that have been studied most are based on the multivariate normal distribution discussed below.

Robert Adrian considered the bivariate normal distribution early in the nineteenth century, and Francis Galton understood the nature of correlation near the end of that century. Karl Pearson made important contributions to correlation, including multiple correlation, and to regression analysis early in the present century. G. U. Yule and others considered measures of association in contingency tables, and thus began multivariate developments for counted data. The pioneering work of “Student” (W. S. Cosset) on small-sample distributions led to R. A. Fisher’s distributions of simple and multiple correlation coefficients. J. Wishart derived the joint distribution of sample variances and covariances for small multivariate normal samples. Harold Hotelling generalized the Student t-statistic and t-distribution for the multivariate problem. S. S. Wilks provided procedures for additional tests of hypotheses on means, variances, and covariances. Classification problems were given initial consideration by Pearson, Fisher, and P. C. Mahalanobis through measures of racial likeness, generalized distance, and discriminant functions, with some results similar to the work of Hotelling. Both Hotelling and Maurice Bartlett made initial studies of canonical correlations, intercorrelations between two sets of variates. More recent research by S. N. Roy, P. L. Hsu, Meyer Girshick, D. N. Nanda, and others has dealt with the distributions of certain characteristic roots and vectors as they relate to multivariate problems, notably to canonical correlations and multivariate analysis of variance. Much attention has also been given to the reduction of multivariate data and its interpretation through many papers on factor analysis and principal components. [For further discussion of the history of these special areas of multivariate analysis and of their present-day applications, seeCounted Data; Distributions, Statistical, article on Special Continuous Distributions; Factor analysis; Multivariate Analysis, articles onCorrelationand Classification and Discrimination; Statistics, Descriptive, article on Association; and the biographies ofFisher, R. A.; Galton; Girshick; Gosset; Pearson; Wilks; Yule.]

Basic multivariate distributions

Scientific progress is made through the development of more and more precise and realistic representations of natural phenomena. Thus, science, and to an increasing extent social science, uses mathematics and mathematical models for improved understanding, such mathematical models being subject to adoption or rejection on the basis of observation [seeModels, Mathematical]. In particular, stochastic models become necessary as the inherent variability in nature becomes understood.

The multivariate normal distribution provides the stochastic model on which the main theory of multivariate analysis is based. The model has sufficient generality to represent adequately many experimental and observational situations while retaining relative simplicity of mathematical structure. The possibility of applying the model to transforms of observations increases its scope [seeStatistical Analysis, Special Problems Of, article on Transformations Of Data]. The large-sample theory of probability and the multivariate central limit theorem add importance to the study of the multivariate normal distribution as it relates to derived distributions. Inquiry and judgment about the use of any model must be the responsibility of the investigator, perhaps in consultation with a statistician. There is still a great deal to be learned about the sensitivity of the multivariate model to departures from that distributional assumption. [SeeErrors, article on Effects Of Errors In Statistical Assumptions.]

The multivariate normal distribution

Suppose that the characteristics or variates to be measured on each element of a sample from a population, conceptual or real, obey the probability law described through the multivariate normal probability density function. If these variates are p in number and are designated by X1, … Xp, the multivariate normal density contains p parameters, or population characteristics, σ1, …, σP , representing, respectively, the means or expected values of the variates, and parameters σ ij i, j = 1, …, p, σji σ ij, representing variances and covariances of the variates. Here σ ii is the variance of Xi (corresponding to the variance σ2 of a variate X in the univariate case) and σij = σij is the covariance of Xi and Xj. The correlation coefficient between Xi and Xi is

The multivariate normal probability density function provides the probability density for the variates Xi, … … …, Xp at each pointx1, … … …, xp in the sample or observation space. Its specific mathematical form is

− ∞ < xi < ∞, i = 1,..., p [For the explicit form of this density in the hivariate case (p = 2), seeMultivariate Analysis, article on Correlation(1).]

(Vector and matrix notation and an understanding of elementary aspects of matrix algebra are important for any real understanding or application of multivariate analysis. Thus, x′ is the vector (xi...xp), μ′ is the vector (μ1 ...,μp), and (x – μ)′ is the vector (x1–μ1..., xp – μp Also, Σ is the p × p, symmetric matrix which has elements σij, Σ = [σij], ǀΣǀ is the determinant of Σ and Σ-1 is its inverse. The prime indicates “transpose,” and thus (x — μ)− is the transpose of (x — μ), a column vector.)

Comparison off(x1..., xp with f (x) the univariate normal probability density function, may assist understanding; for a univariate normal variate X with mean μ and variance σ2,

where − ∞< x < ∞

The multivariate normal density may be characterized in various ways. One direct method begins with p independent, univariate normal variables, U1, ..., Up each with zero mean and unit variance. From the independence assumption, their joint density is the product

a very special case of the multivariate normal probability density function. If variates Xl, ... Xp are linearly related to Ul, ..., Up so that X =AU + μ in matrix notation, with X , U , and μ being column vectors and A being a p × p nonsingular matrix of constants aij, then

Xi = aij + ... + aip Up, i = 1,..., p.

Clearly, the mean of Xi is E(Xi) = μi, where μi is a known constant and E represents “expectation.” The variance of Xi is

and the covariance of Xi and Xj, ij, is

Standard density function manipulations then yield the joint density function of Xl, ..., Xp as that already given as the general p-variate normal density. If the matrix A is singular, the results for E(Xi), var(Xi), and cov(Xi, Xj) still hold and Xi, ..., Xp are said to have a singular multivariate normal distribution; although the joint density function cannot be written, the concept is useful.

A second characterization of the p-variate normal distribution is the following: Xl ..., Xp have a pvariate normal distribution if and only if is univariate normal for all choices of the coefficients ai, that is, if and only if all linear combinations of the Xi are univariate normal.

The multivariate normal cumulative distribution function represents the probability of the joint occurrence of the events X1x1,Xpx and may be written

indicating that probabilities that observations fall into regions of the p-dimensional variate space may be obtained by integration. Tables of F(x1,..., xp) are available for p = 2, 3 (see Greenwood & Hartley 1962).

Some basic properties of the p-variate normal distribution in terms of X = (X1, ..., Xp) are the following.

(a) Any subset of the Xi has a multivariate normal distribution. In fact, any set of q linear combinations of the Xi has a q-variate normal distribution, a result following directly from the linear combination characterization, qp.

(b) The conditional distribution of q of the Xi, given the p — q others, is q-variate normal.

(c) If σij = 0, ij, then Xi and Xi are independent.

(d) The expectation and variance of are and

(e) The covariance of are and

A cautionary note is that Xi, ..., Xp may be separately (marginally) univariate normal while the joint distribution may be very nonnormal.

The geometric properties of the p-dimensional surface defined by y = f(xi, ..., xp) are interesting. Contours of the surface are pdimensional ellipsoids. All inflection points of the surface occur at constant y and hence fall on the same horizontal ellipsoidal cross section. Any vertical cross section of the surface leads to a subsurface that is normal or multivariate normal in form and is capable of representation as a normal probability density surface except for a proportionality constant.

Characteristic and moment-generating functions yield additional methods of description of random variables [seeDistributions, Statistical, article on Special Continuous Distributions]. For the multivariate normal distribution, the moment-generating function is

where t’ = (t1, ... tp). The moment-generating function may descibe either the nonsingular or singular pvariate normal distribution. Note that, from its definition, the matrix Σ may be shown to be nonnegative definite. When Σ is positive definite the multivariate density may be specified as f (x1 ..., xp). When Σ is singular, Σ-1 does not exist, and the density may not be given. However, M(t1 ..., tp) may still be given and can thus describe the singular multivariate normal distribution. To say that X has a singular distribution is to say that X lies in some hyperplane of dimension less than p.

The multivariate normal sample

Table 1 illustrates a multivariate sample with p = 4 and sample size N = 10; the data here are head measurements.

Table 1 — Measurements taken on first and second adult sons in a sample of ten families
First sonSecond sonFirst sonSecond son
Source: Based on original data by G. P. Frets, presented in Rao 1952, table 7b.2β.

One can anticipate covariance or correlation between head length and head breadth and between head measurements of first and second sons. Hence, for most purposes it will be important to treat the data as a single multivariate sample rather than as several univariate samples.

General notation for a multivariate sample is developed in terms of the variates X, ..., X representing the p observation variates for the ath sample unit (for example, the aih family in the sample), α= 1, ..., N. In a parallel way xiα may be regarded as the realization of Xiα in a particular set of sample data. For multivariate normal procedures, standard data summarization involves calculation of the sample means, i 1 , ..., p, and the sample variances and covariances,

Sample correlation coefficients may be computed from For the data of Table 1, the sample values of the statistics are given in Table 2.

Table 2 — Sample statistics for measurements taken on sons

The required assumptions for the simpler multivariate normal procedures are that the observation vectors (X1α, ..., X) are independent in probability and that each such observation vector consists of p variates following the same multivariate normal law—that is, having the same probability density f(x1 ... xp) with the same parameters, elements of μ and Σ. The joint density for the p × N random variables X is, by the independence assumption, just the product of N p-variate normal densities, each having the same μ and σ’s. The joint density may be expressed in terms of μ and Σ and x and s, where s is the symmetric p × p matrix with elements sij; and x’ = (x1,..., xp).

Elements of S, the matrix of random variables corresponding to s, and of the vector constitute a set of sufficient statistics for the parameters in Σ and μ [seeSufficiency]. Furthermore, it may be shown that S and ̄X are independent.

Basic derived distributions. The distribution of the vector of sample means, ̄X = (1 ... , p), is readily described for the random sampling under discussion. That distribution is again pvariate normal with the same mean vector, μ as in the underlying population but with covariance matrix N-1PΣ.

There is complete analogy here with the univariate case.

The joint probability density function of the sample variances and covariances, Sij, has been named the Wishart distribution after its developer. This density is

where − ∞ < Sij < ∞ i < j, 0 ∞ ≤ Sii < ∞, and i, j =; 1, …, p, and the matrix s is positive definite.

The Wishart density is a generalization of the chi-square density with N − 1 degrees of freedom for (N − l)S22 in the univariate case, in which S2 is the sample variance based on N independent observations from a univariate normal population. Anderson (1958, sec. 14.3) has a note on the noncentral Wishart distribution, a generalization of the noncentral chi-square distribution.

Procedures on means, variances, covariances

Many of the simpler multivariate statistical procedures were developed as extensions of useful univariate methods dealing with tests of hypotheses and related confidence intervals for means, variances, and covariances. Small-sample distributions of important statistics of multivariate analysis have been found; almost invariably the starting point in the derivations is the joint probability density of sample means and sample variances and covariances, the product of a multivariate normal density and a Wishart density, or one of these densities separately.

Inferences on means, dispersion known. If µ* is a pelement column vector of given constants and if the elements of ∑ are known, it was shown long ago, perhaps first by Karl Pearson, that when µ* = — µ, Q(̄) = N(̄-µ’ ∑-1 ̄ µ) has the central chi-square distribution with p degrees of freedom [seeDistributions, Statistical, article on Special Continuous Distributions]. It was later shown more generally that Q(̄) has the noncentral chi-square distribution with p degrees of freedom and noncentrality parameter r2 —N(µ •) ’∑-1 µ • µ when µ ≠ µ,. (The symbol r2 is consistent with the notation of Anderson 1958, sec. 5.4.)

A null hypothesis, H0i: µ.= µ* specifying the means of the multivariate normal density when ∑ is known and when the alternative hypothesis is general, µ ≠ µ* may be of interest in some experimental situations. With significance level α the critical region of the sample space, the region of rejection of the hypothesis H0i, is that region where being the tabular value of a chisquare variate with p degrees of freedom such that [seeHypothesis Testing]. The power of this test may be computed when H 01 is false, that is, when µ ≠ µ*, by evaluation of the probability, where is a noncentral chi-square variate with p degrees of freedom and noncentrality π2.

When the alternative hypotheses are one-sided, in the sense that each component ofµ is taken to be greater than or equal to the corresponding component of µ*, the problem is more difficult. First steps have been taken toward the solution of this problem (see Kudô 1963; Nüesch 1966).

Since µ is unknown, it is estimated by ̄ Corresponding to the test given above, the confidence region with confidence coefficient 1 — α for the µi consists of all values µ* for which the inequality holds [seeEstimation, article on Confidence Intervals and Regions]. This confidence region is the surface and interior of an ellipsoid centered at the point whose coordinates are the elements of ̄ in the pdimensional parameter space of the elements of µ.

Paired sample problems may also be handled. Let Y1 … … … Y2p be 2p variates with means ζ l, …… … ζ 2phaving a multivariate normal density, and let yja, j = 1, … … …,2p, α = 1, … … … , N, be independent multivariate observations from this multivariate normal population. Suppose that Yi; and Yp+i, i — 1, … … … p, are paired variates. Then Xi = Yi; —Yp+i,i = 1,… … …, p, make a set of multivariate normal variates with parameters that again may be designated as the elements of µ and ∑, µ i = ζ i – ζ p+1 Similarly, takexia = yiayp+i,a and ̄i = ÿi — ÿp+i. Inferences on the means, µ i of the difference variates, Xi, when ∑ is known may be made on the same basis as above for the simple sample. In the paired situation it will often be appropriate to take µ* = 0, that is, H01 : µ, = 0. Here 0 denotes a vector of O’s. For example, in Table 1 the data can be paired through the association of first and second sons in a family; a pertinent inquiry may relate to the equalities of both mean head lengths and mean head breadths of first and second sons. For association with this paragraph, columns in Table 1 should have variate headings Y1 , Y3, Y2, and Y4, indicating that p — 2; then Xl = Yl — Y3 measures difference in head lengths of first and second sons and X2 =Y2 — Y4 measures difference in head breadths.

There are also nonpaired versions of these procedures. In a table similar to Table 1 the designations “first son” and “second son” might be replaced by “adult male American Indian” and “adult male Eskimo.” Then the data could be considered to consist of ten bivariate observations taken at random from each of the two indicated populations with no basis for the pairing of the observation vectors. Anthropological study might require comparisons of mean head lengths and mean head breadths for the two racial groups. The procedures of this section may be adapted to this problem. Suppose that X(1)1;, … … … , Xp(1); and X(2)p … … …, X(2)p are the p variates for the two populations, the two sets of variates being stochastically independent of each other and having multivariate normal distributions with common dispersion matrix Σ but with means µ(1)1, … … … , µ(2)1 … … … µ(2)prespectively. The corresponding sample means are based respectively on samples of independent observations of sizes Nl and N,2 from the two populations. Definition of µ =µ(2) – µ(2), µ = µ(2), and N ∑-1 = {N1 N 2/ (N1 + N 2)]permits association and use of Q(µ) and its properties for this two-sample problem. If the dispersion matrices of the two populations are known but different, a slight modification of the procedure is readily available.

Jackson and Bradley (1961) have extended these methods to sequential multivariate analysis [seeSequential Analysis].

Generalized Student procedures

In the preceding section it was assumed that ∑ was known, but in most applications this is not the case. Rather, ∑ must be estimated from the data, and-the generalized Student statistic or Hotelling’s T2, T2(̄, S ) = N(̄ – µ*S 1 (̄ – µ*), comparable to Q(̄ ), is almost always used. (For procedures that are not based on T2 see Šidàk 1967.) It has been shown that F(̄,S ) = (N – p) T2 (̄ ,S )/p(N– 1) has the variance-ratio or Fdistribution with p and N — p degrees of freedom [seeDistributions, Statistical, article on Special Continuous Distributions]. The Fdistribution is central when µ = µ*, that is, when the mean vector of the multivariate normal population is equal to the constant vector µ*, and is noncentral otherwise with noncentrality parameter π2 already defined.

The hypothesis H01: µ =µ*is of interest, as be-fore. The statistic F(̄ ,S ) takes the role of Q(̄ ), and Fp1 N-piα takes the role of X2p;α where is the tabular value of the variance-ratio variate Fp,N-p with ρ and N — p degrees of freedom such that P{Fp,N-p≥ Fp,N-p;α}= α The confidence region for the elements of u, consists of all valuesµ* for which the inequality F(µ,s ) ≤ Fp,N-p:α holds; the region is again an ellipsoid centered at µ in the pdimensional parameter space, and the confidence coefficient is 1 α.

Visualization of the confidence region for the elements of µ. is often difficult. When p = 2, the ellipsoid becomes a simple ellipse and may be plotted (see Figure 1). When p > 2, two-dimensional elliptical cross sections of the ellipsoid may be plotted, and parallel tangent planes to the ellipsoid may be found that yield crude bounds on the various parameters. One or more linear contrasts among the elements ofµ may be of special interest, and then the dimensionality of the whole problem, including the confidence region, is reduced. Some of the problems of multiple comparisons arise when linear contrasts are used[seeLinear Hypotheses, article on Multiple Comparisons].

For the simple one-sample problem, s = [sij] is computed as shown in Table 2. For the paired sample problem, s in F(µ,s ) is the sample variance-covariance matrix computed from the derived multivariate sample of differences, and µ is the sample vector of mean differences, as before. For the unpaired two-sample problem, it is necessary to replace Ns -l in F(ar,s), just as it was necessary to replace N∑-1 when ∑ was known. Each population has the dispersion matrix ∑*, and two sample dispersion matrices sm and s *) may be computed, one for each multivariate sample, to estimate ∑*. A “pooled” estimate of the dispersion matrix ∑* is , the multivariate generalization of the pooled estimate of variance often used in univariate statistics. For the two-sample problem, Ns -l in F(µ,s ) is replaced by [N1 N2/(N1 +N2s *-1. All of the assumptions about the populations and about the samples discussed in the preceding section apply for the corresponding generalized Student procedures.

An application of the generalized Student procedures for paired samples may be made for the data in Table 1. The bivariate (p = 2) sample of paired differences (in Table 1, column 1 minus column 2, column 3 minus column 4) is exhibited in Table 3. The sample mean differences and sam

Table 3 — Difference dafa on head measurements, first adult sen minus second adult son
Head-Length Difference,Head-Breadth Difference
Table 4 — Sample statistics for measurement differences on sons

ple variances and covariance of the difference data are given in Table 4, along with the elements sij of s -1 The column headings and statistics in Table 3 and Table 4 have the arguments d simply to distinguish them from the symbols in tables 1 and 2. For a comparison of first and second sons, it may be appropriate to take µ* = 0 and compute

If a significance level α = .10 is chosen, then F2f810; = 3.11 and the differences between paired means are not statistically significant; indeed, they are less than ordinary variation would lead one to expect. (For some sets of data this sort of result should lead to re-examination of possible biases or nonindependence in the data-collection process.)

To find those values µ* in the confidence region for µ, µ,* must be replaced in T2(µ,s ); thus,

The corresponding The confidence region on µ 1 and µ 2with confidence coefficient 1 — α consists of those points in the (µ1 µ2)-space inside or on the ellipse described by

This ellipse is plotted in Figure 1 for α = .05, .10, .25, F2;8;α = 4.46, 3.11, 1.66, for clearer insight into the nature of the region.

A number of variants of the generalized Student procedure have been developed, and other variants are bound to be developed in the future. For example, one may wish to test null hypotheses specifying relationships between the coordinates of µ (see Anderson 1958, sec. 5.3.5). Again, one may wish to test that certain coordinates of µ have given values, knowing the values of the other coordinates.

For another sort of variant, recall that it was assumed for the two-sample application that the dispersion matrices for the two parent populations were identical. If this assumption is untenable, then a multivariate analogue of the Behrens-Fisher problem must be considered (Anderson 1958, se 5.6). Sequential extensions of the generalized Student procedures have been given by Jackson and Bradley (1961).

Generalized variances

Tests of hypotheses and confidence intervals on variances are conducted easily in univariate cases through the use of the chisquare and variance-ratio distributions. The situation is much more difficult in multivariate analysis.

For the multivariate one-sample problem, hypotheses and confidence regions for elements of the dispersion matrix, ∑, may be considered. A first possible hypothesis is H02: ∑ = ∑* a null hypothesis specifying all of the elements of ∑. (This hypothesis is of limited interest per se, except when ∑* = I or as an introduction to procedures on multivariate linear hypotheses.) It is clear that a test statistic should depend on the elements Si ofS; it is not clear what function of these elements might be appropriate.

The statistic | S | has been called the generalized sample variance, and |Z | has been called the generalized variance. The test statistic |S |/|Z *| was proposed by Wilks, who examined its distribution; simple, exact, small-sample distributions are known only whenp = 1,2. An asymptotic or limiting distribution is available for large N; the statistic has the limiting univariate normal density with zero mean and unit variance. It is clear that when ∑ = ∑* under H02, S estimates ∑*, and the ratio |S |/|Z *| should be near unity; it is not clear that the ratio may not be near unity when S :*. However, values of |S | that differ substantially from | *should lead to rejection of H02 (see Anderson 1958, sec. 7.5).

Wilks’s use of generalized variances is only one possible generalization of univariate procedures. Other comparisons of S and * are possible. In nondegenerate cases, * is nonsingular, and the product matrix S*-1 should be approximately an identity matrix. All of the characteristic roots from the determinantal equation where I is the p x p identity matrix, should be near unity; the trace, trS*-1, should be nearp. Roy (1957, sec. 14.9) places major emphasis on the largest and smallest roots of S and ∑ and gives approximate confidence bounds on the roots of the latter in terms of those of the former. A test of H02 may be devised with the hypothesis being rejected when the corresponding roots of ∑* fail to fall within the confidence bounds. These and other similar considerations have led to extensive study of the distributions of roots of determinantal equations. Complete and exact solutions to these multivariate problems are not available.

Suppose that two independent multivariate normal populations have dispersion matrices ∑(1) and ∑(2), and samples of independent observation vectors of sizes N1 and N2 yield, respectively, sample dispersion matrices S (l) and S (2) . The hypothesis of interest is H.03: ∑(1) = ∑(2) . In the univariate case(p= 1), the statistic is the simple variance ratio and, under H03, has the Fdistribution with N1 – N 1 and N2 – 1 degrees of freedom. The general likelihood ratio criterion for testing Hc,3 is, with minor adjustment,


If p = 1, then λ is a monotone function of F. By asymptotic theory for large Nl and N2 ,– 2 loge λ may be taken to have the central chi-square distribution with degrees of freedom under H03. Anderson (1958, sees. 10.2, 10.4-10.6) discusses these problems further.

Roy (1957, sec. 14.10) prefers again to consider characteristic roots and develops test procedures and confidence procedures based on the largest and smallest roots of Heck (1960) has provided some charts of upper percentage points of the distribution of the largest characteristic root.

Multivariate analysis of variance

Multivariate analysis of variance bears the same relationship to the problems of generalized variances as does univariate analysis of variance to simple variances. An understanding of the basic principles of the analysis of variance is necessary to consider the multivariate generalization. The theory of general linear hypotheses is pertinent, and concepts of experimental design carry over to the multivariate case. [SeeExperimental Design, article onthe Design of Experiments; Linear Hypotheses, article On Analysis of Variance.]

Consider the univariate randomized block design with v treatments and ε blocks. A response, X, on treatment γ in block is expressed in the fixed-effects model (Model I) as the linear function Xε µ = +π +β ε, + θγ ε, where µ is the over-all mean level of response, π is the modifying effect of treatment is the special influence of block and eyθ5 is a random error such that the set of vb errors are independent univariate normal variates with zero means and equal variances, σ2. The multivariate generalization of this model replaces the scalar variate Xys with a p-variate column vector Xγ ε with elements Xγ βi=l, … … … , P, consisting of responses on each ofp variates for treatment y in block 8. Similarly, the scalars µ, π β and eyʸ γ ε are replaced by p-element column vectors, and the vectors є,5 constitute a set of vb independent multivariate normal vector variates with zero means and common dispersion matrices, Σ.

In univariate analysis of variance, treatment and error mean squares are calculated. If these are and , their forms are


where and X..= The test of treatment equality is the test of the hypothesis H04 : τ1 = … = τv (= 0); the statistic used is F =/ , distributed as F with v — 1 and(v — l)(b — 1) degrees of freedom under H04 with large values of F statistically significant. When H04 is true, both and provide unbiased estimates of σ2 and are independent in probability, whereas when H04 is false, still gives an unbiased estimate of σ2, but tends to be larger.

The multivariate generalization of analysis of variance involves comparison of p x p dispersion matrices ST and Sw, the elements of which correspond to and

for i, j = 1, … , p. It can be shown that ST and Sw have independent Wishart distributions withv — I and (υ − l)(b − 1) degrees of freedom and identical dispersion matrices, Z, under H04. Thus, the multivariate analysis-of-variance problem is reduced again to the problem of comparing two dispersion matrices, S T and S w, like S (1) and S (2) of the preceding section. This is the general situation in multivariate analysis of variance, even though this illustration is for a particular experimental design.

Wilks (1932a; 1935) recommended use of the statistic ǀS wǀ/ǀS w + S Tǀ, Roy (1953) considered the largest root of , and Lawley (1938) suggested . These statistics correspond roughly to criteria on the product of characteristic roots, the largest root, and the sum of the roots, respectively. They lead to equivalent tests in the univariate case (where only one root exists), but the tests are not equivalent in the multivariate case.

Pillai (1964; 1965) has tables and references on the distribution of the largest root. A paper by Smith, Gnanadesikan, and Hughes (1962) is recommended as an elementary expository summary with a realistic example.

Other procedures

Other, more specialized statistical procedures have been developed for means,, variances, and covariances for multivariate normal populations, particularly tests of special hypotheses.

Many models based on the univariate normal distribution may be regarded as special cases of multivariate normal models. In particular, it is often assumed that observations are independent in probability and have homogeneous variances, σ2. A test of such assumptions may sometimes be made if the sample is regarded as N observation vectors from a p-variate multivariate normal population with special dispersion matrix under a null hypothesis H05: Σ = σ2I , where I is the p × p identity matrix and cr2 is the unknown common variance. This test and a generalization of it are discussed by Anderson (1958, sec. 10.7). See also Wilks (1962, problem 18.21).

Wilks (1946; 1962, problem 18.22) developed a series of tests on means, variances, and covariances for multivariate normal populations. He considered three hypotheses,

H06 implies equality of means, equality of variances, and equality of covariances; H07 makes no assumption about the means but implies equality of variances and equality of covariances; H08 is a hypothesis about equality of means given the special dispersion matrix, Σ, specified through equality of its diagonal elements and equality of its nondiagonal elements. In these hypotheses ρ is the intraclass correlation, which has been considered in various contexts by other authors [seeMultivariate Analysis, article on Correlation (1)]. Wilks showed that the test of H08 leads to the usual, univariate, an alysis-of-variance test for treatments in a two-way classification. For H06 and H07, likelihood ratio tests were devised and moments of the test statistics were obtained with exact distributions in special cases and asymptotic ones otherwise.

Other topics of multivariate analysis

This general discussion of multivariate analysis would not be complete without mention of basic concepts of other major topics discussed elsewhere in this encyclopedia.

Discriminant functions

Classification problems are encountered in many contexts [seeMultivariate Analysis, article on Classification and Discrimination]. Several populations are known to exist, and information on their characteristics is available, perhaps from samples of individuals or items identified with the populations. A particular individual or item of unknown population is to be classified into one of the several populations on the basis of its particular characteristics. This and related problems were considered by early workers in the field and more recently in the context of statistical decision theory, which seems particularly appropriate for this subject[seeDecision Theory].


The simple product-moment correlation coefficient between variates Xi and Xi, was defined above as ρij, with similarly defined sample correlation, rij [seeMultivariate Analysis, articles onCorrelation]. In the bivariate case (p = 2) the exact small sample distributions of r12 based on the bivariate normal model were developed by Fisher and Hotelling. The multiple correlation between X1, say, and the set X2,..., Xp may be defined as the maximum simple correlation between X1 and a linear function β2X2 + ... + βpX p maximized through choice of β2,...,βi,

Partial correlations have been developed as correlations in conditional distributions.

Canonical correlations extend the notion of multiple correlation to two groups of variates. If the variate vector, X, is subdivided so that

X (s) being the column vector with elements X2,..., Xs and X (t) being the column vector with elements X s+1,.., Xp (p = s + t), the largest canonical cor- relation is the maximum simple correlation between two linear functions, and . The second largest canonical correlation is the maximum simple correlation between two new linear functions, Y(s) and Y(t), similar to Y(s) and Y(t) but uncorrelated with Y(s) and Y(t), and so on. Distribution theory and related problems are given by Anderson (1958, chapter 12) and Wilks (1962, sec. 18.9).

The theory of rank correlation is well developed in the bivariate case [seeNonparametric Statistics, article on Ranking Methods]. Tetrachoric and biserial correlation coefficients have been considered for special situations.

Principal components

The problem of principal components and factor analysis is a problem in the reduction of the number of variates and in their interpretation [seeFactor Analysis]. The method of principal components considers uncor-related linear functions of the p original variates with a view to expressing major characteristic variation in terms of a reduced set of new variates. Retelling has been responsible for much of the development of principal components, and the somewhat parallel treatments of factor analysis have been developed more by psychometricians than by statisticians. References for principal components are Anderson (1958, chapter 11), Wilks (1962, sec. 18.6), and Kendall ([1957] 1961, chapter 2). Kendall ([1957] 1961, chapter 3) gives an expository account of factor analysis.

Counted data

The multinomial distribution plays an important role in analysis when multivariate data consist of counts of the number of individuals or items in a sample that have specified categorical characteristics. The multivariate analysis of counted data follows consideration of contingency tables and relationships between the probability parameters of the multinomial distribution. Much has been done on tests of independence in such tables, and recently investigators have developed more systematically analogues of standard multivariate techniques for contingency tables [seeCounted Data].

Nonparametric statistics

There has been a paucity of multivariate techniques in nonparametric statistics. Except for work on rank correlation, only a few isolated multivariate methods have been developed—for example, bivariate sign tests. The difficulty appears to be that adequate models for multivariate nonparametric methods must contain measures of association (or of nonindependence) that sharply limit the application of the permutation techniques of nonparametric statistics [seeNonparametric Statistics].

Missing values

Only limited results are available in multivariate analysis when some observations are missing from observation vectors. Wilks (1932b), in considering the bivariate normal distribution with missing observations, provided several methods of parameter estimation and compared them. Maximum likelihood estimation was somewhat complicated, but two ad hoc methods proved simpler and yielded exact forms of sampling distributions. Basically, one may obtain estimates of means and variances through weighted averages of means and variances of the available data and estimate correlations from the available data on pairs of variates. If only a few observations are missing, usual analyses should not be much affected; if many observations are missing, little advice may be given except to suggest the use of maximum likelihood techniques and computers for the special situation. It is clearly inappropriate to treat missing observations as zero observations—as has sometimes been done.

Some useful references are Anderson (1957), Buck (1960), Nicholson (1957), and Matthai (1951).

Other multivariate results

In a general discussion of multivariate analysis, it is not possible to consider all areas where multivariate data may arise or all theoretical results of probability and statistics that may be pertinent to multivariate analysis. Many of the theorems of probability admit of multivariate extensions; results in stochastic processes, the theory of games, decision theory, and so on, may have important, although perhaps not implemented, multivariate generalizations.

Ralph A. Bradley


Multivariate analysis is complex in theory, in application, and in interpretation. Basic works should be consulted, and examples of applications in various subject areas should be examined critically. The theory of multivariate analysis is well presented in Anderson 1958; its excellent bibliography and reference notations by section make it a good guide to works in the field. Among books on mathematical statistics, other major works are Rao 1952; Kendall & Stuart (1946) 1961; Roy 1957; Wilks 1946; 1962. Greenwood & Hartley 1962 gives references to tables. T. W. Anderson is completing a bibliography of multivariate analysis. Books more related to the social sciences are Cooley & Lohnes 1962; Talbot & Mulhall 1962. Papers that are largely expository and bibliographical are Tukey 1949; Bartlett 1947; Wishart 1955; Feraud 1942; and Smith, Gnanadesikan, & Hughes 1962. Some applications in the social sciences are given in Tyler 1952; Rao & Slater 1949; Tintner 1946; Kendall 1957.

Anderson, T. W. 1957 Maximum Likelihood Estimates for a Multivariate Normal Distribution When Some Observations Are Missing. Journal of the American Statistical Association 52:200-203.

Anderson, T. W. 1958 An Introduction to Multivariate Statistical Analysis. New York: Wiley.

Bartlett, M. S. 1947 Multivariate Analysis. Journal of the Royal Statistical Society Series B 9 (Supplement): 176-190. → A discussion of Bartlett’s paper appears on pages 190-197.

Buck, S. F. 1960 A Method of Estimation of Missing Values in Multivariate Data Suitable for Use With an Electronic Computer. Journal of the Royal Statistical Society Series B 22:302-306.

Cooley, William W.; and LOHNES, PAUL R. 1962 Multivariate Procedures for the Behavioral Sciences. New York: Wiley.

Feraud, L. 1942 Probleme d’analyse statistique a plusieurs variables. Lyon, Universite de, Annales 3d Series, Section A 5:41-53.

Greenwood, J. Arthur; and Hartley, H. O. 1962 Guide to Tables in Mathematical Statistics. Princeton Univ. Press. → A sequel to the guides to mathematical tables produced by and for the Committee on Mathematical Tables and Aids to Computation of the National Academy of Sciences–National Research Council of the United States.

Heck, D. L. 1960 Charts of Some Upper Percentage Points of the Distribution of the Largest Characteristic Root. Annals of Mathematical Statistics 31:625-642.

Jackson, J. Edward; and Bradley, Ralph A. 1961 Sequential X2- and T2-tests. Annals of Mathematical Statistics 32:1063-1077.

Kendall, M. G. (1957) 1961 A Course in Multivariate Analysis. London: Griffin.

Kendall, M. G.; and Stuart, Alan (1946) 1961 The Advanced Theory of Statistics. Rev. ed. Volume 2: Inference and Relationship. New York: Hafner; London: Griffin. → The first edition was written by Kendall alone.

Kudo, Akio 1963 A Multivariate Analogue of the One-sided Test. Biometrika 50:403-418.

Lawley, D. N. 1938 Generalization of Fisher’s z Test. Biometrika 30:180-187.

Matthai, Abraham 1951 Estimation of Parameters From Incomplete Data With Application to Design of Sample Surveys. Sankhyā 11:145-152.

Morrison, Donald F. 1967 Multivariate Statistical Methods. New York: McGraw-Hill. → Written for investigators in the life and behavioral sciences.

Nicholson, George E. JR. 1957 Estimation of Parameters From Incomplete Multivariate Samples. Journal of the American Statistical Association 52:523-526.

NÜesch, Peter E. 1966 On the Problem of Testing Location in Multivariate Populations for Restricted Alternatives. Annals of Mathematical Statistics 37:113-119.

Pillai, K. C. Sreedharan 1964 On the Distribution of the Largest of Seven Roots of a Matrix in Multivariate Analysis. Biometrika 51:270-275.

Pillai, K. C. Sreedharan 1965 On the Distribution of the Largest Characteristic Root of a Matrix in Multivariate Analysis. Biometrika 52:405-414.

Rao, C. Radhakrishna 1952 Advanced Statistical Methods in Biometric Research. New York: Wiley.

Rao, C. Radhakrishna; and Slater, Patrick 1949 Multivariate Analysis Applied to Differences Between Neurotic Groups. British Journal of Psychology Statistical Section 2:17-29. → See also “Correspondence,” page 124.

Roy, S. N. 1953 On a Heuristic Method of Test Construction and Its Use in Multivariate Analysis. Annals of Mathematical Statistics 24:220-238.

Roy, S. N. 1957 Some Aspects of Multivariate Analysis. New York: Wiley.

Sidak, Zbynek 1967 Rectangular Confidence Regions for the Means of Multivariate Normal Distributions. Journal of the American Statistical Association 62: 626-633.

Smith, H.; Gnanadesikan, R.; and Hughes, J. B. 1962 Multivariate Analysis of Variance (MANOVA). Biometrics 18:22-41.

Talbot, P. Amaury; and Mulhall, H. 1962 The Physical Anthropology of Southern Nigeria: A Biometric Study in Statistical Method. Cambridge Univ. Press.

Tintner, Gerhard 1946 Some Applications of Multivariate Analysis to Economic Data. Journal of the American Statistical Association 41:472-500.

Tukey, John W. 1949 Dyadic ANOVA: An Analysis of Variance for Vectors. Human Biology 21:65-110.

Tyler, Fred T. 1952 Some Examples of Multivariate Analysis in Educational and Psychological Research. Psychometrika 17:289-296.

Wilks, S. S. 1932a Certain Generalizations in the Analysis of Variance. Biometrika 24:471-494.

Wilks, S. S. 1932a Moments and Distributions of Estimates of Population Parameters From Fragmentary Samples. Annals of Mathematical Statistics 3:163-195.

Wilks, S. S. 1935 On the Independence of k Sets of Normally Distributed Statistical Variables. Econometrica 3:309-326.

Wilks, S. S. 1946 Sample Criteria for Testing Equality of Means, Equality of Variances, and Equality of Covariances in a Normal Multivariate Distribution. Annals of Mathematical Statistics 17:257-281.

Wilks, S. S. 1962 Mathematical Statistics. New York: Wiley. → An earlier version of some of this material was issued in 1943.

Wishart, John 1955 Multivariate Analysis. Applied Statistics 4:103-116.


Correlation (1)is a general overview of the topic; Correlation (2)goes into more detail about certain aspects.

The term “correlation” has been used in a variety of contexts to indicate the degree of interrelation between two or more entities. One reads, for example, of the correlation between intelligence and wealth, between illiteracy and prejudice, and so on. When used in this sense the term is not sufficiently operational for scientific work. One must instead speak of correlation between numerical measures of entities—in short, of correlation between variables.

If statistical inference is to be used, the variables must be random variables, and for them a probability model must be specified. For two random variables, X and Y, this model will describe the probabilities (or probability densities) with which (X, Y) takes values (x, y); that is, it will describe probabilities in the (X, Y)-population. One of the characteristics of this population is the correlation coefficient; the available information concerning it is usually in the form of a random sample, (X1, Y1), … , (Xn, Yn). Thus, correlation theory is concerned with the use of samples to estimate, test hypotheses, or carry out other procedures concerning population correlations.

Surprisingly enough, confusion occasionally sets in, even at this early stage. There are deplorable examples in the literature in which the authors of a study are concerned with whether a certain sample coefficient of correlation can be computed instead of with whether it will be useful to compute it in the light of the research goal and of some special model.

The so-called Pearson product-moment correlation coefficient—usually denoted by p in the population and r in the sample, and usually termed just the correlation coefficient—is the one most frequently encountered, and the purpose of this article is to survey the situations in which it is employed. Other sorts of correlation include rank correlation, serial correlation, and intraclass correlation. [For a discussion of rank correlation, seeNonparametric statistics, article on Ranking methods; for serial correlation, seeTime series. Intraclass correlation will be touched on briefly at the end of this article.]

First, simple correlation between X and Y will be considered, then multiple correlation between a single variable, X0, and a set of variables, (X1, … ,Xp), and finally canonical correlation between two sets, (Y1, … ,Yk) and (X1, … ,Xp). Partial correlation will be discussed in connection with multiple correlation. The case of two variables is a sufficient setting in which to discuss relationships with regression theory and to point out common errors made in applying correlation methods.

The two most important models for correlation theory are the linear regression model, discussed below (see also Binder 1959), and the joint normal model. The joint normal model plays a central role in the theory for several reasons. First, the conditions for its approximate validity are frequently met. Second, it is mathematically tractable. Finally, of those joint probability laws for which p is actually a measure of independence, the joint normal model is perhaps the simplest to deal with. For any two random variables, X and Y, it follows from the definition of p that if the variables are independent, they are uncorrelated; hence, to conclude that the hypothesis of zero correlation is false is to assert dependence for X and Y. In the other direction, if X, Y follow a bivariate normal law and are uncorrelated, then X and Y are independent, but this conclusion does not hold in general—the assumption of normality (or some other, similar restriction) is essential; it is even possible that X and Y are uncorrelated and also perfectly related by a (nonlinear) function. If the probability law for X, Y is only approximately bivariate normal, conventional normal theory can still be applied; in fact, considerable departure from normality may be tolerated (Gayen 1951). For large samples, r itself is in any case approximately normal with mean p and with a standard deviation that can be derived if enough is known about the joint probability distribution of X, Y.

Many misconceptions prevail about the interpretation of correlation. These stem in part from the fact that early work in the field reflected confusion about the distinction between sample estimators and their population counterparts. For some time workers were also under the impression that high correlation implies the existence of a cause-and-effect relation when in fact neither correlation, regression, nor any other purely statistical procedure would validate such a relation.

Historically, research in the theory of correlation may be divided into four phases. In the latter part of the nineteenth century Galton and others realized the value of correlation in their work but could deal with it only in a vague, descriptive way [seeGalton]. About the turn of the century Karl Pearson, Edgeworth, and Yule developed some real theory and systematized the use of correlation [seeEdgeworth; Pearson; Yule]. From about 1915 to 1928, R. A. Fisher placed the theory of correlation on a more or less rigorous footing by deriving exact probability laws and methods of estimation and testing [seeFisher, R. A.]. Finally, in the 1930s first Retelling and then Wilks, M. G. Kendall, and others, spurred on by psychologists, particularly Spearman and Thurstone, developed principal component analysis (closely related to factor analysis) and canonical correlation. [For a discussion of principal component analysis, seeFactor analysis; for a discussion of canonical correlation, seeMultivariate analysis, article on CORRELATION (2), and the section “Canonical correlation” below. See also the biographies ofSpearmanand Thurstone.] Along with the mathematical development there occurred an increasing realization among social scientists of the value of mathematics in their work. This produced better communication between them and statistical theorists and also led them to discard the older, and often incorrect, treatments of correlation on which they had relied.

Correlation theory is now recognized as an important tool in experimentation, especially in those situations involving many variables. Its main value is in suggesting lines along which further research can be directed in a search for possible cause-and-effect relations in complex situations [seeCausation].

In every field of application there are books describing correlation methods and, just as important, acquainting the reader with the types of data he will handle. Some examples are the works of McNemar (1949) in psychology, Croxton, Cowden, and Klein (1939) in economics and sociology, and Johnson (1949) in education. Mathematical treatments on several levels are also available. An excellent elementary work is the book by Wallis and Roberts (1956), which requires very little knowledge of mathematics, yet presents statistical concepts carefully and fully. Those equipped with more mathematics should find the books of Anderson and Bancroft (1952) and Yule and Kendall (1958), at an intermediate level, and Kendall and Stuart (1958-1966), at an advanced level, quite useful.

In the mathematical study of correlations between several variables the natural language is that of classical matrix theory; some knowledge of matrices, linear transformations, quadratic forms, and determinantal equations is required. This expository presentation, however, will not require background in these topics.

Simple correlation

For two jointly distributed random variables, X and Y, denote their population standard deviations by σx and σy and their population covariance by σxy. The correlation coefficient is then defined as pxy = σxy./σxσy. (Both standard deviations are positive, except for the uninteresting case in which one or both variables are constant. Then the correlation coefficient is undefined.) As Feller has remarked, this definition would lead a physicist to regard pXY as “dimensionless covariance.”

Elementary properties of pxy. are that it lies between -1 and +1, that it is unchanged if constants are added to X and Y or if X and Y are multiplied by nonzero constants of the same sign (if the signs are different, the sign of pxy. will be changed), and that it takes one of its extreme values only if a perfect linear relation, Y = a + bX, exists (-1 for b < 0, + 1 for b > 0). Also, since the variance of a linear combination is frequently needed, one should note the important relation and in particular that variances are additive in the presence of zero correlation.

The usual estimator for pxy , based on a random sample, (X1, Y1), … ,(Xn, Yn), is

where ̄X, ̄Y are the sample means and SXSY ,SXY are the sample standard deviations and covariance. Regardless of the specific model adopted, rxy can be used to estimate pxy and will have some desirable properties: rxy lies between 1 and +1 and has approximately pxy for its population mean; rxy is a consistent estimator of pxy , that is, if the sample size is increased indefinitely, Pr(|rXY – pXY| <Є) approaches 1, no matter how small a positive constant Є is chosen.

Normal model

If the joint probability law is bivariate normal, that is, probability is interpreted as volume under the surface then f(x,y) factors into an expression in x times an expression in y (the condition defining independence of X, Y) if and only if pXY = 0.

Under normality, rXY is the maximum likelihood estimator of pXY. Further, the probability law of rXY has been derived (Fisher 1915) and tabulated (David 1938). The statistic can be referred to the t-table with n - 2 degrees of freedom to test the hypothesis H: pxy =0. In addition, charts (David 1938), which have been reproduced in many books, are available for the determination of confidence intervals. The variable z = tanh-1rxy is known (Fisher 1925, pp. 197 ff.) to have an approximate normal law with mean tanh-1pxy and standard deviation even for n as small as 10; thus, the z-transformation is especially useful, for example, in testing whether two (X,Y)-populations have the same correlation. Also, it has the advantage of stabilizing variances —that is, the approximate variance of z depends on n but not on pxy . The quantity rxy itself, though approximately normal with mean pxy and standard deviation for very large n, will still be far from normal for moderate n when pxy is not near zero.

Even in the bivariate normal case currently under discussion, rxy does not have population mean exactly pxy, but the slight discrepancy can be greatly reduced (Olkin & Pratt 1958) by using instead as an estimator for pxy.

Biserial and point-biserial correlations. If one variable, say Y, is dichotomized at some unknown point w then the data from the (X,Y)-population appear in the form of a sample from an (X,Z)-population, with Z one or zero according as Yw or Y < w). If pxy and w are of interest, they can be estimated by rb (biserial r) and wb, or by maximum likelihood estimators ̂pxy and ŵ (Tate 1955a; 1955b). The latter estimators are jointly normal for large n, and tables of standard deviations are available (Prince & Tate 1966). If pxy is desired, it can be estimated by rxz, usually called point-biserial r. If, however, the assumption of underlying bivariate normality is correct, then so rXZ would be a bad estimator of pXY. If one thinks in terms of models rather than data, there is no need for confusion on this point. Tate (1955b) gives an expository discussion of both models.

Tetrachoric correlation. If in the bivariate normal case both X and Y are observable only in dichotomized form, the sample values can be arranged in a 2 x 2 table, and one can calculate rt, the so-called tetrachoric r. Unfortunately, the tetrachoric model is not amenable to the same type of simple mathematical treatment as is the biserial model. On this point the reader should consult Kendall and Stuart (1958-1966).

Relation of correlation to regression

The notion of regression is appropriate in a situation in which one needs to predict Y, or to estimate the conditional population mean of Y, for given X [seeLinear hypotheses, article on Regression]. The discussion given here will be sufficiently general to bring out the meanings of the correlation coefficient and the correlation ratio in regression analysis and to indicate connections between them. The reader should keep two facts in mind: (1) predictions are described by regression relations, whereas their accuracy is measured by correlation, and (2) assumptions of bivariate normality are not required in order to introduce the notion of regression and to carry its development quite far.

A prediction of Y, ϕ(X), is judged “best” (in the sense of least squares), quite apart from assumptions of normality, if it makes the expected mean-square error, E(Y –ϕ(X))2, a minimum. It turns out that for best prediction, ϕ(X) must be μY¦X, the mean of the conditional probability law (also often referred to as the regression function) for Y given X, but that if only straight lines are allowed as candidates, the “best” such gives the prediction A+BX, with A = μY–μX(PXYσYX), B = pXYσY/σx. The basic quantities of interest, μY\X and A + BX, lead to the following decomposition of YμY:

Y – μY=(Y – μY|) + (A+BX – μ) + (μ – A – BX).

It can be shown that the right-hand terms are uncorrelated and that, therefore, by squaring and taking expected values they satisfy the basic relation

These terms may be conveniently interpreted as portions of the variation of Y: the first is the variation “unexplained” by X, and the sum of the second and third is the variation explained by the “best” prediction, the second term being the amount explained by the “best” linear prediction. The quantity

the squared correlation ratio for Y on X, is the proportion of variation in Y “explained” by X (that is, by regression). Since it can be shown that the basic relation may be rewritten as

If the regression is linear, then μY|X = A + BX, the third term drops out of the decomposition of Y – μY, and , the proportion of “explained” variation, coincides with . If in addition , the variance of the conditional law for Y given X, is constant, then this variance coincides with E(Y – μy|x)2, and

When X,Y follow a bivariate normal law, both conditions are met, and hence this last relation is satisfied. In any event one can see from the basic relation, and conditions of nonnegativity for mean squares, that with if and only if the regression is linear; when that linearity of regression holds, both quantities equal zero if and only if the regression is actually constant, and both quantities equal one if and only if the point (X,Y) must always lie on a straight line. It should be noted that in general , whereas pxy is symmetric: pxy = Pyx. Traditional terms, now rarely used, are “coefficient of determination” for , “coefficient of nondetermination” for , and “coefficient of alienation” for .

The use of data to predict Y from X, by fitting a sample regression curve, evidently involves two types of error, the error in estimating the true regression curve by a sample curve and the inherent sampling variability of Y (which cannot be reduced by statistical analysis) about the true regression curve. (The reader may consult Kruskal 1958 for a concise summary of the above material, together with further interpretive remarks, and Tate 1966 for an extension of these ideas to the case of three or more variables and the consequent consideration of generalized variances.)

It cannot be too strongly emphasized that the correlation coefficient is a measure of the degree of linear relationship. It is frequently the case that for variables Y and X, the regression of Y on X is linear, or at least approximately linear, for those values of X which are of interest or are likely to be encountered. For a given set of data one can test the hypothesis of linearity of regression ( see Dixon & Massey [1951] 1957, sees. 11-15). If it is accepted, then the degree of relationship may be measured by a correlation coefficient. If not, then one can in any event measure the degree of relationship by a correlation ratio. In some cases it may be desirable to give two measures (that is, to give estimates of both 541 and η2XY – p2XY), one for the degree of linear relationship and one for the degree of additional nonlinear relationship. In the case of nonlinear relationship, however, μY|X cannot be estimated satisfactorily for any specific value X = x unless either a whole array of Y observations is available for that x or some specific nonlinear functional form is assumed for μY|X. In view of the advantages of using normal theory, it is best whenever possible to make the regression approximately linear by a suitable change of variable and to check the procedure by testing for approximate normality and linearity of regression. [SeeStatistical Analysis, Special Problems Of, article on Transformations Of Data.

When X, Y follow a bivariate normal law, one has not only linear regression for Y on X but also normality for the conditional law of Y given X and for the marginal law of X. If the conditions of the bivariate normal model are relaxed in order to allow X to have some type of law other than normal, while the remaining properties just mentioned are present, some interesting results can be obtained. It is known, for example (Tate 1966), that for large n, rXY is approximately normal with mean pXY and standard deviation with γ denoting the coefficient of excess (kurtosis minus 3) for the X-population. (For a general treatment of aspects of this case, see Gayen 1951.)

It is an important fact that there is value in rxy even if there exists no population counterpart for it. This arises in the following way: Let x be a fixed variable subject to selection by the experimenter, and let Y have a normal law with mean A + Bx and standard deviation . This is called the linear regression model. The usual mathematical theory developed for this model requires that actually not depend on x, although slight deviations from constancy are not serious. If a definite dependence on x exists, it can sometimes be removed by an appropriate transformation of Y [seeStatistical Analysis, Special Problems of, article on Transformations of Data. Note that the nonrandom character of x here is stressed by use of a lower-case letter.

The quantities A and B may be estimated, as before, by least squares, and the strength of the resulting relationship may be measured by rxy. Distribution theory for rxy is, of course, not the same as in the bivariate normal case, since rxy is only formally the same as rxy. In other words, it is important to take into consideration in any given case whether (X, Y) actually has a bivariate distribution or whether X = x behaves as a parameter, an index for possible Y distributions.

Errors in correlation methods

Three common errors in correlation methods have already been mentioned: focusing attention on the data and ignoring the model, concluding that the presence of correlation implies causation, and assuming that no relation between variables is present if correlation is lacking. The literature contains many confused articles resulting from the first type of error and also many illustrations, some humorous, of what could occur if the second type of error were committed—for example, in connection with the high correlation between the number of children and the number of storks’ nests in towns of north-western Europe (Wallis & Roberts 1956, p. 79). The source of the correlation presumably is some factor such as economic status or size of house. As an artificial but mildly surprising example of the third type of error one should consider the fact that for a standard normal variable X, Y and X are uncorrelated if Y = X2.

A different type of error arises when one tries to control some unwanted condition or source of variation by introducing additional variables. If U = X/Z and V = Y/Z, then it is entirely possible that p(v will differ greatly from rxy For example, pXY may be zero but puv very different from zero. The difficulty is clear in this example, but similar difficulties can enter data analysis in insidious ways. Using percentages instead of initial observations can also produce gross misunderstanding. As a very simple example consider U = X/(X + Y)and V = Y/(X + Y) and the fact that pov = –1 even if X and Y are independent. Of course, if additional variables, say Z and W, are involved, the magnitude of the correlation between X/(X + Y + Z + W) and Y/(X + Y + Z + W) will not be so great. The adjective usually applied to this type of correlation is “spurious,” though “artificially induced” would be better. A spurious correlation can in certain circumstances be useful; for instance, the idea of so-called part-whole correlation (see McNemar [1949] 1962, chapter 10) deserves consideration in certain situations. If, for example, a test score T is made up of scores on separate questions or subtests, say T1 + T2 + • • • + Tm, a high correlation r/Ti could not be ascribed wholly to spuriousness. It is altogether possible that Ti would serve as well as T for the purpose at hand.

Multiple and partial correlation

If more than two variables are observed for each individual, say X1, X2,, • • • , Xp , there are more possibilities to be considered for correlation relationships: simple correlations, pij • • •, i≠ j), multiple correlations between any variable and a set of the others, and partial correlations between any two variables with all or some of the others held fixed. (In this section capital letters for random variables will be omitted in subscripts; only the numerical indexes will be used.)

Multiple correlation

The multiple correlation between X0 and the set (X1 , • • • , Xp), denoted by R0 • • • is defined to be the largest simple correlation obtainable between X0 and a, X1 + • • • + apXp , where the coefficients, ai are allowed to vary. It possesses the following properties: R0•12•••p is non-negative and is at least as large as the absolute value of any simple correlation; if additional variables, Xp+1,Xp+2 , • • • , are included, the multiple correlation cannot decrease. It thus follows that if Ro •12•=0, all p0j are zero. Also, if R0 • • • 12 • • .P = 1, then a perfect linear relationship, X0 = a0 + a1X1 + • • • + apXp , exists for some an a1***,ap. The usual estimator of JR0•12• • ••P, based on a random sample of vector observations on (X0, X1 ,• • • , Xp), is the sample correlation r0.12•••P between X0 and its least squares prediction based on Xl , • • • , Xp. Under the joint normal model, H: R0*12•••p = 0 can be tested by referring to an F-table with p and n – p – 1 degrees of freedom (Fisher 1928). Also, r0•12•••p, like rXY, is approximately normal for large n with mean R0•12•••p and standard deviation provided R0 •12 •••p ≤0; if R0 •12••p= 0, then nr>sup>20•12••p has approximately a chi-square law with p degrees of freedom. Fisher’s z-transformation applies as before—except when R0•12•••P is zero (Hotelling 1953). Note that R0•12•••p and r0•12•••p do not reduce to simple correlations if p = 1. Instead, one finds that Roi = |p0i| and ron = |rol|.

Regression relationships, in which X0 is predicted by Xj, • • • , Xp, are analogous to those for simple correlation; for example, when regression is linear and conditional variances are constant, R20•10 • • • is the portion of σu2 which is “explained” by regression, namely with σ20•12••p denoting the expected mean-square difference between X0 and its best prediction based on X1 , X2 , •; • • , Xp : µ0 •12 •••p = B0B1X1+ • • • +BpXp. Calculations are more difficult but follow the same principles. The coefficients, 1, B2 , •; • • , Bp, are traditionally known as partial regression coefficients but are now usu-ally termed just regression coefficients, as in the bivariate case. Each coefficient gives the change in µn• • 12 ••p Per unit change in the variable associated with that coefficient. It is clear that the relative importance of the contributions of the separate in-dependent variables (Xl X2, • • • ,Xp) cannot be measured by relative sizes of the coefficients, since the independent variables need not be measured in the same units. In chapters 12 and 13 of their book Yule and Kendall (1958) give many examples, along with interpretation and practical advice.

Statements made above in reference to simple correlation of biserial data carry over to multiple correlation (Hannan … Tate 1965), and the same tables (Prince … Tate 1966) are applicable.

Partial correlation

The coefficient of partial correlation, σol•2, is, roughly speaking, what p0i would be if the linear effect of X2 were removed. One can measure “X0 with the linear effect of X2 removed” and “Xl with the linear effect of X2 removed” by subtracting “best” linear predictions, A0 + A2X0 and A’0 + A2X2, and obtaining the residuals, X0 “–A0 –A2 X2 and X1, – A–0 – A2X2. Then σ01•o is defined to be the simple correlation between these two residuals. In the same way, the effect of more than one additional variable can be removed, and one may consider p0i•23•••P. Partials between any two other variables, with any of the remaining p — I “held fixed,” are similarly defined by rearrangement of subscripts. For the case of three variables poi.o can be expressed in terms of simple correlations as Alternatively, if the joint probability law is normal, poi.o can be defined as the simple correlation as

Alternatively, if the joint probability law is normal, σ01σo can be defined as the simple correlation between X0 and X1 calculated from the conditional law for X0 and Xl given X2 , but this is not true in general. Also, since p01-2 is the ordinary correlation between the residuals denned above, may be characterized in terms of the unexplained variance in one residual after linear prediction from the other, namely

To see an important relation between multiple and partial correlation, think of the variables X1, X2, • • • , Xp as being introduced one at a time and producing increases in multiple correlation with X0. Then

From this it follows that

which yields a recursion relation that allows for the correction of a multiple correlation when a variable is added or subtracted. Elaborate and useful computational schemes are available for adding and subtracting variables in correlation analysis. One viewpoint (see Ezekiel and Fox [1930] 1959, appendix 2) is that one should generally start with the largest feasible number of independent variables and then subtract one at a time those that are negligibly useful in predicting X0. Other approaches begin with the best single v predictor among X1, X2 , • • • , Xp and then add others one at a time until further additions make no substantial improvement.

Many expressions and statements analogous to the above relationships can of course be obtained by rearrangement of subscripts, including those which employ only some of the p + 1 variables. Since all parameters involved in this discussion are actually only simple correlations between appropriate pairs of random variables, one can construct estimators by calculating the corresponding sample simple correlations. Thus, for example, r01.2 is calculable from the observation pairs, (Xo – ĀAn — ĀA2 X21, X1i–ĀA0– ĀA2 X2i, of sample residuals.

Finally, it has been shown (Fisher 1928) that if the multivariate normal model is assumed, many results for r01•23…p can be obtained from those for r01 by replacing n…2 by n– p …l. For example, (n–p–l)* rol •23 …p /(l–r2n1…n )* can be referred to the t-table with n – p – 1 degrees of freedom as a test of H: p01 .23 …p = 0.

An example

As an example of the applications of multiple and partial correlation, consider an experiment in which X0 represents grade point average, Xj represents IQ, X2 represents hours of study per week, and the relationship is sought between X0 and X1 with X2 held fixed (Keeping 1962, p. 363). Results based on a sample of 450 school children showed that r0.12 = 0.82, rol = 0.60, r02 =0.32, r12 = –0.35, and r1.2 = 0.80. The positive correlation between X0 and X1 together with the negative correlation between Xn and X2 (a more intelligent student need not study so long), obscured somewhat the strength of the relationship between X0 and Xx . It should perhaps be mentioned that from the relation it is clear that r0.12 r01.2, with equality if and only if r02 = 0. It is true in general, for parameters or sample estimators, that a multiple correlation between a given variable and others is at least as large in magnitude as any simple or partial cor-relation between that variable and any of the others.

Reduction of the number of variables

Yule and Kendall (1958, chapter 13) offer practical advice of an elementary nature in relation to economy in the number of variables to be considered. In this connection one thing is more or less certain: if the number of variables is, say, greater than ten, an attempt to analyze the interrelations between variables by using their whole correlation matrix offers too many possibilities for the mind to encompass, or for methods to isolate, and is therefore probably a waste of time.

There are less elementary techniques for dealing with problems involving large sets of variables, which have been treated in depth and are worthy of wide application. These include canonical correlation, principal components, and factor analysis.

Canonical correlation

There are cases in which an experimenter wishes to study the interrelations between two sets of variables, (̂Y • • • , Y/c) and (XI 3 • • • , Xp). The purpose of canonical correlation theory (Hotelling 1936) is to replace these sets by new (and smaller) sets, at the same time preserving the correlation structure as much as possible. The method is as follows: Linear combinations, one from each set of variables, are so constructed as to have maximum simple correlation with each other. These linear combinations, denoted by U: and Vl , are called the first pair of canonical variables; their correlation, pl3 is the first canonical correlation. The process is continued by the construction of further pairs of linear combinations, with the provision that each new canonical variable be uncorrelated with all previous ones. If k≤p, the process will terminate with U1U2…, Uk , V1, V2, … , Vfc and canonical cor-relations P1 , p2 , … , pMk, ;. If k = 1, the resulting single Correlation (1) canonical correlation is the multiple correlation for Y1 on X1, X2, … , Xp. Since pl ≥ p2 ≥ … pk ≥ 0 and since many canonical correlations may be small, it is clear that the canonical pairs worth pre-serving may be few.

The usual model specifies a joint normal law in p + k variables, and estimation of canonical correlations can be carried out with a sample by a scheme which parallels that for the construction of p19-…,pk. The joint probability law for sample canonical correlations is known both in exact form and in approximate form for large n.

Before canonical correlations are estimated, it may be wise to carry out an initial test for possible complete lack of correlation between the two sets of variables. The hypothesis that p1 = p2 = … = pk = 0, or, equivalently, that all correlations between an Xi and a Y; are zero, may be tested essentially by a procedure of Wilks (see Tate 1966). The hypothesis being tested can be rewritten as a form analogous to that of other, related tests. There are various tests available for this hypothesis; one should try to choose the one with highest power against the alternative hypotheses of interest. (See Anderson 1958, sec. 14.2; Hotelling 1936.)

Principal components and factor analysis

One of the central problems arising in the application of correlations is that of holding the variables considered down to a manageable number. This was mentioned above in connection with canonical correlation and is also the guiding principle underlying principal components analysis [for a discussion of principal components, see Hotelling 1933 and Factor Analysis,article on Statistical Aspects]. There one deals with a single set of variables, forming linear compounds that are uncor-related with one another and arranged in order of decreasing variance. The basis of principal components analysis is the assumption that the more interesting observable quantities are those with larger variation. Factor analysis, which is of vast importance in psychological testing, utilizes a similar idea, except that the number of linear compounds to be considered is prescribed by the model. Connections between these two methods are discussed in a monograph by Kendall (1957).

Other methods of correlation

Intraclass correlation

In the discussion of sampling from an (X, Y)-population and the consequent use of the sample to estimate p XY, there has been no question as to the separate identification of the X and Y for each observation. Thus, one can think of pXY as a measure of the interrelation between two classes, an X-class and a Y-class, and hence the term inter class correlation may be used. As an example of a situation in which the identification of X and Y is not clear, consider measuring the correlation between the weights of identical twins at, say, age five. Here there is in effect only one class, that of pairs of weights of twins. Any establishment of two classes — for example, by considering X the weight of the taller twin and Y the weight of the shorter twin — would be wholly arbitrary and not helpful. The population of weight pairs has a correlation coefficient, and this gives the intraclass correlation, the correlation coefficient between the two weights of a pair in random order. The method for handling this situation works as well with data involving triplets (one is still, how-ever, interested in correlation for weights in the same family) or any number of children. Consider n observations (families) on k-tuplets, with k ^ 2. The method consists essentially in the averaging of products of deviations over all possible k(k — 1 ) pairs of children. If Xiy represents the weight of the jth child in the xth family, then the intra-class correlation, r, is given by , with s2 = Σ Σ(Xii; — X̄)2/nk, the within-families sample variance, and s2m= Σ(X̄i — X ̄ )2/n, the between-families sample variance. Thus,

It is clear that r≥ – l / ( k — 1) and that for a single family (n = 1), r = — l/(k — 1). Intraclass correlation is closely related to components of variance models in the analysis of variance [see LINEAR HYPOTHESES, article on Analysis Of Variance].


Observations on random variables are frequently subject to measurement errors or, at any rate, are observable only in combination with other random variables, so that in attempting to observe U, V one must instead accept X = U + E, Y = V + F. Previous methods lead to information about PAY, when what is relevant is information about prv- If E and F are assumed to be uncorrelated with U, V, and each other, then the relation between puv and pxy is given by

which shows that pxy ≤ puv , with equality occurring only in the trivial case in which E, F are both constant. The coefficient puv is said to be attenuated by the effect of E and F. Correction for attenuation consists in applying to the above relation known or assumed information relative to pxy, (σnr),(op;/ov) in order to estimate puv. (For further discussion, see McNemar 1949.)


Anderson, Richard L.; and Bancroft, T. A. 1952 Statistical Theory in Research. New York: McGraw-Hill.

Anderson, T. W. 1958 An Introduction to Multivariate Statistical Analysis. New York: Wiley.

Binder, Arnold 1959 Considerations of the Place of Assumptions in Correlational Analysis. American Psychologist 14:504-510.

Croxton, F. E.; Cowden, D. J.; and KLEIN, S. (1939) 1967 Applied General Statistics. 3d ed. Englewood Cliffs, N.J.: Prentice-Hall. → Klein became a co-author with the third edition.

David, F. N. 1938 Tables of the Ordinates and Probability Integral of the Distribution of the Correlation Coefficient in Small Samples. London: University College, Biometrika Office.

Dixon, Wilfrid J.; and Massey, Frank J. JR. (1951) 1957 Introduction to Statistical Analysis. 2d ed. New York: McGraw-Hill.

Ezekiel, Mordecai; and Fox, KARL A. (1930) 1959 Methods of Correlation and Regression Analysis: Linear and Curvilinear. 3d ed. New York: Wiley.

Fisher, R. A. 1915 Frequency Distribution of the Values of the Correlation Coefficient in Samples From an Indefinitely Large Population. Biometrika 10:507-521.

Fisher, R. A. (1925) 1958 Statistical Methods for Research Workers. 13th ed. New York: Hafner. -* Previous editions were published by Oliver …; Boyd.

Fisher, R. A. 1928 On a Distribution Yielding the Error

Functions of Several Well Known Statistics. Volume 2, pages 805-813 in International Congress of Mathematicians (New Series), Second, Toronto, 1924, Proceedings. Univ. of Toronto Press.

Gayen, A. K. 1951 The Frequency Distribution of the Product-moment Correlation Coefficient in Random Samples of Any Size Drawn From Non-normal Uni-verses. Biometrika 38:219-247.

Hannan, J. F.; and Tate, R. F. 1965 Estimation of the Parameters for a Multivariate Normal Distribution When One Variable Is Dichotomized. Biometrika 52: 664-668.

Hotelling, Harold 1933 Analysis of a Complex of Statistical Variables Into Principal Components. Journal of Educational Psychology 24:417-441, 498-520.

Hotelling, Harold 1936 Relations Between Two Sets of Variates. Biometrika 28:321-377.

Hotelling, Harold 1953 New Light on the Correlation Coefficient and Its Transforms. Journal of the Royal Statistical Society Series B 15:193-225.

Johnson, Palmer O. 1949 Statistical Methods in Research. New York: Prentice-Hall.

Keeping, E. S. 1962 Introduction to Statistical Inference. Princeton, N.J.: Van Nostrand.

Kendall, M. G. (1957) 1961 A Course in Multivariate Analysis. London: Griffin.

Kendall, M. G.; and Stuart, Alan 1958-1966 The Advanced Theory of Statistics. New ed. 3 vols. New York: Hafner; London: Griffin. → Volume 1: Distribution Theory, 1958. Volume 2: Inference and Relationship, 1961. Volume 3: Design and Analysis, and Time Series, 1966. The first edition, published in 1943-1946, was written by Kendall alone.

Kruskal, William H. 1958 Ordinal Measures of Association. Journal of the American Statistical Association 53:814-861.

Mcnemar, Quinn (1949) 1962 Psychological Statistics. 3d ed. New York: Wiley.

Olkin, Ingram; and Pratt, John W. 1958 Unbiased Estimation of Certain Correlation Coefficients. Annals of Mathematical Statistics 29:201-211.

Prince, Benjamin M.; and Tate, rOBERT F. 1966 The Accuracy of Maximum Likelihood Estimates of Correlation for a Biserial Model. Psychometrika 31:85-92.

Tate, R. F. 1955a The Theory of Correlation Between Two Continuous Variables When One Is Dichotomized Biometrika 42:205-216.

Tate, R. F. 1955b Applications of Correlation Models for Biserial Data. Journal of the American Statistical Association 50:1078-1095.

Tate, R. F. 1966 Conditional-normal Regression Models. Journal of the American Statistical Association 61:477-489.

Wallis, W. Allen; and Roberts, Harry V. 1956 Statistics: A New Approach. Glencoe, 111.: Free Press. → A revised and abridged paperback edition of the first section was published in 1962 by Collier.

Yule, G. Udny; and Kendall, M. G. 1958 An Introduction to the Theory of Statistics. 14th ed., rev. … enl. London: Griffin. → The first edition was published in 1911 with Yule as sole author. Kendall has been a joint author since the eleventh edition (1937), and the 1958 edition was revised by him. A 1965 printing contains new material.


Correlation, in a broad sense, is any probabilistic relationship between random variables (or sets of random variables) other than stochastic independence. Two random variables are said to be independent when the conditional distribution of one, given the other, does not depend on the given value. Viewed another way, independence means that the probability that both random variables are simultaneously in some given intervals is simply the product of the separate interval probabilities. Whenever independence does not hold, the two random variables are dependent, or correlated. (Terminology is not wholly standard, for the word “correlated” is sometimes used to refer to special kinds of dependence only.) [SeeProbability, article on Formal Probability.]

Two sets of random variables—that is, two random vectors—are independent when the conditional distribution of one set, given the other, does not depend on the given values.

The idea of a numerical measure of association between two random variables seems to have originated with Francis Galton, in the last part of the nineteenth century [seeGalton]. From crude beginnings at his hands, the concept passed into those of F. Y. Edgeworth and particularly into those of Karl Pearson, whose academic training had been in mathematical physics but who caught Galton’s enthusiasm and devoted the rest of his life to statistics [seeEdgeworth; Pearson]. From them there came the definition and exploration of the important correlation coefficient,

in sample form. Here (X1, Y1),..., (Xy, Yy) are the members of an N-fold bivariate sample, and ̄X and ̄Y are the corresponding sample averages. Of course r may be written more compactly as

or still more compactly as

where xi = Xi – ̄X and yi = Yi – ̄Y are the residuals, or deviations from the sample averages. Another way of expressing r is obtained by dividing the numerator and the denominator by N – 1:

where Sxx and Syy are the conventional modes of expressing sample variance and Sxy that of sample covariance.

The population, or underlying, correlation coefficient between random variables X and Y is

where var X = E(XEX)2, var Y = E(YEY)2, and cov (X, Y) = E[(XEX)(YEY)]. When the sample of (Xi, Yi) is random, r is the usual estimator of ρ.

Instead of centering the quantities entering into the expressions for r and ρ on (̄X, ̄Y) and (EX, EY), respectively, estimated or true conditional expectations, given other variables, may be used. Then the correlation coefficients are called partial correlation coefficients.

The adjectives “Pearsonian” and “product-moment” are sometimes used in naming the correlation coefficient. Although Pearson was the first to study it with care, later workers, especially R. A. Fisher, pushed both the theoretical study and the applications of correlation much further [seeFisher, R. A.].

Applications of correlation

The initial application of correlation was to genetics, although that science remained at a rudimentary stage in England and other Western countries until the early 1900s, when the basic principles published in 1866 by the Austrian monk Gregor Mendel were rediscovered. Subsequent genetic research revealed specific correlations that result from various degrees of relationship, from the extent of random mating, and from other conditions. A substantial compendium of this correlational theory of genetics was published by Fisher (1918). These specific correlations made it important to compare certain hypotheses suggested by theoretical considerations about the value of a correlation coefficient with the observed results. For example, a theoretical correlation of 1/2 between stature of father and stature of son is suggested by a hypothesis of random mating; this correlation may, however, be obscured by the fluctuations of random sampling. Because of such problems the probability distribution of r in samples from a basic distribution with correlation ρ became an object of mathematical inquiry, of which an account will be given below. Some of the mathematical problems of great complexity are still only partly solved.

Analysis of human abilities

Even before the rediscovery of Mendelian genetics, psychologists became interested in correlation with a view to detecting and analyzing variations in human abilities. A pioneer work was Charles E. Spearman’s paper of 1904, which was later revised and expanded into his book The Abilities of Man (1927), leading to the theory that each of the various human abilities tested is the sum of a greater or less quota of “general intelligence” and another independent fraction of an ability special to the particular thing tested [seeIntelligence And Intelligence Testing; the biography ofSpearman]. These special abilities were initially thought of as being independent of general intelligence and of each other [seeFactor Analysis]. If ρij denotes the true, or population, correlation between the ith and jth test scores, the original Spearman theory holds, as a consequence of the assumptions that, for all different subscripts, i, j, k, l,

ρij ρkl = ρik ρjl = ρil ρjk.

The population correlations, ρij, cannot, however, be derived from theory but must be estimated by the sample correlations, rij, obtained from actual test scores. After the problem was recognized, there ensued a long period of wrestling with the difficult mathematics and logic of this problem and of attempting to reformulate the early theory to apply with greater generality to situations involving group factors and other elaborations. Greatly enlarged testing programs supplied vast amounts of data. In the 1920s and 1930s new views of the problem were introduced—in numerous articles in journals, in the work of L. L. Thurstone, in a book by Truman L. Kelley (1928), and in a work of Karl Holzinger and Harry Harman (1941) that was the culmination of work done and papers published during the 1930s [seeKelley; Thurstone]. One of those who introduced new ideas and methods was Spearman himself, when he became convinced that his original formulation was inadequate (1927).

Rank-order correlation

Spearman introduced a correlation coefficient for ranked observations that avoids any assumption, either of normality or of any other particular form of distribution [seeNonparametric Statistics, article on Ranking Methods]. It has been used extensively by statisticians unwilling to make assumptions of particular forms for their data. An exact standard error for the Spearman coefficient was published by Hotelling and Pabst in 1936. Maurice G. Kendall provided another rank correlation coefficient in Biometrika in 1938 and reviewed the subject at length in chapter 16 of his Advanced Theory of Statistics (1943–1946, vol. 1). See also Kendall’s monograph on ranking methods (1948).

The correlation ratio

The correlation ratio was originally introduced to deal with nonlinear regression when the data are grouped. It has a strong formal similarity with analysis of variance in the one-way layout. Its theory is treated by Hotelling (1925) and Wish art (1932).

Effect of deviations from assumptions

The correlation coefficient, r, is sensitive to deviations from the usual basic assumptions of normality, independence of observations, and uniform variance among the observations. Extreme deviations in variance, particularly in the form of large deviations of both X and Y in the same term, may cause an exaggeration of r above ρ. Effects of nonnormality may be serious and will be discussed later. These effects are generally ignored in the literature.

Lack of independence, another sort of deviation from assumptions, particularly between different observations on the same variate, has been felt to be so serious a menace as to impair deeply the reliability of many correlation coefficients, especially for economic time series. Partial correlation, equivalent to removal of a set of variables that are considered extraneous from both X and Y by least squares, is a useful method. A special case of it is the elimination of trends—best done by least squares—which may be combined with the elimination of seasonal variation, for which special methods have been devised. Caution is needed in such enterprises to obtain “models” that are truly reasonable and do not involve removing too much with the trend, throwing out the baby with the bath water. But the penalty for such a sin is often very light, usually being limited to a reduction in the number of degrees of freedom, whereas a failure to remove significant components of trend, such as secular and seasonal components, may grossly exaggerate the correlation.

Autocorrelation and serial correlation

Autocorrelation, in which each observation on X is matched with another observation on X, where there is a fixed time interval between the two observations, may be measured by the same formula as r or slight variations of it. Lag correlation is given by the usual formula with a fixed time interval between each X and the corresponding Y. In both these situations the distribution is different from that of r based on a random sample. The choice of suitable types of autocorrelations and serial correlations should be made with a view to what is known or believed about the interrelations of the actual observations. Since these interrelations are seldom known exactly, the choice of a particular statistic can often be made so as to relate it suitably both to the true matrix of correlation and to manageable forms for its own distribution. (For methods useful in finding some such distributions, see papers by Tjailing C. Koopmans 1942 and R. L. Anderson 1942.)

Other applications

Correlation enters biometrics in many places other than genetics. Areas in which it has been widely used are quality control and quantitative anthropology [seePhysical Anthropology; Quality Control, Statistical].

The precision of r

The formula for r is the same as that used in solid analytic geometry for the cosine of the angle between two lines through the origin, one to each of the points with coordinates (x1,..., xy), (y1,..., yx), except that the formula given in the textbooks is usually confined to three dimensions. Since it is a cosine, r cannot exceed 1 or be less than – 1, but when the variates are distributed under reasonable assumptions of continuity, r can take either of these extreme values. If (and only if) r = ±1, the Y’s of the sample are linearly related to their corresponding X’s, with the linear function increasing if r = 1 and decreasing if r = –1.

In order to make substantial use of r, it is necessary to have at least an approximation to its probability distribution, which will involve both the true value and the sample size. The probability distribution was first deduced for random samples with ρ ≠ 0 from the ’bivariate normal population by Fisher (1915), but the results, although correct, were very difficult to use until simplifying transformations could be found. One simplification, which in the end proved too drastic, is to use the standard error of r, a function of ρ and N, and to treat r as normally distributed about ρ. This had been done by Karl Pearson and L. N. G. Filon (1898). An earlier version contained an error, whose cause it is instructive to examine: the two sample standard deviations in the denominator of r were regarded as fixed, or the same in all samples; this introduced into the denominator of the standard error of r an extraneous factor, . The error was corrected in the 1898 paper, which provided the equivalent of the formula

where n = N – 1, the number of so-called degrees of freedom in this case. (N could be used instead of n in the above expression, but it is useful and conventional to use the degrees of freedom.) The above expression appeared in textbooks for several decades, puzzling students by the obvious absurdity that the standard error of r appears as a function of r itself. Of course the meaning of the above formula is that the asymptotic (or largesample) standard error of r is , which is estimated by substituting r for ρ in the expression. The notation of the period was one in which parameters and their estimators were often denoted by the same symbol, a pernicious practice that sometimes misled even those statisticians who presumably used it only as a convenient shorthand. The need for a notational distinction between the two concepts of parameter and estimator was not well understood, even by mathematical statisticians, until after the publication of Fisher’s paper of 1915.

The development of mathematical theory

The first publication of an exact distribution of a correlation coefficient seems to have been by William S. Cosset (1908), a chemist publishing under the name “Student” because of his employers’ opposition to publication [see Gosset]. The data were supposed to represent a random sample from a bivariate normal population with correlation ρ = 0. Fisher’s 1915 paper supplied for the first time an exact distribution of r with ρ ≠ 0. This paper has led to others by various authors, and will stand as a great triumph.

The matter had been on Pearson’s mind, and after the publication of Fisher’s paper he mobilized the resources of his entire Biometric Laboratory in London to improve the results. In what has come to be referred to as the Cooperative Study (Soper et al. 1917), Pearson, with four collaborators, began with a series expression for the distribution which is remarkable in that although it converges, it does so with extreme slowness. When multiplied by an appropriate factor, however, and integrated to get the moments, the new series converges with great rapidity, especially for large samples. The Cooperative Study also effected other mathematical improvements and provided handsome plates showing the frequency function as a surface with horizontal coordinates r and ρ, with drawings and tables. But then came a fateful step.

Difficulties about the foundations of statistical inference were coming more clearly into view, partly as a result of all the work on r. It seemed only natural to Pearson to invoke Bayes’ theorem of inverse probability to provide a solution of these unsolved problems. The Cooperative Study has a section on the application of the results, with a priori probabilities provided by Pearson’s experience and judgment and with far-reaching inferences from hypothetical samples.

Fisher had already taken a stand against Bayesian inference and wrote a rebuttal to the inverse probability argument of the Cooperative Study. However, because of Pearson’s opposition, Fisher, still a young man and comparatively unknown, was unable to publish his paper in England. It finally appeared in 1921 in Corrado Gini’s new journal Metron, published in Rome. In the 1921 volume and in that of 1924, besides pointing out the absurdities arising from application of inverse probability by Pearson’s methods to certain data, Fisher made an important constructive contribution regarding the application of the same distribution to partial correlations with a reduction in the number of degrees of freedom equal to the number of variates eliminated.

Florence N. David, a member of the Pearson group at University College, London, computed a very fine table (1938) of the correlation distribution in random samples from a normal distribution, using as a principal method the numerical solution of difference equations. It far exceeded in scope and accuracy the short tables previously published in Fisher’s initial paper (1915) and in the Cooperative Study (Soper et al. 1917). She used as a principal computational tool the two second-order difference equations previously discovered, which she adapted.

The appropriate formula for the variance of the correlation coefficient, equivalent to the 1898 result of Pearson and Filon, is only the first term of an infinite series of powers of n-1 with coefficients involving increasing powers of ρ Additional terms may be computed by various methods—for example, by the rapidly convergent series for the moments of r used in the Cooperative Study (Soper et al. 1917) or by Hotelling (1953, p. 212).

All these approximations to the variance of r, however, require a knowledge of ρ, which is ordinarily not obtainable. Moreover, when ρ ≠ 0, the distribution of r is skew, and if ρ is close to ±1 and the sample is of moderate size, the distribution is very skew indeed. A serious problem is thus created for statisticians who wish to determine, for example, whether the values of r in two independent samples differ significantly from each other or to find a suitably weighted average of several quite different and independent values of r, corresponding either to distinct values of ρ or to one common value. Fisher proposed as a solution for such problems the transformation

abandoning an inferior transformation of his 1915 paper, and announced that, to a close approximation and with moderately large samples, z has a nearly normal distribution, with means and variances nearly independent of ρ. F. N. David examines, in her volume of tables (1938), the accuracy of these statements by Fisher and is inclined to consider them accurate enough for practical use. These descriptive terms are, however, relative, and it still seems that for some cases, especially with small samples, use of the z transformation is not sufficiently accurate.

In Fisher’s original calculation there are small errors in the mean and variance of z, which are not carried beyond terms of order n-1. These are corrected and the series are carried out to terms of order n-2 in a paper by Hotelling (1953). These series provide apparent improvements in the accuracy of z, at least for large samples. This paper also contains revised calculations on many other aspects of the correlation distribution.

A frequent practical problem is to test the null hypothesis ρ = 0 from a single observed correlation. Under normality this null hypothesis corresponds to independence. To this end, r is usually transformed into z, which is treated as normally distributed about 0. This practice, however, is not to be recommended. It is far more accurate in such cases to use one of the other three methods of testing the hypothesis ρ = 0 (given in the first part of Hotelling 1953).

This 1953 paper is a careful reworking of most of the earlier theory of correlation, with considerable additions. These include three new formulas for the distribution of r when ρ = 0 and one formula, involving a very rapidly convergent hypergeometric series, good for all ǀρǀ < 1. With these series there are easily calculated and usually small upper bounds for the error of stopping with any term. There are also attractive series for the probability integral and for the moments of r and of Fisher’s transform, z = tanh-1r. Simple improvements are obtained for Fisher’s estimates of the bias and variance of z. These eliminate certain small errors and go further in the series of powers of n-1 to terms of order n-3 and carry these through for moments of orders lower than 5. For moments of order 5 or more, all terms are of order n-4 or higher. The moments of r through the sixth are given through terms of order n-3. The skewness and kurtosis are also given and differ slightly from Fisher’s values.

Finally, it is proposed that z be modified, particularly for large samples, by using in its place either the first two or all three of the terms of

Here, as throughout the 1953 paper, n means the number of degrees of freedom, which is ordinarily less by unity than the sample number.

A further method for testing ρ = 0 is to restate this hypothesis as asserting that the regression coefficient of one variate on the other is truly 0, and to test this by means of Student’s t, the ratio of the estimated regression coefficient to its estimated standard error; this is a function of r.

All these methods are accurate only in the case of random sampling from a normal distribution. However, even in this standard situation the use of z is more or less inaccurate, especially for small samples and large values of r.

As stated above, Fisher recommends the use of z instead of r also for purposes other than testing ρ = 0, such as testing the difference between two independent correlation coefficients or the dispersion among several such values of r or the weights to be applied in averaging them or the accuracy of the average. This idea was carried further by R. L. Thorndike (1933) in a study of the stability of the IQ. Each of his experiments resulted in a correlation between the results of the test given at an earlier and at a later date. With the magnitude of such a correlation coefficient is associated the number of persons in the sample and also the time elapsed between tests. Since the weights to be applied to the independent experiments are inversely proportional to the variances in the several cases, and since the reciprocals of the variances are approximately proportional to the number of cases in the samples when the correlations are transformed into values of z, essentially uniform variances are obtained. Thus, in fitting a curve to the several correlations, the method of least squares is appropriate because its assumptions are approximately satisfied. The weights are taken as the numbers of persons in the experiments. More accuracy could presumably be obtained by using instead of z the slightly different expressions z* and z** obtained by Hotelling (1953, pp. 223–224).

Variance of r in nonnormal cases

In addition to the unreliability of inferences involving correlation coefficients mentioned above, because of correlations between different observations on the same variate and because of nonuniform variances, a quite different source of errors is the nonnormal bivariate distributions that often affect observations. When these distributions, or their first four moments, are known or approximated, the variance of r is given, to a first approximation, by the formula

in which μij (i,j = 0,1,2,3,4) is the expectation E[(XEX)i (YEY)i]. This formula was established by Arthur L. Bowley (1901, p. 423 in the 1920 edition) and later by Maurice G. Kendall (1943-1946, vol. 1, p. 212).

If the moments of the bivariate normal distribution are substituted in this formula, the result is , the well-known first approximation. A second approximation is found by multiplying this result by 1 + 11 ρ2/(2n), as shown, with considerable extensions, by Hotelling (1953, p. 212).

If instead of being normal the distribution is of uniform density within an ellipse centered at the origin and tilted with respect to the coordinate axes if ρ ≠ 0, and if the density is 0 outside this ellipse, the formula for the variance, given above, is multiplied by ⅔. This is a substantial reduction.

Another case is a distribution over only four points, with probabilities

and withρ taking any value between –1 and 1. The moments needed are easily found; since x3 = x and y3 = y for the values ± 1, which are the only ones considered, any subscript of 2 or more may be reduced by 2 or 4. The result is , and ρ is the correlation. This variance is larger than that for samples from a normal distribution by the factor (1 – ρ2)-1.

A collection of such cases would be useful in practice because of the importance of nonnormality in correlation.

Partial and multiple correlation—geometry

Suppose that tests of arithmetical and reading abilities, yielding scores X1 and X2 , are applied to a group of seventh-grade school children and the correlation between these abilities is sought. A difficulty is that proficiency in both tests depends on age, X3, and general advancement, X4.

In this case either or both of X3 and X4 may be incorporated in regression functions fitted by least squares to X1 and X2, and the deviations of X1 and X2 from these functions may be correlated in a way more nearly independent of age and general advancement than X1 and X2 by themselves. Such a correlation is called a sample partial correlation of order 1 or 2, according to the number of variables eliminated, and is denoted by r12.3, r12.4, or r12.34.

If all four variables are measured on each of N children, the results may be pictured as the N coordinates, in a space of N dimensions, of four points, and each of these determines a vector from the origin. If the coordinates are replaced by deviations from the respective four means, this is equivalent to projecting each of the four vectors orthogonally onto the flat subspace through the origin for which the sum of a point’s coordinates is zero. Consider the four vectors from the origin to the four projections; the cosines of their angles are the correlations among the original variables. The above projections may be regarded as the original vectors, from each of which is subtracted its orthogonal projection on the equiangular line (the line of all points whose coordinates are equal among themselves). The sample partial correlations may be regarded similarly, except that the subtracted projection is onto a subspace that includes the equiangular line, and more. For example, r12.3 may be described geometrically as follows: Begin with the plane determined by the equiangular line and the vector from the origin to the point determined by the N observations on X3 as coordinates. Project the vector from the origin to the X1 point onto that plane, and subtract the resulting vector from theX1 vector. This gives the residual values of the X1 observations after best “removing” the effects of a constant and of X3. Now go through the same procedure for X2. Then r12.3 is the cosine of the angle between the two vectors of residuals. In order to compute r12.3 it is not necessary to go through this process arithmetically, for r12.3 is a simple function of the ordinary correlation coefficients

From this geometry, which was described by Dunham Jackson (1924), it is easy to see that if X1 and X2 have a joint normal distribution, with independence among the different persons, and X3 is fixed or has an arbitrary distribution, then the deviations of X1 and X2 from their regressions on X3 have a correlation distribution of the same kind, with the sample number reduced by unity.

The definition of r12.4 is equivalent to the formula above, with “3” replaced by “4.” It may be given a geometrical interpretation like those above. In general, the subscripts before the dot, called primary subscripts, pertain to the variables whose correlation is sought; they are interchangeable. The subscripts after the dot are called secondary subscripts, refer to the variables being eliminated, and may be permuted among themselves in any order without changing the value of the partial correlation provided by the formula. If ρ variates and the arithmetic means are eliminated, with N values for each variable, the number of degrees of freedom is reduced to n = N – 1 – p.

Partial correlations may also be expressed as ratios of determinants of simple correlations. This fact is useful in proving theorems, but in numerical work the recursive formulas like those above are generally used.

Partial correlations were used extensively by Yule (see Yule & Kendall [1911] 1958) in investigations of social phenomena, generally on the basis of the poor-law union as a unit.

Multiple correlation is the correlation of one predictand (“dependent variate”) with two or more predictor variables, with least squares as the method of prediction or estimation. The multiple correlation coefficient is the correlation between the observations, y, and the predicted values, Y [seeLinear hypotheses, article on Regression]. The exact sampling distribution of the multiple correlation coefficient R, like that of r, was discovered by Fisher.

Canonical correlations

The situation of multiple correlation is generalized to the case where one has two sets of variables, with two or more variables in each set, and wishes to use and analyze the relations between the sets. The multiple correlation case is that in which one set consists of only a single variable, whereas in the new situation there are at least two variables in each set. This problem was dealt with in a brief paper by Hotelling (1935). A longer, definitive version of it and of many related problems appeared the following year (Hotelling 1936a). T. W. Anderson (1958), working with slightly different notation and subject matter, deals with canonical correlations and canonical variates in a population in chapter 12 and in a sample in chapter 13, with related subjects.

A primary objective of canonical correlation analysis is to determine two linear functions, one of variates in the first set, the other of those in the second set, so that the correlation between these two functions is as great as possible. Without loss of generality, one may require the variances of these two linear combinations to be unity, so that covariance is to be maximized. This permits use of the Lagrange multiplier approach, with two fixed conditions. The resulting equation for maximization is a determinantal equation in the Lagrange multiplier, λ, written in terms of all the original correlations. If there are s variates in the first set and t in the second, with st as a matter of convention, it turns out that the determinantal equation has 2s real roots, all less than or equal to one in absolute value. (They come in pairs of equal magnitude and opposite sign.) If one of the roots is substituted for λ, the determinant of the determinantal equation is 0; then, if its matrix be used to form linear equations, their solution provides the coefficients of two linear functions of the s and tvariates, respectively. The correlations between those pairs of linear functions, lying between 0 and + 1, constitute the canonical correlations of the system. The linear functions are the canonical variates and may be regarded either as determined only to within an arbitrary common multiplier or as determined by the conditions that their variance shall equal unity. The greatest root and its corresponding pair of linear functions provide the solution of the primary problem.

If all roots are 0, then every correlation of a variate in one set with a variate in the other is 0.

For s = t = 2 the calculations are easy by elementary methods. For larger values of s and t, however, elementary methods rapidly grow more laborious and may well be superseded by iterative procedures. Such processes are available; different but similar processes are described by Retelling (1936a; 1936b).

Canonical correlations and variates may be computed for the population, if its correlation matrix is known, exactly as for the sample. If a population canonical correlation, ρ, is a single, not a multiple, root of its equation, then large-sample first approximations to it will tend to normality with a standard error that to a first approximation is , exactly as in the case of elementary correlation. For multiple roots the large-sample approximations have a distribution tending to the chi-square form, with the number of degrees of freedom equal to the multiplicity.

There is some awkwardness in using canonical correlations that may sometimes be avoided, according to the particular purpose, by using functions of them. Symmetric functions often bring special simplicity. If the roots are r1r2,..., two of the most useful symmetric functions are q r1r2... rs and ; q has been called the vector correlation coefficient and z the vector alienation coefficient. They may be used to test different types of deviations from independence between the two sets, but the same is true of other functions of r1,..., r3, for example, the greatest root.

Between the set (x1, x2) and the set (x3, x4) the vector correlation coefficient is

This vanishes if the tetrad difference (the numerator) does so. Thus, the tetrad difference, of great importance in factor analysis, may sometimes be tested appropriately by testing q. It is shown by Hotelling (1936a, p. 362) that if complete independence exists between the two sets, the probability that q is exceeded in a sample of N from a quadrivariate normal distribution is exactly (1 – ǀqǀ)N-3. (Many other matters involved in the statistics of pairs of variates are also included in Hotelling 1936a and other publications.)

A study of causes of death related to alcoholism in France carried out by Sully Lederman was the starting point of a utilization of canonical correlations and canonical variates by Luu-Mau-Thanh, of the Institut de Statistique de L’Université de Paris and the Institut National d’Etudes Démographiques (Luu-Mau-Thanh 1963). The first set of variates consisted of three causes of death: alcoholism, liver diseases, and cerebral hemorrhage. The other set consisted of seven other causes of death. The canonical correlations were found to be .812, .450, and .279. The author also calculated principal components for the two sets. He illustrated another kind of application of canonical analysis by some data on grain collected by Frederick V. Waugh (1942) and analyzed by Maurice G. Kendall (1957). Luu-Mau-Thanh commented that the progress of canonical correlation analysis has been hampered by the heavy computational labor required but that the arrival of modern electronic computers will abolish this difficulty.

Harold Hotelling

[See alsoStatistics, Descriptive, article on Association.]


Anderson, R. L. 1942 Distribution of the Serial Correlation Coefficient. Annals of Mathematical Statistics 13:1-13.

Anderson, Theodore W. 1958 An Introduction to Multivariate Statistical Analysis. New York: Wiley.

Bowley, Arthur L. (1901) 1937 Elements of Statistics. 6th ed. New York: Scribner; London: King.

David, Florence N. 1938 Tables of the Ordinates and Probability Integral of the Distribution of the Correlation Coefficient in Small Samples. London: University College, Biometrika Office.

Fisher, R. A. 1915 Frequency Distribution of the Values of the Correlation Coefficient in Samples From an Indefinitely Large Population. Biometrika 10:507-521.

Fisher, R. A. 1918 The Correlation Between Relatives on the Supposition of Mendelian Inheritance. Royal Society of Edinburgh, Transactions 52:399-433.

Fisher, R. A. 1921 On the “Probable Error” of a Coefficient of Correlation Deduced From a Small Sample. Metron 1, no. 4:3-32.

Fisher, R. A. 1924 The Distribution of the Partial Correlation Coefficient. Metron 3:329-333.

[Gosset, William S.] (1908) 1943 Probable Error of a Correlation Coefficient. Pages 35-42 in William S. Cosset, “Student’s” Collected Papers. Edited by E. S. Pearson and John Wishart. London: University College, Biometrika Office.

Holzinger, Karl J.; and Harman, Harry H. 1941 Factor Analysis: A Synthesis of Factorial Methods. Univ. of Chicago Press.

Hotelling, Harold 1925 The Distribution of Correlation Ratios Calculated From Random Data. National Academy of Sciences, Proceedings 11:657-662.

Hotelling, Harold 1935 The Most Predictable Criterion. Journal of Educational Psychology 26:139-142.

Hotelling, Harold 1936a Relations Between Two Sets of Variates. Biometrika 28:321-377.

Hotelling, Harold 1936b Simplified Calculation of Principal Components. Psychometrika 1:27-35.

Hotelling, Harold 1943 Some New Methods in Matrix Calculation. Annals of Mathematical Statistics 14: 1-34.

Hotelling, Harold 1953 New Light on the Correlation Coefficient and Its Transforms. Journal of the Royal Statistical Society Series B 15:193-225.

Hotelling, Harold; and Pabst, Margaret R. 1936 Rank Correlation and Tests of Significance Involving No Assumption of Normality. Annals of Mathematical Statistics 7:29-43.

Jackson, Dunham 1924 The Trigonometry of Correlation. American Mathematical Monthly 31:275-280.

Kelley, Truman L. 1928 Crossroads in the Mind of Man: A Study of Differentiate Mental Abilities. Stanford Univ. Press.

Kendall, Maurice G. 1938 A New Measure of Rank Correlation. Biometrika 30:81-93.

Kendall, Maurice G. 1943-1946 The Advanced Theory of Statistics. 2 vols. London: Griffin. → A new edition, written by Maurice G. Kendall and Alan Stuart, was published in 1958-1966.

Kendall, Maurice G. (1948) 1955 Rank Correlation Methods. 2d ed. London: Griffin; New York: Hafner. KENDALL, MAURICE G. (1957) 1961 A Course in Multivariate Analysis. London: Griffin.

Koopmans, Tjalling C. 1942 Serial Correlation and Quadratic Forms in Normal Variables. Annals of Mathematical Statistics 13:14-33.

Luu-Mau-Thanh 1963 Analyse canonique et analyse factorielle. Institut de Science Économique Appliquée, Cahiers Series E Supplement 138:127-164.

Pearson, Karl; and Filon, L. N. G. (1898) 1948 Mathematical Contributions to the Theory of Evolution. IV: On the Probable Errors of Frequency Constants and on the Influence of Random Selection on Variation and Correlation. Pages 179-26T in Karl Pearson’s Early Statistical Papers. Cambridge Univ. Press. → First published in Volume 191 of the Philosophical Transactions of the Royal Society of London, Series A.

Soper, H. E. et al. 1917 On the Distribution of the Correlation Coefficient in Small Samples: A Cooperative Study. Biometrika 11, no. 4:328-413.

Spearman, Charles E. 1904 The Proof and Measurement of Association Between Two Things. American Journal of Psychology 15:72-101.

Spearman, Charles E. 1927 The Abilities of Man: Their Naturg and Measurement. London: Macmillan.

Thorndike, R. L. 1933 The Effect of the Interval Between Test and Retest Upon the Constancy of the IQ. Journal of Educational Psychology 24:543-549.

Waugh, Frederick V. 1942 Regressions Between Sets of Variables. Econometrica 10:290-310.

Wishart, John 1932 Note on the Distribution of the Correlation Ratio. Biometrika 24:441-456.

Yule, G. Udny; and Kendall, Maurice G. (1911) 1958 An Introduction to the Theory of Statistics. 14th ed., rev. & enl. London: Griffin. → Maurice G. Kendall has been a joint author since the eleventh edition (1937). The 1958 edition was revised by Maurice G. Kendall.


Classification is the identification of the category or group to which an individual or object belongs on the basis of its observed characteristics. When the characteristics are a number of numerical measurements, the assignment to groups is called by some statisticians discrimination, and the combination of measurements used is called a discriminant function. The problem of classification arises when the investigator cannot associate the individual directly with a category but must infer the category from the individual’s measurements, responses, or other characteristics. In many cases it can be assumed that there are a finite number of populations from which the individual may have come and that each population is described by a statistical distribution of the characteristics of individuals. The individual to be classified is considered as a random observation from one of the populations. The question is, Given an individual with certain measurements, from which population did he arise?

R. A. Fisher (1936), who first developed the linear discriminant function in terms of the analysis of variance, gave as an example the assigning of iris plants to one of two species on the basis of the lengths and widths of the sepals and petals. Indian men have been classified into three castes on the basis of stature, sitting height, and nasal depth and height (Rao 1948). Six measurements on a skull found in England were used to determine whether it belonged to the Bronze Age or the Iron Age (Rao 1952). Scores on a battery of tests in a college entrance examination may be used to classify a prospective student into the population of students with potentialities of completing college successfully or into the population of students lacking such potentialities. (In this example the classification into populations implies the prediction of future performance.) Medical diagnosis may be considered as classification into populations of disease.

The problem of classification was formulated as part of statistical decision theory by Wald (1944) and von Mises (1945). [SeeDecision theory.] There are a number of hypotheses; each hypothesis is that the distribution of the observation is a given one. One of these hypotheses must be accepted and the others rejected. If only two populations are admitted, the problem is the elementary one of testing one hypothesis of a specified distribution against another, although usually in hypothesis testing one of the two hypotheses, the null hypothesis, is singled out for special emphasis [seeHypothesis Testing]. If a priori probabilities of the individual belonging to the populations are known, the Bayesian approach is available [seeBayesian Inference]. In this article it is assumed throughout that the populations have been determined. (Sometimes the word classification is used for the setting up of categories, for example, in taxonomy or typology.) [SeeClustering; Typologies.]

The characteristics can be numerical measurements (continuous variables), attributes (discrete variables), or both. Here the case of numerical measurements with probability density functions will be treated, but the case of attributes with frequency functions is treated similarly. The theory applies when only one measurement is available (p = 1) as well as when several are (p≥ 2). The classification function based on the approach of statistical decision theory and the Bayesian approach automatically take into account any correlation between variables. (Karl Pearson ’s coefficient of racial likeness, introduced in a paper by M. L. Tildesley [1921] and used as a basis of classification, suffered from its neglect of correlation between measurements.)

Classification for two populations

Suppose that an individual with certain measurements (x1,..., xp) has been drawn from one of two populations, π and π. The properties of these two populations are specified by given probability density functions (or frequency functions), p1(xl,..., xp) and p2(x1,..., xp), respectively. (Each infinite population is an idealization of the population of all possible observations.) The goal is to define a procedure for classifying this individual as coming from π1 or π2. The set of measurements x1,..., xp can be presented as a point in a p-dimensional space. The space is to be divided into two regions, R1 and R2. If the point corresponding to an individual falls in R1 the individual will be classified as drawn from π, and if the point falls in R2 the individual will be classified as drawn from π2.

Standards for classification

The two regions are to be selected so that on the average the bad effects of misclassification are minimized. In following a given classification procedure, the statistician can make two kinds of errors: If the individual is actually from π1 the statistician may classify him as coming from π2, or if he is from π2 the statistician may classify him as coming from π. As shown in Table 1, the relative undesirability of these two kinds of misclassification are C(2ǀl), the “cost” of misclassifying an individual from π1as coming from π2, and C(lǀ2), the cost of misclassifying an individual from π2 as coming from π1. These costs may be measured in any consistent units; it is only the ratio of the two costs that is important. While the statistician may not know the costs in each case, he will often have at least a rough idea of them. In practice the costs are often taken as equal.

Table 1 — Cosfs of correct and incorrect classification

In the example mentioned earlier of classifying prospective students, one “cost of misclassification” is a measure of the undesirability of starting a student through college when he will not be able to finish and the other is a measure of the undesirability of refusing to admit a student who can complete his course. In the case of medical diagnosis with respect to a specified disease, one cost of misclassification is the serious effect on the patient’s health of the disease going undetected and the other cost is the discomfort and waste of treating a healthy person.

If the observation is drawn from π the probability of correct classification, P(1/1, R), is the probability of falling into R1, and the probability of misclassification, P(2/1, R)=1 –P(1/1, R), is the probability of falling into R2. (In each of these expressions Ris used to denote the particular classification rule.) For instance,

The integral in (1) effectively stands for the sum of the probabilities of measurements from TTI in R, . Similarly, if the observation is from π2, the probability of correct classification is P(2ǀ2, R), the integral of p2(x1,..., xp) over R2, and the probability of misclassification is P(1ǀ2, R). If the observation is drawn from π, there is a cost or loss when the observation is incorrectly classified as coming from π2; the expected loss, or risk, is the product of the cost of a mistake times the probability of making it, r(1, R) = C(2ǀl)P(2ǀl, R). Similarly, when the observation is from π2, the expected loss due to misclassification is r(2, R) = C(1ǀ2)P(1ǀ2, R).

In many cases there are a priori probabilities of drawing an observation from one or the other population, perhaps known from relative abun- dances. Suppose that the a priori probability of drawing from π1is q1 and from π2 is q2. Then the expected loss due to misclassification is the sum of the products of the probability of drawing from each population times the expected loss for that population:

The regions, R1 and R2, should be chosen to minimize this expected loss.

If one does not have a priori probabilities of drawing from π1 and π2, he cannot write down (2). Then a procedure R must be characterized by the two risks r (1, R) and r (2, R). A procedure R is said to be at least as good as a procedure R* if r (1, R)≤r (1, R*) and r (2, R)≤ r (2, R*), and R is better than R* if at least one inequality is strict. A class of procedures may then be sought so that for every procedure outside the class there is a better one in the class (called a complete class). The smallest such class contains only admissible procedures; that is, no procedure out of the class is better than one in the class. As far as the expected costs of misclassification go, the investigator can restrict his choice of a procedure to a complete class and in particular to the class of admissible procedures if it is available.

Usually a complete class consists of more than one procedure. To determine a single procedure as optimum, some statisticians advocate the minimax principle. For a given procedure, JR, the less desirable case is to have a drawing from the population with the greater risk. A conservative principle to follow is to choose the procedure so as to minimize the maximum risk [seeDecision Theory].

Classification into one of two populations

Known probability distributions

Consider first the case of two populations when a priori probabilities of drawing from π1 and π2 are known; then joint probabilities of drawing from a given population and observing a set of variables within given ranges can be defined. The probability that an observation comes from π1 and that the zth variate is between xi and xi + dx (i = 1,..., p) is approximatelyq1p1 (x1,..., xp)dx1,..., dxp. Similarly, the probability of drawing from π2, and obtaining an observation with the zth variate falling between xi and xi + dxi (i = 1,..., p) is approximately q2p2 (x1,..., xp) dx1... dxp. For an actual observation x1,..., xp, the conditional probability that it comes from π1 is

and the conditional probability that it comes from π2 is

The conditional expected loss if the observation is classified into π2, is C(2ǀl) times (3), and the conditional expected loss if the observation is classified into π1 is C(lǀ2) times (4). Minimization of the conditional expected loss is equivalent to the rule

(The case of equality in (5) can be neglected if the density functions are such that the probability of equality is zero; if equality in (5) may occur with positive probability, then when such an observation occurs it may be classified as from π1 with an arbitrary probability and from π2 with the complementary probability.) Inequalities (5) may also be written

wherek = [C(1ǀ2)q2]/[C(2ǀ1)q1]. This is the Bayes solution. These results were first obtained in this way by Welch (1939) for the case of equal costs of misclassification.

These inequalities seem intuitively reasonable. If the probability of drawing from π1 is decreased or if the cost of misclassifying into π2 is decreased, the inequality in (6) for R1 is satisfied by fewer points. Since the regions depend on q1 and q2 the expected loss does also. The curve Ain Figure 1

It may very well happen that the statistician errs in assigning his a priori probabilities. (The probabilities might be estimated from a sample of individuals whose populations of origin are known or can be identified by means other than the measurements for classification; for example, disease categories might be identified by subsequent autopsy.) Suppose that the statistician uses ̄q,1 and ̄q2(=1 —–̄ql) when q1 and q2 (= 1 – q1) are the actual probabilities of drawing from π1 and π2, respectively. Then the actual expected loss is

qlC(2ǀ1)P(2ǀ1,̄R) + (1–q1)C(1ǀ2)P(1ǀ2,̄R),

where ‾R and ‾R2 are based on ‾q1 and q2. Given the regions ‾R, and ‾R2, this is a linear function of g, graphed as the line B in Figure 1, a line that touches A at q1 = ̄q1. The line cannot go below A because the best regions are defined by (6). From the graph it is clear that a small error in q1 is not very important.

When the statistician cannot assign a priori probabilities to the two populations, he uses the fact that the class of Bayes solutions (6) is identical (in most cases) to the class of admissible solutions. A complete class of procedures is given by (6) with k ranging from 0 to ∞. (If the probability that the ratio is equal to k is positive a complete class would have to include procedures that randomize between the two classifications when the value of the ratio is k.)

The minimax procedure is one of the admissible procedures. Since R2 increases as k increases, and hence r(1, R) increases as k increases, and at the same time r(2, R) decreases, the choice of k giving the minimax solution is the one for which r( 1, R) = r(2, R). This is then the average loss, for it is immaterial which population is drawn from. The graph of the risk against a priori probability q1 is, therefore, a horizontal line (labeled C in Figure 1). Since there is one value of q1, say q*1, such that k = [C(lǀ2)(l – q1)]/[C(2ǀ1)q1], the line C must touch A.

Two known multivariate normal populations. An important example of the general theory is that in which the populations have multivariate normal distributions with the same set of variances and correlations but with different sets of means. [SeeMultivariate Analysis: Overview.]

Suppose that x1,..., xp have a joint normal distribution with means in π1 of and in π2of . Let the common set of variances and correlations be , pµ-1-p. It is convenient to write (6) as

where “In” denotes the natural logarithm. In this particular case

where λ1,...,λp form the solution of the linear equations

The first term on the right side of (7) is the well-known linear discriminant function obtained by Fisher (1936) by choosing that linear function for which the difference in expected values for the two populations relative to the standard deviation is a maximum. The second term is a constant consisting of the average discriminant function at the two population means. The regions are given by

If a priori probabilities are assigned, then k is [C(lǀ2)q2]/[C(2ǀl)q1]. In particular, if k = 1 (for example, if C(1ǀ2) = C(2ǀ1) and q1 = q2 = ½),In k = 0, and the procedure is to compare the discriminant function of the observations with the discriminant function of the averages of the respective means.

If a priori probabilities are not known, the same class of procedures (8) is used as the admissible class. Suppose the aim is to find In k = c, say, so that the expected loss when the observation is from π1 is equal to the expected loss when the observation is from π2. The probabilities of misclassification can be computed from the distribution of

when x1,..., xp are from π1 and when xl,..., xp are from π2. Let Δ2 be the Mahalanobis measure of distance between π1and π2,

The distribution of U is normal with variance Δ2.

If the observation is from π1 the mean of U is ½Δ2;; if the observation is from π2 the mean is –½Δ2.

The probability of misclassification if the observation is from π1is

where Φ(z) is the probability that a normal deviate with mean 0 and variance 1 is less than z. The probability of misclassification if the observation is from π2 is

Figure 2 indicates the two probabilities as the shaded portion in the tails. The aim is to choose c so that

If the costs of misclassification are equal, c = 0 and the common probability of misclassification is Ф(½Δ). In case the costs of misclassification are unequal, c can be determined to sufficient accuracy by a trial-and-error method with the normal tables.

If the set of variances and correlations in one population is not the same as the set in the other population, the general theory can be applied, but In [P1,(x1,..., xp)/p2(xl,..., xp)] is a quadratic, not a linear, function of x1,..., xp. Anderson and Bahadur (1962) treat linear functions for this case.

Classification with estimated parameters

In most applications of the theory the populations are not known but must be inferred from samples, one from each population.

Two multivariate normal populations. Consider now the case in which there are available random samples from two normal populations and in which the aim is to use that information in classifying another observation as coming from one of the two populations. Suppose the sample is from π1 a and the sample from π2. Then μ(1)i can be estimated by the mean of the ith variate of the first sample and by the mean of the second sample The usual estimate of based on the two samples is

These estimates may then be substituted into the definition of U, to obtain a new linear function of x1,..., xp depending on these estimates. The classification function is

where the coefficients l1,..., lp are the solution to

Since there are now sampling variations in the estimates of parameters, it is no longer possible to state that this procedure is best in either of the senses used earlier, but it seems to be a reasonable procedure. (A result of Das Gupta [1965] shows that when N(1) = N(2) and the costs of misclassification are equal, the procedure with c = 0 is minimax and admissible.)

The exact distributions of the classification statistic based on estimated coefficients cannot be given explicitly; however, the distribution can be indicated as an integral (with respect to three variables). It can be shown that as the sample sizes increase, the distributions of this statistic approach those of the statistic used when the parameters are known. Thus for sufficiently large samples one can proceed exactly as if the parameters were known. Asymptotic expansions of the distributions are available (Bowker & Sitgreaves 1961).

A mnemonic device for the computation of the discriminant function (Fisher 1938) is the introduction of the dummy variate, y, which is equal to a constant (say, 1) when the observation is from π1 and is equal to another constant (say, 0) when the observation is from π2. Then (formally) the regression of this dummy variate, y, on the observed variates x1,..., xp over the two samples gives a linear function proportional to the discriminant function. In a sense this linear function is a predictor of the dummy variate, y.

In practice the investigator might not be certain that the two populations differ. To test the null hypothesis that he can use the discriminant function of the difference in sample means

which is (N(1) + N(2))/(N(1)N(2)) times Retelling’s generalized T2. The T2-test may thus be considered as part of discriminant analysis.[SeeMultivariate Analysis: Overview.]

Classification for several populations

So far, classification into one of only two groups has been discussed; consider now the problem of classifying an observation into one of several groups. Let π1,..., πm be m populations with density functions P1(x1,...,xp),..., pm(x1,..., xp), respectively. The aim is to divide the space of observations into m mutually exclusive and exhaustive regions R1,..., Rm . If an observation falls into Rg it will be considered to have come from πg. Let the cost of classifying an observation from πg, as coming from πk be C(h/g). The probability of this misclassification is

If the observation is from πg, the expected loss or risk is

Given a priori probabilities of the populations, q1,..., qm, the expected loss is

R1,..., Rm are to be chosen to make this a minimum.

Using a priori probabilities for the populations, one can define the conditional probability that an observation comes from a specified population, given the values of observed variates, x1,..., xp. The conditional probability of the observation coming from πg is

If the observation is classified as from πk, the expected loss is

where x stands for the set x1,..., xp. The expected loss is minimized at this point if h is chosen to minimize (9). The regions are

If C(hǀg) = 1 for all g and h (g≠h), then x1,..., xp , is in Rk if

In this case the point x1,..., xp is in Rk if k is the index for which qgpg(x) is a maximum, that is, πk is the most probable population, given the observation. If equalities can occur with positive probability so that there is not a unique maximum, then any maximizing population may be chosen without affecting the expected loss.

If a priori probabilities are not given, an unconditional expected loss for a classification procedure cannot be defined. Then one must consider the risks r(g, R) over all values of g and ask for the admissible procedures; the form is (10) when C(h\g)= 1 for all g and h (g≠h). The minimax solution is (10) when q1,..., qm are found so that

This number is the expected loss. (The theory was first given for the case of equal costs of misclassification by von Mises [1945].)

Several multivariate normal populations

As an example of the theory, consider the case of ra multivariate normal populations with the same set of variances and correlations. Let the mean of xi in πg be μi(g). Then

where λ1(g,h),..., λp(g,h) are the solution to

For the sake of simplicity, assume that the costs of misclassification are equal. If a priori prob- abilities, q1,..., qm, are known, the regions are defined by

where ugh(x1,..., xp)is (12). If a priori probabilities are not known, the admissible procedures are given by (13), with In qk replaced by suitable constants ch. The minimax procedure is (13), for which (11) holds. To determine the constants ch,use the fact that if the observation is from πgugh(x1,..., xp), h = 1,..., m and hg, have a joint normal distribution with means

The variance of ugh(x1,..., xp) is twice (14), and the covariance between the variables ugh(x1,..., xp) andugk (x1,..., xp) is

From these one can determine P(hǀg,R) for any set of constants c1,..., cm.

This procedure divides the space by means of hyperplanes. If p = 2 and m = 3, the division is by half -lines, as in Figure 3.

If the populations are unknown, the parameters may be estimated from samples, one from each population. If the samples are large enough, the above procedures can be used as if the parameters were known.

An example of classification into three populations has been given in Anderson (1958).

The problem of classification when (x1,..., xp) are continuous variables with density functions has been treated here. The same solutions are ob-

tained when the variables are discrete, that is, take on a finite or countable number of values. Then P1(x1,..., xp), p2(x1,..., xp), and so on are the respective probabilities (or frequency functions) of (x1,...,, xp) in π1,..., π2, and so on. (See Birnbaum & Maxwell 1960; Cochran & Hopkins 1961.) In this case randomized procedures are essential.

For other expositions see Anderson (1951) and Brown (1950). For further examples see Mosteller and Wallace (1964) and Smith (1947).

T. W. Anderson

[Directly related are the entries Clustering; Screening And Selection.]


Anderson, T. W. 1951 Classification by Multivariate Analysis. Psychometrika 16:31–50.

Anderson, T. W. 1958 An Introduction to Multivariate Statistical Analysis. New York: Wiley.

Anderson, T. W.; and Bahadur, R. R. 1962 Classification Into Two Multivirate Normal Distributions With Different Covariance Matrices. Annals of Mathematical Statistics 33:420–431.

Birnbaum, A.; and Maxwell, A. E. 1960 Classification Procedures Based on Bayes’s Formula. Applied Statistics 9:152–169.

Bowker, Albert H.; and Sitgreaves, Rosedith 1961 An Asymptotic Expansion for the Distribution Function of the W-classification Statistic. Pages 293–310 in Herbert Solomon (editor), Studies in Item Analysis and Prediction. Stanford Univ. Press.

Brown, George W. 1950 Basic Principles for Construction and Application of Discriminators. Journal of Clinical Psychology 6:58–60.

Cochran, William G.; and Hopkins, Carl E. 1961 Some Classification Problems With Multivariate Qualitative Data. Biometrics 17:10–32.

Das Gupta, S. 1965 Optimum Classification Rules for Classification Into Two Multivariate Normal Populations. Annals of Mathematical Statistics 36:1174–1184.

Fisher, R. A. 1936 The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7:179–188.

Fisher, R. A. 1938 The Statistical Utilization of Multiple Measurements. Annals of Eugenics 8:376–386.

Mosteller, Frederick; and Wallace, David L. 1964 Inference and Disputed Authorship: The Federalist. Reading, Mass.: Addison-Wesley.

Rao, C. Radhakrishna 1948 The Utilization of Multiple Measurements in Problems of Biological Classification. Journal of the Royal Statistical Society Series B 10:159–193.

Rao, C. Radhakrishna 1952 Advanced Statistical Methods in Biometric Research. New York: Wiley.

Smith, Cedric A. B. 1947 Some Examples of Discrimination. Annals of Eugenics 13:272–282.

Tildesley, M. L. 1921 A First Study of the Burmese Skull. Biometrika 13:176–262.

Von Mises, Richard 1945 On the Classification of Observation Data Into Distinct Groups. Annals of Mathematical Statistics 16:68–73.

Wald, Abraham 1944 On a Statistical Problem Arising in the Classification of an Individual Into One of Two Groups. Annals of Mathematical Statistics 15:145–162.

Welch, B. L. 1939 Note on Discriminant Functions. Biometrika 31:218–220.

multivariate analysis

views updated May 14 2018

multivariate analysis Univariate analysis consists in describing and explaining the variation in a single variable. Bivariate analysis does the same for two variables taken together (covariation). Multivariate analysis (MVA) considers the simultaneous effects of many variables taken together. A crucial role is played by the multivariate normal distribution, which allows simplifying assumptions to be made (such as the fact that the interrelations of many variables can be reduced to information on the correlations between each pair), which make it feasible to develop appropriate models. MVA models are often expressed in algebraic form (as a set of linear equations specifying the way in which the variables combine with each other to affect the dependent variable) and can also be thought of geometrically. Thus, the familiar bivariate scatter-plot of individuals in the two dimensions representing two variables can be extended to higher-dimensional (variable) spaces, and MVA can be thought of as discovering how the points cluster together.

The most familiar and often-used variants of MVA include extensions of regression analysis and analysis of variance, to multiple regression and multivariate analysis of variance respectively, both of which examine the linear effect of a number of independent variables on a single dependent variable. This forms the basis for estimating the relative (standardized) effects of networks of variables specified in so-called path (or dependence or structural equational) analysis—commonly used to model, for example, complex patterns of intergenerational occupational inheritance. Variants now exist for dichotomous, nominal, and ordinal variables.

A common use of MVA is to reduce a large number of inter-correlated variables into a much smaller number of variables, preserving as much as possible of the original variation, whilst also having useful statistical properties such as independence. These dimensionality-reducing models include principal components analysis, factor analysis, and multi-dimensional scaling. The first (PCA) is a descriptive tool, designed simply to find a small number of independent axes or components which contain decreasing amounts of the original variation. Factor analysis, by contrast, is based on a model which postulates different sources of variation (for example common and unique factors) and generally only attempts to explain common variation. Factor analysis has been much used in psychology, especially in modelling theories of intelligence.

Variants of MVA less commonly used in the social sciences include canonical analysis (where the effects are estimated of a number of different variables on a number of—that is not just one—dependent variables); and discriminant analysis (which maximally differentiates between two or more subgroups in terms of the independent variables).

Recently, much effort has gone into the development of MVA for discrete (nominal and ordinal) data, of particular relevance to social scientists interested in analysing complex cross-tabulations and category counts (the most common form of numerical analysis in sociology). Of especial interest is loglinear analysis (akin both to analysis of variance and chi-squared analysis) which allows the interrelationships in a multi-way contingency table to be presented more simply and parsimoniously.

multivariate analysis

views updated May 21 2018

multivariate analysis The study of multiple measurements on a sample. It embraces many techniques related to a range of different problems.

Cluster analysis seeks to define homogeneous classes within the sample on the basis of the measured variables. Discriminant analysis is a technique for deciding whether an individual should be assigned to a particular predefined class on the basis of the measured variables. Principle component analysis and factor analysis aim to reduce the number of variables in the study to a few (say two or three) that express most of the variation within a sample.

Multivariate probability distributions define probabilities for sets of random variables.

multivariate analysis

views updated May 08 2018

multivariate analysis In a statistical analysis, the measurement of several different attributes of each unit of observation.