Nonparametric Statistics

views updated

Nonparametric Statistics

The articles under this heading are to be regarded not as a handbook of nonparametric procedures but as an introduction that stresses principles through discussion of some important examples. The first article deals mainly with nonparametric inferences about measures of location for one and two populations. The article on ranking methods presents further examples of nonparametric methods, all involving ranking of observations. Other related topics are treated in the articles on order statistics and on runs.

I THE FIELD

Nonparametric, or distribution-free, statistical methods are based on explicitly weaker assumptions than such classical parametric procedures as Student’s t-test, analysis of variance, and standard techniques connected with the correlation coefficient. Examples of nonparametric methods are the sign test, the Wilcoxon tests, and certain confidence interval methods for population quantiles (median, quartile, etc.).

The basic distinction is that parametric procedures make demanding, relatively narrow assumptions about the probability distributions from which empirical observations arise; the most common such distributional assumption is that of normality. Nonparametric methods, in contrast, do not make specific distributional assumptions of this kind.

The dividing line is, of course, not a sharp one; some parametric methods are so robust against errors in distributional assumptions as to be almost nonparametric in practice, while most nonparametric methods are only distribution free for some of their characteristics. For example, most nonparametric hypothesis tests are distribution free under the null hypothesis but distribution dependent with regard to power [seeErrors, article onEffects of Errors in Statistical Assumptions].

The underlying motivation for the use of nonparametric methods is the reluctance to make the traditional parametric assumptions, in particular (but not only) the assumption of normality. The relaxation of assumptions is paid for by decreased sharpness of inference when the narrower parametric assumptions are in fact true and by less flexibility in mode of inference. An advantage of some, but by no means all, nonparametric methods is that they are easy and quick to apply; some authors call them “rough and ready.”

Nonparametric procedures are, of course, concerned with parameters, that is, numerical characteristics of distributions, but usually with parameters that have desirable invariance properties under modifications in scale of measurement. For example, the median is an important parameter for nonparametric analysis; if X is a random variable with median m and if f is a strictly increasing function, then the median of f(X) is f(m).

History. Many of the first statistical techniques were applied only to massive quantities of data, so massive that the computed statistics had effectively known normal distributions, usually because the statistics were based on sample moments. Later, the need for methods appropriate to small samples became clear, and the exact distributions of many statistics were derived for specific assumed forms (usually normal) of the underlying distribution. This work, begun by William S. Gosset (“Student”) and R. A. Fisher, led to the standard statistical tests: for instance, t, chi-square, and F.

Somewhat later, although there was substantial overlap, procedures were developed having exact properties without the need for special distributional assumptions. Procedures were also developed that simplified numerical analysis of data; much of the motivation for all this work was a desire for procedures that could be applied to data given in the form of ranks or comparisons. These developments often arose from the work of social scientists dealing with data that clearly did not come from distributions of standard form; for example, rank sum tests were proposed by G. Deuchler and Leon Festinger (psychologists), and one form of analysis of variance by ranks was proposed by Milton Friedman (an economist).

When dealing with data that clearly do not arise from a standard distribution, an alternative to the use of nonparametric methods is the use of transformations, that is, application to each observation of some common function, chosen so as to make the resulting transformed data appear nearly normal or nearly following some other standard distributional form [seeStatistical Analysis, Special Problems of, article onTransformations of Data].

In the first section below, the nonparametric analysis of experiments generating a single sample is examined. Artificial data are used in order to concentrate attention on the formal aspects of the statistical analysis. The second section is a less detailed discussion of two sample problems, and the last mentions some additional important nonparametric problems.

One-sample problems

A single sample of seven numerical observations will be used to illustrate the application of nonparametric procedures. Suppose the observations are (A) -1.96; (B) -.77; (C) -.59; (D) +1.21; (E) +.75; (F) +4.79; and (G) +6.95. These data might be the scores achieved by seven people on a test, the per cent change in prices of seven stocks between the two points of time, the ratings of seven communities on some sociological scale, and so forth.

There are several kinds of statistical inferences for each of several aspects of the parent population. Also, for each kind of inference about each aspect of the population there exist several nonparametric techniques. The selection of a kind of inference about an aspect by a particular technique is guided by the experimenter’s interests; the available resources for analysis; the relative costs; the basic probability structure of the data; the criteria of optimality, such as power of tests or length of confidence intervals; and the sensitivity (robustness) of the technique when the underlying assumptions are not satisfied perfectly.

Point estimation of a mean

Assume the data represent a random sample. As a first problem, the mean (a location parameter) of the population is the aspect of interest; the form of inference desired is a point estimate, and the technique of choice is the sample mean,

The following are some justifications for the use of x: (a) If the sample is from a normal population, then x̄ is the maximum likelihood estimator of the population mean, θ (b) If the sample is from a population with finite mean and finite variance, then x̄ is the Gauss-Markov estimator; that is, among linear functions of the observations that have the mean as expected value, x̄ has smallest variance, (c) The least squares value is x̄; that is, x̄ is the value of y that minimizes ( —1.96 — y)² +... + (6.95 - y)². (This last holds whether or not the sample is random.)

Result a is parametric since a functional form is selected for the population; in contrast, b is nonparametric. The least squares result, c, is not dependent on probability considerations.

Point estimation of a median

The sample median, +.75, is sometimes used as a point estimator of the population mean. (When the sampled population is symmetric, the population mean and median are equal.)

For some populations (two-tailed exponential, for example) the sample median is the maximum likelihood estimator of the population mean. The sample median minimizes the mean absolute deviation, ǀ-1.96 - yǀ +... + ǀ6.95 -yǀ.

Parametric confidence intervals for a mean

If it can be assumed that the data are from a normal population (mean θ and variance γ² unknown), to form the conventional two-sided confidence interval with confidence level 1 - α one first computes x̄ and s² (the sample variance, with divisor n — 1, where n is sample size) and then forms the interval where t_n-1,α/2 is the upper α/2 quantile of the t-distribution with n — 1 degrees of freedom. For the present data, x̄ = 1.483, s² = 10.44, and the conventional 95 per cent confidence interval is (-1.508, 4.474).

Nonparametric confidence intervals for a median

In a sample of size n let x₍₁₎ < x₍₂₎ <... < x_(n) be the ordered observations, the order statistics. (It will be assumed throughout that no ties occur between observations, that is, that the observations come from a continuous population.) If Med is the population median, then the probability of an observation’s being less (greater) than Med is 1/2. The probability that all the observations are less (greater) than Med is 2^-n. The event x_(i) < Med < x_(n-i+1) occurs when at least i of the observations are less than Med and at least i of the observations are greater than Med. Hence, x_(i) < Med < x_(n-i+1) has the same probability as does obtaining at least i heads and at least i tails in n tosses of a fair coin. From the binomial distribution, one obtains

In words, (x_(i), x_(n-i+1)) is a confidence interval for Med with confidence level given by the above formula; for example, (x₍₁₎, x_(n)) is a confidence interval for Med with confidence level 1 - 2^-n+1. Thus, for the present data with i = 1, (—1.96, 6.95) is a confidence interval with confidence level = 63/64 = .984 and (-.77, 4.79) has confidence level = 7/8 = .875.

Confidence intervals for the mean of a normal population are available at any desired confidence level. For the above nonparametric procedure, particularly for very small sample sizes, there are sharp restrictions on the available confidence levels. The restriction occurs because nonparametric procedures are often based on random variables with discrete distributions even though the underlying random variables have continuous distributions.

Comparisons of confidence intervals

The confidence statement appears more definitive the shorter the confidence interval. One relevant criterion is the expected value of the squared length of the confidence interval. The value of expected squared length for the confidence interval based on the t-distribution is . If the normality assumption is dropped, then the t-distribution confidence interval will, in general, no longer have the desired confidence level, although its expected squared length remains the same. (For some specific nonnormal assumptions—uniform distributions, for example—confidence intervals at the desired confidence level may be obtained with much smaller expected squared lengths.) The expected squared length of the order statistic confidence interval, assuming only symmetry of the distribution, is . If normality actually holds, the confidence interval based on the t-distribution has smaller expected squared length: with n = 7 the ratio of expected squared lengths is .796 for confidence level .984 and .691 for confidence level .875. A general approximate result is that to obtain equal expected squared lengths one needs a sample size for the order statistics interval of about times the sample size for the t-distribution interval.

Tests of hypotheses about a mean

Aside from the estimation of a parameter, one often wishes to test hypotheses about it. If the data are from a normal population, then the best procedure, uniformly most powerful unbiased of similar tests, is based on , where θ₀ is the hypothetical value for the population mean. Consider testing the null hypothesis θ₀ = —.5 on the basis of the specific sample of seven; the resulting value of the t-statistic is 1.21, which is statistically significant at the .1575 level (two-sided). For these data and this test statistic the null hypothesis would be rejected if the required significance level exceeds .1575, but if the significance level is smaller, the null hypothesis would not be rejected. The power of this test depends on the quantity (θ — θ₀)²/σ², where θ is the population mean [seeHypothesis Testing].

Sign test for a median. Perhaps the simplest nonparametric test comparable to the t-test is the sign test. It is easily applied, which makes it useful for the preliminary analysis of data and for the analysis of data of passing interest. The null hypothesis for the sign test specifies the population median, Med₀. In the example, the null hypothesis is that the median is —.5. The test statistic is the number of observations greater than Med₀. In the example, the value of the statistic is 4. (The term “sign test” arises because one counts the number of differences, observation minus Med0, that are positive. ) If one rejects the null hypothesis whenever the number of positive signs is less than i or greater than n — i, i < n/2, the significance level will be

In this example the possible significance levels are 0(i = 0), 1/64(i = 1), 1/8(i = 2), 29/64(i = 3), 1(i = 4). For these data, the sample significance level is 1. The sign test leads to exact levels of significance when the observations are independent and each has probability 1/2 of exceeding Med₀. It is not necessary to have each observation from the same population. One needs only to compare each observation with the hypothetical median. The test can be applied even when quantitative measurements cannot be made, as long as the signs of the comparisons are available.

The power of the sign test depends on p, the probability that an observation exceeds Med₀. When p ≠ 1/2, the power of the sign test approaches 1 as the sample size increases. Tests having this property are said to be consistent. All reasonable tests, such as those discussed below, will have such consistency properties. When p ≠ 1/2, the power of the sign test is always greater than the significance level. Tests having this property are said to be unbiased.

The sample significance level can be found from a binomial table with p = 1/2, and the power from a general binomial table. If z is the number of positive results in an experiment with moderately large n, one can approximate the significance level by computing t’ = (2z — n)n^1/2 and referring t’ to a normal table. A somewhat better approximation can be obtained by replacing z with z + 1 in computing t’; this is called a continuity correction and often arises when a discrete distribution is approximated by a continuous distribution.

The sign test and the nonparametric confidence intervals described above are related in the same manner as the t-test and confidence intervals based on the t-distribution.

The sign procedure is easily put into a sequential form; experiments in this form often save money and time [seeSequential Analysis].

Conventionalized data—signed ranks

One can think of the sign statistic as the result of replacing the observations by certain conventional numbers, 1 for a positive difference and 0 otherwise, and then proceeding with the analysis of the modified data. A more interesting example is to replace the observations by their signed ranks, that is, to replace the observation whose difference from Med₀ is smallest in absolute value by +1 or —1, as that difference is positive or negative, etc. Thus for the present data with Med₀ = —.5, the signed ranks are -4, -2, -1, +5, +3, +6, and +7.

Signed-rank test or Wilcoxon test for a median. The one-sample signed-rank Wilcoxon statistic, W, is the sum of the positive signed ranks. In the example, W = 21. The exact null distribution of W can be found when the observations are mutually independent and it is equally likely that the jth smallest absolute value has a negative or positive sign, for example, when the observations come from populations that are symmetrical about the median specified by the null hypothesis.

When the null hypothesis is true, the probability that the Wilcoxon statistic is exactly equal to w is found by counting the number of possible samples that yield w as the value of the statistic and then dividing the count by 2ⁿ. When n = 7, the largest value for W, 28, is yielded by the one sample having all positive ranks. In that case Pr(W = 28) = 2^-7. The distribution of W under the null hypothesis is symmetric around the quantity n(n + 1)/4, so that Pr(W = w) = Pr(W = 1/2n{n + 1) - w). Thus for n = 7, Pr(W = 21) = Pr(W = 7). A value of 7 for W can be obtained when the positive ranks are (7), (1,6), (2,5), (3,4), (1,2,4)—each parenthesis represents a sample. Thus Pr(W = 21) = Pr(W = 7) = 5/2⁷. By enumeration, the probability that W ≥ 21 or W ≤ 7 is 36/128 = .281. The present data are therefore statistically significant at the .281 level when using the Wilcoxon test (two-sided).

The sign test has about n/2 possible significance levels, and the Wilcoxon test has about n(n + 1)/4 possible significance levels. For small samples, tables of the exact null distribution of W are available. Under the null hypothesis the mean of W is given by n(n + 1)/4 and the variance of W is given by n(n + 1)(2n+1)/24 . The standardized variable that is derived from W by setting t’ = (W — EW)/(var W)^1/2 has approximately a normal distribution. Most statements about the power of the W test are based on very small samples, Monte Carlo sampling results, or large-sample theory. For small samples, W is easy to compute. As the sample size increases, ranking the data becomes more difficult than computing the t-statistic.

W can be computed by making certain comparisons, without detailed measurements. Denote by m′ (n′) the number of negative (positive) observations. Let u₁,. . .,u_m, be the values of the positive and v₁,. . ., v_m, be the absolute values of the negative observations. Let ∆_ij = ∆(u_i, v_j) be equal to 1 if u_i > v_j and equal to 0 otherwise. Define

S = Σ_iΣ_j∆_ij

Then

W = S + n′(n′ + 1)/2.

If m′ > 0 and n′ > 0, the statistic S/m’n’ is an unbiased estimator of the probability that a positive observation will be larger than the absolute value of a negative observation. Hence S is of independent interest. The relationship between S and W involves the random quantity n’, so that inferences drawn from S and W need not be the same.

Wilcoxon statistic confidence intervals. Given a population distribution symmetric about the median, the Wilcoxon test generates confidence intervals for the population median, Med. In the example, the significance level 10/128 = .0781 corresponds to rejecting the null hypothesis when W ≥ 25 or W ≤ 3. Thus the corresponding confidence interval at confidence level 1 - 10/128 = .9219 consists of values of Med that will make 4 ≤ W ≤ 24. An examination of the original data (the signed ranks are not now sufficient) yields the interval (—1.28, 4.08). An examination of some trial values of Med will help in understanding this result. Thus if Med = 4.2, one sees that F has rank 1, G has rank 2, no other observation has positive rank, and W = 3, which means this null hypothesis would be rejected. If Med = 4, then F has rank 1, G has rank 3, no other observation has positive rank, and W = 4, which means this null hypothesis would be accepted.

Permutation test for a location parameter

The final test considered here for a location parameter (like the median or mean) is the so-called permutation test on the original data. Under the null hypothesis, the observations are mutually independent and come from populations symmetric about the median Med₀. This includes both the cases where the signed ranks are the basic observations and the cases where the signs are the basic observations (scoring +1 for a positive observation and — 1 for a negative observation). Given the absolute values of the observations minus Med₀, under the null hypothesis there are 2_n equally likely possible assignments of signs to the absolute values. The nonparametric statistic to be considered is the total, T, of the positive deviations from Med₀. One works with the conditional distribution of T, given the absolute values of the observed differences. Using Med₀ = -.5, for the present data T = 15.70. There are 13 configurations of the signs that will give a value of T at least this large and another 13 configurations that will give a value of 17.52 - 15.70 = 1.82 or smaller, the latter being the lower tail of the symmetrical distribution of T. Thus the significance level of the permutation test on the present data is 26/128 = .2031. With the test, each multiple of 1/64 is a possible significance level. The computations for this procedure are prohibitive in cost if the sample size is not very small. A partial solution is to use some of the assignments of signs (say a random sample) to estimate the conditional distribution of T. The ordinary t-statistic and T are monotone functions of each other; thus a significance test based on t is equivalent to one based on T. The distribution of t for the permutation test is approximately normal for large values of n. Actually, the randomization that occurs in the design of experiments yields the nonparametric structure discussed here.

The power of the permutation test is approximately the same as that of the t-test when the data are from a normal population.

Finally, one can construct confidence intervals for Med₀ using the permutation procedure. The confidence interval with level 122/128 = .955 is (-1.107, 4.376) for the above example.

Tests for a location parameter compared

All the tests are consistent; for a particular alternative and large sample size, each will have power near 1. To compare the large sample power of the tests, a sequence of alternatives approaching the hypothesized null value, Med₀, is usually introduced. Let Med_N be the alternative at the Nth step of this sequence, and assume that Med_N — Med₀ = c/N^1/2 where c is a constant. As N increases, the alternative hypothesis approaches the null hypothesis and larger sample sizes will be required to obtain a desired power for a specified level of significance. The efficiency of test II compared to test I is denned as the ratio E_I,II(N) = n_I(N)/n_II(N), where n_I and n_II are the sample sizes yielding the desired power for test I and test II. For large N, E_I,II(N) is usually almost independent of test size, power, and the constant c; that is, as N grows, E_I,II(N) usually has a limit, E_I,II, called the Pitman efficiency of test II compared to test I. The Pitman efficiency will depend on the tests under consideration and the family of possible distributions. Consider the two experiments, I and II, to test the hypothesized null value, Med₀. They have sample sizes n_I(N) and n_II(N) and costs per observation γ_I and γ_II. Then the cost of experiment I divided by the cost of experiment II is γ₁n_I(N)/γ_IIn_II(N) = (γ_I/γ_II)E_I,II. When this ratio is > 1 (< 1), experiment II costs less (more) than experiment I and the two experiments have the same power functions. The γ associated with the sign test can be much smaller than the γ’s associated with the Wilcoxon and permutation tests.

For normal alternatives, where the parameter is the mean, one has

E_{t-test, permutation} = 1,

E_{t-test, Wilcoxon test} = 3/π = .955,

E_{t-test, sign test} = 2/π = .637,

Tolerance intervals

Tolerance intervals resemble confidence intervals, but the former permit an interpretation in terms of hypothetical future observations. One nonparametric tolerance interval method, based on the order statistics of a random sample of size n, provides the assertion that at least 90 per cent (say) of the underlying population lies in the interval from x_(i) to x_(n-i+1). This assertion is made in the confidence sense; that is, its probability of being correct (before the observations are taken) is, say, .95. The relationship between i, the confidence level (here .95), and the coverage level (here .90) is in terms of the incomplete beta function. Special tables and graphs are available (Walsh 1962-1965).

Another form of nonparametric tolerance interval interprets the interval from x_(i) to x_(n-i+1) as giving a random interval whose probability under the underlying distribution is approximately given by (n - 2 i + 1)/(n + 1). In this context the word “approximately” is used to mean the following: the probability measure of the random interval [x_(i), x_(n-i+1)] is itself random; its expectation is (n-2i+l)/(n+1). The probability (before any of the observations are made) that a future observation will lie in a tolerance interval of this form is (n-2i+l)/(n+1).

For the data set, with i = 1, the interval [-1.96, 6.95] is obtained. This interval is expected (in the above sense) to contain [(7-2+1)/(7+1)] = 75 per cent of the underlying population.

Other one-sample tests

Goodness of fit procedures are used to test the hypothesis that the data came from a particular population or to form confidence belts for the whole distribution function.[These are discussed inGoodness Fit.]

Tests of the randomness of a sample (that is, of the assumption that the observations are independent and drawn from the same distribution) can be carried out by counting the number of runs above and below the sample median. [Such techniques are discussed inNonparametric Statistics, article onRuns.]

The statistic based on the number of runs, the Wald-Wolfowitz statistic, was originally proposed as a test of goodness of fit. It is not, however, recommended for this purpose, because other, more powerful tests are available.

Tied observations

The discussion so far has presumed that no pairs of observations have the same value and that no observation equals Med₀. In practice such ties will occur. If ties are not extensive their consequence will be negligible. If exact results are required, the analysis can in principle be carried out conditionally on the observed pattern of ties.

Two-sample problems

Experiments involving two samples allow comparisons between the sampled populations without the necessity of using matched pairs of observations. Comparisons arise naturally when considering the relative advantages of two experimental conditions or two treatments, for example, a treatment and a control, or when absolute standards are unknown or are not available.

Most of the one-sample procedures have twosample analogues (the exceptions are tests of randomness and tolerance intervals). With two samples, central interest has been focused on the difference between location parameters of the two populations. To be specific, let x₁,. . . ,x_m be the observed values in the first sample and y₁, . . ., y_n be the observed values in the second sample. Let M_x and M_y be the corresponding location parameters with ∆ = M_y — M_x.

Estimation of difference of two means

As a point estimator of ∆ when M_x and M_y represent the population means, one often uses the differences in the sample averages, ȳ — x̄. This is the maximum likelihood estimator of ∆ when all of the observations are independent and come from normal populations; it is also the Gauss-Markov estimator whenever the populations have finite variances, and it is the least squares estimator. With the normal assumption, and also assuming equal variances, confidence intervals for ∆ can be obtained, utilizing the t-distribution. As in the normal case, nonparametric confidence intervals for ∆ will not have the prescribed confidence level unless the two populations differ in location parameter only.

Brown—Mood procedures—two medians. The analogue of the confidence procedure based on signs, the Brown-Mood procedure, is constructed in the following manner: Let w₁ . . ., w_m+n be all of the observations arranged in increasing order, that is, w_I is the smallest observation in both samples; w₂ is the second smallest observation in both samples; etc. Denote by w* the median of the combined sample. (It will be assumed that m + n is odd.) Let m* be the number of w’s greater than w*, from the x-population. When the two populations are the same, m* has a hypergeometric distribution and a nonparametric test of the hypothesis that the two populations are the same, specifically that ∆ = 0, is based on the distribution of m*. To obtain confidence intervals, replace the x-sample with x’_i = x_i + ∆ (i = 1,... , m); form w’, the analogue of the w sequence; compute the median of the w’ sequence and call it w’*; compute the number of observations from the x’ sequence above w’* and call it m’*; see if one would accept the null hypothesis of no difference between the x’ and y populations; if one accepts the null hypothesis, then ∆ is in the confidence interval, and if one rejects the null hypothesis, then ∆ is not in the confidence interval.

Wilcoxon two-sample procedure. The analogue to the one-sample Wilcoxon procedure is to assign ranks (a set of conventional numbers) to the w’s, that is, w_I is given rank 1, w₂ is given rank 2, etc. The test statistic is the sum of the ranks of those w’s from the x-population. When the two populations are identical, the distribution of this test statistic will not depend on the underlying cjbmmon distribution. The test based on this statistic is called the Wilcoxon test or Mann—Whitney test.

The Wilcoxon test can be used if it is possible to compare each observation from the x-population with each observation from the y-population. The Mann-Whitney version of the Wilcoxon statistic (a linear function of the Wilcoxon statistic) is the number of times an observation from the y-population exceeds an observation from the x-population. When this number is divided by ran, it becomes an unbiased estimator of Pr(Y > X), the probability that a randomly selected y will be larger than a randomly selected x. This parameter has many interpretations and uses, for instance, if stresses and strains are brought together by random selection, it is the probability that the stress will exceed the strain (the system will function).

Permutation procedure. The permutation procedure is based on the conditional distribution of the sum of the observations in the x-sample given the w sequence, when each selection of ra of the values from the w sequence is considered equally likely. There are such possible selections. The sum of the observations in the x-sample is an increasing function of the usual t-statistic. The importance of the permutation procedure is that it often can be made to mimic the optimal procedures for the parametric situation while retaining exact nonparametric properties.

Comparisons of scale parameters

To compare spread or scale parameters, one can rank the observations in terms of their distances from w*, the combined sample median. The sum of ranks corresponding to the x-sample is a useful statistic [seeNonparametric Statistics, article onRanking Methods].

When the null hypothesis does not include the assumption that both populations have the same median, the observations in each sample can be replaced by their deviations from their medians, and then the w-sequence can be formed from the deviations in both samples. This ranking procedure will not be exactly nonparametric, but it yields results with significance and confidence levels near the nominal levels.

BIBLIOGRAPHY

The extensive literature of nonparametric statistics is indexed in the bibliography of Savage 1962. In this index the classification scheme roughly parallels the structure of this article. The index includes citations made to earlier works. Named procedures in this article can be examined in detail by examining the corresponding author’s articles listed in the index. Detailed information for applying many nonparametric procedures has been given in Siegel 1956 and Walsh 1962-1965. The advanced mathematical theory of nonparametric statistics has been outlined in Fraser 1957 and Lehmann 1959. Noether 1967 is an intermediate-level text, particularly useful for its treatment of ties.

Fraser, Donald A. S. 1957 Nonparametric Methods in Statistics. New York: Wiley.

Lehmann, Erich L. 1959 Testing Statistical Hypotheses. New York: Wiley.

Noether, Gottfried E. 1967 Elements of Nonparametric Statistics. New York: Wiley.

Savage, I. Richard 1957 Nonparametric Statistics. Journal of the American Statistical Association 52:331-344. → A review of Siegel 1956.

Savage, I. Richard 1962 Bibliography of Nonparametric Statistics. Cambridge, Mass.: Harvard Univ. Press.

Siegel, Sidney 1956 Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill.

Walsh, John E. 1962-1965 Handbook TTof Nonparametric Statistics. 2 vols. Princeton, N.J.: Van Nostrand. → Volume 1: Investigation of Randomness, Moments,Percentiles, and Distributions. Volume 2: Results for Two and Several Sample Problems, Symmetry and Extremes. A third volume is in preparation.

II ORDER STATISTICS

Order statistics is a branch of statistics that considers the rank of an observation in a sample as well as its algebraic magnitude. Applications arise in all parts of statistics, from broadly distributionfree (nonparametric) problems to those in which a specific form for the parent population is assumed.

Order statistics methods are particularly useful when a complete sample is unavailable. For example, suppose a biologist is studying treated animals and observes the time of survival from treatment to death. He obtains his observations automatically as order statistics and might end the experiment after a fixed time has elapsed or after a certain number (or proportion) of the animals have died. In such cases estimation of, and testing of hypotheses about, parameters describing survival time can be successfully handled by order statistics.

Early literature in the field was on the use of order statistics for complete samples, usually of small size. Interest in order statistics can be traced back to the use of the median and the range as estimators of location and scale. This method of estimating location and scale has been generalized into linear functions of order statistics for use in censored small samples and as easily computed, but somewhat inefficient, estimators in complete samples. [SeeStatistics, Descriptive, article onLocation and Dispersion.]

Applications of order statistics also include methods of studying extreme observations in a sample. This theory of extreme values is helpful in the analysis of the statistical aspects of rainfall and floods, in fatigue testing, in the analysis of injury experience of industrial personnel, and in the analysis of oldest ages among survivors. Order statistics are of fundamental importance in screening the extreme individuals in a sample either in a specific selection procedure or in judging whether or not these extreme observations are contaminated and should be rejected as outliers [seeScreening and Selection; Statistical Analysis, Special Problems of, article onOutliers]. The range, the “Studentized range,” and other functions of order statistics have been incorporated into analysis of variance itself [seeLinear Hypotheses, article onMultiple Comparisons].

There were early and isolated investigations concerning order statistics, such as computations of the relative efficiency of the median versus the mean in samples from the normal distribution (for example, Encke as early as 1834), but the first systematic development of sampling theory in order statistics occurred in 1902, when Pearson considered the “Galton difference problem.” He found the expected value of the difference between the rth and (r + 1)st order statistics in a sample of size n. Daniell (1920) derived the expected value of the rth order statistic and of products of order statistics. He also considered linear estimators for measures of location and scale in the normal distribution. In 1921, Bortkiewicz considered the distribution theory of the sample range, and in 1925, L. H. C. Tippett found the mean value of the sample range and tabulated the cumulative distribution function (cdf) of the largest order statistic for a sample(n ≤ 1,000) from the normal distribution.

Definitions and distribution theory

Suppose X is a continuous random variable with cdf F(x) and probability density function f(x). (For discrete random variables the work of Soliman Hasan Abdel-Aty in 1954 and Irving W. Burr in 1955 may be helpful; see the guide at the beginning of the bibliography.) If the elements of a random sample, X₁, X₂, . . ., X_n, are rearranged in order of ascending magnitude such that

X₍₁₎ <... < X_(n),

then X_(r) is called the rth order statistic in the sample of n. Arrangement from least to greatest is always possible, since the probability that two or more X’s are equal is zero. In actual samples ties may occur because of rounding or because of insensitive measuring devices; there are special rules designed to handle such cases for each procedure (see, for example, Kendall 1948, chapter 3).

Exact distribution of X_(r). The probability element, φ(x_(r)) dx_(r), for the rth order statistic is

The heuristic argument for the above expression can be described in terms of Figure 1, showing the x-axis. To say that the rth order statistic lies in a small interval, (x_(r), x_(r) + dx_(r)), is to say that r - 1 unordered observations are less than x_(r), that n — r observations are greater than x_(r) + dx_(r), and that one observation is in the small interval. These probabilities correspond to the three factors above just to the left of dx_(r). The factorial factor at their left allows for rearrangements of the unordered observations.

Similar expressions exist for the joint distribution of the ith and jth order statistics, i < j. This can lead to the cdf of the sample range and other useful measures.

Limit distributions

Limit distributions will be discussed first for extreme order statistics and then for proportionately defined order statistics.

Extreme order statistics. Limiting (large sample, asymptotic) distributions for the extreme order statistics—the largest and the smallest—have been much studied because of their applicability to the analysis of floods, strength of materials, and so on. For brevity, only the largest order statistic, X_(n), will be considered here.

The large sample behavior of X_(n) clearly depends on the behavior of F(x) for large values of x, and since X_(n) will in general become large and highly variable as n grows, an appropriate kind of centering and scaling must be introduced.

Perhaps the most important case is that in which F(x) approaches unity at an exponential rate when x grows but in which F(x) never attains the value unity. More precisely, the case considered is that in which, for some a > 0, [1 - F(x)]/e^.ax has the limit zero as x grows but in which 1 ~ F(x) never actually attains the value zero. If a centering sequence, (u_n), is then defined by

F(u_n) = (n - 1)/n,

and a scaling sequence, {α_n}, is defined by

where f is the probability density function associated with F, then the basic result is that Y = α_n(X_(n) — u_n) has as its limit distribution the double exponential, with cumulative distribution function exp(-ey^-v).

There are two other kinds of such limit distributions, depending on the structure of F for large x; a detailed discussion was given by E. J. Gumbel in 1958. One of these is, in effect, the distribution studied by W. Weibull in 1949 to investigate the breaking strengths of metals.

The three kinds of limit distributions are intertransformable by changes of variable. Since the limit distributions are relatively simple, large sample estimation is relatively easy; in fact, graphical methods have been worked out and were explained by Gumbel; an exposition of procedural steps is given in Botts (1957).

Proportionately defined order statistics. A second kind of limit distribution is for the rth order statistic where one thinks of r as growing proportionally to n. For any number λ, 0 < λ < 1, the λ-quantile of the population is defined as that value of x, ξ_λ, such that

For instance, if λ = .5, x_.50 is the median, or the 50th percentile (a percentile is a quantile expressed on the base of 100). A λ-quantile, like the median, may sometimes be indeterminate in an interval where f(x) = 0. In such a case every ξ in the interval that satisfies F(¾_λ) = λ can be taken as the λ-quantile.

The λ-quantile of a sample is defined by

where nλ is not an integer, where [nλ] is the largest integer < nλ

if nλ is an integer; this is an arbitrary definition for the indeterminate case, since z_λ can be any value in the interval (X_(nλ), X_(nλ+1)).

N. V. Smirnoff showed in 1935 that if r/n = λ as n → ∞, and if f(x) is continuous and positive at the λ-quantile, ξ_λ, then is asymptotically normal with mean zero and variance . This result in the special case of the median (λ = 1/2) was studied by Encke (1834 ).

The joint limit distribution of two order statistics was also given by Smirnoff. Frederick Mosteller in 1946 extended this to a set of k order statistics with a normal k-variate limit distribution.

The normalized sample quantile given by transforming to has the same asymptotic distribution as (X_(r) — ξ_λ), discussed above. For sampling from a normal distribution with parameters μ and σ, the median of a sample (z_.50) of size n is asymptotically normal with mean μ and standard deviation . The sample average has mean μ and standard deviation , so the ratio of standard deviations is . In other words, for normal sampling it takes about 100 per cent more observations to obtain the same precision with the sample median as with the sample average. For samples from other distributions, this relationship may be reversed.

Nonparametric procedures for quantiles

Although the distribution of the X_(i), the order statistics, depends upon the underlying distribution with cumulative distribution function F, it is important to observe that so long as F is continuous, the distribution of the F(X_(i)) does not depend upon F. A great many nonparametric procedures are based on this fact, and some of them will now be briefly described. (See Mood & Graybill [1950] 1963.)

Confidence limits for quantiles

A simple confidence interval for the quantile at confidence level α is obtained by choosing integers 0 ≤ r < s ≤ n such that

I_λ(r, n — r + 1) — I_λ(s, n — s + 1) = 1 — α,

where I is the incomplete beta distribution [seeDistributions, Statistical, article onSpecial Continuous Distributions]. The consequent confidence interval is just X_(r) ≤ ξ_λ ≤ X_(s). If the order statistics are symmetrically placed (s = n — r+1), further simplification is achieved. If, further, the quantile is the median, the interval is particularly easy to use; its testing analogue is called the median test. Binomial tail summations may equivalently be used for the incomplete beta distribution values.

An example. Given a sample of 30 observations from any continuous distribution, 1 — a symmetric confidence intervals for the population median are as shown in Table 1.

*Table 1 — Symmetric confidence intervals for the population median, based on a sample of 30 observations*
Order statistics	Confidence coefficient
X₍₈₎ and X₍₂₃₎	.995
X₍₁₀₎ and X₍₂₁₎	.95
X₍₁₁₎ and X₍₂₀₎	.90

For the quartiles (λ = .25, .50, .75), with n going from 2 to 1,000, and for several values of α, section 12.3 of Owen (1962) provides convenient tables. Noether (1949) gives an elementary discussion of these procedures and the ones for tolerance intervals.

Tolerance intervals

Nonparametric tolerance intervals also may be based on the concept of choosing integers determined by the incomplete beta distribution. A tolerance interval says that at least 100β per cent of the probability distribution described by F is contained in a random interval, so chosen that the statement is true with probability 1 — α. If the ends of the interval are X_(r) and X_(s), with r < s ≤ n, and if β is given, then α may be computed in terms of the incomplete beta distribution; similarly, if α is given, β may be computed. This was discussed further by R. B. Murphy in 1948.

Furthermore, many nonparametric tests of hypotheses are based squarely on the above fundamental property of the F(X(i)). [Many of these tests are discussed elsewhere inNonparametric Statistics; see alsoGoodness of Fit.]

Linear systematic statistics

Mean and standard deviation

A linear systematic statistic is a linear combination of the sample order statistics. This article deals with the use of linear systematic statistics to estimate location parameters (μ) and scale parameters (σ) from a random sample with underlying distribution of known form up to unknown μ and/or σ. The distributions for which linear systematic statistics have been considered include the normal, exponential, rectangular, gamma, logistic, and extremevalue distributions and others of only theoretical interest, such as the right triangular. The choice of the coefficients of a linear systematic statistic should be optimal in some sense—for example, in terms of bias, sampling variance, and computational convenience.

The expectation of X_(r) can be expressed as a linear function of μ and σ, and the variance and covariance of the X_(r) can be computed up to a scalar constant. Using generalized least squares, E. H. Lloyd in 1952 showed how to find the minimum variance unbiased estimators among all linear combinations. Linear combinations can also be constructed if only a portion of the sample order statistics are used, either from necessity or for convenience (the use of only certain order statistics corresponds to the requirement that certain coefficients of the linear combination be zero). The following general notation for linear systematic statistics will be used:

where the a’s are the coefficients.

For the normal distribution Daniel Teichroew in 1956 calculated with ten-decimal precision the expected values of the order statistics and of the products of pairs of order statistics for samples of size 20 and under. These were used by Sarhan and Greenberg in 1956 to compute variances and covariances of the order statistics and to derive tables of optimal coefficients a_1i and a_2i.

Censored samples. For complete samples this procedure does not represent much of an achievement, although the loss in efficiency in comparison to optimal unbiased minimum variance estimators is often negligible. When some of the sample observations are censored, however, the tables become almost indispensable. [SeeStatistical Analysis, Special Problems of, article 071Truncation and Censorship.] Observations may be censored, usually at the extremes, because the errors of measurement at the extremes are greater than those in the central portion of the distribution or because it is difficult and/or costly to determine their exact magnitude. If censorship is by a fixed point on the abscissa (regardless of whether this point is known to the investigator), the censorship is referred to as Type I. If a certain predetermined percentage of the observations is censored, the censorship is referred to as trimming or as Type II censoring. (F. N. David and Norman L. Johnson in 1956 used the terms “type B” and “type A” to denote Type I and Type II censoring. Another sort of censoring is known as Winsorization, discussed below in relation to the rejection of outlying observations.) Censorship and trimming may be on one or both ends of the ordered sample.

If r₁ and r₂ observations are censored from the left-hand and right-hand sides, respectively, estimates of the parameters can be based upon the values of the n — r₁ — r₂ observations remaining after censoring, together with the corresponding coefficients a’_1i and a’_2i. Tables of the coefficients for the best linear estimators under Type II censoring are given by Sarhan and Greenberg (1962). They also include tables of variances (and covariances) of the estimators and their efficiencies relative to the best linear estimators using complete samples. These tables are valuable because such a use of order statistics gives the best possible estimation among the class of linear statistics for samples up to size 20. Another use of the tables is for the rapid appraisal of patterns of relative information in each order statistic.

An example. Students made measurements (shown in Table 2) of strontium-90 concentration in samples of a test milk in which 9.22 micromicrocuries per liter was the known correct value. One

Table 2 — Measurements of strontium-90 concentration
Ordered observations		a′_1i	a′_2i
1.	—	0	0
2.	7.1	.1884	—.4034
3.	8.2	.1036	—.1074
4.	8.4	.1040	—.0616
5.	9.1	.1041	—.0201
6.	9.8	.1041	.0201
7.	9.9	.1040	.0616
8.	10.5	.1036	.1074
9.	11.3	.1884	.4034
10.	—	0	0

observation was censored at each side because of suspected unreliability. The resulting estimates are μ 8 = 9.27 (with variance .1043σ²) and σ* = 2.05 (with variance .0824σ²). If the two censored observations had been trimmed from one side rather than symmetrically, the efficiency in estimating the mean decreases from 95.9 per cent to 93.0 per cent, but that for estimating the standard deviation increases slightly, from 69.9 per cent to 70.9 per cent.

Other linear systematic statistics. In samples larger than size 20, the calculations for optimal linear systematic statistics are cumbersome, and tables of coefficients are not available. Alternative nearly best and nearly unbiased estimators were proposed by A. K. Gupta in 1952, Sarhan and Greenberg in 1956, Gunnar Blom in 1958, and Sarndal in 1962. These alternatives have coefficients that are easier to compute than a\ { and a’2i.

Estimators from less than available sample

The linear systematic statistics discussed above were based on all the order statistics, or on all those available after censorship. Optimality was stressed in terms of minimum variance. The present section considers simpler linear systematic statistics based on less than all available order statistics. They are simpler in the sense that many of the coefficients are zero and/or the nonzero coefficients take only a few, convenient values.

Small samples

Simplified estimators in small samples will be considered for several different distributions.

Normal distribution. Wilfrid J. Dixon in 1957 suggested simple estimators of the mean of a normal distribution that are highly efficient for small samples. Two of these were:

(1) The mean of two optimally selected order statistics;

(2) , that is, the mean of all order statistics except the largest and smallest.

Estimator (1) is, more specifically, the average of the order statistics with indices ([.27n] + 1) and ([.73n] + 1), where [k] is the largest integer in k. Its asymptotic efficiency, relative to the arithmetic mean of the sample, is 81.0 per cent, and in small samples its efficiency, measured in the same way, is never below this figure and rapidly approaches 100 per cent as n declines toward 2. (Asymptotic estimation is further discussed below.) When n is a multiple of 100, computation of the 27th and 73d percentiles runs into the same problem of indeterminacy as does computation of the median of an even-numbered sample. As explained in the definition of quantiles above, an arbitrary working rule for the 27th percentile is .

The 27th and 73d percentiles as used here are relevant to the problem of optimum groupings discussed below for large samples in the univariate and bivariate normal distributions. Their application to a problem in educational psychology was originally pointed out by Kelley (1939), seeking to select the upper and lower group of persons with a normally distributed score for the validation of test items. He recommended selection of 27 per cent at each extreme to be reasonably certain that the groups were different in respect to the trait. Cureton (1957), using the same method, and David R. Cox in 1957 noted that one-third at each extreme is optimal when the distribution is rectangular.

The second estimator has higher efficiency than the first for n ≥ 5, and its efficiency asymptotically approaches unity. Its chief advantage comes about when there is a possible outlier (discussed below).

Dixon in 1957 also considered linear estimators of the standard deviation for the normal distribution. These estimators are better than, although related to, the sample range. One such estimator is based upon the sum of that set of subranges W_(i) = X_(n-i+1) — X_(i) that gives minimum variance. For a sample of size 10, for example, the unbiased estimator of the standard deviation based upon the range is .325 W₍₁₎ and has an efficiency of 85.0 per cent relative to the unbiased complete sample standard deviation. Dixon’s improved estimator is .1968(W₍₁₎ + W₍₂₎), and its efficiency is 96.4 per cent.

Exponential distributions. The one-parameter exponential distribution, with density

is most useful in life-testing situations, where each observation is automatically an ordered waiting time to failure or death. Benjamin Epstein and Milton Sobel in 1953 showed that the maximum likelihood estimator of σ based only on the first r order statistics is

and that 2rσ̂_r,n/σ is distributed as chi-square with 2r degrees of freedom.

This estimator is a multiple of a simple weighted average of the first r uncensored observations. For other conditions of censoring, the best linear estimator σ* described by Sarhan and Greenberg in 1957 may be used. Other simple estimators include those of Harter (1961) and Kulldorff (1963a).

For the two-parameter exponential distribution (with unknown left end point of range, α) Epstein and Sobel in 1954 considered best linear unbiased estimators α* and σ* based again upon the first r order statistics. These are

and

Still further simplified estimators for both the one-parameter and two-parameter exponential distributions have been considered; they use the optimal k out of n sample order statistics. Tables and instructions for the use of these can be found in Sarhan, Greenberg, and Ogawa (1963), Kulldorff (1963b), and Saleh and Ali (1966).

Large samples

In a large sample, unbiased estimators of location and scale with high asymptotic efficiency may be derived by selecting k suitably spaced order statistics, where k is considerably less than n. The spacing of the quantiles λ₁, λ₂,. . ., λk which produces an estimator with maximum efficiency is termed an optimum spacing.

The problem of optimum spacing of sample quantiles for purposes of estimation is related to asymptotically optimum grouping of observations for convenience in exposition or for purposes of contrast. Cox in 1957 considered optimum grouping for the normal distribution and obtained the same set of quantiles as Ogawa found in 1951 for optimum spacings. Kulldorff in 1958 derived optimal groupings for the exponential distribution, and the results were identical with those sample quantiles derived by Sarhan, Greenberg, and Ogawa (1963) for optimum spacings. The optimum spacing for estimation in the rectangular distribution, however, is obtained from the extreme observations, whereas optimal grouping requires equal frequencies.

Normal distribution. For the mean, Mosteller in 1946 considered the estimator

The integers n_i are determined by k fixed numbers, λ₁, . . ., λ_k, such that 0 < λ₁ <... < λ_k < 1, and n_i = [nλ_i] + 1. For k = 3, for example, Mosteller obtained λ₁ = .1826, λ₂ = .5000, and λ₃ = .8174, with asymptotic efficiency equal to 87.9 per cent. A different set of quantiles and a weighted average of them was selected by Ogawa in 1951 to minimize asymptotic variance. Tables for Ogawa’s estimators are given in Sarhan and Greenberg (1962).

Estimation of both mean and standard deviation is a more complex problem; it has been solved only for symmetric spacing with k = 2, in which case the quantiles selected are λ₁ = .134 and λ₂ = .866.

Exponential distribution. For the one-parameter exponential distribution, the estimators of the form

have been obtained with high maximum asymptotic relative efficiency for k = 1(1) 15 by Sarhan, Greenberg, and Ogawa (1963). Kulldorff (1963a) has also done this, with greater arithmetic precision.

As an example, for k = 3,

has asymptotic relative efficiency .89.

Bivariate normal distribution. An estimator of the correlation coefficient, ρ, was devised by Mosteller in 1946 using the ranks of 2n < N observations. The procedure is to order the N observations on the x coordinate and to distribute the n largest and n smallest of them into an upper and a lower set based upon the y coordinate. The dividing line between the upper and lower sets is the median of the 2n observations on the y coordinate when the means and variances are unknown. Using the number of cases in the resultant four corners, an estimate of ρ can be obtained with the aid of a graph provided by Mosteller for varying levels of n/N. The optimal value of n/N is approximately 27 per cent, in which case the method has an efficiency slightly in excess of 50 per cent in comparison to the corresponding Pearson correlation coefficient. This method has merit where data are on punch cards and machines can be used for rapid sorting. By using the upper and lower 27 per cent of a sample, a similar adaptation can be made when fitting straight lines by regression, especially when both variables are subject to error (see Cureton 1966).

Outlying observations

An observation with an abnormally large residual (deviation from its estimated expectation) is called an outlier; it can arise either because of large inherent variability or because of a spurious measurement. A rule to reject outliers should be considered an insurance policy and not really a test of significance. The first attempt to develop a rejection criterion was suggested by C. S. Peirce in 1852 while he was studying observations from astronomy.

C. P. Winsor in 1941 proposed a procedure that now bears his name: a suspected outlier should not be rejected completely, but its original value should be replaced by the nearest value of an observation that is not suspect. For the normal distribution the symmetrically Winsorized mean is somewhat similar to the second Dixon estimator discussed above (the mean of all order statistics except the largest and smallest), but the former shows only a small loss of efficiency and is more stable than the latter (see Tukey 1962). The Dixon estimator was proposed for a different purpose but has been used when an outlying observation was suspect. In evaluating the utility of either of these procedures, or of any other, one must consider the probability of falsely rejecting a valid observation as well as the bias caused by retaining a spurious item. The usefulness of rejection criteria should be measured in terms of the residual error variance.[These problems are discussed in more detail inStatistical Analysis, Special Problems of, article onOutliers.] A few examples will be given here to illustrate rejection rules when it is suspected that the observation has an error in location and when the underlying distribution is normal with mean μ and variance σ².

Case of a known

To test whether X_(n) is an outlier when μ is known, use can be made of

whereas when μ is unknown, one can use

For detecting an outlier, the performance of B₁ is better than that of B₁ (and better than those of others not listed here), although B₂ is easier to compute. Tables for using B₁ can be found in Pearson and Hartley (1954, vol. 1, 1966 edition).

Case of σ unknown

If an independent estimator of σ is available from outside the sample in question, one can substitute it in B₁ to obtain a modified criterion. Tables of upper percentage points for this externally Studentized extreme deviate from the sample mean were prepared by David in 1956.

When no estimator of σ is available except from the observed sample, two of the appropriate tests are

and

An example. Given five observations, 23, 29, 31, 44, and 63, is the largest observation a spurious one?

From the tables prepared by Frank E. Grubbs in 1950 for B₃ with n = 5, the 90th upper percentage point is 1.791, and the 95th upper percentage point is 1.869. Thus, the value of 63 gives a result above the 10 per cent level and would not be rejected by B₃ at ordinary levels. From the table prepared by Dixon in 1951 for r₁₀, the sample significance level is again seen to be between 10 per cent and 20 per cent, so 63 would again not usually be rejected, and the suspect value would be retained.

Multiple outliers

Tests can also be constructed for multiple outliers, and the efficiency of various procedures can be compared. Excellent reviews of this field were given by Anscombe (1960) and by Dixon in 1962.

Tests of significance

Counterparts to the standard tests of significance can be derived by substituting the median, midrange, or quasi midrange for the mean and by using the range or subranges in lieu of the sample standard deviation. Such tests will usually be lower in power, but in small samples the differences may be negligible.

Hypotheses on location

The test criterion for the difference between the location measure of a sample and a hypothetical population value based upon substitution of the range in the standard t-test, published by Joseph F. Daly in 1946 and by E. Lord in 1947, is

An example. Walsh (1949 ) poses the problem of testing whether the mean (x̄ = 1.05) of the following sample of 10 observations differs significantly from the hypothesized population mean of 0. Ordered observations are -1.2 , -1.1 , -.2 , .1, .7, 1.3, 1.8, 2.0, 3.4, and 3.7.

As tabulated by Lord, the critical values for a two-tailed test are 10 per cent = .186 and 5 per cent = .230, and therefore the hypothetical mean would not be rejected at the 5 per cent level.

Difference between means of two samples

Lord in 1947 devised the following test criterion for the difference between the means of two samples of size n from a normal population:

An example. The problem is to test whether two samples, shown in Table 3, have a common

Table 3
X-sample	Y-sample
27.6	43.3
35.5	48.7
45.0	53.6
46.7	56.5
47.6	63.9

population mean, assuming a normal distribution and equal but unknown variances. These data give x̄ = 40.5, ȳ = 53.2, and

The critical value for 5 per cent is .307; for 1 per cent it is .448. The null hypothesis of equality of the means is rejected at the 5 per cent level.

Equivalent tests were devised by J. Edward Jackson and Eleanor L. Ross in 1955 without the restriction that the two sample sizes be the same.

Tests on variances

One-sample, two-sample, and k-sample tests of variances and analysis of variance can be performed with the use of ranges and subranges. A good reference on this subject is Chapter 7 by David in Sarhan and Greenberg (1962).

Bernard G. Greenberg

BIBLIOGRAPHY

Those citations to the literature which are given in the text but are not specified below can be found in the bibliography of Wilks 1948 or that of Sarhan & Greenberg 1962. The discussion in the latter monograph corresponds closely to the coverage in this article.

Anscombe, F. J. 1960 Rejection of Outliers. Technometrics 2:123-147.

Botts, Ralph R. 1957 “Extreme-value” Methods Simplified. Agricultural Economics Research 9:88-95.

Cureton, Edward E. 1957 The Upper and Lower Twenty-seven Per Cent Rule. Psychometrika 22:293296.

Cureton, Edward E. 1966 Letter to the Editor. American Statistician 20, no. 3:49.

Daniell, P. J. 1920 Observations Weighted According to Order. American Journal of Mathematics 42:222-236.

Encke, J. F. (1834) 1841 On the Method of Least Squares. Volume 2, pages 317-369, in Scientific Memoirs: Selected From the Transactions of Foreign Academies of Science and Learned Societies, and From Foreign Journals. Edited by Richard Taylor. London: Taylor. → First published in German in the Astronomisches Jahrbuch.

Harter, H. Leon 1961 Estimating the Parameters of Negative Exponential Populations From One or Two Order Statistics. Annals of Mathematical Statistics 32:1078-1090.

Kelley, Truman L. 1939 The Selection of Upper and Lower Groups for the Validation of Test Items. Journal of Educational Psychology 30:17-24.

Kendall, M. G. (1948) 1963 Rank Correlation Methods. 3d ed., rev. & enl. New York: Hafner; London: Griffin.

Kulldorff, Gunnar 1963a On the Optimum Spacing of Sample Quantiles From an Exponential Distribution. Unpublished manuscript, Univ. of Lund, Department of Statistics.

Kulldorff, Gunnar 1963b Estimation of One or Two Parameters of the Exponential Distribution on the Basis of Suitably Chosen Order Statistics. Annals of Mathematical Statistics 34:1419-1431.

Mood, Alexander M.; and Graybill, Franklin A. (1950) 1963 Introduction to the Theory of Statistics. 2d ed. New York: McGraw-Hill.

Noether, Gottfried E. 1949 Confidence Limits in the Non-parametric Case. Journal of the American Statistical Association 44:89-100.

Owen, Donald B. 1962 Handbook of Statistical Tables. Reading, Mass.: Addison-Wesley. → A list of addenda and errata is available from the author.

Pearson, Egon S.; and Hartley, H. O. (editors) (1954) 1958 Biometrika Tables for Statisticians. Vol. 1. 2d ed. Cambridge Univ. Press. → A third edition of Volume 1 was published in 1966.

Saleh, A. K. MD. Ehsanes; and Ali, Mir M. 1966 Asymptotic Optimum Quantiles for the Estimation of the Parameters of the Negative Exponential Distribution. Annals of Mathematical Statistics 37:143-151.

Sarhan, Ahmed E.; and Greenberg, Bernard G. (editors) 1962 Contributions to Order Statistics. New York: Wiley.

Sarhan, Ahmed E.; Greenberg, Bernard G.; and Ogawa, Junjiro 1963 Simplified Estimates for the Exponential Distribution. Annals of Mathematical Statistics 34:102-116.

Sarndal, Carl E. 1962 Information From Censored Samples. Stockholm: Almqvist & Wiksell.

Sarndal, Carl E. 1964 Estimation of the Parameters of the Gamma Distribution by Sample Quantiles. Technometrics 6:405-414.

Tukey, John W. 1962 The Future of Data Analysis. Annals of Mathematical Statistics 33.1-67, 812. → Page 812 is a correction.

Walsh, John E. 1949 Applications of Some Significance Tests for the Median Which Are Valid Under Very General Conditions. Journal of the American Statistical Association 44:342-355.

Wilks, S. S. 1948 Order Statistics. American Mathematical Society, Bulletin 54:6-50.

III RUNS

A “run” is a sequence of events of one type occurring together in some ordered sequence of events. For example, if a coin is spun 20 times and the results are heads (H) or tails (T) in the following order:

there are 10 runs in the sequence, as indicated by the lines shown. If the tossings occur at random, it is clearly more reasonable to expect the number of runs occurring in the above configuration than those in the configuration

HHHHHHHHHTTTTTTTTTTT

which has the same number of heads and tails but only two runs. In dealing with runs as a statistical phenomenon, two general cases can be distinguished. In the first, the number of heads and the number of tails are fixed and interest is centered on the distribution of the number of runs, given the number of heads and tails. In the second, the number of heads and the number of tails themselves are random variables, with only the total number of spins fixed. The distribution of the number of runs in this case will be different.

A study of runs can assist one in deciding upon the randomness or nonrandomness of temporal arrangements of observations. Runs can also be used to form certain nonparametric or distribution-free tests of hypotheses that are usually tested in other ways. Examples of these uses are given below.

Simple runs

Interest in runs can be traced back to Abraham de Moivre’s Doctrine of Chances, first published in 1781. Whitworth (1867) devotes some space to runs, and Karl Pearson (1897) discusses in an interesting manner the runs of colors and numbers occurring in roulette plays at Monte Carlo. His analysis is, in fact, faulty in certain respects, a point that was picked up by Mood (1940). (Some errors in Mood’s own paper have been given by Krishna Iyer [1948].) In an unusual article on runs, Solterer (1941) examines the folklore which holds that accidents or tragedies occur in triplets by analyzing the dates of death of 597 Jesuit priests in the United States for the period 1900-1939. Using a simple run test, he shows that some grouping of deaths does, in fact, occur. Wallis and Moore (1941a; 1943) give further illustrations of the uses to which runs have been put. Many writers treat the following basic problem:

Given r₁ elements of one kind and r₂ elements of a second kind, with the r = r₁ + r₂ elements arranged on a line at random, what is the probability distribution of the total number of runs? The general answer is a bit complex in expression, but the first two moments of d, the number of runs, may be simply written out:

(1) Expected value of

and

Variance of d

(2)

Furthermore, for large values of r₁ and r₂ the distribution of d tends to normality. This provides a ready and straightforward means of carrying out significance tests for the random arrangement of the r elements.

The power of the test will be good either when elements form rather too few groups—that is, when there is strong positive association of the elements—or when there are too many groups because of a disposition for the elements to alternate. The test may be applied to quantitative observations by dichotomizing them as above or below the sample median (or some other convenient quantile), as in the following example.

Example—lake levels. The data for this example are given in Table 1 and consist of the highest monthly mean level of Lake Michigan-Huron for each of the 96 years from 1860 to 1955 inclusive. Inspection of the figures shows that the median height was 581.3 feet. Each level below this figure is marked B, and each height equal to or above this figure is marked A. Thus, for these data, r = 96, r_A = 49, and r_B = 47. A dichotomization just a bit off the sample median (to avoid boundary problems) has been used. Although a dichotomization near the sample median (as opposed to some theoretical cutting point) was used to define the categories A and B, the null distribution of d is unaffected. The number of runs is counted and found to be 15. From (1) and (2) it is found that E(d) = 49 and σ(d) = 4.87. Testing the observed number of runs against the expected, the quantity

is referred to the unit normal distribution. (The 0.5 above is the so-called continuity correction; see Wallis & Roberts 1956, pp. 372-375.) The probability of observing a value of —6.87 or smaller by chance, if the arrangement of A’s and B’s is random, is well below 0.00001, and hence it can be concluded that the observations are not randomly ordered. In other words, the lake has tended to be high for several years at a time and then low for several years. In practical terms, this means that an estimate of next year’s highest level will usually be closer to this year’s level than to the 96-year average.

Example—anthropological interval sift. The anthropological interval sift technique is described in a paper by Naroll and D’Andrade (1963). Traits often diffuse between neighboring cultures, forming clusters of neighbors with like traits. A sifting method seeks to sift out a cross-cultural sample, so that, ideally, from each geographical cluster of neighbors with like traits only one example will be considered in the cross-cultural sample. The object of the test is to see whether the sift has been successful in removing the correlation effects of neighboring societies with like traits.

The method is applied to narrow strips of the globe 600 nautical miles wide and thousands of miles long. From a random start, equal intervals are marked off along the length of the strip every so many miles. The first society encountered after each mark is included in the sample. The interval chosen must be large enough to ensure that neighboring members of a single diffusion patch are not included any more frequently than would be produced in a random geographic distribution. Everything depends upon the interval chosen, and a run test is used to decide whether the interval used is suitable.

For this purpose the 40 ordered societies obtained in the sample were dichotomized on each of four characteristics. For example, the first dichotomization was according to whether or not the society had a residence rule of a type theoretically associated with bilateral or bilineal kinship systems (bilocal, neolocal, uxorineolocal, or uxoribilocal). Although the values of the r_i in this example may seem to be random, it is reasonable to regard them as fixed and to apply the appropriate tests given the r_i. Hence, for each of the four characteristics looked at, the run test was applied, with the results shown in Table 2. Because of the relationships between the four characteristics, strong dependences exist among them; this is the reason for the identity, or near identity, of the results for characteristics 1 and 3, and for 2 and 4. Thus the four tests are appreciably dependent in the statistical sense.

*Table 1 — Lake Michigan-Huron, highest monthly mean level for each calendar year, 1860—1955*
Year	Level in feel^a	Two categories^b	Change^c	Year	Level in feel^a	Two categories^b	Change^c
a. Data for certain years are shown to two decimals to avoid ties.
b. A: 581.3 feet or more.
B: less than 581.3 feet.
c. This column will be used in a later example.
Source: Wallis & Roberts 1956, p. 566.
1860	583.3	A		1910	580.5	B	—
1861	583.5	A	+	1911	580.0	B	—
1862	583.2	A	—	1912	580.7	B	+
1863	582.6	A	—	1913	581.3	A	+
1864	582.2	A	—	1914	580.7	B	—
1865	582.1	A	—	1915	580.0	B	—
1866	581.7	A	—	1916	581.1	B	+
1867	582.2	A	+	1917	581.87	A	+
1868	581.6	A	—	1918	581.91	A	+
1869	582.1	A	+	1919	581.3	A	—
1870	582.7	A	+	1920	581.0	B	—
1871	582.8	A	+	1921	580.5	B	—
1872	581.5	A	—	1922	580.6	B	—
1873	582.2	A	+	1923	579.8	B	+
1874	582.3	A	+	1924	579.6	6	—
1875	582.1	A	—	1925	578.49	B	—
1876	583.6	A	+	1926	578.49	B	0
1877	582.7	A	—	1927	579.6	B	+
1878	582.5	A	—	1928	580.6	B	+
1879	581.5	A	—	1929	582.3	A	+
1880	582.1	A	+	1930	581.2	B	—
1881	582.2	A	+	1931	579.1	B	—
1882	582.6	A	+	1932	578.6	B	—
1883	583.3	A	—	1933	578.7	B	—
1884	583.1	A	+	1934	578.0	B	+
1885	583.3	A	+	1935	578.6	B	+
1886	583.7	A	+	1936	578.7	B	+
1887	582.9	A	—	1937	578.6	B	+
1888	582.3	A	—	1938	579.7	B	—
1889	581.8	A	—	1939	580.0	B	+
1890	581.6	A	—	1940	579.3	B	—
1891	580.9	6	—	1941	579.0	B	—
1892	581.0	B	+	1942	580.2	B	+
1893	581.3	A	+	1943	581.5	A	+
1894	581.4	A	+	1944	580.8	B	+
1895	580.2	B	—	1945	581.00	B	—
1896	580.0	B	—	1946	580.96	B	—
1897	580.85	B	—	1947	581.1	8	+
1898	580.83	B	+	1948	580.8	B	—
1899	581.1	B	+	1949	579.7	B	—
1900	580.7	B	—	1950	580.0	B	+
1901	581.1	B	—	1951	581.6	A	+
1902	580.83	B	+	1952	582.7	A	+
1903	580.82	B	—	1953	582.1	A	—
1904	581.5	A	+	1954	581.7	A	—
1905	581.6	A	+	1955	581.5	A	—
1906	581.5	A	—
1907	581.6	A	+
1908	581.8	A	+
1909	581.1	B	—

In all four cases the observed number of runs was slightly below the expected number. Application of the test described above showed, however, that there was no statistically significant evidence of overclustering and, hence, no statistically significant evidence of diffusion. Thus the sift had been successful in removing the statistically significant effects of diffusion. Of course, considerations of power would be necessary for a full analysis.

*Table 2*
* Probability of obtaining the observed number of runs or a more extreme result, if the two kinds of sign were arranged at random.
Source: Naroll & D’Andrade 1963, p. 1061. Reproduced by permission of the American Anthropological Association.
Characteristic	d	r₁	r₂	Sample significance level*
1	9	34	6	0.137
2	19	18	22	0.337
3	9	34	6	0.137
4	19	22	18	0.337

Runs with more than two categories

The foregoing can be extended to the situation where the r elements are subdivided into more than two types of element, say k types, so that there are r₁ elements of type 1, r₂ of type 2, and so on up to r_k of type k where Σ_jr_j = r. When the r elements are arranged at random, the first two moments of the number of runs, d, are now

and

where . These formulas, provided that r and the r } are all reasonably large, permit tests of significance for random groupings to be carried out.

Example —runs with multiple types. The example of runs with multiple types presented here is based on the work of E. S. Pearson and is quoted in David and Barton (1957). The data relate to the falls in the prices of shares on the London Stock Exchange during the Suez crisis period, November 6, 1956-December 8, 1956, inclusive. Five types of industrial activity were considered:A, insurance; B, breweries and distilleries; C, electrical and radio; D, motor and aircraft; E, oil. The closing prices as given in The Times (of London) for 18 businesses of each type were taken, and Table 3 shows, for each day, the type of industrial activity for which the greatest number of the 18 showed a fall in price from the previous day. In the few cases where there were equal numbers

*Table 3*
Nov. 6-A	Nov. 13—B	Nov. 20-E	Nov. 27-B	Dec. 4-C
Nov. 7-A	Nov. 14-C	Nov. 21-C	Nov. 28-E	Dec. 5-C
Nov. 8-D	Nov. 15-C	Nov. 22-E	Nov. 29-A	Dec. 6-D
Nov. 9-D	Nov. 16-C	Nov. 23-E	Nov. 30-E	Dec. 7-C
Nov. 10-A	Nov. 17-E	Nov. 24-E	Dec. 1-E	Dec. 8-B

for two types, that type which also showed the fewer rises in price was taken. For the data shown in Table 3, r = 25 with r_A = 4, r_B = 3, r_c = 7, r_D = 3, r_E = 8. There are 16 runs (d). From (3) and (4) above, the mean and standard deviation of d, given a random grouping of the letters, are E(d) = 20.12, σ(d) = 1.24. The probability of getting the observed value of d or a lower one by chance if the grouping is random is therefore found by referring the quantity

to a unit normal distribution. The appropriate probability is 0.001, and it is therefore concluded that the test is picking out the fact that during the Suez crisis there was some persistence from day to day in the way in which different classes of shares were affected.

Runs in a circle

A modification of the type of run described above occurs when the line is bent into a circle so that there is no formal starting and ending point. If each configuration around the circle is equally probable, then it is possible to find the distribution of the total number of runs, d (see, for example, David & Barton 1962, p. 94, or Stevens 1939). The first two moments are

For small r, say, up to 20, the exact probabilities need to be enumerated to carry out significance tests (see, for example, Walsh 1962-1965), but for larger values of r the distribution can be assumed to be normal with the appropriate parameters. It should also be noted that in this analysis it is assumed that the circle cannot be turned over, as would be the case in considering the number of runs among the beads threaded on a bracelet.

Random totals

In the foregoing analyses it is assumed that the total composition of the sequence is known, which is to imply that the r_i are known quantities. The probability distribution and moments are thus conditional upon r_i being known. It is possible to take the argument a stage further and to consider the situation if the r_i are obtained as a result of some sort of sampling and are thus themselves random variables. For purposes of illustration, suppose that there are two kinds of elements and that the sampling is of a binomial form with the chance of an element being of the first kind as p and the chance of its being of a second kind as q (where p + q = 1), the elements being

independent of one another. Then it can be shown that

and

A test based on the above moments might be used in a quality control situation where a previously estimated proportion, p, of articles are expected to have some characteristic. Consecutive items are examined and the number of runs, d, found in r items are counted in order to see whether the machine concerned is producing a random ordering of quality or not [seeQuality Control, Statistical].

Runs up and down

Given a sequence of numbers, a sign, either + or —, can be attached to each number other than the first, the sign being + if the current number is above the previous number numerically or — if it is below. Using this sequence of +’s and — ‘s, one can count the number of runs (d’). If the original sequence of numbers was such that long sequences of roughly equal numbers were placed together, the number of runs as defined here would be large, since there would be a considerable element of chance as to whether two adjacent numbers were in ascending or descending order of magnitude. If there is, however, a gentle rise followed by a gentle fall in the original sequence, perhaps repeated several times, then the number of runs will be much smaller. Hence, the number of runs calculated in this manner is a measure of persistence of the trend. The probability distribution is a complicated one (details are given in David & Barton 1962, p. 154), but the first two moments of it can be evaluated fairly readily and give

and

where n is the total number of original observations, giving n — 1 signs in the sequence. Again it may be shown that for large values of n, the distribution tends to normality.

Example—runs up and down. The data in Table 1 is used again, only in this example the aim is to see whether the directions of movement in water level tend to persist. It would be possible for high and low values to cluster together simply as a result of a few large changes, with changes of direction within the clusters varying as for independent observations.

To carry out a suitable test of the hypothesis that there is no persistence in the directions of the measurements, a sign was inserted against the maximum level for each year except the first. The sign is + if the level was higher than for the preceding year, — if it was lower, 0 if it was the same. (In this case the total number of runs, d’, would be the same whether the 0 was considered as + or as —.) The longest movements in one direction were the two five-year declines from 1861 to 1866 and from 1886 to 1891. There were two four-year movements, the rise from 1879 to 1883 and the decline from 1922 to 1926.

The total number of runs, d’, was 48. On the hypothesis that the + and — signs were arranged at random, the distribution of d’ is approximately normal with moments from (9) and (10): E(d’) = 63.67 and cr(d’) = 4.10. Testing the observed against the expected value of d’ the quantity

is referred to a unit normal distribution. A onetailed probability is required, since the alternative hypothesis is the one-sided one that there will be fewer runs than the null hypothesis indicates. The probability of a value of —3.7 or smaller arising by chance if there were no real persistence of movement in the same direction would be less than 0.001. Thus, the lake evidently has a tendency to move consecutively in the same direction more often than would be the case with independent observations. The clustering of high and low values, therefore, is not due (at least not exclusively) to a few large changes but in some part to cumulative movements up and down.

Test for two populations

Let x₁, x₂, . . .,x_n₁ be an ordered sample from a population with probability density f(x), and let y₁, y₂,. . ., y_n₂ be a second ordered sample from the same population. Let the two samples be combined and arranged in order of magnitude; thus, for example, one might have

x₁ < y₁ < y₂ < x₂ < y₃ < x₃ < x₄ < x₅ < y₄ < . . .

To test the null hypothesis that the two samples come from the same population, let the two samples be combined as above, define a run as a sequence of letters of the same kind bounded by letters of the other kind, and count the total number of runs, d.

Thus, in the above example the first element forms a run, the second and third a run, the fourth a run, the fifth a run, the sixth to eighth a run, and so on. The first and last elements are bounded only on one side for this purpose so that if only the nine elements given here were involved, there would be six runs.

It is apparent that if the two samples are from the same population, the x’s and y’s will ordinarily be well mixed and d will be large. If the two populations are widely separated so that their ranges do not overlap at all, the value of d will be only two and, in general, differences between the populations will tend to produce low values of d. Thus, the two populations may have the same mean or median, but if the x population is relatively concentrated compared to the y population, there will be a long y run at each end of the combined sample and there will thus tend to be a low value of d.

The test is performed by observing the total number of runs in the combined sample, accepting the null hypothesis if d is greater than some specified number d₀, rejecting the null hypothesis otherwise. To do this, the first two moments of the null distribution of d, given in (1) and (2) above, may be used. This run test, originated by Wald and Wolfowitz (1940), is sensitive to differences in both shape and location between the two populations and is described in more detail in Mood (1950).

Example—two-sample test. This example of the two-sample test is taken from Moore (1959). Two random samples, each of 14 observations, were available, and it was desired to test whether they could be considered to have come from the same population or not.

Sample A:	9.1,	9.5,	10.0,	10.4,	10.7,
	9.0,	9.5,	12.2,	11.8,	12.0,
	10.2,	11.5,	11.6,	11.2
Sample B:	9.4,	11.3,	10.9,	9.4,	9.2,
	8.5,	8.9,	11.4,	11.1,	9.3,
	9.8,	9.6,	8.0,	9.7

Arranging these in numerically ascending order gave the following (where sample B is shown in italics):

8.0,	8.5,	8.9,	9.0,	9.1,	9.2,	9.3,
9.4,	9.4,	9.5,	9.5,	9.6,	9.7,	9.8,
10.0,	10.2,	10.4,	10.7,	10.9,	11.1,	11.2,
11.3,	11.4,	11.5,	11.6,	11.8,	12.0,	12.2

A straightforward count shows that there are 10 runs; hence, d = 10. From (1) and (2), with r₁ = r₂ = 14 and r = 28, E(d) = 15 and var (d) = 6.74. Testing the null hypothesis that there is no difference between the two samples, the quantity

is referred to the unit normal distribution. The probability of that value, or a smaller value, arising by chance if there were no difference between the two samples is 0.045—that is, the difference is significant at the 5 per cent level but not at the 1 per cent level—and thus some doubt is thrown on the hypothesis that the two samples come from the same population. A standard two-sample t-test for samples from normal populations (see, for example, Moore & Edwards 1965, pp. 9-10) gives a similar indication.

Other run tests

Various other procedures based on runs have been proposed, a few of which are mentioned here. Length of runs as a criterion is discussed by Takashima (1955). The use of the longest run up or down (that is, numerically rising or falling) is discussed by Olmstead (1946), who gives a table to facilitate the use of his test. Mosteller (1941) gives a test on the use of long runs of observations above or below the median and discusses its use in the interpretation of quality control charts. Mann (1945) discusses a test for randomness based on the number of runs up and down. In testing goodness of fit, runs of signs of cell deviations from the null hypothesis have been studied (see, for example, Walsh 1962-1965, vol. 1, pp. 450-451, 463-465).

An extension of runlike concepts is used by Campbell, Kruskal, and Wallace (1966) to study the distribution of adjacencies between Negro and white students seating themselves in a classroom.

The power of run tests has been investigated by a number of writers. For example, David (1947) studies the power of grouping tests for randomness. Bateman (1948) investigates the power of the longest run test. Levene (1952) studies the power of tests for randomness based on runs up and down. The power of the two-sample tests above depends upon the kind of alternative that is being considered. In instances where the alternative can be precisely specified, there may be other tests, not primarily based on runs, that have better power.

Ties . Except in one of the above examples, where the course of action was obvious, the possibility of ties, that is, a pair of observations being precisely equal, has been ignored. Hence, observations could always be ordered uniquely without ties and the various statistics uniquely defined. However, in practice, observations are measured only to a few significant figures and ties will therefore sometimes occur. The simplest way to deal with this possibility in the situations discussed here is to assume that the tied observations are ordered at random. The appropriate tests are then carried out as before. Discussions of the problems of ties are given by Kendall (1948) and Kruskal (1952).

Peter G. Moore

BIBLIOGRAPHY

Bateman, G. 1948 On the Power Function of the Longest Run as a Test for Randomness in a Sequence of Alternatives. Biometrika 35:97-112.

Campbell, Donald T.; Kruskal, William H.; and Wallace, William P. 1966 Seating Aggregation as an Index of Attitude. Sociometry 29, no. 1:1-15.

David, F. N. 1947 A Power Function for Tests of Randomness in a Sequence of Alternatives. Biometrika 34:335-339.

David, F. N.; and Barton, D. E. 1957 Multiple Runs. Biometrika 44:168-178.

David, F. N.; and Barton, D. E. 1962 Combinatorial Chance. New York: Hafner.

Fisher, R. A. (1925) 1958 Statistical Methods for Re search Workers. 13th ed. New York: Hafner. → Previous editions were also published by Oliver and Boyd.

Kendall, Maurice G. (1948) 1963 Rank Correlation Methods. 3d ed., rev. & enl. London: Griffin; New York: Hafner.

Krishna Iyer, P. V. 1948 The Theory of Probability Distributions of Points on a Line. Journal of the Indian Society of Agricultural Statistics 1:173-195.

Kruskal, William H. 1952 A Non-parametric Test for the Several Sample Problem. Annals of Mathematical Statistics 23:525-540.

Levene, Howard 1952 On the Power Function of Tests of Randomness Based on Runs Up and Down. Annals of Mathematical Statistics 23:34-56.

Ludwig, Otto 1956 Uber die stochastische Theorie der Merkmalsiterationen. Mitteilungsblatt fiir mathematische Statistik und ihre Anwendungsgebiete 8:49-82.

Mann, Henry B. 1945 On a Test for Randomness Based on Signs of Differences. Annals of Mathematical Statistics 16:193-199.

Mood, Alexander M. 1940 The Distribution Theory of Runs. Annals of Mathematical Statistics 11:367-392.

Mood, Alexander M. 1950 Introduction to the Theory of Statistics. New York: McGraw-Hill. → A second edition, by Mood and Franklin A. Graybill, was published in 1963.

Moore, Peter G. 1959 Some Approximate Statistical Tests. Operational Research Quarterly 10:41-48.

Moore, Peter G.; and Edwards, D. E. 1965 Standard Statistical Calculations. London: Pitman.

Mosteller, Frederick 1941 Note on an Application of Runs to Quality Control Charts. Annals of Mathematical Statistics 12:228-232.

Naroll, Raoul; and D’andrade, Roy G. 1963 Two Further Solutions to Galton’s Problem. American Anthropologist New Series 65:1053-1067.

Olmstead, P. S. 1946 Distribution of Sample Arrangements for Runs Up and Down. Annals of Mathematical Statistics 17:24-33.

Pearson, Karl 1897 The Chances of Death and Other Studies in Evolution. Vol. 1. New York: Arnold.

Solterer, J. 1941 A Sequence of Historical Random Events: Do Jesuits Die in Three’s? Journal of the American Statistical Association 36:477-484.

Stevens, W. L. 1939 Distribution of Groups in a Se quence of Alternatives. Annals of Eugenics 9:10-17.

Takashima, Michio 1955 Tables for Testing Randomness by Means of Lengths of Runs. Bulletin of Mathematical Statistics 6:17-23.

Wald, Abraham; and Wolfowitz, J. (1940) 1955 On a Test Whether Two Samples Are From the Same Population. Pages 120-135 in Abraham Wald, Selected Papers in Statistics and Probability. New York: McGraw-Hill. → First published in Volume 11 of the Annals of Mathematical Statistics.

Wallis, W. Allen; and Moore, Geoffrey H. 1941a A Significance Test for Time Series Analysis. Journal of the American Statistical Association 36:401-409.

Wallis, W. Allen; and Moore, Geoffrey H. 1941b A Significance Test for Time Series. Technical Paper No. 1. New York: National Bureau of Economic Research.

Wallis, W. Allen; and Moore, Geoffrey H. 1943 Time Series Significance Tests Based on Signs of Differences. Journal of the American Statistical Association 38:153-164.

Wallis, W. Allen; and Roberts, Harry V. 1956 Statistics: A New Approach. Glencoe, III.: Free Press. → A revised and abridged paperback edition of the first section was published by Collier in 1962 as The Nature of Statistics.

Walsh, John E. 1962-1965 Handbook of Nonparametric Statistics. 2 vols. Princeton, N.J. : Van Nostrand. → Volume 1: Investigation of Randomness, Moments, Percentiles, and Distributions. Volume 2: Results for Two and Several Sample Problems, Symmetry and Extremes.

Whitworth, William Allen (1867) 1959 Choice and Chance, With 1000 Exercises. New York: Hafner. → The first and subsequent editions were published in Cambridge by Bell.

Whitworth, William Allen (1897) 1965 DCC Exercises: Including Hints for the Solution of All the Questions in Choice and Chance. New York: Hafner. → A reprint of the first edition published in Cambridge by Bell.

IV RANKING METHODS

A ranking is an ordering of individuals or objects according to some characteristic of interest. If there are n objects, it is natural to assign to them the ranks 1, 2, . . ., n, with 1 assigned to the object ranked highest. In this article it is, however, more convenient to employ the opposite and completely equivalent convention of assigning rank 1 to the lowest and rank n to the highest object. Ranking methods are concerned with statistics (rank-order statistics) constructed from the ranks, usually in random samples of observations. An ordinal scale of measurement clearly suffices for the calculation of such statistics, but ranking methods are also frequently used even when meaningfully numerical measurements are available. In cases when such measurements are available the ordered observations may be denoted by x(1) ≤ x(2) ≤... ≤ x_(n), where x_(i)(i = 1, 2, . . ., n) is the ith order statistic. Only the rank i of x_(i) occurs in rank-order statistics; use of x_(i) itself leads to order statistics [seeNonparametric Statistics, article onOrder Statistics].

Objects which are neighbors in the ranking always differ in rank by 1, although it may reasonably be objected that the actual difference between neighboring objects is often greater near the ends of a ranking than in the middle. This difficulty can be overcome by the use of scores, a score being a suitably chosen function of the corresponding rank. However, the objection turns out to have less force than it might appear to have, since in many important cases scoring has been found to be little more efficient than the simpler ranking.

It is clear that an unambiguous ranking is not possible when ties are present. Strictly speaking, ranking methods therefore require the assumption of an underlying continuous distribution so as to ensure zero probability for the occurrence of ties. In applications, ties will nevertheless insist on appearing. Commonly, each of the tied values is given the average rank of the tied group, a technique which leaves the sum of the ranks unchanged but reduces their variability. The effect is negligible if the number of ties is small, but in any case simple corrections can be applied to many of the procedures developed on the assumption of no ties.

Ranking methods are potentially useful in all fields of experimentation where measurements can be made on at least an ordinal scale. Much of the motivation for the development of the subject has in fact stemmed from the social sciences. To date, testing of hypotheses (for instance, identity of two or more populations) has been most fully explored. Point and interval estimation are sometimes possible, but usually the order statistics are needed in addition to the ranks. Ranks are also beginning to be used in sequential analysis and multiple-decision procedures. Ranking methods are not as flexible as parametric procedures developed on the assumption of normality, but they have a much greater range of validity, since they do not generally require knowledge of the underlying distribution other than its continuity (but see, for example, “Estimation,” below). Most ranking methods therefore fall under the wider heading of nonparametric statistics and play a major role in that subject.

When numerical measurements are replaced by their ranks, the question arises of how much “information” is thereby lost. Perhaps contrary to intuition, the loss in efficiency or power of the standard rank tests compared to the best corresponding parametric tests is usually quite small. The loss hinges on several factors: the particular test used, the nature of the alternative to the null hypothesis under test, and the sample size. Since the best parametric test in a given situation depends on the often uncertain form of the underlying distribution, one should also compare the performance of the two tests for other distributional forms which may reasonably occur. Whereas the significance level of the rank test remains quite unchanged, that of the parametric test may be seriously upset. Even when the parametric test is not too sensitive in this respect it can easily become inferior in power to the rank test. A more obvious point in favor of ranking methods is that they are much less affected by spurious or wild observations.

The following account begins with two-sample problems and then turns to several-sample situations and rank correlation. [Nonparametric onesample procedures, including ranking methods, are considered inNonparametric Statistics, article onTheField.] The emphasis is on standard tests, but some estimation procedures are described, and reference is made to several multiple-decision procedures.

Two-sample problems

Let x₁, x₂, . . ., x_m and y₁, y₂, . . ., y_n be independent observations made respectively on the continuous random variables X with density f(x) and Y with density g(y). Consider the important question: Do the x’s differ significantly from the y’s; that is, do the observations throw serious doubt on the null hypothesis that f = g? The statistical test most appropriate for answering this question will depend on the alternative the experimenter has in mind. Often the y’s are expected, on the whole, to be larger than the x’s; that is, if p = Pr(Y > X), the null hypothesis H₀: f = g, for which p = 1/2, is tested against the alternative, H₁: p > 1/2. A special case of this alternative occurs when g(y) is merely f(x) shifted to the right, so that the two populations differ in location only. Other alternatives of interest are ₁H : p < 1/2 and the two-sided alternative H₂: p ≠ 1/2.

Note that H₀ specifies identity of f and g, not merely that the distributions have the same median (see Pratt 1964) or the same spread.

Wilcoxon’s two-sample test

Wilcoxon’s twosample test enables one to test H₀ without knowledge of the common (under the null hypothesis) functional form of f and g. Arrange the m + n observations (w_i) in combined ascending order of magnitude giving w₁ ≤ w₂ ≤... ≤ w_m-n. The Wilcoxon (1945) statistic is the sum, R_x, of the ranks of those w’s that came from f(x). (For an equivalent counting procedure see Mann & Whitney 1947.) If H₀ is true, each observation has expected rank given by [1 + 2 +... + (m + n)]/(m + n) = 1/2(m + n + 1), so that the expected value of R_x is 1/2m(m + n + 1). When H₁ is the alternative under consideration, values of R_x much smaller than this lead to the rejection of H₀ in favor of H₁. For a two-sided test, H₀ is rejected when R_x differs from E(R_x) by too much in either direction. Actual significance points are given in many sources, for example, by Siegel and Tukey (1960) for m ≤ n ≤ 20. Outside this range a normal approximation, possibly with continuity correction, is adequate unless m and n differ greatly. Simply treat

as approximately a unit normal deviate.

An example. The amount of aggression attributed to characters in a film by nine members of each of two populations resulted in the following scores (Siegel & Tukey 1960):

x: 25 5 14 19 0 17 15 8 8

y: 12 16 6 13 13 3 10 10 11

These are combined into the w-series, with their origin noted as in Table 1. (The modified ranks will be discussed below.) Here R_x = 1 + 3 + 5 + 6 + 13 + 14 + 16 + 17 + 18 = 93 and E(R_x) = 1/2 . 9 . 19 = 85.5. Because R_x is this close to E(R_x), there is evidently no reason for rejecting H₀. Tables show that for a two-sided test R_x would have to be as small as 62 or, by symmetry, as large as 109 to be statistically significant at the 5 per cent level.

*Table 1 — Aggression scores and rankings*
SCORE	SAMPLE	RANK	MODIFIED RANK
0	x	1	1
3	y	2	4
5	x	3	5
6	y	4	8
8	x	5	9
8	x	6	12
10	y	7	13
10	y	8	16
11	y	9	17
12	y	10	18
13	y	11	15
13	y	12	14
14	x	13	11
15	x	14	10
16	y	15	7
17	x	16	6
19	x	17	3
25	x	18	2

Test of relative spread. Although the foregoing example was used to illustrate Wilcoxon’s test for differences in location, the alternative of interest was, in fact, that the two populations differ in variability. An appropriate test can be made by assigning in place of the usual ranks the modified ranks shown in the last row of Table 1, which are arranged so that low ranks are applied to extreme observations—both high and low—whereas high ranks are applied to the more central observations. Clearly, sufficiently low or high values of the sum R’_x of modified ranks will again lead to rejection of H₀. As before, there are three alternatives—say, H’₁, ₁H’, H’₂, where H’₁ specifies that the x’s are more spread out than the y’s, etc. Precisely the same significance points apply as for the ordinary Wilcoxon test, since the ranks have merely been reallocated to produce a test statistic sensitive to the alternative under consideration. (Indeed, other alternatives, H“, can be tested by an assignment of ranks which tends to make R “small or large if H” is true. As usual, the alternative should be specified prior to the experiment.)

Here R’_x = 1 + 5 + 9 + 12 +1 1 + 10 + 6 + 3 + 2 = 59, which is significant at the 5 per cent (1 per cent) level against .

Unless the two populations have the same median or can be changed so that they do, interpretation of this test may be difficult.

Several-sample problems

One-way classification

Wilcoxon’s procedure can be generalized to produce a test of the identity of several populations. Given the N observations x_ij (i = 1,2, . . .,k; j = l,2, . . ., n_i; Σn_i = N), arrange them in common ascending order. Let R_i denote the sum of the ranks of those n_i observations originating from the ith population. If R̄_i = R_i/n_i, it is clear that large variations in these mean ranks will cast suspicion on the null hypothesis of equality of all k populations. A suitably standardized weighted sum-of-squares test statistic suggests itself. Kruskal and Wallis (1952) show that except for very small n_i,

is distributed (under the null hypothesis of identity of the populations) approximately as chi-square with k — 1 degrees of freedom. The last form of (1) is a convenient computing formula. In the central form it may be noted that 1/2(N + 1) is the

mean and (N² - 1 )/12 the variance of a randomly chosen rank. In fact, the procedure is essentially a one-way analysis of variance performed on th ranks as variables.

Two-way classification

Friedman (1937) was concerned with data giving the standard deviations of expenditures on n = 14 different categories of products at k = 7 income levels. The problem was to determine whether the standard deviations differed significantly over income levels. The “problem of n rankings” considered by Kendall and Babington Smith (1939) can be handled in the same way and the procedure will be illustrated with such data. The n = 4 observers were asked to rank k = 6 objects, with the results shown in Table 2. An approximate test of equality among the objects is obtained by referring

to tables of chi-square with k — 1 degrees of freedom. A measure of agreement between observers is given by the coefficient of concordance, denoted by , denned so as to have range (0,1), with W = 1 corresponding to complete agreement. (In the expression r₁ = r_i/n.)

In this example , which is clearly not significant, and W = 0.23.

In cases where is significantly large, that may be because one or more of the objects are “outliers”—that is, come from a population different from the bulk of the objects. Tables for the detection of such outliers have been provided by Doornbos and Prins (1958) and by Thompson and Willke (1963) [seeStatistical Analysis, Special Problems of, article onOutliers].

Incomplete rankings

In a two-way classification, it is often impracticable to have each observer rank all the objects at one time. For example, if the objects are foods to be tasted, the block size is best restricted to 2 or 3, but each judge may be asked to perform several such tastings with intervening rest periods. By means of suitable balanced

*Table 2 — Results of four observers ranking six objects*
			OBJECT
		A	B	C	D	E	F
	P	5	4	1	6	3	2
	Q	2	3	1	5	6	4
OBSERVER
OBSERVER	R	4	1	6	3	2	5
	S	4	3	2	5	1	6
Totals	r₁	15	11	10	19	12	17

incomplete block designs it is still possible to treat this case by very similar methods [seeExperimental Design; see also Durbin 1951].

The method of paired comparisons

Formally, the method of paired comparisons is a special case of incomplete ranking where the block size is 2. Each of the observers expresses a preference for one of the objects in every pair he judges. Ordinarily all possible pairwise comparisons are made, but fractional designs are also available.

The method has long interested psychologists (for “objects” read “stimuli”) and can be traced back to G. T. Fechner. It received fresh impetus from the work of L. L. Thurstone (1927), who supposed, in effect, that a particular observer’s response to object A_i can be characterized by a normal variable, Y_i, with true mean V_i. The probability that A_i is preferred to A_j (A_i → A_j) in direct comparison is then Pr(Y_i > Y_j). Thurstone distinguished five cases, of which the simplest (case v) assumes that (a) observer differences may be ignored and that (b) the Y_i are independent (actually, equal correlation suffices) normal variates with common variance. In this situation it is easy to estimate the V_i and to test their equality (see Mosteller 1951; see also Scaling).

Distributions other than normal have also been proposed for the Y_i. Postulating a sech² density for Y_i — Y_j, an assumption close to that of normality, Bradley and Terry (1952) arrive at the model

in other words, the odds on A_i → A_j are π_i to π_j. The π_i can be estimated by maximum likelihood. This approach can be extended to comparisons in triples and to multiple-choice situations (Luce 1959).

Implicit in these various models is the assumption that one-dimensional scaling of responses is appropriate. No such assumption is needed in the combinatorial approach of Kendall and Babington Smith (1940), who count the number of circular triads—that is, the number of times A_i → A_j, A_j → A_k, and yet A_k → A_i. A sufficiently low count leads to rejection of the null hypothesis, H₀, of equality among the objects. The following procedure (David 1963) is completely equivalent for a single observer and provides a very simple general test. Let a_i be the total number of times that A_i is preferred to other objects. Then, under H₀,

is distributed approximately as chi-square with k - 1 degrees of freedom. Large values of T lead to rejection of H₀, in which case it is possible to make more detailed statements about differences between the objects. The necessary multipledecision procedures are analogous to those developed by Tukey and Scheffé for the separation of treatment means in the analysis of variance [seeLinear Hypotheses, article onMultiple Comparisons].

Rank correlation

For n pairs of observations (x_i, y_i) the productmoment coefficient of correlation

provides a convenient index (—1 ≤ r ≤ 1) of the extent to which, as x increases, y increases (r > 0) or decreases (r < 0 ) linearly [seeMultivariate Analysis, articles onCorrelation].

Spearman’s

p. If r is applied to the ranks rather than to the observations, the result is Spearman’s coefficient of rank correlation, more readily computed from

where d_i is the difference in rank between y_i and x_i. Here p = 1 corresponds to complete agreement and p = — 1 to complete reversal in the two sets of ranks.

An example. Consider the rankings of observers P and Q in Table 2. There, d_i = 3, 1, 0, 1, -3 , -2 so that

indicating only slight agreement between P and Q. In fact, a value of that is 4 or smaller or 66 or larger would be needed for rejection of the null hypothesis, H₀, of independent rankings at a 5 per cent level of significance (Owen 1962). For n ≥ 12 an approximate test of H₀ can be made by treating as unit normal.

Estimation. If the underlying distribution of (x, y) is bivariate normal, with correlation coefficient p₀, an estimate of p₀ which is unbiased in large samples is given by

In the example, r₀ = 0.33.

Test of trend. Given the observations y₁, y₂, . . ., y_n, an investigator may wonder whether some positional effect or trend is present. For example, if y_i is the reaction time of a subject to the ith presentation of a stimulus, fatigue could tend to increase the later observations. A possible test of randomness against such an alternative in this one-sample situation is to compare the ranking of the y_i with the natural ordering 1,2, . . ., n, by means of p (with d_i = rank y_i - i). Clearly, a strong upward trend would lead to large values of p. Jonckheere (1954) treats the corresponding k- sample problem.

Other rank correlation coefficients

The coefficient p is a member of a class of correlation coefficients which includes r, Kendall’s τ, and the Fisheryates coefficient, r_F, obtained by replacing x_i and y_i in r by their normal scores. The last three quantities are further discussed by Fieller and his associates (1957). In addition to the above applications, p, τ, and r_F may be used to test the equality of the true correlation coefficients in two bivariate normal populations and to investigate by means of partial rank correlation coefficients whether agreement between two rankings might be due to some extraneous factor (for example, Goodman 1959). For interpretations of various rank correlation coefficients see Kruskal (1958).

Herbert A. David

BIBLIOGRAPHY

Of the books in this bibliography, Kendall 1948 (1963)corresponds most closely to the coverage of this article. Owen 1962 includes a very useful section on ranking methods and gives many of the tables needed to supplement in small samples the approximate tests discussed. The subject of paired comparisons is treated in David 1963. Siegel 1956 gives a helpful elementary account of nonparametric statistics for the behavioral sciences.

Bradley, Ralph A.; and Terry, M. E. 1952 Rank Analysis of Incomplete Block Designs. I: The Method of Paired Comparisons. Biometrika 39:324-345.

David, Herbert A. 1963 The Method of Paired Comparisons. London: Griffin; New York: Hafner.

Doornbos, R.; and Prins, H. J. 1958 On Slippage Tests. III: Two Distribution-free Slippage Tests and Tw Tables. Indagationes mathematicae 20:438-447.

Durbin, J. 1951 Incomplete Blocks in Ranking Experiments. British Journal of Psychology 4:85-90.

Fieller, E. C ; Hartley, H. O.; and Pearson, E. S. 1957 Tests for Rank Correlation Coefficients. Biometrika 44:470-481.

Friedman, Milton 1937 The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. Journal of the American Statistical Association 32:675-701.

Goodman, Leo A. 1959 Partial Test for Partial Taus. Biometrika 46:425-432.

Jonckheere, A. R. 1954 A Distribution-free k-sample Test Against Ordered Alternatives. Biometrika 41: 133-145.

Kendall, Maurice G. (1948) 1963 Rank Correlation Methods. 3d ed., rev. & enl. New York: Hafner; London: Griffin.

Kendall, Maurice G.; and Smith, B. Babington 1939 The Problem of m Rankings. Annals of Mathematical Statistics 10:275-287.

Kendall, Maurice G.; and Smith, B. Babington 1940 On the Method of Paired Comparisons. Biometrika 31:324-345.

Kruskal, William H. 1958 Ordinal Measures of Association. Journal of the American Statistical Association 53:814-861.

Kruskal, William H.; and Wallis, W. Allen 1952 Use of Ranks in One-criterion Variance Analysis.Journal of the American Statistical Association 47: 583-621.

Luce, R. Duncan 1959 Individual Choice Behavior: A Theoretical Analysis. New York: Wiley.

Mann, Henry B.; and Whitney, D. R. 1947 On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other. Annals of Mathematical Statistics 18:50-60.

Mosteller, Frederick 1951 Remarks on the Method of Paired Comparisons. Psychometrika 16:3-9, 203206, 207-218. → Part 1: The Least Squares Solution Assuming Equal Standard Deviations and Equal Correlations. Part 2: The Effect of an Aberrant Standard Deviation When Equal Standard Deviations and Equal Correlations Are Assumed. Part 3: A Test of Significance for Paired Comparisons When Equal Standard Deviations and Equal Correlations Are Assumed.

Owen, Donald B. 1962 Handbook of Statistical Tables. Reading, Mass.: Addison-Wesley. → A list of addenda and errata is available from the author.

Pratt, John W. 1964 Robustness of Some Procedures for the Two-sample Location Problem. Journal of the American Statistical Association 59:665-680.

Savage, I. Richard 1962 Bibliography of Nonparametric Statistics. Cambridge, Mass.: Harvard Univ. Press. SIEGEL, SIDNEY 1956 Nonparametric Statistics for the Behavioral Sciences. New York: McGraw-Hill.

Siegel, Sidney; and Tukey, John W. 1960 A Nonparametric Sum of Ranks Procedure for Relative Spread in Unpaired Samples. Journal of the American Statistical Association 55:429-445.

Thompson, W. A. JR.; and Willke, T. A. 1963 On an Extreme Rank Sum Test for Outliers. Biometrika 50:375-383.

Thurstone, L. L. 1927 A Law of Comparative Judgment. Psychological Review 34:273-286.

Wilcoxon, Frank 1945 Individual Comparisons by Ranking Methods. Biometrics 1:80-83.

International Encyclopedia of the Social Sciences