## Goodness of Fit

**-**

## Goodness of Fit

# Goodness of Fit

A goodness of fit procedure is a statistical test of a hypothesis that the sampled population is distributed in a specific way, for example, normally with mean 100 and standard deviation 15. Corresponding confidence procedures for a population distribution also fall under this topic. Related tests are for broader hypotheses, for example, that the sampled population is normal (without further specification). Others test hypotheses that two or more population distributions are the same.

Populations arise because of variability, of which various sources (sometimes acting together) can be distinguished. First, there is inherent variability among experimental units, for example, the heights, IQ’s, or ages of the students in a class each vary among themselves. Then there is measurement error, a more abstract or conceptual notion. The age of a student may have negligible measurement error, but his IQ does not; it depends on a host of accidental factors: how the student slept, the particular questions chosen for the test, and so on. There are also other conceptual populations, not properly thought of in terms of measurement error—the population of subject responses, for example, in the learning experiment below.

The distribution of a numerical population trait is often portrayed by a histogram, a density function, or some other device that shows the proportion of cases for which a particular value of the numerical trait is achieved (or the proportion within a small interval around a particular value). The shape of the histogram or density function is important; it may or may not be symmetrical. If it is not, it is said to be *skew.* If it is symmetrical, it may have a special kind of shape called *normal.* For example, populations of scores on intelligence tests are often assumed normally distributed by psychologists. Indeed, the construction of the test may aim at normality, at least for some group of individuals. Again, lifetimes of machines may be assumed to have negative exponential distributions, meaning that expected remaining life does not vary with age. [*See*Distributions, Statistical, *article on*special continuous distributions; Probability; Statistics, descriptive.]

It is technically often convenient, especially in connection with goodness of fit tests, to deal with the *cumulative distribution function* (c.d.f.) rather than with the density function. The c.d.f. evaluated at *x* is the proportion of cases with numerical values less than or equal to *x;* thus, if *f(x*) is a density function, the corresponding c.d.f. is

For explicitness, a subscript will be added to *F,* indicating the population, distribution, or random variable to which it applies. It is a matter of convention that cumulation is from the left and that it is based on “less than or equal to” rather than just “less than.”

The *sample c.d.f.* is the steplike function whose value at *x* is the proportion of observations less than or equal to *x.* Many goodness of fit procedures are based on geometrically suggested measures of discrepancy between sample and hypothetical population c.d.f.’s. Some informal procedures use “probability” graph paper, especially normal paper (on which a normal c.d.f. becomes a straight line).

For nominal populations (for example, proportions of people expressing allegiance to different religions or to none) there is no concept corresponding to the c.d.f. The main emphasis of this article is on numerical populations.

Although goodness of fit procedures address themselves principally to the shape of population c.d.f.’s, the term “goodness of fit” is sometimes applied more generally than in this article. In particular, some authors write of goodness of fit of observed regressions to hypothetical forms, for example, to a straight line. [*This topic is dealt with in* Linear hypotheses, *article on*regression.]

**Hypotheses—simple, composite, approximate.** A test of goodness of fit, based on a sample from a population, assesses the plausibility that the population distribution has specified form; in brief, *tests* the hypothesis that *F _{X}* has shape

*F*The specification may be complete, that is, the population distribution may be specified completely, in which case the hypothesis is called

_{0}.*simple.*Alternatively, the form may be specified only up to certain unknown parameters, which often are the parameters of location and scale. In this case the hypothesis is called

*composite.*Still another type of hypothesis is an

*approximate*one, which is composite in a certain sense. Here one specifies first what one would consider a material departure from a hypothesized shape (Hodges & Lehmann 1954). For example, in the case of a

*simple approximate*hypothesis, one might agree that

*F*departs ma terially from

_{X}*F*if the maximum vertical deviation between the actual and hypothesized cumulative distribution functions exceeds .07. The approximate hypothesis then states that the actual and hypothesized distributions do not differ materially in this sense.

_{0}Approximate hypotheses specialize to the others, so that a complete theory of testing for the former would be desirable. This is especially true since, as has been pointed out by Karl Pearson (1900) and Joseph Berkson (1938), tests of “exact” hy potheses, being as a rule *consistent,* have problematical logical status: unless the exact hypothe sis is exactly correct and all of the sampling assumptions are exactly met, rejection of the hypothesis is assured (for fixed significance level) when sample size is large. Unfortunately, such a complete theory does not now exist, but the strong early interest in “exact” hypotheses was not mis spent: The testing and “acceptance” of “exact” hypotheses concerning *F _{X}* seems to have much the same status as the provisional adoption of physical or other “laws.” If the latter has helped the advancement of science, so has no doubt the former; this is true notwithstanding that old hypotheses or theories will almost surely be discarded as additional data become available. This point has been made by Cochran (1952) and Chapman (1958). Cochran also suggests that the tests of “exact” hy potheses are “invertible” into confidence sets, in the usual manner, thus providing statistical procedures somewhat similar in intent to tests of approximate hypotheses [

*see*Estimation,

*article on*confidence intervals and regions].

**Conducting a test of goodness of fit.** Many tests of goodness of fit have been developed; as with statistical tests generally, a test of goodness of fit is conveniently conducted by computing from the sample a *statistic* and its *sample significance level* [*see*Hypothesis testing]. In the case of a test of goodness of fit, the statistic will measure the discrepancy between what the sample in fact *is* and what a sample from a population of hypothesized form *ought to be.* The sample significance level of an observed measure of discrepancy, *d _{0},* is, at least for all the standard goodness of fit procedures, the probability,

*Pr{d ≥ d*that

_{0}},*d*exceeds

*d*under random sampling from a population of hypothesized form. In other words, it is the proportion of like discrepancy measures,

_{0}*d,*exceeding

*d*computed on the basis of many successive hypothetical random samples of the same size from a population of hypothesized form. For many tests of goodness of fit, there exist tables (for extensive bibliography see Greenwood & Hartley 1962) that give those values of

_{0},*d*corresponding to given significance level and sample size (

_{0}*n*). Many of these standard tests are

*nonparametric,*which means that

*Pr{d ≥ d*is the same for a very large class of hypotheses

_{0}}*F*so that only one such tabulation is required [

_{0},*see*Nonparametric statistics].

If, as is usual, the relevant alternative population distributions (more generally, alternative probabilistic models for the generation of the sample at hand) tend to encourage large values of *d _{0},* the hypothesized population distribution will be judged implausible if the sample significance level is small (conventionally .05 or less). If the sample significance level is not small, it means that the statistic has a value unsurprising under the null hypothesis, so that the test gives no reason to reject the null hypothesis. If, however, the sample significance level is very large, say .95 or more, one may construe this as a warning of possible trouble, say, that an overzealous proponent of the hypothesis has slanted the data or that the sampling was not random. Note here an awkward usage prevalent in statistics generally: an observed measure of discrepancy

*d*with

_{0}*low probability Pr{d ≥ d*usually is described as

_{0}}*highly significant.*

**Choosing a test of goodness of fit.** Choosing a test of goodness of fit amounts to deciding in what sense the discrepancy between the hypothesized population distribution and the sample is to be measured: The sample c.d.f. may be compared directly with the hypothesized population c.d.f., as is done in the case of tests of the Kolmogorov-Smirnov type. For example, the original Kolmogorov-Smirnov test itself, as described below, summarizes the discrepancy by the maximum absolute deviation between the hypothesized population c.d.f., *F _{0},* and the sample c.d.f. Alternatively, one may compare uncumulated frequencies, as for the χ

^{2}test. Again, a standard shape parameter, such as skewness, may be computed for the sample and for the hypothesized population and the two compared.

Any reasonable measure of discrepancy will of course tend to be small if the population yielding the sample conforms to the null hypothesis. A good measure of discrepancy will, in addition, tend to be large under the likely alternative forms of the population distribution, a property designated technically by the term *power.* For example, the sample skewness coefficient might have good power if the hypothesized population distribution were normal (zero population skewness coefficient) and the relevant alternative distributional forms were appreciably skew.

**Two general considerations.** Two general considerations should be kept in mind. First it is important that the particular goodness of fit test used be selected without consideration of the sample at hand, at least if the calculated significance level is to be meaningful. This is because a measure of discrepancy chosen in the light of an observed sample anomaly will tend to be inordinately large. Receiving license plate 437918 hardly warrants the inference that, this year, the first and second digits add to the third, and the fifth and sixth to the fourth. It may of course be true, in special instances, that some adjustment of the test procedure in the light of the data does not affect the significance computations appreciably—as, for example, when choosing category intervals, based on the sample mean and variance, for the χ^{2} test (Watson 1957).

Second, a goodness of fit test, like any other statistical test, leads to an inference from a *sample* to the *population sampled.* Indeed, the usual hypothesis under test is that the sample is in fact a *random sample* from an *infinite population* of hypothesized form, and the tabulated probabilities, *Pr{d ≥ d _{0}),* almost always presuppose this. (In principle, one could obtain goodness of fit tests for more complex kinds of probability samples than random ones, but little seems to be known about such possibilities.) It is therefore essential that the sample to which a standard test is applied can be thought of as a random sample. If it cannot, then one must be prepared either to do one’s own non-standard significance probability computations or to defend the adequacy of the approximation involved in using the standard tabulations. Consider, for example, starting with a random sample in volving considerable repetition, say the sample of response words obtained from a panel of subjects taking a psychological word association test or the sample of nationalities obtained from a survey of the United Nations. Suppose now that one tal lies the number of items in the sample (response words, nationalities) appearing exactly once, exactly twice, etc. There results a new set of data, consisting of a certain number of one’s, a certain number of two’s, etc. This collection of integers has the outward appearance of a random sample, and the literature contains instances of the application of the standard tests of goodness of fit to such observed frequencies. Yet the probability mechanism that generates these integers has no resemblance whatever to random sampling, and the standard probability tabulations cannot be assumed to apply. Other examples arise when the data are generated by time series; for some of these the requisite nonstandard probability computations have been done (Patankar 1954), while, in other cases, special devices have made the standard computations apply. For example, in the case of the learning experiment by Suppes and his associates (1964), the sample consists of the time series of a subject’s responses to successive stimuli. Certain theories of learning predict a particular bimodal long-run response population distribution; but the goodness of fit test of this hypothesized shape, on the basis of a series of subject responses, is ham pered by the statistical dependence of neighboring responses. However, theory suggests, and a test of randomness confirms, that the subsample consisting of every fifth response is effectively random, enabling a standard χ

^{2}test of goodness of fit to be carried out on the basis of this subsample. Whether four-fifths of the sample is a reasonable price to pay for validly carrying out a standard procedure is of course a matter of debate.

## Tests of simple hypotheses

### The χ^{2} test

The χ^{2} test was first proposed in 1900 by Karl Pearson. To apply the test, one first divides the possible range of numbers (number pairs in the bivariate case) into *k* regions. For example, if only nonnegative numbers are possible, one might use the categories 0 to .2, .2 to .5, .5 to .7, and .7 and beyond. Next, one computes the probabilities, *p _{i},* associated with each of these regions (intervals in the example just given) under the hypothesized

*F*This is often done by subtracting values of

_{0}.*F*from each other; for example, when

_{0}*F*is the exponential cumulative distribution function 1 −

_{0}*e*

^{−x},The expected numbers *E _{i}* of observations in each category are (under the null hypothesis)

*E*where

_{i}= np_{i}*n*is the size of the random sample.

After the sample has been collected, there also will be observed numbers, O_{i}, of sample members in each category. The chi-square measure of discrepancy *d _{χ2}* is then computed by summing squared differences of class frequencies, weighted in such a way as to bring to bear standard distribution theory,

where the subscript 0 indicates the specific sample value of *d _{x2}* (Often “

*X*

^{2}” or

*“χ*is used to denote this statistic.)

^{2}”As is shown, for example, by Cochran (1952), the probability distribution of *d _{x2},* when

*F*=

_{x}*F*can be approximated by the chi-square distribution with

_{0},*k*− 1 degrees of freedom, . This fact, to which the test owes its name, was first demonstrated by Karl Pearson. The larger the expectations

*E*the better is the approximation; this has been pointed out, for example, by Mann and Wald (1942). Hence, the significance,

_{j},*Pr{d*≥

_{x2}*d*}, is evaluated to a good approximation by consulting a tabulation of the distribution. For example, if

_{x2,0}*k,*as above, equals 4, and

*d*had happened to be 4.6, then

_{x2,0}*Pr{d*≥

_{x2}*d*} ≅ .20. With a sample significance level of .20, most statisticians would not question the plausibility of

_{x2,0}*F*However, were

_{0}.*d*larger, and the corresponding significance equal to .05 or less, the consensus would be reversed.

_{x2,0}At what point is the distributional approximation endangered by small *E _{i}?* An early study of this problem, performed by Cochran in 1942 (referred to in Cochran 1952), shows that a few E; near 1 among several large ones do not materially affect the approximation. Recent studies, by Kempthorne (1966) and by Slakter (1965), show that this is true as well when

*all E*are near 1.

_{i}These and other studies indicate that, although some care must be taken to avoid very small *E _{i},* much latitude remains for choosing categories. How is this to be done? To begin with, in keeping with the spirit of remarks by Birnbaum (1953), if the relevant alternatives

*F**to

*F*are such that

_{0}is large for a certain choice of *k* categories, it is these categories that should be selected. Among various sets of *k* categories, those yielding large *d, _{x2}* (

*F*, F*) are preferred.

_{0}In the absence of detailed knowledge of the alternatives, the usual recommendation, at least in the one-dimensional case, is to use intervals of equal *E _{i}.* There remains the question of how many such intervals there should be. The typical statistical criterion for this is power, that is, the likelihood that the value of

*d*will be large enough to warrant rejection of the hypothesis

_{x2}*F*when the population is in fact a relevant alternative one. If large power is desired for

_{0}*all*alternative population c.d.f.’s departing from

*F*at some

_{0}*x*by at least a given fixed amount, Mann and Wald (1942) recommend a number of categories of the order of 4

*n*

^{2/5}. Williams (1950) has shown that this figure can easily be halved.

The χ^{2} test is versatile; it is readily adapted to problems involving nominal rather than numerical populations [*see*Counted data]. It can also be adapted to bivariate and multivariate problems, as, for example, by Keats and Lord (1962), where the joint distribution of two types of mental test scores is considered. As opposed to many of its competitors, the χ^{2} test is not biased, in the sense that there are no alternatives *F** to *F _{0}* under which acceptance of

*F*is more likely than it is under

_{0}*F*itself. It is readily adapted to composite and approximate testing problems. Also, it seems to be true that the χ

_{0}^{2}test is in the best position among its competitors with regard to the practical computation of power. As is pointed out by Cochran (1952), such computations are performed by means of the

*noncentral*chi-square distribution with

*k*− 1 degrees of freedom.

### Modifications of the χ^{2} test

Important modifications of the χ^{2} test, intended to increase its power against specific families *F* of distributions alternative to *F _{0},* are given by Neyman (1949) and by Fix, Hodges, and Lehmann 1954). Here

*F*is assumed to include

*F*and to allow differentiable parametric representation of the category expectations

_{0}*E*Note that the inclusion of

_{i}.*F*in

_{0}*F*differs from the point of view adopted, for example, by Mann and Wald (1942). These modifications are essentially likelihood ratio tests of

*F*versus

_{0}*F*and are similar to procedures used to test composite and approximate hypotheses.

Another modification, capable of orientation against specific “smooth” alternatives, Neyman’s ψ^{2} test, was introduced in 1937. Other important modifications are described in detail in Cochran (1954).

### Other procedures

When (*X _{1}, …, X_{n}*) is a random sample from a population distributed according to a

*continuous*c.d.f.

*F*then (

_{0},*U*) = (

_{1}…, U_{n}*F*), …,

_{0}(X_{1}*F*)) has all the probabilistic properties of a random sample from a population distributed uniformly over the numbers between zero and one. (If the population has a density function, the c.d.f. is continuous.) No matter what the hypothesized

_{0}(X_{n}*F*the initial application of this

_{0},*probability integral*transformation thus reduces all probability computations to the case of this

*uniform*population distribution and gives a

*nonparametric*character to any procedure based on the transformed sample (

*U*). Most goodness of fit tests of simple hypotheses are nonparametric in this sense, including the χ

_{1}, …, U_{n}^{2}test itself, when categories are chosen so as to assign specified values, for example, the constant value

*1/k,*to the category probabilities

*p*

_{t}.Another common test making use of the trans formation *U* = *F _{0}(X*) is the Kolmogorov-Smirnov test, first suggested by Kolmogorov (1933) and explained in detail by Goodman (1954) and Massey (1951). The test bears Smirnov’s name, as well as Kolmogorov’s, presumably because Smirnov (as Doob and Donsker did later) gave an alternate derivation of its asymptotic null distribution, tabulated this distribution, and also extended the test to the two-sample case discussed below (1939

*a*). Denote by

*F*) the

_{n}(x*sample c.d.f.,*that is,

*F*) is the proportion of sample values less than or equal to

_{n}(x*x.*The test is based on the maximum absolute vertical deviation between

*F*) and

_{n}(x*F*

_{0}(x),the dependence of *d _{K}* on the quantities

*U*) being best brought out by the alternate formula

_{i}= F_{0}(X_{i}where *u _{i}* is the smallest

*U*is the next to smallest, etc.; the equivalence of the two formulas is made clear by a sketch. As Kolmogorov noted in his original paper, the probabilities tabulated for

_{i}, u_{2}*d*are

_{K}*conservative*when

*F*is not continuous, in the sense that, for discontinuous

_{0}*F*actual probabilities of

_{0},*d*≥

_{K}*d*will tend to be less than those obtained from the tabulations, leading to occasional unwarranted acceptance of

_{K,0}*F*

_{0}.Computations (Shapiro & Wilk 1965) suggest that this test has low power against alternatives with mean and variance equal to those of the hypothesized distribution. It has, however, been argued, for example, by Birnbaum (1953) and Kac, Kiefer, and Wolfowitz (1955), that the test yields good minimum power over classes of alter natives *F** satisfying *d _{K}(F*, F_{0}) ^ δ* these, as the reader will note, are precisely the classes of alternatives envisaged by Mann and Wald (1942) in optimizing the number of categories used in the χ

^{2}test. A detrimental feature of the Kolmogorov-Smirnov test is its bias, pointed out in Massey (1951).

An important feature of the test is that it can be “inverted” in keeping with the usual method to provide a confidence band for *F _{0}(x*) centered on

*F*which, except for the narrowing caused by the restriction 0 ≤

_{0}(x),*F*) ≤ 1, has constant width [

_{0}(x*see*Estimation,

*article on*confidence intervals and regions]. The construction of such a band has been suggested by Wald and Wolfowitz and is described by Goodman (1954). Attaching a significance probability to an observed

*d*amounts to ascertaining the band width required in order just to include wholly the hypothesized

_{K,0}*F*in the confidence band.

_{0}The Kolmogorov-Smirnov test has been modified in several ways; the first of these converts the test into a “one-sided” procedure based on the discrepancy

A useful feature of this modification is the simplicity of the large sample computation of significance probabilities associated with observed discrepancies *d _{K,+0,};* abbreviating the latter to

*d,*one has

*Pr{d*} ≅

_{K+}≥ d*e*It is verified by Chapman (1958) that

^{−2d2}.*d*yields good minimum power over those classes of alternatives

_{K+}*F**that satisfy .

Other, more complex modifications provide greater power against special alternatives, as in the weight function modifications (Darling 1957), which provide greater power against discrepancies from *F _{0}* in the tails. Another sort of modification, introduced and tabulated by Kuiper in 1960, calls for a measure of discrepancy

*d*that is especially suited to testing goodness of fit to hypothesized

_{v}*circular*distributions, being invariant under arbitrary choices of the angular origin. This property could be important, for example, in psychological studies involving responses to the color wheel, or in the learning experiment mentioned above. The measured, also has been singled out by E. S. Pearson (1963) as the generally most attractive in competition with

*d*and the discrepancy measures

_{K}*d*and

_{ω2}*d*mentioned below.

_{v}A second general class of procedures also making use of the transformation *U = F _{0}(x*) springs from the discrepancy measure

first proposed by Cramer in 1928 and also by Von Mises in 1931 (see Darling 1957). Marshall (1958) has verified a startling agreement between the asymptotic and small sample distributions of da,^{2} for sample sizes *n* as low as 3. Power considerations for *d _{ω2}* are similar to those expressed for

*d*and are discussed also in the sources cited by Marshall; the test based on

_{K},*d*can be expected to have good minimum power over classes of alter natives

_{ω2}*F**satisfying the conditions . However, the test is biased (as is that based on

*d*).

_{K}As in the case of *d _{K}, d_{ω2}* has weight function modifications for greater power selectivity, and also a modification

*d*analogous to the modification

_{v},*d*of

_{v}*d*and introduced by Watson (1961), which does not depend on the choice of angular origin and is thus also suited for testing the goodness of fit to hypothesized circular distributions.

_{K}Other procedures include those based on the Fisher-Pearson measures and , apparently first suggested in connection with goodness of fit in 1938 by E. S. Pearson. As pointed out by Chapman (1958), the tests based on *d ^{(1)}* and

*d*are uniformly most powerful against polynomial alternatives to

^{2}*F*of form

_{v}(x) = x*x*and (1 −

^{k}*x*)

^{k},

*k*> 1, and hence are “smooth” in the sense of Neyman’s ψ

^{2}test. Computations by Chapman suggest that, dually to

*d*has good maximum power over classes of alternatives

_{K}, d^{2}*F°*satisfying

*d*) ≤ 8.

_{K}(F*, F_{0}Another set of procedures, discussed and defended by Pyke (1965) and extensively studied by Weiss (1958), are based on functions of the spacings, *u _{i+}* −

*u*or

_{i}*u*− (

_{i}*i*+ 1)

^{−1}, of the

*u*’s, from each other or from their expected locations under

*F*. Still another criterion (Smirnov 1939

_{0}*b*) examines the number of crossings of

*F*) and

_{0}(x*F*

_{0}(x).An important modification, applicable to all of the procedures in this section, is suggested in Durbin (1961). This modification is intended to increase the power of any procedure based on the transforms *U _{i},* against a certain class of alternatives described in that paper.

Since there are multivariate probability integral transformations, applying an initial “uniformizing” transformation is possible in the multivariate case as well. However, one of several possible transformations must now be chosen, and, related to this nonuniqueness, the direct analogues of the univariate discrepancy measures are no longer functions of uniformly distributed transforms and do not lead to nonparametric tests (Rosenblatt 1952).

## Tests of composite hypotheses

### The χ^{2} test

In the composite case, null hypothesis specifies only that *F _{x}(x*) is a member of certain parametric class {

*F*)}. Typically, but not necessarily, θ is the pair (μ,σ), a parameter of location, and σ a parameter of scale, in which case

_{0}(x*F*) may be written

_{0}(x*F*[(

_{0}*x − μ)/σ*]. In any event, there arises the question of modifying the measure

*d**of discrepancy between the sample and a particular cumulative distribution function into a measure

_{x2}*D*of discrepancy between the sample and the class {

_{x2}*F*)}. A natural approach is to set

_{θ}(x*D _{x2}* =

*min*.

_{θ}d_{x2}If θ is composed of *m* parameters, it can be shown that, under quite general conditions, *D _{x2}* is approximately distributed according to the distribution when

*F*) equals any one of the

_{x}(x*F*). Hence significance probability computations can once again be referred to tabulations of the distribution. The requisite minimization with respect to

_{θ}(x*θ*can be cumbersome, and several modifications have been proposed, for example, the following by Neyman (1949):

Suppose that one defines *d _{x2}(θ*) as the discrepancy

*d*between the observed sample and the particular distribution

_{x2}*F*Then

_{0}(x).*D*is defined also by

*D _{x2}* =

*d*

_{x2}(θ͂),with the estimator θ͂ computed from

that is, with θ͂ the *minimum chi-square estimator* of θ. The suggested modifications involve using estimators of θ alternate to θ͂ in this last definition of *D _{x2}*, that is, estimators that “essentially” minimize

*d*among these are the so-called grouped-data or partial-information maximum likelihood estimators.

_{x2}(θ);Frequently used but *not equivalent* estimators are the ordinary “full-information” maximum like lihood estimators θ͂ of θ, for example, (*x̄, s*) for (μ, σ) in the normal case. These do *not* “essentially” minimize *d _{x2}* and consequently tend to inflate

*D*beyond values predicted by the distribution, leading to some unwarranted rejections of the composite hypothesis. However, it is indicated by Chernoff and Lehmann (1954), and also by Watson (1957), that no serious distortion will result if the number of categories is ten or more.

_{x2}### Composite analogues of other tests

Adaptation of the tests based on the probability integral transformation to the composite case proceeds much as in the case of χ^{2}. With definitions of *d _{ω2}* and

*d*(θ) analogous to that of

_{K}*d*Darling (1955) has investigated the large sample probability distribution of

_{x2}*D*= d

_{ω2}_{ω2}(θ^) and D

_{k}=d

_{k}(θ^) for efficient estimators θ^ of θ analogous to the estimators θ¯ for χ

^{2}, Note that in the absence of any χ

^{2}-like categories, the ordinary full-information maximum likelihood estimators now

*do*qualify as estimators

*θ^*

A major problem now is, however, that the modified procedures are no longer nonparametric. Thus a special investigation is required for every composite hypothesis. This is done by Kac, Kiefer, and Wolfowitz (1955) for the normal scale-location family, and the resulting large sample distribution is partly tabulated.

### Tests based on special characteristics

The alternatives of concern sometimes differ from a composite null hypothesis in a manner easily described by a standard shape parameter. Special tests have been proposed for such cases. For example, the sample skewness coefficient has been suggested (Geary 1947) for testing normality against skew alternatives. Again, for testing Poissonness against alternatives with variance unequal to mean, R. A. Fisher has recommended the variance-to-mean ratio (see Cochran 1954). This meas ure is approximately distributed as , when Pois sonness in fact obtains, for λ > 1 and *n* > 15 (Sukhatme 1938), which follows from the fact that the denominator is then a high-precision estimate of λ, and the numerator is approximately distributed as λ Analogous recommendations apply to testing binomiality. Essentially the same point of view underlies tests of normality based on the ratio of mean deviation, or of the range, to the standard deviation.

### Transforming into simple hypotheses

Another interesting approach to the composite problem, advocated by Hogg and also by Sarkadi (1960), is to transform certain composite hypotheses into equivalent simple ones.

Specifically, there are location-scale parametric families {*F _{0}*[(x —μ)/σ]}with the following property: A random sample from any particular

*F*[(x —μ)/σ] is reducible by a transformation T to a new set of random variables, Y = T(X), constituting in effect a random sample from a distribution

_{0}*G (y*) involving no unknown parameters at all. Moreover,

*only*random samples from distributions

*F*[(x —μ)/σ] lead to

_{0}*G(y*) when operated on by T.

It then follows that testing the composite hypothesis *H* that (*X _{1}, …*, X

_{n}) is a random sample from a distribution

*F*[(x —μ)/σ] with some μ and some cr is equivalent to testing the hypothesis

_{0}*H′*that (Y

_{1},…, Y

_{m}) is a random sample from the distribution

*G(y).*Any of the tests for simple hy potheses is then available for testing

*H′.*An example is provided by a negative exponential

*F*and uniform G, in which case the ordered exponential random sample (X

_{0}_{(1)}, …, X

_{(n)}) is transformed into an ordered uniform random sample (Y

_{(l)}, …, Y

_{(n-2)}) by the transformation

### Conditioning

Another way of neatly doing away with the unknown parameter is to consider the conditional distribution of the sample, given a sufficient estimate of it. This method is advocated, at least for testing Poissonness, in Fisher (1950).

### Tests related to probability plots

S. S. Shapiro and M. B. Wilk have quantified in various ways the departure from linearity of the sorts of probability plots mentioned above, in particular of the plot of the ordered sample values against the expected values of the standardized order statistics [*see*Nonparametric statistics, *article on*order statistics]. This new approach bears some similarity to one given in Darling (1955), which is based on the measure *d _{ω2}* modified for the composite case. Both approaches, in a sense, compare adjusted observed order statistics with standardized order statistic expectations. But the approach of Shapiro and Wilk is tailored more explicitly to particular scale-location families, by using their particular order statistic variances and covariances. It is no wonder that preliminary evaluations of this sort of approach (for example, by Shapiro & Wilk 1965) have shown exceptional promise. As an added bonus, the procedure is

*similar*over the entire scale-location family; that is, its probability distribution is independent of location and scale.

## Approximate hypotheses

The first, and seemingly most practically developed, attempt to provide the requisite tests of approximate hypotheses is found in Hodges and Lehmann (1954). Hodges and Lehmann assume the *k* typical categories of the χ^{2} test and formulate the approximate *simple* hypothesis in terms of the discrepancy *d(p, p _{0}*) between the category probabilities

*p*under F

_{i}_{x}and the category probabilities p

_{0,i}. under a

*simple*hypothesis F

_{0}. A very tractable discrepancy measure of this type is ordinary distance, for which the approximate hypothesis takes the form

Denoting *O _{i}/n* by O

_{i}, the suggested test reduces, essentially, to the

*one-sided*test of the hypothesis

*d(p, p*δ based on the approximately normal statistic [d(O,p

_{0}) =_{0})-δ]/σ^, where σ^ is the standard deviation, estimated from the sample o

_{i}, of

*d(o,*p

_{0}). For example, when

*F*specifies

_{0}*k*categories with p

_{0,i},i =

*l/k,*one treats as unit normal (under the null hypothesis) the statistic

and uses an upper-tail test. Thus a value of S of 1.645 leads to a sample significance level of .05. This approach lends itself easily to the computation of power and is extended as well by Hodges and Lehmann to the testing of approximate *composite* hypotheses.

Extension of other tests for simple hypotheses to the testing of approximate hypotheses has been considered by J. Rosenblatt (1962) and by Kac, Kiefer, and Wolfowitz (1955).

## Further topics

That the sample is random may itself be in doubt, and tests have been designed to have power against specific sorts of departure from randomness. For example, tests of the hypothesis of randomness against the alternative hypothesis that the data are subject to a Markov structure are given by Billingsley (1961) and Goodman (1959); the latter work also covers testing that the data have Markov structure of a given order against the alternative that the data have Markov structure of higher order, and the testing of hypothesized values of transition probabilities when a Markov structure of given order is assumed [*see*Markov chains].

Many of the tests described in this article can be extended to *several-sample* procedures for testing the hypothesis that several populations are in fact distributed identically; thus, as first suggested in Smirnov (1939a), if G_{m}(x) denotes the proportion of values less than or equal to x, in an independent random sample (Y_{1}, …, Y_{m}) from a second population, *d _{K}(F_{n},G_{m}*) provides a natural test of the hypothesis that the two continuous population distribution functions

*F*and

_{x}*G*coincide. Many of these extensions are functions only of the

_{Y}*relative ranks*of the two samples and, as such, are nonparametric, that is, their null probability distributions do not depend on the common functional form of F

_{x}and

_{y}. [

*Several-sample nonparametric procedures are discussed in*Nonparametric sta tistics.]

Another topic is that of tests of goodness of fit as preliminary tests of significance, in a sense discussed, for example, by Bancroft (1964). That tests of goodness of fit are typically applied in this sense is recognized by Chapman (1958), and the probabilistic properties of certain “nested” sequences of tests beginning with a test of goodness of fit have been considered by Hogg (1965). The Bayes and information theory approaches to χ^{2} tests of goodness of fit are also important (see Lindley 1965; Kullback 1959).

H. T. David

[*Directly related are the entries*Hypothesis Testing; Significance, Tests of. *Other relevant material may be found in*Counted Data; Estimation; Non-Parametric Statistics.]

## BIBLIOGRAPHY

Bancroft, T. A. 1964 Analysis and Inference for In completely Specified Models Involving the Use of Preliminary Test(s) of Significance. *Biometrics* 20:427–442.

Berkson, Joseph 1938 Some Difficulties of Interpretation Encountered in the Chi-square Test. *Journal of the American Statistical Association* 33:526–536.

Billingsley, Patrick 1961 Statistical Methods in Mark ov Chains. *Annals of Mathematical Statistics* 32:12–40.

Birnbaum, Z. W. 1953 Distribution-free Tests of Fit for Continuous Distribution Functions. Annals *of Mathe matical Statistics* 24:1–8.

Chapman, Douglas G. 1958 A Comparative Study of Several One-sided Goodness-of-fit Tests. *Annals of Mathematical Statistics* 29:655–674.

Chernoff, Herman; and Lehmann, E. L. 1954 The Use of Maximum Likelihood Estimates in Tests for Goodness of Fit. *Annals of Mathematical Statistics* 25:579–586.

Cochran, William G. 1952 The X^{2}Test of Goodness of Fit. *Annals of Mathematical Statistics* 23:315–345.

Cochran, William G. 1954 Some Methods for Strengthening the Common X^{2} Tests. *Biometrics* 10:417–451.

Darling, D. A. 1955 The Cramer-Smirnov Test in the Parametric Case. *Annals of Mathematical Statistics* 26:1–20.

Darling, D. A. 1957 The Kolmogorov-Smirnov, Cramer-Von Mises Tests. *Annals of Mathematical Statistics* 28:823–838.

Durbin, J. 1961 Some Methods of Constructing Exact Tests. *Biometrika* 48:41–55.

Fisher, R. A. 1924 The Conditions Under Which x^{2}Measures the Discrepancy Between Observation and Hypothesis. *Journal of the Royal Statistical Society* 87:442–450.

Fisher, R. A. 1950 The Significance of Deviations From Expectation in a Poisson Series. *Biometrics* 6:17–24.

Fix, Evelyn; Hodges, J. L. Jr.; and Lehmann, E. L. 1954 The Restricted Chi-square Test. Pages 92-107 in Ulf Grenander (editor), *Probability and Statistics.* New York: Wiley.

Geary, R. C. 1947 Testing for Normality. *Biometrika* 34:209–242.

Goodman, Leo A. 1954 Kolmogorov-Smirnov Tests for Psychological Research. *Psychological Bulletin* 51: 160–168.

Goodman, Leo A. 1959 On Some Statistical Tests for mth Order Markov Chains. *Annals of Mathematical Statistics* 30:154–164.

Greenwood, Joseph A.; and Hartley, H. O. 1962 *Guide to Tables in Mathematical Statistics,* Princeton Univ. Press.→ A sequel to the guides to mathematical tables produced by and for the Committee on Mathematical Tables and Aids to Computation of the National Academy of Sciences-National Research Council of the United States.

Hodges, J. L. Jr.; and Lehmann, E. L. 1954 Testing the Approximate Validity of Statistical Hypotheses. *Journal of the Royal Statistical Society* Series B 16: 261–268.

Hogg, Robert V. 1965 On Models and Hypotheses With Restricted Alternatives. *Journal of the American Statistical Association* 60:1153–1162.

Kac, M.; Kiefer, J.; and Wolfowitz, J. 1955 On Tests of Normality and Other Tests of Goodness of Fit Based on Distance Methods. *Annals of Mathematical Statistics* 26:189–211.

Keats, J. A.; and Lord, Frederic M. 1962 A Theoreti cal Distribution for Mental Test Scores. *Psychometrika* 27:59–72.

Kempthorne, O. 1966 The Classical Problem of Inference: Goodness of Fit. Unpublished manuscript.→ Paper presented at the Berkeley Symposium on Mathe matical Statistics and Probability, Fifth, *Proceedings* to be published.

Kolmogorov, A. N. 1933 Sulla determinazione empirica di une legge di distribuzione. Istituto Italiano degli Attuari, *Giornale* 4:83–99.

Kuiper, Nicolaas H. 1960 Tests Concerning Random Points on a Circle. Akademie van Wetenschappen, Amsterdam, *Proceedings* Series A 63:38–47.

Kullback, S. 1959 *Information Theory and Statistics.* New York: Wiley.

Lindley, D. V. 1965 *Introduction to Probability and Statistics From a Bayesian Viewpoint.* Volume 2: Inference. Cambridge Univ. Press.

Mann, H. B.; and Wald, A. 1942 On the Choice of the Number of Class Intervals in the Application of the Chi Square Test. *Annals of Mathematical Statistics* 13:306–317.

Marshall, A. W. 1958 The Small Sample Distribution of *nu. Annals of Mathematical Statistics* 29:307–309.

Massey, Frank J. Jr. 1951 The Kolmogorov-Smirnov Test for Goodness of Fit. *Journal of the American Statistical Association* 46:68–78.

Neyman, Jerzy 1937 “Smooth Test” for Goodness of Fit. *Skandinavisk aktuarietidskrift* 20:149–199.

Neyman, Jerzy 1949 Contribution to the Theory of the X^{2} Test. Pages 239-273 in Berkeley Symposium on Mathematical Statistics and Probability, First, Pro ceedings. Berkeley: Univ. of California Press.

Patankar, V. N. 1954 The Goodness of Fit of Fre quency Distributions Obtained From Stochastic Proc esses. *Biometrika* 41:450–462.

Pearson, E. S. 1938 The Probability Integral Transfor mation for Testing Goodness of Fit and Combining Independent Tests of Significance. *Biometrika* 30: 134–148.

Pearson, E. S. 1963 Comparison of Tests for Random ness of Points on a Line. *Biometrika* 50:315–325.

Pearson, Karl 1900 On the Criterion That a Given System of Deviations From the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen From Random Sampling. *Philosophical Magazine* 5th Series 50:157–175.

Pyke, Ronald 1965 Spacings. *Journal of the Royal Statistical Society* Series B 27:395–449.

Rosenblatt, Judah 1962 Testing Approximate Hypotheses in the Composite Case. *Annals of Mathematical Statistics* 33:1356–1364.

Rosenblatt, Murray 1952 Remarks on a Multivariate Transformation. *Annals of Mathematical Statistics* 23: 470–472.

Sarkadi, KÁroly 1960 On Testing for Normality. Ma gyar Tudomanyos Akademia, Matematikai Kutató Int£zet, *Közlemenyek* Series A 5:269–274.

Shapiro, S. S.; and Wilk, M. B. 1965 An Analysis of Variance Test for Normality (Complete Samples). *Biometrika* 52:591–611.

Slakter, Malcolm J. 1965 A Comparison of the Pearson Chi-square and Kolmogorov Goodness-of-fit Tests With Respect to Validity. *Journal of the American Statistical Association* 60:854–858.

Smirnov, N. V. 1939a On the Estimation of the Discrepancy Between Empirical Curves of Distribution for Two Independent Samples. Moscow, Universitet, *Bulletin mathematique* Serie Internationale 2, no. 2: 3–26.

Smirnov, N. V. 1939b Ob ukloneniiakh empiricheskoi krivoi raspredeleniia (On the Deviations of the Empirical Distribution Curve). *Matematicheskii sbornik* New Series 6, no. 1:1–26. → Includes a French resume.

Sukhatme, P. V. 1938 On the Distribution of X^{2} hi Samples of the Poisson Series. *Journal of the Royal Statistical Society* 5 (Supplement):75–79.

Suppes, Patrick et al. 1964 Empirical Comparison of Models for a Continuum of Responses With Non-con tingent Bimodal Reinforcement. Pages 358-379 in R. C. Atkinson (editor), *Studies in Mathematical Psychology.* Stanford Univ. Press.

Watson, G. S. 1957 The X^{2} Goodness-of-fit Test for Nor mal Distributions. *Biometrika* 336–348.

Watson, G. S. 1961 Goodness-of-fit Tests on a Circle. *Biometrika* 48:109–114.

Weiss, Lionel 1958 Limiting Distributions of Homo geneous Functions of Sample Spacings. *Annals of Mathematical Statistics* 29:310–312.

Williams, C. Arthur Jr. 1950 On the Choice of the Number and Width of Classes for the Chi-square Test of Goodness of Fit. *Journal of the American Statistical Association* 45:77–86.

## goodness of fit

**goodness of fit** A statistical term used to indicate the correspondence between an observed distribution and a model or hypothetical mathematical distribution. In many statistical tests of significance the hypothetical or expected distribution is a model based upon there being no relationship between the dependent and independent variables. The tests then measure whether any observed deviation from the expected model may reasonably be accounted for by chance sampling variation, or whether it is sufficiently large to indicate a real difference, generalizable to the population from which the sample was taken. See also SIGNIFICANCE TESTS.