Goodness of Fit
Goodness of Fit
A goodness of fit procedure is a statistical test of a hypothesis that the sampled population is distributed in a specific way, for example, normally with mean 100 and standard deviation 15. Corresponding confidence procedures for a population distribution also fall under this topic. Related tests are for broader hypotheses, for example, that the sampled population is normal (without further specification). Others test hypotheses that two or more population distributions are the same.
Populations arise because of variability, of which various sources (sometimes acting together) can be distinguished. First, there is inherent variability among experimental units, for example, the heights, IQ’s, or ages of the students in a class each vary among themselves. Then there is measurement error, a more abstract or conceptual notion. The age of a student may have negligible measurement error, but his IQ does not; it depends on a host of accidental factors: how the student slept, the particular questions chosen for the test, and so on. There are also other conceptual populations, not properly thought of in terms of measurement error—the population of subject responses, for example, in the learning experiment below.
The distribution of a numerical population trait is often portrayed by a histogram, a density function, or some other device that shows the proportion of cases for which a particular value of the numerical trait is achieved (or the proportion within a small interval around a particular value). The shape of the histogram or density function is important; it may or may not be symmetrical. If it is not, it is said to be skew. If it is symmetrical, it may have a special kind of shape called normal. For example, populations of scores on intelligence tests are often assumed normally distributed by psychologists. Indeed, the construction of the test may aim at normality, at least for some group of individuals. Again, lifetimes of machines may be assumed to have negative exponential distributions, meaning that expected remaining life does not vary with age. [SeeDistributions, Statistical, article onspecial continuous distributions; Probability; Statistics, descriptive.]
It is technically often convenient, especially in connection with goodness of fit tests, to deal with the cumulative distribution function (c.d.f.) rather than with the density function. The c.d.f. evaluated at x is the proportion of cases with numerical values less than or equal to x; thus, if f(x) is a density function, the corresponding c.d.f. is
For explicitness, a subscript will be added to F, indicating the population, distribution, or random variable to which it applies. It is a matter of convention that cumulation is from the left and that it is based on “less than or equal to” rather than just “less than.”
The sample c.d.f. is the steplike function whose value at x is the proportion of observations less than or equal to x. Many goodness of fit procedures are based on geometrically suggested measures of discrepancy between sample and hypothetical population c.d.f.’s. Some informal procedures use “probability” graph paper, especially normal paper (on which a normal c.d.f. becomes a straight line).
For nominal populations (for example, proportions of people expressing allegiance to different religions or to none) there is no concept corresponding to the c.d.f. The main emphasis of this article is on numerical populations.
Although goodness of fit procedures address themselves principally to the shape of population c.d.f.’s, the term “goodness of fit” is sometimes applied more generally than in this article. In particular, some authors write of goodness of fit of observed regressions to hypothetical forms, for example, to a straight line. [This topic is dealt with in Linear hypotheses, article onregression.]
Hypotheses—simple, composite, approximate. A test of goodness of fit, based on a sample from a population, assesses the plausibility that the population distribution has specified form; in brief, tests the hypothesis that F_{X} has shape F_{0}. The specification may be complete, that is, the population distribution may be specified completely, in which case the hypothesis is called simple. Alternatively, the form may be specified only up to certain unknown parameters, which often are the parameters of location and scale. In this case the hypothesis is called composite. Still another type of hypothesis is an approximate one, which is composite in a certain sense. Here one specifies first what one would consider a material departure from a hypothesized shape (Hodges & Lehmann 1954). For example, in the case of a simple approximate hypothesis, one might agree that F_{X} departs ma terially from F_{0} if the maximum vertical deviation between the actual and hypothesized cumulative distribution functions exceeds .07. The approximate hypothesis then states that the actual and hypothesized distributions do not differ materially in this sense.
Approximate hypotheses specialize to the others, so that a complete theory of testing for the former would be desirable. This is especially true since, as has been pointed out by Karl Pearson (1900) and Joseph Berkson (1938), tests of “exact” hy potheses, being as a rule consistent, have problematical logical status: unless the exact hypothe sis is exactly correct and all of the sampling assumptions are exactly met, rejection of the hypothesis is assured (for fixed significance level) when sample size is large. Unfortunately, such a complete theory does not now exist, but the strong early interest in “exact” hypotheses was not mis spent: The testing and “acceptance” of “exact” hypotheses concerning F_{X} seems to have much the same status as the provisional adoption of physical or other “laws.” If the latter has helped the advancement of science, so has no doubt the former; this is true notwithstanding that old hypotheses or theories will almost surely be discarded as additional data become available. This point has been made by Cochran (1952) and Chapman (1958). Cochran also suggests that the tests of “exact” hy potheses are “invertible” into confidence sets, in the usual manner, thus providing statistical procedures somewhat similar in intent to tests of approximate hypotheses [seeEstimation, article onconfidence intervals and regions].
Conducting a test of goodness of fit. Many tests of goodness of fit have been developed; as with statistical tests generally, a test of goodness of fit is conveniently conducted by computing from the sample a statistic and its sample significance level [seeHypothesis testing]. In the case of a test of goodness of fit, the statistic will measure the discrepancy between what the sample in fact is and what a sample from a population of hypothesized form ought to be. The sample significance level of an observed measure of discrepancy, d_{0}, is, at least for all the standard goodness of fit procedures, the probability, Pr{d ≥ d_{0}}, that d exceeds d_{0} under random sampling from a population of hypothesized form. In other words, it is the proportion of like discrepancy measures, d, exceeding d_{0}, computed on the basis of many successive hypothetical random samples of the same size from a population of hypothesized form. For many tests of goodness of fit, there exist tables (for extensive bibliography see Greenwood & Hartley 1962) that give those values of d_{0} corresponding to given significance level and sample size (n). Many of these standard tests are nonparametric, which means that Pr{d ≥ d_{0}} is the same for a very large class of hypotheses F_{0}, so that only one such tabulation is required [seeNonparametric statistics].
If, as is usual, the relevant alternative population distributions (more generally, alternative probabilistic models for the generation of the sample at hand) tend to encourage large values of d_{0}, the hypothesized population distribution will be judged implausible if the sample significance level is small (conventionally .05 or less). If the sample significance level is not small, it means that the statistic has a value unsurprising under the null hypothesis, so that the test gives no reason to reject the null hypothesis. If, however, the sample significance level is very large, say .95 or more, one may construe this as a warning of possible trouble, say, that an overzealous proponent of the hypothesis has slanted the data or that the sampling was not random. Note here an awkward usage prevalent in statistics generally: an observed measure of discrepancy d_{0} with low probability Pr{d ≥ d_{0}} usually is described as highly significant.
Choosing a test of goodness of fit. Choosing a test of goodness of fit amounts to deciding in what sense the discrepancy between the hypothesized population distribution and the sample is to be measured: The sample c.d.f. may be compared directly with the hypothesized population c.d.f., as is done in the case of tests of the KolmogorovSmirnov type. For example, the original KolmogorovSmirnov test itself, as described below, summarizes the discrepancy by the maximum absolute deviation between the hypothesized population c.d.f., F_{0}, and the sample c.d.f. Alternatively, one may compare uncumulated frequencies, as for the χ^{2} test. Again, a standard shape parameter, such as skewness, may be computed for the sample and for the hypothesized population and the two compared.
Any reasonable measure of discrepancy will of course tend to be small if the population yielding the sample conforms to the null hypothesis. A good measure of discrepancy will, in addition, tend to be large under the likely alternative forms of the population distribution, a property designated technically by the term power. For example, the sample skewness coefficient might have good power if the hypothesized population distribution were normal (zero population skewness coefficient) and the relevant alternative distributional forms were appreciably skew.
Two general considerations. Two general considerations should be kept in mind. First it is important that the particular goodness of fit test used be selected without consideration of the sample at hand, at least if the calculated significance level is to be meaningful. This is because a measure of discrepancy chosen in the light of an observed sample anomaly will tend to be inordinately large. Receiving license plate 437918 hardly warrants the inference that, this year, the first and second digits add to the third, and the fifth and sixth to the fourth. It may of course be true, in special instances, that some adjustment of the test procedure in the light of the data does not affect the significance computations appreciably—as, for example, when choosing category intervals, based on the sample mean and variance, for the χ^{2} test (Watson 1957).
Second, a goodness of fit test, like any other statistical test, leads to an inference from a sample to the population sampled. Indeed, the usual hypothesis under test is that the sample is in fact a random sample from an infinite population of hypothesized form, and the tabulated probabilities, Pr{d ≥ d_{0}), almost always presuppose this. (In principle, one could obtain goodness of fit tests for more complex kinds of probability samples than random ones, but little seems to be known about such possibilities.) It is therefore essential that the sample to which a standard test is applied can be thought of as a random sample. If it cannot, then one must be prepared either to do one’s own nonstandard significance probability computations or to defend the adequacy of the approximation involved in using the standard tabulations. Consider, for example, starting with a random sample in volving considerable repetition, say the sample of response words obtained from a panel of subjects taking a psychological word association test or the sample of nationalities obtained from a survey of the United Nations. Suppose now that one tal lies the number of items in the sample (response words, nationalities) appearing exactly once, exactly twice, etc. There results a new set of data, consisting of a certain number of one’s, a certain number of two’s, etc. This collection of integers has the outward appearance of a random sample, and the literature contains instances of the application of the standard tests of goodness of fit to such observed frequencies. Yet the probability mechanism that generates these integers has no resemblance whatever to random sampling, and the standard probability tabulations cannot be assumed to apply. Other examples arise when the data are generated by time series; for some of these the requisite nonstandard probability computations have been done (Patankar 1954), while, in other cases, special devices have made the standard computations apply. For example, in the case of the learning experiment by Suppes and his associates (1964), the sample consists of the time series of a subject’s responses to successive stimuli. Certain theories of learning predict a particular bimodal longrun response population distribution; but the goodness of fit test of this hypothesized shape, on the basis of a series of subject responses, is ham pered by the statistical dependence of neighboring responses. However, theory suggests, and a test of randomness confirms, that the subsample consisting of every fifth response is effectively random, enabling a standard χ^{2} test of goodness of fit to be carried out on the basis of this subsample. Whether fourfifths of the sample is a reasonable price to pay for validly carrying out a standard procedure is of course a matter of debate.
Tests of simple hypotheses
The χ^{2} test
The χ^{2} test was first proposed in 1900 by Karl Pearson. To apply the test, one first divides the possible range of numbers (number pairs in the bivariate case) into k regions. For example, if only nonnegative numbers are possible, one might use the categories 0 to .2, .2 to .5, .5 to .7, and .7 and beyond. Next, one computes the probabilities, p_{i}, associated with each of these regions (intervals in the example just given) under the hypothesized F_{0}. This is often done by subtracting values of F_{0} from each other; for example, when F_{0} is the exponential cumulative distribution function 1 − e^{−x},
The expected numbers E_{i} of observations in each category are (under the null hypothesis) E_{i} = np_{i} where n is the size of the random sample.
After the sample has been collected, there also will be observed numbers, O_{i}, of sample members in each category. The chisquare measure of discrepancy d_{χ2} is then computed by summing squared differences of class frequencies, weighted in such a way as to bring to bear standard distribution theory,
where the subscript 0 indicates the specific sample value of d_{x2} (Often “X^{2}” or “χ^{2}” is used to denote this statistic.)
As is shown, for example, by Cochran (1952), the probability distribution of d_{x2}, when F_{x} = F_{0}, can be approximated by the chisquare distribution with k − 1 degrees of freedom, . This fact, to which the test owes its name, was first demonstrated by Karl Pearson. The larger the expectations E_{j}, the better is the approximation; this has been pointed out, for example, by Mann and Wald (1942). Hence, the significance, Pr{d_{x2} ≥ d_{x2,0}}, is evaluated to a good approximation by consulting a tabulation of the distribution. For example, if k, as above, equals 4, and d_{x2,0} had happened to be 4.6, then Pr{d_{x2} ≥ d_{x2,0}} ≅ .20. With a sample significance level of .20, most statisticians would not question the plausibility of F_{0}. However, were d_{x2,0} larger, and the corresponding significance equal to .05 or less, the consensus would be reversed.
At what point is the distributional approximation endangered by small E_{i}? An early study of this problem, performed by Cochran in 1942 (referred to in Cochran 1952), shows that a few E; near 1 among several large ones do not materially affect the approximation. Recent studies, by Kempthorne (1966) and by Slakter (1965), show that this is true as well when all E_{i} are near 1.
These and other studies indicate that, although some care must be taken to avoid very small E_{i}, much latitude remains for choosing categories. How is this to be done? To begin with, in keeping with the spirit of remarks by Birnbaum (1953), if the relevant alternatives F* to F_{0} are such that
is large for a certain choice of k categories, it is these categories that should be selected. Among various sets of k categories, those yielding large d,_{x2} (F*, F_{0}) are preferred.
In the absence of detailed knowledge of the alternatives, the usual recommendation, at least in the onedimensional case, is to use intervals of equal E_{i}. There remains the question of how many such intervals there should be. The typical statistical criterion for this is power, that is, the likelihood that the value of d_{x2} will be large enough to warrant rejection of the hypothesis F_{0} when the population is in fact a relevant alternative one. If large power is desired for all alternative population c.d.f.’s departing from F_{0} at some x by at least a given fixed amount, Mann and Wald (1942) recommend a number of categories of the order of 4n^{2/5}. Williams (1950) has shown that this figure can easily be halved.
The χ^{2} test is versatile; it is readily adapted to problems involving nominal rather than numerical populations [seeCounted data]. It can also be adapted to bivariate and multivariate problems, as, for example, by Keats and Lord (1962), where the joint distribution of two types of mental test scores is considered. As opposed to many of its competitors, the χ^{2} test is not biased, in the sense that there are no alternatives F* to F_{0} under which acceptance of F_{0} is more likely than it is under F_{0} itself. It is readily adapted to composite and approximate testing problems. Also, it seems to be true that the χ^{2} test is in the best position among its competitors with regard to the practical computation of power. As is pointed out by Cochran (1952), such computations are performed by means of the noncentral chisquare distribution with k − 1 degrees of freedom.
Modifications of the χ^{2} test
Important modifications of the χ^{2} test, intended to increase its power against specific families F of distributions alternative to F_{0}, are given by Neyman (1949) and by Fix, Hodges, and Lehmann 1954). Here F is assumed to include F_{0} and to allow differentiable parametric representation of the category expectations E_{i}. Note that the inclusion of F_{0} in F differs from the point of view adopted, for example, by Mann and Wald (1942). These modifications are essentially likelihood ratio tests of F_{0} versus F and are similar to procedures used to test composite and approximate hypotheses.
Another modification, capable of orientation against specific “smooth” alternatives, Neyman’s ψ^{2} test, was introduced in 1937. Other important modifications are described in detail in Cochran (1954).
Other procedures
When (X_{1}, …, X_{n}) is a random sample from a population distributed according to a continuous c.d.f. F_{0}, then (U_{1} …, U_{n}) = (F_{0}(X_{1}), …, F_{0}(X_{n})) has all the probabilistic properties of a random sample from a population distributed uniformly over the numbers between zero and one. (If the population has a density function, the c.d.f. is continuous.) No matter what the hypothesized F_{0}, the initial application of this probability integral transformation thus reduces all probability computations to the case of this uniform population distribution and gives a nonparametric character to any procedure based on the transformed sample (U_{1}, …, U_{n}). Most goodness of fit tests of simple hypotheses are nonparametric in this sense, including the χ^{2} test itself, when categories are chosen so as to assign specified values, for example, the constant value 1/k, to the category probabilities p_{t}.
Another common test making use of the trans formation U = F_{0}(X) is the KolmogorovSmirnov test, first suggested by Kolmogorov (1933) and explained in detail by Goodman (1954) and Massey (1951). The test bears Smirnov’s name, as well as Kolmogorov’s, presumably because Smirnov (as Doob and Donsker did later) gave an alternate derivation of its asymptotic null distribution, tabulated this distribution, and also extended the test to the twosample case discussed below (1939a). Denote by F_{n}(x) the sample c.d.f., that is, F_{n}(x) is the proportion of sample values less than or equal to x. The test is based on the maximum absolute vertical deviation between F_{n}(x) and F_{0}(x),
the dependence of d_{K} on the quantities U_{i} = F_{0}(X_{i}) being best brought out by the alternate formula
where u_{i} is the smallest U_{i}, u_{2} is the next to smallest, etc.; the equivalence of the two formulas is made clear by a sketch. As Kolmogorov noted in his original paper, the probabilities tabulated for d_{K} are conservative when F_{0} is not continuous, in the sense that, for discontinuous F_{0}, actual probabilities of d_{K} ≥ d_{K,0} will tend to be less than those obtained from the tabulations, leading to occasional unwarranted acceptance of F_{0}.
Computations (Shapiro & Wilk 1965) suggest that this test has low power against alternatives with mean and variance equal to those of the hypothesized distribution. It has, however, been argued, for example, by Birnbaum (1953) and Kac, Kiefer, and Wolfowitz (1955), that the test yields good minimum power over classes of alter natives F* satisfying d_{K}(F*, F_{0}) ^ δ these, as the reader will note, are precisely the classes of alternatives envisaged by Mann and Wald (1942) in optimizing the number of categories used in the χ^{2} test. A detrimental feature of the KolmogorovSmirnov test is its bias, pointed out in Massey (1951).
An important feature of the test is that it can be “inverted” in keeping with the usual method to provide a confidence band for F_{0}(x) centered on F_{0}(x), which, except for the narrowing caused by the restriction 0 ≤ F_{0}(x) ≤ 1, has constant width [seeEstimation, article onconfidence intervals and regions]. The construction of such a band has been suggested by Wald and Wolfowitz and is described by Goodman (1954). Attaching a significance probability to an observed d_{K,0} amounts to ascertaining the band width required in order just to include wholly the hypothesized F_{0} in the confidence band.
The KolmogorovSmirnov test has been modified in several ways; the first of these converts the test into a “onesided” procedure based on the discrepancy
A useful feature of this modification is the simplicity of the large sample computation of significance probabilities associated with observed discrepancies d_{K,+0,}; abbreviating the latter to d, one has Pr{d_{K+} ≥ d} ≅ e^{−2d2}. It is verified by Chapman (1958) that d_{K+} yields good minimum power over those classes of alternatives F* that satisfy .
Other, more complex modifications provide greater power against special alternatives, as in the weight function modifications (Darling 1957), which provide greater power against discrepancies from F_{0} in the tails. Another sort of modification, introduced and tabulated by Kuiper in 1960, calls for a measure of discrepancy d_{v} that is especially suited to testing goodness of fit to hypothesized circular distributions, being invariant under arbitrary choices of the angular origin. This property could be important, for example, in psychological studies involving responses to the color wheel, or in the learning experiment mentioned above. The measured, also has been singled out by E. S. Pearson (1963) as the generally most attractive in competition with d_{K} and the discrepancy measures d_{ω2} and d_{v} mentioned below.
A second general class of procedures also making use of the transformation U = F_{0}(x) springs from the discrepancy measure
first proposed by Cramer in 1928 and also by Von Mises in 1931 (see Darling 1957). Marshall (1958) has verified a startling agreement between the asymptotic and small sample distributions of da,^{2} for sample sizes n as low as 3. Power considerations for d_{ω2} are similar to those expressed for d_{K}, and are discussed also in the sources cited by Marshall; the test based on d_{ω2} can be expected to have good minimum power over classes of alter natives F* satisfying the conditions . However, the test is biased (as is that based on d_{K}).
As in the case of d_{K}, d_{ω2} has weight function modifications for greater power selectivity, and also a modification d_{v}, analogous to the modification d_{v} of d_{K} and introduced by Watson (1961), which does not depend on the choice of angular origin and is thus also suited for testing the goodness of fit to hypothesized circular distributions.
Other procedures include those based on the FisherPearson measures and , apparently first suggested in connection with goodness of fit in 1938 by E. S. Pearson. As pointed out by Chapman (1958), the tests based on d^{(1)} and d^{2} are uniformly most powerful against polynomial alternatives to F_{v}(x) = x of form x^{k} and (1 − x)^{k}, k > 1, and hence are “smooth” in the sense of Neyman’s ψ^{2} test. Computations by Chapman suggest that, dually to d_{K}, d^{2} has good maximum power over classes of alternatives F° satisfying d_{K}(F*, F_{0}) ≤ 8.
Another set of procedures, discussed and defended by Pyke (1965) and extensively studied by Weiss (1958), are based on functions of the spacings, u_{i+} − u_{i} or u_{i} − (i + 1)^{−1}, of the u’s, from each other or from their expected locations under F_{0}. Still another criterion (Smirnov 1939b) examines the number of crossings of F_{0}(x) and F_{0}(x).
An important modification, applicable to all of the procedures in this section, is suggested in Durbin (1961). This modification is intended to increase the power of any procedure based on the transforms U_{i}, against a certain class of alternatives described in that paper.
Since there are multivariate probability integral transformations, applying an initial “uniformizing” transformation is possible in the multivariate case as well. However, one of several possible transformations must now be chosen, and, related to this nonuniqueness, the direct analogues of the univariate discrepancy measures are no longer functions of uniformly distributed transforms and do not lead to nonparametric tests (Rosenblatt 1952).
Tests of composite hypotheses
The χ^{2} test
In the composite case, null hypothesis specifies only that F_{x}(x) is a member of certain parametric class {F_{0}(x)}. Typically, but not necessarily, θ is the pair (μ,σ), a parameter of location, and σ a parameter of scale, in which case F_{0}(x) may be written F_{0}[(x − μ)/σ]. In any event, there arises the question of modifying the measure d*_{x2} of discrepancy between the sample and a particular cumulative distribution function into a measure D_{x2} of discrepancy between the sample and the class {F_{θ}(x)}. A natural approach is to set
D_{x2} = min_{θ} d_{x2}.
If θ is composed of m parameters, it can be shown that, under quite general conditions, D_{x2} is approximately distributed according to the distribution when F_{x}(x) equals any one of the F_{θ}(x). Hence significance probability computations can once again be referred to tabulations of the distribution. The requisite minimization with respect to θ can be cumbersome, and several modifications have been proposed, for example, the following by Neyman (1949):
Suppose that one defines d_{x2}(θ) as the discrepancy d_{x2} between the observed sample and the particular distribution F_{0}(x). Then D is defined also by
D_{x2} = d_{x2}(θ͂),
with the estimator θ͂ computed from
that is, with θ͂ the minimum chisquare estimator of θ. The suggested modifications involve using estimators of θ alternate to θ͂ in this last definition of D_{x2}, that is, estimators that “essentially” minimize d_{x2}(θ); among these are the socalled groupeddata or partialinformation maximum likelihood estimators.
Frequently used but not equivalent estimators are the ordinary “fullinformation” maximum like lihood estimators θ͂ of θ, for example, (x̄, s) for (μ, σ) in the normal case. These do not “essentially” minimize d_{x2} and consequently tend to inflate D_{x2} beyond values predicted by the distribution, leading to some unwarranted rejections of the composite hypothesis. However, it is indicated by Chernoff and Lehmann (1954), and also by Watson (1957), that no serious distortion will result if the number of categories is ten or more.
Composite analogues of other tests
Adaptation of the tests based on the probability integral transformation to the composite case proceeds much as in the case of χ^{2}. With definitions of d_{ω2} and d_{K} (θ) analogous to that of d_{x2} Darling (1955) has investigated the large sample probability distribution of D_{ω2} = d_{ω2}(θ^) and D_{k}=d_{k}(θ^) for efficient estimators θ^ of θ analogous to the estimators θ¯ for χ^{2}, Note that in the absence of any χ^{2}like categories, the ordinary fullinformation maximum likelihood estimators now do qualify as estimators θ^
A major problem now is, however, that the modified procedures are no longer nonparametric. Thus a special investigation is required for every composite hypothesis. This is done by Kac, Kiefer, and Wolfowitz (1955) for the normal scalelocation family, and the resulting large sample distribution is partly tabulated.
Tests based on special characteristics
The alternatives of concern sometimes differ from a composite null hypothesis in a manner easily described by a standard shape parameter. Special tests have been proposed for such cases. For example, the sample skewness coefficient has been suggested (Geary 1947) for testing normality against skew alternatives. Again, for testing Poissonness against alternatives with variance unequal to mean, R. A. Fisher has recommended the variancetomean ratio (see Cochran 1954). This meas ure is approximately distributed as , when Pois sonness in fact obtains, for λ > 1 and n > 15 (Sukhatme 1938), which follows from the fact that the denominator is then a highprecision estimate of λ, and the numerator is approximately distributed as λ Analogous recommendations apply to testing binomiality. Essentially the same point of view underlies tests of normality based on the ratio of mean deviation, or of the range, to the standard deviation.
Transforming into simple hypotheses
Another interesting approach to the composite problem, advocated by Hogg and also by Sarkadi (1960), is to transform certain composite hypotheses into equivalent simple ones.
Specifically, there are locationscale parametric families {F_{0}[(x —μ)/σ]}with the following property: A random sample from any particular F_{0}[(x —μ)/σ] is reducible by a transformation T to a new set of random variables, Y = T(X), constituting in effect a random sample from a distribution G (y) involving no unknown parameters at all. Moreover, only random samples from distributions F_{0}[(x —μ)/σ] lead to G(y) when operated on by T.
It then follows that testing the composite hypothesis H that (X_{1}, …, X_{n}) is a random sample from a distribution F_{0}[(x —μ)/σ] with some μ and some cr is equivalent to testing the hypothesis H′ that (Y_{1},…, Y_{m}) is a random sample from the distribution G(y). Any of the tests for simple hy potheses is then available for testing H′. An example is provided by a negative exponential F_{0} and uniform G, in which case the ordered exponential random sample (X_{(1)}, …, X_{(n)}) is transformed into an ordered uniform random sample (Y_{(l)}, …, Y_{(n2)}) by the transformation
Conditioning
Another way of neatly doing away with the unknown parameter is to consider the conditional distribution of the sample, given a sufficient estimate of it. This method is advocated, at least for testing Poissonness, in Fisher (1950).
Tests related to probability plots
S. S. Shapiro and M. B. Wilk have quantified in various ways the departure from linearity of the sorts of probability plots mentioned above, in particular of the plot of the ordered sample values against the expected values of the standardized order statistics [seeNonparametric statistics, article onorder statistics]. This new approach bears some similarity to one given in Darling (1955), which is based on the measure d_{ω2} modified for the composite case. Both approaches, in a sense, compare adjusted observed order statistics with standardized order statistic expectations. But the approach of Shapiro and Wilk is tailored more explicitly to particular scalelocation families, by using their particular order statistic variances and covariances. It is no wonder that preliminary evaluations of this sort of approach (for example, by Shapiro & Wilk 1965) have shown exceptional promise. As an added bonus, the procedure is similar over the entire scalelocation family; that is, its probability distribution is independent of location and scale.
Approximate hypotheses
The first, and seemingly most practically developed, attempt to provide the requisite tests of approximate hypotheses is found in Hodges and Lehmann (1954). Hodges and Lehmann assume the k typical categories of the χ^{2} test and formulate the approximate simple hypothesis in terms of the discrepancy d(p, p_{0}) between the category probabilities p_{i} under F_{x} and the category probabilities p_{0,i}. under a simple hypothesis F_{0}. A very tractable discrepancy measure of this type is ordinary distance, for which the approximate hypothesis takes the form
Denoting O_{i}/n by O_{i}, the suggested test reduces, essentially, to the onesided test of the hypothesis d(p, p_{0}) = δ based on the approximately normal statistic [d(O,p_{0})δ]/σ^, where σ^ is the standard deviation, estimated from the sample o_{i}, of d(o, p_{0}). For example, when F_{0} specifies k categories with p_{0,i},i = l/k, one treats as unit normal (under the null hypothesis) the statistic
and uses an uppertail test. Thus a value of S of 1.645 leads to a sample significance level of .05. This approach lends itself easily to the computation of power and is extended as well by Hodges and Lehmann to the testing of approximate composite hypotheses.
Extension of other tests for simple hypotheses to the testing of approximate hypotheses has been considered by J. Rosenblatt (1962) and by Kac, Kiefer, and Wolfowitz (1955).
Further topics
That the sample is random may itself be in doubt, and tests have been designed to have power against specific sorts of departure from randomness. For example, tests of the hypothesis of randomness against the alternative hypothesis that the data are subject to a Markov structure are given by Billingsley (1961) and Goodman (1959); the latter work also covers testing that the data have Markov structure of a given order against the alternative that the data have Markov structure of higher order, and the testing of hypothesized values of transition probabilities when a Markov structure of given order is assumed [seeMarkov chains].
Many of the tests described in this article can be extended to severalsample procedures for testing the hypothesis that several populations are in fact distributed identically; thus, as first suggested in Smirnov (1939a), if G_{m}(x) denotes the proportion of values less than or equal to x, in an independent random sample (Y_{1}, …, Y_{m}) from a second population, d_{K}(F_{n},G_{m}) provides a natural test of the hypothesis that the two continuous population distribution functions F_{x} and G_{Y} coincide. Many of these extensions are functions only of the relative ranks of the two samples and, as such, are nonparametric, that is, their null probability distributions do not depend on the common functional form of F_{x} and _{y}. [Severalsample nonparametric procedures are discussed in Nonparametric sta tistics.]
Another topic is that of tests of goodness of fit as preliminary tests of significance, in a sense discussed, for example, by Bancroft (1964). That tests of goodness of fit are typically applied in this sense is recognized by Chapman (1958), and the probabilistic properties of certain “nested” sequences of tests beginning with a test of goodness of fit have been considered by Hogg (1965). The Bayes and information theory approaches to χ^{2} tests of goodness of fit are also important (see Lindley 1965; Kullback 1959).
H. T. David
[Directly related are the entriesHypothesis Testing; Significance, Tests of. Other relevant material may be found inCounted Data; Estimation; NonParametric Statistics.]
BIBLIOGRAPHY
Bancroft, T. A. 1964 Analysis and Inference for In completely Specified Models Involving the Use of Preliminary Test(s) of Significance. Biometrics 20:427–442.
Berkson, Joseph 1938 Some Difficulties of Interpretation Encountered in the Chisquare Test. Journal of the American Statistical Association 33:526–536.
Billingsley, Patrick 1961 Statistical Methods in Mark ov Chains. Annals of Mathematical Statistics 32:12–40.
Birnbaum, Z. W. 1953 Distributionfree Tests of Fit for Continuous Distribution Functions. Annals of Mathe matical Statistics 24:1–8.
Chapman, Douglas G. 1958 A Comparative Study of Several Onesided Goodnessoffit Tests. Annals of Mathematical Statistics 29:655–674.
Chernoff, Herman; and Lehmann, E. L. 1954 The Use of Maximum Likelihood Estimates in Tests for Goodness of Fit. Annals of Mathematical Statistics 25:579–586.
Cochran, William G. 1952 The X^{2}Test of Goodness of Fit. Annals of Mathematical Statistics 23:315–345.
Cochran, William G. 1954 Some Methods for Strengthening the Common X^{2} Tests. Biometrics 10:417–451.
Darling, D. A. 1955 The CramerSmirnov Test in the Parametric Case. Annals of Mathematical Statistics 26:1–20.
Darling, D. A. 1957 The KolmogorovSmirnov, CramerVon Mises Tests. Annals of Mathematical Statistics 28:823–838.
Durbin, J. 1961 Some Methods of Constructing Exact Tests. Biometrika 48:41–55.
Fisher, R. A. 1924 The Conditions Under Which x^{2}Measures the Discrepancy Between Observation and Hypothesis. Journal of the Royal Statistical Society 87:442–450.
Fisher, R. A. 1950 The Significance of Deviations From Expectation in a Poisson Series. Biometrics 6:17–24.
Fix, Evelyn; Hodges, J. L. Jr.; and Lehmann, E. L. 1954 The Restricted Chisquare Test. Pages 92107 in Ulf Grenander (editor), Probability and Statistics. New York: Wiley.
Geary, R. C. 1947 Testing for Normality. Biometrika 34:209–242.
Goodman, Leo A. 1954 KolmogorovSmirnov Tests for Psychological Research. Psychological Bulletin 51: 160–168.
Goodman, Leo A. 1959 On Some Statistical Tests for mth Order Markov Chains. Annals of Mathematical Statistics 30:154–164.
Greenwood, Joseph A.; and Hartley, H. O. 1962 Guide to Tables in Mathematical Statistics, Princeton Univ. Press.→ A sequel to the guides to mathematical tables produced by and for the Committee on Mathematical Tables and Aids to Computation of the National Academy of SciencesNational Research Council of the United States.
Hodges, J. L. Jr.; and Lehmann, E. L. 1954 Testing the Approximate Validity of Statistical Hypotheses. Journal of the Royal Statistical Society Series B 16: 261–268.
Hogg, Robert V. 1965 On Models and Hypotheses With Restricted Alternatives. Journal of the American Statistical Association 60:1153–1162.
Kac, M.; Kiefer, J.; and Wolfowitz, J. 1955 On Tests of Normality and Other Tests of Goodness of Fit Based on Distance Methods. Annals of Mathematical Statistics 26:189–211.
Keats, J. A.; and Lord, Frederic M. 1962 A Theoreti cal Distribution for Mental Test Scores. Psychometrika 27:59–72.
Kempthorne, O. 1966 The Classical Problem of Inference: Goodness of Fit. Unpublished manuscript.→ Paper presented at the Berkeley Symposium on Mathe matical Statistics and Probability, Fifth, Proceedings to be published.
Kolmogorov, A. N. 1933 Sulla determinazione empirica di une legge di distribuzione. Istituto Italiano degli Attuari, Giornale 4:83–99.
Kuiper, Nicolaas H. 1960 Tests Concerning Random Points on a Circle. Akademie van Wetenschappen, Amsterdam, Proceedings Series A 63:38–47.
Kullback, S. 1959 Information Theory and Statistics. New York: Wiley.
Lindley, D. V. 1965 Introduction to Probability and Statistics From a Bayesian Viewpoint. Volume 2: Inference. Cambridge Univ. Press.
Mann, H. B.; and Wald, A. 1942 On the Choice of the Number of Class Intervals in the Application of the Chi Square Test. Annals of Mathematical Statistics 13:306–317.
Marshall, A. W. 1958 The Small Sample Distribution of nu. Annals of Mathematical Statistics 29:307–309.
Massey, Frank J. Jr. 1951 The KolmogorovSmirnov Test for Goodness of Fit. Journal of the American Statistical Association 46:68–78.
Neyman, Jerzy 1937 “Smooth Test” for Goodness of Fit. Skandinavisk aktuarietidskrift 20:149–199.
Neyman, Jerzy 1949 Contribution to the Theory of the X^{2} Test. Pages 239273 in Berkeley Symposium on Mathematical Statistics and Probability, First, Pro ceedings. Berkeley: Univ. of California Press.
Patankar, V. N. 1954 The Goodness of Fit of Fre quency Distributions Obtained From Stochastic Proc esses. Biometrika 41:450–462.
Pearson, E. S. 1938 The Probability Integral Transfor mation for Testing Goodness of Fit and Combining Independent Tests of Significance. Biometrika 30: 134–148.
Pearson, E. S. 1963 Comparison of Tests for Random ness of Points on a Line. Biometrika 50:315–325.
Pearson, Karl 1900 On the Criterion That a Given System of Deviations From the Probable in the Case of a Correlated System of Variables Is Such That It Can Be Reasonably Supposed to Have Arisen From Random Sampling. Philosophical Magazine 5th Series 50:157–175.
Pyke, Ronald 1965 Spacings. Journal of the Royal Statistical Society Series B 27:395–449.
Rosenblatt, Judah 1962 Testing Approximate Hypotheses in the Composite Case. Annals of Mathematical Statistics 33:1356–1364.
Rosenblatt, Murray 1952 Remarks on a Multivariate Transformation. Annals of Mathematical Statistics 23: 470–472.
Sarkadi, KÁroly 1960 On Testing for Normality. Ma gyar Tudomanyos Akademia, Matematikai Kutató Int£zet, Közlemenyek Series A 5:269–274.
Shapiro, S. S.; and Wilk, M. B. 1965 An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52:591–611.
Slakter, Malcolm J. 1965 A Comparison of the Pearson Chisquare and Kolmogorov Goodnessoffit Tests With Respect to Validity. Journal of the American Statistical Association 60:854–858.
Smirnov, N. V. 1939a On the Estimation of the Discrepancy Between Empirical Curves of Distribution for Two Independent Samples. Moscow, Universitet, Bulletin mathematique Serie Internationale 2, no. 2: 3–26.
Smirnov, N. V. 1939b Ob ukloneniiakh empiricheskoi krivoi raspredeleniia (On the Deviations of the Empirical Distribution Curve). Matematicheskii sbornik New Series 6, no. 1:1–26. → Includes a French resume.
Sukhatme, P. V. 1938 On the Distribution of X^{2} hi Samples of the Poisson Series. Journal of the Royal Statistical Society 5 (Supplement):75–79.
Suppes, Patrick et al. 1964 Empirical Comparison of Models for a Continuum of Responses With Noncon tingent Bimodal Reinforcement. Pages 358379 in R. C. Atkinson (editor), Studies in Mathematical Psychology. Stanford Univ. Press.
Watson, G. S. 1957 The X^{2} Goodnessoffit Test for Nor mal Distributions. Biometrika 336–348.
Watson, G. S. 1961 Goodnessoffit Tests on a Circle. Biometrika 48:109–114.
Weiss, Lionel 1958 Limiting Distributions of Homo geneous Functions of Sample Spacings. Annals of Mathematical Statistics 29:310–312.
Williams, C. Arthur Jr. 1950 On the Choice of the Number and Width of Classes for the Chisquare Test of Goodness of Fit. Journal of the American Statistical Association 45:77–86.
Cite this article
Pick a style below, and copy the text for your bibliography.

MLA

Chicago

APA
"Goodness of Fit." International Encyclopedia of the Social Sciences. . Encyclopedia.com. 15 Oct. 2017 <http://www.encyclopedia.com>.
"Goodness of Fit." International Encyclopedia of the Social Sciences. . Encyclopedia.com. (October 15, 2017). http://www.encyclopedia.com/socialsciences/appliedandsocialsciencesmagazines/goodnessfit
"Goodness of Fit." International Encyclopedia of the Social Sciences. . Retrieved October 15, 2017 from Encyclopedia.com: http://www.encyclopedia.com/socialsciences/appliedandsocialsciencesmagazines/goodnessfit
goodness of fit
goodness of fit A statistical term used to indicate the correspondence between an observed distribution and a model or hypothetical mathematical distribution. In many statistical tests of significance the hypothetical or expected distribution is a model based upon there being no relationship between the dependent and independent variables. The tests then measure whether any observed deviation from the expected model may reasonably be accounted for by chance sampling variation, or whether it is sufficiently large to indicate a real difference, generalizable to the population from which the sample was taken. See also SIGNIFICANCE TESTS.
Cite this article
Pick a style below, and copy the text for your bibliography.

MLA

Chicago

APA
"goodness of fit." A Dictionary of Sociology. . Encyclopedia.com. 15 Oct. 2017 <http://www.encyclopedia.com>.
"goodness of fit." A Dictionary of Sociology. . Encyclopedia.com. (October 15, 2017). http://www.encyclopedia.com/socialsciences/dictionariesthesaurusespicturesandpressreleases/goodnessfit
"goodness of fit." A Dictionary of Sociology. . Retrieved October 15, 2017 from Encyclopedia.com: http://www.encyclopedia.com/socialsciences/dictionariesthesaurusespicturesandpressreleases/goodnessfit