Precautions in the analysis of counted data
Counted data subject to sampling variability arise in demographic sampling, in survey research, in learning experiments, and in almost every other branch of social science. The counted data may relate to a relatively simple investigation, for exampie, estimating sex ratio at birth in some specified human population, or to a complex problem, investigating the interaction among qualitative responses of animals to stimuli in a physiological experiment. Further, a counted data approach is sometimes useful even when the actual data are inherently not counted; for example, a classical approach to so-called goodness of fit uses the counts of numbers of continuous observations in cells or intervals. Again, some nonparametric tests are based on a related device. [SeeGoodness of fit; Nonparametric statistics.]
Investigations leading to counted data are often described by giving percentages of individuals falling in the various categories. It is essential that the total numbers of individuals also be reported; otherwise reliability and sampling error cannot be estimated.
The structure of this article is as follows. First, simple procedures relating to one or two sample percentages are considered. These procedures exemplify the basic chi-square approach; they may be regarded as methods for treating particular contingency tables in a way falling in the domain of the basic chi-square theorem. Second, special aspects of contingency tables are considered in some detail: power, single degrees of freedom, ordered alternatives, dependent samples, measures of association, multidimensional contingency tables. Under the last topic is considered the important topic of three-factor interactions. Third, some alternatives to chi-square are briefly mentioned.
Consider an experiment in which animals of a group are independently subjected to a stimulus. Assume two, and only two, responses are possible (A and Ā). Of 20 animals exposed independently to the stimulus, responses of type A are exhibited by 16. Such a count, or the corresponding percentage, 80 per cent, may be the basis of an estimate of the probability of an A response in all animals of this kind; or it may be the basis of a test of the hypothesis that responses A and Ā are equally likely. The evaluation of either this estimate or the test is dependent upon the assumptions underlying the data collection.
One of the basic models associated with such experiments is the binomial. The binomial model is associated with a series of independent trials in each of which an event A may or may not occur and for which it is assumed that the probability of occurrence of A, denoted p, is constant from trial to trial. If the number of occurrences of A among n such trials is v, then v/n is the maximum likelihood and also the minimum variance unbiased estimator of p. [For further discussion of the binomial distribution, seeDistributions, statistical, article onspecial discrete distributions.]
Additional insight as to the reliability of the estimator is obtained from a confidence interval for p. Tables and graphs have been prepared to provide such confidence intervals for appropriate levels of confidence. The best known of these is the graph, by Clopper and Pearson (1934). This graph or the tables that have been computed for the same purpose (for example, Owen 1962) determine so-called central confidence limits; that is, the intervals that are “false,” in the sense that they do not include the true parameter value, are equally divided between those that are too low and those that are too high. [SeeEstimation, article onconfidence intervals and regions.]
Confidence intervals may also be used to test a null hypothesis that p has the value p0. If p0 is not included in the 1 — α level confidence interval, then the null hypothesis p = p0 is rejected at level α.
Equivalently, a direct test may be made of this hypothesis by utilizing the extensive tables of the binomial distribution. Two of the best known are those of Harvard University (1955) and the U.S. National Bureau of Standards (1950).
More usually both confidence intervals and test procedures are based upon an approximation to the distribution, that is, on the fact that (v — np) · [np(l — p)]– has a limiting standard normal distribution. [SeeDistributions, statistical, article onapproximations to distributions.]
Denote by Z1–α the 100(1 – α) percentile of the standard normal distribution. The null hypothesis p = p0 tested against the alternative p ≠ p0 is rejected at level α on the basis of an observation of v successes in n trials if
in case the alternatives of interest are limited to one side of p0, say p > p0, the test procedure at level α is to reject H0 if
The subtracted is the so-called continuity correction—useful when a discrete distribution is being approximated by a continuous one.
Thus, in the experiment described above, the experimenter might be testing whether the choice is made at random between A andĀ against the possibility that A is the preferred response. This is a test of the hypothesis against the alternative . Corresponding to the conventional 5 per cent significance level, Z0.95 = 1.64; then if v = 16 (16 A responses are observed), the hypothesis is rejected at the 5 per cent level since
The normal approximation to the binomial is thought to be quite satisfactory if np0(l — p0) is 5 or more. However, for many practical situations the normal approximation provides an adequate test (in the sense that the type i error is sufficiently close to the specified level) for values of np0(l – p0) well below the bound of 5 mentioned above.
The simplest confidence limits for p based on the normal approximation are
The binomial model requires independence of the successive trials. Much sampling, especially of human populations, is, however, done without replacement so that successive observations are in fact dependent and the correct model is not the binomial but the hypergeometric. In sampling theory this is taken into account by the finite population correction, which modifies the variance. Thus, where the binomial variance is np(l — p), the hypergeometric variance for a sample of size n from a population of size N is np(l — p) (1 — n/N). If n is a small fraction of N, the finite population correction is negligible; thus, the binomial model is often used as an acceptable approximation.
For one or two proportions
The statistic which for sufficiently large n may be used to test the hypothesis p = p0 against p ≠ p0, yields, when squared, an equivalent test procedure based on the chi-square distribution with one degree of freedom; this follows from the fact that the square of a standard normal variable has a chi-square distribution with one degree of freedom. [SeeDistributions, statistical, article onspecial continuous distributions.]
Following recent practice, “X2” is written for the test statistic, and the symbol “x2” is reserved for the distributional form.
This algebraic identity shows that the statistic X2 may be written (neglecting the continuity correction term, ) as (observed – expected)2/expected, summed over the two categories A and Ā. Such a measure of deviation of observations from their expected values under a null hypothesis is of wide application. For example, consider the counts of individuals with characteristic A that occur in two independent random samples and suppose that the null hypothesis at test is that the probability of occurrence of A is the same in both populations; call the common (but unspecified) probability p. The observations may be tabulated as in Table 1.
|Table 1 — Observations in two samples|
|NUMBERS OF OBSERVED|
If p were known, then under the null hypothesis the expectation of the number of A’s in sample 1 would be n1p and in sample 2 the expectation would be n2p, where p is the probability of occurrence of A. Since p is unknown, however, it must be estimated from the data [seeEstimation, article onpoint estimation].
If the hypothesis were true, the two samples could be pooled and the usual (minimum variance unbiased) estimator of p would be v.1/n. With this estimator the estimated expected number of A’s in sample 1 is n1 (v.1/n) and in sample 2 is n2(v.2/n). Similarly the estimated expected numbers of Ā’s are n1 (v.1/n and n2(v.2/n in the two samples. These estimated expectations are tabulated in Table 2.
An expression similar to X2 can be calculated for each sample where now, however, p0 is replaced by the estimator v.1/n These expressions are
Since the estimator of p will tend to be close to the true value for large sample sizes, it is intuitive to conjecture that each of these are squares of normal variables (at least approximately for large samples). The sum does have a limiting chi-square distribution but with one degree of freedom, not two. The “loss” of the degree of freedom comes from estimating the unknown parameter, p. The test statistic, which more formally written is
may be simplified to
If |v11v22 – v12v21| is less than or equal to n/2, the correction term is inappropriate and possibly misleading. In practice this problem rarely arises.
Basic chi-square theorem
The above chi-square test statistics for one or two proportions may, as was seen, be written as sums of terms whose numerators are squared deviations of the observed counts from those “expected” under the null hypothesis. (Expected is placed in quotation marks to emphasize that the “expectations” are often estimated expectations obtained via estimation of unknown parameters.) The denominators may be regarded as weights to standardize the ratios. This pattern may be widely extended.
For example, consider a questionnaire with respondents placing themselves in five categories: strongly favor, mildly favor, neutral, mildly oppose, strongly oppose. The n independent responses might furnish data for a test of the hypothesis that each of the responses is equally likely. If the probabilities of the five responses are denoted p1 through p5, this null hypothesis specifies p1 = p2 = p3 = p4 = and under the null hypothesis the expected number of responses in each category is n/5. The appropriate weights in the denominator of the chi-square test statistic are suggested by the expanded form of X2 given above; each term (observed -expected)2 is divided by its expected value. Thus in this example,
That these weights lead to the usual kind of null distribution can be shown by considering the multinomial distribution, the extension of the binomial distribution to a series of independent trials with several outcomes rather than just two. If the null hypothesis is true, this X2 has approximately a chi-square distribution with four degrees of freedom.
More generally, suppose that on each of n independent trials of an experiment exactly one of the events E1, ···,EJ occurs. Let PJ, depending in a given way (under the null hypothesis under test) on unknown parameters θ1, ···, θm, be the probability that Ej occurs and suppose there are asymptotically efficient estimators of the θ’s, from which are obtained asymptotically efficient estimators of the pj, denoted pJ; thus npj estimates the expected frequency of occurrence of Ej under the null hypothesis. Let the random variable vj be the number of times Ej actually occurs in the n trials. Then
has, under the null hypothesis for large n and under mathematical regularity conditions, approximately the chi-square distribution with J — m — 1 degrees of freedom. When the null hypothesis is false, X2 tends to be larger on the average than when it is true, so that a right-hand tail critical region is appropriate, that is, the null hypothesis is rejected for large values of X2.
Note that the above “chi-square” statistic is of form
(The quotation marks around expected indicate that this is actually an asymptotically efficient estimator of the expectation under the null hypothesis.)
The above development can readily be extended to l independent sequences of trials, with ni trials in the ith sequence, pij; denoting the probability under the null hypothesis of event j for sequence i, and vij; denoting the number of times Ej, occurs in sequence i. As before,
is, for large ni, approximately chi-square with I(J – 1) — m degrees of freedom, under the null hypothesis, and with appropriate regularity conditions. Note that when l = 1, J — m — 1 degrees of freedom are obtained, as before.
The primary problem in such tests is the derivation of asymptotically efficient estimators. For example, such estimators may be maximum likelihood estimators or minimum chi-square estimators. The latter are the θ’s that minimize X2, the test statistic, subject to whatever functional restraints are imposed upon the pij’s. Neyman (1949) has given a method of determining modified minimum chi-square estimators, a method that reduces to solving only linear equations, as many as there are unknown parameters to estimate. A review of the methods of generating such minimum chi-square estimators for this model, and for a more general one, is given by Ferguson (1958).
It is easily seen that the comparison of two percentages is a special case of the general theorem. Here I = J = 2 and the null hypothesis can be put in the form p11 = p12 = θ; p21 = p22= 1 – θ. Here P12 is the probability of A occurring on a trial in the first series, p12 is the probability of A occurring on a trial in the second series; p21, p22 are defined similarly with respect to Ā. The maximum likelihood estimator of θ is v.1/n and the degrees of freedom are seen to be one from insertion in the general formula.
Proofs of the basic chi-square theorem and statements of the mathematical regularity conditions may be found in Cramér (1946) or Neyman (1949).
Power of the chi-square test
The chi-square test is extensively used as an omnibus test without particular alternatives in view. Frequently such applications are almost useless in the sense that their sensitivity (that is, power) is very low. It is therefore important not only to make such tests but also to specify the alternatives of interest and to determine the power, that is, the probability that the null hypothesis is rejected when in fact such alternatives are true. A fairly complete theory of the power of chi-square tests has been given recently by Mitra (1958) and Diamond (1963).
Because chi-square tests are based upon a limiting distribution theorem it is necessary to express the alternative in a special form, depending on the sample size, n, in order to obtain meaningful results. Consider first the case where I = 1 and the null hypothesis completely specifies the as numerical constants. (In the questionnaire experiment above, since there are five responses the null hypothesis that the responses are equally likely specifies ) Write an alternative in the form
If in fact then the test statistic X2 has a limiting noncentral chi-square distribution with noncentrality parameter
and with J — 1 degrees of freedom [seeDistributions, statistical, article onspecial continuous distributions].
The λ required to obtain a specified probability of rejection of an alternative for tests at significance levels 0.01 and 0.05 has been tabulated; such a table is given, for example, by Owen (1962, pp. 61–62). These tables are useful not only in calculating the power function but also in specifying sample size in advance. For the example where there are five responses and the null hypothesis is , consider the alternative , . Then , j = 1, ···, 4; , so that λ = n(.25). To achieve a probability of 0.80 of rejecting the null hypothesis for this alternative, it is found from the tables that λ must be 11.94 (four degrees of freedom and 0.05 significance level). This requires a sample size of 11.94/.25 or, to the nearest whole number, 48.
For the comparison of two samples, a similar power theory is available. Consider two sequences of ni trials each of which results in an outcome E1, E2, ···, EJ,. Here pij (i = 1, 2, j = 1, ··· J) is the probability of outcome j on sequence i and the null hypothesis of homogeneity is p11 = p21, p12 = p22, ···, p1J = p2J. Now consider a sequence of alternatives which for some pi (j = 1, 2, ···, J) satisfy the equations , where n = n1 + n2, and ∑jC1j = ∑jC2j = 0. Then, for the sequence of alternatives, X2 has, in the limit as n → ∞, noncentral chi-square distribution with J — 1 degrees of freedom and noncentrality parameter λ, where
In actual practice, when the statistician considers a specified alternative for finite n, pj is not uniquely defined; it is convenient to define but whether other choices of pj; might improve the goodness of the asymptotic approximation to the actual power appears not to have been investigated. For the case J = 2, and with n1 = n2, a nomogram is available showing the sample size required to obtain a specified level of power for one-sided hypotheses, that is, for comparison of an experimental and a standard group (Columbia University 1947, chapter 7). In the general case the formulation of λ is more difficult.
In the example of comparing two percentages, the observations were conveniently set out in a 2 × 2 array. Similarly, in the more general comparative experiment (the power of which was just discussed), it would be convenient to set out the observations in a 2 × J array. These are special cases of contingency tables, which, in general, have r rows and c columns; counted data that may be so represented arise, for example, in many experiments and surveys.
Such arrays or contingency tables may arise in at least three different situations, which may be illustrated by specific examples:
(1) Double polytomy: A sample of n voters is taken from an electoral list and each voter is classified into one of r party affiliations P1, ···, Pr and into one of c educational levels E1, ···, Ec. Denote by pij the probability that a voter belongs to party i and educational level j, so that ∑j∑jpij = 1. The usual null hypothesis of interest is that the classifications are independent, that is, pij = i · p ·j,
where pi, = probability a voter is in party i (regardless of educational level) and p·j = probability a voter is in educational level j, again regardless of the other classification variable (that is, Pi. = ∑jpij) and Pj. = ∑jpij). In this case both the vertical and horizontal marginal totals of the r × c sample array are random.
(2) Comparative trials: Consider instead of a single sample from the general electoral roll, r samples of sizes ni from the r different party rolls. The voters in each sample are classified as to educational level (levels E1, ···, Ec). Denote as before by pij the probability that a voter drawn from party i belongs to educational level j (so that ∑i∑jpij = 1 ) The hypothesis of homogeneity specifies that P1j = P2j = ··· pij for each j. In this case the row totals are fixed (n1, ···, nr), while the column totals are random. Into this category falls the two-sample experiment discussed earlier; in that case r = 2.
(3) Independence trials (fixed marginal totals): Consider a group of n manufactured articles, of which fixed proportions are in each of the quality categories C1, ···, Cc. The articles have been divided into r groups of fixed size n1, ···, nr for further processing or for shipment to customers. The question arises whether the partitioning into the r groups can reasonably be considered to have been done randomly, that is, independently of how the articles fall into the quality categories. Since the number of articles in each of the categories C1, ···, Cc as well as the n1, ···, nr are fixed, both marginal totals are fixed in this situation.
For these three cases let vij denote the number of individuals falling into row i, column j, and denote by vi. and v.j the row and column totals, whether fixed or random. While different probability models are associated with the three cases, the approximate or large sample chi-square test is identical. The test statistic is
which has, if the null hypothesis is true, an approximate chi-square distribution with (r — 1) • (c — 1) degrees of freedom.
For the comparative trials case, this is an extension of the comparison of two percentages. The maximum likelihood estimator of the common value of pij, p2j, ···, prj is v.j/n under the null hypothesis. The comparative trials model consists of r sequences of trials, each of which may result in one of c events; c — 1 parameters are estimated. Because as soon as c – 1 of the probabilities are estimated the final one is determined. Hence the degrees of freedom are r(c—1) — (c—1 ) = (r–1)(c–1).
In the double polytomy case there are (r — 1) + (c — 1) independent parameters to be estimated under the null hypothesis : p1., p2., ···, pr–1., p.1, p.2, ···, p·c–1, since again the restrictions ∑jp.j = ∑jp·j = 1 provide the last two needed values. The maximum likelihood estimators of the pi are the vi./n and of the p.j, are the v.j/n, so that the estimated expected values are n(vi./n) (vj./n) or vi.v.j/n The degrees of freedom in this case are (rc– l)—(r — 1)—(c — 1)or(r — 1)(c — 1) since there is only one sequence of trials with rc outcomes.
Like all chi-square tests, these are based upon asymptotic distribution theory and are satisfactory in practice for “large” sample sizes. A number of rules of thumb have been established in regard to the acceptable lower limit of sample size so that the actual type i error, or probability of rejecting the null hypothesis when true, does not depart too far from the prescribed significance level. For a careful discussion of this problem, and of procedures to adopt when the samples are too small, see Cochran (1952; 1954).
2 × 2 tables. The special case of contingency tables with r = c = 2 has been extensively studied, and the so-called Fisher exact test is available. Given v1., v2., v.1, v.2, under any of the null hypotheses, v11 has a specific hypergeometric distribution; hence probabilities of deviations as numerically large as, or larger than, the observed deviation can be calculated and a test can be made. The application of the test is now greatly facilitated by use of tables by Finney et al. (1963). For the comparative trials model and the double dichotomy model this exact test is a conditional test, given the marginal counts.
While the hypotheses associated with the three different models in r × c tables, in general, and 2 × 2 tables, in particular, can be tested by the same chi-square procedure, the power of the test varies according to the model. For the 2 × 2 case, approximations and tables have been given for each of the three models. The most recent of these are by Bennett and Hsu (1960) for comparative and independence trials and Harkness and Katz (1964) for the double dichotomy model. Earlier approximations are discussed and compared by these authors.
Single degrees of freedom
The statistic X2 used to test the several null hypotheses possible for r x c contingency tables can be partitioned into (r — 1)(c — 1) uncorrelated X2 terms, each of which has a limiting chi-square distribution with one degree of freedom when the null hypothesis is true.
Planned comparisons. Planned subcomparisons, however, can be treated most easily by forming new contingency tables and calculating the approximate X2 statistic. For example, in the comparison of three experimental learning methods with a standard method the observations might be recorded for each pupil as successful or unsuccessful and tabulated in a 4 × 2 table. These are four comparative trials; and X2, the statistic to test homogeneity, has, under the null hypothesis, an approximate chi-square distribution with three degrees of freedom.
In this situation, two subcomparisons might be indicated: the standard method versus the combined experimental groups and in the experimental groups among themselves. Tables 3a and 3b show the two new contingency tables. The X2 statistics calculated from these two subtables may be used to make the indicated secondary tests. The two X2 values (with one and two degrees of freedom respectively) will not sum to the X2 calculated for the whole 4 × 2 array. Short-cut formulas for a partition that is additive and references to other papers on this subject are given by Kimball ( 1954).
|Table 3a — Comparison between standard method and combined experimental methods|
|Standard (method 1)||V11||V12|
|Experimental methods combined||v21 + v22 + v41||v22 + v22 + v42|
|Table 3b — Comparison among experimental methods|
|Experimental (method 2)||v21||v22|
|Experimental (method 3)||v31||v32|
|Experimental (method 4)||v41||v42|
Unplanned comparisons. As in the analysis of variance of linear models, distinction should be made between such planned comparisons and unplanned comparisons. Goodman (1964a) has given a procedure to find confidence intervals for a family of “contrasts” among multinomial probabilities for the r × c contingency table in the comparative-trials model. A “contrast” is any linear function of the probabilities pij, with coefficients summing to zero, that is,
[SeeLinear hypotheses, article onmultiple comparisons.]
Thus in the comparison of teaching methods experiment referred to above, where pi, is the probability of a pupil being successful when taught by method i, the unplanned comparisons or contrasts might be p21, — p31, p21 — p41, p31 — p41. These represent pairwise comparisons of the three experimental methods.
Denote a contrast by θ; an estimator of pij is pij = vij/vi. and an estimator of θ is
An estimator of the variance of θ is
The large-sample joint confidence intervals for θ with confidence coefficient 1 — α have the form θ — S(θ)L, θ + S(θ)L, where L is the square root of the upper 100(1 — α)th percentage point of the chi-square distribution with (r — 1)(c — 1) degrees of freedom. An experiment in which one or more of the totality of all such possible intervals fail to include the true θ may be called a violation. The probability of such a violation is α.
If instead of all contrasts, only a few, say G, are of interest, then L in the last formula may be replaced by Z1–α/2G (the 100 [l–α/2G] percentile of the standard normal distribution), which often will be smaller than L and hence yield shorter confidence intervals while the probability of a violation is still less than or at most equal to α.
Comparative trials; ordered alternatives
In the comparative trials model, with r × 2 contingency tables, frequently the only alternative of interest is an ordered set of pi1’s. For example in a 2 × 2 comparative trial involving a control and a test group, the question may be to decide whether the groups are the same or whether the test group yields “better” results than the control group. In the 2 × 2 case, this situation is handled simply by working with the signed square root of X2, which has a standard normal distribution if the null hypothesis is true. A one-sided alternative is then treated in the same manner as a test for a percentage referred to earlier.
For the more general r × 2 table, the most complete treatment is that of Bartholomew (1959); his test, however, requires special tables. If the experimenter believes that the pi1 have a functional relationship to a known associated variable, xi, then a specific test can be derived from the basic theorem. Such a test would be a particular example of a planned comparison. Many authors have given short-cut formulas and worked out examples of this type of problem (cf. Cochran 1954; Armitage 1955).
Comparative trials; dependent samples
A sample is taken of n voters who have voted in the last two national elections for one of the major parties. Denote the parties by L and C, and suppose that in the sample 45 per cent voted L in the first election and 55 per cent voted L in the second. Does this indicate a significant change in voter behavior in the subpopulation of which this is a sample? To make such a comparison in matched or dependent samples, it is necessary to obtain information on the actual changes in party preference [seePanel studies]. These can be read from a 2 × 2 table such as Table 4.
|Table 4 — Voter preference in two elections|
Such a 2 × 2 table with random marginals appears to fall into the double-dichotomy model, but the hypothesis of independence is not of interest here. The changes are indicated by the off-diagonal elements v12,, v21, and the hypothesis of no net change is equivalent to the hypothesis that, given v12 + v21, v12 is binomially distributed with probability . Thus, the test of comparison of two percentages in identical or matched samples reduces to the test for a percentage. If the normal approximation is adequate, the square of the normal deviate, with the continuity correction, is
which has a limiting chi-square distribution with one degree of freedom under the null hypothesis that the probabilities are the same in the two matched groups. Cochran (1950) has extended this test to the r × c case. The test described above does not at all depend on v11 and v22, but of course these quantities would enter into procedures pointed toward issues other than testing the null hypothesis of no net change.
Chi-square tests of goodness of fit
Chi-square tests have been used extensively to test whether sample observations might have arisen from a population with a specified form, such as binomial, Poisson, or normal. Such chi-square tests are again special cases of the general theory outlined above, although there are many other types of tests for goodness of fit [seeGoodness of fit].
There are some special problems in connection with some nonstandard chi-square tests of goodness of fit for the binomial and Poisson distributions. The standard chi-square test of goodness of fit for these two discrete distributions requires (in most cases) estimation of the mean. The sample mean is an efficient estimator of the population mean; it appears to make little difference whether the sample mean is computed from the raw or grouped data.
There is evidence that two simpler tests are more powerful, at least for some alternatives, for testing whether a set of counts does come from one of the distributions. These test statistics are the so-called indices of dispersion, studied by Lexis and Bortkiewicz, which in fact compare two estimators of the variance — the usual sample estimator and the estimator derivable from the fact that for these distributions the variance is a function of the mean. Alternatively they may be viewed as chi-square tests, conditional on the total count and placed in the framework of the basic theorem. Thus if the observations are v1, ···, vn, which according to the null hypothesis come from a Poisson distribution, the appropriate index of dispersion test statistic is
where ύ is the sample mean of the vi For large n and if the null hypothesis is true, X2 is approximately distributed as chi-square with n — 1 degrees of freedom [seeBortkiewicz; Lexis].
The corresponding test for the binomial can be expressed similarly, but it is also useful to set out the n observations in a 2 × c contingency table such as Table 5.
|Table 5 — Arrangement of data to test for binomial|
|Sample 1||Sample 2||···||Sample c|
|Failures||n1 — x1||n2 — x2||···||nc — xc|
The variance test is equivalent to the chi-square test of homogeneity in this 2 × c array, which has, of course, c — 1 degrees of freedom.
Whereas in general the chi-square test is a one-tailed test, that is, the null hypothesis is rejected for large values of the statistics, the dispersion tests are often two-tailed tests, not necessarily with equal probability in the tails. The reason for this is that a too small value of X2 reflects a pattern that is more regular than that expected by chance, and such patterns may correspond to important alternatives to the null hypothesis of homogeneous randomness.
Contingency table association measures
If in the double-dichotomy model the hypothesis of independence is rejected, it is logical to seek a measure of association between the classifications. Distinction must be made between purely descriptive measures and sampling estimators of such measures. [SeeStatistics, descriptive, article onassociation.]
A large number of such measures have been presented, usually related to the X2 statistic used to test the null hypothesis of independence. Goodman and Kruskal (1954–1963) have emphasized the need to choose measures of association that have contextual meaning in the light of some probability model with predictive or explanatory value. They distinguish between two cases—no ordering among the categories and directed ordering among them.
Multidimensional contingency tables
The analysis of data that have been categorized into three or more classifications involves not only a considerable increase in the variety of possibilities but also introduces some new conceptual problems. The basic test of mutual independence is, however, a straightforward extension of the two-dimensional one and a simple application of the main theorem. The test will be discussed for three classifications.
This is a test of the hypothesis that pijk, the probability of an observation falling in row i, column j, and layer k, can be factored into pi∴p.j. p∴k. Under this null hypothesis the estimated expected value in cell ijk is n–2(vi∴)(v.j.)(v∴k), where the dots indicate summation over the corresponding subscripts of the observed counts, vijk. The X2 statistic has the usual form, sum of (observed — expected)2/expected, and has.rcl — r — c — l + 2 degrees of freedom if there are r rows, c columns, and l layers.
Tests for partial independence, for example that pijk = (pi ∴)(p.ji), or for homogeneity (between layers, for example) may be derived similarly. New concepts and new tests are introduced by the idea of interaction between the different classifications.
In linear models, interactions are measures of nonadditivity of the effects due to different classifications. With contingency models several definitions of interaction have been given; the present treatment follows Goodman (1964b). Consider, for example, samples drawn from rural and urban populations and classified by sex and age, with age treated dichotomously (see Table 6).
|Table 6 — Classification of rural and urban samples by sex and age|
For the urban group there is a sex ratio 20/42 among the “young” and 14/24 among the “old.” The ratio of these may be regarded as a measure of the interaction of age and sex in the urban population. Similarly the same ratio of sex ratios is a measure of the interaction in the rural population. These are, of course, sample values; and the population interactions must be defined in terms of the probabilities pijk. It is useful to define
and to write the three-factor no interaction hypothesis as Δ1 = Δ2, for this 2 × 2 × 2 contingency table. The maximum likelihood estimator of Δk is dk = v22k/v12kv21k, and its variance can be estimated consistently by where A simple statistic to test the hypothesis Δ1 = Δ2 is , which, if the null hypothesis is true, has a large sample chi-square distribution with one degree of freedom.
For the data given d1 = 0.816, d2 = 1.631, , , and X2 = 1.01 so that the three-factor no interaction hypothesis is not rejected at the usual significance levels. Goodman has extended this test in an obvious way to the 2 × 2 × l contingency table. Here the test statistic is
which has l — 1 degrees of freedom. The extension to r × c × l tables is based on logarithms of frequencies rather than the actual frequencies. Goodman also provides confidence intervals for the interactions Δk and indicates a number of equivalent tests. A bibliography of the very extensive literature on this topic is given in this paper.
Alternatives to chi-square
While the chi-square tests are classical for the analysis of counted data, with the original simple tests going back to Karl Pearson, they are not the likelihood ratio tests. The latter are based upon statistics of the form
in the general case with one sequence of trials. It is easy to show, by expanding minus twice the logarithm of the likelihood ratio statistic in a power series, that its leading term is X2, and that further terms are of a smaller order than the leading term so that the tests are equivalent in the limit. However, they are not equivalent for small samples.
Another test for contingency tables (comparative trials model) is that of C. A. B. Smith (1951). Further work appears to be necessary before any of these alternatives is accepted as preferable to the chi-square tests. The most widely used alternative analysis is that indicated in the next section.
ANOVA of transformed counted data
Contingency tables have an obvious analogy to similar arrays of measured data that are often treated by analysis of variance techniques (ANOVA). The analysis of variance models are more satisfactory if the data are such that (1) effects are additive, (2) error variability is constant, (3) the error distribution is symmetrical and nearly normal, and (4) the errors are statistically independent. [SeeLinear hypotheses, article onanalysis of variance.]
Counted data that arise from a binomial or multinomial model fail most obviously on the second property since the variances of such data vary with the mean. However, some function or transformation of the observations may have approximately constant variance. Transformations that have been derived for counted data to make variances nearly constant have been found empirically often to improve the degree of approximation to which properties (1) and (3) hold also. These transformations include: (a) Arc sine transformation for proportions
which is applicable to dichotomous data with an equal number of trials in each sequence. If the number of trials (ni) varies from sequence to sequence the problem is more complicated (see Cochran 1943). (b) Square root transformation: for Poisson data, (c) Logarithmic transformation: y = log (v + 1) for data such that the standard deviation is proportional to the mean. Use of the arc sine transformation, and subsequent analysis, is facilitated by the use of binomial probability paper; graphic techniques are simple and usually adequate. The basic reference for such procedures is Mosteller and Tukey (1949).
Refinements of these transformations and discussion of the choice of transformations is given a thorough treatment by Tukey (1957). If a suitable transformation has been made, the whole battery of tests that have been developed in analysis of variance (including covariance techniques) is applicable. Estimation problems may be more subtle; in some situations estimates may be given in the transformed variable but in others it may be desirable to transform back to the original variable. [SeeStatistical analysis, special problems of, article ontransformations of data.]
Precautions in the analysis of counted data
The transformations discussed above were derived to apply to data that conform to such models as the binomial or Poisson and that could be analyzed by chi-square methods. However, counted data often arise from models that do not conform to the basic assumption; in particular independence may be lacking, so that the chi-square tests are not valid. Such data are often transformed and treated by analysis of variance procedures; the justification for this is largely empirical. Examples of situations where this is necessary are experimental responses of animals in a group where dependence may be present, eye estimates of the numbers in a group of people, and comparisons of proportions in heterogeneous and unequal-sized groups. In such situations care is necessary that the proper transformation is selected to achieve the properties listed above and in the interpretation of the results of the analysis.
The lack of independence and the presence of extraneous sources of variation are frequent sources of error in the analysis of counted data because the chi-square tests are invalidated by such factors. A discussion of these errors and others is found in Lewis and Burke (1949). The two careful expository papers by Cochran (1952; 1954) represent an excellent source of further reading on this topic. See also the monograph by Maxwell (1961).
Douglas G. Chapman
[See alsoQuantal response.]
Armitage, P. 1955 Tests for Linear Trends in Proportions and Frequencies. Biometrics 11:375–386.
Bartholomew, D. J. 1959 A Test of Homogeneity for Ordered Alternatives. Parts 1–2. Biometrika 46:36–48, 328–335.
Bennett, B. M.; and Hsu, P. 1960 On the Power Function of the Exact Test for the 2 × 2 Contingency Table. Biometrika 47:393–398.
Clopper, C. J.; and Pearson, E. S. 1934 The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial. Biometrika 26:404–413.
Cochran, William G. 1943 Analysis of Variance for Percentages Based on Unequal Numbers. Journal of the American Statistical Association 38:287–301.
Cochran, William G. 1950 The Comparison of Percentages in Matched Samples. Biometrika 37:256–266.
Cochran, William G. 1952 The X2 Test of Goodness of Fit. Annals of Mathematical Statistics 23:315–345.
Cochran, William G. 1954 Some Methods for Strengthening the Common X2 Tests. Biometrics 10:417–451.
Columbia University, Statistical Research Group 1947 Techniques of Statistical Analysis for Scientific and Industrial Research and Production and Management Engineering. Edited by Churchill Eisenhart, Millard W. Hastay, and W. Allen Wallis. New York: McGraw-Hill.
CramÉr, H. 1946 Mathematical Methods of Statistics. Princeton Univ. Press. → See especially Chapter 30.
Diamond, Earl L. 1963 The Limiting Power of Categorical Data Chi-square Tests Analogous to Normal Analysis of Variance. Annals of Mathematical Statistics 34:1432–1441.
Ferguson, Thomas S. 1958 A Method of Generating Best Asymptotically Normal Estimates With Application to the Estimation of Bacterial Densities. Annals of Mathematical Statistics 29:1046–1062.
Finney, David J. et al. 1963 Tables for Testing Significance in a 2 × 2 Contingency Table. Cambridge Univ. Press.
Goodman, Leo A. 1964a Simultaneous Confidence Intervals for Contrasts Among Multinomial Population. Annals of Mathematical Statistics 35:716–725.
Goodman, Leo A. 1964b Simple Methods for Analyzing Three-factor Interaction in Contingency Tables. Journal of the American Statistical Association 59:319–352.
Goodman, Leo A.; and Kruskal, William H. 1954–1963 Measures of Association for Cross-classifications. Parts 1–3. Journal of the American Statistical Association 49:732–764; 54:123–163; 58:310–364.
Harkness, W. L.; and Katz, Leo 1964 Comparison of the Power Functions for the Test of Independence in 2 × 2 Contingency Tables. Annals of Mathematical Statistics 35:1115–1127.
Harvard University, Computation Laboratory 1955 Tables of the Cumulative Binomial Probability Distribution. Cambridge, Mass.: Harvard Univ. Press.
Kimball, A. W. 1954 Short Cut Formulas for the Exact Partition of X2 in Contingency Tables. Biometrics 10: 452–458.
Lewis, D.; and Burke, C. J. 1949 The Use and Misuse of the Chi-square Test. Psychological Bulletin 46:433–489. → Discussion of the article may be found in subsequent issues of this bulletin: 47:331–337, 338–340, 341–346, 347–355; 48:81–82.
Maxwell, Albert E. 1961 Analyzing Qualitative Data. New York: Wiley.
Mitra, Sujit Kumar 1958 On the Limiting Power Function of the Frequency Chi-square Test. Annals of Mathematical Statistics 29:1221–1233.
Mosteller, Frederick; and Tukey, John W. 1949 The Uses and Usefulness of Binomial Probability Paper. Journal of the American Statistical Association 44: 174–212.
Neyman, Jerzy 1949 Contribution to the Theory of the x- Test. Pages 239–273 in Berkeley Symposium on Mathematical Statistics and Probability, Proceedings. Edited by Jerzy Neyman. Berkeley: Univ. of California Press.
Owen, Donald B. 1962 Handbook of Statistical Tables. Reading, Mass.: Addison-Wesley. A list of addenda and errata is available from the author.
Smith, C. A. B 1951 A Test for Heterogeneity of Pro-portions. Annals of Eugenics 16:15–25.
Tukey, John W. 1957 On the Comparative Anatomy of Transformations. Annals of Mathematical Statistics 28:602–632.
U.S. National Bureau of Standards 1950 Tables of the Binomial Probability Distribution. Applied Mathematics Series, No. 6. Washington: Government Printing Office.