## Significance, Tests of

## Significance, Tests of

# Significance, Tests of

Other aspects of significance testing

Dangers, problems, and criticisms

*The topic of significance testing is treated in detail in* Hypothesis testing, *a term that is usually regarded as synonymous with “tests of significance,” although some statisticians feel that there are important distinctions. The present article describes the basic ideas of significance testing, outlines the most important elementary tests, and reviews problems related to the philosophy and application of significance testing*.

Significance tests are statistical procedures for examining a hypothesis in the light of observations. Many significance tests are simple and widely used, and have a long history; the idea of significance testing goes back at least to the eighteenth century. There has been much confusion about significance testing, and consequently it has been much misused. Despite their apparent simplicity (and sometimes because of it), significance tests have generated many controversies about meaning and utility.

A significance test starts with observations and with a hypothesis about the chance mechanism generating the observations. From the observations a test statistic is formed. Large values of the test statistic (or small values, or both, depending on circumstances) lead to strong skepticism about the hypothesis, whereas other values of the test statistic are held to be in conformance with the hypothesis. Choice of the test statistic usually depends upon the alternatives to the hypothesis under test.

## Basic ideas

As an example of significance testing, suppose that the hypothesis under test, often called the *null hypothesis*, or H_{0}, is that sleep deprivation has no effect on a certain kind of human skill, say arithmetic ability as measured by a particular type of test. Some population must be specified; suppose that it consists of all students at a particular college who are willing to participate in a sleep-deprivation experiment. One hundred students are chosen at random, tested for arithmetic skill, deprived of a night’s sleep, and then tested again. For each student the initial test result minus the second test result is regarded as the basic observation, datum, or score. Suppose, for the present, that if the null hypothesis is false, the effect of sleep deprivation is either to increase average score or to decrease it; either direction might hold, so the *alternative hypotheses* are two-sided.

The analyst would, of course, be concerned about possible practice and motivational effects; a helpful device might be the use of a control group that would also be tested twice, but with a normal night’s sleep. The effect of practice might be lessened by the presentation of a sequence of tests before the experiment proper begins. Another matter of concern is whether the basic score used, the difference between test results, is more appropriate than the ratio of test results or some other combination of them. For present purposes such issues are not discussed.

From the 100 scores a test statistic is then formed. The choice might well be the Student *t*-statistic, which is the observed average score divided by an estimate of its own variability. Details of the *t*-statistic are given later.

The nub of the significance-testing concept is that if the null hypothesis really is true, then it is unlikely that the Student *t*-statistic would be greater than, say, 2.6 in unsigned numerical value. The probability of that event, under the usual—but possibly dangerous—assumptions that the scores behave like statistically independent random variables with the same normal distribution, is about .01, or 1 out of 100. If, on the other hand, the null hypothesis is false, the probability of the event is greater than .01, much greater if sleep deprivation has a large effect.

Hence, if a Student *t*-statistic greater than 2.6 or less than –2.6 is observed, either the null hypothesis is false, or the null hypothesis is true and an event of small probability has occurred, *or* something is wrong with the underlying assumptions. If those assumptions seem satisfactory, and if the Student *t*-statistic is, say, 2.7, most people would prefer to act as if the null hypothesis were false. The cut-off number, 2.6, and the associated probability, .01, are cited for the sake of specific illustration. If 2.6 is replaced by a larger number, the associated probability is smaller; if it is replaced by a smaller number, the probability is larger.

In formal significance testing one decides beforehand on the cut-off number. If 2.6 is chosen, then there is a .01 probability of erroneously doubting the null hypothesis when it is in fact true.

In the example, 2.6 is called the *critical value* of the (unsigned) Student *t*-statistic, and .01 is called the *level of significance* or (usually synonymously) the size of the test. The level of signifi-ance, frequently denoted by a, is also called the probability of Type i error, or of error of the first kind; this is the error of falsely doubting the null hypothesis when it is in fact true. In the present context, if the unsigned Student *t*-statistic turns out to be greater than the critical value, it is often said that the statistic (or the test, or the sample) is *statistically significant*; this terminology can lead to confusion, as will be described below.

From this approach, then, a significance test is a procedure, agreed upon beforehand, that will result in one of two conclusions (or actions, or viewpoints): if the test statistic—here the unsigned Student *t*-statistic–is greater than the critical value, one tends to act as if the null hypothesis is false; otherwise one tends to act in some different way, not necessarily as if the null hypothesis is true but at least as if its falsity is not proved. These two kinds of conclusions or actions are conventionally called, *respectively, rejection* and *acceptance* of the null hypothesis. Such nomenclature, with its emotional overtones and its connotations in everyday speech, has led to confusion and needless controversy. The terms “rejection” and “acceptance” should be thought of simply as labels for two kinds of actions or conclusions, the first more appropriate if the null hypothesis is false and the second more appropriate if it is true or at least not known to be false. In the example, rejection and acceptance might correspond to different attitudes toward a theory of sleep or to different recommendations for desirable amounts of sleep. (In some contexts, such as acceptance sampling in industry, the terms “acceptance” and “rejection” may be used literally.)

There is further room for confusion in the varying ways that significance testing is described. In the example it would be accurate to say that the hypothesis of no sleep-deprivation effect (on average score) is being tested. But it might also be said, more loosely, that one is testing the effect of sleep deprivation, and this might be misinterpreted to mean that the null hypothesis asserts a positive mean score under sleep deprivation. One reason for this ambiguity is that the null hypothesis is often set up in the hope that it will be rejected. In the comparison of a new medical treatment with an old one, equality of effectiveness might be uninteresting, so only rejection of the null hypothesis of equality would excite the experimenter, especially if he had invented the new treatment. On the other hand, accepting the null hypothesis may itself be very important–for example, in an experiment to test the theory of relativity. Usually, however, there is a basic asymmetry between the null hypothesis and the other hypotheses.

The two-decision procedure discussed above is a highly simplified framework for many situations in which significance tests are used. Although there are cases in which the above viewpoint makes direct sense (for example, industrial acceptance sampling, crucial experiments for a scientific theory), most uses of significance tests are different. In these more common uses the analyst computes that level of significance at which the observed test statistic would just lead to rejection of the null hypothesis. In the example, if the observed Student *t*-statistic were 1.8, the resulting sample level of significance, or observed significance level, would be .075, again under the conventional assumptions listed earlier. From this second viewpoint the observed (or sample) significance level is a measure of rarity or surprise; the smaller the level, the more surprising the result if the null hypothesis is true. Naturally, if the result is very surprising under the null hypothesis, one is likely to conclude that the null hypothesis is false, but it is important to know quantitatively the magnitude of the surprise. The sample level of significance is often called the P-value, and the notation P = .075 (or p = .075) may be used. (For other views on measuring surprise, see Weaver 1948; Good 1956.)

Usually a very rough knowledge of the degree of surprise suffices, and various simplifying conventions have come into use. For example, if the observed significance level lies between .05 and .01, some authors say that the sample is statistically significant, while if it is less than .01, they call it *highly statistically significant*. The adverb “statistically” is often omitted, and this is unfortunate, since statistical significance of a sample bears no necessary relationship to possible subject-matter significance of whatever true departure from the null hypothesis might obtain (see Boring 1919).

From one point of view, significance testing is a device for curbing overenthusiasm about apparent effects in the data. For example, if the sleep-deprivation experiment showed a positive effect, an experimenter enthusiastic about a theory of sleep psychology might cry, “Eureka! Sleep deprivation really does have the predicted consequence.” After he himself has had a good night’s sleep, however, he might compute the appropriate test statistic and see, perhaps, that the observed effect–or a greater one—could occur with probability .2 under the null hypothesis of no real underlying effect. This would probably dampen his initial enthusiasm and might prevent one of the many premature announcements of discoveries that clutter the literature of some sciences. (The sleep example is solely for illustration; I have no reason to think that experimenters on sleep are anything but cautious in their statistical analyses.)

On the other hand, it is easy to be overconservative and throw out an interesting baby with the nonsignificant bath water. Lack of statistical significance at a conventional level does not mean that no real effect is present; it means only that no real effect is clearly seen from the data. That is why it is of the highest importance to look at power and to compute confidence intervals, procedures that will be discussed later.

The null hypothesis, despite its name, need not be that the average effect of sleep deprivation is zero. One might test the null hypothesis that the average effect is 10 score units. Or one might test such a null hypothesis as: the average effect of sleep deprivation is not positive; or, again: the average effect lies between –10 and 10 score units. The last two null hypotheses, unlike the previous ones, do not fix the hypothesized average effect at a single value.

## Standard tests

There are a number of popular standard significance tests in the statistical armamentarium. The tests described here include those on means, variances, correlations, proportions, and goodness of fit.

### Mean

The null hypothesis in a test on means, as in the original example, is that the expected value (mean) of a distribution is some given number, µ_{0}. This hypothesis is to be tested on the basis of a random sample of *n* observations on the distribution; that is, one considers *n* independent random variables, each with that same distribution whose mean is of interest. If the variance, σ^{2}, of the distribution is known, and if X denotes the sample mean (arithmetic average), a widely useful test statistic is

which has mean zero and standard deviation one under the null hypothesis. Often the absolute (unsigned numerical) value of (1) is the test statistic. If the underlying distribution of interest is normal, then (1) is itself normally distributed, with variance one and mean zero (that is, it has a unit normal distribution) under the null hypothesis. Even if the underlying distribution is not normal, (1) has approximately the unit normal distribution under the null hypothesis, provided that *n* is not too small.

In practice σ^{2} is usually unknown, and an estimator pf it, s^{2}, is used instead. The most usual estimator is based on the sum of squared deviations of the observations from X,

where the X_{i} are the observations. (The usual name for s^{2} is “sample variance,” and that for *s* is “sample standard deviation.”) Then (1) becomes the Student (or *t*-) statistic

If the X_{1} are normal, then (3) has as its distribution under the null hypothesis the Student (or *t*-) distribution with *n* – 1 degrees of freedom (*df*). If *n* is large (and often it need not be very large), then in quite general circumstances the distribution of (3) under the null hypothesis is approximately the unit normal distribution. [*See*Distributions, statistical, *articles on*special continuous distributions*and*approximations to distributions.]

*Two samples*. Suppose that the observations form two independent random samples of sizes n_{1}, and n_{2}: X_{11}, X_{1n1} and X_{21}, X_{2n2}. Here the first subscript denotes the sample (1 or 2), and the second subscript denotes the observation within the sample (1 through n_{1}, or n_{2}). For example, in the sleep-deprivation experiment the first sample might be of male students and the second of female students. The null hypothesis might be that the average effect of sleep deprivation on arithmetic skill is the same for male and female students. Note that this null hypothesis says nothing about the magnitude of the average effect; it says only that the average effect is the same for the two groups.

More generally, suppose the null hypothesis is that the two underlying means are equal: µ^{1} = µ^{2} where µ_{1}, is the expected value of any of the X_{11} and µ_{2} that of any of the X_{2i}. If the variances are known, say and , the usual test statistic is

where X^{1} and X^{2} are the two sample means. If the X’s are normally distributed, (4) has the unit normal as null distribution–that is, as its distribution when the null hypothesis holds. Even if the X’s are not normally distributed, the unit normal distribution is a good approximation to the null distribution of (4) for n_{1}, n_{2}, large. (If, instead of the null hypothesis µ_{1} = µ_{2}, one wishes to test µ_{1} = µ_{2} + δ, where δ is some given number, one need only replace X_{1} – X_{2} in the numerator of (4) with X_{1}-X_{2}-δ.)

If the variances are not known, and if s2 and s2.are estimators of them, a common test statistic is

which, for n_{1}, n_{2}, large, has the unit normal distribution as an approximation to the null distribution. The null distribution is better approximated by a t-distribution [*details can be found in* Errors, *article on*effects of errors in statistical assumptions].

If the variances are unknown but may reasonably be assumed to be equal , then a common test statistic is the two-sample Student statistic,

where , are defined in terms of (2). If the observations are normally distributed, (6) has as null distribution the (-distribution with n_{1} + n_{2} - 2 *df*. This holds approximately, in general, if the X’s are mildly nonnormally distributed and if and differ somewhat but n_{1}, and n_{2} are not very different.

*Paired samples*. The two-sample procedure described above should be distinguished from the paired two-sample procedure. The latter might be used if the subjects are fraternal twins of opposite sex, with both members of a sampled pair of twins observed. Then the male and female samples would no longer be independent, since presumably the members of a pair of twins are more nearly alike in most respects than are two random subjects. In this case one would still use X_{1} - X_{2} in the numerator of the test statistic, but the denominator would be different from those given above. Although it is not always possible, pairing (more generally, blocking) is often used to increase the power of a test. The underlying idea is to make comparisons within a relatively homogeneous set of experimental units. [See Experimental design.]

[These significance tests, and generalizations of them, are further discussed in Hypothesis testing; Linear hypotheses, article on Analysis of variance; and Errors, article on effects of errors in statistical assumptions.]

### Variances

The null hypothesis in a test on variances may be that σ^{2} has the value then the usual test statistic is s^{2} as defined in (2), or, more conveniently,

Under the null hypothesis, (7) has the chi-square distribution with *n* – 1 degrees of freedom, provided that the observations are normal, independent, and identically distributed. Unlike the null distributions for the tests on means, the null distribution of (7) is highly sensitive to deviations from normality.

In the two-sample case a common null hypothesis is . The usual test statistic is , which, under normality, has as null distribution the F-distribution with n_{1} - 1 and n_{2} - 1 df.

[These tests are discussed further in Linear hypotheses, article on analysis of variance; Variances, statistical study of.]

### Correlation

The most common procedure in simple correlation analysis is to test the null hypothesis that the population correlation coefficient is zero, using as test statistic the sample correlation coefficient. Under the assumptions of bivariate normality and a random sample, the null distribution is closely related to a (-distribution. One can also test the null hypothesis that the population correlation coefficient has some nonzero value, say .55. Special tables or approximate methods are required here [see Multivariate analysis, *articles on*correlation].

### Proportions

*Single sample*. The simplest case in the analysis of proportions is that of a sample proportion and the null hypothesis that the corresponding population probability is p0, some given number between 0 and 1. The sample proportion and population probability correspond to some dichotomous property: alive-dead, heads-tails, success-failure. It is convenient to summarize the sample in the form of a simple table

where *n* is the sample size, *N* the number of sample observations having a stated property, *n - N* the number not having this property, and *N/n* the sample proportion. The usual sampling assumptions, which need examination in each application, are that the observations are statistically independent and that each has the same probability of having the stated property. As a result of these assumptions, *N* has a binomial distribution.

The usual test statistic is

which, for *n* not too small, has approximately the unit normal distribution under the null hypothesis. The square of this test statistic, again for *n* not too small and under the null hypothesis, has approximately a chi-square distribution with one df.

Notice that (8) is really a special case of (1) if *n* is regarded as the sum of *n* independent observations taking values 1 and 0 with probabilities p0 and 1 - P_{0}.

*Two samples*. The data may consist of two sample proportions, and the null hypothesis may be that the corresponding population proportions are equal. It is convenient to express such data in terms of a 2 x 2 table like Table 1, where the two samples (of sizes n_{1}., n_{2},) correspond to the upper two lines and the bottom and right rims are marginal totals—for example, N_{1l} + N_{12} = n_{1}.. Capital letters denote random variables and lower-case letters nonrandom quantities. The usual assumptions, which must be examined in each application, are that N_{11} and N_{21}, are independently and binomially distributed. A common test statistic is

which, for n’s not too small, has approximately the unit normal distribution under the null hypothesis. (Note the great similarity of test statistics (6) and (9), which are related in much the same way as (l)and(8).)

If the unsigned value of (9) is the test statistic, one may equivalently consider its square, which is expressible in the simpler form

Under the null hypothesis, and for n’s not too small, (10) has approximately a chi-square distribution with one df. Large values of (10) are statistically significant.

*Chi-square statistics*. The above test statistics, in squared form, are special cases of the chi-square test statistic; their null distributions are approximately

Table 1 – Summary of fwo samples for analysis of proportions | |||
---|---|---|---|

Number having property | Number not having | Totals | |

Sample 1 | N_{11} | N_{12} | n_{1+} |

Sample 2 | N_{21} | N_{22} | n_{2+} |

Totals | N_{+1} | N_{+2} | n_{++}=n |

chi-square ones. Such test statistics can generally be thought of in terms of data tabulations of the kind shown above; the statistic is the sum of terms given by the mnemonic expression

summed over the cells in the body of the table. For example, in the first, very simple table for a sample proportion, there are two cells in the body of the table, with *n* and *n - N* observed frequencies in them. The corresponding frequencies expected under the null hypothesis are npa and n(l - P_{0}). Applying the mnemonic expression gives as the chi-square statistic

which can readily be shown to equal the square of (8). The one df for the approximate chi-square distribution is sometimes described thus: There are two observed frequencies, but they are restricted in that their sum must be n. Hence, there is one df.

Similarly, (10) can be obtained from the table for two proportions by means of the mnemonic expression. In this case the null hypothesis does not completely specify the expected frequencies, so they must be estimated from the data. That is why quotation marks appear around “expected” in the mnemonic. [*Further details, and related procedures, including the useful continuity correction, are discussed in* Counted data.]

### Association for counted data

Suppose that the individuals in a sample of people are cross-classified by hair color and eye color, with perhaps five categories for each attribute. The null hypothesis is that the two attributes are statistically independent. A standard procedure to test for association in this *contingency table* is another chi-square test, which is based essentially on (10) above. Although the test statistics are the same and have the same approximate null distribution, the power functions are different. [This test and related ones are discussed in COUNTED DATA.]

### Nonparametric tests

The tests for means, variances, and correlation, discussed above, are closely tied to normality assumptions, or at least to approximate normality. Analogous tests of significance, without the normality restriction, have been devised [these are discussed in NONPARAMETRIC STATISTICS; HYPOTHESIS TESTING].

### General approximation

In many cases a test statistic, T, is at hand, but its null distribution is difficult to express in usable form. It is often useful to approximate this null distribution in terms of μ_{0}, the expected value of T under the null hypothesis, and S, an estimator of the standard deviation of T under the null hypothesis. The approximation takes (T - μ_{0})/S as roughly unit normal under the null hypothesis. This approximation is, intuitively and historically, very important. The first tests of proportions discussed above are special cases of the approximation.

### Goodness of fit

A number of significance tests are directed to the problem of goodness of fit. In these cases the null hypothesis is that the parent distribution is some specific one, or a member of some specific family. Another problem often classed as one of goodness of fit is that of testing, on the basis of two samples, the hypothesis that two parent populations are the same. [See Goodness of fit.]

## Other aspects of significance testing

### Alternative hypotheses and power

A significance test of a null hypothesis is usually held to make sense only when one has at least a rough idea of the hypotheses that hold if the null hypothesis does not; these are the alternative hypotheses (or alternate hypotheses). In testing that a population mean, μ, is zero, as in the example at the start of the article, the test is different if one is interested only in positive alternatives (μ > 0), if one is interested only in negative alternatives (μ < 0), or if interest extends to both positive and negative alternatives.

When using such a test statistic as the Student *t*-statistic, it is important to keep the alternatives in mind. In the sleep-deprivation example the original discussion is appropriate when both positive and negative alternatives are relevant. The null hypothesis is rejected if the test statistic has either a surprisingly large positive value or a surprisingly large negative value. On the other hand, it might be known that sleep deprivation, if it affects arithmetic skill at all, can only make it poorer. This means that the expected average score (where the score is the initial test result minus the result after sleep deprivation) cannot possibly be negative but must be zero or positive. One would then use the so-called right-tail test, rejecting the null hypothesis only if the Student *t*-statistic is large. For example, deciding to reject H_{0} when the statistic is greater than 2.6 leads to a level of significance of .005. Similarly, right-tail P-values would be computed. If the Student statistic observed is 1.8, the right-tail P-value is .037; that is, the probability, under the null hypothesis, of observing a Student statistic 1.8 or larger is .037. There are also left-tail tests and left-tail P-values.

For a test considered as a two-action procedure, the power is the probability of (properly) rejecting the null hypothesis when it is false. Power, of course, is a function of the particular alternative hypothesis in question, and power functions have been much studied. The probability of error of Type *n* is one minus power; a Type *n* error is acceptance of the null hypothesis when it is false. By using formulas, approximations, tables, or graphs for power, one can determine sample sizes so as to control both size and power. [See Experimental design, article on the design of experiments.]

A common error in using significance tests is to neglect power considerations and to conclude from a sample leading to “acceptance” of the null hypothesis that the null hypothesis holds. If the power of the test is low for relevant alternative hypotheses, then a sample leading to acceptance of H0 is also very likely if those alternatives hold, and the conclusion is therefore unwarranted. Conversely, if the sample is very large, power may be high for alternatives “close to” the null hypothesis; it is important that this be recognized and possibly reacted to by decreasing the level of significance. These points are discussed again later.

It is most important to consider the power function of a significance test, even if only crudely and approximately. As Jerzy Neyman wrote, perhaps with a bit of exaggeration, “. . . if experimenters realized how little is the chance of their experiments discovering what they are intended to discover, then a very substantial proportion of the experiments now in progress would have been abandoned in favour of an increase in size of the remaining experiments, judged more important” (1958, p. 15). [References to Neyman’s fundamental and pathbreaking contributions to the theory of testing are given in Hypothesis testing.]

There is, however, another point of view for which significance tests of a null hypothesis may be relevant without consideration of alternative hypotheses and power. An illuminating discussion of this viewpoint is given by Anscombe (1963).

### Combining significance tests

Sometimes it is desirable to combine two or more significance tests into a single one without reanalyzing the detailed data. For example, one may have from published materials only the sample significance levels of two tests on the same hypothesis or closely related ones. (Discussions of how to do this are given in Mosteller & Bush [1954] 1959, pp. 328-331; and Moses 1956. See also Good 1958.) Of course, a combined analysis of all the data is usually desirable, when that is possible.

### Preliminary tests of significance

A desired test procedure is often based on an assumption that may be questionable–for example, equality of the two variances in the two-sample t-test. It has frequently been suggested that a preliminary test of the assumption (as null hypothesis) be made and that the test of primary interest then be carried out only if the assumption is accepted by the preliminary test. If the assumption is rejected, the test of primary interest requires modification.

Such a two-step procedure is difficult to analyze and must be carried out with caution. One relevant issue is the effect on the primary test of a given error in the assumption. Another is that the preliminary test may be much more sensitive to errors in other underlying assumptions than is the main significance test. [See Errors, article on effects of errors in statistical assumptions. *A discussion of preliminary significance tests, with references to prior literature, is given in Bancroft 1964. Related material is given in* Kitagawa 1963.]

### Relation to confidence sets and estimation

It often makes sense, although the procedure is not usually described in these terms, to compute appropriate sample significance levels not only for the null hypothesis but also for alternative hypotheses as if–for the moment–each were a null hypothesis. In this way one obtains a measure of how surprising the sample is for both the null and the alternative hypotheses. If parametric values corresponding to those hypotheses not surprising (at a specified level) for the sample are considered together, a confidence region is obtained. [See Estimation, article on confidence intervals and regions.]

In any case, a significance test is typically only one step in a statistical analysis. A test asks, in effect, Is anything other than random variation appearing beyond what is specified by the null hypothesis? Whatever the answer to that question is–but especially if the answer is Yes–it is almost always important to estimate the magnitudes of underlying effects. [See Estimation.]

### Relation to discriminant analysis

Significance testing, historically and in most presentations, is asymmetrical: control of significance level is more important than control of power. The null hypothesis has a privileged position. This is sometimes reasonable–for example, when the alternative hypotheses are diffuse while the null hypothesis is sharp. In other cases there is no particular reason to call one hypothesis “null” and the other “alternative” and hence no reason for asymmetry in the treatment of the two kinds of error. This symmetric treatment then is much the same as certain parts of the field called discriminant analysis. [See Multivariate analysis, article On classification and discrimination.]

## Dangers, problems, and criticisms

Some dangers and problems of significance testing have already been touched on: failure to consider power, rigid misinterpretation of “accept” and “reject,” serious invalidity of assumptions. Further dangers and problems are now discussed, along with related criticisms of significance testing.

### Nonsignificance is often nonpublic

Negative results are not so likely to reach publication as are positive ones. In most significance-testing situations a negative result is a result that is not statistically significant, and hence one sees in published papers and books many more statistically significant results than might be expected. Many–perhaps most –statistically nonsignificant results never see publication.

The effect of this is to change the interpretation of published significance tests in a way that is hard to analyze quantitatively. Suppose, to take a simple case, that some null hypothesis is investigated independently by a number of experimenters, all testing at the .05 level of significance. Suppose, further, that the null hypothesis is true. Then any one experimenter will have only a 5/100 chance of (misleadingly) finding statistical significance, but the chance that at least one experimenter will find statistical significance is appreciably higher. If, for example, there are six experimenters, a rejection of the null hypothesis by at least one of them will take place with probability .265, that is, more than one time out of four. If papers about experiments are much more likely to be published when a significance test shows a level of .05 (or less) than otherwise, then the nonpublication of nonsignificant results can lead to apparent contradictions and substantive controversy. If the null hypothesis is false, a similar analysis shows that the “power” of published significance tests may be appreciably higher than their nominal power. (Discussions of this problem are given in Sterling 1959; and Tullock 1959.)

### Complete populations

Another difficulty arises in the use of significance tests (or any other procedures of probabilistic inference) when the data consist of a complete census for the relevant population. For example, suppose that per capita income and per capita dollars spent on new automobiles are examined for the 50 states of the United States in 1964. The formal correlation coefficient may readily be computed and may have utility as a descriptive summary, but it would be highly questionable to use sampling theory of the kind discussed in this article to test the null hypothesis that the population correlation coefficient is zero, or is some other value. The difficulty is much more fundamental than that of nonnormality; it is hard to see how the 50 pairs of numbers can reasonably be regarded as a sample of any kind. Some statisticians believe that permutation tests may often be used meaningfully in such a context, but there is no consensus. [For a definition of permutation tests, see Nonparametric statistics. Further discussion of this problem and additional references are given in Hirschi & Selvin 1967, chapter 13. An early article is Woofter 1933.]

### Target versus sampled populations

Significance tests also share with all other kinds of inference from samples the difficulty that the population sampled from is usually more limited than the broader population for which an inference is desired. In the sleep-deprivation example the sampled population consists of students at a particular college who are willing to be experimental subjects. Presumably one wants to make inferences about a wider population: all students at the college, willing or not; all people of college age; perhaps all people. [See Statistics; Errors, article on non-sampling errors.]

### Neglect of power by word play

A fallacious argument is that power and error of the second kind (accepting the null hypothesis when it is false) need not be of concern, since the null hypothesis is never really accepted but is just not rejected. This is arrant playing with words, since a significance test is fatuous unless there is a question with at least two possible answers in the background. Hence, both kinds of probabilities of wrong answers are important to consider. Recall that “accept” and “reject” are token words, each corresponding to a conclusion that is relatively more desirable when one or another true state of affairs obtains.

To see in another way why more than Type I error alone must be kept in mind, notice that one can, without any experiment or expense, achieve zero level of significance (no error of the first kind) by never rejecting the null hypothesis. Or one can achieve any desired significance level by using a random device (like dice or a roulette wheel) to decide the issue, without any substantive experiment. Of course, such a procedure is absurd because, in the terminology used here, its power equals its level of significance, whatever alternative hypothesis may hold.

### Difficulties with significance level

If significance tests are regarded as decision procedures, one criticism of them is based on the arbitrariness of level of significance and the lack of guidance in choosing that level. The usual advice is to examine level of significance and power for various possible sample sizes and experimental designs and then to choose the least expensive design with satisfactory characteristics. But how is one to know what is satisfactory? Some say that if satisfaction cannot be described, then the experimenter has not thought deeply enough about his materials. Others oppose such a viewpoint as overmechanical and inappropriate in scientific inference; they might add that the arbitrariness of size, or something analogous to it, is intrinsic in any inferential process. One cannot make an omelet without deciding how many eggs to break.

*Unconventional significance levels*. Probably the most common significance levels are .05 and .01, and tables of critical values are generally for these levels. But special circumstances may dictate tighter or looser levels. In evaluating the safety of a drug to be used on human beings, one might impose a significance level of .001. In exploratory work, it might be quite reasonable to use levels of .10 or .15 in order to increase power. What is of central importance is to know what one is doing and, in particular, to know the properties of the test that is used.

*Nonconstant significance levels*. For many test situations it is impossible to obtain a sensible test with the same level of significance for all distributions described by a so-called composite null hypothesis. For example, in testing the null hypothesis that a population mean is not positive, the level of significance depends generally on the value of the nonpositive mean; the more it departs from zero, the smaller the probability of Type I error. In such cases the usual approach is to think in terms of the maximum probability of Type I error over all distributions described by the null hypothesis. In the nonpositive mean case, the maximum is typically attained when the mean is zero.

### Necessarily false null hypotheses

Another criticism of standard significance tests is that in most applications it is known beforehand that the null hypothesis cannot be exactly true. For example, it seems most implausible that sleep deprivation should have literally no effect at all on arithmetic ability. Hence, why bother to test a null hypothesis known at the start to be false?

One answer to this criticism can be outlined as follows: A test of the hypothesis that a population mean has a specified value, µ_{0} is a simplification. What one really wants to test is whether the mean is near µ_{0}, as near as makes no substantive difference. For example, if a new psychological therapy raises the cure rate from 51 per cent to 51.1 per cent, then even if one could discover such a small difference, it might be substantively uninteresting and unimportant. For “reasonable” sample sizes and “reasonable” significance levels, most standard tests have power quite close to the level of significance for alternative hypotheses close to the null hypothesis. When this is so and when, in addition, power is at least moderately large for alternatives interestingly different from the null hypothesis, one is in a satisfactory position, and the criticism of this section is not applicable. The word “reasonable” is in quotation marks above because what is in fact reasonable depends strongly on context. To examine reasonableness it is necessary to inspect, at least roughly, the entire power function. Many misuses of significance testing spring from complete disregard of power.

A few authors’, notably Hodges and Lehmann (1954), have formalized the argument outlined above by investigating tests of null hypotheses such as the following: The population mean is in the interval (µ_{0} - Δ, µ_{0} + Δ), where µ_{0} is given and A is a given positive number.

There are, to be sure, null hypotheses that are not regarded beforehand as surely false. One well-known example is the null hypothesis that there is no extrasensory effect in a parapsychological experiment. Other examples are found in crucial tests of well-formulated physical theories. Even in these instances the presence of small measurement biases may make interpretation difficult. A minuscule measurement bias, perhaps stemming from the scoring method, in an extensive parapsychological experiment may give a statistically significant result although no real parapsychological effect is present. In such a case the statistical significance may reflect only the measurement bias, and thus much controversy about parapsychology centers about the nature and magnitude of possible measurement biases, including chicanery, unconscious cues, and biases of scoring.

A sharp attack on significance testing, along the lines of this section and others, is given by L. J. Savage (1954, chapter 16).

### Several tests on the same data

Frequently, two or more hypothesis tests are carried out on the same data. Although each test may be at, say, the .05 level of significance, one may ask about the *joint* behavior of the tests. For example, it may be important to know the probability of Type I error for both of two tests together. When the test statistics are statistically independent, there is no problem; the probability, for example, that at least one of two tests at level .05 will give rise to Type I error is 1 - (.95)^{2} = .0975. But in general the tests are statistically dependent, and analogous computations may be difficult. Yet it is important to know, for example, when two tests are positively associated, in the sense that given a Type I error by one, the other is highly likely to have a Type I error.

When a moderate to large number of tests are carried out on the same data, there is generally a high probability that a few will show statistical significance even when all the null hypotheses are true. The reason is that although the tests are dependent, they are not completely so, and by making many tests one increases the probability that at least some show statistical significance. [*Some ways of mitigating these problems for special cases are described in* Linear hypotheses, *article on*multiple comparisons.]

### One-sided versus two-sided tests

There has been much discussion in the social science literature (especially the psychological literature) of when one-sided or two-sided tests should be used. If a one-sided test is performed, with the choice of sidedness made tendentiously after inspection of the data, then the nominal significance level is grossly distorted; in simple cases the actual significance level is twice the nominal one.

Suppose that in the sleep-deprivation example the data show an average score that is positive. “Aha,” says the investigator, “I will test the null hypothesis with a right-tail test, against positive means as alternatives.” Clearly, if he pursues that policy for observed positive average scores and the opposite policy for observed negative average scores, the investigator is unwittingly doing a two-tail test with double the stated significance level. This is an insidious problem because it is usually easy to rationalize a choice of sidedness post hoc, so that the investigator may be fooling both himself and his audience.

The same problem occurs in the choice of test statistic in general. If six drugs are compared in their effects on arithmetic skill and only the observations on those two drugs with least and greatest observed effects are chosen for a test statistic, with no account taken of the choice in computing critical values, an apparently statistically significant result may well just reflect random error, in the sense that the true significance level is much higher than the nominal one. [Some methods of dealing with this are described in Linear hypotheses, article on multiple comparisons.]

*Three-decision procedures*. Many writers have worried about what to do after a two-tail test shows statistical significance. One might conclude, for example, that sleep deprivation has an effect on average score, but one wishes to go further and say that it has a positive or a negative effect, depending on the sign of the sample average. Yet significance testing, regarded stringently as a two-decision procedure, makes no provision for such conclusions about sign of effect when a two-sided test rejects the null hypothesis.

What is really wanted in such a case may well be a three-decision procedure: one either accepts H0, rejects H0 in favor of alternatives on one side, or rejects H0 in favor of alternatives on the other side. There is no reason why such a procedure should not be used, and it is, in effect, often used when the user says that he is carrying out a significance test. A major new consideration with this kind of three-decision procedure is that one now has six, rather than two, kinds of error whose probabilities are relevant; for each possible hypothesis there are two erroneous decisions. In some simple cases symmetry reduces the number of different probabilities from six to three; further, some of the probabilities may be negligibly small, although these are probabilities of particularly serious errors (for a discussion, see Kaiser 1960). A three-decision procedure may often be usefully regarded as a composition of two one-sided tests. [See Hypothesis testing.]

A variety of other multiple-decision procedures have been considered, generally with the aim of mitigating oversimplification in the significance-testing approach. A common case is that in which one wants to choose the best of several populations, where “best” refers to largest expected value. [A pioneering investigation along this line is Mosteller 1948. See also Decision theory; Screening and selection.]

### Hypotheses suggested by data

In the course of examining a body of data, the analyst may find that the data themselves suggest one or more kinds of structure, and he may decide to carry out tests of corresponding null hypotheses from the very data that suggested these hypotheses. One difficulty here is statistical dependence (described above) between the tests, but there is a second difficulty, one that appears even if there is only a single significance test. In a sense, this difficulty is just a general form of the post hoc one-sided choice problem.

Almost any set of data, of even moderate size and complexity, will show anomalies of some kind when examined carefully, even if the underlying probabilistic structure is wholly random–that is, even if the observations stem from random variables that are independent and identically distributed. By looking carefully enough at random data, one can generally find some anomaly–for example, clustering, runs, cycles–that gives statistical significance at customary levels although no real effect is present. The explanation is that although any particular kind of anomaly will occur in random data with .05 statistical significance just 5 times out of 100, so many kinds of anomalies are possible that at least one will very frequently appear. Thus, use of the same data both to suggest and to test hypotheses is likely to generate misleading statistical significance.

Most sets of real data, however, are not completely random, and one does want to explore them, to form hypotheses and to test these same hypotheses. One can sometimes use part of the data for forming hypotheses and the remainder of the data to examine the hypotheses. This is, however, not always feasible, especially if the data are sparse. Alternatively, if further data of the same kind are available, then one can use the earlier data to generate hypotheses and the later data to examine them. Again, this approach is not always possible.

As an example, consider one of the earliest instances of significance testing, by Daniel Bernoulli and his son John in 1734, as described by Tod-hunter ([1865] 1949, sees. 394-397). Astronomers had noticed that the orbital planes of the planets are all close together. Is this closeness more than one might expect if the orbital planes were determined randomly? If so, then presumably there is a physical reason for the closeness. The Bernoullis first attempted to make precise what is meant by randomness; in modern terminology this would be specification of randomness as a null hypothesis. They then computed the probability under randomness that the orbital planes would be as close together as, or closer than, the observed planes. This corresponds to deciding on a test statistic and computing a one-tail P-value. The resulting probability was very small and the existence of a physical cause strongly suggested. [For biographical material on the Bernoullis, see Bernoulli family.]

Todhunter, in his History, described other early instances of significance testing, the earliest in 1710 by John Arbuthnot on the human sex ratio at birth (the relevant sections in Todhunter [1865] 1949 are 343-348, 617-622, 888, 9l’5, and 987). Significance testing very much like that of the Bernoullis continues; a geophysical example, relating to the surprising degree of land-water antipodality on the earth’s surface, has appeared recently (Harrison 1966). The whole topic received a detailed discussion by Keynes ([1921] 1952, chapter 25). Up-to-date descriptions of astronomical theories about the orbital planes of the planets have been given by Struve and Zebergs (1962, chapter 9).

Quite aside from the issue of whether it is reasonable to apply probabilistic models to such unique objects as planetary orbits (or to the 50 states, as in an earlier example), there remains the difficulty that a surprising concatenation was noted in the data, and a null hypothesis and test statistic fashioned around that concatenation were considered. Here there is little or no opportunity to obtain further data (although more planets were discovered after the Bernoullis and, in principle, planetary systems other than our sun’s may someday be observable). Yet, in a way, the procedure seems quite reasonable as a means of measuring the surprisingness of the observed closeness of orbital planes. I know of no satisfactory resolution of the methodological difficulties here, except for the banal moral that when testing hypotheses suggested during the exploration of data, one should be particularly cautious about coming to conclusions. (In the Bayesian approach to statistics, the problem described here does not arise, although many statisticians feel that fresh problems are introduced [see Bayesian inference].)

An honest attempt at answering the questions, What else would have surprised me? Was I bound to be surprised? is well worthwhile. For example, to have observed planetary orbits symmetrically arranged in such a way that their perpendiculars nearly formed some rays of a three-dimensional asterisk would also have been surprising and might well be allowed for.

### Difficulties in determining reference set

Some statisticians have been concerned about difficulties in determining the proper underlying reference probabilities for sensibly calculating a P-value. For example, if sample size is random, even with a known distribution, should one compute the P-value as if the realized sample size were known all along, or should one do something else? Sample size can indeed be quite random in some cases; for example, an experimenter may decide to deal with all cases of a rare mental disorder that come to a particular hospital during a two-month period. [Three papers dealing with this kind of problem are Barnard 1947; Cox 1958; and Anscombe 1963. See also Likelihood.]

Random sample sizes occur naturally in that part of statistics called sequential analysis. Many sequential significance testing procedures have been carefully analyzed, although problems of determining reference sets nonetheless continue to exist. [See Sequential analysis.]

### Optional stopping

Closely related to the discussion of the preceding section is the problem of optional stopping. Suppose that an experimenter with extensive resources and a tendentious cast of mind undertakes a sequence of observations and from time to time carries out a significance test based on the observations at hand. For the usual models and tests, he will, sooner or later, reach statistical significance at any preassigned level, even if the null hypothesis is true. He might then stop taking observations and proclaim the statistical significance as if he had decided the sample size in advance. (The mathematical background of optional stopping is described in Robbins 1952.)

For the standard approach to significance testing, such optional stopping is as misleading and reprehensible as the suppression of unwanted observations. Even for an honest experimenter, if the sampling procedure is not firmly established in advance, a desire to have things turn out one way or another may unconsciously influence decisions about when to stop sampling. If the sampling procedure is firmly established in advance, then, at least in principle, characteristics of the significance test can be computed in advance; this is an important part of sequential analysis.

Optional stopping is, of course, relevant to modes of statistical analysis other than significance testing. It poses no problem for approaches to statistics that turn only on the observed likelihood, but many statisticians feel that these approaches are subject to other difficulties that are at least equally serious. [See Bayesian inference; Likelihood.]

### Simplicity and utility of hypotheses

It is usually the case that a set of data will be more nearly in accord with a complicated hypothesis than with a simpler hypothesis that is a special case of the complicated one. For example, if the complicated hypothesis has several unspecified parameters whereas the simpler one specializes by taking some of the parameters at fixed values, a set of data will nearly always be better fit by the more complicated hypothesis than by the simpler one just because there are more parameters available for fitting: a point is usually farther away from a given line than from a given plane that includes the line; in the polynomial regression context, a linear regression function will almost never fit as well as a quadratic, a quadratic as well as a cubic, and so on.

Yet one often prefers a simpler hypothesis to a better-fitting more complicated one. This preference, which undoubtedly has deep psychological roots, poses a perennial problem for the philosophy of science. One way in which the problem is reflected in significance testing is in the traditional use of small significance levels. The null hypothesis is usually simpler than the alternatives, but one may be unwilling to abandon the null hypothesis unless the evidence against it is strong.

Hypotheses may be intrinsically comparable in ways other than simplicity. For example, one hypothesis may be more useful than another because it is more closely related to accepted hypotheses for related, but different, kinds of observations.

The theory of significance testing, however, takes no explicit account of the simplicity of hypotheses or of other aspects of their utility. A few steps have been made toward incorporating such considerations into statistical theory (see Anderson 1962), but the problem remains open.

### Importance of significance testing

Significance testing is an important part of statistical theory and practice, but it is only one part, and there are other important ones. Because of the relative simplicity of its structure, significance testing has been overemphasized in some presentations of statistics, and as a result some students come mistakenly to feel that statistics is little else than significance testing.

### Other approaches to significance testing

This article has been limited to the customary approach to significance testing based on the frequency concept of probability. For other concepts of probability, procedures analogous to significance testing have been considered. [See Bayesian inference. An extensive discussion is given in Edwards et al. 1963.] Anscombe (1963) has argued for a concept of significance testing in which only the null hypothesis, not the alternatives, plays a role.

William H. Kruskal

[*See also*Hypothesis testing.]

## BIBLIOGRAPHY

Anderson, T. W. 1962 The Choice of the Degree of a Polynomial Regression as a Multiple Decision Problem. Annals of Mathematical Statistics 33:255-265.

Anscombe, F. J. 1963 Tests of Goodness of Fit. Journal of the Royal Statistical Society Series B 25:81-94.

Bakan, David 1966 The Test of Significance in Psychological Research. Psychological Bulletin 66:423-437.

Bancroft, T. A. 1964 Analysis and Inference for Incompletely Specified Models Involving the Use of Preliminary Test(s) of Significance. Biometrics 20: 427-442.

Barnard, G. A. 1947 The Meaning of a Significance Level. Biometrika 34:179-182.

Boring, Edwin G. 1919 Mathematical vs. Scientific Significance. Psychological Bulletin 15:335-338.

Cox, David R. 1958 Some Problems Connected With Statistical Inference. Annals of Mathematical Statistics 29:357-372.

Edwards, Ward; Lindman, Harold; and Savage, Leonard J. 1963 Bayesian Statistical Inference for Psychological Research. Psychological Review 70:193-242.

Good, I. J. 1956 The Surprise Index for the Multivariate Normal Distribution. Annals of Mathematical Statistics 27:1130-1135.

Good, I. J. 1958 Significance Tests in Parallel and in Series. *Journal of the American Statistical Association* 53:799-813.

Harrison, Christopher G. A. 1966 Antipodal Location of Continents and Oceans. Science 153:1246-1248.

Hirschi, Travis; and Selvin, Hanan C. 1967 Methods in Delinquency Research. New York: Free Press. → See especially Chapter 13, “Statistical Inference.”

Hodges, J. L. JR.; and Lehmann, E. L. 1954 Testing the Approximate Validity of Statistical Hypotheses. *Journal of the Royal Statistical Society Series* B 16: 261-268.

Kaiser, Henry F. 1960 Directional Statistical Decisions. Psychological Review 67:160-167.

Keynes, John maynard (1921) 1952 A Treatise on Probability. London: Macmillan. → A paperback edition was published in 1962 by Harper.

Kitagawa, Tosio 1963 Estimation After Preliminary Tests of Significance. University of California Publications in Statistics 3:147-186.

Moses, Lincoln E. 1956 Statistical Theory and Research Design. Annual Review of Psychology 7:233-258.

Mosteller, Frederick 1948 A fe-sample Slippage Test for an Extreme Population. Annals of Mathematical Statistics 19:58-65.

Mosteller, Frederick; and Bush, ROBERT R. (1954) 1959 Selected Quantitative Techniques. Volume 1, pages 289-334 in Gardner Lindzey (editor), Handbook of Social Psychology. Cambridge, Mass.: Addi-son-Wesley.

Neyman, Jerzy 1958 The Use of the Concept of Power in Agricultural Experimentation. Journal of the Indian Society of Agricultural Statistics 9, no. 1:9-17.

Robbins, Herbert 1952 Some Aspects of the Sequential Design of Experiments. American Mathematical Society, Bulletin 58:527-535.

Savage, Leonard J. 1954 The Foundations of Statistics. New York: Wiley.

Sterling, Theodore D. 1959 Publication Decisions and Their Possible Effects on Inferences Drawn From Tests of Significance–or Vice Versa. Journal of the American Statistical Association 54:30-34.

Struve, Otto; and Zebergs, Velta 1962 Astronomy of the 20th Century. New York: Macmillan.

Todhunter, Isaac (1865) 1949 A History of the Mathematical Theory of Probability From the Time of Pascal to That of Laplace. New York: Chelsea.

Tullock, Gordon 1959 Publication Decisions and Tests of Significance: A Comment. Journal of the American Statistical Association 54:593 only.

Weaver, Warren 1948 Probability, Rarity, Interest, and Surprise. Scientific Monthly 67:390-392.

Wittenborn, J. R. 1952 Critique of Small Sample Statistical Methods in Clinical Psychology. Journal of Clinical Psychology 8:34-37.

Woofter, T. J. JR. 1933 Common Errors in Sampling. Social Forces 11:521-525.