Test Statistics

views updated

Test Statistics


Hypothesis testing or significance testing is undoubtedly one of the most widely used quantitative methodologies in empirical research in the social sciences. It is one viable way to use statistics to examine a hypothesis in light of observations or sample information. The starting point of hypothesis testing is specifying the hypothesis to be tested, called the null hypothesis. Then a test statistic is chosen to summarize the sample information, and its value is taken as an indication of the strength of sample evidence against the null hypothesis.

Modern hypothesis testing dates to the 1920s and the work of Ronald Aylmer Fisher (18901962) on the one hand, and Jerzy Neyman (18941981) and Egon Pearson (18951980) on the other. Fisher (1925) refers to hypothesis testing as significance testing (this entry does not distinguish between the two terms). In the Fisherian approach, the observed test statistic is converted to the P -value, which is the probability of obtaining the observed or more extreme value of the test statistic under the null model; the smaller the P -value, the stronger the sample evidence against the null hypothesis. An early example of Fishers significance testing was conducted in 1735 by the father and son Swiss mathematicians Daniel Bernoulli (17001782) and John Bernoulli (16671748). They tested for the random/uniform distribution of the inclinations of the planetary orbits. A detailed discussion of their original results and subsequent modifications of their results can be found in Anders Hald (1998).

In the Neyman and Pearsonian (1928, 1933) approach, an alternative hypothesis is specified and the null hypothesis is tested against this alternative hypothesis. The specification of an alternative hypothesis allows the computation of the probabilities of two types of error: Type I error (the error of falsely rejecting a null hypothesis) and Type II error (the error of incorrectly accepting a null hypothesis). Type I error is also referred to as the significance level of the test, and one minus Type II error the power of the test. Given that the two types of error cannot be minimized simultaneously, the common practice is to specify the level of significance or Type I error and then use a test that maximizes its power subject to the given significance level. In the Fisherian approach, the P -value is reported without necessarily announcing the rejection or nonrejection of the null hypothesis, whereas in the Neyman and Pearsonian approach, the null hypothesis is either rejected in favor of the alternative hypothesis or not rejected at the given significance level. E. L. Lehmann (1993) provides a more detailed comparison of the two approaches.

In empirical research, a mixture of the two approaches is typically adopted. Consider the linear regression model:

where is the set of observations on the dependent variable Y and the explanatory variables X 2, , XK, and ε iis the unobserved error term. The parameters β 2, , βKmeasure the ceteris paribus effects of the explanatory variables on the dependent variable. The significance of these effects is routinely tested by the t -tests and F -test. The t -test was discovered by William Sealy Gosset (18761937) for the mean of a normal population and extended by Fisher in 1925 to other contexts, including regression coefficients. Gossets result was published in Biometrika under the pseudonym Student in 1908. The F -test was originally developed by Fisher in the context of testing the ratio of two variances. Fisher pointed out many other applications of the F -test, including the significance of the complete regression model.

For a given j = 2, , K, the null hypothesis for the corresponding t -test is H 0j-βj = 0 and the t -statistic is

where bj denotes the ordinary least squares estimator of βj and se (bj) denotes the standard error of bj. Note that if the null H 0j is true, the explanatory variable Xij would be absent from the regression model (1) and thus considered to be insignificant in explaining the dependent variable given the presence of the other explanatory variables. This is why t -tests are referred to as tests for the significance of individual variables as opposed to the F -test, which tests for the significance of the complete regression. The null hypothesis for the F -test is

H 0:β 2=β 3==βk =0.

There are several equivalent formulas for computing the F -statistic, one of which is

where R 2 is the coefficient of determination. Since under H 0, all the explanatory variables can be dropped from (1), the F -test is a test for the significance of the complete regression.

Much packaged computer software routinely calculates the t -statistics and the F -statistic. For a given sample, the observed value of tj(F ) summarizes the sample evidence on the significance of the explanatory variable Xj (the significance of the regression (1)). To either convert the observed value of tj (F ) to the P -value or make a binary decision on the rejection or nonrejection of the null hypothesis H 0j (H 0) at a given significance level, the distribution of tj(F ) under the corresponding null hypothesis is required. On the basis of the null hypothesis being true and further assumptions on the nature of the sample and on the normality of the error in (1), the distribution of tj is known to be Students t with (K -1) degrees of freedom, denoted as t [K -1], and the distribution of F is the so-called F -distribution with {(K -1),(n-K )} degrees of freedom denoted as F [K -1, n-K ] (see Goldberger [1991] for details). The known distribution of tj(F ) under the null hypothesis allows the computation of the P -value or the computation of the appropriate critical value at a prespecified significance level with which the observed test statistic can be compared.

Like t -tests and the F -test, standard tests rely on further assumptions in addition to the truth of the null hypothesis, such as the assumption of a random sample and the normality of the error term. These further assumptions may not be met in typical applications in social sciences, and modifications are required of tests designed on the basis of these assumptions. For example, when normality of the error term is not met, the distributions of the t -statistic and F -statistic are no longer t [K-1 ]or F [K-1, n-K ]. Fortunately, their asymptotic distributions are known under general conditions and may be used to perform these tests. Alternatively, resampling techniques, such as the bootstrap and subsampling, may be used to approximate the distributions of the test statistics under the null hypothesis (see Efron and Tibshirani [1993] and Politis et al. [1999] for an excellent introduction to these methods).

The issue that has generated the most debate in hypothesis testing from the beginning is the choice of significance level (Henkel 1976). Given any value of the test statistic, one can always force nonrejection by specifying a low enough significance level or force rejection by choosing a high enough significance level. Although reporting the P -value partly alleviates this arbitrariness in setting the significance level, it is desirable to report estimates of the parameters of interest and their standard errors or confidence intervals so that the likely values of the unknown parameters and the precision of their estimates can be assessed.

SEE ALSO Hypothesis and Hypothesis Testing; Students T-Statistic


Efron, Bradley, and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. New York: Chapman and Hall.

Fisher, Ronald Aylmer. 1925. Statistical Methods for Research Workers. Edinburgh, U.K.: Oliver and Boyd.

Goldberger, Arthur S. 1991. A Course in Econometrics. Cambridge, MA: Harvard University Press.

Hald, Anders. 1998. A History of Mathematical Statistics from 1750 to 1930. New York: Wiley.

Henkel, Ramon E. 1976. Tests of Significance. Beverly Hills, CA: Sage.

Lehmann, E. L. 1993. The Fisher, Neyman-Pearson Theories of Testing Hypotheses: One Theory or Two? Journal of the American Statistical Association 88: 12421249.

Neyman, Jerzy, and Egon S. Pearson. 1928. On the Use and Interpretation of Certain Test Criteria for Purposes of Statistical Inference. Biometrika 20A: 175240, 263294.

Neyman, Jerzy, and Egon S. Pearson. 1933. On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society of London. Ser. A: 231, 289337.

Politis, Dimitris N., Joseph P. Romano, and Michael Wolf. 1999. Subsampling. New York: Springer.

Student (William Sealy Gosset). 1908. The Probable Error of a Mean. Biometrika 6 (1): 125.

Yanqin Fan