Statistics: Basic Concepts of Classical Inference
Statistics: BASIC CONCEPTS OF CLASSICAL INFERENCE
Statistics may be defined as the study and informed application of methods for drawing conclusions about the world from fallible observations. It has three distinct components: (1) It is based on the mathematical theory of probability, (2) as inductive inference it belongs to the philosophy of science, and (3) its subject matter is any of a wide range of empirical disciplines.
Humanity has been counting, measuring, and recording from antiquity, but the formal history of statistics dates to the first systematic analyses of official registries in the seventeenth century. The origin of the name is from the eighteenth century, the German Statistik, meaning "study of the state" or political science (generally qualitative). It was appropriated in the 1780s for use in English as statistics, an unusual new name for the quantitative analysis of conditions in a country (replacing political arithmetic), in order to attract public attention (Pearson 1978). Applied subsequently to measurement error in astronomy, the statistical approach using probability spread in the nineteenth century to social phenomena, to physics, and then to biology. Formal statistical inference came into being around the turn of the twentieth century, motivated in large measure by the study of heredity and evolution.
Intensive developments of theory and methodology, with the enormous impact of the electronic computer, have made statistics the most widely used mathematical discipline, applied to virtually every area of human endeavor. Analysis and interpretation of empirical results is basic to much of modern technology and the controversies surrounding its use. Statistical methodology, readily available in computer software packages, is easy to apply but not so easy to understand. Lack of professional competence, conflicts of interest, and oversimplified reporting by the media pose real dangers of abuse. Yet intelligent participation in the shaping of public policy requires the insights of a thoughtful, well-informed electorate.
There is a vast and constantly growing body of statistical methods, but the most commonly reported results employ the classical, or Neyman-Pearson, theory of statistical inference. Presented herein are the basic concepts of the classical theory in concise form. Further details, with many examples, can be found in textbooks on various levels of mathematical sophistication.
Descriptive versus Inferential Statistics
Statistics can be understood as descriptive or inferential. Descriptive statistics are methods for organizing, summarizing, and communicating data, or the data themselves. The resulting tables and graphs may represent complete information on a subject or only a selected sample. Inferential statistics, the subject here, refers to methods for reaching conclusions extending beyond the observations actually made, to statements about large classes of potential observations. It is inference from a sample, beyond its description.
From Sample to Probability
Statistics begins with data to explore a question about some large target population (of people or objects) that can be expressed in quantitative form. It is often impossible to observe the entire population of interest, and therefore a sample is selected from the best available, sometimes called the sampled population, to distinguish it from the target population.
RANDOM SAMPLE. The sample, on which the inference will be based, should be representative of the population, and thus be selected at random. This means that each member of the population should have an equal chance of being selected—an aim that in real-life situations can at best be approximately met. For example, to determine what proportion of patients with a certain type of cancer would benefit from a new treatment, the outcome of interest could be the proportion surviving for one year after diagnosis, with the study sample drawn from patients being seen in a particular hospital. The representativeness of the sample is always a key question in statistics.
STABLE RELATIVE FREQUENCY. It is known from experience that the observed proportion of a characteristic of a population becomes stable with increasing sample size. For example, the relative frequency of boys among the newborn fluctuates widely when studied in samples of size 10, and less so with samples of size 50. When based on samples of size 250, it is seen to settle just above .5, around the well-established value of .51. It is the observed stability of frequency ratios with increasing sample size that connects statistics with the mathematical concept of probability.
FREQUENTIST DEFINITION OF PROBABILITY. Classical statistical inference uses the frequentist definition of probability: The probability of an event denotes the relative frequency of occurrence of that event in the long run. This definition is reflected in a fundamental principle of probability, the law of large numbers: In the long run, the relative frequency of occurrence of an event approaches its probability. The probability may be known from the model, such as obtaining a six with a balanced die, namely 1/6. This is an example of the classical definition of probability, pertaining to a finite number of equally likely outcomes. Otherwise by definition the probability is whatever is obtained as long-run relative frequency. The size of the sample is of central importance in all applications.
The frequentist definition is embedded in the axiomatic approach to probability, which integrates statistics into the framework of modern mathematics. There are three basic axioms, using concepts of the theories of sets and measure. Expressed simply, the axioms state that: (1) the probability of any event (set) in the sample space of events is a number between 0 and 1, (2) the probability of the entire sample space is 1, and (3) if two events are mutually exclusive (only one of them can occur), then the probability that one or the other occurs is the sum of their probabilities.
RANDOM VARIABLES AND THEIR DISTRIBUTIONS. The numerical or coded value of the outcome of interest in a statistical study is called a random variable. The yes/no survival status of a cancer patient one year after diagnosis is a binary random variable. In a sample of size n, the number of patients surviving is some number Sn between 0 and n, called a binomial random variable. Sn/n is the relative frequency of surviving, and 1 – Sn/n the relative frequency of not surviving one year. The distribution of Sn, to be discussed below, is the binomial distribution showing the probabilities of all possible outcomes between 0 and n. An example of a continuous random variable X is the diastolic blood pressure (in millimeters of mercury) of patients treated for hypertension, at a given point of treatment. The relative frequency of different values assumed by X is the observed distribution of the random variable.
The concrete examples of a random variable and its distribution have direct counterparts in the mathematical theory of probability, and these are used in the development of methods of inference. A random sample of size n in statistics is considered a sample of n independent, identically distributed random variables, with independence a well-defined mathematical concept. These are abstract notions, often omitted in elementary presentations that give only the computational formulas. But they are the essential link for going from an observed set of numbers (the starting point of statistics) to mathematical entities that are the building blocks of the theory on which the methods of statistics are based.
PARAMETERS OF A DISTRIBUTION. The probability distribution of a random variable X describes how the probabilities are distributed over the values assumed by X along the real line; the sum of all probabilities is 1. The distribution is defined by parameters, constants that specify the location (central value) and shape of the distribution, often denoted by Greek letters. The most commonly used location parameter is the mean or expected value of X, E(X), denoted by μ ("mu"). E(X) is the weighted average of all possible outcomes of a random variable, weighted by the probabilities of the respective outcomes. A parameter that specifies the spread of the distribution is the variance of the random variable X, Var(X), defined as E(X − μ)2 and denoted by σ2 ("sigma square"). It is the expected value of the squared deviations of the observed values from the mean of the distribution. The square root of the variance, or σ, is called the standard deviation of X.
THE BINOMIAL DISTRIBUTION. An important distribution deals with counting outcomes and computing proportions or percentages, often encountered in practice. Independent repetition of an experiment with a binary outcome and the same probability p of success n times yields the binomial distribution specified by the parameters n and p. The random variable X, defined as the number of successes in n trials, can have any value r between 0 and n, with probability function where C(n, r) is the combination of n things taken r at a time and has the form (n!, called "n factorial," is the product of integers from 1 to n, with 0! = 1. For example, 4! = 1 × 2 × 3 × 4 = 24.) It can be shown that for a binomial random variable, E(X) = np, and Var(X) = np(1 − p). As the sum of n outcomes coded 0 or 1, X is also denoted by Sn.
THE NORMAL DISTRIBUTION. The most basic distribution in statistics is the normal or Gaussian distribution of a random variable X, defined by the probability density function where μ is the mean and σ is the standard deviation. The formula includes the constants π = 3.142 and e = 2.718, the base of the natural logarithm.
One reason for the importance of this equation is that many variables observed in nature follow an approximate normal distribution. Figure 1 shows frequency histograms of two samples, of height and diastolic blood pressure, with the corresponding normal distribution. The smoother fit in Figure 1a is the result of the far larger sample size as compared with the number of observations used in Figure 1b.
THE STANDARD NORMAL DISTRIBUTION. An important special case of the normal distribution is the standard normal, with mean 0 and standard deviation 1, obtained by the transformation Any normal variable can be transformed to the extensively tabled standard form, and the related probabilities remain the same. Figure 2 shows areas under the normal curve in regions defined by the mean and standard deviation, for both the X-scale and Z-scale. It is useful to remember that for a normally distributed random variable, about 95 percent of the observations lie within two standard deviations of the mean.
THE SAMPLE MEAN. Statistical inference aims to characterize a population from a sample, and interest is often in the sample mean as an estimate of the population mean. Given a sample of n random variables X1,X2, ... , Xn, the sample mean is defined as If the variables are independently distributed, each with mean μ and variance σ2, then the standard error of the mean is For simplicity of notation, the symbols M and SE are used below.
THE CENTRAL LIMIT THEOREM. The normal distribution plays a special role in statistics also because of the basic principle of probability known as the central limit theorem: In general, for very large values of n, the sample mean has an approximate normal distribution. More specifically, if X1, X2, ... , Xn are n independent, identically distributed random variables with mean μ and variance σ2, then the distribution of their standardized mean tends to the standard normal distribution as n → ∞. Nothing is said here about the shape of the underlying distribution. This principle, observed empirically and proved with increasingly greater precision and generality, is important to much of statistical theory and methodology.
APPLICATION TO THE BINOMIAL DISTRIBUTION. In the case of the binomial distribution, where X = Sn is the sum of n independent random variables with outcomes 0 or 1, By the central limit theorem, the distribution of the standardized mean tends to the standard normal p distribution as n → ∞. (The approximation can be used if both np > 30 and n(1 − p) > 30. A so-called continuity correction of -1/2n in the numerator improves the approximation, but is negligible for large n.)
Inference: Testing Statistical Hypotheses
Performing tests of statistical hypotheses is part of the scientific process, as indicated in Table 1, ideally with the professional statistician as member of the research team. The conceptual framework of subject matter specialists is an essential component, as is their close participation in the study, from its design to the interpretation of results.
FORMAL STRUCTURE. The formal steps of testing, summarized in Table 2, involve defining the null hypothesis, denoted H0, to be tested against the alternative hypothesis H1. The aim is to reject, or "nullify," the null hypothesis, in favor of the alternative, which is typically the hypothesis of real interest. The test may be two-sided or one-sided. For example, if the mean of a distribution is μ0 under the null hypothesis, one may use the two-sided test, usually displayed as follows: that is, if the absolute value of the test statistic z,calculated from the observations, is outside the critical value c, determined by the significance level α ("alpha"). The corresponding one-sided test would be one of the following: An outcome in the rejection region, the tail(s) of the distribution outside c, is considered unlikely if the null hypothesis is true, leading to its rejection at significance level μ. The form of the test used, one- or two-sided, depends on the context of the problem, but the actual test used should always be reported.
AN EXAMPLE IN TWO PARTS. A senator, running for reelection against a strong opponent, wants to know his standing in popular support. An eager volunteer conducts a survey of 100 likely voters (Case #1) and reports
|SOURCE: Courtesy of Valerie Miké.|
|1.||Conceptual framework or paradigm|
|2.||Formulation of testable (falsifiable) hypothesis|
|3.||Research design, including selection of sample|
|6.||Interpretation of results|
|7.||Generalization to some population: Inference|
|8.||Follow-up in further studies|
|SOURCE: Courtesy of Valerie Miké.|
|1.||Set up null hypothesis vs. alternative hypothesis.|
|2.||Collect data in accordance with research design.|
|3.||Analyze data for overall patterns, outliers, consistency with theoretical assumptions, etc.|
|4.||Compute the test statistic, to be compared with the critical value, which divides the distribution of the test statistic under the null hypothesis into "likely" and "unlikely" regions, determined by the significance level α . The conventional division is 95% and 5%, for α .05.|
|a. If the test statistic is in the 95% region, considered a "likely" outcome, do not reject the null hypothesis.|
|b. If the test statistic is in the 5% region, considered an "unlikely" outcome, reject the null hypothesis. The result is said to be statistically significant at P = .05.|
|5.||Review analysis with subject matter specialist, for possible implications and further studies.|
back that 55 plan to vote for the senator. Meanwhile, a professional pollster retained by the campaign manager takes a sample of 1,100 likely voters (Case #2), and also obtains a positive response from 55 percent. What can they conclude?
Each may choose a two-sided test of the null hypothesis that the true proportion p of supporters is .5, at significance level α = .05: By the central limit theorem for the binomial distribution each can use the test statistic z, assuming the standard normal distribution, and carry out a z-test for Case #1 (n = 100) and Case #2 (n = 1,100). The sample mean M is .55 for each, but SE involves the sample size: (To distinguish between a random variable and its observed value, the latter is often denoted in lower case, such as Z versus z.) As seen in Figure 3a, this test statistic is just one standard deviation from the mean under the null hypothesis, well within the likely region. Figure 3b shows that even a one-sided test would require a test statistic of at least z = 1.645 to reject H0. The senator cannot be said to be ahead of his opponent.
Figure 3a shows that this test statistic is greater than the critical value 1.96, leading to rejection of the null hypothesis. The pollster can report that the senator is statistically in the lead, whereas the volunteer's result is inconclusive.
ERRORS ASSOCIATED WITH TESTING. Two types of error that may occur in testing a statistical hypothesis are shown in Table 3: Type I, rejecting H0 when it is true, and Type II, not rejecting it when it is false. (The expression "accept" instead of "do not reject" H0 is sometimes used, but strictly speaking the most that can be asserted is that the observed result is consistent with, or is a "likely" outcome under, the null hypothesis; it is always a tentative conclusion.) The Type I error means that when H0 is rejected at P = .05 (or α = .05, the significance level of the test), an outcome in the rejection region would occur by chance 5 percent of the time if H0 were true. The Type II error, its probability denoted by β ("beta"), is not as well known; many users of statistical methods even seem unaware that it is an integral part of the theory. The complement of β, or (1 − β), the probability of rejecting H0 when it is false, is called the power of the test.
THE P-VALUE. In reporting the results of a study, statistical significance is usually indicated in terms of what has become known as the P-value, written as P < .05 or P < .01, referring to the significance level μ. In analyses carried out by computer, the software typically also provides the actual value of P corresponding to the observed test statistic (properly doubled for two-sided
|Conclusion of test||Null hypothesis true||Null hypothesis false|
|SOURCE: Courtesy of Valerie Miké.|
|Do not reject H0||No error||Type II error|
|"Not statistically significant"||(β)|
|Reject H0||Type I error||No error|
|"Statistically significant"||(α or P) Significance level||(1-β) Power|
|Assume one-year survival rate with current treatment is 50% and with new treatment is|
|First column shows n = number of patients in each treatment group. Entries in columns 2–7 represent power of test (1–β) = probability of rejecting H0 for different values of H1; α = .05, two-sided test (arcsine transformation). For entries marked (*) the power is greater than .995.|
|SOURCE: Courtesy of Valerie Miké.|
tests). In Case #1 above, the value corresponding to z = 1.0 can be read off Figure 2 as P = .32. For Case #2, the value for z = 3.33 is seen as P < .003; it can be looked up in a table of the normal distribution as P = .0024. In results reported in the applied literature, at times only the observed P-value may be given, with no discussion of formal testing.
THE POWER OF THE TEST. Tests of the null hypothesis can be carried out without reference to the Type II error, but along with α and the sample size n, consideration of β is crucial in the research design of studies. The level of β, or equivalently, the power of the test, is always defined in terms of a specific value of the alternative hypothesis. The relationship between α and β for fixed n is shown in Figure 4 for a one-sided test of μ0versus μ1. Changing the critical value c shows that as α increases, β decreases, and vice versa. A shift of μ1 in relation to μ0 indicates that the distance between them affects the power of the test.
Power as a function of sample size and alternative hypothesis is illustrated in Table 4. Assuming that a certain type of cancer has a one-year survival rate of 50 percent with the standard treatment, a randomized clinical trial is planned to evaluate a promising new therapy. The table shows the power of a two-sided test at α = .05 for a range of possible survival rates, with the new treatment and different numbers of patients included in each arm of the study.
For example, if there are 100 patients in each group, a new treatment yielding a one-year survival rate of 75 percent would be detected with probability (power) .96. "Detect" here refers to the probability that the observed difference in survival rates will be statistically significant. But if the improvement is only to 60 percent, the corresponding power is a mere .30. To detect this improvement with high power (.99) would require a sample size of 1,000. In any particular case, investigators have a general idea of what improvement can reasonably be expected. If the survival rate in the study arm is unlikely to be higher than 60 percent, then a clinical trial with just a few hundred patients is not a good research design and may be a waste of precious human and financial resources.
Inference: Estimating Confidence Intervals
An intuitive everyday procedure is point estimation, obtaining a summary figure, such as the sample mean, for some quantity of interest. But it is generally desirable to give an indication of how good—how precise—this estimate is, and this is done with the confidence interval.
THE FORMAL STRUCTURE. It is assumed here that the normal distribution is applicable, so that the terms already introduced can be used, with estimation of the population mean α by the sample mean M. By definition, the following holds for the standard normal z-statistic As can be seen from Figure 3a, for α = .05 this becomes Rewriting the expression inside the parentheses yields which is called a 95 percent confidence interval for the unknown population mean α. It means that in a long sequence of identical repeated studies, 95 percent of the confidence intervals calculated from the sample would include the unknown parameter. There is always a 5 percent chance of error, but a larger sample size yields a smaller SE and narrower limits.
TWO-PART EXAMPLE CONTINUED. In the senator's reelection campaign, the point estimate M = .55 was obtained with different samples by both the volunteer and the pollster, and here the unknown parameter estimated by M is the true proportion p. Using the expression above yields The critical value c = 1.96 for the standard normal (two-sided, α = .05) is close to 2.0, and results are often presented in the form M ± 2SE. The latter expression may be reported by the media as "55 percent with a 3 percent margin of error," putting the senator clearly in the lead. What is omitted is that this is a 95 percent confidence interval, with a 5 percent chance of error on the interval itself.
RELATIONSHIP BETWEEN TESTING AND ESTIMATION. Any value included in a (1 − α) confidence interval would in general be accepted (not rejected) as the null hypothesis in the corresponding test of significance level α, and values outside the interval would be rejected. In this example the null hypothesis of p = .50 was rejected in Case #2, but not in Case #1. The confidence interval is a useful, informative way to report results.
A statistical study may be observational or experimental and may involve one or more samples. The polls and the clinical trial were examples of a one-sample survey and a two-sample experiment, respectively. The methods of inference described a simple prototype of the Neyman-Pearson theory, using the binomial and standard normal distributions, but they are valid in a wide range of contexts. Other important probability distributions include two generated by a stable random process: the Poisson, for the number of events occurring at random in a fixed interval, and the exponential, for the length of the interval between the occurrence of random events. Radioactive decay, traffic accidents in a large city, and calls arriving at a telephone exchange are random processes that illustrate both distributions.
If the variance of a normal distribution is unknown and estimated from the sample (using a computational formula involving the observations), the z-test used above is replaced by the t-test for small samples (n < 30), with its own distribution. For larger samples the normal distribution is a close approximation. The chi-square test, perhaps the most widely used method in applied statistics, assesses the relationship between two categorical variables (each taking on a finite number of values, displayed in a two-way table), or the "goodness-of-fit" of observed data to a particular distribution. Multivariate techniques deal with inferences about two or more random variables, including their interaction; basic among these are correlation and regression. Important and central to the design of experiments is the analysis of variance, a method for partitioning the variation in a set of data into components associated with specific causes, in order to assess the effect of any hypothetical causes on the experimental result.
There are specialized techniques for time series and forecasting, for sample surveys and industrial quality control. Sequential analysis refers to procedures for repeated testing of hypotheses along the way, to minimize the sample size needed for a study. The class of nonparametric methods uses tests that do not assume a specific parametric form for the probability distributions, all within the classical theory. Decision theory formulates statistical problems as a choice between possible decisions based on the concept of utility or loss.
The same data can often be analyzed by different techniques, using different assumptions, and these may yield conflicting results. Statistical theory aims to provide the best methods for a given situation, tests that are most powerful across the range of alternatives, and estimates that are unbiased and have the smallest variance. Given an adequate model, statistics can control the uncertainty attributable to sampling error. But it cannot control systematic error, when the data are not even closely representative of the assumed population. Inference is based on an abstract logical structure, and its application to messy reality always requires the mature judgment of experienced investigators.
Anderson, David R.; Dennis J. Sweeney; and Thomas A. Williams. (1994). Introduction to Statistics: Concepts and Applications, 3rd edition. Minneapolis/St. Paul: West Publishing. A textbook of applied statistics with many examples; no calculus required.
Cox, D. R., and D. V. Hinkley. (1974). Theoretical Statistics. London: Chapman and Hall. A graduate level text on the theory of modern statistics.
Cox, D. R., and D. V. Hinkley. (1978). Problems and Solutions in Theoretical Statistics. London: Chapman and Hall. Additional material pertaining to the text above.
Fisher, R. A. (1990). Statistical Methods, Experimental Design, and Scientific Inference, ed. J. H. Bennett. Oxford: Oxford University Press. Three classic works by a founder of modern statistics, published in a single volume.
Freedman, David; Robert Pisani; and Roger Purves. (1998). Statistics, 3rd edition. New York: Norton. An introductory text presenting concepts and methods by means of many examples, with minimal formal mathematics.
Kotz, Samuel; Norman L. Johnson; and Campbell B. Read, eds. (1982–1999). Encyclopedia of Statistical Sciences. 9 vols. plus supp. and 3 update vols. New York: Wiley.
Kruskal, William H., and Judith M. Tanur, eds. (1978). International Encyclopedia of Statistics. 2 vols. New York: Free Press.
Lehmann, E. L. (1959). Testing Statistical Hypotheses. New York: Wiley. A theoretical development of the Neyman-Pearson theory of hypothesis testing.
Miller, Irwin, and Marylees Miller. (2004). John E. Freund's Mathematical Statistics with Applications, 7th edition. Upper Saddle River, NJ: Prentice Hall. An intermediate-level introduction to the theory of statistics.
Pearson, Karl. (1978). The History of Statistics in the 17th and 18th Centuries against the Changing Background of Intellectual, Scientific, and Religious Thought: Lectures by Karl Pearson Given at University College, London, during the Academic Sessions, 1921–1933, ed. E. S. Pearson. London: Griffin. Includes documentation on the origin of the name statistics.
Snedecor, George W., and William G. Cochran. (1989). Statistical Methods, 8th edition. Ames: Iowa State University Press. A classic text of applied statistics.
Strait, Peggy Tang. (1989). A First Course in Probability and Statistics with Applications, 2nd edition. San Diego, CA: Harcourt Brace Jovanovich. Thorough presentation of mathematical concepts and techniques, with hundreds of examples from a wide range of applications.