## Inferential statistics

**-**

## Inference, Statistical

# Inference, Statistical

CONSTRUCTING CONFIDENCE INTERVALS

To perform inference, in layman’s terms, is to make an educated or informed guess of an unknown quantity of interest given what is known. *Statistical inference*, again in layman’s terms, goes one step further, by making an informed guess about the error in our informed guess of the unknown quantity. To the layman, this may be difficult to grasp—if I don’t know the truth, how could I possibly know the error in my guess? Indeed, the exact error—that is, the difference between the truth and our guess—can never be known when inference is needed. But when our data set, or more generally, quantitative information, is collected through a probabilistic mechanism—or at least can be approximated or perceived as such—probabilistic calculations and statistical methods allow us to compute the probable error, formally known as the “standard error,” of our guess, or more generally, of our guessing method, the so-called “estimator.” Such calculations also allow us to compare different estimators, that is, different ways of making informed guesses, which sometimes can lead to the best possible guess, or the most efficient estimation, given a set of (often untestable) assumptions and optimality criteria.

Consider the following semihypothetical example. Mary, from a prestigious university in Europe, is being recruited as a statistics professor by a private U.S. university. Knowing that salaries at U.S. universities tend to be significantly higher than at European universities, Mary needs to figure out how much she should ask for without aiming too low or too high; either mistake could prevent her from receiving the best possible salary. This is a decision problem, because it depends on how much risk Mary is willing to take and many other factors that may or may not be quantifiable. The inference part comes in because, in order to make an informed decision, Mary needs to know something about the possible salary ranges at her new university.

## FROM SAMPLE TO POPULATION

As with any statistical inference, Mary knows well that the first important step is to collect relevant data or information. There are publicly available data, such as the annual salary surveys conducted by the American Statistical Association. But these results are too broad for Mary’s purposes because the salary setup at Mary’s new university might be quite different from many of the universities surveyed. In other words, what Mary needs is a conditional inference, conditional on the specific characteristics that are most relevant to her goal. In Mary’s case, the most relevant specifics include (1) the salary range at her new university and (2) the salary for someone with experience and credentials similar to hers.

Unlike at public universities, salary figures for senior faculty at many private universities are kept confidential. Therefore, collecting data is not easy, but in this example, through various efforts Mary obtained $140, 000, $142, 000, and $153, 000 as the salary figures for three of the university’s professors with statuses similar to Mary’s. Mary’s interest is not in this particular sample, but in inferring from this sample an underlying population of possible salaries that have been or could be offered to faculty members who can be viewed approximately as exchangeable with Mary in terms of a set of attributes that are (perceived to be) used for deciding salaries (e.g., research achievements, teaching credentials, years since PhD degree, etc.). This population is neither easy to define nor knowable to most individuals, and certainly not to Mary. Nevertheless, the sample Mary has, however small, tells her something about this population. The question is, what does it tell, and how can it be used in the most efficient way? These are among the core questions for statistical inference.

## DEFINING ESTIMAND

But the first and foremost question is what quantity Mary wants to estimate. To put it differently, if Mary knew the entire distribution of the salaries, what features would she be interested in? Formulating such an inference objective, or estimand, is a critical step in any statistical inference, and often it is not as easy as it might first appear. Indeed, in Mary’s case it would depend on how “aggressive” she would want to be. Let’s say that she settles on the 95th percentile of the salary distribution; she believes that her credentials are sufficient for her to be in the top 5 percent of existing salary range, but it probably would not be an effective strategy to ask for a salary that exceeds everyone else’s.

Mary then needs to estimate the 95th percentile using the sample she has. The highest salary in the sample is $153, 000, so it appears that any estimate for the 95th percentile should not exceed that limit if all we have is the data. This would indeed be the case if we adopt a pure nonparametric inference approach. The central goal of this approach is very laudable: Making as few assumptions as possible, let the data speak. Unfortunately, there is no free lunch—the less you pay, the less you get. The problem with this approach is that unless one has a sufficient amount of data, there is just not enough “volume” in the data to speak loudly enough so that one could hear useful messages. In the current case, without any other knowledge or making any assumptions, Mary would have no base to infer any figure higher than $153,000 to be a possible estimate for the 95th percentile.

## MAKING ASSUMPTIONS

But as a professional statistician, Mary knows better. She knows that she needs to make some distributional assumptions before she can extract nontrivial information out of merely three numbers, and that in general, log-normal distribution is not a terrible assumption for salary figures. That is, histograms of the logarithm of salary figures tend to be shaped like a “bell curve,” also known as the Gaussian distribution. This is a tremendous amount of information, because it effectively reduces the “infinitely unknowable” distribution of possible salary figures to only two parameters, the mean and the variance of the log of the salary. Mary can estimate these two parameters using the sample size of three if the three log salary figures (11.849, 11.864, 11.938) she obtained can be regarded as a probabilistic sample. This is a big “if,” but for now, let us assume this is approximately true. Then the sample mean 11.884 and sample standard deviation 0.048 provide valid estimates of the unknown true mean *μ* and true standard deviation *σ.* Because for the normal distribution *N* (*μ*, *σ* ^{2}) the 95th percentile is *z* _{95} = *μ* + 1.645 *σ*, Mary’s estimate for the 95th percentile for the log salary distribution is 11.884 + 1.645 × 0.048 = 11.963. Because the log transformation is strictly monotone, this means that Mary’s estimate for the 95th percentile for the salary distribution is exp (11.963) = $156, 843, about 2.5 percent higher than the observed maximal salary of $153, 000!

## ASSESSING UNCERTAINTY

With a sample size of three, Mary knows well that there is large uncertainty in estimating the mean *μ*, as well as in estimating *σ.* But how do we even measure such error without knowing the true value? This is where the probabilistic calculation comes in, if the sample we have can be regarded as a probabilistic sample. By probabilistic sample, we mean that it is generated by a probabilistic mechanism, such as drawing a lottery. In Mary’s case, the sample was clearly not drawn randomly, so we need to make some assumptions. In general, in order for any statistical method to render a meaningful inference conclusion, the sample must be “representative” of the population of interest, or can be perceived as such, or can be corrected as such with the help of additional information. A common assumption to ensure such a “representativeness” is that our data form an independently and identically distributed (i.i.d.) sample of the population of interest. This assumption can be invalidated easily if, for instance, faculty members with higher salaries are less likely to disclose their salaries to Mary. This would be an example of selection bias, or more specifically, a nonresponse bias, a problem typical, rather than exceptional, in opinion polls and other surveys that are the backbone of many social science studies. But if Mary knew how a faculty’s response probability is related to the faculty member’s salary, then methods do exist for her to correct for such a bias.

Mary does not have such information, nor does she worry too much of the potential bias in her sample. To put it differently, she did her best to collect her data to be “representative,” being mindful of the “garbage-ingarbage-out” problem; no statistical analysis method could come to rescue if the data quality is just poor. So she is willing to accept the i.i.d. assumption, or rather, she does not have strong evidence to invalidate it. This is typical with small samples, where model diagnosis, or more generally, assumption checking is not directly feasible using the data alone. But contrary to common belief, just because one does not have enough data to check assumptions, this does not imply one should shy away from making parametric assumptions. Indeed, it is with small samples that the parametric assumptions become most valuable. What one does need to keep in mind when dealing with a small sample is that the inference will be particularly sensitive to the assumptions made, and therefore a sensitivity analysis—that is, checking how the analysis results vary with different assumptions—is particularly necessary.

Under the i.i.d. assumption, we can imagine many possible samples of three drawn randomly from the underlying salary population, and for each of these samples we can calculate the corresponding sample mean and sample standard deviation of the log salary. These sample means and sample standard deviations themselves will have their own distributions. Take the distribution of the sample mean as an example. Under the i.i.d. assumption, standard probability calculations show that the mean of this distribution retains the original mean *μ*, but its variance is the original variance divided by the sample size *n*, *σ* ^{2}/*n.* This makes good intuitive sense because averaging samples should not alter the mean, but should reduce the variability in approximating the true mean, and the degree of reduction should depend on the sample size: The more we average, the closer we are to the true mean, probabilistically speaking. Furthermore, thanks to the central limit theorem, one of the two most celebrated theorems in probability and statistics (the other is the law of large numbers, which justifies the usefulness of sample mean for estimating population mean, among many other things), often we can approximate the distribution of the sample mean by a normal distribution, even if the underlying distribution for the original data is not normal.

## CONSTRUCTING CONFIDENCE INTERVALS

Consequently, we can assess the probable error in the sample mean, as an estimate of the true mean, because we can use the sample standard deviation to estimate *σ*, which can then be used to form an obvious estimate of the standard error . For Mary’s data, this comes out to be which is our estimate of the probable error in our estimate of *μ*, 11.884. In addition, we can use our distributional knowledge to form an interval estimate for *μ.* Typically, an interval estimator is in an appealing and convenient form of “sample mean ± 2 × standard error,” which is a 95 percent confidence interval when (1) the distribution of the sample mean is approximately normal; and (2) the sample size, *n*, is large enough (how large is large enough would depend on problems at hand; in some simple cases, *n* = 30 could be adequate, and in others, even *n* = 30, 000 might not be enough). For Mary’s data, the assumption (1) holds under the assumption that the log salary is normal, but the assumption (2) clearly does not hold. However, there is an easy remedy, based on a more refined statistical theory. The convenient form still holds as long as one replaces the multiplier 2 by the 97.5th percentile of the *t* distribution with degrees of freedom *n* – 1. For Mary’s data, *n* = 3, so the multiplier is 4.303. Consequently, a 95 percent confidence interval for *μ* can be obtained as 11.884 ± 4.303 × 0.028 = (11.766, 12.004). Translating back to the original salary scale, this implies a 95 percent confidence interval ($128,541, $163,407). This interval for the mean is noticeably wider than the original sample range ($140, 000, $153, 000); this is not a paradox, but rather a reflection that with sample size of only three, there is a tremendous uncertainty in our estimates, particularly because of the long tail in the lognormal distribution.

So what is the meaning of this 95 percent confidence interval? Clearly it does not mean that (11.766, 12.004) includes the unknown value *μ* with 95 percent probability; this interval either covers it or it does not. The 95 percent confidence refers to the fact that among all such intervals computed from all possible samples of the same size, 95 percent of them should cover the true unknown *μ*, if all the assumptions we made to justify our probabilistic calculations are correct. This is much like when a surgeon quotes a 95 percent success chance for a pending operation; he is transferring the overall (past) success rate associated with this type of surgery—either in general, or by him—into confidence of success for the pending operation.

By the same token, we can construct a confidence interval for *σ*, and indeed for Mary’s estimand, a confidence interval for the 95th percentile *z* _{95} = *μ* + 1.645 *σ.* These constructions are too involved for the current illustration, but if we ignore the error in estimating *σ* (we shouldn’t if this were a real problem), that is, by pretending *σ* = 0.048, then constructing a 95 percent confidence interval for *z* _{95} = *μ* + 1.65 *σ* would be the same as for *μ* + 1.645 × 0.048 = *μ* + 0.079, which is (11.766 + 0.079, 12.004 + 0.079) = (11.845, 12.083). Translating back to the original salary scale, this implies that a 95 percent confidence interval for *z* _{95} would be ($139, 385, $176, 839). The right end point of this interval is about 15 percent higher than the maximal observed salary figure, $153, 000. As Mary’s ultimate problem is making a decision, how she should use this knowledge goes beyond the inference analysis. The role of inference, however, is quite clear, because it provides quantitative information that has direct bearing on her decision. For example, Mary’s asking salary could be substantially different knowing that the 95th percentile is below $153, 000 or could go above $170, 000.

## LIMITATIONS

One central difficulty with statistical inference, which also makes the statistical profession necessary, is that there simply is no “correct” answer: There are infinitely many incorrect answers, a set of conceivable answers, and a few good answers, depending on how many assumptions one is willing to make. Typically, statistical results are only part of a scientific investigation or of decision making, and they should never be taken as “the answer” without carefully considering the assumptions made and the context to which they would be applied. In our example above, the statistical analysis provides Mary with a range of plausible salary figures, but what she actually asks for will depend on more than this analysis. More importantly, this analysis depends heavily on the assumption that the three salary figures are a random sample from the underlying salary distribution, which is assumed to be log-normal. Furthermore, this analysis completely ignored other information that Mary may have, such as the American Statistical Association’s annual salary survey. Such information is too broad to be used directly for Mary’s purposes (e.g., taking the 95th percentile from the annual survey), but nevertheless it should provide some ballpark figures for Mary to form a general prior impression of what she is going after. This can be done via Bayesian inference, which directly puts a probabilistic distribution on any unknown quantity that is needed for making inference, and then computes the posterior distribution of whatever we are interested in given the data. In Mary’s case, this would lead to a distribution for *z* _{95}, from which she can directly assess the “aggressiveness” of each asking salary figure by measuring how likely it exceeds the actual 95th percentile. For illustration of this more flexible method, see Gelman et al (2004).

**SEE ALSO** *Classical Statistical Analysis; Degrees of Freedom; Distribution, Normal; Errors, Standard; Inference, Bayesian; Selection Bias; Standard Deviation; Statistics; Statistics in the Social Sciences*

## BIBLIOGRAPHY

Casella, George, and Roger L. Berger. 2002. *Statistical Inference.* 2nd ed. Pacific Grove, CA: Thompson Learning.

Cox, D. R. 2006. *Principles of Statistical Inference.* Cambridge, U.K.: Cambridge University Press.

Cox, D. R., and D. V. Hinkley. 1974. *Theoretical Statistics.* London: Chapman and Hall.

Gelman, Andrew, J. B. Carlin, H. S. Stern, and D. B. Rubin. 2004. *Bayesian Data Analysis.* Boca Raton, FL: Chapman and Hall/CRC.

*Xiao-Li Meng*

## Statistical Inference

# STATISTICAL INFERENCE

Making an inference involves drawing a general conclusion from specific observations. People do this every day. Upon arising in the morning, one observes that the sun is shining and that the day will be nice. The news reports the arrest of a military veteran for child abuse, and a listener infers that military veterans have special adjustment problems. Statistical inference is a way of formalizing the process of drawing general conclusions from limited information. It is a way of stating the degree of confidence one has in making an inference by using probability theory. Statistically based research allows people to move beyond speculation.

Suppose a sociologist interviews two husbands. Josh, whose wife is employed, does 50 percent of the household chores; Frank, whose wife does not work for pay, does 10 percent. Should the sociologist infer that husbands do more housework when their wives are employed? No. This difference could happen by chance with only two cases. However, what if 500 randomly selected husbands with employed wives average 50 percent of the chores and randomly selected husbands with nonemployed wives average 10 percent? Since this difference is not likely to occur by chance, the sociologist infers that husbands do more housework when their wives are employed for pay.

Researchers perform statistical inferences in three different ways. Assume that 60 percent of the respondents to a survey say they will vote for Marie Chavez. The *traditional hypothesis testing* approach infers that Chavez will win the election if chance processes would account for the result (60 percent support in this survey) with less than some a priori specified statistical significance level. For example, if random chance could account for the result fewer than five times in a hundred, one would say the results are statistically significant. Statistical significance levels are called the *alpha* (e.g., α = .05 for the 5 percent level). If Chavez would get 60 percent support in a sample of the size selected less than 5 percent of the time by chance, one would infer that she will win. The researcher picked the 5 percent level of significance before doing the survey. (The test, including the α level, must be planned *before* one looks at the findings.) If one would get this result 6 percent of the time by chance, there is no inference. Note that not making the inference means just that: One does not infer that Chavez's opponent will win.

A second strategy involves stating the *likelihood of the result occurring by chance* without an a priori level of significance. This strategy reports the result (60 percent of the sample supported Chavez) and the probability of getting that result by chance, say, .042. This gives readers the freedom to make their inferences using whatever level of significance they wish. Sam Jones, using the .01 level (α = .01) in the traditional approach would see that the results do not meet his criterion. He would not conclude that Chavez will win. Mara Jabar, using the .05 level, would conclude that Chavez would win.

The third strategy places a *confidence interval around a result*. For example, a researcher may be 95 percent confident that Chavez will get between 55 percent and 65 percent of the votes. Since the entire interval—55 percent to 65 percent—is enough for a victory, that is, is greater than 50 percent one infers that Chavez will win.

Each approach has an element of risk attached to the inference. That risk is the probability of getting the result by chance alone. Sociologists tend to pick low probabilities (e.g., .05, .01, and even .001), because they do not want to conclude that something is true when it is at all likely to have occurred by chance.

## TRADITIONAL TESTS OF SIGNIFICANCE

Traditional tests of significance involve six steps. Three examples are used here to illustrate these steps: (1) A candidate will win an election, (2) mothers with at least one daughter will have different views on abortion than will mothers with only sons, and (3) the greater a person's internal political efficacy is, the more likely that person is to vote.

*Step 1*: State a hypotheses (*H*1) in terms of statistical parameters (characteristics such as means, correlations, proportions) of the population:

H1:P(vote for the candidate) < .50. [Read: The mean for mothers with daughters is not equal to the mean for mothers with sons.]

H2: μ mothers with daughters ≠ μ mothers with sons. [Read: The means for mothers with daughters is not equal to the mean for mothers with sons.]

H3: ρ < 0.0. [Read: The popluation correlation ρ (rho) between internal political efficacy and voting is greater than zero.]

*H2* says that the means are different but does not specify the direction of the difference. This is a two-tail hypothesis, meaning that it can be significant in either direction. In contrast, *H1* and *H2* signify the direction of the difference and are called one-tail hypotheses.

These three hypotheses are not directly testable because each involves a range of values. *Step 2* states a null hypothesis, which the researcher usually wishes to reject, that has a specific value.

H10:P(vote for the candidate) = .50.

H20: μ mothers with daughters = μ mothers with sons.

H30: ρ = 0.

An important difference between one-tail and two-tail tests may have crossed the reader;s mind. Consider *H1*0. If 40 percent of the sample supported the candidate, one fails to reject *H1*0 because the result was in the direction opposite of that of the one-tail hypothesis. In contrast, whether mothers with daughters have a higher or lower mean attitude toward abortion than do mothers with sons, one proceeds to test *H2*0 because a difference in either direction could be significant.

*Step 3* states the a priori level of significance. Sociologists usually use the .05 level. With large samples, they sometimes use the .01 or .001 level. This paper uses the .05 level (α = .05). If the result would occur in fewer than 5 percent (corresponding to the .05 level) of the samples if the null hypothesis were true in the population, the null hypothesis is rejected in favor of the main hypothesis.

Suppose the sample correlation between internal political efficacy and voting is .56 and this would occur in fewer than 5 percent of the samples this size if the population correlation were 0 (as specified in *H3*0). One rejects the null hypothesis, *H3*0, and accepts the main hypothesis, *H3*, that the variables are correlated in the population. What if the sample correlation were .13 and a correlation this large would occur in 25 percent of the samples from a population in which the true correlation were 0? Because 25 percent exceeds the a priori significance level of 5 percent, the null hypothesis is not rejected. One cannot infer that the variables are correlated in the population. Simultaneously, the results do not prove that the population correlation is .00, simply that it could be that value.

*Step 4* selects a test statistic and its critical value. Common test statistics include *z*, *t*, *F*, and χ²(chi-square). The *critical value* is the value the test statistic must exceed to be significant at the level specified in step 3. For example, using a one-tail hypothesis, a *z* must exceed 1.645 to be significant at the .05 level. Using a two-tail hypothesis, a *z*, must exceed 1.96 to be significant at the .05 level. For *t*, *F*, and χ², determining the critical value is more complicated because one needs to know the degrees of freedom. A formal understanding of degrees of freedom is beyond the scope of this article, but an example will give the reader an intuitive idea. If the mean of five cases is 4 and four of the cases have values of 1, 4, 5, and 2, the last case must have a value of 8 (it is the only value for the fifth case that will give a mean of 4, since 1 + 4 + 5 + 2 + *x* = 20, only if *x* = 8 and 20/5 = 4). Thus, there are *n* − 1 degrees of freedom. Most test statistics have different distributions for each number of degrees of freedom.

Figure 1 illustrates the *z* distribution. Under the *z* distribution, an absolute value of greater than 1.96 will occur by chance only 5 percent of the time. By chance a *z* > 1.96 occurs 2.5 percent of the time and a *z* < − 1.96 occurs 2.5 percent of the time. Thus, 1.96 is the critical *z*-score for a two-tail .05 level test. The critical *z*-score for a one-tail test at the .05 level is 1.645 or − 1.645, depending on the direction specified in the main hypothesis.

*Step 5* computes the test statistic. An example appears below.

*Step 6* decides whether to reject or fail to reject the null hypothesis. If the computed test statistic exceeds the critical value, one rejects the null hypothesis and makes the inference to accept the main hypothesis. If the computed test statistic does not exceed the critical value, one fails to reject the null hypothesis and make no inference.

**Example of Six Steps Applied to** *H1*. A random sample of 200 voters shows 60 percent of them supporting the candidate. Having stated the main hypothesis (step 1) and the null hypothesis (step 2), step 3 selects an a priori significance level at α = .05, since this is the conventional level. Step 4 selects the test statistic and its critical level. To test a single percentage, a *z* test is used (standard textbooks on social statistics discuss how to select the appropriate tests statistics; see Agresti and Finlay 1996; Loether and McTavish 1993; Raymondo 1999; Vaughan 1997). Since the hypothesis is one-tail, the critical value is 1.645 (see Figure 1).

The fifth step computes the formula for the test statistic:

where *p*s is the proportion in the sample *p* is the proportion in the population under *H*0*q* is 1 − *p**n* is the number of people in the sample.

Thus,

The sixth step makes the decision to reject the null hypothesis, since the difference is in the predicted direction and 2.828 > 1.645. The statistical inference is that the candidate will win the election.

## REPORTING THE PROBABILITY LEVEL

Many sociological researchers do not use the traditional null hypothesis model. Instead, they report the probability of the result. This way, a reader knows the probability (say, .042 or .058) rather than the significant versus not significant status. Reporting the probability level removes the "magic of the level of significance." A result that is significant at the .058 level is not categorically different from one that is significant at the .042 level. Where the traditional null hypothesis approach says that the first of these results is not significant and the second is, reporting the probability tells the reader that there is only a small difference in the degree of confidence attached to the two results. Critics of this strategy argue that the reader may adjust the significance level post hoc; that is, the reader may raise or lower the level of significance after seeing the results. It also is argued that it is the researcher, not the reader, who is the person testing the hypotheses; therefore, the researcher is responsible for selecting an a priori level of significance.

The strategy of reporting the probability is illustrated for *H1*. Using the tabled values or functions in standard statistical packages, the one-tail probability of a *z* = 2.828 is .002. The researcher reports that the candidate had 60 percent of the vote in the sample and that the probability of getting that much support by chance is .002. This provides more information than does simply saying that it is significant at the .05 level. Results that could happen only twice in 1,000 times by chance (.002) are more compelling than are results that could happen five times in 100 (.05).

Since journal editors want to keep papers short and studies often include many tests of significance, reporting probabilities is far more efficient than going through the six-step process outlined above. The researcher must go through these steps, but the paper merely reports the probability for each test and places an asterisk along those which are significant at the .05 level. Some researchers place a single asterisk for results significant at the .05 level, two asterisks for results significant at the .01 level, and three asterisks for results significant at the .001 level.

## CONFIDENCE INTERVALS

Rather than reporting the significance of a result, this approach puts a confidence interval around the result. This provides additional information in terms of the width of the confidence interval.

Using a confidence interval, a person constructs a range of values such that he or she is 95 percent confident (some use a 99 percent confidence interval) that the range contains the population parameter. The confidence interval uses a two-tail approach on the assumption that the population value can be either above or below the sample value.

For the election example, *H1*, the confidence interval is

where *z*a/2 is the two-tail critical value for the alpha level *p*s is the proportion in the sample *p* is the proportion in the population under *H*0*q* is 1 − *p**n* is the number of people in the sample

upper limit |.669

lower limit |.531

The researcher is 95 percent confident that the interval, .531 to .669, contains the true population proportion. The focus is on the confidence level (.95) for a result rather than the low likelihood of the null hypothesis (.05) used in the traditional null hypothesis testing approach.

The confidence interval has more information value than do the first two approaches. Since the value specified in the null hypothesis (*H*0: P = .50) is not in the confidence interval, the result is statistically significant at the .05 level. Note that a 95 percent confidence level corresponds to a .05 level of significance and that a 99 percent confidence interval corresponds to a .01 level of significance. Whenever the value specified by the null hypothesis is not in the confidence interval, the result is statistically significant. More important, the confidence interval provides an estimate of the range of possible values for the population. With 200 cases and 60 percent support, there is confidence that the candidate will win, although it may be a close election with the lower limit indicating 53.1 percent of the vote or a landslide with the upper limit indicating 66.9 percent of the vote. If the sample were four times as large, *n* = 800, the confidence interval would be half as wide (.565–.635) and would give a better fix on the outcome.

## COMPUTATION OF TESTS AND CONFIDENCE INTERVALS

Table 1 presents formulas for some common tests of significance and their corresponding confidence intervals where appropriate. These are only a sample of the tests that are commonly used, but they cover means, differences of means, proportions, differences of proportions, contingency tables, and correlations. Not included are a variety of multivariate tests for analysis of variance, regression, path analysis, and structural equation models. The formulas shown in Table 1 are elaborated in most standard statistics textbooks (Agresti and Finlay 1996; Blalock 1979; Bohrnstedt and Knoke 1998: Loether and McTavish 1993; Raymondo 1999; Vaughan 1997).

## LOGIC OF STATISTICAL INFERENCE

A formal treatment of the logic of statistical inference is beyond the scope of this article; the following is a simplified description. Suppose one wants to know whether a telephone survey can be thought of as a random sample. From current census information, suppose the mean, μ, income of the community is $31,800 and the standard deviation, σ, is $12,000. A graph of the complete census enumeration appears in Panel A of Figure 2. The fact that there are a few very wealthy people skews the distribution.

A telephone survey included interviews with 1,000 households. If it is random, its sample mean and standard deviation should be close to the population parameters, μ and σ, respectively. Assume that the sample has a mean of $33,200 and a standard deviation of $10,500. To distinguish these sample statistics from the population parameters, call them *M* and *s*. The sample distribution appears in Panel B by Figure 2. Note that it is similar to the population distribution but is not as smooth.

One cannot decide whether the sample could be random by looking at Panels A and B. The distributions are different, but this difference might
have occurred by chance. Statistical inference is accomplished by introducing two theoretical distributions: the sampling distribution of the mean and the *z*-distribution of the normal deviate. A theoretical distribution is different from the population and sample distributions in that a theoretical distribution is mathematically derived; it is not observed directly.

**Sampling Distribution of the Mean.** Suppose that instead of taking a single random sample of 1,000 people, one took two such samples and determined the mean of each one. With 1,000 cases, it is likely that the two samples would have means that were close together but not the same. For instance, the mean of the second sample might be $30,200. These means, $33,200 and $30,200, are pretty close to each other. For a sample to have a mean of, say $11,000, it would have to include a greatly disproportionate share of poor families; this is not likely by chance with a random sample with *n* = 1,000. For a sample to have a mean of, say, $115,000, it would have to have a greatly disproportionate share of rich families. In contrast, with a sample of just two individuals, one would not be surprised if the first person had an income of $11,000 and the second had an income of $115,000.

The larger the samples are, the more stable the mean is from one sample to the next. With only 20 people in the first and second samples, the means may vary a lot, but with 100,000 people in both samples, the means should be almost identical. Mathematically, it is possible to derive a distribution of the means of all possible samples of a given *n* even though only a single sample is observed. It can be shown that the mean of the
sampling distribution of means is the population mean and that the standard deviation of the sampling distribution of the means is the population standard deviation divided by the square root of the sample size. The standard deviation of the mean is called the *standard error of the mean:*

This is an important derivation in statistical theory. Panel C shows the sampling distribution of the mean when the sample size is *n* = 1,000. It also shows the sampling distribution of the mean for *n* = 100. A remarkable property of the sampling distribution of the mean is that with a large sample size, it will be normally distributed even though the population and sample distributions are skewed.

One gets a general idea of how the sample did by seeing where the sample mean falls along the sampling distribution of the mean. Using Panel C for *n* = 1,000, the sample *M* = $33,200 is a long way from the population mean. Very few samples with *n* = 1,000 would have means this far way from the population mean. Thus, one infers that the sample mean probably is based on a nonrandom sample.

Using the distribution in Panel C for the smaller sample size, *n* = 100, the sample *M* = $33,200 is not so unusual. With 100 cases, one should not be surprised to get a sample mean this far from the population mean.

Being able to compare the sample mean to the population mean by using the sampling distribution is remarkable, but statistical theory allows more precision. One can transform the values in the sampling distribution of the mean to a distribution of a test statistic. The appropriate test statistic is the distribution of the normal deviate, or *z*-distribution. It can be shown that

If the *z*-value were computed for the mean of all possible samples taken at random from the population, it would be distributed as shown in Panel D of Figure 2. It will be normal, have a mean of zero, and have a variance of 1.

Where is *M* = $33,200 under the distribution of the normal deviate using the sample size of *n* = 1,000? Its *z*-score using the above formula is

Using tabled values for the normal deviate, the probability of a random sample of 1,000 cases from a population with a mean of $31,800 having a sample mean of $33,200 is less than .001. Thus, it is extremely unlikely that the sample is purely random.

With the same sample mean but with a sample of only 100 people,

Using tabled values for a two-tail test, the probability of getting the sample mean this far from the population mean with a sample of 100 people is .250. One should not infer that the sample is nonrandom, since these results could happen 25 percent of the time by chance.

The four distributions can be described for any sample statistic one wants to test (means, differences of means, proportions, differences of proportions, correlations, etc). While many of the calculations will be more complex, their logic is identical.

## MULTIPLE TESTS OF SIGNIFICANCE

The logic of statistical inference applies to testing a single hypothesis. Since most studies include multiple tests, interpreting results can become extremely complex. If a researcher conducts 100 tests, 5 of them should yield results that are statistically significant at the .05 level by chance. Therefore, a study that includes many tests may find some "interesting" results that appear statistically significant but that really are an artifact of the number of tests conducted.

Sociologists pay less attention to "adjusting the error rate" than do those in most other scientific fields. A conservative approach is to divide the Type I error by the number of tests conducted. This is known as the Dunn multiple comparison test, based on the Bonferroni inequality. For example, instead of doing nine tests at the .05 level, each test is done at the .05/9 = .006 level. To be viewed as statistically significant at the .05 level, each specific test must be significant at the .006 level.

There are many specialized multiple comparison procedures, depending on whether the tests are planned before the study starts or after the results are known. Brown and Melamed (1990) describe these procedures.

## POWER AND TYPE I AND TYPE II ERRORS

To this point, only one type of probability has been considered. Sociologists use statistical inference to minimize the chance of accepting a main hypothesis that is false in the population. They reject the null hypothesis only if the chances of it's being true in the population are very small, say, α = .05. Still, by minimizing the chances of this error, sociologists increase the chance of failing to reject the null hypothesis when it should be rejected. Table 2 illustrates these two types of error.

Type I, or α, error is the probability of rejecting *H*0 falsely, that is, the error of deciding that *H*1 is right when *H*0 is true in the population. If one were testing whether a new program reduced drug abuse among pregnant women, the *H*1 would be that the program did this and the *H*0 would be that the program was no better than the existing one. Type I error should be minimized because it would be wrong to change programs when the new program was no better than the existing one. Type I

Type I ( α) and Type II ( β) Errors | ||

true situation in the population | ||

decision made by the researcher | h0, the null hypothesis, is true | h1, the main hypothesis, is true |

ho, the null hypothesis is true | 1 – α | β |

hr, the main hypothesis is true | α | 1 – β |

error has been described as "the chances of discovering things that aren't so" (Cohen 1990, p. 1304). The focus on Type I error reflects a conservative view among scientists. Type I error guards against doing something new (as specified by *H*1) when it is not going to be helpful.

Type II, or ß, error is the probability of failing to reject *H*0 when *H*1 is true in the population. If one failed to reject the null hypothesis that the new program was no better (*H*0) when it was truly better (*H*1), one would put newborn children at needless risk. Type II error is the chance of missing something new (as specified by *H*1) when it really would be helpful.

Power is 1 − ß. Power measures the likelihood of rejecting the null hypothesis when the alternative hypothesis is true. Thus, if there is a real effect in the population, a study that has a power of .80 can reject the null hypothesis with a likelihood of .80. The power of a statistical test is measured by how likely it is to do what one usually wants to do: demonstrate support for the main hypothesis when the main hypothesis is true. Using the example of a treatment for drug abuse among pregnant women, the power of a test is the ability to demonstrate that the program is effective if this is really true.

Power can be increased. First, get a larger sample. The larger the sample, the more power to find results that exist in the population. Second, increase the α level. Rather than using the .01 level of significance, a researcher can pick the .05 or even the .10. The larger α is, the more powerful the test is in its ability to reject the null hypothesis when the alternative is true.

There are problems with both approaches. Increasing sample size makes the study more costly. If there are risks to the subjects who participate, adding cases exposes additional people to that risk. An example of this would be a study that exposed subjects to a new drug treatment program that might create more problems than it solved. A larger sample will expose more people to these risks.

Since Type I and Type II errors are inversely related, raising α reduces ß thus increasing the power of the test. However, sociologists are hesitant to raise α since doing so increases the chance of deciding something is important when it is not important. With a small sample, using a small α level such as .001 means there is a great risk of ß error. Many small-scale studies have a Type II error of over .50. This is common in research areas that rely on small samples. For example, a review of one volume of the *Journal of Abnormal Psychology* (this journal includes many small-sample studies) found that those studies average Type II error of .56 (Cohen 1990). This means the psychologist had inadequate power to reject the null hypothesis when *H*1 was true. When *H*1 was true, the chance of rejecting *H*0 (i.e., power) was worse than that resulting from flipping a coin.

Some areas that rely on small samples because of the cost of gathering data or to minimize the potential risk to subjects require researchers to plan their sample sizes to balance α, power, sample size, and the minimum size of effect that is theoretically important. For example, if a correlation of .1 is substantively significant, a power of .80 is important, and an α = .01 is desired, a very large sample is required. If a correlation is substantively and theoretically important only if it is over .5, a much smaller sample is adequate. Procedures for doing a power analysis are available in Cohen (1988); see also Murphy and Myous (1998).

Power analysis is less important for many sociological studies that have large samples. With a large sample, it is possible to use a conservative α error rate and still have sufficient power to reject the null hypothesis when *H*1 is true. Therefore, sociologists pay less attention to ß error and power than do researchers in fields such as medicine and psychology. When a sociologist has a sample of 10,000 cases, the power is over .90 that he or she will detect a very small effect as statistically significant. When tests are extremely powerful to detect small effects, researchers must focus on the substantive significance of the effects. A correlation of .07 may be significant at the .05 level with 10,000 cases, but that correlation is substantively trivial.

## STATISTICAL AND SUBSTANTIVE SIGNIFICANCE

Some researchers and many readers confuse statistical significance with substantive significance. Statistical inference does not ensure substantive significance, that is, ensure that the result is important. A correlation of .1 shows a weak relationship between two variables whether it is statistically significant or not. With a sample of 100 cases, this correlation will not be statistically significant; with a sample of 10,000 cases, it will be statistically significant. The smaller sample shows a weak relationship that might be a zero relationship in the population. The larger sample shows a weak relationship that is all but certainly a weak relationship in the population, although it is not zero. In this case, the statistical significance allows one to be confident that the relationship in the population is substantively weak.

Whenever a person reads that a result is statistically significant, he or she is confident that there is some relationship. The next step is to decide whether it is substantively significant or substantively weak. Power analysis is one way to make this decision. One can illustrate this process by testing the significance of a correlation. A population correlation of .1 is considered weak, a population correlation of .3 is considered moderate, and a population correlation of .5 or more is considered strong. In other words, if a correlation is statistically significant but .1 or lower, one has to recognize that this is a weak relationship—it is statistically significant but substantively weak. It is just as important to explain to the readers that the relationship is substantively weak as it is to report that it is statistically significant. By contrast, if a sample correlation is .5 and is statistically significant, one can say the relationship is both statistically and substantively significant.

Figure 3 shows power curves for testing the significance of a correlation. These curves illustrate the need to be sensitive to both statistical significance and substantive significance. The curve on the extreme left shows the power of a test to show that a sample correlation, *r*, is statistically significant when the population correlation, ρ (rho), is .5. With a sample size of around 100, the power of a test to show statistical significance approaches 1.0, or 100 percent. This means that any correlation that is this strong in the population can be shown to be statistically significant with a small sample.

What happens when the correlation in the population is weak? Suppose the true correlation in the population is .2. A sample with 500 cases almost certainly will produce a sample correlation that is statistically significant, since the power is approaching 1.0. Many sociological studies have 500 or more cases and produce results showing that substantively weak relationships, ρ = .2, are statistically significant. Figure 3 shows that even if the population correlation is just .1, a sample of 1,000 cases has the power to show a sample result that is statistically significant. Thus, any time a sample is 1,000 or larger, one has to be especially careful to avoid confusing statistical and substantive significance.

The guidelines for distinguishing between statistical and substantive significance are direct but often are ignored by researchers:

- If a result is not statistically significant, regardless of its size in the sample, one should be reluctant to generalize it to the population.
- If a result is statistically significant in the sample, this means that one can generalize it to the population but does not indicate whether it is a weak or a strong relationship.
- If a result is statistically significant and strong in the sample, one can both generalize it to the population and assert that it is substantively significant.
- If a result is statistically significant and weak in the sample, one can both generalize it to the population and assert that it is substantively weak in the population.

This reasoning applies to any test of significance. If a researcher found that girls have an average score of 100.2 on verbal skills and boys have an average score of 99.8, with girls and boys having a standard deviation of 10, one would think this as a very weak relationship. If one constructed a histogram for both girls and boys, one would find them almost identical. This difference is not substantively significant. However, if there was a sufficiently
large sample of girls and boys, say, *n* = 10,000, it could be shown that the difference is statistically significant. The statistical significance means that there is some difference, that the means for girls and boys are not identical. It is necessary to use judgment, however, to determine that the difference is substantively trivial. An abuse of statistical inference that can be committed by sociologists who do large-scale research is to confuse statistical and substantive significance.

## NONRANDOM SAMPLES AND STATISTICAL INFERENCE

Very few researchers use true random samples. Sometimes researchers use convenience sampling. An example is a social psychologist who has every student in a class participate in an experiment. The students in this class are not a random sample of the general population or even of students in a university. Should statistical inference be used here?

Other researchers may use the entire population. If one wants to know if male faculty members are paid more than female faculty members at a particular university, one may check the payroll for every faculty member. There is no sample—one has the entire population. What is the role of statistical inference in this instance?

Many researchers would use a test of significance in both cases, although the formal logic of statistical inference is violated. They are taking a "what if" approach. If the results they find could have occurred by a random process, they are less confident in their results than they would be if the results were statistically significant. Economists and demographers often report statistical inference results when they have the entire population. For example, if one examines the unemployment rates of blacks and whites over a ten-year period, one may find that the black rate is about twice the white rate. If one does a test of significance, it is unclear what the population is to which one wants to generalize. A ten-year period is not a random selection of all years. The rationale for doing statistical inference with population data and nonprobability samples is to see if the results could have been attributed to a chance process.

A related problem is that most surveys use complex sample designs rather than strictly random designs. A stratified sample or a clustered sample may be used to increase efficiency or reduce the cost of a survey. For example, a study might take a random sample of 20 high schools from a state and then interview 100 students from each of those schools. This survey will have 2,000 students but will not be a random sample because the 100 students from each school will be more similar to each other than to 100 randomly selected students. For instance, the 100 students from a school in a ghetto may mostly have minority status and mostly be from families that have a low income in a population with a high proportion of single-parent families. By contrast, 100 students from a school in an affluent suburb may be disproportionately white and middle class.

The standard statistical inference procedures discussed here that are used in most introductory statistics texts and in computer programs such as SAS and SPSS assume random sampling. When a different sampling design is used, such as a cluster design, a stratified sample, or a longitudinal design, the test of significance will be biased. In most cases, the test of significance will underestimate the standard errors and thus overestimate the test statistic (*z*, *t*, *F*). The extent to which this occurs is known as the "design effect." The most typical design effect is greater than 1.0, meaning that the computed test statistic is larger than it should be. Specialized programs allow researchers to estimate design effects and incorporate them in the computation of the test statistics. The most widely used of these procedures are WesVar, which is available from SPSS, and SUDAAN, a stand-alone program. Neither program has been widely used by sociologists, but their use should increase in the future.

## references

Agresti, Alan, and Barbara Finlay 1996 *Statistical Methods for the Social Sciences*. Englewood Cliffs, N.J.: Prentice-Hall.

Blalock, Hubert M., Jr. 1979 *Social Statistics*. New York: McGraw-Hill.

Bohrnstedt, George W., and David Knoke 1988 *Statistics for Social Data Analysis*, 2nd ed. Itasca, Ill.: F.E. Peakcock.

Brown, Steven R., and Lawrence E. Melamed 1990 *Experimental Design and Analysis*. Newbury Park, Calif.: Sage.

Cohen, Jacob 1988 *Statistical Power Analysis for the Behavioral Sciences*, 2nd ed. Hillsdale, N.J.: Erlbaum.

—— 1990 "Things I Have Learned (So Far)." *American Psychologist* 45:1304–1312.

Loether, Herman J., and Donald G. McTavish 1993 *Descriptive and Inferential Statistics*. New York: Allyn and Bacon.

Murphy, Kelvin R., and Brentt Myous, eds. 1998 *Statistical Power Analysis: A Simple and Graphic Model for Traditional and Modern Hypothesis Tests*. Hillsdale, N.J.: Erlbaum.

Raymondo, James 1999 *Statistical Analysis in the Social Sciences*. New York: McGraw-Hill.

Vaughan, Eva D. 1997 *Statistics: Tools for Understanding Data in Behavioral Sciences*. Englewood Cliffs, N.J.: Prentice-Hall.

Alan C. Acock

## statistical inference

**statistical inference** The process by which results from a sample may be applied more generally to a population. More precisely, how inferences may be drawn about a population, based on results from a sample of that population.

Inferential statistics are generally distinguished as a branch of statistical analysis from descriptive statistics, which describe variables and the strength and nature of relationships between them, but do not allow generalization. The ability to draw inferences about a population from a sample of observations from that population depends upon the sampling technique employed. The importance of a scientific sample is that it permits statistical generalization or inference. For example, if we survey a simple random sample of university students in Britain and establish their average (mean) height, we will be able to infer the likely range within which the mean height of all university students in Britain is likely to fall. Other types of sample, such as quota samples, do not allow such inferences to be drawn. The accuracy with which we are able to estimate the population mean from the sample will depend on two things (assuming that the sample has been drawn correctly): the size of the sample and the variability of heights within the population. Both these factors are reflected in the calculation of the standard error. The bigger the standard error, the less accurate the sample mean will be as an estimate of the population mean.

Strictly speaking, therefore, inferential statistics is a form of inductive inference in which the characteristics of a population are estimated from data obtained by sampling that population. In practice, however, the methods are called upon for the more ambitious purpose of prediction, explanation, and hypothesis testing.

## inferential statistics

**inferential statistics** Statistics which permit the researcher to demonstrate the probability that the results deriving from a sample are likely to be found in the population from which the sample has been drawn. They therefore allow sociologists to generalize from representative samples, by applying ‘tests of significance’ to patterns found in these samples, in order to determine whether these hold for populations as a whole. The other type of statistics in which sociologists are interested are descriptive statistics, which summarize the patterns in the responses within a data-set, and provide information about averages, correlations, and so forth. See also SIGNIFICANCE TESTS; STATISTICAL INFERENCE.