Statistics in the Social Sciences
Statistics in the Social Sciences
Social science research generally relies on statistical analyses for the understanding of behavioral phenomena. Data analysis generally begins with an examination of descriptive statistics, then proceeds to inferential statistics.
The arithmetic mean is the most common measure of central tendency. It is calculated by adding all the scores in a distribution and dividing by the number of scores. A less frequently used average is the median ; it is calculated by ordering the numbers by size and identifying the middle-most score. The least used measure of central tendency is the mode, the most frequently occurring score.
When scores fall in a normal distribution, the mean is the most useful average because it gives a good sense of typical scores and because researchers can use it in subsequent sophisticated statistical tests. However, if a group of numbers is non-normal or has extreme values, the mean may not give a good sense of typicality; extreme scores have an inordinate impact on the mean, elevating or reducing it to give a false sense of typical values. In such a situation, researchers may use the median. Finally, the mode can be useful when an investigator is interested simply in getting a count of the number of observations in a given category.
To get a sense of the degree to which scores disperse, researchers calculate a measure of variability. The most common measures of variability are the variance and the standard deviation (SD); the SD is the square root of the variance. The SD tells, on average, how far any given score is likely to fall from the mean. Researchers prefer these measures with the mean because they can be used with sophisticated data analyses.
When investigators compute the median, they are likely to make use of a different measure of variability, the semi-interquartile range (SIQR). To compute the SIQR, the researcher identifies the scores at the twenty-fifth and the seventy-fifth percentile ranks and averages the two.
The least sophisticated, and least employed, measure of variability is the range. It is simply the difference between the highest and lowest score. It is greatly affected by extreme scores, so the range does not provide a useful measure of the degree to which scores cluster.
A final type of descriptive statistic is the standardized score, commonly the z -score. A standardized score indicates how far a given number in a distribution falls from the mean in standard deviation units. Thus, if a score fell one standard deviation above the mean, its standardized z -score would be +1.00. Or if a score fell half a standard deviation below the mean, its z- score would be–0.50. This descriptive z -score is used differently from the inferential statistic by the same name.
After an assessment using descriptive statistics, researchers conduct inferential tests designed to determine whether the obtained results are likely to generalize beyond the sample of subjects tested. Most of the statistics are based on certain assumptions regarding the data. In order for the tests to be maximally informative, data should be normally distributed and the variances of different groups should be equal. For some tests, there is an additional assumption of equal sample sizes across groups. Although the tests are robust enough to permit some violation of assumptions, the greater the departure from the assumptions, the less optimal the information provided by the tests.
The most commonly used statistical tests in the social and behavioral sciences are the analysis of variance (ANOVA) and related tests, including the Student’s t -test; Pearson product-moment correlation; regression analysis; and the chi-square test. All of these except the chi-square test fall conceptually within the domain of the general linear model (GLM). ANOVA is a specific case of linear regression that is, in turn, a special case of the GLM. The basic premise of the GLM is that one can express the value of a dependent variable as a linear combination of the effects of a set of independent (or predictor) variables plus an error effect.
For historical reasons, researchers have treated ANOVA and related models and linear regression as different statistical approaches. Theoretically, these various tests are closely related, but in application, they are algebraically different. Hence, many researchers have not known of the close relation between them. As computer-based analyses have become nearly ubiquitous, however, a merging of the different approaches has begun (Howell 2007).
Researchers use the ANOVA and the Student’s t -test to assess whether reliable differences exist across groups. Historically, the t -test is the older of the two approaches, but the ANOVA is used more frequently. The two tests lead to identical conclusions. In fact, for a two-group t -test, there is an identity relation between an obtained t -value and the F -value, namely, t 2 = F.
The z -test is conceptually similar to the t -test, addressing the same questions, but is much less frequently used because the z -test requires that the investigator know the population mean and variance (and standard deviation). The t -test requires only estimates of the population mean and standard deviation, which the researcher can discern from the data. It is rare for researchers to know the population parameters.
The ANOVA is useful for comparing means of multiple groups in studies with a single independent variable and for comparing means of two or more groups in studies with multiple independent variables. The advantage of ANOVA for single-variable research is that it permits a global assessment of potential differences with a single test. The advantage of ANOVA with multiple independent variables is that the investigator can spot interactions among variables.
The ANOVA lets the researcher know if means of any two groups differ reliably, but it does not specify exactly which of the means differ. In order to determine which of multiple means differ from one another, researchers employ post hoc analyses. When a study has multiple dependent variables, or outcome measures, researchers use an extension of the ANOVA, the multivariate analysis of variance (MANOVA).
Sometimes investigators are interested in whether two factors covary. That is, is there an association between them such that by knowing a value of a measurement on one variable, one can make a reasonable prediction about the value on the second variable? Researchers use the Pearson product-moment correlation to identify the strength of an association between variables and regression analysis for making predictions.
Correlations can be positive, wherein as the score by a subject on one variable increases in magnitude, so does the magnitude on the second variable. Correlations can also be negative, as when increasing scores on one variable are paired with decreasing scores on the other variable. Or there can be no relation between the two variables. The sign, positive or negative, of the correlation does not indicate the strength of a relation, only the relative direction of change of scores on the two variables. The strength of a relation between variables is signaled by the absolute value of the correlation coefficient. These coefficients range from -1.00 to +1.00.
In some research, the investigator tries to predict a dichotomous outcome (e.g., success versus failure). In this instance, the appropriate test is known as logistic regression.
The commonly used statistical tests make assumptions about characteristics of data. When the assumptions are not tenable, researchers can use nonparametric statistics, sometimes called distribution-free tests because they do not make any assumptions about the nature of the data in a distribution. Their advantage is that they do not require that data be normally distributed or that different groups have equal variability. Their disadvantage is that they discard some information in the data; thus they are, statistically speaking, less efficient. There are nonparametric statistics that correspond to the commonly used parametric tests.
When a two-sample t -test is inappropriate because of the nature of the distribution of scores, researchers can use either the Wilcoxon rank-sum test or the Mann-Whitney U. Instead of a one-sample paired t -test, one could use Wilcoxon’s matched-pairs signed-ranks test or the sign test.
A replacement for the ANOVA is the Kruskal-Wallis one-way ANOVA for independent samples. When there are repeated measurements of subjects, Friedman’s rank test for k correlated samples can be appropriate to replace the ANOVA.
When a study involves observations of the frequency with which observations fall into various categories, a common nonparametric test is the chi-square test. There are two varieties in common use, depending on whether the research design involves a single variable or two variables. The one-variable chi-square is a goodness of fit test, indicating whether the predicted number of observations in specified categories matches the observed number. The two-variable chi-square test assesses whether the frequencies of observations in different categories are contingent upon values of the second variable.
The logic of inferential tests of statistical significance is to establish a null hypothesis, H 0, that specifies no relation between variables or no differences among means, then to see if there is enough evidence to reject H 0 in favor of the experimental (or alternate) hypothesis. Rejection of H 0 is associated with an a priori probability value that the researcher establishes. Most researchers select a probability value of .05, meaning that they will reject H 0 if the probability of having gotten the obtained results is less than 5 percent if H 0 is actually true.
When rejecting H 0, researchers note that their results are statistically significant. Nonscientists often erroneously interpret the word significant to mean important. This technical wording is not intended to convey a sense that the results are important or practical. This wording simply means that the results are reliable, or replicable.
One objection to null hypothesis testing is that the cutoff for statistical significance is traditional and arbitrary. A probability value of .051 conveys much the same effect as a statistically significant value of .05, yet researchers have traditionally claimed that the slightly higher probability value puts their results into a different category of results, one reflecting no interesting findings and insufficient information to reject H 0.
The approach of most researchers is to question the likelihood of obtaining the results if H 0 is true. Thus, any conclusion specifies how rare the results were if the H 0 is true, but, logically, this approach says nothing about whether H 0 actually is true. The logic of traditional tests of the null hypothesis is often confused with the question of whether H 0 is true, given the results that were obtained. These two questions are conceptually very different. As a result, the traditional tests also say nothing about whether an experimental hypothesis, presumably the one of interest, is true.
Controversy currently exists regarding the adequacy of such tests of the null hypothesis. Theorists have argued that H 0 is, strictly speaking, never true. That is, there will always be differences across groups or relations among variables, even if the differences are small or the relations weak. As a result, researchers have proposed alternate strategies for answering research questions.
One suggestion is to report effect sizes rather than simple probability values. Generally speaking, there are two families of effect sizes, d-families for questions of difference and r-families for questions of association (Rosenthal 1994). In an experiment, an effect size is a statistic that assesses the degree of variability due to treatments compared to variability caused by unknown factors like measurement error. A common measure of effect size when two groups are compared is Cohen’s d. For more complex designs, investigators assess effect size through measures like eta-squared, η2, and omega-squared, Ω2.
Effect sizes in correlational analyses include r 2, which indicates the percentage of common variability among correlated variables. (In linear regression analysis, r 2 has a different meaning, relating to error in predicting the value of a criterion variable from predictor variables. It is conceptually related to the similarly named statistic in correlations.)
A different supplement to traditional hypothesis testing involves the use of confidence intervals (CI). A CI specifies the range of scores that can be considered plausible estimates of some population parameter. The parameter most often estimated is the mean. Thus a CI for a mean gives the investigator a sense of values of the population mean that seem reasonable, given sample values.
Some investigators have suggested that CIs be used to increase the information provided by null hypothesis testing because CIs give information about potential effect sizes, about the statistical power of a study, and about comparability of results across different studies. As with any emerging statistical technique, there are controversies associated with the use of CIs (Cohen 1994; Cumming and Finch 2005), including the fact that researchers in the social and behavioral sciences are still relatively untutored in the use of CIs (Belia et al. 2005).
One recent approach to hypothesis testing has been to replace the traditional probability value (i.e., what is the probability that the result occurred given that H 0 is true?) with the probability that an effect can be reproduced if a study were replicated (Killeen 2005). Theorists have argued that social and behavioral researchers are typically interested in whether their effects are real and will replicate, not whether the researcher can reject a statement of no effect. Thus it makes more sense to use a statistic that allows a prediction about how likely a result is to recur if a study is repeated.
A final, alternate approach to hypothesis testing involves Bayesian statistics. This approach relies on an investigator’s prior knowledge and beliefs about the outcome of research. Given certain a priori assumptions, it is possible to compute the probability that H 0 is true, given the data, instead of merely the likelihood of obtaining the data, assuming that H 0 is true. Unfortunately, accurate probability values that would make the Bayesian approach feasible in the social and behavioral sciences are very difficult to ascertain, so few researchers have adopted this approach.
The noted psychologist S. S. Stevens (1906–1973) identified four scales of measurement, each in increasing order of mathematical sophistication. Nominal scales are simply categorical (e.g., female-male). Ordinal scales are ordered in magnitude, but the absolute magnitude of the difference between any two observations is unspecified (e.g., the number of votes of the first-, second-, and third-place candidates in an election). With interval scales, a given difference between any two scores means the same at all points along the scale (e.g., the difference on a seven-point rating scale from one to two is equivalent to the difference between six and seven). A ratio scale involves the equality of differences between scores but also equality of ratios among scores (e.g., a task taking four seconds requires twice as much time as a task taking two seconds, and ten seconds is twice as long as five).
Behavioral researchers have sometimes claimed that parametric statistics require interval or ratio data. In reality, determining the scale of measurement of data is not always straightforward. Furthermore, statisticians do not agree that such distinctions are important (Howell 2007; Velleman and Wilkinson 1993). In practice, social researchers do not pay a great deal of attention to scales in determining statistical tests. Furthermore, some tests that purport to be useful for one type of scale (e.g., the phi coefficient for categorical data) are actually algebraic variations on formulas for parametric tests (e.g., Pearson’s r ).
SEE ALSO Bayesian Statistics; Central Tendencies, Measures of; Chi-Square; Classical Statistical Analysis; Cliometrics; Data, Longitudinal; Data, Pseudopanel; Econometrics; Hypothesis and Hypothesis Testing; Least Squares, Ordinary; Logistic Regression; Mean, The; Measurement; Methods, Quantitative; Mode, The; Probability; Psychometrics; Regression; Regression Analysis; Social Science; Standard Deviation; Statistics; Student’s T-Statistic; Test Statistics; Variability; Variance; Z-Test
Belia, Sarah, Fiona Fidler, Jennifer Williams, and Geoff Cumming. 2005. Researchers Misunderstand Confidence Intervals and Standard Error Bars. Psychological Methods 10 (4): 389–396.
Cohen, Jacob. 1994. The Earth Is Round (p <.05). American Psychologist 49 (12): 997–1003.
Cumming, Geoff, and Sue Finch. 2005. Inference by Eye: Confidence Intervals and How to Read Pictures of Data. American Psychologist 60 (2): 170–180.
Howell, David. C. 2007. Statistical Methods for Psychology. 6th ed. Belmont, CA: Wadsworth.
Killeen, Peter. R. 2005. An Alternative to Null-Hypothesis Significance Tests. Psychological Science 16 (5): 345–353.
Rosenthal, Robert. 1994. Parametric Measures of Effect Size. In The Handbook of Research Synthesis, ed. Harris Cooper and Larry V. Hedges, 231–244. New York: Russell Sage Foundation.
Velleman, Paul. E., and Leland Wilkinson. 1993. Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading. The American Statistician 47: 65–73.
Wilkinson, Leland, and the Task Force on Statistical Inference. 1999. Statistical Methods in Psychology: Guidelines and Explanations. American Psychologist 54: 594–604.
Bernard C. Beins