Meta-analysis is the practice of statistically summarizing empirical findings from different studies, reaching generalizations about the obtained results. Thus, "meta-analysis" literally refers to analysis of analyses. Meta-analysis, a term coined by Glass (1976), is also known as research synthesis and quantitative reviewing. Because progress within any scientific field has always hinged on cumulating empirical evidence about phenomena in an orderly and accurate fashion, reviews of studies have historically proved extremely influential (e.g., Mazela and Malin 1977). With the exponential growth in the numbers of studies available on a given social scientific topic, the need for these reviews has increased proportionally, meaning that reviews are potentially even more important each day. The empirical evidence, consisting of multiple studies examining a phenomenon, exists as a literature on the topic. Although new studies rarely replicate earlier studies without changing or adding new features, many studies can be described as conceptual replications that use different stimulus materials and dependent measures to test the same hypothesis, and still others might contain exact replications embedded within a larger design that adds new experimental conditions. In other instances, repeated tests of a relation accrue in a less systematic manner because researchers sometimes include in their studies tests of particular hypotheses in auxiliary or subsidiary analyses.
In order to reach conclusions about empirical support for a phenomenon, it is necessary to compare and contrast the findings of relevant studies. Therefore, accurate comparisons of study outcomes—reviews of research—are at the very heart of the scientific enterprise. Until recently these comparisons were nearly always made using informal methods that are now known as narrative reviewing, a practice by which scholars drew overall conclusions from their impressions of the overall trend of the studies' findings, sometimes guided by a count of the number of studies that had either produced or failed to produce statistically significant findings in the hypothesized direction. Narrative reviews have appeared in many different contexts and still serve a useful purpose in writing that does not have a comprehensive literature review as its goal (e.g., textbook summaries, introductions to journal articles reporting primary research). Although narrative reviewing has often proved useful, the method has often proved to be inadequate for reaching definitive conclusions about the degree of empirical support for a phenomenon or for a theory about the phenomenon. One indication of this inadequacy is that independent narrative reviews of the same literature often have reached differing conclusions.
COMMON PROBLEMS WITH NARRATIVE REVIEWS
Critics of the narrative reviewing strategy (e.g., Glass et al. 1981; Rosenthal 1991) have pointed to four general faults that frequently occur in narrative reviewing: (1) Narrative reviewing generally involves the use of a convenience sample of studies, perhaps consisting of only those studies that the reviewer happens to know. Because the parameters of the reviewed literature are typically not explicit, it is difficult to evaluate the adequacy of the definition of the literature or the thoroughness of the search for studies. If the sample of studies was biased, the conclusions reached may also be biased. (2) Narrative reviewers generally do not publicly state the procedures they used for either cataloging studies' characteristics or evaluating the quality of the studies' methods. Therefore, the review's claims about the characteristics of the studies and the quality of their methods are difficult to judge for their accuracy. (3) In cases in which study findings differed, narrative reviewing has difficulty in reaching clear conclusions about whether differences in study methods explain differences in results. Because narrative reviewers usually do not systematically code studies' methods, these reviewing procedures are not well suited to accounting for inconsistencies in findings.
(4) Narrative reviewing typically relies much more heavily on statistical significance to judge studies' findings than on the magnitude of the findings. Statistical significance is a poor basis for comparing studies that have different sample sizes, because effects of identical magnitude can differ widely in statistical significance. Because of this problem, narrative reviewers often reach erroneous conclusions about a pattern in a series of studies, even in literatures as small as ten studies (Cooper and Rosenthal 1980).
As the number of available studies cumulates, the conclusions reached in narrative reviews become increasingly unreliable because of the informality of the methods they use to draw these conclusions. Indeed, some historical scholars have attributed crises of confidence in central social scientific principles to apparent failures to replicate findings across studies (e.g., Johnson and Nichols 1998). Clearly, there will be practical limitations on the abilities of scholars to understand the vagaries of a literature containing dozens if not hundreds of studies (e.g., by 1978, there were at least 345 studies examining interpersonal expectancy effects, according to Rosenthal and Rubin 1978; and by 1983, there were over 1,000 studies evaluating whether birth order is related to personality, as reported by Ernst and Angst 1983). From this perspective, the social sciences might be considered victims of their own success: Although social scientists have been able to collect a myriad of data about a myriad of phenomena, they were forced to rely on their intuition when it came to assessing the state of the knowledge about popular topics. Since the 1980s, however, as scholars have gained increasing expertise in reviewing research literatures, literatures that once appeared haphazard at best and fragile at worst now frequently are shown to have substantial regularities (Johnson and Nichols 1998). For example, although scholars working in the 1950s and on through the 1970s frequently reached conflicting conclusions about whether men or women (or neither) are more easily influenced by others, reviewers using meta-analytic techniques have found highly reliable tendencies in this same literature. For example, Eagly and Carli's meta-analysis (1981) showed that men are more influenced than women when the communication topic is feminine (e.g., sewing), and that women are more influenced than men when the topic is masculine (e.g., automobiles). Moreover, contemporary meta-analysts now almost routinely move beyond relatively simple questions of whether one variable relates to another to the more sophisticated question of when the relation is larger, smaller, or reverses in sign. Thus, there is, indeed, a great deal of replicability across a wide array of topics, and inconsistencies among study findings can often be explained on the basis of methodological differences among the studies.
META-ANALYTIC REVIEWS OF EVIDENCE
Because of the importance of comparing study findings accurately, scholars have dedicated considerable effort to making the review process as reliable and valid as possible and thereby circumventing the criticisms listed above. These efforts highlight the proposition that research synthesis is a scientific endeavor—there are identifiable and replicable methods involved in producing reliable and valid reviews (Cooper and Hedges 1994). Although scientists have cumulated empirical data from independent studies since the early 1800s (see Stigler 1986), relatively sophisticated techniques for synthesizing study findings emerged only after the development of such standardized indexes as r-, d-, and p-values, around the turn of the twentieth century (see Olkin 1990). Reflecting the field's maturation, Hedges and Olkin (1985) presented a sophisticated version of the statistical bases of meta-analysis, and standards for meta-analysis have grown increasingly rigorous. Meta-analysis is now quite common and well accepted because scholars realize that careful application of these techniques often will yield the clearest conclusions about a research literature (Cooper and Hedges 1994; Hunt 1997).
Conducting a meta-analysis generally involves seven steps: (1) determining the theoretical domain of the literature under consideration, (2) setting boundaries for the sample of studies, (3) locating relevant studies, (4) coding studies for their distinctive characteristics, (5) estimating standardized effect sizes for each study, (6) analyzing the database, and (7) interpreting and presenting the results. The first conceptual step is to specify with great clarity the phenomenon under review by defining the variables whose relation is the focus of the review. Ordinarily a synthesis evaluates evidence relevant to a single hypothesis; the analyst studies the history of the research problem and of typical studies in the literature. Typically, the research problem will be defined as a relation between two variables, such as the influence of an independent variable on a dependent variable (e.g., the influence of silicon breast implants on connective tissure disease, as reported by Perkins et al. 1995). Moreover, a synthesis must take study quality into account at an early point to determine the kinds of operations that constitute acceptable operationalizations of these conceptual variables. Because studies testing a particular hypothesis typically differ in the operations used to establish the variables, it is no surprise that these different operations were often associated with variability in studies' findings. If the differences in studies' operations can be appropriately judged or categorized, it is likely that an analyst can explain some of this variability in effect size magnitude.
The most common way to test competing explanations is to examine how findings pattern across studies. Specifically, a theory might imply that a third variable should influence the relation between the independent and dependent variables: The relation should be larger or smaller with a higher level of this third variable. Treating this third variable as a potential moderator of the effect, the analyst would code the studies for their status on the moderator. This meta-analytic strategy, known as the moderator variable approach, tests whether the moderator affects the examined relation across the studies included in the sample. This moderator variable approach, advancing beyond the simple question of whether the independent variable is related to the dependent variable, addresses the question of when the magnitude or sign of the relationship varies. In addition to this moderator variable approach to synthesizing studies' findings, other strategies have proved to be useful. In particular, a theory might suggest that a third variable serves as a mediator of the critical relation because it conveys the causal impact of the independent variable on the dependent variable. If at least some of the primary studies within a literature have evaluated this mediating process, mediator relations can be tested within a meta-analytic framework by performing correlational analyses that are an extension of path analysis with primary-level data (Shadish 1996).
Clearly, only some studies will be relevant to the conceptual relation that is the focus of the meta-analysis, so analysts must define boundaries for the sample of studies, the second step in conducting a meta-analysis. Decisions about the inclusion of studies are important because the inferential power of any meta-analysis is limited by the methods of the studies that are integrated. To the extent that all (or most) of the reviewed studies share a particular methodological limitation, any synthesis of these studies would be limited in this respect. As a general rule, research syntheses profit by focusing on the studies that used stronger methods to test the meta-analytic hypotheses. Nonetheless, it is important to note that studies that have some strengths (e.g., manipulated independent variables) may have other weaknesses (e.g., deficiencies in ecological validity). In deciding whether some studies may lack sufficient rigor to include in the meta-analysis, it is important to adhere to methodological standards within the area reviewed. Although a large number of potential threats to methodological rigor have been identified (Campbell and Stanley 1963; Cook and Campbell 1979), there are few absolute standards of study quality that can be applied uniformly in every meta-analysis. As a case in point, although published studies are often thought to be of higher quality than unpublished studies, there is little basis for this generalization: Many unpublished studies (e.g., dissertations) have high quality, and many studies published in reputable sources do not. It is incumbent on the analyst to define the features of a high-quality study and to apply this definition toall studies in the literature, regardless of such considerations as the reputation of the journal.
Analysts often set the boundaries of the synthesis so that the methods of included studies differ dramatically only on critical moderator dimensions. If other, extraneous dimensions are thereby held relatively constant across the reviewed studies, moderator variable analyses can be more clearly interpreted. Nonetheless, an analyst should include in the sample all studies or portions of studies that satisfy the selection criteria, or, if an exhaustive sampling is not possible, a representative sample of those studies. Following this principle yields results that can be generalized to the universe of studies on the topic.
Because including a large number of studies generally increases the value of a quantitative synthesis, it is important to locate as many studies as possible that might be suitable for inclusion, the third step of a meta-analysis. To ensure that a sufficient sample of studies is located, reviewers are well advised to err in the direction of being extremely inclusive in their searching procedures. As described elsewhere (e.g., Cooper 1998; White 1994), there are many ways to find relevant studies; ordinarily, analysts should use all these techniques. Because computer searches of publication databases seldom locate all the available studies, it is important to supplement them by (1) examining the reference lists of existing reviews and of studies in the targeted literature, (2) obtaining published sources that have cited seminal articles within the literature, (3) contacting the extant network of researchers who work on a given topic to ask for new studies or unpublished studies, and (4) manually searching important journals to find some reports that might have been overlooked by other techniques.
Once the sample of studies is retrieved, analysts code them for their methodological characteristics, the fourth step in the process. The most important of these characteristics are potential moderator variables, which the analyst expects on an a priori basis to account for variation among the studies' effect sizes, or which can provide useful descriptive information about the usual context of studies in the literature. In some cases, reviewers recruit outside judges to provide ratings of methods used in studies. Because accurate coding is crucial to the results of a meta-analysis, the coding of study characteristics should be carried out by two or more coders, and an appropriate index of interrater reliability should be calculated. To be included in a meta-analysis, a study must contain some report of a quantitative test of the hypothesis that is under scrutiny in order to convert summary statistics into effect sizes, the fifth step of the process. Most studies report the examined relation by one or more inferential statistics (e.g., t-tests, F-tests, r-values), which can be converted into an effect size (see Cooper and Hedges 1994b; Glass et al. 1981; Johnson 1993; Rosenthal 1991). The most commonly used effect size indexes in meta-analysis are the standardized difference and the correlation coefficient (see Rosenthal 1991, 1994). The standardized difference, which expresses the finding in standard deviation units, was first proposed by Cohen (1969) in the following form: where Ma; and Mb; are the sample means of two compared groups, and SD is the standard deviation, pooled from the two observations. Because this formula overestimates population effect sizes to the extent that sample sizes are small, Hedges (1981) provided a correction for this bias; with the bias corrected, this effect estimate is conventionally known as d. Another common effect size is the correlation coefficient, r, which gauges the association between two variables. Because the sampling distribution of a sample correlation coefficient tends to be skewed to the extent that the population correlation is large, it is conventional in meta-analysis to use a logarithmic transform of each correlation in statistical operations (Fisher 1921). The positive or negative sign of the effect sizes computed in a meta-analysis is defined so that studies with opposite outcomes have opposing signs. When a study examines the relation of interest within levels of another variable, effect sizes may be calculated within the levels of this variable as well as for the study as a whole. In addition to correcting the raw g and r because they are biased estimators of the population effect size, analysts sometimes correct for many other biases that accrue from the methods used in each study (e.g., unreliability of a measure; see Hunter and Schmidt 1990). Although it is unrealistic for analysts to take into account all potential sources of bias in a meta-analysis, they should remain aware of biases that may be important within the context of their research literature.
Once the effect sizes are calculated, they are analyzed, the sixth step of the process, using either fixed- or random-effects models. Fixed-effects models, which are the most common analysis used, assume that there is one underlying, but unknown, effect size and that study estimates of this effect size vary only in sampling error. Random-effects models assume that each effect size is unique and that the study is drawn at random from a universe of related but separate effects (see Hedges and Vivea 1998 for a discussion). The general steps involved in the analysis of effect sizes usually are:
(1) to aggregate effect sizes across the studies to determine the overall strength of the relation between the examined variables; (2) to analyze the consistency of the effect sizes across the studies;
(3) to diagnose outliers among the effect sizes; and
(4) to perform tests of whether study attributes moderate the magnitude of the effect sizes. Although several frameworks for modeling effect sizes have been developed (for reviews, see Johnson et al. 1995; Sánchez-Meca and Marín-Martínez 1997), the Hedges and Olkin fixed-effect approach (1985) appears to be the most popular and therefore will be assumed in the remainder of this discourse. These statistics were designed to take advantage of the fact that studies have differing variances by calculating the nonsystematic variance of the effect sizes analytically (Hedges and Olkin 1985). Because this nonsystematic variance of an effect size is inversely proportional to the sample size of the study and because sample sizes typically vary widely across the studies, the error variances of the effect sizes are ordinarily quite heterogeneous. These meta-analytic statistics also permit an analysis of the consistency (or homogeneity) of the effect sizes across the studies, a highly informative analysis not produced by conventional, primary-level statistics. As the homogeneity calculation illustrates, analyzing effect sizes with specialized meta-analytic statistics rather than the ordinary inferential statistics used in primary research allows a reviewer to use a greater amount of the information available from the studies (Rosenthal 1991, 1995).
As a first step in a quantitative synthesis, the study outcomes are combined by averaging the effect sizes with each weighted by its sample size. This procedure gives greater weight to the more reliably estimated study outcomes, which are in general those with the larger sample sizes (see Hedges et al. 1992; Johnson et al. 1995). As a test for significance of this weighted mean effect size, a confidence interval is typically computed around this mean, based on its standard deviation, d+ ± 1.96 √v+, where 1.96 is the unit-normal value for a 95 percent confidence interval (CI) (assuming a nondirectional hypothesis). If the CI includes zero (0.00), the value indicating exactly no difference, it may be concluded that aggregated across all studies there is no significant association between the independent and dependent variables (X and Y). For example, Perkins and colleagues (1995) found no evidence across thirteen studies that silicone breast implants increased risk of connective tissue disease. In a different literature, He and colleagues (1999) found that, across eighteen studies, nonsmokers exposed to passive smoke had a higher relative risk of coronary heart disease than nonsmokers not exposed to smoke.
Once a meta-analysis has derived a weighted mean effect size, it, and other meta-analytic statistics, must be interpreted and presented, which is the seventh step of conducting a meta-analysis. If the mean effect is nonsignificant and the homogeneity statistic is small and nonsignificant, an analyst might conclude that there is no relation between the variables under consideration. However, in such cases, it is wise to consider the amount of statistical power that was available: If the total number of research participants in the studies integrated was small, it is possible that additional data would support the existence of the effect. Even if the mean effect is significant and the homogeneity statistic is small and nonsignificant, concerns about its magnitude arise. To address this issue, Cohen (1969, 1988) proposed some guidelines for judging effect magnitude, based on his informal analysis of the magnitude of effects commonly yielded by psychological research. Cohen intended "that medium represent an effect of a size likely to be visible to the naked eye of a careful observer" (1992, p. 156). He intended that small effect sizes be "noticeably smaller yet not trivial" (p. 156) and that large effect sizes "be the same distance above medium as small is below it"
(p. 156). As Table 1 shows, a "medium" effect turned out to be about d = 0.50 and r = .30, equivalent to the difference in intelligence scores between clerical and semiskilled workers. A "small" effect size was about d = 0.20 and r = .10, equivalent
|Cohen's (1969) Guidelines for Magnitude of d and r|
|size||effect size metric|
to the difference in height between 15- and 16-year-old girls. Finally, a large effect was about d = 0.80 and r = .50,equivalent to the difference in intelligence scores between college professors and college freshmen.
Another popular way to interpret mean effect sizes is to derive the equivalent r and square it. This procedure shows how much variability would be explained by an effect of the magnitude of the mean effect size. Thus, a mean d of 0.50 produces an R2 of .09. However, this value must be interpreted carefully because R2, or variance explained, is a directionless effect size. Therefore, if the individual effect sizes that produced the mean effect size varied in their signs (i.e., if the effect sizes were not all negative or all positive), the variance in Y explained by the predictor X, calculated for each study and averaged, would be larger than this simple transformation of the mean effect size. Thus, another possible procedure consists of computing R2 for each individual study and averaging these values.
When the weighted mean effect size and the CI are computed, the homogeneity of the d's is statistically examined, in order to determine whether the studies can be adequately described by a single effect size (Hedges and Olkin 1985). If the effect sizes can be so described, then they would differ only by unsystematic sampling error. If there is a significant fit statistic, the weighted mean effect size may not adequately describe the outcomes of the set of studies because it is likely that quite different mean effects exist in different groups of studies. Further explanatory work would be merited, even when the composite effect size is significant. The magnitude of individual study outcomes would differ systematically, and these differences may include differences in the direction (or sign) of the relation. In some studies, the independent variable might have had a large positive effect on the dependant variable, and in other studies, it might have had a smaller positive effect or even a negative effect. Even if the homogeneity test is nonsignificant, significant moderators could be present, especially when the fit statistic is relatively large (for further discussions, see Johnson and Turco 1992; Rosenthal 1995). Nonetheless, in a meta-analysis that attempts to determine the relation of one variable to another, rejecting the hypothesis of homogeneity could be troublesome, because it implies that the association between these two variables likely is complicated by the presence of interacting conditions. However, because analysts usually anticipate the presence of one or more moderators of effect size magnitude, establishing that effect sizes are not homogeneous is ordinarily neither surprising nor troublesome.
To determine the relation between study characteristics and the magnitude of the effect sizes, both categorical models and continuous models can be tested. In categorical models, analyses may show that weighted mean effect sizes differ in magnitude between the subgroups established by dividing studies into classes based on study characteristics. In such cases, it is as though the meta-analysis is broken into sub-meta-analyses based on their methodological features. For example, He and colleagues (1999) found that risk of coronary heart disease was greater for women than for men nonsmokers who were exposed to passive smoke. If effect sizes that were found to be heterogeneous become homogeneous within the classes of a categorical model, the relevant study characteristic has accounted for the systematic variability between the effect sizes. Similarly, continuous models, which are analogous to regression models, examine whether study characteristics that are assessed on a continuous scale are related to the effect sizes. As with categorical models, some continuous models may be completely specified in the sense that the systematic variability in the effect sizes is explained by the study characteristic that is used as a predictor. Continuous models are least squares regressions, calculated with each effect size weighted by the reciprocal of its variance (sample size). For example, He and colleagues (1999) found that risk of coronary heart disease was greater to the extent that nonsmokers had greater exposure to passive smoke. Goodness-of-fit statistics enable analysts to determine the extent to which categorical or continuous models, or mixtures of these models provide correct depictions of study outcomes.
As an alternative analysis to predicting effect sizes using categorical and continuous models, an analyst can attain homogeneity by identifying outlying values among the effect sizes and sequentially removing those effect sizes that reduce the homogeneity statistic by the largest amount (e.g., Hedges 1987). Studies yielding effect sizes identified as outliers can then be examined to determine whether they appear to differ methodologically from the other studies. Also, inspection of the percentage of effect sizes removed to attain homogeneity allows one to determine whether the effect sizes are homogeneous aside from the presence of relatively few aberrant values. Under such circumstances, the mean attained after removal of such outliers may better represent the distribution of effect sizes than the mean based on all the effect sizes. In general, the diagnosis of outliers should occur prior to calculating moderator analyses; this diagnosis may locate a value or two that are so discrepant from the other effect sizes that they would dramatically alter any models fitted to effect sizes. Under such circumstances, these outliers should be removed from subsequent phases of the data analysis. Alternatively, outliers can be examined following categorical or continuous models (e.g., finding those that deviate the most from the values predicted by the models).
TRENDS IN THE PRACTICE OF META-ANALYSIS
Although the quality of meta-analyses has been quite variable, it is possible to state the features that compose a high-quality meta-analysis, including success in locating studies, explicitness of criteria for selecting studies, thoroughness and accuracy in coding moderators variables and other study characteristics, accuracy in effect size computations, and adherence to the assumptions of meta-analytic statistics. When meta-analyses meet such standards, it is difficult to disagree with Rosenthal's conclusion (1994) that it is "hardly justified to review a quantitative literature in the pre-meta-analytic, prequantitative manner" (p. 131). Yet merely meeting these high standards does not necessarily make a meta-analysis an important scientific contribution. One factor affecting scientific contribution is that the conclusions that a research synthesis is able to reach are limited by the quality of the data that are synthesized. Serious methodological faults that are endemic in a research literature may well handicap a synthesis, unless it is designed to shed light on the influence of these faults. Also, to be regarded as important, the review must address an interesting question. Moreover, unless the paper reporting a meta-analysis "tells a good story," its full value may go unappreciated by readers. Although there are many paths to a good story, Sternberg's recommendations to authors of reviews (1991) are instructive: Pick interesting questions, challenge conventional understandings if at all possible, take a unified perspective on the phenomenon, offer a clear take-home message, and write well. Thus, the practice of meta-analysis should not preclude incorporating aspects of narrative reviewing, but instead should strive to incorporate and document the richness of the literature.
One reason that the quality of published syntheses has been quite variable is that it is a relatively new tool among scholars who practice it. Yet, as the methods of quantitative synthesis have become more sophisticated and widely disseminated, typical published meta-analyses have improved. At their best, meta-analyses advance knowledge about a phenomenon by explicating its typical patterns and showing when it is larger or smaller, negative or positive, and by testing theories about the phenomenon (see Miller and Pollock 1994). Meta-analysis should foster a healthy interaction between primary research and research synthesis, at once summarizing old research and suggesting promising directions for new research. One misperception that scholars sometimes express is that a meta-analysis represents a dead end for a literature, a point beyond which nothing more needs to be known. In contrast, carefully conducted meta-analyses can often be the best medicine for a literature, by documenting the robustness with which certain associations are attained, resulting in a sturdier foundation on which future theories may rest. In addition, meta-analyses can show where knowledge is at its thinnest, to help plan additional, primary-level research (see Eagly and Wood 1994). As a consequence of a carefully conducted meta-analysis, primary-level studies can be designed with the complete literature in mind and will therefore have a better chance of contributing new knowledge. In this fashion, scientific resources can be directed most efficiently toward gains in knowledge. As time passes and new studies continue to accrue rapidly, it is likely that social scientists will rely more on quantitative syntheses to inform them about the knowledge that has accumulated in their research. Although it is possible that meta-analysis will become the purview of an elite class of researchers who specialize in research integration, as Schmidt (1992) argued, it seems more likely that meta-analysis will become a routine part of graduate training in many fields, enabling students to develop the skills necessary to ply the art and science of meta-analysis and to integrate findings across studies as a normal and routine part of their research activities.
Bond, R., and P. B. Smith 1996 "Culture and Conformity: A Meta-Analysis of Studies Using Asch's (1952b, 1956) Line Judgment Task." Psychological Bulletin 119:111–137.
Campbell, D. T., and J. T. Stanley 1963 Experimental and Quasi-Experimental Designs for Research. Chicago: Rand-McNally.
Cohen, J. 1969 Statistical Power Analysis for the Behavioral Sciences. New York: Academic.
——1988 Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, N.J.: Erlbaum.
——1992 "A Power Prime." Psychological Bulletin 112:155–159.
Cook, T. D., and D. T. Campbell 1979 Quasi-Experimentation: Design and Analysis Issues for Field Settings. Boston: Houghton Mifflin.
Cooper, H. 1998 Integrative Research: A Guide for Literature Reviews, 3rd ed. Newbury Park, Calif.: Sage.
——, and L. V. Hedges 1994a. "Research Synthesis as a Scientific Enterprise." Pp. 3–14 in H. Cooper and L. V. Hedges, eds., The Handbook of Research Synthesis. New York: Russell Sage.
——, and L.V. Hedges, (eds.) 1994b The Handbook of Research Synthesis. New York: Russell Sage.
——, and Rosenthal, R. (1980). "Statistical versus Traditional Procedures for Summarizing Research Findings." Psychological Bulletin 87:442–449.
Eagly, A. H., and L. Carli 1981 "Sex of Researchers and Sex-Typed Communications as Determinants of Sex Differences in Influenceability: A Meta-Analysis of Social Influence Studies." Psychological Bulletin 90:1–20.
——, and W. Wood 1994 "Using Research Syntheses to Plan Future Research." Pp. 485–500 in H. Cooper and L. V. Hedges, eds., The Handbook of Research Synthesis. New York: Russell Sage.
Ernst, C., and J. Angst 1983 Birth Order: Its Influence on Personality. New York: Springer-Verlag.
Fisher, R. A. 1921 "On the "Probable Error" of a Coefficient of Correlation Deduced from a Small Sample." Metron 1:1–32.
Glass, G. V. 1976 "Primary, Secondary, and Meta-Analysis of Research. Educational Researcher 5:3–8.
——, B. McGraw, and M. L. Smith 1981 Meta-Analysis in Social Research. Beverly Hills, Calif.: Sage.
He, J., S. Vupputuri, K. Allen, M. R. Prerost, J. Hughes, and P. K. Whelton 1999 "Passive Smoking and the Risk of Coronary Heart Disease—A Meta-Analysis of Epidemiologic Studies." New England Journal of Medicine 340:920–926.
L. V. Hedges 1981 "Distribution Theory for Glass's Estimator of Effect Size and Related Estimators." Journal of Educational Statistics 6:107–128.
——1987 "How Hard Is Hard Science, How Soft Is Soft Science? The Empirical Cumulativeness of Research." American Psychologist 42:443–455.
——1994 "Statistical Considerations." Pp. 29–38 in H. Cooper and L. V. Hedges, eds., The Handbook of Research Synthesis. New York: Russell Sage.
——, H. Cooper, and B. J. Bushman 1992 "Testing the Null Hypothesis in Meta-Analysis: A Comparison of Combined Probability and Confidence Interval Procedures." Psychological Bulletin 111:188–194.
——, and I. Olkin 1985 Statistical Methods for Meta-Analysis. Orlando, Fla.: Academic.
——, and J. L. Vevea 1998 "Fixed- and Random-Effects Models in Meta-Analysis." Psychological Methods 3:486–504.
Hunt, M. 1997 How Science Takes Stock: The Story of Meta-Analysis. New York: Russell Sage.
Hunter, J. E., and F. L. Schmidt 1990 Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Newbury Park, Calif.: Sage.
Johnson, B. T. 1993 DSTAT 1.10: Software for the Meta-Analytic Review of Research Literatures. Hillsdale, N.J.: Erlbaum.
——, B. Mullen, and E. Salas 1995 "Comparison of Three Major Meta-Analytic Approaches." Journal of Applied Psychology 80:94–106.
——, and D. R. Nichols 1998 "Social Psychologists' Expertise in the Public Interest: Civilian Morale Research during World War II." Journal of Social Issues 54:53–77.
——, and R. Turco 1992 "The Value of Goodness-of-Fit Indices in Meta-Analysis: A Comment on Hall and Rosenthal." Communication Monographs 59:388–396.
Mazela, A., and M. Malin 1977 A bibliometric Study of the Review Literature. Philadelphia: Institute for Scientific Information.
Miller, N., and V. E. Pollock 1994 "Meta-Analysis and Some Science-Compromising Problems of Social Psychology" Pp. 230–261 in W. R. Shadish and S. Fuller, eds., The Social Psychology of Science. New York: Guilford.
Olkin, I. 1990 "History and Goals." In K. W. Wachter and M. L. Straf, eds., The Future of Meta-Analysis. New York: Russell Sage.
Perkins, L. L., B. D. Clark, P. J. Klein, and R. R. Cook 1995 "A Meta-Analysis of Breast Implants and Connective Tissue Disease." Annals of Plastic Surgery 35:561–570.
Rosenthal, R. 1990 "How Are We Doing in Soft Psychology?" American Psychologist 45:775–777.
——1991 Meta-Analytic Procedures for Social Research, rev. ed. Beverly Hills, Calif.: Sage.
——1994 "Parametric Measures of Effect Size." Pp. 231–244 in H. Cooper and L. V. Hedges, eds., The Handbook of Research Synthesis. New York: Russell Sage.
——1995 "Writing Meta-Analytic Reviews." Psychological Bulletin 118:183–192.
——, and D. Rubin 1978 "Interpersonal Expectancy Effects: The First 345 Studies." Behavioral and Brain Sciences 3:377–415.
Sánchez-Meca, J., and Marín-Martínez, F. 1997 "Homogeneity Tests in Meta-Analysis: A Monte-Carlo Comparison of Statistical Power and Type I Error." Quality & Quantity 31:385–399.
Schmidt, F. L. 1992 "What Do Data Really Mean? Research Findings, Meta-Analysis, and Cumulative Knowledge in Psychology." American Psychologist 47:1173–1181.
Shadish, W. R. 1996 "Meta-Analysis and the Exploration of Causal Mediating Processes: A Primer of Examples, Methods, and Issues." Psychological Methods 1:47–65.
Sternberg, R. J. 1991 Editorial. Psychological Bulletin 109:3–4.
Stigler, S. M. 1986 History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, Mass.: Harvard University Press.
White, H. D. 1994 "Scientific Communication and Literature Retrieval." In H. Cooper and L. V. Hedges eds., The Handbook of Research Synthesis, pp. 41–55. New York: Russell Sage.
B. T. Johnson
The American educational psychologist Gene V. Glass (1976) coined the term meta-analysis to stand for a method of statistically combining the results of multiple studies in order to arrive at a quantitative conclusion about a body of literature. The English statistician Karl Pearson (1857–1936) conducted what is believed to be one of the first statistical syntheses of results from a collection of studies when he gathered data from eleven studies on the effect of a vaccine against typhoid fever (1904). For each study, Pearson calculated a new statistic called the correlation coefficient. He then averaged the correlations across the studies and concluded that other vaccines were more effective than the new one.
Early work on quantitative procedures for integrating results of independent studies was ignored for decades. Instead, scholars typically carried out narrative reviews. These reviews often involved short summaries of studies considered “relevant,” with the operational definition of that term usually based on arbitrary and unspecified criteria. Scholars would then arrive at an impressionistic conclusion regarding the overall findings of those studies, usually based largely on the statistical significance of the findings in the individual studies. This latter aspect of narrative reviews is especially problematic. Given that the findings from individual studies are based on samples, it is expected that they will vary from one another even if the studies are estimating the same underlying population parameter. Often, however, scholars misinterpreted expected sampling variability as evidence that the results of studies were “mixed” and therefore inconclusive. In addition, scholars generally did not think to account for the power of the statistical tests in their studies. In the social sciences, statistical power is often not high, resulting in an unacceptably high rate of Type II errors (i.e., the failure to reject a false null hypothesis). In a collection of studies with typical statistical power characteristics, ignoring power can lead to the appearance that the intervention has no effect even if it really does.
Further, narrative reviews were usually not conducted with the same level of explicitness as is required in primary studies. For example, scholars were rarely explicit about either the decision rules invoked in a review or the pattern of results across studies that would be used to accept or reject a hypothesis. In addition, narrative literature reviews typically could not impart a defensible sense of the magnitude of the overall relations they investigated, nor did they adequately account for potential factors that might influence the direction or magnitude of the relations.
The explosion of research in the social and medical sciences that occurred in the 1960s and 1970s created conditions that highlighted another of the difficulties associated with narrative reviews, specifically that it is virtually impossible for humans to make sense out of a body of literature when it is large. Robert Rosenthal and Donald Rubin (1978), for example, were interested in research on the effects of interpersonal expectancies on behavior and found over three hundred studies. Glass and Mary L. Smith (1979) found over seven hundred estimates of the relation between class size and academic achievement. Jack Hunter and his colleagues (1979) uncovered more than eight hundred comparisons of the differential validity of employment tests. It is not reasonable to expect that these scholars could have examined all of the evidence they uncovered and made decisions about what that evidence said in an unbiased and efficient manner.
Meta-analysis addresses the problem of how to combine and weight evidence from independent studies by relying on effect sizes and on statistical procedures for weighting research evidence. Effect sizes are statistics that express how much impact an intervention had (e.g., the average increase on an achievement test), and as such, they give reviewers a way to quantify both the findings from individual studies and the findings aggregated across a body of studies. Effect sizes can also be standardized to allow for comparisons between measures of the same outcome with different scaling properties (e.g., two different math achievement tests).
Effect sizes with known distribution properties can be weighted to arrive at a better estimate of the population parameter when they are combined. The most common method involves weighting each effect size by the inverse of its squared standard error. Using this method, larger studies contribute more to the analysis than do smaller studies. As such, a study with one hundred participants contributes more to the overall analysis than a study with ten participants. The rationale for this procedure is that the study with one hundred participants estimates the population effect more precisely (i.e., has less random variability) and is therefore a better estimate of that population effect.
To carry out a meta-analysis, an average of the weighted effect sizes is computed. A confidence interval can be placed around the weighted average effect size; if the confidence interval does not include zero, then the null hypothesis that the population effect size is zero can be rejected at the given level of confidence. In addition, scholars usually conduct a statistical test to assess the plausibility that the observed effect sizes appear to be drawn from the same population. If the null hypothesis of effect size homogeneity is rejected, then the reviewer has reasonable cause to conduct follow-up tests that attempt to localize the sources of variation. Generally, these correlational analyses attempt to relate variability in study outcomes to characteristics of interventions (e.g., intensity), as well as to study design, sampling, and measurement characteristics.
Finally, representing study results as weighted effect sizes allows scholars to conduct tests of the plausibility of publication bias on study results. Publication bias is the tendency for studies lacking a statistically significant effect not to appear in published literature. Therefore, these studies are more difficult to uncover during a literature search. All else being equal, studies that do not result in a rejection of the null hypothesis have smaller effects than those that do reject the null hypothesis. As such, failing to locate these studies at a rate similar to that of published studies means that the overall estimate arising from a body of studies might be positively biased. Several statistical methods (e.g., the trim-and-fill analysis) are available to help scholars assess the potential impact of publication bias on their conclusions.
SEE ALSO Methods, Quantitative
Cooper, Harris, and Larry V. Hedges, eds. 1994. Handbook of Research Synthesis. New York: Russell Sage Foundation.
Glass, Gene V. 1976. Primary, Secondary, and Meta-analysis of Research. Educational Researcher 5: 3–8.
Glass, Gene V., and Mary L. Smith. 1979. Meta-analysis of Research on Class Size and Achievement. Educational Evaluation and Policy Analysis 1: 2–16.
Hunter, Jack E., Frank L. Schmidt, and R. Hunter. 1979. Differential Validity of Employment Tests by Race: A Comprehensive Review and Analysis. Psychological Bulletin 86: 721–735.
Lipsey, Mark W., and David B. Wilson. 2001. Practical Meta-analysis. Thousand Oaks, CA: Sage.
Pearson, Karl. 1904. Report on Certain Enteric Fever Inoculation Statistics. British Medical Journal 3: 1243–1246.
Rosenthal, Robert, and Donald Rubin. 1978. Interpersonal Expectancy Effects: The First 345 Studies. Behavioral and Brain Sciences 3: 377–415.
Jeffrey C. Valentine
Meta-analysis is the quantitative review of the results of a number of individual studies in order to integrate their findings. The term (from the Greek meta meaning after) refers to analysis of the conclusions of the original analyses. The methodology can in principle be applied to quantitative studies in any area of investigation, but it has become a basic tool in healthcare research. It is part of the broader approach of research synthesis, which also includes qualitative aspects.
Evolution of Meta-Analysis
Gaining an overview of the outcomes of different experiments is the constant aim of science, and statisticians have been concerned with the combination of results since the emergence of formal statistical inference in the early twentieth century. The basic principles were established by the 1950s (Cochran 1954), and the need became clear with the subsequent rapid increase in research publications. The procedure was first developed in the social sciences, and the term meta-analysis introduced in the educational literature in 1976. The 1980s saw mounting interest in the combination of results of clinical trials, and since the early 1990s meta-analysis has experienced explosive growth in medical applications.
Although there seems little doubt that meta-analysis is here to stay, it has been fraught with controversy. There is the problem of the quality of individual studies, with their own biases, often small clinical trials with poor design and execution. There is the problem of heterogeneity, studies that measured different effects, used different populations, had different aims. A further problem is that of publication bias, the fact that studies with positive results are more likely to get published than those with negative outcomes, leading to an inflation of the effect estimate. Related to this is Tower of Babel bias, meaning that most meta-analyses identify only reports published in English.
An international conference on meta-analysis was held in Germany in 1994, to review problems and progress (Spitzer 1995). A strong opponent present called the method "statistical alchemy for the 21st century" (Feinstein 1995). But work has continued, with the development of guidelines for doing meta-analyses, emphasizing the need to identify unpublished studies, eliminate incomplete reports and those of flawed research designs, and include only quality studies that appear to address the same well-defined question. The gold standard is that of Individual Patient Data (IPD), where the original data are available for reanalysis in the combined context. Cumulative meta-analysis is the systematic updating of the analysis as new results become available. There is also extensive research on meta-analysis for observational studies.
The Cochrane Collaboration
An important, promising development is the vigorous Cochrane Collaboration, "an international nonprofit and independent organization, dedicated to making up-to-date, accurate information about the effects of health care readily available worldwide. It produces and disseminates systematic reviews of health care interventions and promotes the search for evidence in the form of clinical trials and other studies of interventions" (Cochrane Collaboration). The movement was inspired by Archibald Cochrane (1909–1988), the British epidemiologist best known for his 1972 work Effectiveness and Efficiency: Random Reflections on Health Services. Cochrane urged equitable provision of those modes of healthcare that had been shown effective in properly designed studies, preferably randomized clinical trials. He considered the latter among the most ethical forms of treatment, and he emphasized the need for systematic critical summaries, with periodic update by specialty, of all relevant randomized clinical trials.
The first Cochrane Center opened in the United Kingdom in 1992, followed by the founding of the Cochrane Collaboration in 1993. In November 2004 its web site listed twelve Cochrane centers worldwide (using six languages) that serve as reference centers for 192 nations and coordinate the work of thousands of investigators. The main output of the Cochrane Collaboration is the Cochrane Library (CLIB), published and updated quarterly by Wiley InterScience and available by subscription via the Internet and on CD-ROM. Its contents include the Cochrane Database of Systematic Reviews (CDSRs), over 3,000 reviews prepared by fifty Collaborative Review Groups (CRGs), the Cochrane Central Register of Controlled Trials, bibliographic data on hundreds of thousands of controlled trials, as well as methodologic information on the rapidly developing field of research synthesis, and critical assessment of systematic reviews carried out by others.
The Ethics of Evidence
Meta-analysis, an attempt to integrate the information already on hand from past studies, enhanced by guidelines that it be done on the highest professional level, fits into the framework of the Ethics of Evidence, a multidisciplinary approach proposed for dealing with the uncertainties of medicine (Miké 1999). The Ethics of Evidence calls for the development, dissemination, and use of the best possible evidence for decisions in healthcare. As a complementary precept, it points to the need to accept that there will always be uncertainty.
To explore the quality of evidence from meta-analyses, a 1997 study compared the results of twelve large randomized clinical trials (RCTs) published in four leading medical journals with the conclusions of nineteen previously published meta-analyses addressing the same questions, for a total of forty primary and secondary outcomes (LeLorier et al. 1997). The agreement between the meta-analyses and the subsequent large RCTs was only somewhat better than chance. A third of the meta-analyses failed to correctly predict the outcome of the RCTs, and would have led to adoption of an ineffective treatment or the rejection of a useful one. (The actual differences between effect estimates were not large, but that did not count in this adopt/reject type of analysis.) Then in 2002 the long-held belief that menopausal hormone replacement therapy offered protection against heart disease, a medical consensus supported by meta-analyses, was shockingly reversed by RCT evidence (Wenger 2003).
The Cochrane Collaboration, as a worldwide, integrated movement, has the great potential to promote cooperation on high-quality, controlled clinical trials. Systematic reviews of these, with regular update and dissemination, should help improve the evidence available for the practice of medicine. But it is important to keep in mind that even the best meta-analysis cannot take the place of original research. Evidence-based medicine, which makes heavy use of the results of meta-analyses, cannot apply evidence that does not exist. Scientists need to stay close to the primary literature, with an open mind, to get new ideas, seek new insights, and generate new hypotheses.
The public needs to have a cautious view of meta-analysis, judging each case in its proper context. For example, the meta-analysis showing that more than 100,000 Americans die each year from the side effects of legally prescribed drugs (Lazarou et al. 1998) merits serious concern, even if the estimate is not quite accurate. There is no substitute for being informed, getting involved, and taking personal responsibility.
Bailar, John C., III. (1997). "The Promise and Problems of Meta-Analysis." New England Journal of Medicine 337: 559–560.
Cochran, William G. (1954). "The Combination of Estimates from Different Experiments." Biometrics 10: 101–129.
Cochrane, Archibald L. (1972). Effectiveness and Efficiency: Random Reflections on Health Services. London: Nuffield Provincial Hospitals Trust.
Feinstein, Alvan R. (1995). "Meta-Analysis: Statistical Alchemy for the 21st Century." Journal of Clinical Epidemiology 48(1): 71–86. Commentary by Alessandro Liberati, "A Plea for a More Balanced View of Meta-Analysis and Systematic Overviews of the Effect of Health Care Interventions," is included.
Lazarou, Jason; Bruce H. Pomeranz; and Paul N. Corey. (1998). "Incidence of Adverse Drug Reactions in Hospitalized Patients: A Meta-Analysis of Prospective Studies." Journal of the American Medical Association 279(15): 1200–1205.
LeLorier, Jacques; Geneviève Grégoire; Abdeltif Benhaddad; et al. (1997). "Discrepancies between Meta-Analyses and Subsequent Large Randomized, Controlled Trials." New England Journal of Medicine 337: 536–542.
Miké, Valerie. (1999). "Outcomes Research and the Quality of Health Care: The Beacon of an Ethics of Evidence." Evaluation & the Health Professions 22: 3–32. Commentary by Edmund D. Pellegrino, "The Ethical Use of Evidence in Biomedicine," also appears in this issue, pp. 33–43.
Moher, David. (2001). "QUOROM." [Quality of Reporting of Meta-Analyses] In Biostatistics in Clinical Trials, eds. Carol Redmond and Theodore Colton. New York: John Wiley & Sons.
Spitzer, Walter O., ed. (1995). The Potsdam International Consultation on Meta-Analysis, special issue Journal of Clinical Epidemiology 48(1): 1–172.
Stroup, Donna F., and Stephan B. Thacker. (2000). "Meta-Analysis in Epidemiology." In Encyclopedia of Epidemiologic Methods, eds. Mitchell H. Gail and Jacques Benichou. New York: John Wiley & Sons.
Thompson, Simon G. (2001). "Meta-Analysis." In Biostatistics in Clinical Trials, eds. Carol Redmond, and Theodore Colton. New York: John Wiley & Sons.
Wenger, Nanette K. (2003). "Menopausal Hormone Therapy and Cardiovascular Protection: State of the Data 2003." Journal of the American Medical Women's Association 58: 236–239.
Cochrane Collaboration. Available at www.cochrane.org. Web site of the organization.
Meta-analysis is the statistical synthesis of the data from a set of comparable studies of a problem, and it yields a quantitative summary of the pooled results. It is the process of aggregating the data and results of a set of studies, preferably as many as possible that have used the same or similar methods and procedures; reanalyzing the data from all these combined studies; and thereby generating larger numbers and more stable rates and proportions for statistical analysis and significance testing than can be achieved by any single study. The process is widely used in the biomedical sciences, especially in epidemiology and in clinical trials. In these applications, meta-analysis is defined as the systematic, organized, and structured evaluation of a problem of interest. The essence of the process is the use of statistical tables or similar data from previously published peer-reviewed and independently conducted studies of a particular problem. It is most commonly used to assemble the findings from a series of randomized controlled trials, none of which on its own would necessarily have sufficient statistical power to demonstrate statistically significant findings. The aggregated results, however, are capable of generating meaningful and statistically significant results.
There are some essential prerequisites for meta-analysis to be valid. Qualitatively, all studies included in a meta-analysis must fulfill predetermined criteria. All must have used essentially the same or closely comparable methods and procedures; the populations studied must be comparable; and the data must be complete and free of biases—such as those due to selection or exclusion criteria. Quantitatively, the raw data from all studies is usually reanalyzed, partly to verify the original findings from these studies, and partly to provide a database for summative analysis of the entire set of data. All eligible studies must be included in the meta-analysis. If a conscious decision is made to exclude some, there is always a suspicion that this was done in order to achieve a desired result. If a pharmaceutical or other commercial organization conducts a meta-analysis of studies aimed at showing its product in a favorable light, then the results will be suspect unless evidence is provided of unbiased selection. One criterion for selection is prior publication in a peer-reviewed medical journal, but there are good arguments in favor of including well-conducted unpublished studies under some circumstances.
A variation of the concept is a systematic review, defined as the application of strategies that limit bias in the assembly, critical appraisal, and synthesis of all relevant studies of a specific topic. Meta-analysis may be, but is not necessarily, used as part of this process. Systematic reviews are conducted on peer-reviewed publications dealing with a particular health problem and use rigorous, standardized methods for the selection and assessment of these publications. A systematic review can be conducted on observational (case-control or cohort) studies as well as on randomized controlled trials.
John M. Last
(see also: Epidemiology; Observational Studies; Statistics for Public Heath )
Dickerson, K., and Berlin, J. A. (1992). "Meta-Analysis: State of the Science." Epidemiologic Reviews 14:154–176.