There is no uniformly accepted definition of what constitutes evaluation research. At perhaps its narrowest point, the field of evaluation research can be defined as "the use of scientific methods to measure the implementation and outcomes of programs for decision-making purposes" (Rutman 1984, p. 10). A broader, and more widely accepted, definition is "the systematic application of social research procedures for assessing the conceptualization, design, implementation, and utility of social intervention programs" (Rossi and Freeman 1993, p. 5). A much broader definition is offered by Scriven (1991), who suggests that evaluation is "the process of determining the merit, worth and value of things" (p. 1). In the latter definition, the notion of what can be evaluated is not limited to a social program or specific type of intervention but encompasses, quite literally, everything.
Any description of the history of evaluation research depends on how the term is defined. Certainly, individuals have been making pronouncements about the relative worth of things since time immemorial. In the case of social programs, proficiency requirements to guide the selection of public officials using formal tests were recorded as early as 2200 b.c. in China (Guba and Lincoln 1981). Most observers, however, date the rise of evaluation research to the twentieth century. For example, programs in the 1930s established by the New Deal were viewed as great opportunities to implement social science methods to aid social planning by providing an accounting of program effects (Stephan, 1935). Modern evaluation research, however, underwent explosive growth in the 1960s as a result of several factors (Shadish et al. 1991). First, the total amount of social programming increased tremendously under the administrations of Presidents Kennedy, Johnson, and Nixon. New programs were directed toward social issues such as education, housing, health, crime, and income maintenance. Second, along with these huge financial investments came the concern by Congress about whether these programs were achieving their intended effect. As a result, Congress began mandating evaluations. Third, program managers were concerned whether programs were being implemented in the manner intended, and consequently data were required to monitor program operations. In addition, there were intellectual issues about how best to implement programs and the relative effectiveness of various approaches to offsetting various social ills. Outcome data were needed to compare competing approaches. The result was a burgeoning demand for trained evaluators; and the large number of scientists involved in the common enterprise of evaluation became sufficient to support the development of evaluation research as a scientific specialty area.
The field of evaluation research is no longer expanding at the rate it was in the 1960s and 1970s (Freeman 1992). By the 1980s, there was a substantial decline in the funding for evaluation activities that was motivated, in part, by the budget cuts of the Reagan administration. By then, however, the field of evaluation research had been established. It continues to thrive for several reasons (Desautels 1997). First, difficult decisions are always required by public administrators and, in the face of continuing budget constraints, these decisions are often based on accountability for results. Second, an increasingly important aspect of service provision by both public and provide program managers is service quality. Monitoring quality requires information about program practices and outcomes. Third, there is growing public demand for accountability in government, a view increasingly echoed by government representatives. Meeting these demands requires measurement of results and a management system that uses evaluation for strategic planning and tactical decision making.
Early in its history, evaluation was seen primarily as a tool of the political left (Freeman 1992). Clearly, that is no longer the case. Evaluation activities have demonstrated their utility to both conservatives and liberals. Although the programs of today may be different from those launched in the 1960s, evaluation studies are more pervasive than ever. As long as difficult decisions need to be made by administrators serving a public that is demanding ever-increasing levels of quality and accountability, there will be a growing market for evaluation research.
PURPOSES OF EVALUATION RESEARCH
A wide variety of activities are subsumed under the broad rubric of "evaluation research." This diversity proceeds from the multiplicity of purposes underlying evaluation activities. Chelimsky (1997) identifies three different purposes of evaluation: evaluation for accountability, evaluation for development, and evaluation for knowledge.
Accountability. From the perspective of auditors and funding agencies, evaluations are necessary to establish accountability. Evaluations of this type frequently attempt to answer the question of whether the program or policy "worked" or whether anything changed as a result. The conceptual distinction between program and policy evaluations is a subtle but important one (Sonnad and Borgatta 1992). Programs are usually characterized by specific descriptions of what is to be done, how it is to be done, and what is to be accomplished. Policies are broader statements of objectives than programs, with greater latitude in how they are implemented and with potentially more diverse outcomes. Questions addressed by either program or policy evaluations from an accountability standpoint are usually cause-and-effect questions requiring research methodology appropriate to such questions (e.g., experiments or quasi-experiments). Studies of this type are often referred to as summative evaluations (Scriven 1991) or impact assessments (Rossi and Freeman 1993). Although the term "outcome" evaluation is frequently used when the focus of the evaluation is on accountability, this term is less precise, since all evaluations, whether conducted for reasons of accountability, development, or knowledge, yield outcomes of some kind (Scriven 1991).
Development. Evaluation for development is usually conducted to improve institutional performance. Developmental evaluations received heightened importance as a result of public pressure during the 1980s and early 1990s for public management reforms based on notions such as "total quality management" and "reinventing government" (e.g., see Gore 1993). Developmental evaluations often address questions such as: How can management performance or organizational performance be improved? What data systems are necessary to monitor program accomplishment? What are appropriate indicators of program success and what are appropriate organizational goals? Studies designed primarily to improve programs or the delivery of a product or service are sometimes referred to as formative or process evaluations (Scriven 1991). In such studies, the focus is on the treatment rather than its outcomes. Depending on the specific question being addressed, methodology may include experiments, quasi-experiments, or case studies. Data may be quantitative or qualitative. Formative or process evaluations may be sufficient by themselves if a strong relationship is known to exist between the treatment and its outcomes. In other cases, they may be accompanied by summative evaluations as well.
Knowledge. In evaluation for knowledge, the focus of the research is on improving our understanding of the etiology of social problems and on detailing the logic of how specific programs or policies can ameliorate them. Just as evaluation for accountability is of greatest interest to funding or oversight agencies, and evaluation for performance is most useful to program administrators, evaluation for knowledge is frequently of greatest interest to researchers, program designers, and evaluators themselves. Questions might include such things as the causes of crime, homelessness, or voter apathy. Since these are largely cause-and-effect questions, rigorous research designs appropriate to such questions are generally required.
CONTEMPORARY ISSUES IN EVALUATION
Utilization of Findings. Implicit in the enterprise of evaluation research is the belief that the findings from evaluation studies will be utilized by policy makers to shape their decisions. Indeed, such a view was espoused explicitly by Campbell (1969), who argued that social reforms should be regarded as social experiments and that the findings concerning program effectiveness should determine which programs to retain and which to discard. This process of rational decision making, however, has not been consistently embraced by policy makers and has been a source of concern and disillusionment for many evaluators. Rossi (1994) sums up the situation by noting:
Although some of us may have entertained hopes that in the "experimenting society" the experimenter was going to be king, that delusion, however grand, did not last for long. It often seemed that programs had robust lives of their own, appearing, continuing, and disappearing following some unknown processes that did not appear responsive to evaluations and their outcomes. (p. 26)
One source of the utilization problem, as Weiss (1975, 1987) has noted, is the fact that evaluations take place in a political context. Although accomplishing its stated objectives is important to program success, it may not be the only—or even the most important—measure of program success. From this perspective, it is not that administrators and policy makers are irrational—they simply use a different model of rationality than do evaluators. Indeed, the view of policy makers and program administrators may be more "rational" than that of evaluators because it has been shown repeatedly that programs can and do survive negative evaluations. Programs are less likely, however, to survive a hostile congressional committee, negative press, or lack of public support. There are generally multiple stakeholders, often with competing interests, associated with any large program. Negative findings are of very little use to individuals whose reputations and jobs are dependent on program success. Thus, rather than bemoaning a lack of utilization of findings, evaluators need to recognize that evaluation findings represent only one piece of a complex political process.
Evaluators concerned with utilization frequently make a distinction between the immediate or instrumental use of findings to make direct policy decisions versus the conceptual use of findings, which serves primarily to enlighten decision makers and perhaps influence later decision making (Leviton and Hughes 1981). In a related vein, Scriven (1993) makes an important distinction between "lack of implementation" and "lack of utilization." Lack of implementation merely refers to a failure to implement recommendations. In contrast, utilization is more ambiguous. It is often not clear what outcomes or actions actually constitute a utilization of findings. Evaluation findings can have great utility but may not necessarily lead to a particular behavior. For example, a consumer can read an evaluation of a product in a publication such as Consumer Reports and then decide not to buy the product. Although the evaluation did not lead to a particular behavior (i.e., purchasing the product), it was nonetheless extremely useful to the consumer, and the information can be said to have been utilized. Some observers have noted that the concern about underutilization of evaluation findings belies what is actually happening in the field of evaluation research. Chelimsky and Shadish (1997) provide numerous examples of how evaluation findings have had substantial impacts on policy and decision making, not only in government but also in the private sector, and not only in the United States but internationally as well.
Quantitative Versus Qualitative Research. The rise of evaluation research in the 1960s began with a decidedly quantitative stance. In an early, influential book, Suchman (1967) unambiguously defined evaluation research as "the utilization of scientific research methods and techniques" (p. 7) and cited a recent book by Campbell and Stanley (1963) on experimental and quasi-experimental designs as providing instruction on the appropriate methodology. It was not long, however, before the dominance of quantitative methods in evaluation research came under attack. Cook (1997) identifies two reasons. First, there has been a longstanding debate, especially in sociology, over the merits of qualitative research and the limits of quantitative methods. Sociologists brought the debate with them when they entered the field of evaluation. Second, evaluation researchers, even those trained primarily in quantitative methods, began to recognize the epistemological limitations of the quantitative approach (e.g., Guba and Lincoln 1981). There were also practical reasons to turn toward qualitative methods. For example, Weiss (1987) noted that quantitative outcome measures are frequently too insensitive to detect program effects. Also, the expected time lag between treatment implementation and any observed outcomes is frequently unknown, with program effects often taking years to emerge. Moreover, due to limited budgets, time constraints, program attrition, multiple outcomes, multiple program sites, and other difficulties associated with applied research, quantitative field studies rarely achieved the potential they exuded on the drawing board. As a result, Weiss recommended supplementing quantitative with qualitative methods.
Focus on the quantitative–qualitative debate in evaluation research was sharpened when successive presidents of the American Evaluation Association expressed differing views on the matter. On the qualitative side, it was suggested that the focus on rigor associated with quantitative evaluations may have blinded evaluators to "artistic aspects" of the evaluation process that have traditionally been unrecognized or simply ignored. The time had come "to move beyond cost benefit analyses and objective achievement measures to interpretive realms" in the conduct of evaluation studies (Lincoln 1991, p. 6). From the quantitative perspective, it was acknowledged that while it is true that evaluations have frequently failed to produce strong empirical support for many attractive programs, to blame that failure on quantitative evaluations is akin to shooting the messenger. Moreover, at a time when research and statistical methods (e.g., regression discontinuity designs, structural equations with latent variables, etc.) were finally catching up to the complexities of contemporary research questions, it would be a shame to abandon the quantitative approach (Sechrest 1992). The ensuing controversy only served to polarize the two camps further.
The debate over which approach is best, quantitative or qualitative, is presently unresolved and, most likely, will remain so. Each paradigm has different strengths and weaknesses. As Cook (1997) points out, quantitative methods are good for generalizing and describing causal relationships. In contrast, qualitative methods are well suited for exploring program processes. Ironically, it is the very differences between the two approaches that may ultimately resolve the issue because, to the extent that their limitations differ, the two methods used jointly will generally be better than either used singly (Reichardt and Rallis 1994).
Research Synthesis. Evaluation research, as it was practiced in the 1960s and 1970s, drew heavily on the experimental model. The work of Donald Campbell was very influential in this regard. Although he is very well known for his explication of quasi-experimental research designs (Campbell and Stanley 1963; Cook and Campbell 1979), much of his work actually de-emphasized quasi-experimentation in favor of experiments (Shadish et al. 1991). Campbell pointed out that quasi-experiments frequently lead to ambiguous causal inferences, sometimes with dire consequences (Campbell and Erlebacher 1970). In addition, he noted that experiments have wide applicability, even in applied settings where random assignment may not initially seem feasible (Campbell and Boruch 1975). Campbell also advocated implementing such rigorous methods in the evaluation of social programs (Campbell 1969). As a result, Campbell is frequently credited with proposing a rational model of social reform in which a program is first evaluated using rigorous social science methods, such as experiments, when possible, and then a report is issued to a decision maker who acts on the findings.
Whatever its source, it was not long before the rational model was criticized as being too narrow to serve as a template for evaluation research. In particular, Cronbach and colleagues (Cronbach et al. 1980) argued that evaluation is as much a political process as a scientific one, that decisions are rarely made but more likely emerge, that there is rarely a single decision maker, and that programs are often amorphous undertakings with no single outcome. From Cronbach's perspective, the notion that the outcome of a single study could influence the existence of a program is inconsistent with the political realities of most programs.
Understanding the ensuing controversy requires an understanding of the notion of validity. Campbell distinguished between two types of validity: internal and external (Campbell 1957; Campbell and Stanley 1963). Internal validity refers to whether the innovation or treatment has an effect. In contrast, external validity addresses the issue of generalizability of effects; specifically, "To what populations, settings, treatment variables, and measurement variables can this effect be generalized" (Campbell and Stanley 1963, p. 5). Campbell clearly assigned greater importance to internal validity than to external validity. Of what use is it, he asked, to generalize experimental outcomes to some population if one has doubts about the very existence of the relationship that one seeks to generalize (Shadish et al. 1991)? Campbell's emphasis on internal validity was clearly consistent with his focus on experiments, since the latter are particularly useful in examining causal relationships.
In contrast, Cronbach (1982) opposed the emphasis on internal validity that had so profoundly shaped the approach to evaluation research throughout the 1960s and 1970s. Although experiments have high internal validity, they tend to be weak in external validity; and, according to Cronbach, it is external validity that is of greatest utility in evaluation studies. That is, decision makers are rarely interested in the impact of a particular treatment on a unique set of subjects in a highly specific experimental setting. Instead, they want to know whether a program or treatment, which may not always be administered in exactly the same way from agency to agency, will have an effect if it is administered on other individuals, and in other settings, from those studied in the experimental situation. From Cronbach's perspective, the rational model of evaluation research based on rigorous social research procedures is a flawed model because there are no reliable methods for generalizing beyond the factors that have been studied in the first place and it is the generalized rather than the specific findings in which evaluators are interested. As a result, Cronbach viewed evaluation as more of an art than a scientific enterprise.
The debate over which has priority in evaluation research, internal or external validity, seems to have been resolved in the increasing popularity of research syntheses. Evaluation syntheses represent a meta-analytic technique in which research results from numerous independent evaluation studies are first converted to a common metric and then aggregated using a variety of statistical techniques. The product is a meaningful summary of the collective results of many individual studies. Research synthesis based on meta-analysis has helped to resolve the debate over the priority of internal versus external validity in that, if studies with rigorous designs are used, results will be internally valid. Moreover, by drawing on findings from many different samples, in many different settings, using many different outcome measures, the robustness of findings and generalizability can be evaluated as well.
Although meta-analysis has many strengths, including increased power relative to individual studies to detect treatment effects, the results are obviously limited by the quality of the original studies. The major drawback to meta-analysis, then, deals with repeating or failing to compensate for the limitations inherent in the original research on which the syntheses are based (Figueredo 1993). Since many evaluations use nonexperimental designs, these methodological limitations can be considerable, although they potentially exist in experiments as well (e.g., a large proportion of experiments suffer from low external validity).
An emerging theory underlying research syntheses of experimental and nonexperimental studies, referred to as critical multiplism (Shadish 1993) and based on Campbell and Fiske's (1959) notion of multiple operationalism, addresses these issues directly. "Multiplism" refers to the fact that there are multiple ways of proceeding in any research endeavor, with no single way being uniformly superior to all others. That is, every study will involve specific operationalizations of causes and effects that necessarily underrepresent the potential range of relevant components in the presumed causal process while introducing irrelevancies unique to the particular study (Cook 1993). For example, a persuasive communication may be intended to change attitudes about an issue. In a study to evaluate this resumed cause-and-effect relationship, the communication may be presented via television and attitudes may be assessed using paper-and-pencil inventory. Clearly, the medium used underrepresents the range of potential persuasive techniques (e.g., radio or newspapers might have been used) and the paper-and-pencil task introduces irrelevancies that, from a measurement perspective, constitute sources of error. The term "critical" refers to the attempt to identify biases in the research approach chosen. The logic, then, of critical multiplism is to synthesize the results of studies that are heterogeneous with respect to sources of bias and to avoid any constant biases. In this manner, meta-analytic techniques can be used to implement critical multiplist ideas, thereby increasing our confidence in the generalizability of evaluation findings.
The increasing use of research syntheses represents one of the most important changes in the field of evaluation during the past twenty-five years (Cook 1997). Research synthesis functions in the service of increasing both internal and external validity. Although it may seem that the use of research syntheses is a far cry from Campbell's notion of an experimenting society, in reality Campbell never really suggested that a single study might resolve an important social issue. In "Reforms as Experiments" (1969) Campbell states:
Too many social scientists expect single experiments to settle issues once and for all. . . . Because we social scientists have less ability to achieve "experimental isolation," because we have good reason to expect our treatment effects to interact significantly with a wide variety of social factors many of which we have not yet mapped, we have much greater needs for replication experiments than do the physical sciences. (pp. 427–428)
Ironically, perhaps, the increasing use of research syntheses in evaluation research is perfectly consistent with Campbell's original vision of an experimenting society.
DIRECTIONS FOR THE FUTURE
The field of evaluation research has undergone a professionalization since the early 1970s. Today, the field of evaluation research is characterized by its own national organization (the American Evaluation Association), journals, and professional standards. The field continues to evolve as practitioners continue the debate over exactly what constitutes evaluation research, how it should be conducted, and who should do it. In this regard, Shadish and colleagues (1991) make a compelling argument that the integration of the field will ultimately depend on the continued development of comprehensive theories that are capable of integrating the diverse activities and procedures traditionally subsumed under the broad rubric of evaluation research. In particular, they identify a number of basic issues that any theory of evaluation must address in order to integrate the practice of evaluation research. These remaining issues include knowledge construction, the nature of social programming and knowledge use, the role of values, and the practice of evaluation.
Knowledge Construction. A persisting issue in the field of evaluation concerns the nature of the knowledge that should emerge as a product from program evaluations. Issues of epistemology and research methods are particularly germane in this regard. For example, the controversy over whether quantitative approaches to the generation of knowledge are superior to qualitative methods, or whether any method can be consistently superior to another regardless of the purpose of the evaluation, is really an issue of knowledge construction. Other examples include whether knowledge about program outcomes is more important than knowledge concerning program processes, or whether knowledge about how programs effects occur is more important than describing and documenting those effects. Future theories of evaluation must address questions such as which types of knowledge have priority in evaluation research, under what conditions various knowledge-generation strategies (e.g., experiments, quasi-experiments, case studies, or participatory evaluation) might be used, and who should decide (e.g., evaluators or stakeholders). By so doing, the field will become more unified, characterized by common purpose rather than by competing methodologies and philosophies.
Social Programming and Knowledge Use. The ostensible purpose of evaluation lies in the belief that problems can be ameliorated by improving the programs or strategies designed to address those problems. Thus, a social problem might be remediated by improving an existing program or by getting rid of an ineffective program and replacing it with a different one. The history of evaluation research, however, has demonstrated repeatedly how difficult it is to impact social programming. Early evaluators from academia were, perhaps, naive in this regard. Social programs are highly resist to change processes because there are generally multiple stakeholders, each with a vested interest in the program and with their own constituencies to support. Complicating the matter is the fact that knowledge is used in different ways in different circumstances. Several important distinctions concerning knowledge use can be made: (1) use in the short term versus use in the long term, (2) information for instrumental use in making direct decisions versus information intended for enlightenment or persuasion, and (3) lack of implementation of findings versus lack of utilization of findings. These different types of use progress at different rates and in different ways. Consequently, any resulting program changes are likely to appear slow and sporadic. But the extent to which such change processes should represent a source of disappointment and frustration for evaluators requires further clarification. Specifically, theories of evaluation are needed that take into account the complexities of social programming in modern societies, that delineate appropriate strategies for change in differing contexts, and that elucidate the relevance of evaluation findings for decision makers and change agents.
Values. Some evaluators, especially in the early history of the field, believed that evaluation should be conducted as a value-free process. The value-free doctrine was imported from the social sciences by early evaluators who brought it along as a by-product of their methodological training. This view proved to be problematic because evaluation is an intrinsically value-laden process in which the ultimate goal is to make a pronouncement about the value of something. As Scriven (1993) has cogently argued, the values-free model of evaluation is also wrong. As proof, he notes that statements such as "evaluative conclusions cannot be established by any legitimate scientific process" are clearly self-refuting because they are themselves evaluative statements. If evaluators cling to a values-free philosophy, then the inevitable and necessary application of values in evaluation research can only be done indirectly, by incorporating the values of other persons who might be connected with the programs, such as program administrators, program users, or other stakeholders (Scriven 1991). Obviously, evaluators will do a better job if they are able to consider explicitly values-laden questions such as: On what social values is this intervention based? What values does it foster? What values does it harm? How should merit be judged? Who decides? As Shadish and colleagues (1991) point out, evaluations are often controversial and explosive enterprises in the first place and debates about values only make them more so. Perhaps that is why values theory has gotten short shrift in the past. Clearly, however, future theory needs to address the issue of values, acknowledging and clarifying their central role in evaluation research.
The Practice of Evaluation. Evaluation research is an extremely applied activity. In the end, evaluation theory has relevance only to the extent that it influences the actual practice of evaluation research. Any theory of evaluation practice must necessarily draw on all the aforementioned issues (i.e., knowledge construction, social programming and information use, and values), since they all have direct implications for practice. In addition, there are pragmatic issues that directly affect the conduct of evaluation research. One important contemporary issue examines the relationship between the evaluator and individuals associated with the program. For example, participatory evaluation is a controversial approach to evaluation research that favors collaboration between evaluation researchers and individuals who have some stake in the program under evaluation. The core assumption of participatory evaluation is that, by involving stakeholders, ownership of the evaluation will be shared, the findings will be more relevant to interested parties, and the outcomes are then more likely to be utilized (Cousins and Whitmore 1998). From an opposing perspective, participatory evaluation is inconsistent with the notion that the investigator should remain detached from the object of investigation in order to remain objective and impartial. Not surprisingly, the appropriateness of participatory evaluation is still being debated.
Other aspects of practice are equally controversial and require clarification as well. For example: Who is qualified to conduct an evaluation? How should professional evaluators be trained and by whom? Should evaluators be licensed? Without doubt, the field of evaluation research has reached a level of maturity where such questions warrant serious consideration and their answers will ultimately determine the future course of the field.
Campbell, Donald 1957 "Factors Relevant to the Validity of Experiments in Social Settings." Psychological Bulletin 54: 297–312.
——1969 "Reforms as Experiments." American Psychologist 24:409–429.
——, and Robert Boruch 1975 "Making the Case for
Randomized Assignment to Treatments by Considering the Alternatives: Six Ways in Which Quasi-Experimental Evaluations in Compensatory Education Tend to Underestimate Effects." In C. Bennett and A. Lumsdaine, eds., Evaluation and Experiments: Some Critical Issues in Assessing Social Programs. New York: Academic Press.
——, and Albert Erlebacher 1970 "How Regression Artifacts Can Mistakenly Make Compensatory Education Programs Look Harmful." In J. Hellmuth, ed., The Disadvantaged Child: vol. 3. Compensatory Education: A National Debate. New York: Brunner/Mazel.
——, and Donald Fiske 1959 "Convergent and Discriminant Validity by the Multitrait–Multimethod Matrix." Psychological Bulletin 56:81–105.
——, and Julian Stanley 1963 Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally.
Chelimsky, Eleanor 1997 "The Coming Transformations in Evaluation." In E. Chelimsky and W. Shadish, eds., Evaluation for the Twenty-first Century. Thousand Oaks, Calif.: Sage.
——, and William Shadish 1997 Evaluation for the Twenty-first Century. Thousand Oaks, Calif.: Sage.
Cook, Thomas 1993 "A Quasi-Sampling Theory of the Generalization of Causal Relationships." In L. Sechrest and A. Scott, eds., Understanding Causes and Generalizing About Them (New Directions for Program Evaluation, No. 57). San Francisco: Jossey-Bass.
——1997 "Lessons Learned in Evaluation over the Past 25 Years. "In Eleanor Chelimsky and William Shadish, eds., Evaluation for the Twenty-first Century. Thousand Oaks, Calif.: Sage.
——, and Donald Campbell 1979 Quasi-Experimentation: Design and Analysis Issues for Field Settings. Chicago: Rand McNally.
Cousins, J. Bradley, and Elizabeth Whitmore 1998 "Framing Participatory Evaluation." In Elizabeth Whitmore, ed., Understanding and Practicing Participatory Evaluation (New Directions for Evaluation, No. 80). San Francisco: Jossey-Bass.
Cronbach, Lee 1982 Designing Evaluations of Educational and Social Programs. San Francisco: Jossey-Bass.
——, Sueann Ambron, Sanford Dornbusch, Robert Hess, Robert Hornik, D. C. Phillips, Decker Walker, and Stephen Weiner 1980 Toward Reform of Program Evaluation. San Francisco: Jossey-Bass.
Desautels, L. Denis 1997 "Evaluation as an Essential Component of 'Value-for-Money."' In Eleanor Clemimsky and William R. Shadish, eds., Evaluation for the Twenty-first Century. Thousand Oaks, Calif.: Sage.
Figueredo, Aurelio 1993 "Critical Multiplism, Meta-Analysis, and Generalization: An Integrative Commentary. "In L. Sechrest, ed., Program Evaluation: A Pluralistic Enterprise (New Directions for Program Evaluation, No. 60). San Francisco: Jossey-Bass.
Freeman, Howard 1992 "Evaluation Research." In E. Borgatta and M. Borgatta, eds., Encyclopedia of Sociology. New York: Macmillan.
Gore, Albert 1993 From Red Tape to Results: Creating a Government That Works Better and Costs Less. New York: Plume/Penguin.
Guba, Egon, and Yvonna Lincoln 1981 Effective Evaluation. San Francisco: Jossey-Bass.
Leviton, Laura, and Edward Hughes 1981 "Research on the Utilization of Evaluations: A Review and Synthesis." Evaluation Review 5:525–548.
Lincoln, Yvonna 1991 "The Arts and Sciences of Program Evaluation." Evaluation Practice 12:1–7.
Reichardt, Charles, and Sharon Rallis 1994 "The Relationship Between the Qualitative and Quantitative Research Traditions." In Charles Reichardt and Sharon Rallis, eds., (New Directions for Program Evaluation, No. 61). San Francisco: Jossey-Bass.
Rossi, Peter 1994 The War Between the Quals and the Quants. Is a Lasting Peace Possible? In Charles Reichardt and Sharon Rallis, eds., The Qualitative–Quantitative Debate: New Perspectives (New Directions for Program Evaluation, No. 61). San Francisco: Jossey-Bass.
——, and Howard E. Freeman 1993 Evaluation, fifth ed. Newbury Park, Calif.: Sage.
Rutman, Leonard 1984 Evaluation Research Methods. Newbury Park, Calif.: Sage.
Scriven, Michael 1991 Evaluation Thesaurus, fourth ed. Newbury Park, Calif.: Sage.
——1993 Hard-Won Lessons in Program Evaluation (New Directions for Program Evaluation, No. 58). San Francisco: Jossey-Bass.
Sechrest, Lee 1992 "Roots: Back to Our First Generations." Evaluation Practice 13:1–7.
Shadish, William 1993 "Critical Multiplism: A Research Strategy and Its Attendant Tactics." In L. Sechrest, ed., Program Evaluation: A Pluralistic Enterprise (New Directions for Program Evaluation, No. 60). San Francisco: Jossey-Bass.
——, Thomas Cook, and Laura Leviton 1991 Foundations of Program Evaluation: Theories of Practice. Newbury Park, Calif.: Sage.
Sonnad, Subhash, and Edgar Borgatta 1992 "Evaluation Research and Social Gerontology." Research on Aging 14:267–280.
Stephan, A. Stephen 1935 "Prospects and Possibilities: The New Deal and the New Social Research." Social Forces 13:515–521.
Suchman, Edward 1967 Evaluative Research: Principles and Practice in Public Service and Social Action Programs. New York: Russell Sage Foundation.
Weiss, Carol 1975 "Evaluation Research in the Political Context." In Marcia Guttentag and Elmer Struenning, eds., Handbook of Evaluation Research, vol. 1. Beverly Hills, Calif.: Sage.
——1987 "Evaluating Social Programs: What Have We Learned?" Society 25:40–45.
Methodological steps and principles
Ours is an age of social-action programs, where large organization and huge expenditures go into the attempted solution of every conceivable social problem. Such programs include both private and public ventures and small-scale and large-scale projects, ranging in scope from local to national and international efforts at social change. Whenever men spend time, money, and effort to help solve social problems, someone usually questions the effectiveness of their actions. Sponsors, critics, the public, even the actors themselves, seek signs that their program is successful. Much of the assessment of action programs is irregular and, often by necessity, based upon personal judgments of supporters or critics, impressions, anecdotes, testimonials, and miscellaneous information available for the evaluation. In recent years, however, there has been a striking change in attitudes toward evaluation activities and the type and quality of evidence that is acceptable for determining the relative success or failure of social-action programs.
Two trends stand out in the modern attitude toward evaluation. First, evaluation has come to be expected as a regular accompaniment to rational social-action programs. Second, there has been a movement toward demanding more systematic, rigorous, and objective evidence of success. The application of social science techniques to the appraisal of social-action programs has come to be called evaluation research.
Examples of the applications of evaluation research are available from a wide variety of fields. One of the earliest attempts at building evaluation research into an action program was in the field of community action to prevent juvenile delinquency. The 1937 Cambridge-Somerville Youth Study provided for an experimental and a control group of boys, with the former to receive special attention and advice from counselors and other community agencies. The plan called for a ten-year period of work with the experimental group followed by an evaluation that would compare the record of their delinquent conduct during that decade with the record of the control group. The results of the evaluation (see Powers & Witmer 1951) showed no significant differences in conduct favorable to the program. A subsequent long-term evaluation of the same program failed to find new evidence of less criminal activity by persons in the experimental group but added a variety of new theoretical analyses to the evaluation (McCord et al. 1959).
Several evaluations of programs in citizenship training for young persons have built upon one another, thus providing continuity in the field. Riecken (1952) conducted an evaluation of summer work camps sponsored by the American Friends Service Committee to determine their impact on the values, attitudes, and opinions of the participants. His work was useful in specifying those areas in which the program was successful or unsuccessful as well as pointing up the importance of measuring unsought by-products of action programs. Subsequently, Hyman, Wright, and Hopkins carried out a series of evaluations of another youth program, the Encampment for Citizenship (1962). Their research design was complex, including a comparison of campers’ values, attitudes, opinions, and behavior before and after a six-week program of training; follow-up surveys six weeks and four years after the group left the program; three independent replications of the original study on new groups of campers in later years; and a sample survey of alumni of the program. These various studies demonstrated the effectiveness of the program in influencing campers’ social attitudes and conduct; they also examined the dynamics of attitudinal change.
Evaluations have been made in such varied fields as intergroup relations, induced technological change, mass communications, adult education, international exchange of persons for training or good will, mental health, and public health. Additional examples of applications of evaluation research, along with discussions of evaluation techniques, are presented by Klineberg and others in a special issue of the International Social Science Bulletin (1955) and in Hyman and Wright (1966).
A scientific approach to the assessment of a program’s achievements is the hallmark of modern evaluation research. In this respect evaluation research resembles other kinds of social research in its concern for objectivity, reliability, and validity in the collection, analysis, and interpretation of data. But it can be distinguished as a special form of social research by its purpose and the conditions under which the research must be conducted. Both of these factors affect such components of the research process as study design and its translation into practice, allocation of research time and other resources, and the value or worth to be put upon the empirical findings.
The primary purpose of evaluation research is “to provide objective, systematic, and comprehensive evidence on the degree to which the program achieves its intended objectives plus the degree to which it produces other unanticipated consequences, which when recognized would also be regarded as relevant to the agency” (Hyman et al. 1962, pp. 5–6). Evaluation research thus differs in its emphasis from such other major types of social research as exploratory studies, which seek to formulate new problems and hypotheses, or explanatory research, which places emphasis on the testing of theoretically significant hypotheses, or descriptive social research, which documents the existence of certain social conditions at a given moment or over time (Selltiz et al. 1959). Since the burden is on the evaluator to provide firm evidence on the effects of the program under study, he favors a study design that will tend toward maximizing such evidence and his confidence in conclusions drawn from it. Although good evaluation research often seeks explanations of a program’s success or failure, the first concern is to obtain basic evidence on effectiveness, and therefore most research resources are allocated to this goal.
The conditions under which evaluation research is conducted also give it a character distinct from other forms of social research. Evaluation research is applied social research, and it differs from other modes of scholarly research in bringing together an outside investigator to guarantee objectivity and a client in need of his services. From the initial formulation of the problem to the final interpretation of findings, the evaluator is duty-bound to keep in mind the very practical problem of assessing the program under study. As a consequence he often has less freedom to select or reject certain independent, dependent, and intervening variables than he would have in studies designed to answer his own theoretically formulated questions, such as might be posed in basic social research. The concepts employed and their translation into measurable variables must be selected imaginatively but within the general framework set by the nature of the program being evaluated and its objectives (a point which will be discussed later). Another feature of evaluation research is that the investigator seldom has freedom to manipulate the program and its components, i.e., the independent variable, as he might in laboratory or field experiments. Usually he wants to evaluate an ongoing or proposed program of social action in its natural setting and is not at liberty, because of practical and theoretical considerations, to change it for research purposes. The nature of the program being evaluated and the time at which his services are called upon also set conditions that affect, among other things, the feasibility of using an experimental design involving before-and-after measurements, the possibility of obtaining control groups, the kinds of research instruments that can be used, and the need to provide for measures of long-term as well as immediate effects.
The recent tendency to call upon social science for the evaluation of action programs that are local, national, and international in scope (a trend which probably will increase in future years) and the fact that the application of scientific research procedures to problems of evaluation is complicated by the purposes and conditions of evaluation research have stimulated an interest in methodological aspects of evaluation among a variety of social scientists, especially sociologists and psychologists. Methodological and technical problems in evaluation research are discussed, to mention but a few examples, in the writings of Riecken (1952), Klineberg (1955), Hyman et al. (1962), and Hayes (1959).
While it is apparent that the specific translation of social-science techniques into forms suitable for a particular evaluation study involves research decisions based upon the special nature of the program under examination, there are nonetheless certain broad methodological questions common to most evaluation research. Furthermore, certain principles of evaluation research can be extracted from the rapidly growing experience of social scientists in applying their perspectives and methods to the evaluation of social-action programs. Such principles have obvious importance in highlighting and clarifying the methodological features of evaluation research and in providing practical, if limited, guidelines for conducting or appraising such research. The balance of this article will discuss certain, but by no means all, of these compelling methodological problems.
Methodological steps and principles
The process of evaluation has been codified into five major phases, each involving particular methodological problems and guiding principles (see Hyman et al. 1962). They are (1) the conceptualization and measurement of the objectives of the program and other unanticipated relevant outcomes; (2) formulation of a research design and the criteria for proof of effectiveness of the program, including consideration of control groups or alternatives to them; (3) the development and application of research procedures, including provisions for the estimation or reduction of errors in measurement; (4) problems of index construction and the proper evaluation of effectiveness; and (5) procedures for understanding and explaining the findings on effectiveness or ineffectiveness. Such a division of the process of evaluation is artificial, of course, in the sense that in practice the phases overlap and it is necessary for the researcher to give more or less constant consideration to all five steps. Nevertheless it provides a useful framework for examining and understanding the essential components of evaluation research.
Conceptualization. Each social-action program must be evaluated in terms of its particular goals. Therefore, evaluation research must begin with their identification and move toward their specification in terms of concepts that, in turn, can be translated into measurable indicators. All this may sound simple, perhaps routine, compared with the less structured situation facing social researchers engaged in formulating research problems for theoretical, explanatory, descriptive, or other kinds of basic research. But the apparent simplicity is deceptive, and in practice this phase of evaluation research repeatedly has proven to be both critical and difficult for social researchers working in such varied areas as mental health (U.S. Dept. of Health, Education & Welfare 1955), juvenile delinquency (Witmer & Tufts 1954), adult education (Evaluation Techniques 1955), and youth programs for citizenship training (Riecken 1952; Hyman et al. 1962), among others. As an example, Witmer and Tufts raise such questions about the meaning of the concept “delinquency prevention” as: What is to be prevented? Who is to be deterred? Are we talking only about “official” delinquency? Does prevention mean stopping misbehavior before it occurs? Does it mean reducing the frequency of misbehavior? Or does it mean reducing its severity?
Basic concepts and goals are often elusive, vague, unequal in importance to the program, and sometimes difficult to translate into operational terms. What is meant, for example, by such a goal as preparing young persons for “responsible citizenship”? In addition, the evaluator needs to consider possible effects of the program which were unanticipated by the action agency, finding clues from the records of past reactions to the program if it has been in operation prior to the evaluation, studies of similar programs, the social-science literature, and other sources. As an example, Carlson (1952) found that a mass-information campaign against venereal disease failed to increase public knowledge about these diseases; nevertheless, the campaign had the unanticipated effect of improving the morale of public health workers in the area, who in turn did a more effective job of combating the diseases. The anticipation of both planned and unplanned effects requires considerable time, effort, and imagination by the researcher prior to collecting evidence for the evaluation itself.
Research design. The formulation of a research design for evaluation usually involves an attempt to approximate the ideal conditions of a controlled experiment, which measures the changes produced by a program by making comparisons of the dependent variables before and after the program and evaluating them against similar measurements on a control group that is not involved in the program. If the control group is initially similar to the group exposed to the social-action program, a condition achieved through judicious selection, matching, and randomization, then the researcher can use the changes in the control group as a criterion against which to estimate the degree to which changes in the experimental group were probably caused by the program under study. To illustrate, suppose that two equivalent groups of adults are selected for a study on the effects of a training film intended to impart certain information to the audience. The level of relevant information is measured in each group prior to the showing of the film; then one group sees the film while the other does not; finally, after some interval, information is again measured. Changes in the amount of information held by the experimental group cannot simply be attributed to the film; they may also reflect the influence of such factors in the situation as exposure to other sources of information in the interim period, unreliability of the measuring instruments, maturation, and other factors extraneous to the program itself. But the control group presumably also experienced such nonprogrammatic factors, and therefore the researcher can subtract the amount of change in information demonstrated by it from the changes shown by the experimental group, thereby determining how much of the gross change in the latter group is due to the exclusive influence of the program.
So it is in the ideal case, such as might be achieved under laboratory conditions. In practice, however, evaluation research seldom permits such ideal conditions. A variety of practical problems requires alterations in the ideal design. As examples, suitable control groups cannot always be found, especially for social-action programs involving efforts at large-scale social change but also for smaller programs designed to influence volunteer participants; also ethical, administrative, or other considerations usually prevent the random assignment of certain persons to a control group that will be denied the treatment offered by the action programs.
In the face of such obstacles, certain methodologists have taken the position that a slavish insistence on the ideal control-group experimental research design is unwise and dysfunctional in evaluation research. Rather, they advocate the ingenious use of practical and reasonable alternatives to the classic design (see Hyman et al. 1962; and Campbell & Stanley 1963). Under certain conditions, for example, it is possible to estimate the amount of change that could have been caused by extraneous events, instability of measurements, and natural growth of participants in a program by examining the amount of change that occurred among participants in programs similar to the one being evaluated. Using such comparative studies as “quasi-control” groups permits an estimate of the relative effectiveness of the program under study, i.e., how much effect it has had over and above that achieved by another program and assorted extraneous factors, even though it is impossible to isolate the specific amount of change caused by the extraneous factors. Another procedure for estimating the influence of nonprogrammatic factors is to study the amount of change which occurs among a sample of the population under study during a period of time prior to the introduction of the action program, using certain of the ultimate participants as a kind of control upon themselves, so to speak. Replications of the evaluation study, when possible, also provide safeguards against attributing too much or too little effect to the program under study. Admittedly, all such practical alternatives to the controlled experimental design have serious limitations and must be used with judgment; the classic experimental design remains preferable whenever possible and serves as an ideal even when impractical. Nevertheless, such expedients have proven useful to evaluators and have permitted relatively rigorous evaluations to be conducted under conditions less perfect than those found in the laboratory.
Error control. Evaluation studies, like all social research, involve difficult problems in the selection of specific research procedures and the provision for estimating and reducing various sources of error, such as sampling bias, bias due to non-response, measurement errors arising in the questions asked or in recording of answers, deliberate deception, and interviewer bias. The practices employed to control such errors in evaluation research are similar to those used in other forms of social research, and no major innovations have been introduced.
Estimating effectiveness. To consider the fourth stage in evaluation, a distinction needs to be made between demonstrating the effects of an action program and estimating its effectiveness. Effectiveness refers to the extent to which the program achieves its goals, but the question of just how much effectiveness constitutes success and justifies the efforts of the program is unanswerable by scientific research. It remains a matter for judgment on the part of the program’s sponsors, administrators, critics, or others, and the benefits, of course, must somehow be balanced against the costs involved. The problem is complicated further by the fact that most action programs have multiple goals, each of which may be achieved with varying degrees of success over time and among different subgroups of participants in the program. To date there is no general calculus for appraising the over-all net worth of a program.
Even if the evaluation limits itself to determining the success of a program in terms of each specific goal, however, it is necessary to introduce some indexes of effectiveness which add together the discrete effects within each of the program’s goal areas. Technical problems of index and scale construction have been given considerable attention by methodologists concerned with various types of social research (see Lazarsfeld & Rosenberg 1955). But as yet there is no theory of index construction specifically appropriate to evaluation research. Steps have been taken in this direction, however, and the utility of several types of indexes has been tentatively explored (see Hyman et al. 1962). One type of difficulty, for example, arises from the fact that the amount of change that an action program produces may vary from subgroup to subgroup and from topic to topic, depending upon how close to perfection each group was before the program began. Thus, an information program can influence relatively fewer persons among a subgroup in which, say, 60 per cent of the people are already informed about the topic than among another target group in which only 30 per cent are initially informed. An “effectiveness index” has been successfully employed to help solve the problem of weighting effectiveness in the light of such restricted ceilings for change (see Hovland et al. 1949; and Hyman et al. 1962). This index, which expresses actual change as a proportion of the maximum change that is possible given the initial position of a group on the variable under study, has proven to be especially useful in evaluating the relative effectiveness of different programs and the relative effectiveness of any particular program for different subgroups or on different variables.
Understanding effectiveness. In its final stage, evaluation research goes beyond the demonstration of a program’s effects to seek information that will help to account for its successes and failures. The reasons for such additional inquiry may be either practical or theoretical.
Sponsors of successful programs may want to duplicate their action program at another time or under other circumstances, or the successful program may be considered as a model for action by others. Such emulation can be misguided and even dangerous without information about which aspects of the program were most important in bringing about the results, for which participants in the program, and under what conditions. Often it is neither possible nor necessary, however, to detect and measure the impact of each component of a social-action program. In this respect, as in others noted above, evaluation research differs from explanatory survey research, where specific stimuli are isolated, and from experimental designs, where isolated stimuli are introduced into the situation being studied. In evaluation research the independent variable, i.e., the program under study, is usually a complex set of activities no one of which can be separated from the others without changing the nature of the program itself. Hence, explanations of effectiveness are often given in terms of the contributions made by certain gross features of the program, for example, the total impact of didactic components versus social participation in a successful educational institution.
Gross as such comparisons must be, they nevertheless provide opportunities for testing specific hypotheses about social and individual change, thereby contributing to the refinement and growth of social science theories. It is important to remember, however, that such gains are of secondary concern to evaluation research, which has as its primary goal the objective measurement of the effectiveness of the program.
Certain forms of research design promise to yield valuable results both for the primary task of evaluation and its complementary goal of enlarging social knowledge. Among the most promising designs are those that allow for comparative evaluations of different social-action programs, replication of evaluations of the same program, and longitudinal studies of the long-range impact of programs. Comparative studies not only demonstrate the differential effectiveness of various forms of programs having similar aims but also provide a continuity in research which permits testing theories of change under a variety of circumstances. Replicative evaluations add to the confidence in the findings from the initial study and give further opportunity for exploring possible causes of change. Longitudinal evaluations permit the detection of effects that require a relatively long time to occur and allow an examination of the stability or loss of certain programmatic effects over time and under various natural conditions outside of the program’s immediate control.
Viewed in this larger perspective, then, evaluation research deserves full recognition as a social science activity which will continue to expand. It provides excellent and ready-made opportunities to examine individuals, groups, and societies in the grip of major and minor forces for change. Its applications contribute not only to a science of social planning and a more rationally planned society but also to the perfection of social and psychological theories of change.
Charles R. Wright
[see alsoExperimental design; Survey analysis.]
CAMPBELL, DONALD T.; and STANLEY, J. S. 1963 Experimental and Quasi-experimental Designs for Research on Teaching. Pages 171–246 in Nathaniel L. Gage (editor), Handbook of Research on Teaching. Chicago: Rand McNally.
CARLSON, ROBERT O. 1952 The Influence of the Community and the Primary Group on the Reactions of Southern Negroes to Syphilis. Ph.D. dissertation, Columbia Univ.
Evaluation Techniques. 1955 International Social Science Bulletin 7: 343–458.
HAYES, SAMUEL P. 1959 Measuring the Results of Development Projects: A Manual for the Use of Field Workers. Paris: UNESCO.
HOVLAND, CARL I.; LUMSDAINE, ARTHUR A.; and SHEFFIELD, FREDERICK D. 1949 Experiments on Mass Communication. Studies in Social Psychology in World War II, Vol. 3. Princeton Univ. Press.
HYMAN, HERBERT H.; and WRIGHT, CHARLES R. 1966 Evaluating Social Action Programs. Unpublished manuscript.
HYMAN, HERBERT H.; WRIGHT, CHARLES R.; and HOPKINS, TERENCE K. 1962 Applications of Methods of Evaluation: Four Studies of the Encampment for Citizenship. Berkeley: Univ. of California Press.
KLINEBERG, OTTO 1955 Introduction: The Problem of Evaluation. International Social Science Bulletin 7: 346–352.
LAZARSFELD, PAUL F.; and ROSENBERG, MORRIS (editors) 1955 The Language of Social Research: A Reader in the Methodology of Social Research. Glencoe, Ill.: Free Press.
McCoRD, WILLIAM; McCoRD, JOAN; and ZOLA, IRVING K. 1959 Origins of Crime: A New Evaluation of the Cambridge–Somerville Youth Study. New York: Columbia Univ. Press.
POWERS, EDWIN; and WITMER, HELEN L. 1951 An Experiment in the Prevention of Delinquency. New York: Columbia Univ. Press; Oxford Univ. Press.
RIECKEN, HENRY W. 1952 The Volunteer Work Camp: A Psychological Evaluation. Reading, Mass.: Addison-Wesley.
SELLTIZ, CLAIRE et al. (1959) 1962 Research Methods in Social Relations. New York: Holt.
U.S. DEPT. OF HEALTH, EDUCATION & WELFARE, NATIONAL INSTITUTES OF HEALTH 1955 Evaluation in Mental Health: Review of Problem of Evaluating Mental Health Activities. Washington: Government Printing Office.
WITMER, HELEN L.; and TUFTS, EDITH 1954 The Effectiveness of Delinquency Prevention Programs. Washington: Government Printing Office.