Quasi-Experimental Research Designs

views updated

QUASI-EXPERIMENTAL RESEARCH DESIGNS

The goal of most social scientific research is to explain the causes of human behavior in its myriad forms. Researchers generally attempt to do this by uncovering causal associations among variables. For example, researchers may be interested in whether a causal relationship exists between income and happiness. One might expect a positive association between these two variables. That is, an increase in income, the independent variable, produces an increase in happiness, the dependent variable. Unfortunately, observing a positive correlation between these two variables does not prove that income causes happiness. In order to make a valid causal inference, three conditions must be present: (1) there must be an association between the variables (e.g., income and happiness); (2) the variable that is the presumed cause (e.g., income) must precede the effect (e.g., happiness) in time; and (3) the association between the two variables cannot be explained by the influence of some other variable (e.g., education) that may be related to both of them. The purpose of any research design is to construct a circumstance within which a researcher can achieve these three conditions and thus make valid causal inferences.

Experimental designs are one of the most efficient ways to accomplish this goal of making valid causal inferences. Four characteristics are especially desirable in designing experiments. First, researchers manipulate the independent variable. That is, they actively modify persons' environment (e.g., provide some people with money they otherwise would not have received)—as contrasted with passively observing the existing, "natural" environment (e.g., simply measuring the amount of income persons normally make). Second, researchers have complete control over when the independent variable is manipulated (e.g., when persons receive supplementary income). Third, researchers have complete control over what they manipulate. That is, they can specify the exact content of the different "treatment conditions" (different levels) of the independent variable to which subjects are exposed (e.g., how much supplementary income persons receive and the manner in which they receive it). Fourth, researchers have complete control over who is assigned to which treatment condition (e.g., who receives the supplementary income and who does not, or who receives higher versus lower amounts of income).

Of these four characteristics important in designing experiments, only manipulation of the independent variable and control over who receives the treatments are essential to classify a study as a true experimental design.

Control over who receives treatment conditions is especially powerful in enhancing valid causal inference when researchers use the technique of random assignment. For example, in evaluating the effect of income on happiness, investigators might randomly assign individuals who are below the poverty level to treatment groups receiving varying levels of supplementary income (e.g., none versus $1,000).

Table 1 (a) illustrates this example. It depicts an experimental design in which subjects are randomly assigned to one of two groups. At time 1, researchers manipulate the independent variable (X): Each subject in group 1 receives $1,000 in supplementary income. Conversely, no subjects in group 2 receive any supplementary income. At time 2, researchers observe (measure) the average level of happiness (O) for group 1 versus group 2. The diagram XO indicates an expected increase in happiness when supplementary income increases. That is, the average happiness score should be higher for group 1 than for group 2.

By assigning each subject to a particular treatment condition based on a coin flip or some other random procedure, experimental designs ensure that each subject has an equal chance off appearing in any one of the treatment conditions (e.g., at any level of supplementary income). Therefore, as a result of random assignment, the different treatment groups depicted in Table 1 (a) should be approximately equivalent in all characteristics (average education, average physical health, average religiosity, etc.) except their exposure to different levels of the independent variable (i.e., different levels of supplementary income). Consequently, even though there is a large number of other variables (e.g., education, physical health, and religiosity) that might affect happiness, none of these variables can serve as a plausible alternative explanation for why the higher-income group has higher average happiness than does the lower-income group.

For example, due to random assignment, physically healthy versus unhealthy persons should be approximately equally distributed between the higher versus lower supplementary income treatment groups. Hence, a critic could not make a plausible argument that the treatment group receiving the higher amount of supplementary income (i.e., group 1) also has better health, and it is the better health and not the greater income that is producing the higher levels of happiness in that treatment

group. Indeed, this same logic applies no matter what other causal variables a critic might substitute for physical health as an alternative explanation for why additional income is associated with greater happiness.

In sum, strong causal inferences are possible where social scientists manipulate the independent variable and retain great control over when treatments occur, what treatments occur, and, especially, who receives the different treatments. But there are times when investigators, typically in "field" (i.e., nonlaboratory, natural, or real-world) settings, are interested in the effects of an intervention but cannot do randomized experiments. More specifically, there are times when researchers in naturally occurring settings can manipulate the independent variable and exercise at least some control over when the manipulation occurs and what it includes. But these same field researchers may have less control over who receives the treatment conditions. In other words, there are many real-world settings in which random assignment is not possible.

Where randomized experiments are not possible, a large number of potential threats to valid causal inference can occur. Under these less-than-optimal field conditions, investigators may resort to a number of alternative research designs that help reduce at least some of the threats to making valid causal inferences. These alternative procedures are collectively referred to as quasi-experimental designs. (See also Campbell and Stanley 1963; Cook and Campbell 1979; Cook and Shaddish 1994; Shaddish et al. in preparation.)

None of these designs is as powerful as a randomized experiment in establishing causal relationships, but some of the designs are able to overcome the absence of random assignment such that they approximate the power of randomized experiments. Conversely, where the designs are particularly weak in establishing causal relationships, Campbell and Stanley (1963) have described them as preexperimental designs. Furthermore, social scientists describe as nonexperimental designs those studies in which the researcher can only measure (observe) rather than manipulate the independent variable. As we shall see, however, one type of nonexperimental design—the "panel"— may surpass preexperimental designs and approach the power of some quasi-experimental designs in overcoming threats to valid causal inference.

Below we describe common threats to "internal validity" (i.e., the making of valid causal inferences) in field settings, the conditions under which such threats are likely to occur, and representative research designs and strategies used to combat the threats. Later we briefly examine threats to "external validity," "construct validity," and "statistical conclusion validity," and strategies used to reduce these threats. As we shall see, whereas randomized experiments are the exemplary design for enhancing internal validity (causal inference), they often suffer in comparison to other research designs with regard to external validity (generalizability across persons, places, and times) and construct validity (whether one is measuring and manipulating what one intended).


THREATS TO INTERNAL VALIDITY

Where researchers are unable to assign subjects to treatment conditions randomly, a large number of threats to internal validity (causal inference) can occur. These potential threats include effects due to history, maturation, testing, instrumentation, regression to the mean, selection, mortality, and reverse causal order. (See Cook and Campbell 1979, and Shaddish et al. in preparation for more elaborate lists.)

Research designs vary greatly in how many and which of these potential threats are likely to occur—that is, are likely to serve as plausible alternative explanations for an apparent causal relationship between an independent and a dependent variable. As an example of a weak (preexperimental) research design in which most of the various threats to internal validity are plausible, consider the "one-group pretest-posttest design" (Campbell and Stanley 1963). Furthermore, assume that researchers have adapted this design to study the effect of income on happiness. As depicted in Table 1 (b), investigators observe the happiness (Ohappiness) of persons at time 2 following a period (during time 1) in which subjects (all below the poverty line) receive no supplementary income (X$0). Subsequently, subjects receive a $1,000 "gift" (X$1,000) at time 3, and their happiness is remeasured (Ohappiness) at time 4.

The investigators find that posttest happiness (i.e., time 4 Ohappiness) is indeed substantially higher than pretest happiness (i.e., time 2 Ohappiness). Accordingly, an increase in supplementary income is associated with an increase in happiness. But is this association due to supplementary income's causing an increase in happiness? Or is the association due to some alternative explanation?

Given this weak, preexperimental research design, there are a number of threats to internal validity that serve as plausible alternative explanations for increases in happiness other than the $1,000 gift. These plausible threats include effects due to history, maturation, testing, instrumentation, and regression to the mean, with less likely or logically impossible threats due to selection, mortality, and reverse causal order.

History effects refer to some specific event that exerts a causal influence on the dependent variable, and that occurs at roughly the same time as the manipulation of the independent variable. For instance, during the period between the pretest (time 2) and posttest (time 4) measure of happiness as outlined in Table 1 (b), Congress may have declared a national holiday. This event could have the effect of elevating everyone's happiness. Consequently, even if the $1,000 gift had no effect on happiness, researchers would observe an increase in happiness from the pretest to posttest measure. In other words, the effects of the $1,000 gifts are totally confounded with the effects of the holiday, and both remain reasonable explanations for the change in happiness from time 2 to time 4. That is, a plausible rival explanation for the increase in happiness with an increase in income is that the holiday and not the additional income made people happier.

Maturation effects are changes in subjects that result simply from the passage of time (e.g., growing hungrier, growing older, growing more tired). Simply put, "people change." To continue with our current example using a weak, preexperimental research design, assume that individuals, as they grow older, increase in happiness owing to their improved styles of coping, increasing acceptance of adverse life events, or the like. If such developmental changes appear tenable, then maturation becomes a plausible rival explanation for why subjects' happiness increased after receiving the $1,000 gift. That is, subjects would have displayed an increase in happiness over time even if they had not received the $1,000 gift.

Testing effects are the influences of taking a pretest on subsequent tests. In the current study of income and happiness, pretest measures of happiness allow participants to become familiar with the measures' content in a way that may have "carryover" effects on later measures of happiness. That is, familiarity with the pretest may make salient certain issues that would not be salient had subjects not been exposed to the pretest. Consequently, it is possible that exposure to the pretest could cause participants to ponder these suddenly salient issues and therefore change their opinions of themselves. For example, people may come to see themselves as happier than they otherwise would have perceived themselves. Consequently, posttest happiness scores would be higher than pretest scores, and this difference need not be due to the $1,000 gift.

Instrumentation effects are a validity threat that occurs as the result of changes in the way that a variable is measured. For instance, in evaluating the effect of income on happiness, researchers may make pretest assessments with one type of happiness measure. Then, perhaps to take advantage of a newly released measure of happiness, researchers might use a different happiness measure on the posttest. Unless the two measures have exactly parallel forms, however, scores on the pretests and posttests are likely to differ. Accordingly, any observed increase in happiness may be due to the differing tests and not to the $1,000 gift.

Regression to the mean is especially likely to occur whenever two conditions are present in combination: (1) researchers select subjects who have extreme scores on a pretest measure of the dependent variable, and (2) the dependent variable is less than perfectly measured (i.e., is less than totally reliable owing to random measurement error). It is a principle of statistics that individuals who score either especially high or low on an imperfectly measured pretest are most likely to have more moderate scores (i.e., regress toward their respective mean) on the posttest. In the social sciences, almost all variables (e.g., happiness) are less than perfectly reliable. Hence, whenever social scientists assign subjects to treatment conditions based on high or low pretest scores, regression to the mean is likely to occur. For example, researchers may believe that those persons who are most unhappy will benefit most from a $1,000 gift. Therefore, only persons with low pretest scores are allowed into the study. However, low scorers on the pretest are likely to have higher happiness scores on the posttest simply as a result of remeasurement. Under such circumstances, regression to the mean remains a plausible rival explanation for any observed increase in happiness following the $1,000 gift.

Selection effects are processes that result in different kinds of subjects being assigned to one treatment group as compared to another. If these differences (e.g., sex) affect the dependent variable (e.g., happiness), then selection effects serve as a rival explanation for the assumed effect of the hypothesized causal variable (e.g., income). Because there is not a second group in the one-group pretest-posttest design illustrated here, the design is not subject to validity threats due to selection. That is, because the same group receives all treatment conditions (e.g., no gift versus a $1,000 gift), the characteristics of subjects (e.g., the number of females versus the number of males) remain constant across treatment conditions. Thus, even if females tended to be happier than males, this effect could not explain why an increase in happiness occurred after subjects received the $1,000 gift.

Mortality effects refer to the greater loss of participants (e.g., due to death or disinterest) in one treatment group compared to another. For instance, in the study of the effects of income on happiness, the most unhappy people are more likely than other subjects to drop out of the study before its completion. Because these dropouts appear in the pretest but not the posttest, the average level of happiness will increase. That is, an increase in happiness would occur even if the supplementary income had no effect whatsoever. Mortality is not, however, a plausible alternative explanation in the current example of a study using the one-group pretest-posttest design. Researchers can simply exclude from the study any subjects who appear in the pretest but not the posttest measure of happiness.

Reverse causal order effects are validity threats due to ambiguity about the direction of a causal relationship; that is, does X cause O, or does O cause X? The one-group pretest-posttest design is not subject to this internal validity threat. The manipulation of the independent variable (giving the $1,000 gift) clearly precedes observation of the dependent variable (degree of happiness). In general, where research designs manipulate rather than measure the independent variable, they greatly reduce the threat of reverse causal order.

As an overview, the reader should note that the various threats to internal validity, where plausible, violate the last two of three conditions necessary for establishing a valid causal inference. Recall that the three conditions are: (1) an association between two variables is present; (2) the presumed cause must precede the presumed effect in time; and (3) the association between the two variables cannot be explained by the influence of a "third" variable that may be related to both of them.

Only the violation of the first condition is not covered by the list of specific threats to internal validity. (But see the later discussion of threats to statistical conclusion validity.) Reverse causal order is a threat to internal validity that violates the second condition of causal inference. Furthermore, history, maturation, testing, instrumentation, regression to the mean, selection, and mortality are all threats to internal validity that one can broadly describe as the potential influence of a "third" variable—that is, threats that violate the third condition of causal inference. That is, each of these threats represents a specific type of third variable that affects the dependent variable and coincides with the manipulation of the independent variable. In other words, the third variable is related to both the independent and dependent variable. Because the third variable affects the dependent variable at the same time that the independent variable is manipulated, it will appear that the independent variable causes a change in the dependent variable. But in fact this apparent causal relation is a spurious (i.e., noncausal) by-product of the third variable's influence.

As an illustration, recall how validity threats due to history can produce a spurious correlation between income and happiness. In the example used earlier, Congress declared a national holiday that increased subjects' happiness and coincided with subjects receiving a $1,000 gift. Hence, the occurrence of a national holiday represents a "third" variable that is related both to income and happiness, and makes it appear (falsely) that income increases happiness.

Research, in its broadest sense, can be viewed as an investigator's attempt to convince the scientific community that a claimed causal relationship between two variables really exists. Clearly, the presence of one or more threats to internal validity challenges the researcher's claim. That is, the more likely a validity threat seems, the less convincing is the investigator's claim.

When confronted with the specific threats to internal validity in field settings, investigators can attempt to modify their research design to control one or more of these threats. The fact that a specific threat is possible for a given research design, however, does not mean it is plausible. Implausible threats do little to reduce the persuasiveness of researchers' claims. Therefore, the specific design researchers' use should be determined in large part by the specific threats to validity that are considered most plausible.

Furthermore, as noted earlier, each research design has a given number of possible threats to internal validity, and some designs have more possible threats than do other designs. But only a certain number of these threats will be plausible for the specific set of variables under study. That is, different sets of independent and dependent variables will carry different threats to internal validity. Thus, researchers may select weaker designs where the plausible threats for a given set of variables are relatively few and not among the possible threats for the given design. Campbell and Stanley (1963) note, for example, that the natural sciences can often use the one-group pretest-posttest design despite its long list of possible threats to internal validity. Given the carefully controlled laboratory conditions and focus on variables measuring nonhuman phenomena, plausible threats to internal validity are low.

The next section examines some common quasi-experimental designs and plausible threats to internal validity created by a given design. The discussion continues to use the concrete example of studying the relationship between income and happiness. Examples using a different set of variables might, of course, either reduce or increase the number of plausible threats for a given design.


QUASI-EXPERIMENTAL DESIGNS

When researchers have the opportunity to make more than a single pretest and posttest, some form of time series design becomes possible. Table 1 (d) illustrates the structure of this design. The O's designate a series of observations (measures of happiness) on the same individuals (group 1) over time. The table shows that subjects receive no supplementary income (X$0) through the first two (times 2 and 4) observational periods. Then at time 5 subjects receive the $1,000 gift (X$1,000). Their subsequent level of happiness is then observed at three additional points (times 6, 8, and 10).

This quasi-experimental design has a number of advantages over the single-group pretest-posttest (preexperimental) design. For instance, by examining the trend yielded by multiple observations prior to providing the $1,000 gift, it is possible to rule out validity threats due to maturation, testing, and regression to the mean. In contrast, instrumentation could still be a threat to validity, if researchers changed the way they measured happiness—especially for changes occurring just before or after giving the $1,000 gift. Moreover, artifacts due to history remain uncontrolled in the time series design. For example, it is still possible that some positive event in the broader environment could occur at about the same time as the giving of the $1,000 gift. Such an event would naturally serve as a plausible alternative explanation for why happiness increased after the treatment manipulation.

In addition to eliminating some threats to internal validity found in the one-group pretestposttest design, the time series design provides measures of how long a treatment effect will occur. That is, the multiple observations (O's) following the $1,000 gift allow researchers to assess how long happiness will remain elevated after the treatment manipulation.

In some circumstances, the time series design may not be possible, owing to constraints of time or money. In such cases, other quasi-experimental designs may be more appropriate. Consequently, as an alternative strategy for dealing with some of the threats to internal validity posed by the single-group pretest-posttest (preexperimental) design, researchers may add one or more comparison groups.

The simplest multigroup design is the static-group comparison (Campbell and Stanley 1963). Table 1 (c) provides an illustration of this design. Here observations are taken from two different groups (G1 and G2) at the same point in time. The underlying assumption is that the two groups differ only in the treatment condition (a $1,000 gift versus no gift) they receive prior to the measure of happiness. In many instances, this is not a safe assumption to make.

The static-group comparison design does reduce some potential threats to internal validity found in the single-group pretest-posttest design; namely, history, testing, instrumentation, and regression to the mean. That is, each of these threats should have equal effects on the two experimental groups. Thus, these threats cannot explain why experimental groups differ in posttest happiness.

Conversely, the static-group comparison design adds other potential threats—selection, reverse causal order, and mortality effects—not found in the single-group pretest-posttest design. Indeed, these threats are often so serious that Stanley and Campbell (1963) refer to the static-group comparison, like the single-group pretest-posttest, as a "pre-experimental" design.

Selection effects are generally the most plausible threats to internal validity in the static-group comparison design. That is, in the absence of random assignment, the treatment groups are likely to differ in the type of people they include. For example, researchers might assign poverty-level subjects to the $1,000 gift versus no gift treatment groups based on physical health criteria. Subjects in poor health would receive the supplementary income; subjects in better health would not. Note, however, that poor health is likely to reduce happiness, and that less healthy—and therefore less happy—people appear in the $1,000 treatment condition. Hence, it is possible that this selection effect based on physical health could obscure the increase in happiness due to the supplementary income. In other words, even if the $1,000 gift does have a positive effect on happiness, researchers might make a false causal inference; namely, that supplementary income has no affect on happiness.

This result illustrates the point that threats to internal validity are not always ones that refute a claim that a causal effect occurred. Threats to internal validity can also occur that refute a claim that a causal effect did not happen. In other words, threats to internal validity concern possible false-negative findings as well as false-positive findings.

The preceding example showed how false-negative findings can result due to selection effects in the static-group comparison. False-positive findings can, of course, also occur due to selection effects in this design. Consider, for instance, a situation in which researchers assign subjects to treatment conditions based on contacts with a particular governmental agency that serves the poor. Say that the first twenty subjects who contact this agency on a specific day receive the $1,000 gift, and the next twenty contacts serve as the no-gift comparison group. Furthermore, assume that the first twenty subjects that call have extroverted personalities that made them call early in the morning. In contrast, the next twenty subjects are less extroverted and thus call later in the day. If extroverted personality also produces higher levels of happiness, then the group receiving the $1,000 gift would be happier than the no-gift comparison group even before the treatment manipulation. Accordingly, even if supplementary income has no effect on happiness, it will appear that the $1,000 gift increased happiness. In other words, extroverted personality is a "third" variable that has a positive causal effect on both level of supplementary income and happiness. That is, the more extroversion, the more supplementary income; and the more extroversion, the more happiness. These causal relationships therefore make it appear that there is a positive, causal relationship between supplementary income and happiness; but in fact this latter correlation is spurious.

Reverse causal order effects are another potential threat to internal validity when researchers use the static-group comparison design. Indeed, reverse causal order effects are really just a special case of selection effects. More specifically, reverse causal order effects will occur whenever the dependent variable is also the "third" variable that determines who is assigned to which treatment groups.

By substituting happiness for extroversion as the "third" variable in the preceding example, one can demonstrate how this reverse causal order effect could occur. Recall that subjects who contacted the government agency first were the most extroverted. Assume now, instead, that the earliest callers were happier people than those who called later (because unhappy people are more likely to delay completing tasks). Under these conditions, then, prior levels of happiness comprise a "third" variable that has a positive causal effect on both level of supplementary income and subsequent happiness. That is, those subjects who are initially happier are more likely to receive supplementary income; and those subjects who are initially happier are more likely to experience subsequent (posttest) happiness. These causal relationships hence make it appear that there is a positive, causal association between supplementary income and happiness. In fact, however, this correlation is spurious. Indeed, it is not supplementary income that determines happiness; it is happiness that determines supplementary income.

Mortality is another possible threat to internal validity in the static-group comparison design. Even if the treatment groups have essentially identical characteristics before the manipulation of the independent variable (i.e., no selection effects), differences between the groups can occur as a consequence of people dropping out of the study. That is, by the time researchers take posttest measures of the dependent variable, the treatment groups may no longer be the same.

For example, in the study of income and happiness, perhaps some individuals in the no-gift group hear that others are receiving a $1,000 gift. Assume that among those people, the most likely to drop out are those who have a "sour" disposition, that is, those who are likely to be the most unhappy members of the group in general. Consequently, the no-gift comparison group will display a higher posttest measure of happiness than the group would have if all members had remained in the study. Thus, even if the $1,000 gift increases happiness, the effect may be obscured by the corresponding, "artificial" increase in happiness in the no-gift comparison group. In other words, mortality effects may lead researchers to make a false causal inference; namely, that there isn't a causal relationship between two variables, when in fact there is.

One of the most common quasi-experimental designs is the nonequivalent control group design. This design is an elaboration of the static-group comparison design. The former is a stronger design than the latter, however, because researchers administer pretests on all groups prior to manipulating the independent variable. Table 1 (e) illustrates this design.

A major advantage of the pretests is that they allow researchers to detect the presence of selection effects. Specifically, by comparing pretest scores for the different treatment groups before the manipulation of treatment conditions, it is possibly to discern whether the groups are initially different. If the groups differ at the time of the pretest, any observed differences at the posttest may simply be a reflection of these preexisting differences.

For instance, in the income and happiness study, if the group receiving the $1,000 gift is happier than the no-gift comparison group at the time of the pretest, it would not be surprising for this group to be happier at posttest, even if supplementary income had no causal effect. The point is that the nonequivalent control group design, unlike the static-group comparison design, can test whether this difference is present. If there is no difference, then researchers can safely argue that selection effects are not a threat to internal validity in their study.

The inclusion of pretest scores also permits the nonequivalent control group design to detect the presence or absence of other threats to internal validity not possible using the static-group comparison design—namely, mortality and reverse causal order. Recall that threats due to reverse causal order are a special subset of selection effects. Thus, the ability of the nonequivalent control group design to detect selection effects means it should also detect reverse causal order effects. Selection effects occur as a consequence of differences in pretest measures of the dependent variable. Therefore, in the present example, differences between groups in pretest happiness would indicate the possibility of reverse causal order effects. In other words, the amount of pretest happiness determined the amount of supplementary income subjects received ($1,000 gift versus no gift), rather than the converse, that the amount of supplementary income determined the amount of posttest happiness.

Furthermore, the pretest scores of the nonequivalent control group design also allow assessment of mortality effects. Regardless of which subjects drop out of which treatment condition, the researcher can examine the pretest scores for the remaining subjects to ensure that the different treatment groups have equivalent initial scores (e.g., on happiness).

In sum, the nonequivalent control group design is able to reduce all the threats to internal validity noted up to this point. Unfortunately, it is unable to detect one threat to internal validity not previously covered—selection by maturation interactions. (For a more complete list of interactions with selection, see Cook and Campbell 1979, and Shaddish et al. in preparation.) This threat occurs whenever the various treatment groups are maturing—growing more experienced, tired, bored, and so forth—at different rates.

For example, consider a situation in which the pretest happiness of the group receiving no gift is as high as the group receiving the $l,000 gift. Moreover, the pretest measures occur when both groups are in stimulating environments, in contrast to the boring environments for the posttest measures. Assume now that there is a greater proportion of people who become bored easily in the no-gift group as compared to the $1,000-gift group. That is, there is a selection effect operating that results in different kinds of people in one group compared to the other. But this difference doesn't manifest itself until a nonstimulating environment triggers the maturational process that generates increasingly higher levels of boredom. The differential rates at which boredom occurs in the two groups result in higher levels of boredom and corresponding unhappiness in the no-gift as compared to the $1,000-gift group. In other words, the group receiving the $1,000 gift will display higher levels of posttest happiness than the no-gift group, even if supplementary income has no effect on happiness.

The multiple time series design incorporates aspects of both the nonequivalent control group and the time series designs. Table 1 (f) illustrates the results of this combination. By extending the number of pretest and posttest observations found in the nonequivalent control group design, the multiple time series design can detect selection-maturation interactions. For instance, if differential reactions to boredom explain why the group receiving the $1,000 gift has higher happiness than the no-gift group, then we should expect to see these differences in at least some of the additional pretest measures (assuming that some of these additional group comparisons occur in nonstimulating environments). We would also expect the differential reaction to boredom to manifest itself in the additional posttest measures of the multiple time series design. That is, whenever researchers take posttest measures in stimulating environments, they should observe no group differences. Conversely, whenever researchers take posttest measures in nonstimulating environments, they should observe higher happiness among the group receiving the $1,000 gift.

Furthermore, by adding a second group to the original, single-group time series, the multiple time series reduces the threat of history that is a major problem with the single-group design. Events (e.g., national holidays) that coincide with the manipulation of the independent variable (e.g., $1,000 gift versus no gift) should have equal impacts on each group in the analysis.

By incorporating multiple groups with pretests and posttests, the multiple time series and nonequivalent control group designs can be effective at reducing a long list of internal validity threats; but in order to actually reduce a number of these threats, researchers must assume that the different treatment groups are functionally equivalent prior to manipulating the independent variable. Pretest scores allow researchers to detect, at least in part, whether this condition of equivalence is present; but what if the groups are not initially equivalent? Under these conditions, researchers may attempt to equate the groups through "matching" or other, "statistical" adjustments or controls (e.g., analysis of covariance). However, matching is never an acceptable technique for making groups initially equivalent (see Nunnally 1975; Kidder and Judd 1986). And statistical controls are a better but still less-than-desirable procedure for equating groups at the pretest (see Lord 1967; Dwyer 1983; Rogosa 1985).

In sum, an overview of pre- and quasi-experimental designs using multiple groups indicates the importance of establishing the equivalence of the groups through pretest measures. Further, researchers should try to obtain as many additional observations as possible both before and after manipulating the treatments. When groups are nonequivalent at the outset, it is extremely difficult to discern whether treatment manipulations have a causal effect.

In certain field settings, however, ethical considerations may mandate that groups be nonequivalent at the outset. That is, researchers must assign subjects to certain treatment conditions based on who is most "needy" or "deserving." If the dependent variable (e.g., happiness) is associated with the criterion (e.g., physical health) that determines who is most needy or deserving, then the experimental groups will not display pretest equivalence (e.g., the people with the worst health and hence lowest pretest happiness must be assigned to the group receiving the $1,000 gift).

Fortunately, the regression-discontinuity design (Thistlethwaite and Campbell 1960; Cook and Campbell 1979) often allows researchers to make relatively unambiguous interpretation of treatment effects, even where groups are not initially equivalent. Indeed, evidence indicates that this design, when properly implemented, is equivalent to a randomized experiment in its ability to rule out threats to internal validity (Cook and Shadish 1994; Shaddish et al. in preparation).

To continue with the example of income and happiness, researchers may feel compelled to give the $1,000 gift to those individuals with the poorest health. The investigators would therefore begin by developing a scale of "need" in which participants below a certain level of physical health receive the gift and those above this cutting point do not. This physical health scale constitutes the "pseudo"-pretest necessary for the regression-discontinuity design. The usual posttest measures of the dependent variable—happiness—would follow the manipulation of the no-gift versus the $1,000-gift treatment conditions. Researchers would then regress posttest happiness measures on "pretest" measures of physical health. This regression analysis would include the calculation of separate regression lines for (1) those subjects receiving the $1,000 gift and (2) those subjects not receiving it.

Figure 1 provides an illustration of the results using the regression-discontinuity design. (The structure of the design does not appear in Table 1 due to its relative complexity.) If the $1,000 gift has a discernible impact on happiness, a "discontinuity" should appear in the regression lines at the cutting point for "good" versus "poor" health. An essential requirement for the regression-discontinuity design is a clear cutting point that defines an unambiguous criterion (e.g., degree of physical health) by which researchers can assign subjects to the treatment conditions. It is the clarity of the decision criterion, not its content, that is important.

An interesting characteristic of the regression-discontinuity design is that it works even if the decision criterion has no effect on the outcome of interest (e.g., happiness). Indeed, as the variable that forms the decision criterion approaches a condition in which it is totally unrelated to the outcome, the decision criterion becomes the functional equivalent of assigning subjects randomly to treatment conditions (Campbell 1975). Even where the criterion is strongly related to the outcome, the regression-discontinuity design, when properly implemented, can still approximate a randomized experiment in reducing threats to internal validity.

There are, however, several threats to internal validity that can occur in using the regression-discontinuity design (hence the use of the above qualifier: "when properly implemented"). One threat emerges when the relationship between the pseudo-pretest measure (e.g., physical health) and the posttest measure (e.g., happiness) does not form a linear pattern. In fact, a curvilinear relationship near the cutting point may be indistinguishable from the discontinuity between the separate regression lines. Moreover, another threat to internal validity arises if the investigators do not strictly adhere to the decision criterion (e.g., if they feel sorry for someone who is close to qualifying for the $1,000 and thus gives that person the "benefit of the doubt"). Additionally, if only a few people receive a given treatment condition (e.g., if only a few $1,000 awards can be made, for financial reasons), the location of the regression line may be difficult to estimate with any degree of accuracy for that particular treatment condition. Accordingly, researchers should include relatively large numbers of subjects for all treatment conditions.

In summary, quasi-experimental designs allow researchers to maintain at least some control over how and when they manipulate the independent variable, but researchers lose much control over who receives specific treatment conditions (i.e., the designs do not permit random assignment). Quasi-experimental designs differ in how closely they approximate the power of randomized experiments to make strong causal inferences. As a general rule, the more observations quasi-experimental designs add (i.e., the more O's, as depicted in the diagrams of Table 1), the more the designs are able to reduce threats to internal validity.


NONEXPERIMENTAL DESIGNS

In contrast to quasi-experimental designs, nonexperimental designs do not manipulate the independent variable. Thus, researchers have no control over who falls into which category of the independent variable when there is a change from one to another category of the independent variable, or what content the various levels of the independent variable will contain. Rather than serving as an agent that actively changes (manipulates) the independent variable, the researcher must be content to passively observe (measure) the independent variable as it naturally occurs. Hence, some social scientists also refer to nonexperimental designs as passive-observational designs (Cook and Campbell 1979).

When researchers can only measure rather than manipulate the independent variable, threats to internal validity increase greatly. That is, reverse causal order effects are much more likely to occur. There is also a much greater probability that some "third" variable has a causal effect on both the independent and the dependent variables, therefore producing a spurious relationship between the latter two variables.

To illustrate these points, consider the most widely used research design among sociologists; namely, the static-group comparison design with measured rather than manipulated independent variables—or what we will refer to here as the passive static-group comparison. Table 1 depicts the structure of this nonexperimental, cross-sectional design. Researchers generally combine this design with a "survey" format, in which subjects self-report their scores on the independent and dependent variables (e.g., report their current income and happiness).

Note that the structure of this nonexperimental design is basically the same as the static-group comparison design found in Table 1 (c). To better capture the various levels of naturally occurring income, however, the diagram in Table 1 expands the number of categories for income from two manipulated categories ($1,000 gift versus no gift) to three measured categories (high, medium, and low personal income). Furthermore, whereas the original static-group design manipulated income before measuring happiness, the passive static-group design measures both personal income and happiness at the same time (i.e., when subjects respond to the survey). Consequently, the temporal ordering of the independent and dependent variable is often uncertain.

Indeed, because researchers do not manipulate the independent variable, and because they measure the independent and dependent variables at the same time, the threat of reverse causal order effects is particularly strong in the passive static-group comparison design. In the present example, it is quite possible that the amount of happiness a person displays is a determinant of how much money that person will subsequently make. That is, happiness causes income rather than income causes happiness. What is even more likely is that both causal sequences occur.

Additionally, the passive static-group comparison design is also especially susceptible to the threat that "third" variables will produce a spurious relationship between the independent and dependent variables. For example, it is likely that subjects who fall into different income groupings (high, medium, and low) also differ on a large number of other (selection effect) variables. That is, the different income groups are almost certainly nonequivalent with regard to characteristics of subjects in addition to income. One would expect, for instance, that the higher-income groups have more education, and that education is associated with greater happiness. In other words, there is a causal link between education and income, and between education and happiness. More specifically, higher education should produce greater income, and higher education should also produce greater happiness. Hence, education could produce a spurious association between income and happiness.

As noted earlier, researchers can attempt to equate the various income groups with regard to third variables by making statistical adjustments (i.e., controlling for the effects of the third variables). But this practice is fraught with difficulties (again, see Lord 1967; Dwyer 1983; Rogosa 1985).

It is especially sobering to realize that a design as weak as the passive static-group comparison is so widely used in sociology. Note, too, that this design is a substantially weaker version of the static-group comparison design that Campbell and Stanley (1963) considered so weak that they labeled it "preexperimental." Fortunately, there are other nonexperimental, longitudinal designs that have more power to make valid causal inferences. Most popular among these designs is the panel design. (For additional examples of passive longitudinal designs, see Rogosa 1988; Eye 1990.)

Table 1 depicts the structure of this longitudinal survey design. It is very similar to the nonequivalent control group design in Table 1 (e). It differs from the quasi-experimental design, however, because the independent variable is measured rather than manipulated, and the independent and dependent variable are measured at the same time.

In its simplest, and weaker, two-wave form (shown in Table 1), the panel design can address at least some threats due to reverse causal order and third variables associated with the independent and dependent variable. (This ability to reduce threats to internal validity is strengthened where investigators include three and preferably four or more waves of measures.) The most powerful versions of the panel design include data analysis using structural equation modeling (SEM) with multiple indicators (see Kessler and Greenberg 1981; Dwyer 1983; and for a more general introduction to SEM, see Kline 1998). Simultaneous equations involve statistical adjustments for reverse causal order and causally linked third variables. Thus, the standard admonishments noted earlier about using statistical control techniques apply here too.

Finally, Cook and his associates (Cook and Shaddish 1994; Shaddish et al. in preparation) have noted a real advantage of nonexperimental designs over randomized experiments. Specifically, experimental designs lend themselves to studying causal linkages (i.e., "descriptions" of causal conditions) rather than the processes accounting for these linkages (i.e., explanations of causal effects). In contrast, nonexperimental designs lend themselves to using causal (path) modeling to study "process"—that is, intervening variables that "mediate" the effect of the independent on the dependent variable. The field of causal modeling using nonexperimental designs has developed at a tremendous pace, dominated by the very flexible and sophisticated data analytic procedure of SEM (Kline 1998).

RANDOMIZED EXPERIMENTS REVISITED

The great problems that nonexperimental designs encounter in making causal inferences help illustrate the increased power researchers obtain when they move from these passive-observational to quasi-experimental designs. But no matter how well a quasi-experimental design controls threats to internal validity, there is no quasi-experimental design that can match the ability of randomized experiments to make strong causal inferences. Indeed, threats to internal validity are greatly reduced when researchers are able to randomly assign subjects to treatment groups. Therefore, the value of this simple procedure cannot be overstated.

Consider the simplest and most widely used of all randomized experiments, the posttest-only control group design (Campbell and Stanley 1963), as depicted in Table 1 (a). Note that it is similar in structure to the very weak, preexperimental, static-group comparison design in Table 1 (c). These two designs differ primarily with regard to whether they do or do not use random assignment.

The addition of random assignment buys enormous power to make valid causal inferences. With this procedure, and the manipulation of an independent variable that temporally precedes observation of the dependent variable, reverse causal order effects are impossible. Likewise, with the addition of random assignment, other threats to internal validity present in the static-group comparison design dissipate. Specifically, selection effects are no longer a major threat to internal validity. That is, selection factors—different kinds of people—should appear approximately evenly distributed between categories of the independent variable (e.g., $1,000 gift versus no gift). In other words, the different groups forming the treatment conditions should be roughly equivalent prior to the treatment manipulation. Further, given this equivalence, threats due to selectionmaturation interactions are also reduced.

Conversely, given that pretest measures of the dependent variable (e.g., happiness) are absent, mortality effects remain a potential threat to internal validity in the posttest-only control group design. Of course, for mortality effects to occur, different kinds of subjects have to drop out of one experimental group as compared to another. For example, in the study of income on happiness, if certain kinds of subjects (say, those who are unhappy types in general) realize that they are not in the group receiving the $1,000 gift they may refuse to continue. This situation could make it appear (falsely) that people receiving the $1,000 gift are less happy than those not receiving the gift.

Naturally, the probability that any subjects drop out is an increasing function of how much time passes between manipulation of treatment conditions and posttest measures of the dependent variable. The elapsed time is generally quite short for many studies using the posttest-only control group design. Consequently, in many cases, mortality effects are not a plausible threat.

In sum, one may conclude that random assignment removes, at least in large part, all of the major threats to internal validity. (But see Cook and Campbell 1979, Cook and Shaddish 1994, and Shaddish et. al. in preparation for some additional qualifications.)

Two more points are noteworthy with respect to random assignment. First, it is important to realize that this procedure does not ensure that third variables that might influence the outcome will be evenly distributed between groups in any particular experiment. For instance, random assignment does not ensure that the average level of education will be the same for the group receiving the $1,000 gift as for the group receiving no gift. Rather, random assignment allows researchers to calculate the probability that third variables (such as education) are a plausible alternative explanation for an apparent causal relationship (e.g., between supplementary income and happiness). Researchers are generally willing to accept that a causal relationship between two variables is real if the relationship could have occurred by chance—that is, due to the coincidental operation of third variables—less than one time out of twenty.

Some researchers add pretests to the posttest-only control group design in order to evaluate the "success" of the random assignment procedure, and to add "statistical power." According to Nunnally (1975), however, the use of a pretest is generally not worth the attendant risks. That is, the pretest may sensitize subjects to the treatment conditions (i.e., create a treatment-testing interaction). In other words, the effect of the independent variable may not occur in other situations where there are no pretest measures. Thus, any gain from pretest information is likely to be offset by this threat to "construct validity." (For an example of a treatment-testing interaction, see the section on threats to construct validity, below).

Second, random assignment of subjects is different from random selection of subjects. Random assignment means that a subject has an equal probability of entry into each treatment condition. Random selection refers to the probability of entry into the study as a whole. The former issue bears on internal validity (i.e., whether observed outcomes are unambiguously due to the treatment manipulation); the latter is an issue of external validity (i.e., the extent to which the results of the study are generalizable).


THREATS TO EXTERNAL VALIDITY

Whereas internal validity refers to whether or not a treatment is effective, external validity refers to the conditions under which the treatment will be effective. That is, to what extent will the (internally valid) causal results of a given study apply to different people and places?

One type of threat to external validity occurs when certain treatments are likely to be most effective on certain kinds of people. For example, researchers might find that a $1,000 gift has a strong effect on happiness among a sample of young adults. Conversely, a study of extremely elderly persons might find that the $1,000 has no effect on happiness (say, because very old people are in general less stimulated by their environment than are younger age groups). Cook and Campbell (1979) label this external validity threat an interaction between selection factors and treatment.

Researchers sometimes mistakenly assume that they can overcome this threat to external validity by randomly selecting persons from the population across which they wish to generalize research findings. Random samples do not, however, provide appropriate tests of whether a given cause-effect finding applies to different kinds of people. Obtaining a random sample of, say, the U.S. population would not, for instance, reproduce the above (hypothetical) finding that a $1,000 gift increases happiness among younger but not older people. Combining younger and older persons in a representative sample would only lead to an averaging or "blending" of the strong effect for youths with the no-effect result for the elderly. In fact, the resulting finding of a "moderate" effect would not be an accurate statement of income's effect on happiness for either the younger or older population.

Including younger and older persons in a random sample would only increase external validity if the researchers knew to provide separate analyses for young and old—among an infinite variety of possible human characteristics that researchers might choose to do subsample analyses on (e.g., males and females). But if researchers suspected that the treatment might interact with age, then they could simply make sure that their nonrandomly selected, "convenience" sample contained sufficient numbers of both youths and elderly to do separate analyses on each age group.

Additionally, threats to external validity occur because certain treatments work best in certain settings. Giving $1,000 to a person at a shopping mall may increase their happiness substantially compared to the same gift given to someone stranded on a desert island with nowhere to spend the money. Cook and Campbell (1979) label this external validity threat an interaction between treatment and setting. Given that quasi-experimental designs are most often located in "real-life," field settings, they are somewhat less susceptible to this threat than are randomized experiments—which most often occur in "artificial," laboratory settings.

Note that threats to external validity are concerned with restricting cause-effect relationships to particular persons or places. Therefore, the best procedure for reducing these restrictions is to replicate the findings on different persons and in different settings—either within a single study or across a series of studies (Cook and Campbell 1979; Cook and Shaddish 1994).

THREATS TO CONSTRUCT VALIDITY

Construct validity refers to the accuracy with which researchers manipulate or measure the construct intended rather than something else (Cook and Campbell 1979; and for updates, see Cook and Shaddish 1994; and Shaddish et al. in preparation). Thus, for example, investigators might establish that their manipulation of a variable labeled "income" does indeed have a causal effect on their measure of the outcome labeled "happiness." That is, the researchers have avoided plausible threats to internal validity and, consequently, have presented a convincing claim for a cause-and-effect relationship. Critics might question, however, whether the manipulation labeled "income" and the measure labeled "happiness" do in fact represent the concepts that the investigators claim they have manipulated and measured, respectively.

For instance, in providing supplementary income to selected subjects, researchers might also have manipulated, say, the perception that the researchers really are concerned about the welfare of the subjects. It may be subjects' perceptions of this "caring attitude," and not an increase in "economic well-being," that produced the effect the $1,000 gift had on happiness. In other words, investigators were manipulating a dimension in addition to the economic dimension they intended to manipulate.

Likewise, in asking subjects to answer a questionnaire that purportedly measures "happiness," researchers may not be measuring happiness but rather the degree to which subjects will respond in socially desirable ways (e.g., some subjects will respond honestly to questions asking how depressed they are, and other subjects will hide their depression).

Cook and Campbell (1979) provide an extensive list of threats to construct validity. The description of these threats is rather abstract and complicated. Hence, the following discussion includes only a few concrete examples of potential threats. For a more complete list and discussion of these threats, the interested reader should consult the original article by Cook and Campbell, as well as the update to their book (Shaddish et al. in preparation) and other volumes on the construct validity of measures and manipulations (e.g., Costner 1971; Nunnally and Bernstein 1994).

One type of threat to construct validity occurs in research designs that use pretests (e.g., the nonequivalent control group design). Cook and Campbell (1979) label this threat an interaction of testing and treatment. This threat occurs when something about the pretest makes participants more receptive to the treatment manipulation. For example, in the study of income and happiness, the pretest may make salient to participants that "they don't have much to be happy about." This realization may, in turn, make subjects more appreciative and thus especially happy when they later receive a $1,000 gift. In contrast, the $1,000 gift might have had little or no causal impact on happiness if subjects were not so sensitized, that is, were not exposed to a pretest. Accordingly, it is the combination of the pretest and $1,000 gift that produces an increase in happiness. Neither condition alone is sufficient to create the casual effect. Consequently, researchers who use pretests must be cautious in claiming that their findings would apply to situations in which pretests are not present. Because quasi-experimental designs are dependent on pretest observations to overcome threats to internal validity (i.e., to establish the initial equivalence of the experimental groups), researchers cannot safely eliminate these measures. Thus, to enhance the construct validity of the manipulation, researchers should strive to use as unobtrusive measures as possible (e.g., have trained observers or other people familiar with a given subject secretly record the subject's level of happiness).

Another set of potential threats to construct validity concerns what Campbell and Stanley (1963) describe as reactive arrangements. Cook and Campbell (1979) have subsequently provided more specific labels for these threats: hypothesis-guessing within experimental conditions, evaluation apprehension, and experimenter expectancies (see also Rosenthal and Rosnow 1969). Threats due to reactive arrangements result as a consequence of the participants' knowing they are in a study, and therefore behaving in a way that they might not in more natural circumstances. With regard to this phenomena, Orne (1962) used the term "demand characteristics" to refer to the totality of cues that affect a subject's response in an research setting in the sense that certain characteristics "demand" certain behaviors. For instance, subjects receiving the $1,000 gift may guess the hypothesis of the study when they are subsequently asked to respond to questions about their state of happiness. Realizing that the study may be an attempt to show that supplementary income increases happiness, participants may try to be "good subjects" and confirm the experimental hypothesis by providing high scores on the happiness questionnaire. In other words, the treatment manipulation did in fact produce an increase in the assumed measure of "happiness," but the measure was actually capturing participants' willingness to be "good subjects."

A classic example of reactive arrangements is the Hawthorne effect (see Lang 1992 for a more comprehensive review). The Hawthorne effect was named for a series of studies conducted between 1924 and 1932 at the Hawthorne Works of the Western Electric Company near Chicago (Mayo 1933; Roethlisberger and Dickson 1939). Researchers attempted to determine, among other things, the effects of illumination on worker productivity. The results were puzzling. There was no clear relationship between illumination and worker performance. Every change, even seemingly adverse changes in which normal lighting was reduced by 50 to 70 percent resulted in increased productivity. In addition, productivity often remained high even after workers were returned to their original working conditions. Even more confusing was the fact that not all the studies reported increased productivity. In some studies, depending upon the factors being manipulated, the effect was even reversed, with workers apparently deliberately reducing their output.

The three most common explanations for the Hawthorne effect are: (1) subjects in studies respond to special attention; (2) awareness of being in a study affects subjects' performance; and (3) subjects react to the novelty of certain aspects of the research procedures (Lang 1992). Subsequent research has not supported any of these explanations conclusively (Adair et al. 1989). Nor is there widespread evidence of the Hawthorne effect in either experimental or field settings (Cook and Campbell 1979). What is clear, however, is that employees, within organizations, are part of social systems that can affect behavior in research settings. Indeed, the Hawthorne studies provided impetus to the development of the new field of "organizational behavior," which has strong links to sociology.

No widely accepted model of the processes involved in subject reactivity presently exists. But to reduce threats to construct validity due to reactive arrangements, researchers may attempt, where feasible, to disguise the experimental hypothesis, use unobtrusive measures and manipulations, and keep both the subject and the person administering the treatments "blind" to who is receiving what treatment conditions. These disguises are generally easier to accomplish in the naturally occurring environments of quasi-experimental designs than in the artificial settings of laboratory experiments. Finally, there are additional, sophisticated structural equation modeling procedures for discerning where reactive arrangements may be present in a study, and for making "statistical adjustments" to correct for the bias that these threats would otherwise introduce (Blalock 1985).


THREATS TO STATISTICAL CONCLUSION VALIDITY

Before researchers can establish whether an independent variable has a causal effect on the dependent variable, they must first establish whether an association between the two variables does or does not exist. Statistical conclusion validity refers to the accuracy with which one makes inferences about an association between two variables—without concern for whether the association is causal or not (Cook and Campbell 1979; Shaddish et al. in preparation). The reader will recall that an association between two variables is the first of three conditions necessary to make a valid causal inference. Thus, statistical conclusion validity is closely linked to internal validity. To put it another way, statistical conclusion validity is a necessary but not sufficient condition for internal validity.

Threats to statistical conclusion validity concern either one of two types of errors: (1) inferring an association where one does not exist (described as a "Type I error," or (2) inferring no association where one does exist (described as a "Type II error"). Researchers ability to avoid Type II errors depends on the power of a research design to uncover even weak associations, that is, the power to avoid making the mistake of claiming an association is due to "chance" (is statistically insignificant) when in fact the association really exists. Type II errors are more likely to occur the lower the probability level that researchers set for accepting an association as being statistically significant; the smaller the sample size researchers use; the less reliable their measures and manipulations; and the more random error introduced by (1) extraneous factors in the research setting that affect the dependent variable, and (2) variations among subjects on extraneous factors that affect the dependent variable (Cook and Campbell 1979).

Investigators can reduce Type II errors (false claims of no association) by: (1) setting a higher probability level for accepting an association as being statistically significant (e.g., p.05 instead of p.01); (2) increasing the sample size; (3) correcting for unreliability of measures and manipulations (see Costner 1971); (4) selecting measures that have greater reliability (e.g., using a ten-item composite measure of happiness instead of a single-item measure); (5) making treatment manipulations as consistent as possible across occasions of manipulation (e.g., giving each subject the $1,000 gift in precisely the same manner); (6) isolating subjects from extraneous (outside) influences; and (7) controlling for the influence of extraneous subject characteristics (e.g., gender, race, physical health) suspected to impact the dependent variable (Cook and Campbell 1979).

Type I errors (inferring an association where one does not exist) are more likely the higher the probability level that researchers set for accepting an association as being statistically significant, and the more associations a researcher examines in a given study. The latter error occurs because the more associations one includes in a study, the more associations one should find that are statistically significant "just by chance alone." For example, given 100 associations and a probability level of .05, one should on the average find 5 associations that are statistically significant due to chance.

Researchers can reduce threats of making Type I errors by setting a lower probability level for statistical significance, particularly when examining many associations between variables. Of course, decreasing Type I errors increases the risk of Type II errors. Hence, one should set lower probability levels in conjunction with obtaining reasonably large samples—the latter strategy to offset the risk of Type II errors.

Research designs vary greatly in their ability to implement strategies for reducing threats to statistical conclusion validity. For example, very large sample sizes (say, 500 subjects or more) are generally much easier to obtain for nonexperimental designs than for quasi-experimental or experimental designs. Moreover, experimental designs generally occur in laboratory rather than naturally occurring settings. Thus, it is easier for these designs to control for extraneous factors of the setting (i.e., random influences of the environment). Additionally, experimental designs are generally better able than quasi-experimental designs to standardize the conditions under which treatment manipulations occur.


SUMMARY AND CONCLUSIONS

Quasi-experimental designs offer valuable tools to sociologists conducting field research. This article has reviewed various threats that researchers must overcome when using such designs. In addition, to provide a context in which to evaluate the relative power of quasi-experimental designs to make valid causal inferences, this article also reviewed examples of experimental and nonexperimental designs.

It is important to note that the quasi-experimental designs described here are merely illustrative; they are representative of the types of research designs that sociologists might use in field settings. These designs are not, however, exhaustive of the wide variety of quasi-experimental designs possible. (See Campbell and Stanley 1963, Cook and Campbell 1979, and Shaddish et al. in preparation, for more extensive reviews.) In fact, great flexibility is one of the appealing features of quasi-experimental designs. It is possible literally to combine bits and pieces from different standard designs in order to evaluate validity threats in highly specific or unusual situations. This process highlights the appropriate role of research design as a tool in which the specific research topic dictates what design investigators should use. Unfortunately, investigators too often use a less appropriate design for a specific research topic simply because they are most familiar with that design. When thoughtfully constructed, however, quasi-experimental designs can provide researchers with the tools they need to explore the wide array of important topics in sociological study.


references

Adair, J., D. Sharpe, and C. Huynh 1989 "Hawthorne Control Procedures in Educational Experiments: A Reconsideration of Their Use and Effectiveness." Review of Educational Research 59:215–228.

Blalock, H. M. (ed.) 1985 Causal Model in Panel andExperimental Designs. New York: Aldine.

Campbell, D. T. 1975 "Reforms as Experiments." In M. Guttentag and E. Struening, eds., Handbook of Evaluation Research, vol. 1. Beverly Hills, Calif.: Sage.

——, and J. C. Stanley 1963. "Experimental and Quasi-Experimental Designs for Research on Teaching." In N. I. Gage, ed., Handbook of Research onTeaching. Chicago: Rand McNally.

Cook, T. D., and D. T. Campbell 1979 Quasi-Experimentation: Design and Analysis Issues for Field Settings. Chicago: Rand McNally.

Cook, T. D., and W. R. Shaddish 1994 "Social Experiments: Some Developments over the Past Fifteen Years." In L. W. Porter and M. Rosenzweig, eds., Annual Review of Psychology 45:545–580.

Costner, H. L. 1971 "Utilizing Casual Models to Discover Flaws in Experiments." Sociometry 34:398–410.

Dwyer, J. H. 1983 Statistical Models for the Social andBehavioral Sciences. New York: Oxford University Press.

Eye, A. (ed.) 1990 Statistical Methods in LongitudinalResearch, vols. I and II. San Diego, Calif.: Academic.

Kessler, R. C., and D. G. Greenberg 1981 Linear PanelAnalysis: Models of Quantitative Change. New York: Academic.

Kidder, L. H., and J. M. Judd 1986 Research Methods inSocial Relations, 5th ed. New York: Holt, Rhinehart, and Winston.

Kline, R. B. 1998 Principles and Practice of StructuralEquation Modeling. New York: Guilford.

Lang, E. 1992 "Hawthorne Effect." In E. Borgatta and M. Borgatta, eds., Encyclopedia of Sociology, 1st ed. New York: Macmillan.

Lord, F. M. 1967 "A Paradox in the Interpretation of Group Comparisons." Psychological Bulletin 68:304–305.

Mayo, E. 1933 The Human Problems of an IndustrialCivilization. New York: Macmillan.

Nunnally, J. 1975 "The Study of Change in Evaluation Research: Principles Concerning Measurement, Experimental Design, and Analysis." In M. Guttentag and R. Struening, eds., Handbook of Evaluation Research. Beverly Hills, Calif.: Sage.

Nunnally, J. C., and I. H. Bernstein 1994 PsychometricTheory, 3rd ed. New York: McGraw-Hill.

Orne, M. 1962 "On the Social Psychology of the Psychological Experiment: With Particular Reference to Demand Characteristics and Their Implications." American Psychologist 17:776–783.

Roethlisberger, F., and W. Dickson 1939 Managementand the Worker. New York: John Wiley.

Rogosa, D. 1985 "Analysis of Reciprocal Effects." In T. Husen and N. Postlethwaite, eds., International Encyclopedia of Education. London: Pergamon Press.

——1988 "Myths about Longitudinal Research." In K. W. Schaie, R. T. Campbell, W. Meredith, and S. C. Rawlings, eds., Methodological Issues in Aging Research New York: Springer.

Rosenthal, R., and R. L. Rosnow 1969 Artifacts in Behavioral Research. New York: Academic.

Shaddish, W. R., T. D. Cook, and D. T. Campbell in preparation "Experimental and Quasi-Experimental Designs for Generalized Causal Inference." Boston: Houghton-Mifflin

Thistlethwaite, D. L., and D. T. Campbell 1960 "Regression-Discontinuity Analysis: An Alternative to the Ex Post Facto Experiment." Journal of Educational Psychology 51:309–317.


Kyle Kercher Karl Kosloski