## Experimental Design

**-**

## Experimental Design

# Experimental Design

I. THE DESIGN OF EXPERIMENTS*William G. Cochran*

II. RESPONSE SURFACES*G. E. P. Box*

III. QUASI-EXPERIMENTAL DESIGN*Donald T. Campbell*

## I THE DESIGN OF EXPERIMENTS

In scientific research, the word “experiment” often denotes the type of study in which the investigator deliberately introduces certain changes into a process and makes observations or measurements in order to evaluate and compare the effects of different changes. These changes are called the *treatments*. Common examples of treatments are different kinds of stimuli presented to human subjects or animals or different kinds of situations with which the investigator faces them, in order to see how they respond. In exploratory work, the objective may be simply to discover whether the stimuli produce any measurable responses, while at a later stage in research the purpose may be to verify or disprove certain hypotheses that have been put forward about the directions and sizes of the responses to treatments. In applied work, measurement of the size of the response is often important, since this may determine whether a new treatment is practically useful.

A distinction is often made between a controlled experiment and an uncontrolled observational study. In the latter, the investigator does not interfere in the process, except in deciding which phenomena to observe or measure. Suppose that it is desired to assess the effectiveness of a new teaching machine that has been much discussed. An observational study might consist in comparing the achievement of students in those schools that have adopted the new technique with the achievement of students in schools that have not. If the schools that adopt the new technique show higher achievement, the objection may be raised that this increase is not necessarily caused by the machine, as the schools that have tried a new method are likely to be more enterprising and successful and may have students who are more competent and better prepared. Examination of previous records of the schools may support these criticisms. In a proper experiment on the same question, the investigator decides which students are to be taught by the new machine and which by the standard technique. It is his responsibility to ensure that the two techniques are compared on students of equal ability and degree of preparation, so that these criticisms no longer have validity.

The advantage of the proper experiment over the observational study lies in this increased ability to elucidate cause-and-effect relationships. Both types of study can establish associations between a stimulus and a response; but when the investigator is limited to observations, it is hard to find a situation in which there is only one explanation of the association. If the investigator can show by repeated experiments that the same stimulus is always followed by the same response and if he has designed the experiments so that other factors that might produce this response are absent, he is in a much stronger position to claim that the stimulus causes the response. (However, there are many social science fields where true experimentation is not possible and careful observational investigations are the only source of information.) [See, for example, EXPERIMENTAL DESIGN, article on QUASIEXPERIMENTAL DESIGN; OBSERVATION; SURVEY ANALYSIS.]

Briefly, the principal steps in the planning of a controlled experiment are as follows. The treatments must be selected and defined and must be relevant to the questions originally posed. The *experimental units* to which the treatments are to be applied must be chosen. In the social sciences, the experimental unit is frequently a single animal or human subject. The unit may, however, be a group of subjects, for instance, a class in comparisons of teaching methods. An important point is that the choice of subjects and of the environmental conditions of the experiment determine the range of validity of the results.

The next step is to determine the size of the sample–the number of subjects or of classes. In general, the precision of the experiment increases as the sample size increases, but usually a balance must be struck between the precision desired and the costs involved. The method for allocating treatments to subjects must be specified, as must the detailed conduct of the experiment. Other factors that might influence the outcome must be controlled (by *blocking* or *randomization*, as discussed later) so that they favor each treatment equally. Finally, the responses or criteria by which the treatments will be rated must be defined. These may be simple classifications or measurements on a continuous scale. Like the treatments, the responses must be relevant to the questions originally posed.

**History.** The early history of ideas on the planning of experiments appears to have been but little studied (Boring 1954). Modern concepts of experimental design are due primarily to R. A. Fisher, who developed them from 1919 to 1930 in the planning of agricultural field experiments at the Rothamsted Experimental Station in England. The main features of Fisher’s approach are as follows *(randomization, blocking*, and *factorial experimentation* will be discussed later):

- The requirement that an experiment itself furnish a meaningful estimate of the underlying variability to which the measurements of the responses to treatments are subject.
- The use of randomization to provide these estimates of variability.
- The use of blocking in order to balance out known extraneous sources of variation.
- The principle that the statistical analysis of the results is determined by the way in which the experiment was conducted.
- The concept of factorial experimentation, which stresses the advantages of investigating the effects of different factors or variables in a single complex experiment, instead of devoting a separate experiment to each factor.

These ideas were stated very concisely by Fisher in 1925 and 1926 but more completely in 1935.

### Experimental error

**Some sources of experimental error.** A major problem in experimentation is that the responses of the experimental units are influenced by many sources of variation other than the treatments. For example, subjects differ in their ability to perform a task under standard conditions: a treatment that is allotted to an unusually capable group of subjects will appear to do well; the instruments by which the responses are measured may be liable to errors of measurement; both the applied treatment and the environment may lack uniformity from one occasion to another.

In some experiments, the effects of subject-to-subject variation are avoided by giving every treatment to each subject in succession, so that comparisons are made within subjects. Even then, however, learning, fatigue, or delayed consequences of previously applied treatments may influence the response actually measured after a particular treatment.

The primary consequence of extraneous sources of variation, called *experimental errors*, is a masking of the effects of the treatments. The observed difference between the effects of two treatments is the sum of the true difference and a contribution due to these errors. If the errors are large, the experimenter obtains a poor estimate of the true difference; then the experiment is said to be of low precision.

*Bias*. It is useful to distinguish between random error and error due to bias. A bias, or systematic error, affects alike all subjects who receive a specific treatment. Random error varies from subject to subject. In a child growth study in which children were weighed in their clothes, a bias would arise if the final weights of all children receiving one treatment were taken on a cold day, on which heavy clothing was worn, while the children receiving a second treatment were weighed on a mild day, on which lighter clothing was worn. In general, bias cannot be detected in the analysis of the results, so that the conclusions drawn by statistical methods about the true effects of the treatments are misleading.

It follows that constant vigilance against bias is one of the requisites of good experimentation. The devices of randomization and blocking, if used intelligently, do much to guard against bias. Additional precautions are necessary in certain types of experiments. If the measurements are subjective evaluations or clinical judgments, the expectations and prejudices of the judges and subjects may influence the results if it is known which treatment any of the subjects received. Consequently, it is important to ensure, whenever it is feasible, that neither the subject nor the person taking the measurement knows which treatment the subject is receiving; this is called a “double blind” experiment. For example, in experiments that compare different drugs taken as pills all the pills should look alike and be administered in the same way. If there is a no-drug treatment, it is common practice to administer an inert pill, called a *placebo*, in order to achieve this concealment.

**Methods for reducing experimental error.** Several devices are used to remove or decrease bias and random errors due to extraneous sources of variation that are thought to be substantial. One group of devices may be called refinements of technique. If the response is the skill of the subject in performing an unfamiliar task, a major source of error may be that subjects learn this task at different rates. An obvious precaution is to give each subject enough practice to reach his plateau of skill before starting the experiment. The explanation of the task to the subjects must be clear; otherwise, some subjects may be uncertain what they are supposed to do. Removal from an environment that is noisy and subject to distractions may produce more uniform performance. The tasks assigned to the subjects may be too easy or too hard so that all perform well or poorly under any treatment, making discrimination between the treatments impossible. The reduction of errors in measurement of the response often requires prolonged research. In psychometrics, much of the work on scaling is directed toward finding superior instruments of measurement [*see* SCALING].

*Blocking*. In many experiments involving comparisons between subjects, the investigator knows that the response will vary widely from subject to subject, even under the same treatment. Often it is possible to obtain beforehand a measurement that is a good predictor of the response of the subject. A child’s average score on previous tests in arithmetic may predict well how he will perform on an arithmetic test given at the end of a teaching experiment. Such initial data can be used to increase the precision of the experiment by forming blocks consisting of children of approximately equal ability. If there are three teaching methods, the first block contains the three children with the best initial scores. Each child in this block is assigned to a different teaching method. The second block contains the three next best children, and so on. The purpose of the blocking is to guarantee that each teaching method is tried on an equal number of good, moderate, and poor performers in arithmetic. The resulting gain in precision may be striking.

The term “block” comes from agricultural experimentation in which the block is a compact piece of land. With human subjects, an arrangement of this kind is sometimes called a *matched pairs* design (with two treatments) or a *matched groups* design (with more than two treatments).

A single blocking can help to balance out the effects of several different sources of variation. In a two-treatment experiment on rats, a block comprising littermates of the same sex equalizes the two treatments for age and sex and to some extent for genetic inheritance and weight also. If the conditions of the experiment are subject to uncontrolled time trends, the two rats in a block can be tested at approximately the same time.

*Adjustments in the statistical analysis*. Given an initial predictor, *x*, of the final response, *y*, an alternative to blocking is to make adjustments in the statistical analysis in the hope of removing the influence of variations in *x*. If *x* and *y* represent initial and final scores in a test of some type of skill, the simplest adjustment is to replace *y* by *y – x*, the improvement in score, as the measure of response. This change does not always increase precision. The error variance of *y – x* for a subject may be written σ^{2}_{y} + σ^{2}_{x} – 2ρσ_{y}σ_{x}, where ρ is the correlation between *y* and *x*. This is less than σ^{2}_{y} only if ρ exceeds σ_{x}/2σ_{y}.

A more accurate method of adjustment is given by the analysis of covariance. In this approach, the measure of response is *y* – *bx*. The quantity *b*, computed from the results of the experiment, is an estimate of the average change in *y* per unit increase in *x*. The adjustment accords with common sense. If the average *x* value is three units higher for treatment A than for treatment B, and if *b* is found to be 2/3, the adjustment reduces the difference between the average *y* values by two units.

If the relation between *y* and *x* is linear, the use of a predictor, *x*, to form blocks gives about the same increase in precision as its use in a covariance analysis. For a more detailed comparison in small experiments, see Cox (1957). Blocking by means of *x* may be superior if the relation between *y* and *x* is not linear. Thus, a covariance adjustment on *x* is helpful mainly when blocking has been used to balance out some other variable or when blocking by means of *x* is, for some reason, not feasible. One disadvantage of the covariance adjustment is that it requires considerable extra computation. A simpler adjustment such as *y* – *x* is sometimes preferred even at some loss of precision.

*Randomization*. Randomization requires the use of a table of random numbers, or an equivalent device to decide some step in the experiment, most frequently the allotment of treatments to subjects [*see* RANDOM NUMBERS].

Suppose that three treatments–*A, B, C*–are to be assigned to 90 subjects without blocking. The subjects are numbered from 1 to 90. In a two-digit column of random numbers, the numbers 01 to 09 represent subjects 1 to 9, respectively; the numbers 10 to 19 represent subjects 10 to 19, respectively, and so on. The numbers from 91 to 99 and the number 00 are ignored. The 30 subjects whose numbers are drawn first from the table are assigned to treatment A, the next 30 to B, and the remaining 30 to C.

In the simplest kind of blocking, the subjects or experimental units are arranged in 30 blocks of three subjects each. One in each block is to receive A, one B, and one C. This decision is made by randomization, numbering the subjects in any block from 1 to 3 and using a single column of random digits for the draw.

Unlike blocking, which attempts to eliminate the effects of an extraneous source of variation, randomization merely ensures that each treatment has an equal chance of being favored or handicapped by the extraneous source. In the blocked experiment above, randomization might assign the best subject in every block to treatment A. The probability that this happens is, however, only 1 in 3^{30} Whenever possible, blocking should be used for all major sources of variation, randomization being confined to the minor sources. The use of randomization is not limited to the allotment of treatments to subjects. For example, if time trends are suspected at some stage in the experiment, the order in which the subjects within a block are processed may be randomized. Of course, if time trends are likely to be large, blocking should be used for them as well as randomization, as illustrated later in this article by the crossover design.

In his *Design of Experiments*, Fisher illustrated how the act of randomization often allows the investigator to carry out valid tests for the treatment means without assuming the form of the frequency distribution of the data (1935). The calculations, although tedious in large experiments, enable the experimenter to free himself from the assumptions required in the standard analysis of variance. Indeed, one method of justifying the standard methods for the statistical analysis of experimental results is to show that these methods usually give serviceable approximations to the results of randomization theory [*see* NONPARAMETRIC STATISTICS; *see also* Kempthorne 1952].

*Size of experiment*. An important practical decision is that affecting the number of subjects or experimental units to be included in an experiment. For comparing a pair of treatments there are two common approaches to this problem. One approach is to specify that the observed difference between the treatment means be correct to within some amount *±d* chosen by the investigator. The other approach is to specify the power of the test of significance of this difference.

Consider first the case in which the response is measured on a continuous scale. If *σ* is the standard deviation per unit of the experimental errors and if each treatment is allotted to *n* units, the standard error of the observed difference between two treatment means is √2 σ /√n for the simpler types of experimental design. Assuming that this difference is approximately normally distributed, the probability that the difference is in error by more than d = 1.96 is about 0.05 (from the normal tables). The probability becomes 0.01 if *d* is increased to 2.58 . Thus, although there is no finite *n* such that the error is certain to be less than *d*, nevertheless, from the normal tables, a value of *n* can be computed to reduce the probability that the error exceeds *d* to some small quantity α such as 0.05. Taking *α =* 0.05 gives n = 7.7σ^{2}/d^{2} = 8σ^{2}/d^{2}. The value of σ is usually estimated from previous experiments or preliminary work on this experiment.

If the criterion is the proportion of units that fall into some class (for instance, the proportion of subjects who complete a task successfully), the corresponding formula for *n*, with *α =* 0.05, is

where p_{1},p_{2} are the true proportions of success for the two treatments and *d* is the maximum tolerable error in the observed difference in proportions. Use of this formula requires advance estimates of p_{1} and p_{2}. Fortunately, if these lie between 0.3 and 0.7 the quantity p(l – *p}* varies only between 0.21 and 0.25.

The choice of the value of *d* should, of course, depend on the use to be made of the results, but an element of judgment often enters into the decision.

The second approach (specifying the power) is appropriate, for instance, when a new treatment is being compared with a standard treatment and when the investigator intends to discard the new treatment unless the test of significance shows that it is superior to the standard. He does not mind discarding the new treatment if its true superiority is slight. But if the true difference (new – standard) exceeds some amount, Δ, he wants the probability of finding a significant difference to have some high value, β (perhaps 0.95, 0.9, or 0.8).

With continuous data, the required value of *n* is approximately

where

ξ_{n} = normal deviate corresponding to the significance level, α, used in the test of significance,

and

ξ_{1-β} = normal deviate for a *one-tailed* probability 1-β

For instance, if the test of significance is a one-tailed test at the 5% level and β is 0.9, so that ξ_{α} = 1.64 and ξ_{1-β} = 1.28, then *n* = 17σ^{2}/Δ^{2}. The values of Δ, α, and β are chosen by the investigator. With proportions, an approximate formula is

where p̄ = (p_{1}+p_{2})/2 and q̄ = 1 – p̄ and p_{2} – p_{1} is the size of difference to be detected. One lesson that this formula teaches is that large samples are needed to detect small or moderate differences between two proportions. For instance, with p_{1} = 0.3, p_{2} = 0.4, α = 0.05 (two-tailed), and β = 0.8, the formula gives n = 357 in each sample, or a total of 714 subjects.

More accurate tables for *n*, with proportions and continuous data, are given in Cochran and Cox (1950) and a fuller discussion of the sample size problem in Cox (1958).

If the investigator is uncertain about the best values to choose for Δ, it is instructive to compute the value of Δ that will be detected, say with probability 80% or 90%, for an experiment of the size that is feasible. Some experiments, especially with proportions, are almost doomed to failure, in the sense that they have little chance of detecting a true difference of the size that a new treatment is likely to produce. It is well to know this before doing the experiment.

*Controls*. Some experiments require a *control*, or comparison, treatment. For a discussion of the different meanings of the word “control” and an account of the history of this device, see Boring (1954). In a group of families having a prepaid medical care plan, it is proposed to examine the effects of providing, over a period of time, additional free psychiatric consultation. An intensive initial study is made of the mental health and social adjustment of the families who are to receive this extra service, followed by a similar inventory at the end. In order to appraise whether the differences (final – initial) can be attributed to the psychiatric guidance, it is necessary to include a control group of families, measured at the beginning and at the end, who do not receive this service. An argument might also be made for a second control group that does not receive the service and is measured only at the end. The reason is that the initial psychiatric appraisal may cause some families in the first control group to seek psychiatric guidance on their own, thus diluting the treatment effect that is to be studied. Whether such disturbances are important enough to warrant a second control is usually a matter of judgment.

The families in the control groups, like those in the treated group, must be selected by randomization from the total set of families available for the experiment. This type of evaluatory study presents other problems. It is difficult to conceal the treatment group to which a family belongs from the research workers who make the final measurements, so that any preconceptions of these workers may vitiate the results. Second, the exact nature of the extra psychiatric guidance can only be discovered as the experiment proceeds. It is important to keep detailed records of the services rendered and of the persons to whom they were given.

### Factorial experimentation

In many programs of research, the investigator intends to examine the effects of several different types of variables on some response (for example, in an experiment on the accuracy of tracking, the effect of speed of the object, the type of motion of the object, and the type of handle used by the human tracker). In factorial designs, these variables are investigated simultaneously in the same experiment. The advantages of this approach are that it makes economical use of resources and provides convenient data for studying the interrelationships of the effects of different variables.

These points may be illustrated by an experiment with three factors or variables, A, *B*, and C, each at two levels (that is, two speeds of the object, etc.). Denote the two levels of A by a_{1} and a_{2} and similarly for B and C. The treatments consist of all possible combinations of the levels of the factors. There are eight combinations:

(1) a_{1}b_{1}c_{1} | (3) a_{1}b_{2}c_{1} | (5) a_{1}b_{1}c_{2} | (7) a_{1}b_{2}c_{2} |

(2) a_{2}b_{1}c_{1} | (4) a_{2}b_{2}c_{1} | (6) a_{2}b_{1}c_{2} | (8) a_{2}b_{2}c_{2} |

Suppose that one observation is taken on each of the eight combinations. What information do these give on factor A? The comparison (2) - (1), that is, the difference between the observations for combinations (2) and (1), is clearly an estimate of the difference in response, a_{2} – a_{1} since the factors B and C are held fixed at their lower levels. Similarly, (4) – (3) gives an estimate of a_{2} - a_{1}, with B held at its higher level and C at its lower level. The differences (6) - (5) and (8) - (7) supply two further estimates of a_{2} - a_{1} The average of these four differences provides a comparison of a_{2} with a_{1} based on two samples of size four and is called the *main effect* of A.

Turning to B, it may be verified that (3) - (1), (4)-(2), (7)-(5), and (8) - (6) are four comparisons of a_{2} - a_{1}, with B Their average is the main effect of B. Similarly, (5) - (1), (6) - (2), (7) – (3), and (8) – (4) provide four comparisons of a_{2} with a_{1}.

Thus the testing of eight treatment combinations in the factorial experiment gives estimates of the effects of each of the factors A, B, and C based on samples of size four. If a separate experiment were devoted to each factor, as in the “one variable at a time” approach, 24 combinations would have to be tested (eight in each experiment) in order to furnish estimates based on samples of size four. The economy in the factorial approach is achieved because every observation contributes information on all factors.

In many areas of research, it is important to study the relations between the effects of different factors. Consider the following question: Is the difference in response between a_{2} and a_{1} affected by the level of B? The comparison

(a_{2}b_{2} – a_{1}b_{2}) – (a_{2}b_{1} – a_{1}b_{1})

where each quantity has been averaged over the two levels of C, measures the difference between the response to A when B is at its higher level and the response to A when B is at its lower level. This quantity might be called the effect of B on the response to A. The same expression rearranged as follows,

(a_{2}b_{2} – a_{2}b_{1}) – (a_{1}b_{2} – a_{1}b_{1}),

also measures the effect of A on the response to B. It is called the AB *two-factor interaction*. (Some writers introduce a multiplier, ½, for conventional reasons.) The AC and BC interactions are computed similarly.

The analysis can be carried further. The AB interaction can be estimated separately for the two levels of C. The difference between these quantities is the effect of C on the AB interaction. The same expression is found to measure the effect of A on the BC interaction and the effect of B on the AC interaction. It is called the AB*C three-factor interaction.

The extent to which different factors exhibit interactions depends mostly on the way in which nature behaves. Absence of interaction implies that the effects of the different factors are mutually additive. In some fields of application, main effects are usually large relative to two-factor interactions, and two-factor interactions are large relative to three-factor interactions, which are often negligible. Sometimes a transformation of the scale in which the data are analyzed removes most of the interactions [*see* STATISTICAL ANALYSIS, SPECIAL PROBLEMS *OF, article on* TRANSFORMATIONS OF DATA]. There are, however, many experiments in which the nature and the sizes of the interactions are of primary interest.

The factorial experiment is a powerful weapon for investigating responses affected by many stimuli. The number of levels of a factor is not restricted to two and is often three or four. The chief limitation is that the experiment may become too large and unwieldy to be conducted successfully. Fortunately, the supply of rats and university students is large enough so that factorial experiments are widely used in research on learning, motivation, personality, and human engineering [*see, for example*, TRAITS].

Several developments mitigate this problem of expanding size. If most interactions may safely be assumed to be negligible, good estimates of the main effects and of the interactions considered likely to be important can be obtained from an experiment in which a wisely chosen fraction (say 1/2 or 1/3) the totality of treatment combinations is tested. The device of *confounding* (see Cochran & Cox 1950, chapter 6, esp. pp. 183–186; Cox 1958, sec. 12.3) enables the investigator to use a relatively small sized block in order to increase precision, at the expense of a sacrifice of information on certain interactions that are expected to be negligible. If all the factors represent continuous variables (x_{1}, x_{2}, …) and the objective is to map the *response surface* that expresses the response, *η*, as a function of ξ_{1}, ξ_{2}, …, then one of the designs specially adapted for this purpose may be used. [For *discussion of these topics, see* EXPERIMENTAL DESIGN, *article on* RESPONSE SURFACES; *see also* Cox 1958; Davies 1954.]

In the remainder of this article, some of the commonest types of experimental design are outlined.

*Randomized groups*. The randomized group arrangement, also called the one-way layout, the simple randomized design, and the completely randomized design, is the simplest type of plan. Treatments are allotted to experimental units at random, as described in the discussion of “Randomization,” above. No blocking is used at any stage of the experiment; and, since any number of treatments and any number of units per treatment may be employed, the design has great flexibility. If mishaps cause certain of the responses to be missing, the statistical analysis is only slightly complicated. Since, however, the design takes no advantage of blocking, it is used primarily when no criteria for blocking are available, when criteria previously used for blocking have proved ineffective, or when the response is not highly variable from unit to unit.

*Randomized blocks*. If there are *v* treatments and the units can be grouped into blocks of size *v*, such that units in the same block are expected to give about the same final response under uniform treatment, then a randomized blocks design is appropriate. Each treatment is allotted at random to one of the units in any block. This design is, in general, more precise than randomized groups and is very extensively used.

Sometimes the blocks are formed by assessing or scoring the subjects on an initial variable related to the final response. It may be of interest to examine whether the comparative effects of the treatments are the same for subjects with high scores as for those with low scores. This can be done by an extension of the analysis of variance appropriate to the randomized blocks design. For example, with four treatments, sixty subjects, and fifteen blocks, the blocks might be classified into three levels, *high, medium*, or *low*, there being five blocks in each class. A useful partition of the degrees of freedom *(df)* in the analysis of variance of this “treatments x levels” design is as follows:

df | |

Between levels | 2 |

Between blocks at the same level | 12 |

Treatments | 3 |

Treatments × levels interactions | 6 |

Treatments × blocks within levels | 36 |

Total | 59 |

The mean square for interaction is tested, against the mean square for treatments × blocks within levels, by the usual F-test. Methods for constructing the levels and the problem of testing the overall effects of treatments in different experimental situations are discussed in Lindquist (1953). [*see* LINEAR HYPOTHESES, *article on* ANALYSIS OF VARIANCE.]

*The crossover design*. The crossover design is suitable for within-subject comparisons in which each subject receives all the treatments in succession. With three treatments, for example, a plan in which every subject receives the treatments in the order ABC is liable to bias if there happen to be systematic differences between the first, second, and third positions, due to time trends, learning, or fatigue. One design that mitigates this difficulty is the following: a third of the subjects, selected at random, get the treatments in the order ABC, a third get BCA, and the remaining third get CAB. The analysis of variance resembles that for randomized blocks except that the sum of squares representing the differences between the over-all means for the three positions is subtracted from the error sum of squares.

*The Latin square*. A square array of letters (treatments) such that each letter appears once in every row and column is called a Latin square. The following are two 4×4 squares.

This layout permits simultaneous blocking in two directions. The rows and columns often represent extraneous sources of variation to be balanced out. In an experiment that compared the effects of five types of music programs on the output of factory workers doing a monotonous job, a 5 × 5 Latin square was used. The columns denoted days of the week and the rows denoted weeks. When there are numerous subjects, the design used is frequently a group of Latin squares.

For within-subject comparisons, the possibility of a residual or carry-over effect from one period to the next may be suspected. If such effects are present (and if one conventionally lets columns in the above squares correspond to subjects and rows correspond to order of treatment) then square (1) is bad, since each treatment is always preceded by the same treatment (A by C, etc.). By the use of square (2), in which every treatment is preceded once by each of the other treatments, the residual effects can be estimated and unbiased estimates obtained of the direct effects (see Cochran & Cox 1950, sec. 4.6a; Edwards 1950, pp. 274–275). If there is strong interest in the residual effects, a more suitable design is the *extra-period Latin square*. This is a design like square (2), in which the treatments C, D, A, B in the fourth period are given again in a fifth period.

*Balanced incomplete blocks*. When the number of treatments, *v*, exceeds the size of block, *k*, that appears suitable, a balanced incomplete blocks design is often appropriate. In examining the taste preferences of adults for seven flavors of ice cream in a within-subject test, it is likely that a subject can make an accurate comparison among only three flavors before his discrimination becomes insensitive. Thus *v = 7, k= 3*. In a comparison of three methods of teaching high school students, the class may be the experimental unit and the school a suitable block. In a school district, it may be possible to find twelve high schools each having two classes at the appropriate level. Thus *v = 3, k = 2.*

Balanced incomplete blocks (BIB) are an extension of randomized blocks that enable differences among blocks to be eliminated from the experimental errors by simple adjustments performed in the statistical analysis. Examples for *v = 7, k = 3* and for *v – 3, k =* 2 are as follows (columns are blocks):

The basic property of the design is that each pair of treatments occurs together (in the same block) equally often.

In both plans shown, it happens that each row contains every treatment. This is not generally true of BIB designs, but this extra property can sometimes be used to advantage. With *v = 7*, for instance, if the row specifies the order in which the types of ice cream are tasted, the experiment is also balanced against any consistent order effect. This extension of the BIB is known as an *incomplete Latin square* or a *Youden square*. In the high schools experiment, the plan for *v* = 3 would be repeated four times, since there are twelve schools.

**Comparisons between and within subjects.** Certain factorial experiments are conducted so that some comparisons are made within subjects and others are made between subjects. Suppose that the criterion is the performance of the subjects on an easy task, *T*_{1}, and a difficult task, *T*_{2}, each subject attempting both tasks. This part of the experiment is a standard crossover design. Suppose further that these tasks are explained to half the subjects in a discouraging manner, *S*_{1}, and to the other half in a supportive manner, *S*_{2}. It is of interest to discover whether these preliminary suggestions, S, have an effect on performance and whether this effect differs for easy and hard tasks. The basic plan, requiring four subjects, is shown in the first three lines of Table 1, where O denotes the order in which the tasks are performed.

The comparison T_{2} – *T*_{1}, which gives the main effect of T, is shown under the treatments line. This is clearly a within-subject comparison since each subject carries a + and a –. The main effect of suggestion, S_{2} – S_{1} is a between-subject comparison: subjects 3 and 4 carry + signs while subjects 1 and 2 carry – signs. The TS interaction, measured by *T*_{2}*S*_{2} - *T*_{1}*S*_{2} - *T*_{2}*S*_{1} + *T*_{1}*S*_{1}, is seen to be a within-subject comparison.

Since within-subject comparisons are usually more precise than between-subject comparisons, an important property of this design is that it gives relatively high precision on the T and TS effects at the expense of lower precision on S. The design is particularly effective for studying interactions. Sometimes the between-subject factors involve a classification of the subjects. For instance, the subjects might be classified into three levels of anxiety, A, by a preliminary rating, with equal numbers of males and females of each degree included. In this situation, the factorial effects *A, S* (for sex), and AS are between-subject comparisons. Their interactions with T are within-subject comparisons.

The example may present another complication. Subjects who tackle the hard task after doing the easy task may perform better than those who tackle the hard task first. This effect is measured by a TO interaction, shown in the last line in Table 1. Note that the TO interaction turns out to be a between-subject comparison. The same is true of the TSO three-factor interaction.

In designs of this type, known in agriculture as

Table 1 | ||||||||
---|---|---|---|---|---|---|---|---|

Subject | 1 | 2 | 3 | 4 | ||||

O_{1} | O_{2} | O_{1} | O_{2} | O_{1} | O_{2} | O_{1} | O_{2} | |

Treatment | T_{1}S_{1} | T_{2}S_{1} | T_{2}S_{1} | T_{1}S_{1} | T_{1}S_{2} | T_{2}S_{2} | T_{2}S_{2} | T_{1}S_{2} |

T_{2}-T_{1} | - | + | + | - | - | + | + | - |

S_{2}-S_{1} | - | - | - | - | + | + | + | + |

TS | + | - | - | + | - | + | + | - |

TO | + | + | - | - | + | + | - | - |

Table 2 | |

Source | df> |
---|---|

Between subjects | |

S | 1 |

TO | 1 |

TSO | 1 |

Error b | 4(n-1) |

Within subjects | |

O | 1 |

T | 1 |

TS | 1 |

SO | 1 |

Error w | 4(n-1) |

*split-plot* designs, separate estimates of error are calculated for between-subject and within-subject comparisons. With *4n* subjects, the partition of degrees of freedom in the example is shown in Table 2 (if it is also desired to examine the TO and TSO interactions).

Plans and computing instructions for all the common types of design are given in Cochran and Cox (1950); and Lindquist (1953), Edwards (1950), and Winer (1962) are good texts on experimentation in psychology and education.

WILLIAM G. COCHRAN

*[Directly related are the articles under* LINEAR HYPOTHESES.]

## BIBLIOGRAPHY

BORING, EDWIN G. 1954 The Nature and History of Experimental Control. *American Journal of Psychology* 67:573–589.

CAMPBELL, DONALD T.; and STANLEY, J. S. 1963 Experimental and Quasi-experimental Designs for Research on Teaching. Pages 171–246 in Nathaniel L. Gage (editor), *Handbook of Research on Teaching*. Chicago: Rand McNally.

COCHRAN, WILLIAM G.; and Cox, GERTRUDE M. (1950) 1957 *Experimental Designs*. 2d ed. New York: Wiley.

Cox, D. R. 1957 The Use of a Concomitant Variable in Selecting an Experimental Design. *Biometrika* 44: 150–158.

Cox, D. R. 1958 *Planning of Experiments*. New York: Wiley.

DAVIES, OWEN L. (editor) (1954) 1956 *The Design and Analysis of Industrial Experiments*. 2d ed., rev. Edinburgh: Oliver & Boyd; New York: Hafner.

EDWARDS, ALLEN (1950) 1960 *Experimental Design in Psychological Research*. Rev. ed. New York: Holt.

FISHER, R. A. (1925) 1958 *Statistical Methods for Research Workers*. 13th ed., rev. New York: Hafner. → Previous editions were also published by Oliver & Boyd.

FISHER, R. A. (1926) 1950 The Arrangement of Field Experiments. Pages 17.502a-17.513 in R. A. Fisher, *Contributions to Mathematical Statistics*. New York: Wiley. → First published in Volume 33 of the *Journal of the Ministry of Agriculture.*

FISHER, R. A. (1935) 1960 *The Design of Experiments*. 7th ed. New York: Hafner; Edinburgh: Oliver & Boyd.

KEMPTHORNE, OSCAR 1952 *The Design and Analysis of Experiments*. New York: Wiley.

LINDQUIST, EVERET F. 1953 *Design and Analysis of Experiments in Psychology and Education*. Boston: Houghton Mifflin.

WINER, B. J. 1962 *Statistical Principles in Experimental Design*. New York: McGraw-Hill.

## II RESPONSE SURFACES

Response surface methodology is a statistical technique for the design and analysis of experiments; it seeks to relate an average response to the values of quantitative variables that affect response. For example, response in a chemical investigation might be yield of sulfuric acid, and the quantitative variables affecting yield might be pressure and temperature of the reaction.

In a psychological experiment, an investigator might want to find out how a test *score* achieved by certain subjects depended upon *duration* of the period during which they studied the relevant material and the *delay* between study and test. In mathematical language, the psychologist is interested in the presumed *functional relationship* η= f(ξ_{1},ξ_{2}) that expresses the *response score,*, η, as a function of the two *variables* duration, ξ_{1}, and delay, ξ_{1}. If repeated experiments were made at any fixed set of experimental conditions, the measured response would nevertheless vary because of measurement errors, observational errors, and variability

in the experimental material. We regard η therefore as the *mean response* at particular conditions; y, the response actually observed in a particular experiment, differs from η because of an (all-inclusive) error *e*. Thus *y* = η + *e*, and a mathematical model relating the observed response to the levels of *k* variables can be written in the form

The appropriate investigational strategy depends heavily on the state of ignorance concerning the functional form, *f*. At one extreme the investigator may not know even which variables, ξ, to include and must make a preliminary screening investigation. At the other extreme the true functional form may actually be known or can be deduced from a mechanistic theory.

Response surface methods are appropriate in the intermediate situation; the important variables are known, but the true functional form is neither known nor easily deducible. The general procedure is to approximate *f* locally by a suitable function, such as a polynomial, which acts as a “mathematical French curve.”

**Geometric representation of response relationships.** The three curves of Figure 1A, showing a hypothetical relationship associating test score with study period for three different periods of delay, are shown in Figure 1B as sections of a *response surface*. This surface is represented by its response *contours* in Figure 1C. Figure ID shows how a third variable may be accommodated by the use of three-dimensional *contour surfaces.*

*Local graduation*. It is usually most convenient to work with coded variables like x_{1} = (ξ_{1} - ξ^{0}_{1})/S_{1}, s_{2} = (ξ_{2} - ξ_{2}^{0})/S_{2} in which ξ_{1}^{0}, ξ_{2}^{0} are the coordinates of the center of a region of current interest and S_{1} and S_{2}, are convenient scale factors.

Let ŷ represent the calculated value of the response obtained by fitting an approximating function by the method of least squares [*see* LINEAR HYPOTHESES, *article on* REGRESSION]. In a region like R_{1} in Figure 1C an adequate approximation can be obtained by fitting the first-degree polynomial

The response contours of such a fitted plane are, of course, equally spaced parallel straight lines. In a region like R_{2} a fair approximation might be achieved by fitting a second-degree polynomial

Flexibility of functions like those in (2) and (3) is greatly increased if the possibility is allowed that *y, X _{1}*, and

*x*

_{2}, are suitable transformed values of the response and of the variable. For example, it might be appropriate to analyze log score rather than score itself.

*[Ways of choosing suitable transformations are described in*STATISTICAL ANALYSIS, SPECIAL PROBLEMS OF,

*article on*TRANSFORMATIONS OF DATA;

*and in*Box & Cox 1964

*and*Box & Tidwell 1962.]

### Uses of response surface methodology

A special pattern of points at which observations are to be made is called an experimental design. In Figure 1C are shown a first-order design in *R _{1}*, suitable for fitting and checking a first-degree polynomial, and a second-order design in R

_{2}, suitable for fitting and checking a second-degree polynomial. Response surface methodology has been applied (a) to provide a description of how the response is affected by a number of variables over some already chosen region of interest and (b) to study and exploit multiple response relationships and constrained extrema. In drug therapy, for example, the true situation might be as depicted in Figure 2. First-degree approximating functions fitted to

*each*of the three responses–η

_{1}, therapeutic effect, η

_{2}, nausea, and η

_{3}, toxicity –could approximately locate the point P where maximum therapeutic effect is obtained with nausea and toxicity maintained at the acceptable limits η

_{2}, = 5, η

_{3}= 30. Response surface methodology has also been applied (c) to locate and explore the neighborhood of maximal or minimal response. Because problems in (c) often subsume those in (a) and (b), only this application will be considered in more detail.

**Location and exploration of a maximal region.** Various tactics have been proposed to deal with the problem of finding where the response surface has its maximum or minimum and of describing its shape nearby. Because the appropriateness of a particular tactic usually depends upon factors that are initially unknown, an adaptive strategy of multiple iteration must be employed, that is, the investigator must put himself in a position to learn more about each of a number of uncertainties as he proceeds and to modify tactics accordingly. It is doubtful whether an adaptive strategy could be found that is appropriate to every conceivable response function. One such procedure, which has worked well in chemical applications and which ought to be applicable in some other areas, is as follows: When the initially known experimental conditions are remote from the maximum (a parallel strategy applies in the location of a minimum) rapid progress is often possible by locally fitting a sloping plane and moving in the indicated direction of greatest slope to a region of higher response. This tactic may be repeated until, when the experimental sequence has moved to conditions near the maximizing ones, additional observations are taken and a quadratic (second-order) fit or analysis is made to indicate the approximate shape of the response surface in the region of the maximum.

*An example*. In this example iteration occurs in (A) the amount of replication (to achieve sufficient accuracy), (B) the location of the region of interest, (C) the scaling of the variables, (D) the transformation in which the variables are considered, and (E) the necessary degree of complexity of approximating functions and of the corresponding design. The letters α, B, C, etc., are used parenthetically to indicate the particular type of iteration that is being furthered at any stage. Suppose that, unknown to the experimenter, the true dependence of percentage yield on temperature and concentration is as shown in Figure 3A and the experimental error standard deviation is 1.2 per cent.

*A first-degree approximation*. Suppose that five initial duplicate runs made in random order at points labeled 1, 2, 3, 4, and 5 in Figure 3B yield the results y_{1}=24, , y_{2}=38, y_{3}=42, , y_{4}=42, , y_{5} = 50, . The average yields at the five points are then. At this stage it is convenient to work with the coded variables x, = (temp. - 70)/10 and x_{2} = (cone. - 42.5)/2.5. Using standard least squares theory the coefficients *b _{0}*,

*b*,

_{1}*b*, of equation (2) are then easily estimated (for example, ) and the locally best-fitting plane is

_{2}The differences in the duplicate runs provide an estimate *s =* 1.5, with five degrees of freedom, of o-, the underlying standard deviation. The standard errors of *b _{0}*,

*b*, and

_{1}*b*. are then estimated as 0.5, and no further replication (A) appears necessary to obtain adequate estimation of

_{2}*y.*

*Checking the fit*. To check the appropriateness of the first-degree equation it would be sensible to look at the size of second-order effects. For reason of experimental economy a first-order design usually contains points at too few distinct levels to allow separate estimation of all second-order terms. The design may be chosen, however, so as to allow estimates of “specimen” second-order coefficients or combinations thereof. In the present case estimates can be made of . and Some inadequacy of the first-degree equation is indicated, therefore, but this is tentatively ignored because of the dominant magnitude of *b _{1}* and

*b*.

_{2}*Steepest ascent*. It is now logical to explore (B) higher temperatures and concentrations. The points 6, 7, and 8 are along a steepest ascent path obtained by changes proportional to *b _{1}S_{1} = 5.9 × 10° = 59° in temperature and b_{2} × S_{2} = 7.1* × 2.5% = 17.75% in concentration. Suppose that

*y*

_{6}= 59,

*y*

_{7}= 63, and

*y*

_{8}= 50. Graphical interpolation indicates that the highest yield on this path is between runs 6 and 7, and this is chosen (B) as the center of the new region to be explored.

The path calculated as above is at right angles to contours of the fitted plane when 10-degree units of temperature and 2.5 per cent units of concentration are represented by the same distances. That the experimenter currently regards these units as appropriate is implied by his choice of levels in the design.

*Scaling correction*. To correct unsuitable scaling (C) the investigator can adopt the rule that if a variable produces an effect that is small compared with that produced by the other variables, the center level for that variable is moved away from the calculated path and a larger change is made for this variable in the next set of runs. No change of relative scaling is indicated here, but progress up the surface would normally be accompanied by reduction in the sizes of b_{1} and &_{2}. Also, the checks have already indicated that second-order effects can scarcely be estimated with adequate accuracy

in the present scaling. Thus, wider ranges in both variables should be employed in a second design.

A *second-degree approximation*. A widened first-order design at the new location might give *y*_{9} = 50, *y*_{10} = 66, *y*_{11} = 66, *y*_{12} = 63, and *y*_{13} = 52, as in Figure 3C.

Then ŷ 59.4 + 1.3 *x*_{l} - 0.3 *x*_{2}, is the best-fitting plane, in which *x*_{1} is given by (temp. – 90)/15, *x*_{2}, is given by (cone. – 18.75)/3.75, and the estimated standard error of the coefficients is about 0.8. In the new scaling the check quantities are now *b*_{12} = -6.75 ± 0.8 and *b*_{11} + *b*_{22} = 8.25 ± 1.7. It is clear, without this time duplicating the design, that first-order terms no longer dominate, and no worthwhile further progress can be made by ascent methods. To make possible the fitting and checking of a second-degree polynomial (E), five additional observations might be taken, say *y*_{14} = 54, *y*_{15} = 54, *y*_{16} = 57, *y*_{17} = 65, *y*_{18} = 55. The last ten observations now form a second-order design. A second-degree equation fitted to these observations gives

The design allows a check on the adequacy (E) of the second-degree equation by providing estimates of certain “specimen” combinations of third-order terms

and

The estimated standard errors of the linear coefficients *b*_{1} and *b*_{2}, of the quadratic coefficients *b*_{11} and *b*_{22}, and of the interaction coefficient *b*_{12} are, respectively, 0.52, 0.62, and 0.73.

Before an attempt is made to interpret equation (5) there must be some assurance (A) that the change in response it predicts is large compared with the standard error of that prediction. For a design requiring N observations and an approximating equation containing *p* constants, the average variance of the *N* calculated responses ŷ is (p/N)s^{2} = (6/12) × 2.1 = 1.1 for this example. The square root (1.0 for this example) gives an “average” standard error for ŷ. This may be compared with the range of the predicted ŷ’s, which is 17.08, the highest predicted value being ŷ_{11} = ŷ_{17} = 65.50 and the lowest ŷ_{0} 48.42.

A more precise indication of adequacy may be obtained by an application of the analysis of variance, but a discussion of this is outside the scope of the present account [*see* LINEAR HYPOTHESES, *article on* ANALYSIS OF VARIANCE]. It is to be noted, however, that bare statistical significance of the regression would *not* ensure that the response surface was *estimated* with sufficient accuracy for the interpretation discussed below.

*Interpretation*. Once adequate fit and precision have been obtained, a contour plot of the equation over the region of the experimental design is helpful in interpretation. Especially where there are more than two variables, interpretation is further facilitated by writing the second-degree equation in canonical form (D). In most cases, this means that the center of the quadratic system is chosen as a new origin, and a rotation of axes is performed to eliminate cross-product terms.

In a final group of experiments the new canonical axes and scales could be used to position the design. In Figure 3D the design points are chosen so that they roughly follow a contour and make a rather precise final fitting possible.

It might be asked, Why not simply use the twenty or so experimental points to cover the region shown in Figure 3A with some suitable grid in the first place? The answer is that it is not known initially that the region of interest will be in the area covered by that diagram. The “content” of the space to be explored goes up rapidly as the number of dimensions is increased.

*Suitable designs*. From the foregoing discussion it will be clear that the arrangements of experimental points suitable for response surface study should satisfy a number of requirements. Ideally, a response surface design should (1) allow *y(x)* to be estimated throughout the region of interest, R; (2) ensure that ŷ(x) is as “close” as possible to η(x); (3) give good detectability of lack of fit; (4) allow transformations to be fitted; (5) allow experiments to be performed in blocks; (6) allow designs of increasing order to be built up sequentially; (7) provide an internal estimate of error; (8) be insensitive to wild observations; (9) require a minimum number of experimental points; (10) provide patterning of data allowing ready visual appreciation; (11) ensure simplicity of calculation; and (12) behave well when errors occur in settings of the *x’s.*

A variety of designs have been developed, many of which have remarkably good over-all behavior with respect to these requirements. When maximum economy in experimentation is essential, designs that fail to meet certain of these criteria may have to be used at some increased risk of being misled.

G. E. P. Box

## BIBLIOGRAPHY

ANDERSEN, S. L. 1959 Statistics in the Strategy of Chemical Experimentation. *Chemical Engineering Progress* 55:61–67.

Box, GEORGE E. P. 1954 The Exploration and Exploitation of Response Surfaces: Some General Considerations and Examples. *Biometrics* 10:16–60.

Box, GEORGE E. P. 1957 Integration of Techniques in Process Development. Pages 687–702 in American Society for Quality Control, National Convention, Eleventh, Transactions. Detroit, Mich.: The Society.

Box, GEORGE E. P. 1959 Fitting Empirical Data. New York Academy of Sciences, *Annals* 86:792–816.

Box, GEORGE E. P.; and Cox, D. R. 1964 An Analysis of Transformations. *Journal of the Royal Statistical Society* Series B 26:211–252. → Contains eight pages of discussion.

Box, GEORGE E. P.; and TIDWELL, PAUL W. 1962 Transformations of the Independent Variables. Techno-*metrics* 4:531–550.

Box, GEORGE E. P.; and WILSON, K. B. 1951 On the Experimental Attainment of Optimum Conditions. *Journal of the Royal Statistical Society* Series B 13:1– 45. → Contains seven pages of discussion.

DAVIES, OWEN L. (editor) (1954) 1956 *The Design and Analysis of Industrial Experiments*. 2d ed., rev. New York: Hafner; London: Oliver & Boyd.

HILL, WILLIAM G.; and HUNTER, WILLIAM G. 1966 A Review of Response Surface Methodology: A Literature Survey. *Technometrics* 8:571–590.

HOTELLING, HAROLD 1941 Experimental Determination of the Maximum of a Function. Annals *of Mathematical Statistics* 12:20–45.

## III QUASI-EXPERIMENTAL DESIGN

The phrase “quasi-experimental design” refers to the application of an experimental mode of analysis and interpretation to bodies of data not meeting the full requirements of experimental control. The circumstances in which it is appropriate are those of experimentation in social settings–including planned interventions such as specific communications, persuasive efforts, changes in conditions and policies, efforts at social remediation, etc.–where complete experimental control may not be possible. Unplanned conditions and events may also be analyzed in this way where an exogenous variable has such discreteness and abruptness as to make appropriate its consideration as an experimental treatment applied at a specific point in time to a specific population. When properly done, when attention is given to the specific implications of the specific weaknesses of the design in question, quasi-experimental analysis can provide a valuable extension of the experimental method.

**History of quasi-experimental design.** While efforts to interpret field data as if they were actually experiments go back much further, the first prominent methodology of this kind in the social sciences was Chapin’s ex post facto experiment (Chapin & Queen 1937; Chapin 1947; Greenwood 1945), although it should be noted that because of the failure to control regression artifacts, this mode of analysis is no longer regarded as acceptable. *The American Soldier* volumes (Stouffer et al. 1949) provide prominent analyses of the effects of specific military experiences, where it is implausible that differences in selection explain the results. Thorndike’s efforts to demonstrate the effects of specific coursework upon other intellectual achievements provide an excellent early model (for example, Thorndike & Woodworth 1901; Thorndike & Ruger 1923). Extensive analysis and review of this literature are provided elsewhere (Campbell 1957; 1963; Campbell & Stanley 1963) and serve as the basis for the present abbreviated presentation.

**True experimentation.** The core requirement of a true experiment lies in the experimenter’s ability to apply experimental treatments in complete independence of the prior states of the materials (persons, etc.) under study. This independence makes resulting differences interpretable as effects of the differences in treatment. In the social sciences the independence of experimental treatment from prior status is assured by randomization in assignments to treatments. Experiments meeting these requirements, and thus representing true experiments, are much more possible in the social sciences than is generally realized. Wherever, for example, the treatments can be applied to individuals or small units, such as precincts or classrooms, without the respondents being aware of experimentation or that other units are getting different treatments, very elegant experimental control can be achieved. An increased acceptance by administrators of randomization as the democratic method of allocating scarce resources (be these new housing, therapy, or fellowships) will make possible field experimentation in many settings. Where innovations are to be introduced throughout a social system and where the introduction cannot, in any event, be simultaneous, a use of randomization in the staging can provide an experimental comparison of the new and the old, using the groups receiving the delayed introduction as controls.

**Validity of quasi-experimental analyses.** Nothing in this article should be interpreted as minimizing the importance of increasing the use of true experimentation. However, where true experimental design with random assignment of persons to treatments is not possible, because of ethical considerations or lack of power, or infeasibility, application of quasi-experimental analysis has much to offer.

The social sciences must do the best they can with the possibilities open to them. Inferences must frequently be made from data obtained under circumstances that do not permit complete control. Too often a scientist trained in experimental method rejects any research in which complete control is lacking. Yet in practice no experiment is perfectly executed, and the practicing scientist overlooks those imperfections that seem to him to offer no plausible rival explanation of the results. In the light of modern philosophies of science, no experiment ever *proves* a theory, it merely *probes* it. Seeming proof results from that condition in which there is no available plausible rival hypothesis to explain the data. The general program of quasi-experimental analysis is to specify and examine those plausible rival explanations of the results that are provided by the uncontrolled variables. A failure to control that does not in fact lend plausibility to a rival interpretation is not regarded as invalidating.

It is well to remember that we do make assured causal inferences in many settings not involving randomization: the earthquake caused the brick building to crumble; the automobile crashing into the telephone pole caused it to break; the language patterns of the older models and mentors caused this child to speak English rather than Kwakiutl; and so forth. While these are all potentially erroneous inferences, they are of the same type as experimental inferences. We are confident that were we to intrude experimentally, we could confirm the causal laws involved. Yet they have been made assuredly by a nonexperimenting observer. This assurance is due to the effective absence of other plausible causes. Consider the inference about the crashing auto and the telephone pole: we rule out combinations of termites and wind because the other implications of these theories do not occur (there are no termite tunnels and debris in the wood, and nearby weather stations have no records of heavy wind). Spontaneous splintering of the pole by happenstance coincident with the auto’s onset does not impress us as a rival, nor would it explain the damage to the car, etc. Analogously in quasi-experimental analysis, tentative causal interpretation of data may be made where the interpretation in question is consistent with the data and where other rival interpretations have been rendered implausible.

**Dimensions of experimental validity.** A set of twelve dimensions, representing frequent threats to validity, have been developed for the evaluation of data as quasi-experiments. These may be regarded as the important classes of frequently plausible rival hypotheses that good research design seeks to rule out. Each will be presented briefly even though not all are employed in the evaluation of the designs used illustratively here.

Fundamental to this listing is a distinction between *internal validity* and *external validity*. Internal validity is the basic minimum without which any experiment is uninterpretable: Did in fact the experimental treatments make a difference in this specific experimental instance? External validity asks the question of *generalizability:* To what populations, settings, treatment variables, and measurement variables can this effect be generalized? Both types of criteria are obviously important, even though they are frequently at odds in that features increasing one may jeopardize the other. While internal validity is the *sine qua non*, and while the question of external validity, like the question of inductive inference, is never completely answerable, the selection of designs strong in both types of validity is obviously our ideal.

*Threats to internal validity*. Relevant to internal validity are eight different classes of extraneous variables that if not controlled in the experimental design might produce effects mistaken for the effect of the experimental treatment. These are the following. (1) *History:* other specific events in addition to the experimental variable occurring between a first and second measurement. (2) *Maturation:* processes within the respondents that are a function of the passage of time per se (not specific to the particular events), including growing older, growing hungrier, growing tireder, and the like. (3) *Testing:* the effects of taking a test a first time upon subjects’ scores in subsequent testing. (4) *Instrumentation:* the effects of changes in the calibration of a measuring instrument or changes in the observers or scorers upon changes in the obtained measurements. (5) *Statistical regression:* operating where groups of subjects have been selected on the basis of their extreme scores. (6) *Selection:* biases resulting in differential recruitment of respondents for the comparison groups. (7) *Experimental mortality:* the differential loss of respondents from the comparison groups. (8) *Selection-maturation interaction:* in certain of the multiple-group quasi-experimental designs, such as the nonequivalent control group design, an interaction of maturation and differential selection is confounded with, that is, might be mistaken for, the effect of the experimental variable [*see* LINEAR HYPOTHESES, *article on* REGRESSION; SAMPLE SURVEYS].

*Threats to external validity*. Factors jeopardizing external validity or *representativeness* are: (1) The *reactive or interaction effect of testing*, in which a pretest might increase or decrease the respondent’s sensitivity or responsiveness to the experimental variable and thus make the results obtained for a pretested population unrepresentative of the effects of the experimental variable for the unpretested universe from which the experimental respondents were selected. (2) *Interaction* effects between *selection* bias and the *experimental variable*. (3) *Reactive effects of experimental arrangements*, which would preclude generalization about the effect of the experimental variable for persons being exposed to it in nonexperimental settings. (4) *Multiple-treatment interference*, a problem wherever multiple treatments are applied to the same respondents, and a particular problem for one-group designs involving equivalent time samples or equivalent materials samples.

**Types of quasi-experimental design.** Some common types of quasi-experimental design will be outlined here.

*One-group pretest–posttest design*. Perhaps the simplest quasi-experimental design is the one-group pretest-posttest design, O_{1} X O_{2} (O represents measurement or observation, X the experimental treatment). This common design patently leaves uncontrolled the threats to internal validity of history, maturation, testing, instrumentation, and, if subjects were selected on the basis of extreme scores on O_{1}, regression. There may be situations in which the investigator could decide that none of these represented plausible rival hypotheses in his setting: A log of other possible change-agents might provide no plausible ones; the measurement in question might be nonreactive (Campbell 1957), the time span too short for maturation, too spaced for fatigue, etc. However, the sources of invalidity are so numerous that a more powerful quasi-experimental design would be preferred. Several of these can be constructed by adding features to this simple one.

*Interrupted time-series design*. The interrupted time-series experiment utilizes a series of measurements providing multiple pretests and posttests, for example:

O_{1} O_{2} O_{3} O_{4} O_{5} O_{6} O_{7} O_{8}.

If in this series, O_{1}-O_{2} shows a rise greater than found elsewhere, then maturation, testing, and regression are no longer plausible, in that they would predict equal or greater rises for O_{1}-O_{2}, etc. Instrumentation may well be controlled too, although in institutional settings a change of administration policy is often accompanied by a change in record-keeping standards. Observers and participants may be focused on the occurrence of × and may speciously change rating standards, etc. History remains the major threat, although in many settings it would not offer a plausible rival interpretation.

*Multiple time-series design*. If one had available a parallel time series from a group not receiving the experimental treatment, but exposed to the same extraneous sources of influence, and if this control time series failed to show the exceptional jump from O_{1} to O_{5}, then the plausibility of history as a rival interpretation would be greatly reduced. We may call this the multiple time-series design.

*Nonequivalent control group*. Another way of improving the one-group pretest-posttest design is to add a “nonequivalent control group.” (Were the control group to be randomly assigned from the same population as the experimental group, we would, of course, have a true experimental design not a quasi-experimental design.) Depending on the similarities of setting and attributes, if the non-equivalent control group fails to show the gain manifest in the experimental group, then history, maturation, testing, and instrumentation are controlled. In this popular design, the frequent effort to “correct” for the lack of perfect equivalence by matching on pretest scores is *absolutely wrong* (e.g., Thorndike 1942; Hovland et al. 1949; Campbell & Clayton 1961), because it introduces a regression artifact. Instead, one should accept any initial pretest differences, using analysis of covariance, gain scores, or graphic presentation. (This, of course, is not to reject blocking on pretest scores in true experiments where groups have been assigned to treatments at random.) Remaining uncontrolled is the selection-maturation interaction, that is, the possibility that the experimental group differed from the control group not only in initial level but also in its autonomous maturation rate. In experiments on psychotherapy and on the effects of specific coursework this is a very serious rival. Note that it can be rendered implausible by use of a time series of pretest for both groups thus moving again to the multiple time-series design.

*Other quasi-experimental designs*. There is not space here to present adequately even these four quasi-experimental designs, but perhaps the strategy of adding specific observations and analyses to check on specific threats to validity has been illustrated. This is carried to an extreme in the recurrent institutional cycle design (Campbell & McCormack 1957; Campbell & Stanley 1963), in which longitudinal and cross-sectional measurements are combined with still other analyses to assess the impact of indoctrination procedures, etc. through exploiting the fact that essentially similar treatments are being given to new entrants year after year or cycle after cycle. Other quasi-experimental designs are covered in Campbell and Stanley (1963), Campbell and Clayton (1961), Campbell (1963), and Pelz and Andrews (1964).

*Correlational analyses*. Related to the program of quasi-experimental analysis are those efforts to achieve causal inference from correlational data. Note that while correlation does not prove causation, most causal hypotheses imply specific correlations, and examination of these thus probes, tests, or edits the causal hypothesis. Furthermore, as Blalock (1964) and Simon (1947–1956) have emphasized, certain causal models specify uneven patterns of correlation. Thus the A→B→C model implies that γ_{AC} be smaller than γ_{AB} or γ_{BC} However, their use of partial correlations or the use of Wright’s path analysis (1920) are rejected as tests of the model because of the requirement that the “cause” be totally represented in the “effect.” In the social sciences it will never be plausible that the cause has been measured without unique error and that it also totally lacks unique systematic variance not shared with the effect. More appropriate would be Lawley’s (1940) test of the hypothesis of *single factoredness*. Only if single factoredness can be rejected would the causal model, as represented by its predicted uneven correlation pattern, be the preferred interpretation [*see* MULTIVARIATE ANALYSIS, *articles on* CORRELATION].

**Tests of significance.** A word needs to be said about tests of significance for quasi-experimental designs. It has been argued by several competent social scientists that since randomization has not been used tests of significance assuming randomization are not relevant. On the whole, the writer disagrees. However, some aspects of the protest are endorsed: Good experimental design is needed for any comparison inferring change, whether or not tests of significance are used, even if only photographs, graphs, or essays are being compared. In this sense, experimental design is independent of tests of significance. More importantly, tests of significance have mistakenly come to be taken as thoroughgoing *proof*. In vulgar social science usage, finding a “significant difference” is apt to be taken as *proving* the author’s basis for predicting the difference, forgetting the many other plausible rival hypotheses explaining a significant difference that quasi-experimental designs leave uncontrolled. Certainly the valuation of tests of significance in some quarters needs demoting. Further, the use of tests of significance designed for the evaluation of a single comparison becomes much too lenient when dozens, hundreds, or thousands of comparisons have been sifted. And in a similar manner, an experimenter’s decision as to which of his studies is publishable and the editor’s decision as to which of the manuscripts are acceptable further bias the sampling basis. In all of these ways, reform is needed.

However, when a quasi-experimenter has, for example, compared the results from two intact classrooms employed in a sampling of convenience, a chance difference is certainly *one*, even if only one, of the many plausible rival hypotheses that must be considered. If each class had but 5 students, one would interpret the fact that 20 per cent more in the experimental class showed increases with less interest than if each class had 100 students. In this case there is available an elaborate formal theory for the plausible rival hypothesis of chance fluctuation. This theory involves the assumption of randomness, which is quite appropriate when the null model of random association is rejected in favor of a hypothesis of systematic difference between the two groups. If a “significant difference” is found, the test of significance will not, of course, reveal whether the two classes differed because one saw the experimental movie or for some selection reason associated with class topic, time of day, etc., that might have interacted with rate of autonomous change, pretest instigated changes, reactions to commonly experienced events, etc. But such a test of significance will help rule out what can be considered as a ninth threat to internal validity; that is, that there is no difference here at all that could not be accounted for as a vagary of sampling in terms of a model of purely chance assignment. Note that the statement of probability level is in this light a statement of the plausibility of this one rival hypothesis, which always has some plausibility, however faint.

DONALD T. CAMPBELL

*[Other relevant material may be found in*HYPOTHESIS TESTING; PERSONALITY MEASUREMENT, *article on*SITUATIONAL TESTS; PSYCHOMETRICS; REASONING AND LOGIC; SURVEY ANALYSIS.]

## BIBLIOGRAPHY

BLALOCK, HUBERT M. JR. 1964 *Causal Inferences in Nonexperimental Research*. Chapel Hill: Univ. of North Carolina Press.

CAMPBELL, DONALD T. 1957 Factors Relevant to the Validity of Experiments in Social Settings. *Psychological Bulletin* 54:297–312.

CAMPBELL, DONALD T. 1963 From Description to Experimentation: Interpreting Trends as Quasi-experiments. Pages 212–242 in Chester W. Harris (editor), *Problems in Measuring Change*. Madison: Univ. of Wisconsin Press.

CAMPBELL, DONALD T.; and CLAYTON, K. N. 1961 Avoiding Regression Effects in Panel Studies of Communication Impact. *Studies in Public Communication* 3: 99–118.

CAMPBELL, DONALD T.; and McCoRMACK, THELMA H. 1957 Military Experience and Attitudes Toward Authority. *American Journal of Sociology* 62:482–490.

CAMPBELL, DONALD T.; and STANLEY, J. S. 1963 Experimental and Quasi-experimental Designs for Research on Teaching. Pages 171–246 in Nathaniel L. Gage (editor), *Handbook of Research on Teaching*. Chicago: Rand McNally.

CHAPIN, FRANCIS S. (1947) 1955 *Experimental Designs in Sociological Research*. Rev. ed. New York: Harper.

CHAPIN, FRANCIS S.; and QUEEN, S. A. 1937 *Research Memorandum on Social Work in the Depression*. New York: Social Science Research Council.

GREENWOOD, ERNEST 1945 *Experimental Sociology: A Study in Method*. New York: Columbia Univ. Press.

HOVLAND, CARL I.; LUMSDAINE, ARTHUR A.; and SHEFFIELD, FREDERICK D. 1949 *Experiments on Mass Communication*. Studies in Social Psychology in World War II, Vol. 3. Princeton Univ. Press.

LAWLEY, D. N. 1940 The Estimation of Factor Loadings by the Method of Maximum Likelihood. Royal Society of Edinburgh, *Proceedings* 60:64–82.

PELZ, DONALD C.; and ANDREWS, F. M. 1964 Detecting Causal Priorities in Panel Study Data. *American Sociological Review* 29:838–848.

SIMON, HERBERT A. (1947–1956)1957 *Models of Man: Social and Rational; Mathematical Essays on Rational Human Behavior in a Social Setting*. New York: Wiley.

STOUFFEK, SAMUEL A. et al. 1949 *The American Soldier*. Studies in Social Psychology in World War II, Vols. 1 and 2. Princeton Univ. Press. → Volume 1: *Adjustment During Army Life*. Volume 2: *Combat and Its Aftermath.*

THORNDIKE, EDWARD L.; and RUGER, G. J. 1923 The Effect of First-year Latin Upon Knowledge of English Words of Latin Derivation. *School and Society 18:* 260–270.

THORNDIKE, EDWARD L.; and WOODWORTH, R. S. 1901 The Influence of Improvement in One Mental Function Upon the Efficiency of Other Functions. *Psychological Review* 8:247–261, 384–395, 553–564.

THORNDIKE, R. L. 1942 Regression Fallacies in the Matched Groups Experiment. *Psychometrika* 7:85–102.

WRIGHT, S. 1920 Correlation and Causation. *Journal of Agricultural Research* 20:557–585.

## Experimental Design

# Experimental design

Careful and detailed plan of an experiment.

In simple psychological experiments, one characteristic—the independent variable—is manipulated by the experimenter to enable the study of its effects on another characteristic—the **dependent variable** . In many experiments, the **independent variable** is a characteristic that can either be present or absent. In these cases, one group of subjects represent the experiment group, where the independent variable characteristic exists. The other group of subjects represent the **control group** , where the independent variable is absent.

The validity of psychological research relies on sound procedures in which the experimental manipulation of an independent variable can be seen as the sole reason for the differences in behavior in two groups. Research has shown, however, that an experimenter can unknowingly affect the outcome of a study by influencing the behavior of the research participants.

When the goal of an experiment is more complicated, the experimenter must design a test that will test the effects of more than one variable. These are called multivariate experiments, and their design requires sophisticated understanding of statistics and careful planning of the variable manipulations.

When the actual experiment is conducted, subjects are selected according to specifications of the independent and dependent variables. People who participate as research subjects often want to be helpful as possible and can be very sensitive to the subtle cues on the part of the experimenter. As a result, the person may use a small smile or a frown by the experimenter as a cue for future behavior. The subject may be as unaware of this condition, known as experimenter bias, as the experimenter.

Experimenter bias is not limited to research with people. Studies have shown that animals (e.g., laboratory rats) may act differently depending on the expectations of the experimenter. For example, when experimenters expected rats to learn a maze-running task quickly, the rats tended to do so; on the other hand, animals expected not to learn quickly showed slower learning. This difference in learning resulted even when the animals were actually very similar; the experimenter's expectations seemed to play a causal role in producing the differences.

Some of the studies that have examined experimenter bias have been criticized because those studies may have had methodological flaws. Nonetheless, most researchers agree that they need to control for the experimenter bias. Some strategies for reducing such bias include automation of research procedures. In this way, an experimenter cannot provide cues to the participant because the procedure is mechanical. Computer-directed experiments can be very useful in reducing this bias.

Another means of eliminating experimenter bias if to create a double-blind procedure in which neither the subject nor the experimenter knows which condition the subject is in. In this way, the experimenter is not able to influence the subject to act in a particular way because the researcher does not know what to expect from that subject.

The results of experiments can also be influenced by characteristics of an experimenter, such as sex, race, euthanasic or other personal factors. As such, a subject might act in an unnatural way not because of any behavior on the part of the experimenter, but because of the subject's own biases.

## Further Reading

Christensen, Larry B. *Experimental Methodology.* 5th ed. Boston: Allyn and Bacon, 1991.

Elmes, David G. *Research Methods in Psychology.* 4th ed. St. Paul: West Publishing Company, 1992.

Martin, David W. *Doing Psychology Experiments*. 2nd ed. Monterey, CA: Brooks/Cole, 1985.

## experimental design

**experimental design** A system of allocating treatments to experimental units so that the effects of the treatments may be estimated by statistical methods. The basic principles of experimental design are *replication*, i.e. the application of the same treatment to several units, *randomization*, which ensures that each unit has the same probability of receiving any given treatment, and *blocking*, i.e. grouping of similar units, each one to receive a different treatment. *Factorial designs* are used to allow different types of treatment, or *factors*, to be tested simultaneously. Analysis of variance is used to assess the significance of the treatment effects. See also missing observations, fractional replication.