I. NONSAMPLING ERRORSFrederick Mosteller
II. EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONSRobert M. Elashoff
The view has sometimes been expressed that statisticians have laid such great emphasis on the study of sampling errors (the differences between the observed values of a variable and the long-run average of the observed values in repetitions of the measurement) that they have neglected or encouraged the neglect of other, frequently more important, kinds of error, called nonsampling errors.
Errors in conception, logic, statistics, and arithmetic, or failures in execution and reporting, can reduce a study’s value below zero. The roster of possible troubles seems only to grow with increasing knowledge. By participating in the work of a specific field, one can, in a few years, work up considerable methodological expertise, much of which has not been and is not likely to be written down. To attempt to discuss every way a study can go wrong would be a hopeless venture. The selection of a kind of error for inclusion in this article was guided by its importance, by the extent of research available, by the ability to make positive recommendations, and by my own preferences.
Although the theory of sampling is generally well developed, both the theory and practice of the control of nonsampling errors are in a less satisfactory state, partly because each subject matter, indeed each study, is likely to face yet uncatalogued difficulties. Empirical results of methodological investigations intended to help research workers control nonsampling errors have accumulated slowly, not only because of myriad variables but also because the variables produce results that lack stability from one study to another.
This article deals mainly with techniques for reducing bias. The portions on variability are not exceptions, for they offer ways to avoid underestimating the amount of variability. The presentation deals, first, with the meaning of bias and with conceptual errors; second, with problems of nonsampling errors especially as they arise in the sample survey field through questionnaires, panel studies, nonresponse, and response errors; and, third, with errors occurring in the analysis of nearly any kind of quantitative investigation, errors arising from variability, from technical problems in analysis, in calculations, and in reporting. Some discussions of nonsampling errors restrict themselves to the field of sample surveys, where problems of bias and blunder have been especially studied, but this article also treats some nonsampling errors in experimental and observational studies.
Bias and conceptual errors
Bias and true values . What is bias? Most definitions of bias, or systematic error, assume that for each characteristic to be measured there exists a true value that the investigation ideally would produce. Imagine repeatedly carrying out the actual proposed, rather than the ideal, investigation, getting a value each time for the characteristic under study, and obtaining an average value from these many repetitions. The difference between that average value and the true value of the characteristic is called the bias. The difference between the outcome of one investigation and the true value is the sum of bias and sampling error. The point of averaging over many repetitions is to reduce the sampling error in the average value to a negligible amount. (It is assumed that for the process under study and for the type of average chosen, this reduction is possible.)
Is there a true value?) The concept “true value” is most touchy, for it assumes that one can describe an ideal investigation for making the measurement. Ease in doing this depends upon the degree of generality of the question. For example, the measurement of “interventionist attitude” in the United States during World War II is discussed below. For such a broad notion, the concept of a true value seems vague, even admitting the possible use of several numbers in the description. It is easier to believe in a true value for the percentage of adults who would respond “Yes” to “Should we go to war now?” Even here the training of the interviewers, the rapidly changing fraction of the population holding given opinions, and the effect of the social class and opinions of the interviewer upon the responses of those interviewed must raise questions about the existence of a true value. At the very least, we wonder whether a true value could represent a time span and how its conditions of measurement could be specified. In designing an ideal sample survey, what kind of interviewer should be used?
Today some scientists believe that true values do not exist separately from the measuring process to be used, and in much of social science this view can be amply supported. The issue is not limited to social science; in physics, complications arise from the different methods of measuring microscopic and macroscopic quantities such as lengths. On the other hand, because it suggests ways of improving measurement methods, the concept of “true value” is useful; since some methods come much nearer to being ideal than others, the better ones can provide substitutes for true values. (See the discussion on describing response error in the section on “Response error,” below.)
To illustrate further the difficulty of the notion of a true value, consider an example from one of the most quantitative social sciences. When the economist assesses the change in value of domestic product, different choices of weights and of base years yield different results. He has no natural or unique choice for these weights and years. He can only try to avoid extremes and unusual situations. While, as noted above, the belief in a true value independent of the measuring instrument must be especially weak in the area of opinion, similar weaknesses beset measures of unemployment, health, housing, or anything else related to the human condition. [See INDEX NUMBERS.]
Conceptual errors . Since the variety of sources of biases is practically unlimited, this article discusses only a few frequently encountered sources.
Target population—sampled population. Often an investigation is carried out on a sample drawn from a population—the sampled population—quite different from that to which the investigator wants to generalize—the target population. This mismatch makes the inference from sample to target population shaky. To match target and sampled population perfectly is usually impossible, but often the expenditure of time and money or the use of special skills or cooperation can patch what cannot be made whole.
Some examples of this process follow: (1) The psychologist wants to establish general laws of learning for all organisms, and especially for man, but he may choose to study only the college sophomore, usually in his own college and rarely outside his own country. His principal alternatives are the rat and the pigeon. Reallocation of time and money may extend the sampled population and bring him closer to the target he has in mind.
(2) The sociologist may want to study the actual organization of trade unions and yet be hard pressed to study in depth more than a single union. This limitation is impossible for an individual to overcome, but cooperative research may help. (For a remarkable cooperative anthropological study of child rearing, see Six Cultures: Studies of Child Rearing [Whiting 1963].)
(3) The historian or political scientist may want to exposit the whole climate of opinion within which an important decision is made, yet he must pick some facts and omit others, emphasize some and not others. The sampling of historical records offers a compromise between scanning everything, which may be impossible or unsatisfactorily superficial, and the case study of a single document or of a small collection.
(4) The man who generalizes on educational methods on the basis of his studies in one class, or one school subject, or one grade, or one school, or one school system, or even one country, needs to consider whether the bases of his investigations should be broadened.
(5) The investigator, especially in studies where he does not regard his investigation as based on a sample, but on a population or census, would be wise to consider what population he hopes his investigation applies to, whether the full breadth of it has had an appropriate chance to contribute cases to his study and, if not, how he might get at the rest. He may be satisfied with describing the population under study, but often he is not.
(6) More narrowly, in sampling the membership of a professional society, the investigator may find his published membership list out of date by some years. For a fee the society may be willing to provide its current mailing list, which is probably as close as one can get to the target population. Obviously, the target population changes even while the study is being performed.
Incompatibility of meaning. While arguing for statistical thinking in the attempt to generalize one’s results, one must not fall into the pit of statistical nonsense. Both anthropologists and historians call attention to mistakes that can come from regarding seemingly like objects, rituals, or behavior in different cultures as exchangeable commodities for statistical purposes. The notion of “father” without distinction between “pater” and “genitor” offers an example. In the Trobriands, a boy lives with his benign, biological father until he is nine or ten years old, then moves to his mother’s brothers’ village for training and discipline, and there he inherits property. In the United States the biological father theoretically plays the role of disciplinarian, and the uncles frequently play benign, indulgent roles.
Pilot studies. Toward the completion of a study, investigators usually feel that it would have been better done in some other way. But a study can be petted and patted so long that, before completion, its value is past. The huge, never-completed study usually damages the investigator’s reputation, however wise the termination. Much can and must be learned by trying, and therefore nearly any investigation requires pilot work. Pilot work is little written about, perhaps because it is hard to summarize and perhaps because the results usually sound so obvious and often would be were they not hidden among thousands of other possible obvious results that did not occur. The whole spectrum from the tightest laboratory experiment to the loosest observational study requires careful pilot work. Pilot studies pinpoint the special difficulties of an investigation and, by encouraging initial action, overcome doctrines of omniscience that require a complete plan before starting. While it is true that the statistician can often give more valuable aid at the planning stage by preventing errors than by salvaging poor work through analysis, firm plans made in the absence of pilot studies are plans for disaster.
Hawthorne effects. Psychologists sadly say that even under the most carefully controlled conditions, laboratory animals do as they please. Humans do even worse. When Roethlisberger and Dickson (1939) carried out their experiments to find conditions that would maximize productivity of factory teams at the Hawthorne Works of Western Electric, they found that every change—increasing the lighting or reducing it, increasing the wage scale or reducing it—seemed to increase the group productivity. Paying attention to people, which occurs in placing them in an experiment, changes their behavior. This rather unpredictable change is called the Hawthorne effect. Instead of trying to eliminate this effect, it has been suggested that all educational efforts should be carried out as portions of experiments, so as to capitalize on the Hawthorne effect. No doubt boredom, even with experimentation, would eventually set in.
The existence of Hawthorne effects seriously restricts the researcher’s ability to isolate variables that change performance in a consistent manner. Although experimenters, by adjusting conditions, may create substantial changes in behavior, what causes the changes may still be a mystery. Reliable repetition of results by different experimenters using different groups can establish results more firmly.
What treatment was applied? In experimental work with humans, it is especially difficult to know whether the treatment administered is the one that the experimenter had in mind. For example, in an unsuccessful learning experiment on the production of words by individuals, subjects in one group were instructed that every word in the class of words that they were seeking contained the same letters of the alphabet. When no differences in learning rates emerged between these subjects and those told nothing about the class of words being sought, further investigations were made. It turned out that few subjects listened to this particular instruction, and among those who did, several forgot it during the early part of the experiment. If a particular instruction is important, special efforts have to be made to ensure that the subject has received and appreciated it.
One approach to the problem of Hawthorne effects uses, in addition to experimental groups, two kinds of control groups: groups who are informed that they are part of an experiment and other groups who are not so informed. As always, the investigator has to be alert about the actual treatment of control and experimental groups. L. L. Thurstone told me about experimenting for the U.S. Army to measure the value of instruction during sleep for training in telegraphy. Thurstone had control squads who were not informed that they were in the study. The sergeants instructing these control squads felt that the “sleep learning” squads were getting favored treatment, and to keep their squads “even,” they secretly instituted additional hours of wide-awake training for their own squads, thereby ruining the whole investigation.
Randomization. Generally speaking, randomization is a way to protect the study from bias in selecting subjects or in assigning treatments. It aids in getting a broad representation from the population into the sample. Randomization helps to communicate the objectivity of the study. It provides a basis for mathematical distribution theory that has uses in statistical appraisals and in simulations. [See EXPERIMENTAL DESIGN; RANDOM NUMBERS.]
Bad breaks in random sampling. Valuable as randomization is, chance can strike an investigator stunning blows. For example, suppose that a psychological learning experiment is intended to reinforce 5 randomly chosen responses in each burst of 20. If the randomization accidentally gives reinforcement to the first 5 responses in each burst of 20, the psychologist should notice this and realize that he has selected a special kind of periodic reinforcement. The objectivity of the random assignment cannot cure its qualitative failure.
Similarly, suppose that in preparing to study fantasy productions under two carefully controlled conditions the clinical psychologist observes that his randomizing device has put all his scientist subjects into one group and all his humanist subjects into another. In that case, he should reconsider the grouping.
In principle, one should write down, in advance, sets of assignments that one would not accept. Unfortunately, there are usually too many of these, and nobody is yet adept at characterizing them in enough detail to get a computer to list them, even if it could face the size of the task. One solution is to describe a restricted, but acceptable, set of assignments and to choose randomly from these. Omitting some acceptable assignments may help to make the description feasible while keeping the list satisfactorily broad.
If this solution is not possible, then one probably has either to trust oneself (admittedly risky) or else get a more impartial judge to decide whether a particular random assignment should be borne.
If there are many variables, an investigator cannot defend against all the bad assignments. By leaning upon subject matter knowledge and accepting the principle that the variables usually thought to be important are the ones to be especially concerned about, stratification, together with randomization, can still be of some assistance. For example, the stratification might enforce equal numbers of each sex, with individuals still randomly chosen from the potential pools of men and women. In studying bad breaks from randomization, the investigator can afford to consider rejecting only the assignments too closely related to proved first-order or main effects and not second-order effects or boomerang possibilities conceived in, but never observed from, armchairs.
Random permutations. Although arranging objects in a random order can easily be done by using an ordinary random number table, few people know how to do it. In any case, making these permutations is tedious, and it is worth noting the existence of tables of random permutations in some books on the design of experiments and in the book by Moses and Oakford (1963) that offers many permutations of sets numbering 9, 16, 20, 30, 50, 100, 200, 500, 1,000 elements. A set of any other size less than 1,000 can be ordered by using the permutations for the next larger size. With larger sets, some stratification is almost sure to be valuable.
Example. One (nonstratified) permutation for a set of 30 elements is shown in Table 1 (read left to right).
To arrange at random the letters of the alphabet, we might assign the integers, starting with 1 for a and ending with 26 for z, to the positions of the letters in the alphabet. Then, according to the permutation of Table 1, 11 and 5 correspond to k and e. 29 is omitted, 26 corresponds to z. Continuing gives the permutation:
For a random sample of 5 letters from the alphabet, drawn without replacement, we could just take the first 5 listed.
Simulations for new statistical methods. Large-scale simulation of economic, political, and social processes is growing in popularity; social scientists who invent new statistics would often find it profitable to try these out on idealized populations, constructed with the aid of random number tables, to see how well they perform their intended functions under perfectly understood conditions. This sort of exploration should be encouraged as part of the pilot work. To illustrate the lack, many books and hundreds of papers have been written about factor analytic methods, yet in 1966 it is hard to point to more than a single published simulation (Lawley & Swanson 1954; Tucker 1964) of the methods proposed on artificially constructed populations with random error.
Nonsampling errors in sample surveys
Questionnaires. Questionnaires themselves present many sources of bias, of which the wording of questions and the options offered as answers are especially important. Some topics discussed below (“Panel studies,” “Nonresponse,” and “Response error”) also treat questionnaire matters. [See especially SAMPLE SURVEYS; SURVEY ANALYSIS; see also INTERVIEWING.]
Wording. The wording and position of questions on questionnaires used in public opinion polls and other investigations illustrate the difficulties surrounding the notion of true value mentioned earlier. Rugg and Cantril’s survey article (1944) analyzes and illustrates the effects of the manner of questioning on responses. For example, prior to the U.S. entry into World War II, variations on a question about American aid to Great Britain, asked of American citizens within a period of about six weeks, produced the following percentages in favor of the “interventionist” position: 76, 73, 58, 78, 74, 56. Here the interventionist position meant approval of “giving aid even at the risk of war.” At much the same time, unqualified questions about “entering the war immediately” produced the following percentages in favor, 22, 17, 8, numbers substantially different from those in the previous set. Although one would be hard put to choose a number to represent degree of support for intervention, the interval 55 to 80 per cent gives a range; this range was clearly higher than that in support of entering the war immediately.
Pilot studies of the wordings of questions test their meaning and clarity for the intended population. Phillip Rulon recalls interviews with very bright second graders from a geography class to discuss a test item that they had “missed”: “Winderoded rocks are most commonly found in the (a) deserts, (b) mountains, (c) valleys.” They chose “valleys” because few people would find winderoded rocks in the mountains or the deserts, however many such rocks might be in those places. After a question previously found to be unsatisfactory is reworded, bitter experience advises the testing of the new version.
In single surveys, one needs to employ a variety of questions to get at the stability and meaning of the response. The use of “split ballots” (similar but modified questionnaires administered to equivalent samples of individuals) offers a way to experiment and to control for position and wording.
Changing opinions. To ignore the results of the polls because of the considerable variation in responses would be as big a mistake as to adopt their numbers without healthy skepticism. Since opinion in time of crisis may move rapidly, it is easy to misappraise the tenor of the times without a systematic measuring device. For example, between July 1940 and September 1941, the per cent of U.S. citizens saying that they were willing to risk war with Japan rather than to let it continue its aggression rose from 12 to 65 per cent. Again, although in June 1940 only 35 per cent thought it more important to help England than to keep out of war, by September 1940 the percentage had risen to the 50s (Cantril 1944, p. 222). In September 1940, President Roosevelt made a deal that gave Great Britain 50 destroyers in return for leases of bases (Leuchtenburg 1963, pp. 303-304); in the face of the fluctuations of public opinion a historian considering the destroyer deal might easily believe that Roosevelt acted against, rather than with, the majority. (As I recall from experience at the Office of Public Opinion Research, Roosevelt had his own personal polls taken regularly, with reports submitted directly to him, usually on a single question.)
Seemingly minor variations in questions may change the responses a good deal, and so to study changes over time, one needs to use one wellchosen question (or sequence) again and again. Naturally, such a question may come under attack as not getting at the “true value.” If the question is to be changed, then, to get some parallel figures, it and the new question should be used simultaneously for a while.
Intercultural investigations. Considering the difficulty of getting at opinions and the dependence of responses upon the wording of the questions asked, even within a country, the problem of obtaining comparable cross-cultural or cross-national views looks horrendous. Scholars planning such studies will want to see three novel works. Kluckhohn and Strodtbeck’s (1961) sociological and anthropological Variations in Value Orientations especially exploits ranking methods in the comparison of value orientations in Spanish-American, Mormon, Texas, Zuni, and Navajo communities. Subjects describe the many values of their culture by ordering their preferences, for example, for ways of bringing up children: past (the old ways), present (today’s ways), or future (how to find new ways to replace the old). Cantril’s (1966) social-psychological and internationally oriented Pattern of Human Concerns uses rating methods and sample surveys to compare values and satisfactions in the populations of 15 nations. For example, the respondent’s rating, on a scale of 0 to 10, expresses his view of how nearly he or his society has achieved the goal inquired about, and another rating evaluates how much either might expect to advance in five years. In international economics, measurements may be more easily compared, although the economist may be forced to settle for measuring the measurable as an index of what he would like to evaluate. Harbison and Myers’ study, Education, Manpower, and Economic Growth (1964), illustrates this approach.
Panel studies . Although the single sample survey can be of great value, in some problems it is desirable to study the changes in the same people through time. The set of people chosen for repeated investigation is called a panel. One advantage of the panel study over the single survey is the deeper analysis available. For instance, when a net 5 per cent change takes place, does this mean that only 5 per cent of the people changed, or perhaps that 15 per cent changed one way and 10 per cent another? Second, additional measurement precision comes from matching responses from one interview to another. Third, panel studies offer flexibility that allows later inquiries to help explain earlier findings. [See PANEL STUDIES.]
Dropouts.. Panel studies, even when they start out on an unbiased sample, have the bias that the less informed, the lower-income groups, and those not interested in the subject of the panel tend to drop out. Sobol (1959) suggested sampling these people more heavily to begin with, and she tried to follow movers. According to Seymour Sudman of the National Opinion Research Center, in national consumer panels and television rating panels where a fee is paid to the participant, the lower-income groups do not drop out.
Beginning effects. When new individuals or households first join a panel, their early responses may differ from their later ones. The “first-month” effect has unknown origins. For example, after the first month on the panel, the fraction of unemployed reported in private households decreases about 6 per cent. Over the course of several panel interviews, more houses become vacant and consumer buying decreases (Neter & Waksberg 1964a; Waksberg & Pearl 1964). Household repairs decreased by 9 per cent between the second and third interview. In consumer panels, the reports made during the first six or eight weeks of membership are usually not included in the analysis. The startup differences are not clear-cut and emphatic but unsettling enough that the data are set aside, expensive as that is.
Sample surveys are not alone in these “first-time” effects. Doctors report that patients’ blood pressures are higher when taken by a strange doctor. In the Peirce reaction-time data, presented in Table 4, the first day’s average reaction time was about twice those of the other 23 days.
Long-run effects. A most encouraging finding in consumer panel studies has been the stability of the behavior of the panelists. By taking advantage of the process of enlarging two panels, Ehrenberg (1960) studied the effects of length of panel membership in Great Britain and in Holland. When he compared reports of newly recruited households (after their first few weeks) with those of “old” panel members, he found close agreement for purchasing rates, brand shares of market, and diary entries per week.
Panels do have to be adjusted to reflect changes in the universe, and panel families dissolve and multiply.
Nonresponse . The general problem of nonresponse arises because the properties of the nonrespondents usually differ to some degree from those of respondents. Unfortunately, nonresponse is not confined to studies of human populations. Physical objects can be inaccessible for various reasons: records may be lost, manholes may be paved over, a chosen area may be in dense jungle, or the object may be too small to be detected. One tries to reduce nonresponse, adjust estimates for it, and allow for it in measures of variability.
Mail questionnaires. The following advice, largely drawn from Levine and Gordon (1958-1959) and Scott (1961), is intended to increase response from mail questionnaires:
(1) Respondent should be convinced that the project is important.
(2) Preparatory letter should be on the letterhead of a well-known organization or, where appropriate, should be signed by a well-known person. In the United States and in Great Britain, governmental agencies are more likely to obtain responses than most organizations. Indeed, Scott (1961) reports 90 per cent response! Special populations respond to appeals from their organizations.
In pilot studies preparatory to using mailed census questionnaires in the initial stage of enumeration for the 1970 census (enumerator to follow up nonrespondents), the U.S. Bureau of the Census got the percentages of responses to the mailing shown in Table 2.
|PERCENTAGE OF RESPONSE|
|Long form||Short form|
(3) Rewards may be used (gifts, trading stamps, sweepstakes). Do not offer a copy of the final report unless you are prepared to give it.
(4) Make questionnaire attractive (printing on good paper is preferred), easy to read, and easy to fill in, remembering that many people have trouble reading fine print. Longer questionnaires usually lower the response rate.
(5) Keep questions simple, clear, as short as possible, and where multiple-choice answers appear, make sure that they do not force respondent to choose answers that do not represent his position.
(6) Try to keep early questions interesting and easy; do not leave important questions to the end; keep related questions together, unless there are strong reasons to act otherwise.
(7) Use a high class of mail, first-class, airmail, and even special delivery, both for sending the questionnaire and on the return envelope. Do not expect respondent to provide postage. In Great Britain, Scott (1961) found that compared with a postcard a card to be returned in an envelope raised response.
(8) Follow hard-core resistance with repeat questionnaire (the sixth mailing may still be rewarding), telegram, long-distance phone call, or even personal interview, as discussed below. Small response from early mailings may be badly biased; for example, successful hunters respond more readily than unsuccessful ones to questions about their bag (Kish 1965, p. 547).
(9) Do not promise or imply anonymity and then retain the respondent’s identity by subterfuge, however worthy the cause. Views on the effects of anonymity are mixed. If respondent’s identity is needed, get it openly.
The principles set out above for mail questionnaires and those below for personal interviews may well be culture-bound for they are largely gathered from Western, English-speaking experience. For example, where paper is expensive, questionnaires on better paper may be less likely to be returned than those on poorer paper.
Sample surveys using personal interviews. In personal interviews, 80 to 90 per cent response has been attained even on intimate topics. In 1966, 85 per cent was regarded as rather good for predesignated respondents in household surveys. In addition to the relevant maxims given above for mail surveys, to reduce nonresponse in personal interview surveys Sharp and Feldt (1959), among others, suggest some of the following:
(1) Send preview letter; use press to announce survey. In three lengthy surveys on different topics, according to Reuben Cohen of the Opinion Research Corporation, a letter sent in advance led to an average gain of 9 per cent in reaching, after four calls, urban adult respondents randomly drawn from the household list. Cohen also suggests that follow-up letters, after unsuccessful interviewing attempts, can reduce urban nonresponse by about one-third. Some students of polling believe that the actual impact of the preview letter is largely on the interviewer, who thinks that obtaining cooperation will be easier because of it—and so it is.
(2) Use trained interviewers, that is, interviewers trained especially to handle opening remarks, to explain the need for full coverage, and to get information about profitable times to make later calls (“callbacks”) to reach respondents who are initially not at home. Experienced interviewers have had 3 per cent to 13 per cent fewer nonrespondents than inexperienced ones.
(3) Be flexible about calling at convenience of respondent, even at his place of work or recreation, on evenings and on week ends.
(4) Allow interviewer to call back many times to locate assigned respondent.
(5) Employ interpreter when appropriate.
(6) In more esoteric situations, know the culture. Do not plan to interview farmers in the peak periods of farm activity. An anthropologist scheduled a survey of current sexual behavior among South Sea islanders during the season when women were taboo to fishermen—the natives, finding it a great joke, were slow to explain.
Extra effort. When a survey carried out in the usual way produces a surprisingly large nonresponse, an all-out effort may be mounted using many of the devices mentioned earlier. A rule of thumb is that the nonresponse can be reduced by about half.
Oversampling nonrespondents. Repeated callbacks are the traditional method for reducing nonresponse in personal interviews, and careful cost analysis has shown that their cost per completed interview is lower than was at first supposed when quota sampling was popular. Kish and Hess (1959) report a procedure for including in the current sample nonrespondents from similar previous surveys, so as to have in advance an over supply of persons likely not to respond. Then the sample survey, although getting responses from these people at a lower rate than from others, more nearly fills out its quotas.
Subsampling nonrespondents. To reduce nonresponse in mail surveys, subsampling the nonrespondents and pursuing them with personal interviews has been used frequently (formulas for optimum design are given in Hansen & Hurwitz 1946). In methods thus far developed the assumption is made that nonrespondents can surely be interviewed. When this assumption is unjustified, the method is less valid.
Adjusting for respondents not at home. The next method adjusts for those not at home but does not handle refusals, which often come to about half the nonresponse. Bartholomew (1961) has got accurate results by assuming that most of the bias arises from the composition of the population available at the first call. By finding out when to call back, the interviewer reduces later biases from this source. The interviewer gets information either from others in the house or from neighbors. To illustrate, in empirical investigations of populations of known composition, Bartholomew studied the percentage of men in political wards of a city. In four wards, differences between first-call and second-call samples in percentage of men were 17 per cent, 29 per cent, 36 per cent, and 38 per cent, substantial differences. But the differences between the second-call percentage of men and the actual percentage of men not reached by the first call were only 6 per cent, 2 per cent, 2 per cent, and 2 per cent, supporting Bartholomew’s point.
Suppose that proportion p of the population has the characteristic of interest. It is convenient to regard p as the weighted average pp1 + (1 — p)p2, where p is the proportion of first-call responders in the population, p1 is the proportion of first-call responders having the characteristic, and p., is the proportion of others in the population having the characteristic. (It is assumed that p2 is independent of response status after the first call.) Now if N, the total sample size, is expressed as N = N1 + N2 + N3, where N1, is the number of first-call responders in the sample, N2 is the number of second-call (but not first-call) responders in the sample, and N3 is the number of others, then p is naturally estimated by N1/N (the proportion of first-call responders in the sample), and P1 by n1/N1 (the proportion of first-call responders in the sample who have the characteristic) and j>2 by n2/N2 (the proportion of second-call responders in the sample having the characteristic). Putting these estimators in the weighted average gives, as estimator of p,
For example, if the number of men in the first call is n1 = 40 out of N1 = 200 interviewed, the second-call data are n2 = 200, N2 = 400, and the original sample size is N = 1,000, then the estimate of the proportion of men is 0.44. Even if the theory were exactly true, some increase in variance would arise from using such weights instead of obtaining the whole sample (Kish 1965, sees. 11.7B, 11.7C).
Extrapolation. Hendricks (1956) suggests plotting the variable being measured against the percentage of the sample that has responded on successive waves and extrapolating to 100 per cent. This simple, sensible idea could profit from more research, empirical and theoretical.
Effect on confidence interval. In sample surveys, nonresponse increases the lengths of the confidence intervals for final estimates by unknown amounts. For dichotomous types of questions, the suggestion is often made that all the nonresponses be counted first as having the attribute, then as not having it. The effect on the 95 per cent confidence interval is shown in Table 3. When such extreme allowances are required, the result of, say, 20 per cent nonresponse is frequently disastrous. For large random samples, this treatment of nonresponse, as may be seen from Table 3, adds approximately the per cent of nonresponse to the total length of the confidence interval that would have been appropriate with 100 per cent response. For example, with a sample of 2,500 from a large population, a 95 per cent confidence interval from 58 per cent to 62 per cent would be lengthened by 20 per cent nonresponse to 48.5 per cent to 71.5 per cent. This additional length gives motivation enough for wanting to keep nonresponse low.
|Table 3 – Allowance to be added to and subtracted from the observed percentage to give at least 95 per cent confidence of covering the true value*|
|* These numbers are approximately correct for percentages near 0.50; they are likely to be conservative otherwise.|
|Source: Cochran, Mosteller, and Tukey 1954, p. 280.|
|PER CENT NONRESPONSE||700||2,500||Infinite|
No one believes that these “worst possible” limits represent the true state of affairs, nor should anyone believe the optimist who supposes that the nonrespondents are just like the respondents. In large samples, differences as large as 28 per cent in the fraction possessing a characteristic between the first 60 per cent interviewed and the next 25 per cent have been reported. To develop a more sensible set of limits in the spirit of Bayesian inference would be a useful research job for sample survey workers and theoretical statisticians. [See BAYESIAN INFERENCE.] This urgently needed work would require both empirical information (possibly newly gathered) and theoretical development.
The laboratory worker who studies human behavior rarely has a defined target population and he frequently works with a sample of volunteers. Under such circumstances we cannot even guess the extent of nonresponse. Again the hope is that the property being studied is independent of willingness or opportunity to serve as a subject—the position of the optimist mentioned above.
Since a few experimenters do sample defined populations, the argument that such sampling is impossible has lost some of its strength. The argument that such sampling is too expensive has to be appraised along with the value of inferences drawn from the behavior of undefined sampled populations.
Studying differences between groups offers more grounds for hope that bias from nonresponse works in the same direction and in nearly the same amount in both groups and that the difference may still be nearly right. This idea comes partly from physical measurements where sometimes knowledge can make such arguments about compensating errors rigorous. But, as Joseph Berkson warns, no general theorem states “Given any two wrong numbers, their difference is right.”
Response error . When incorrect information about the respondent enters the data, a response error occurs. Among the many causes are misunderstandings, failures of memory, clerical errors, or deliberate falsehoods. The magnitudes of some of these errors and some ways to reduce them are discussed below.
Telescoping events. In reporting such things as amount of broken crockery or expenditures for household repairs, some respondents telescope the events of a considerable period into the shorter one under study. As a possible cure, Neter and Waksberg (1964b) have introduced a device called “bounded recall.” In a study of household repairs, the respondent was interviewed twice, first under unbounded recall, during which the full story of the last month, including the telescoping from previous months, was recorded by the interviewer. Second, in an interview using bounded recall a month later, the respondent was deliberately aided by the record of repairs from the first interview. The magnitudes of the effects of telescoping are considerable, because the “unbounded” interview for household repairs gave 40 per cent more jobs and 55 per cent higher expenditure than did the “bounded” interview. Data from a “bounded” interview produce less bias.
Forgetting. Although telescoping occurs for some activities, chronic illness (Feldman 1960), which had already been clinically diagnosed, was reported only at a 25 per cent rate in household interviews. Others report rates in the 40 per cent to 50 per cent range. Feldman despairs of the household interview for this purpose; but if improvement in reports is to be attempted, he recommends more frequent interviews by competent, trained interviewers and, in a panel study, the use of a morbidity diary to improve self-reporting. One limitation, not attributable to forgetting, is set because physicians choose not to inform their patients of every illness they diagnose.
Sudman (1964) compared consumer panel reports based upon diary records with reports based upon unaided recall. First, he shows that for 72 grocery products (55 being food), the purchases recorded in the diary underestimate the amount shipped by the manufacturer (after adjustment for nonhousehold use) by a median of about 15 per cent. The underreporting was highly predictable, depending on both the properties of the product (frequency of purchase, where most often purchased) and its treatment in the diary (type size, page number, position on page). Second, when recall was compared with diary, the median ratio of purchases (purchases recalled divided by purchases in diary) was 1.05 for nonfood products, 1.83 for perishable food, 1.54 for staple foods. Leading nationally advertised brands have their market shares overstated under recall by 50 per cent compared to diary records, and chain brands are understated.
Use of experts. After respondents had valued their own homes, Kish and Lansing (1954) obtained expert appraisals for a sample of the homes. The comparison of the experts’ appraisal with the homeowners’ appraisal can be used to adjust the total valuation or to adjust valuations for groups of houses. For example, the homeowners may average a few per cent too high.
Editing records. Whenever a comparison of related records can be made, the accuracy of records can probably be improved. For example, Census Bureau editors, experienced in the lumber business, check annual sawmill production reports against those of the previous year, and large changes are rechecked with the sawmill.
Describing response errors. One common measure in the analysis of nonsampling errors puts bias and sampling variability into one index, called the mean square error. The larger the mean square error, the worse the estimate. The mean square error is the expected squared deviation of the observed value from the true value. This quantity can be separated into the sum of two parts, the variance of the observation around its own mean and the square of the bias. Although true values are not available, in the United States, the Bureau of the Census, for instance, tries to find standards more accurate than the census to get an estimate of the response bias to particular questions. For example, using the Current Population Survey as a standard, the Bureau of the Census not only finds out that the census underestimates the percentage in the labor force, but the bureau also gets data on the portions of the population not being satisfactorily measured, either because of variability or bias. Using such information, the bureau can profitably redesign its inquiries because it knows where and how to spend its resources.
Response uncertainty. In attitudinal studies (Katz 1946), the investigator must be especially wary of reports obtained by polling the public on a matter where opinion is not crystallized. The No Opinion category offers one symptom of trouble: for example, Katz reports that in 1945 only 4 per cent had No Opinion about universal military training in the United States, 13 per cent had No Opinion about giving the atom bomb secret to the United Nations, and 32 per cent had No Opinion about U.S. Senate approval of the United Nations charter. Even though the vote was 66 per cent to 3 per cent in favor of approving the charter, the 32 per cent No Opinion must suggest that the 69 per cent who offered an opinion contained a large subgroup who also did not hold a well-formed opinion.
Errors in analysis
Troubles with variability . In analyzing data, the presence of variability leads to many unsuspected difficulties and effects. In addition to treating some of the common traps, this section gives two ways to analyze variability in complicated problems where theoretical formulas for variance are either unavailable or should be distrusted.
Inflated sample size. The investigator must frequently decide what unit shall be regarded as independent of what other unit. For example, in analyzing a set of responses made by 10 individuals, each providing 100 responses, it is a common error to use as the sample size 10 x 100 = 1,000 responses and to make calculations, based perhaps on the binomial distribution, as if all these responses were independent. Unless investigation has shown that the situation is one in which independence does hold from response to response both within and between individuals, distrust this procedure. The analysis of variance offers some ways to appraise both the variability of an individual through time and the variation between individuals.
Use of matched individuals. Some investigators fail to take advantage of the matching in their data. Billewicz (1965, p. 623) reports that in 9 of 20 investigations that he examined for which the data were gathered from matched members in experimental and control groups, the analysis was done as if the data were from independent groups. Usually the investigator will have sacrificed considerable precision by not taking advantage in his analysis of the correlation in the data. Usually, but not always, the statistical significance of the results will be conservative. When matched data are analyzed as if independent, the investigator owes the reader an explanation for the decision.
Pooling significance tests. In the same vein, investigators with several small effects naturally wish that they could pool these effects to get more extreme levels of significance than those given by the single effects. Most methods of pooling significance tests depend upon independence between the several measures going into the pool. And that assumption implies, for example, that data from several items on the same sample survey cannot ordinarily be combined into a significance test by the usual pooling methods because independence cannot be assured. Correlation is almost certainly present because the same individuals respond to each item. Sometimes a remedy is to form a battery or scale that includes the several items of interest and to make a new test based upon the battery (Mosteller & Bush 1954, pp. 328-331). Naturally, the items would be chosen in advance for the purpose, not based post hoc upon their results. In the latter case, the investigator faces problems of multiplicity, discussed below.
Outlying observations. Frequently data contain suspicious observations that may be outliers, observations that cannot be rechecked, and yet that may considerably alter the interpretation of the data when taken at their face value. An outlier is an observation that deviates much more from the average of its mates or has a larger residual from a predicted value than seems reasonable on the basis of the pattern of the rest of the measurements. The classic example is given by the income distribution for members of a small college freshman class, exactly one member of which happens to be a multimillionaire. The arithmetic mean is not typical of the average member’s income; but the median almost ignores an amount of income that exceeds the total for all the others in the class. Sometimes the outliers can be set aside for special study.
One current approach tailors the analysis to the type of outlier that is common in the particular kind of investigation by choosing statistics that are both appropriate and not especially sensitive to outliers. For example, as a measure of location, one might systematically use the median rather than the mean, or for more efficiency, the trimmed mean, which is the average of the measurements left after the largest and smallest 5 per cent (or lOOa per cent) of the measurements have been removed (Mosteller & Tukey 1966, sees. A5, B5). In normal populations, the median has an efficiency of about 64 per cent, but the trimmed mean has most of the robustness of the median and an efficiency of about 1 — -2/3α, where α is the proportion trimmed off each end. For α = 0.10, the efficiency is 93 per cent. [See ERRORS, article on the EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONS; NONPARAMETRIC STATISTICS, article On ORDER STATISTICS; STATISTICAL ANALYSIS, SPECIAL PROBLEMS OF, article on OUTLIERS.]
Shifting regression coefficients. When one fits a regression equation to data, this regression equation may not forecast well for a new set of data. Among the reasons are the following:
(1) The fitted regression coefficients are not true values but estimates (sampling error).
(2) If one has selected the best from among many predictive variables, the selected ones may not be as good as they appeared to be on the basis of the sample (regression effect).
(3) Worse, perhaps none of the predictive variables were any good to start with (bad luck or poor planning).
(4) The procedure used to choose the form of the regression curve (linear, quadratic, exponential, …) has leaned too hard on the previously available data, and represents them too well as compared with the total population (wrong form).
(5 ) The new sample may be drawn from a population different from the old one (shifting population ).
What are the effects of (2) and (5)? Consider the regression of height (Y) on weight (X) for a population of boys. Suppose that the true regression equation for this population is
E(Y) = a + b(X-µx),
where a and b are unknown constants, E(Y) is the expected value of Y for a given value of X, and µx is the mean of X. Suppose that an individual’s height has a predictive error e that has mean 0, variance σ2, and is unrelated to X, Y, and the true values of a and b.
Suppose that the experimenter chooses fixed values of X, xi, such as 70, 80, 90, 100, 110, 120, 130, 140 pounds, obtaining boys having each of these weights and measuring their heights yi. Then the data are paired observations (xi, yi), i = 1, 2, …, n.
Estimating a and b from the sample by the usual least squares formulas one gets â and b. Given a new sample with the same values of X from the same population, one can estimate the Y’s for the new sample by Y; = a + b(Xi x),
(where x̄ is the average of the Xi) and then the expected mean square error of the estimates for the new sample is
expected value of ,
where Y″is the height for an individual in the new sample. Note that σ-2 is the expected mean square error that would obtain were a and b known exactly instead of having been estimated.
Suppose that in addition to this population, there is a new population with different values of a and b, say a’ and b’. Both populations come from a group of several populations with a and b varying from population to population and having â and b as the mean values of a and b, respectively, and σ2b and σ2b as the variances of these sets of regression coefficients. For the example of the boys’ weights, consider the distribution of values of a and b from one city to another.
The regression line fitted on the basis of the sample from one population and then used on another population yields expected mean square error
where σx2 is the variance of the chosen set of x’s. The first term comes as before from ordinary sampling variation of the Y’s around the fitted regression line (the 2 of 2/n being the dimension of the parameter space), but the 2(σa2 + σx2σb2) comes from drawing two sets of regression coefficients from the population of regression coefficients. This term may be substantial compared to 2σ2/n or even σ2.
We need extensive empirical results for such experiments to get a notion of the size of 2(σa2 + σx2σb2) in various settings of interest to social and natural scientists. These investigations have not yet been carried out. The formulas for mean square error in this realistic situation must cause concern until more empirical studies are done. The existence of the added term should be recognized and an attempt made to assess its contribution numerically.
Uncontrolled sources of variation. Although the important formula σx̄ = σ/√n Aw for the standard deviation of a mean X̄, a random variable, is correct when n uncorrelated measurements are drawn from a distribution with standard deviation σ, two difficulties arise. The measurements may not be uncorrelated, and the distribution may change from one set of measurements to another.
Peirce’s data illustrate these difficulties. In an empirical study intended to test the appropriateness of the normal distribution, C. S. Peirce (1873) analyzed the time elapsed between a sharp tone stimulus and the response by an observer, who made about 500 responses each day for 24 days. Wilson and Hilferty (1929) reanalyzed Peirce’s data. Table 4 shows sample means, x̄, estimated standard deviations of the mean sx̄, and the ratio of the observed to the estimated interquartile range, Q3 — Q1. The observed interquartile range is based on percentage points of the observed distribution; the estimated interquartile range is based on the assumption of a normal distribution and has the value 2(0.6745s), where s is the sample standard deviation. In passing, note that the ratio is systematically much less than unity, defying the normality assumption. More salient for this discussion is the relation of day-to-day variation to the values of sx̄ based on within-day variation. The latter varies from 1.1 to 2.2 (after the first day’s data, whose mean and standard deviation are obviously outliers, are set aside). These limits imply naive standard
|Table 4 – Daily statistics from Wilson and Hilferty’s analysis of C. S. Peirce’s data|
|Day||X̄ ± sx̄ (milliseconds)||Q3 — Q1/2(0.6745s)|
|Sourre: Wilson & Hilferty 1929.|
|1||475.6 ± 4.2||0.932|
|2||241.5 ± 2.1||0.842|
|3||203.1 ± 2.0||0.905|
|4||205.6 ± 1.8||0.730|
|7||186.9 ± 2.2||0.753|
|8||194.1 ± 1.4||0.840|
|9||195.8 ± 1.6||0.756|
|10||215.5 ± 1.3||0.850|
|11||216.6 ± 1.7||0.782|
|12||235.6 ± 1.7||0.759|
|13||244.5 ± 1.2||0.922|
|14||236.7 ± 1.8||0.529|
|15||236.0 ± 1.4||0.662|
|16||233.2 ± 1.7||0.612|
|17||265.5 ± 1.7||0.792|
|18||253.0 ± 1.1||0.959|
|19||258.7 ± 1.8||0.502|
|20||255.4 ± 2.0||0.521|
|21||245.0 ± 1.2||0.790|
|22||255.6 ± 1.4||0.688|
|23||251. 4 ± 1.6||0.610|
|24||243.4 ± 1 1||0.730|
deviations of the difference between means for pairs of days ranging from 1.6 to 3.1. If these applied, most differences would have to be less than twice these, 3.2 to 6.2, and practically all less than 4.8 to 9.3. Table 4 shows that the actual differences -38, +2, -57, +27, … , +11, -4, -8 impolitely pay little attention to such limitations.
In the language of analysis of variance, Peirce’s data show considerable day-to-day variation. In the language of Walter Shewhart, such data are “out of control”-the within-day variation does not properly predict the between-days variation [see QUALITY CONTROL, article on PROCESS CONTROL]. Nor is it just a matter of the observer “settling down” in the beginning. Even after the twentieth day he still wobbles.
Need for a plurality of samples. The wavering in these data exemplifies the history of the “personal equation” problem of astronomy. The hope had been that each observer’s systematic errors could be first stabilized and then adjusted for, thus improving accuracy. Unfortunately, attempts in this direction have failed repeatedly, as these data suggest they might. The observer’s daily idiosyncrasies need to be recognized, at least by assigning additional day-to-day variation.
Wilson and Hilferty (1929, p. 125) emphasize that Peirce’s data illustrate “the principle that we must have a plurality of samples if we wish to estimate the variability of some statistical quantity, and that reliance on such formula as σ/√n is not scientifically satisfactory in practice, even for estimating unreliability of means” (see Table 4).
Direct assessment of variability. One way to get a more honest estimate of variability breaks the data into rational subgroups, usually of equal or nearly equal sizes. For each subgroup, compute the statistic (mean, median, correlation coefficient, spectral density, regression equation, or whatever), base the estimate for the whole group on the average of the statistic for the subgroups, and base the estimate of variability on Student’s t with one degree of freedom less than the number of subgroups. That is, treat the k group statistics like a sample of K independent measurements from a normal distribution. [This method, sometimes called the method of interpenetrating samples, generalizes the method for calculating the sampling error for nonprobability samples described in SAMPLE SURVEYS, article on NONPROBABILITY SAMPLING.]
At least five groups (preferably at least ten) are advisable in order to get past the worst part of the t-table. This suggestion encourages using more, not fewer, groups. For two-sided 5 per cent levels, see Table 5.
Two major difficulties with this direct assessment are (a) that it may not be feasible to calculate meaningful results for such small amounts of data as properly chosen groups would provide, or (b) even if the calculations yield sensible results, they may be so severely biased as to make their use unwise.
A method with wide application, intended to ameliorate these problems, is the jackknife, which offers ways to reduce bias in the estimate and to set realistic approximate confidence limits in complex situations.
Assessment by the jackknife. Again the data are divided into groups, but the statistic to be jackknifed is computed repeatedly on all the data except an omitted group. With ten groups, the statistic is
|Table 5 – Two-sided 5 percent levels for Student’s t for selected degrees of freedom|
|Degrees of freedom||5 per cent critical point|
computed each time for about 90 per cent of the data.
More generally, for the jackknife, the desired calculation is made for all the data, and then, after the data are divided into groups, the calculation is made for each of the slightly reduced bodies of data obtained by leaving out just one of the groups.
Let y(j) be the result of making the complex calculation on the portion of the sample that omits the jth subgroup, that is, on a pool of fe — 1 subgroups. Let ya ii be the corresponding result for the entire sample, and define pseudo values by
These pseudo values now play the role played by the values of the subgroup statistics in the method of interpenetrating samples. For simple means, the jackknife reduces to that method.
As in the method of interpenetrating samples, in a wide variety of problems, the pseudo values can be used to set approximate confidence limits through Student’s t, as if they were the results of applying some complex calculation to each of k independent pieces of data.
The jackknifed value y*, which is the best single result, and an estimate, s*2, of its variance are given by
If the statistic being computed has a bias that can be expressed as a series in the reciprocal of the sample size, N, the jackknife removes the leading term (that in 1/N) in the bias. Specifically, suppose that μ^ the biased estimate of μ, has expected value
where a, b, and so on are constants. If μ^* is the jackknifed estimate, its expected value is
where α, β, and so on are constants. To give a trivial example, Σ(Xi, - X̄)2/N is a biased estimate of σ2. Its expected value is σ2 — (σ2/N), and so it has the sort of bias that would be removed by jackknifing.
To understand how the first-order bias terms are removed by jackknifing, one might compute the expected value of y* for the special case where
Then with the use of k groups of equal size, n, so that kn = N,
The leading term in the bias was removed in the construction of the y*/s. Even if the sample sizes are not equal, the leading term in the bias is likely to have its coefficient reduced considerably.
Example of the jackknife: ratio estimate. In expounding the use of ratio estimates, Cochran ( 1963, p. 156) gives 1920 and 1930 sizes (number of inhabitants) for each city in a random sample of 49 drawn from a population of 196 large U.S. cities. He wishes to estimate the total 1930 population for these 196 cities on the basis of the results of the sample of 49, whose 1920 and 1930 populations are both known, and from the total 1920 population. The example randomly groups his 49 cities into 7 sets of 7 each. Table 6 shows their subtotals.
The formula for the ratio estimate of the 1930 population total is
so that the logarithm of the estimated 1930 population total is given by log (1930 sample total) -log (1920 sample total) + log (1920 population total). Consequently the jackknife is applied to z = log (1930 sample total) - log (1920 sample total), since this choice minimizes the number of multiplications and divisions.
Further computation is shown in Table 7 where in the “all” column the numbers 5,054 and 6,262 come directly from the totals of the previous table, and in the “i – 1” column the numbers 4,303 = 5,054-751 and 5,347 = 6,262-915 are the results
|Table 6 – Subtotals in thousands for sets of 7 cities|
|Source: Cochran  1963, p. 156.|
of omitting the first 7 cities, and so on for the other columns. Five-place logarithms have obviously given more than sufficient precision, so that the pseudo values of z are conveniently rounded to three decimals. From these are computed the mean z* and the 95 per cent limits = mean ± allowance. Table 8 gives all the remaining details. The resulting point estimate is 28,300, about 100 lower than the unjackknifed estimate. (Since the correct 1930 total is 29,351, the automatic bias adjustment did not help in this instance. This is a reminder that bias is an “on the average” concept.) The limits on this estimate are ordinarily somewhat wider than would apply if each city had been used as a separate group, since the two-sided 95 per cent level for Student’s t with 6 degrees of freedom is ta.M = 2.447, while with 47 degrees of freedom it is IM.os = 2.012. The standard error found here was .0125 in logarithmic units, which converts to about 840 in the final total (4.360 + z. + s* = 4.464; antilog 4.464 = 29,140; 29,140-28,300 = 840). The conversion from logarithmic units to original units for the confidence interval represent an approximation that may not always be appropriate [see STATISTICAL ANALYSIS, SPECIAL PROBLEMS OF, article on TRANSFORMATIONS OF DATA]. (Further material on the jackknife can be found in Mosteller and Tukey 1966, sec. E).
|Table 8 – Final computations for the ratio estimate*|
|* Base data: 1920 total = 22,919, log (1920 total) = 4.360 log total = log (1920 total) + log ratio|
|VALUE OF ESTIMATE||95 PER CENT CONFIDENCE INTERVALS|
|log ratio||z* = 0.092||0.062 to 0.123|
|log total||4.360 + z* = 4.452||4.422 to 4.483|
|total||antilog 4.452 = 28,300||26,000 to 30,400|
Analytical difficulties . In analyzing data or planning for its analysis, the choice of a base for rates is not always obvious; comparing many things leads to biases that need adjustment, selection reduces correlation, and selection for excellence leads to disappointments. This section treats these matters.
Bases for rates. The investigator should think about more than one possible base for a percentage or a rate and consider the value of reporting results using different bases. Examples from accident statistics may suffice. Are young women safer drivers than young men? Yes: in the United States in 1966 insurance rates for young women were ordinarily lower because they caused less expensive damage. On the other hand, these rates were based on total disbursements in a fixed period of time. Young women may well drive much less than young men, and if so, their accident rate per mile may be the higher.
Coppin, Ferdun, and Peck (1965) sent a questionnaire on driving in 1963 to a sample of 10,250 California drivers who were aged 16 to 19½ at the beginning of the period. Based on the information from the 65 per cent of questionnaires returned, where respondents estimated mileage driven per week, and on accidents reported in the respondents’ Motor Vehicle Department files, the accident rates per 100,000 miles shown in Table 9 were found. On accidents, nonrespondents were very similar to respondents, but nonrespondents had considerably more violations. Since the mileage is estimated, the evidence is weak; but it seems to be the best available. Boys had more accidents per mile at 16, girls at 17, and after that their rates were nearly equal.
How should airplane safety (or danger) be assessed?
|Table 9 – Accidents and violations per 100,000 miles|
|Source: Coppin et al. 1965, pp. 27-28.|
Deaths per million passenger miles, deaths per trip, and casualties per hour flown suggest themselves, and each can be supported.
In general, different answers may be appropriate for different questions, as was the case in the insurance companies’ view versus the accident-permile view of the safety of young drivers given above. Ease and economy may recommend giving several answers as well as the investigator’s judgment about their merits. In some problems, no resolution may be possible, and then the investigator would do well to admit it.
Problems of multiplicity. When methods of appraisal designed for single comparisons are used to compare many things, the multiplicity may mislead. When means of two samples drawn from the same normal population are compared, they differ by more than twice the standard deviation of their difference in less than 5 per cent of the sample pairs. Among ten sample means from the same population, some pair is more likely than not to differ this much (Table 11). Although statistics has come a long way in providing honest methods of making comparisons when there are many to be made, it has largely done this in the framework of a closed system, where the particular items to be compared have already been specified. For example, many workers have offered suitable ways to measure the significance not only of all possible differences but also of all possible linear contrasts (weighted sums, the weights adding to zero) on the same data. [See LINEAR HYPOTHESES, articles On MULTIPLE COMPARISONS.]
Statistics has not yet provided a way to test the significance of results obtained by peeking at large bodies of data and developing hypotheses as one goes along. The facility of the human brain for rationalizing almost any observed fact immediately after its realization is something that cannot yet be allowed for. This means that it is rarely possible to validate a hypothesis on the same body of data that suggested it and usually new studies are necessary to test hypotheses developed on completely different data (Mosteller & Tukey 1966, sec. B6).
Selection effects. Users of tests for purposes of selection (admission to college, personnel selection) often complain that the scores used to make the selection do not correlate well with the inservice performance of the individuals after selection. Possibly the chosen test does not give scores that correlate well with the performance being measured, but one must remember that when a population is truncated on one of its variables, the correlation of that dimension with the others is likely to be reduced toward zero. To illustrate, suppose that freshman calculus grades Y and precourse examination grades X are bivariately normally distributed with correlation coefficient p. Suppose that only individuals whose pretest scores exceed a certain value X = x are admitted to the calculus course. This means that selection is based on the variable X with the criterion x. The new correlation p’ between the grades of those taking the course and their pretest scores would be given (see Cochran 1951, p. 453) as
p = proportion that the selected group is of the whole population,
t = standard normal deviate having proportion p to the right,
z = height of the standard univariate normal at the position t.
If the proportion selected p = 0.05, then t = 1.645, z = 0.1031. Values of p’ for selected values of p and p are shown in Table 10. To return to the example, if pretest scores and calculus grades had originally been correlated p = 0.8, in the 5 per cent selected the correlation would drop to 0.44.
Table 10 shows that as the percentage truncated increases the correlation in the remaining population slowly decreases from its initial value. For initial correlations between .1 and .8 the reduction is between a half and a third of the original correlation when 75 per cent of the population has been removed. A very rough approximation for p’ is (,7p + .3)p for p > .25 and 0 ≤ p ≤ .7. The new correlation decreases sharply for the higher initial correlation coefficients when more than 90 per cent of the population is deleted. Unfortunately, these results may be rather sensitive to the detailed shape of the bivariate population studied and so this bivariate normal example can only illustrate the possibilities.
|Table 10 – Values of p for various values of 100(1– p), p pairs|
|100 (1—p) PER CENT TRUNCATED||VALUES OF ρ||100p PER CENT SELECTED|
Regression effect. Suppose that a fallible measure selects from many individuals a few that appear to be best. On a reassessment based on fresh performance data, the selected ones will ordinarily not do as well as they originally appeared to do on the selection test. The reason is that performance varies and on the occasion of the test some individuals accidentally perform much better than their average and are selected. Happily, individuals selected to be worst do not do as badly on reassessment. This phenomenon is known as regression toward the mean; instances are sometimes called regression effects or shrinkage effects. To illustrate, Mosteller and Wallace (1964, p. 209) selected words and obtained weights for their rates of use with intent to discriminate between the writings of Alexander Hamilton and James Madison. Writers differ in their rate of use of such words as of, and, the, to, and upon. On the basis of the writings used for the selection and weighting of the word counts, the two statesmen’s writings were separated by 6.9 standard deviations. When the same words and weights were applied to fresh writings not used in selecting or weighting, the new writings were separated by 4.5 standard deviations—still good discrimination, but a loss of 2.4 standard deviations is substantial and illustrates well the effect. Losses are usually greatest among the poorer discriminants. Usages of the word upon originally separated the writings by 3.3 standard deviations, and did even better, 3.8, in the fresh validating materials; but a less effective set of words giving originally a separation of 1.3 standard deviations dropped to 0.3 on retesting. The lesson is that optimization methods (such as least squares and maximum likelihood) do especially well on just the data used to optimize. Plan for validation, and, where hopes are high for much gain from many small effects, prepare for disappointment.
Weights. If individuals are sampled to find out about their families, as in investigations carried out in schools, unless some account is taken of weights, a peculiar distribution may arise. For example, if a sample of girls is asked to report the numbers of sons and of daughters (including themselves) in their families, it turns out that the average number of daughters observed in the sample is approximately one more than the average number of sons. (More precisely, mathematics not given here shows the difference to be: [variance of number of daughters minus covariance of number of sons and daughters] divided by [average number of daughters]. When the distribution of the number of daughters is approximately Poisson and the numbers of sons and of daughters are independent, the ratio is approximately unity.) Essentially, families of three girls report three times as often as families with one girl, and families with no girls do not report at all. If account is taken of the dependence of frequency of reporting upon the number of daughters, this matter can be adjusted, provided information about families with no daughters is available or is not needed.
Similarly, in studying the composition of special groups, unless the analysis is done separately for each family size, one needs to remember that more children are first-born than second, and so on.
Errors in calculation . A well-planned format for laying out calculations and careful checking aid in getting correct answers. To give a base line, a sample survey by the Internal Revenue Service (Farioletti 1952, pp. 65-78) found arithmetical errors in only 6 per cent of 160,000 personal income tax returns. Considering that the task is sometimes troublesome and often resented, this record appears good.
In scientific work, misreadings of numbers, misplaced decimals, errors in the application of formulas all take their toll. As a first step in the control of error, regard any unchecked calculation as probably wrong.
Overmechanization. Overmechanization of computing puts great pressure on the analyst to make one enormous run of the data and thereby economically get all the analyses he wished. Alas, one great sweep is never the way of good data analysis. Instead, we learn a little from each analysis and return again and again. To illustrate, in deciding whether to transform the data to square roots, logarithms, inverse sines, or reciprocals before launching on the major analysis, tests may be run for each function separately, leading to the choice of one or two transformations for use in the next stage. Otherwise the whole large calculation must be run too many times because there are many branch points in a large calculation with several choices available at each. Furthermore, data analysis requires extensive printout, little of which will be looked at; therefore the data analyst must resist the notion that the good computer user makes the machine do all the work internally and obtains very little printout. He must also resist the idea of having ever speedier programs at the cost of more and more time for programming and less and less for analysis. Fine programs are needed, but the cost of additional machine time from slow programs may be less than the cost of improvements in programming and of the waiting time before analysis can begin.
Possibly with the increase of time-sharing in high-speed computation and the handy packaging of general purpose programs for the analysis of data, the opportunities for making studied choices at each point in the analysis will become easier and less time-consuming.
Preserving data from erasure. After processing, data should usually be preserved in some form other than a single magnetic tape. Contrary to theory and rumor, magnetic tapes containing basic data are occasionally erased or made unusable in the high-speed computing process, and all the explanations in the world about how this could or could not have happened cannot restore a bit of information. One remedy is to have a spare tape with your data or program copied upon it. When disaster strikes, remember that few things seem more likely to recur than a rare event that has just happened, and so copy your spare tape before you submit it to the destroyer.
Hand copying. Since human copying is a major source of error, keep hand copying to a minimum and take advantage where possible of the highspeed computer’s ability to produce tables in immediately publishable form and of mechanical reproduction processes. Editing can be done by cutting, pasting, and painting out. When copying is necessary, checks of both column totals and row totals are believed superior to direct visual comparisons of individual entries with the manuscript.
Checks. Checking the programming and calculations of a high-speed computer presents a major unsolved problem. One might suppose that once a machine began producing correct answers, it always would thereafter. Not at all. It may respond to stimuli not dreamed of by the uninitiated. To find, for example, that it throws a small error into the fifth entry in the fourth column of every panel is disconcerting and scary, partly because small errors are hard to find and partly because one wonders whether undetected errors may still be present. Thorough and systematic checking is advised. Some ways are through sample problems; through fuller printout of the details of a problem already worked by hand; by comparing corresponding parts of several problems, including special cases whose answers are known by outside means; and by solving the problem in more than one way.
In addition to the checks on the final calculations, check the input data. For input punched on cards, for example, some process of verifying the punching is required. Methods of checking will vary with the problem. Partly redundant checking may not be wasteful. Look for impossible codes in columns, look for interchanges of columns. Try to set up checks for inconsistency in cards. (In Western cultures, nursery school children are not married, wives don’t have wives, and families with 42 children need verification.) Consider ways to handle blanks based on internal consistency.
In working with computers, be wary of the way symbols translate from keyboard to card or tape—dashes and minus signs or zeros, letter O’s, and blanks are a few sources of confusion. In dealing with numbers using, say, a two-digit field, a number such as 6, unless written 06, may wind up as 60 or as a meaningless character. The possibilities here are endless, but in a given problem it is usually worth organizing systematic procedures to combat these difficulties.
Order-of-magnitude checking. When calculations are complete, order-of-magnitude checks are always valuable. Are there more people in the state of New York than in the United States? Does leisure plus work plus sleep take much more than 24 hours per day? Exercises in calculations of comparative orders of magnitude can be rewarding in themselves because new connections are sometimes made between the research and the rest of the subject matter.
Significant figures. Both hand and high-speed calculations require numbers to be carried to more places than seem meaningful and to more places than simple rules learned in childhood would suggest. These rules seem dedicated to rounding early so as not to exaggerate the accuracy of one’s result. But they may erase the signal with the noise. About the only reassuring rule for complex calculations is that if the important digits are the same when the calculation is carried to twice as many places, enough accuracy has likely been carried.
The old rules for handling significant figures come from a simplified idea that a number can report both its value and its accuracy at the same time. Under such rules the numbers 3.26 and 0.0326 were thought of as correct to within half a unit in the last place. Sometimes in mathematical tables this approach is satisfactory. For data-based numbers, the uncertainty in a number has to be reported separately.
One-of-a-kind calculations. One-of-a-kind calculations, frequent in scientific reports, are especially error prone, both because the investigator may not set up a standard method of calculation, complete with checks, and because he does not have the aid of comparisons with other members of a long sequence. For example, some pollsters believed that their wrong forecast about a vote would have been close had proper weighting for household size been applied, a claim worth checking. Their ultimate error was in thinking that this claim was right. How did they make it? Pages of weightings carefully checked down to, but not including, the final estimate showed no error in their reanalysis. But their one-of-a-kind calculation leading to the final estimate was a ratio composed of an inappropriate numerator and an inappropriate denominator grabbed from the many column totals. By accident this meaningless quotient gave a number nearly identical with that produced by the voters. And who checks further an answer believed to be correct? Actually, the weighting for household size scarcely changed their original forecast. The moral is that the one-of-a-kind calculation offers grave danger.
Consequently, each new calculation can well be preceded by a few applications of the method to simple made-up examples until the user gets the feel of the calculation, of the magnitudes to be expected, and of a convenient way to lay out the procedure. Having someone else check the calculation independently requires that the investigator not teach the verifier the original mistakes. Yates has suggested that, in a large hand calculation, independence could nearly be preserved when different individuals calculate on separate machines in parallel in two different numerical units; for example, one computes in dollars, the other in pounds. At the end, the final answers are converted for comparison.
Gross errors in standard deviations. Since the sample range w (largest measurement minus smallest measurement) is easy to compute, it is often used to check the more complicated calculation of a sample standard deviation s. In the same sample, the ratio w/s must lie between the lower and upper bounds given in Table 11 or else the range, sample standard deviation, or quotient is in error. The table shows the 2.5 per cent and 97.5 per cent point of the distribution of w/s for a normal distribution. When calculations lead to ratios falling outside these limits but inside the bounds, they are not necessarily wrong; but further examination may pay.
Table 11 also shows the median of the distribution of the range of a sample of size n drawn from a standard normal distribution. It gives one an idea of the spread measured in standard deviations to be expected of the sample means of n equalsized groups whose population means are identical. Note that through n = 20 a rough rule is that the
|Table 11 – Bounds on the ratio: range/standard deviationa|
|Lower bound||w/s||Upper bound||Median of w/σ|
|a. Lower bound, 2.5% point, 97.5% point, and upper bound for the ratio: range/sample standard deviation (w/σ); median of the distribution of the ratio: range/population standard deviation (w/σ); the sample size is n. The upper and lower bounds apply to any distribution and sampling method; the percentage points and the median are computed for random sampling from a normal distribution, but they should be useful for other distributions.|
|b. This is not an error. The median is expressed as the multiplier of the population standard deviation, whereas the bounds relate range and sample standard deviation.|
|c. The mean of the distribution is given as an approximation to the median because the latter is not available.|
|Sources: Pearson & Stephens 1964, p. 486, for lower and upper bounds and for 2.5% and 97.5% points; Harter 1963, pp. 162-164, for medians; Pearson & Hartley 1954, p. 174, for means.|
median distance between the largest and smallest sample mean is √nσx̄ .
Reporting . When writing the final report, remember that making clear the frame of reference of a study helps the reader understand the discussion.
Need for full reporting. In reporting on the investigation, be sure to give detailed information about the populations studied, the operational definitions used, and the exceptions to the general rules. Unless the details are carefully reported, they are quickly forgotten and are soon replaced by cloudy fancies. Discussions of accuracy, checks, and controls are needed in the final report.
Full and careful reporting can lead to ample prefaces, numerous appendixes, some jargon, and lengthy discussions. Shrink not from these paraphernalia, so amusing to the layman, for without them the study loses value; it is less interpretable, for it cannot be properly compared with other studies. Jargon may be the price of brevity.
The reader may object that editors will not allow such full reporting. Certainly the amount of detail required does vary with the sort of report to be made. Many studies that are published in short reports turn out to present a long sequence of short articles, and these, in one place or another, can give the relevant details.
Try to go beyond bare-bones reporting by giving readers your views of the sorts of populations, circumstances, or processes to which the findings of the study might apply. Warn the reader about generalizations that you are wary of but that he, on the basis of your findings, might reasonably expect to hold. While such discussions can be criticized as speculation, you owe it to the reader to do your best with them and to be as specific as you can be.
Beyond all this, where appropriate, do write as nontechnical a summary as you can for the interested public.
Suppression of data. In pursuit of a thesis, even the most careful may find it easy to argue themselves into the position that the exceptions to the desired proposition are based upon poorer data, somehow do not apply, would be too few to be worth reporting if one took the trouble to look them up, would mislead the simpleminded if reported, and therefore had best be omitted. Whether or not these views are correct, and some of them may well be, it is preferable to present the whole picture and then to present one’s best appraisal of all the data. The more complete record puts readers in a much better position to consider both the judgments and the proposition.
[Directly related are the entries EXPERIMENTAL DESIGN; FALLACIES, STATISTICAL; SAMPLE SURVEYS.]
BARTHOLOMEW, D. J. 1961 A Method of Allowing for “Not-at-home” Bias in Sample Surveys. Applied Statistics 10:52-59.
BILLEWICZ, W. Z. 1965 The Efficiency of Matched Samples: An Empirical Investigation. Biometrics 21:623-644.
CANTRIL, HADLEY (1944) 1947 The Use of Trends. Pages 220-230 in Hadley Cantril, Gauging Public Opinion. Princeton Univ. Press.
CANTRIL, HADLEY 1966 The Pattern of Human Concerns. New Brunswick, N.J.: Rutgers Univ. Press.
COCHRAN, WILLIAM G. 1951 Improvement by Means of Selection. Pages 449-470 in Berkeley Symposium on Mathematical Statistics and Probability, Second, Proceedings. Edited by Jerzy Neyman. Berkeley: Univ. of California Press.
COCHRAN, WILLIAM G. (1953) 1963 Sampling Techniques. 2d ed. New York: Wiley.
COCHRAN, WILLIAM G.; MOSTELLER, FREDERICK; and TUKEY, JOHN W. 1954 Statistical Problems of the Kinsey Report on Sexual Behavior in the Human Male. Washington: American Statistical Association.
COPPIN, R. S.; FERDUN, G. S.; and PECK, R. C. 1965 The Teen-aged Driver. California, Department of Motor Vehicles, Division of Administration, Research and Statistics Section, Report 21.
EHRENBERG, A. S. C. 1960 A Study of Some Potential Biases in the Operation of a Consumer Panel. Applied Statistics 9:20-27.
FARIOLETTI, MARIUS 1952 Some Results From the First Year’s Audit Control Program of the Bureau of Internal Revenue. National Tax Journal 5, no. 1:65-78.
FELDMAN, JACOB J. 1960 The Household Interview Survey as a Technique for the Collection of Morbidity Data. Journal of Chronic Diseases 11:535-557.
HANSEN, MORRIS H.; and HURWITZ, WILLIAM N. 1946 The Problem of Non-response in Sample Surveys. Journal of the American Statistical Association 41:517—529.
HARBISON, FREDERICK; and MYERS, CHARLES A. 1964 Education, Manpower, and Economic Growth: Strategies of Human Resource Development. New York: McGraw-Hill.
HARTER, H. LEON 1963 The Use of Sample Ranges and Quasi-ranges in Setting Exact Confidence Bounds for the Population Standard Deviation. II. Quasi-ranges of Samples From a Normal Population—Probability Integral and Percentage Points; Exact Confidence Bounds for σ → ARL 21, Part 2. Wright-Patterson Air Force Base, Ohio: U.S. Air Force, Office of Aerospace Research, Aeronautical Research Laboratories.
HENDRICKS, WALTER A. 1956 The Mathematical Theory of Sampling. New Brunswick, N.J.: Scarecrow Press.
KATZ, DANIEL 1946 The Interpretation of Survey Findings. Journal of Social Issues 2, no. 2:33-44.
KISH, LESLIE 1965 Survey Sampling. New York: Wiley.
KISH, LESLIE; and HESS, IRENE 1959 A “Replacement” Procedure for Reducing the Bias of Nonresponse. American Statistician 13, no. 4:17-19.
KISH, LESLIE; and LANSING, JOHN B. 1954 Response Errors in Estimating the Value of Homes. Journal of the American Statistical Association 49:520-538.
KLUCKHOHN, FLORENCE R.; and STRODTBECK, FRED L. 1961 Variations in Value Orientations. Evanston, Ill.: Row, Peterson.
LAWLEY, D. N.; and SWANSON, Z. 1954 Tests of Significance in a Factor Analysis of Artificial Data. British Journal of Statistical Psychology 7:75-79.
LEUCHTENBURG, WILLIAM E. 1963 Franklin D. Roosevelt and the New Deal: 1932-1940. New York: Harper. → A paperback edition was published in the same year.
LEVINE, SOL; and GORDON, GERALD 1958-1959 Maximizing Returns on Mail Questionnaires. Public Opinion Quarterly 22:568-575.
MOSES, LINCOLN E.; and OAKFORD, ROBERT V. 1963 Tables of Random Permutations. Stanford Univ. Press.
MOSTELLER, FREDERICK; and BUSH, ROBERT R. (1954) 1959 Selected Quantitative Techniques. Volume 1, pages 289-334 in Gardner Lindzey (editor), Handbook of Social Psychology. Cambridge, Mass.: Addison-Wesley.
MOSTELLER, FREDERICK; and TUKEY, JOHN W. 1966 Data Analysis, Including Statistics. Unpublished manuscript. → To be published in the revised edition of the Handbook of Social Psychology, edited by Gardner Lindzey and Elliot Anderson.
MOSTELLER, FREDERICK; and WALLACE, DAVID L. 1964 Inference and Disputed Authorship: The Federalist. Reading, Mass.: Addison-Wesley.
NETER, JOHN; and WAKSBERG, JOSEPH 1964a Conditioning Effects From Repeated Household Interviews. Journal of Marketing 28, no. 2:51-56.
NETER, JOHN; and WAKSBERG, JOSEPH 1964b A Study of Response Errors in Expenditures Data From Household Interviews. Journal of the American Statistical Association 59:18-55.
PEARSON, E. S.; and HARTLEY, H. O. (editors) (1954) 1958 Biometrika Tables for Statisticians. Volume 1, 2d ed. Cambridge Univ. Press.
PEARSON, E. S.; and STEPHENS, M. A. 1964 The Ratio of Range to Standard Deviation in the Same Normal Sample. Biometrika 51:484-487.
PEIRCE, CHARLES S. 1873 On the Theory of Errors of Observations. U.S. Coast and Geodetic Survey, Report of the Superintendent : 200-224.
ROETHLISBERGER, FRITZ J.; and DlCKSON, WILLIAM J. (1939) 1961 Management and the Worker: An Account of a Research Program Conducted by the Western Electric Company, Hawthorne Works, Chicago. Cambridge, Mass.: Harvard Univ. Press. → A paperback edition was published in 1964 by Wiley.
RUGG, DONALD; and CANTRIL, HADLEY (1944) 1947 The Wording of Questions. Pages 23-50 in Hadley Cantril, Gauging Public Opinion. Princeton Univ. Press.
SCOTT, CHRISTOPHER 1961 Research on Mail Surveys. Journal of the Royal Statistical Society Series A 124:143-205.
SHARP, HARRY; and FELDT, ALLAN 1959 Some Factors in a Probability Sample Survey of a Metropolitan Community. American Sociological Review 24:650-661.
SOBOL, MARION G. 1959 Panel Mortality and Panel Bias. Journal of the American Statistical Association 54:52-68.
SUDMAN, SEYMOUR 1964 On the Accuracy of Recording Consumer Panels: I and II. Journal of Marketing Research 1, no. 2:14-20; 1, no. 3:69-83.
TUCKER, LEDYARD R. 1964 Recovery of Factors From Simulated Data. Unpublished manuscript.
WAKSBERG, JOSEPH; and PEARL, ROBERT B. 1964 The Effects of Repeated Household Interviews in the Current Population Survey. Unpublished manuscript.
WHITING, BEATRICE B. (editor) 1963 Six Cultures: Studies of Child Rearing. New York: Wiley.
WILSON, EDWIN B.; and HILFERTY, MARGARET M. 1929 Note on C. S. Peirce’s Experimental Discussion of the Law of Errors. National Academy of Sciences, Washington, D.C., Proceedings 15:120-125.
All physical and social laws or models rest ultimately upon assumptions. These laws do not yield exact numerical statements. Even the much admired exactness of the physicists’ laws means only very close approximation—how close depends upon the circumstances. So, too, techniques for statistical analysis require assumptions about the data to justify the use of the techniques in particular situations. When these assumptions are not correct for the data under study, the results of the statistical analysis may be very misleading. This article discusses the effects on statistical analysis of incorrect assumptions and considers some ways of mitigating the problem. The discussion is set in the frameworks of the matched-pairs design, a time-series design, the one-way analysis of variance, and a repeated-measurements design.
The matched-pairs design
In the matched-pairs design two treatments or conditions are studied by assigning one treatment to eacn member of a pair of matched individuals. For example, a department of Slavic languages is interested in finding out whether one of two different teaching methods for a first-year language course is better than the other. A language aptitude and proficiency examination is given on the first day of class, and the scores on these tests are used to pair students who have approximately the same aptitude and proficiency. Then, for each pair of students, one student is randomly assigned to one teaching method and the other student is assigned to the other method. An examination is given at the end of the term to determine whether differences exist between the teaching methods.
Comparisons between the treatments are based on the difference between the responses to the treatments within each pair. Thus the data consist of the n differences X1, X2, …, Xn, where Xj, denotes the difference between the response scores in the jth pair. It is assumed that X1, X2, …, Xn constitute a simple random sample from some population; possible further assumptions will be discussed in the next section.
A social scientist may want to make inferences about several features of the probability distribution underlying his matched-pairs experiment. He may ask which estimators or formulas should be used to estimate the unknown mean or median μ In addition, he may want to know which significance test to use to find out whether the treatments differ. The following sections indicate how answers to these questions may be obtained.
Criteria for point estimation . An investigator is frequently faced with a dilemma in his choice of an estimator. Suppose in the matched-pairs design he wants to estimate p, the mean difference between treatments. If he is willing to assume that the observations are randomly drawn from a normal distribution, then the sample mean X̄= (l/n)ΣXj; is the unique “best” unbiased estimator for μ
Evidence may exist, however, that although the underlying distribution is symmetrical, there are too many extreme values for the normality assumption to be correct. For distributions of this long-tailed kind, the sample mean may have a very large variance. The sample median is generally a reasonable estimator to use for such distributions but has a higher variance than the sample mean for more nearly normal distributions. How can one achieve a reasonable compromise—an estimator that is good under reasonable assumptions?
Here, consideration is restricted to long-tailed symmetrical distributions and, in particular, to compound normal distributions that arise in the following way: (1) An observation is randomly drawn from one of two normal populations. (2) With probability 1 — τ, this observation is randomly drawn from a normal population that has mean μ and variance σ2. (3) With probability τ, this observation is randomly drawn from another normal population having mean μ and variance K2σ2. The values of K and τ that are considered are K ≥ 2 and 0 ≤ τ ≤ .10. In short, the compound normal distributions considered are mixtures of two normal distributions with a common mean. [See DISTRIBUTIONS, STATISTICAL, article on MIXTURES OF DISTRIBUTIONS.]
A useful way to compare any two unbiased estimators of the same parameter is to compute their efficiency. The efficiency of estimator 1 relative to estimator 2 is defined as
An estimator is chosen that compares favorably in terms of efficiency with its competitors over the range of plausible distributions. An estimator has robustness of efficiency relative to another if the above ratio does not dip far below 100 per cent for plausible alternatives.
Estimators of the mean (or median) . One possible compromise between the sample mean and the sample median is
where X,(i) is the jth smallest observation; that is, the largest and smallest observations are discarded and the mean of the remaining observations computed. [See NONPARAMETRIC STATISTICS, article on ORDER STATISTICS.]
In general, an arbitrary percentage of the observations may be discarded. Define the α per cent trimmed mean, X̄α, as the mean of the observations remaining after the smallest (α/100)n observations and the largest (α/100)n observations are excluded. The 0 per cent and 50 per cent trimmed means are the sample mean and the sample median, respectively; thus, in particular, X̄n = X̄.
Efficiency comparisons of the 0 and 6 per cent trimmed means for the compound normal distribution in large samples with K = 3 and 0 ≤ τ ≤ .10 are given by Tukey (1960). He shows, for example, that the 0 per cent trimmed mean X̄ is the best possible estimator if τ = 0 (that is, if all observations are from one population), but that, even in this extreme case, X̄n has efficiency 97 per cent relative to X̄. On the other hand, if τ = .05, X̄n has approximately 143 per cent efficiency with respect to X̄. These computations (and many more) indicate that there is more to gain than to lose by discarding some extreme observations. It is important to add that in many problems the study of extreme observations may give important clues to the improvement of the experimental or observational technique. [See STATISTICAL ANALYSIS, SPECIAL PROBLEMS OF, article on OUTLIERS.]
Test and confidence interval criteria . A test is to be chosen to compare the null and alternative hypotheses
respectively. From the test, confidence intervals for μ are to be obtained. The one-sided and two-sided t tests are the “best” tests of H0 against H1 or H″1 if the underlying distribution is normal, but the goal is to choose a good test under less stringent assumptions.
Two requirements for a good test are that it must possess robustness of validity and robustness of
|Table 1 – Values of γ 1 and γ 2 for some familiar distributions|
|CHI-SQUARE (Xdf2) DISTRIBUTION*||COMPOUND NORMAL DISTRIBUTION|
|* Degrees of freedom denoted by df.|
|τ = 0||τ = .05||τ = .10|
|df=1||df = 5||df = 10||K = 3||K = 3||K = 3|
efficiency over the range of plausible underlying distributions (Box & Tiao 1964; Tukey 1962).
A statistical test or confidence interval has validity if the basic probability statements asserted for the procedure are correct or nearly so. Thus, from t tables, the one-sided t test with 9 degrees of freedom has probability .05 of exceeding 1.833 under the null hypothesis μ = 0. This statement is valid if the normal assumption holds. But suppose the normality assumption is in error. Can t tables still be used to find the probability that t will be greater than an arbitrary value tn for plausible underlying distributions? If the answer to this question is “yes,” then the one-sided t test and associated confidence intervals are said to have robustness of validity. Robustness of validity has not been defined rigorously. A quantitatively precise definition of robustness of validity is difficult to give, since it must depend upon the interpretation of the outcome of a significance test in the given experiment.
In addition to possessing robustness of validity, the test should be a good discriminator between the hypotheses. The discriminating ability of a test is measured by its power, which is the probability of rejecting H0 given that H1, (or H″1) is true [see HYPOTHESIS TESTING]. Both the one-sided and two-sided τ tests have the strong property that their power is higher than the power of any other reasonable test if the normality assumption obtains. This is no longer true for nonnormal distributions, however, and competitors must be sought.
Thus a way to compare two tests is necessary. It is natural to make such comparisons by defining a concept of relative efficiency for two tests, 1 and 2, representable by a numerical index e(l,2). Efficiency of tests and estimators are related concepts [see NONPARAMETRIC STATISTICS for a discussion of efficiency]. If e( , 2) is greater than one, test 1 is more powerful than test 2. A test is said to have robustness of efficiency if its efficiency relative to its competitors is not appreciably below one for credible alternative distributions.
Tests and confidence intervals . The information available on the validity and efficiency properties of the τ test, the Wilcoxon signed-rank test, and the sign test under nonnormality is now summarized [see NONPARAMETRIC STATISTICS for these tests]. The trimmed τ promises to be a strong competitor to the preceding tests (Tukey & McLaughlin 1963).
The one-sided and two-sided τ tests are valid in large samples, although this validity does not extend to very high significance levels, such as .001, and .0001 (see Retelling 1961).
The validity of the τ test will be considered for two different nonnormal distributions. First, assume that the nonnormal distribution has the compound normal form considered above. Second, the nonnormal distribution is assumed to be an Edgeworth distribution with skewness parameter γ≤ and kurtosis (or peakedness) parameter γ2.
Langley and Elashoff (1966) have conducted a Monte Carlo investigation of the performance of the one-sided t test. One thousand samples of n (n — 6, 9, 16) observations each were taken from compound normal distributions with K = 3 and T = 0.0, 0.20, 0.40. In all these situations, the empirical probability of t being greater than the normal theory .05 point was between .04 and .06.
The effects on the one-sided t test when sampling is from an Edgeworth population with parameters γ1, γ2 will be studied next. For symmetrical distributions γ1 = 0, while γ1 > 0 for distributions with a long right tail; for normal distributions γ2 = 0, but γ2 > 0 for bell-shaped symmetrical distributions with long tails and γ,2, < 0 for similar distributions with short tails (see Scheffe 1959, pp. 331-333). In order to provide some feel for the decriptive meaning of the γ1, γ2 parameters, Table 1 gives values of γ1 and γ2 for several nonnormal distributions. For normal distributions γ1 = γ2 = 0.
Table 2 indicates the performance of the onesided t test on a sample of 10. Each entry denotes the probability that t ≤ 1.833 if the null hypothesis that μ = 0 is true (under normal theory this probability is .05). The performance of the one-tailed τ test as shown in Table 2 may be summarized as follows: (1) The true significance level α is always slightly less than .05 for long-tailed symmetrical distributions and slightly greater than .05 in shorttailed
|Table 2 - Probab/lity that t ≥ 1.833|
|Source: Srivastava 1958, p. 427.|
tailed symmetrical distributions. (2) Skewness (γ1 is more important than kurtosis (γ2). In fact, the true significance level is almost constant in each column. A long right tail means that the true significance level is much less than .05; a long left tail leads to a true significance level much greater than .05. The skewness values covered in Table 2 represent very moderate skewness. The two-sided τ test is less affected by skewness and kurtosis (see Srivastava 1958).
The Wilcoxon signed-rank test is the most frequently used competitor to the τ test. It was designed to have perfect robustness of validity with respect to significance level for symmetrical distributions; its validity in asymmetrical distributions is unknown. The sign test has perfect robustness of validity with respect to significance level for both symmetrical and asymmetrical distributions.
Some efficiency computations are reported here [they use the definition of large sample efficiency given in the article NONPARAMETRIC STATISTICS, which also discusses the properties of these tests], Table 3 presents some efficiency computations of the Wilcoxon and sign tests relative to the τ test when the underlying distribution is a compound normal type. If the underlying distribution is normal, the corresponding efficiencies are e(w, t) = .955 and e(s, t) = .636.
The preceding validity analysis and suggestive efficiency study permits these recommendations: (1) The Wilcoxon signed-rank test should be used for symmetrical distributions with moderately long tails. If confidence intervals are desired, the Walsh
|Table 3 – Efficiency of the Wilcoxon and sign tests relative to the t test|
|a. The efficiency of the Wilcoxon signed-rank test relative to the f test.|
|b. The efficiency of the sign test relative to the t test.|
procedure should be employed. (2) If samples are large and the underlying symmetrical distribution has long tails, the routine use of the Wilcoxon test and Walsh confidence procedure may require a large computing cost. In these instances, if the distribution has only moderately long tails, the τ test and its confidence procedure should be a reasonable compromise. If the distribution has very long tails, the sign test and its associated confidence interval procedure provide a reasonable compromise.
A simple time-series design
Nonrandomness among observations can occur in two ways: (1) the observations may not be independent or (2) the observations may not have a common distribution. Each type of nonrandomness will be considered below to show the important effects such nonrandomness may have on statistical methods based upon the assumption of randomness.
Dependence among the observations . A psychologist observes an individual’s response at n points in time. Suppose that Xtt = 1, 2, … , n, denotes the individual’s response at time t. The psychologist assumes that the response has a linear regression over time; that is
(The use of (t/n) [(n + l)/(2n)] instead of just t represents merely a convenient coding of the t values. In particular, (n+l)/(2n) is just the average of the (t/n)’s: (n + l)/(2n) = (1/n)-Σt=1n (t/n).) The et may represent errors of measurement or errors in the assumption of linear regression and are assumed to be random. The psychologist suspects that the et, and hence the Xt, may be correlated; that the et all have the same distribution is not questioned here. The goal of the experiment is to estimate and test hypotheses about the intercept, a, and the slope, β.
In the estimation problem only two estimators for α and two estimators for ft will be studied here. Furthermore, each estimator is a linear combination of the X’s (a1X1 + a2X2 + … + anXn). The reasons for this restriction are that such estimators have been studied most thoroughly and that under normality they have optimal properties. Johnston (1963) studies the estimation problem in some detail.
Suppose, at first, the psychologist believes that there is no dependence among the observations. Then reasonable estimators for α and β are found by the method of unweighted least squares, which gives
If no dependence exists, then these estimators have minimum variance among all linear estimators that are unbiased. The standard errors of α^ and β^ are
where σ2 is the variance of Xt.
Now, suppose that dependence exists among the observations and assume that the correlation between Xt and Xs denoted by the
where ǀs - tǀ denotes the absolute value of the difference s - t. What are the effects of nonzero values of ρ on the estimators α^ and β^? First, while these estimators are still unbiased, they are in this case no longer the minimum variance linear unbiased estimators (see Johnston 1963 for the way to construct the latter estimators). Efficiency computations comparing β^ and the minimum variance linear unbiased estimator, β*, are given in Table 4, where cell entries are the ratio of the variance of β* to the variance of β^ for a sample of size 5.
|Table 4 – Efficiency comparison of β^ and β*|
|var β*/var β^||.921||.982||1||.990||.972|
The estimator β* used in Table 4 is computed under the assumption that ρ is known. When the sample size is large, β^ has efficiency one compared with β* for the particular pattern of correlation considered.
A second effect of nonzero ρ is that the standard errors â and β^ given in (2) are incorrect. The correct standard errors in large samples for the example are
These standard errors may depart quite radically from (2) as Table 5 shows. The ratios obtained in Table 5 are identical to the corresponding ratios of s.e. α. Since the standard errors in (2) may be in serious error, it is clear that the standard error of prediction, that is, the standard error of α + β^[(t/n) (n + l)/(2n)], may also be very wrong. The third effect of ignoring correlation between the observations concerns s2, the conventional estimator of σ2, the underlying variance;
If ρ ≠ 0, then s2 is a biased estimator of σ2. Some sampling experiments by Cochrane and Orcutt (1949) suggest that the expected value of s2 is less than σ2; when there are a large number of observations, the bias is negligible.
In testing hypotheses about α and β a primary concern is with the behavior of the standard t test,
to examine H0: β = 0 against the alternative H1: β ≠ 0 in the presence of the correlation model (3). Table 6 gives the probability that t, ≤ 1.96 if H0: β = 0 is true when the sample size is large (if ρ = 0 this probability is .05). Table 6, and additional tables when H1, is true, vividly demonstrates the sensitivity of the standard t test to nonzero correlation between the observations when (1) and (3) hold. The nonrobustness of t comes primarily from the use of an incorrect standard error of β^ in the denominator of t. The probability computations in Table 6 also hold when H″1: α ≠ 0 is tested against H’: a ≠ 0 using the statistic t = √n α^/√s2. In many social science problems ρ > 0; as seen from Table 5 the null hypothesis would be rejected more often than the nominal 5 per cent level in such situations, assuming that the null hypothesis is true.
Hoel (1964) has reported on a sampling experiment to assess the effects of correlation in small sample sizes on an F test for a polynomial trend. His results support the conclusions reached in the preceding paragraph. Readers interested in robust tests, assuming the correlational structure (3), should consult Hannan (1955).
|Table 6 – Large sample probability that ǀtǀ ≥ 1.96 when correlation exists|
It is important to remember that the magnitude of effects on a statistical technique from dependence among the observations is a function of the technique, the model of dependence, and the values of correlational parameters.
Nonidentically distributed observations . Suppose that the psychologist is principally interested in testing the hypothesis H0 α = 0 against H1: α ≠ 0. Now, however, assume that strong evidence exists that ρ = 0. The psychologist believes β = 0 in (1), but he is not certain about this belief. Thus he asks the question, “What is the effect on the τ test of examining Ho: α = 0 against H1: α ≠ 0, assuming β = 0, if in fact β ≠ 0?” The t statistic the psychologist wants to employ (assuming β = 0) is t = √n X̄/√s2, where now s2 is given by Σ(Xt-X̄)2/(n-1).
Table 7 gives the probability that ǀtǀ ≥ 1.96 if Ho: α = 0 is correct for various values of ǀβǀ and σ2 when n is large (the nominal significance level is .05).
|Table 7 – Probability that ǀtǀ ≥ 1.96 when a slope, assumed zero, is not zero|
Two important effects of incorrectly assuming that β = 0 are apparent from Table 7: (1) the behavior of the t statistic depends upon the unknown σ2 as well as ǀβǀ, and (2) the stated significance level .05 is always at least as large as the true significance level.
It must be remembered that the effects of nonidentically distributed observations depend on the model underlying the observations and the statistical method being used.
The one-way analysis of variance
The one-way analysis of variance may arise when n individuals are randomly assigned to k treatments. The data consist of the nt response scores Xij (i=l, …, k; j=1, ċ, ni) with The observations are assumed to be independent, and the probability distributions of the response variable are assumed identical for individuals receiving the same treatment.
The one-way layout is frequently employed to estimate the means or medians, μi and the variances, σ2i, of the treatments, to establish confidence intervals on the differences, μi - μj, and to test hypotheses about the μi and σ2i. The point estimation problems present no essentially new questions.
Tests and confidence intervals for the median or mean . To discriminate between the null and alternative hypotheses
one ordinarily employs the F test, the Kruskal-Wallis H test, or the k sample median test [see Nonparametric statistics, article Onranking methods; Linear hypotheses, article onanalysis of variance]. The F test is usually a very good way to make this discrimination if (1) the observations are drawn from a population in which each treatment has the same underlying normal distribution except for possible differences among the μi and (2) the alternative hypothesis is not further specified.
The following sections discuss the validity of these tests. Note that the discussion is germane to the validity of Scheffé’s method of multiple comparisons [see Scheffé 1959, chapter 3; see also LINEAR HYPOTHESES, article on MULTIPLE COMPARISONS], since that method is equivalent to the confidence set based on the F test.
Validity assuming σ2i = σ2for all i. The F test has perfect validity for all sample sizes if the populations are normal with equal variances and the observations are independent; that is, if the k samples are drawn from the same normal distribution. In large samples the standard F statistic
provides a valid test for the hypothesis (5) except at high significance levels. For small samples it will be assumed that departures from normality can be represented by an Edgeworth distribution with skewness and kurtosis parameters γ1i and γ2i in each population. Table 8 gives the probability that F ≤ 2.87 if (5) is true (under normal theory this probability is .05) for γ1i = γ1 and γ2i = γ2 for all i, k = 5, all ni’s = 5.
|Table 8 - Probability that F ≥ 2.87 for k = 5 and all ni = 5|
|Source: Box & Andersen 1955, p. 14.|
Table 8 and other work indicate that the F test for (5) possesses robustness of validity relative to significance levels (type 1 error) and power when γ1i = γ1 all i (Gayen 1949; Pearson 1931). The kurtosis parameter γ2 has practically no effect on the F test for (5). However, when γn ≠ γ12, it is known that for k = 2 the one-sided τ test does not have robustness of validity.
The Kruskal-Wallis test and the median test are valid under the null hypothesis that the k samples come from the same population. When k = 2, the Kruskal-Wallis test is equivalent to the twσtailed Wilcoxon rank-sum test. In this case of k = 2, if γ11 ≠ γ12 and γ21 ≠ γ22 ≤ y»2, the one-sided Wilcoxon test appears to be less robust than the one-sided τ test relative to significance levels when the null hypothesis is (5) and μ is a median or mean (Wetherhill 1960).
Inequality of variance. The validity of the preceding tests and some further tests when the assumption of equal variances is dropped will be studied; the normality assumption is retained unless otherwise indicated.
It is necessary at this point to examine the rationale for carrying out a test. Suppose a random sample of n mental patients is drawn and n1 are assigned to treatment 1 and n2 (= n — n1) are assigned to treatment 2. after a period of treatment, each patient is tested and given a score that is assumed to be normally distributed in each population. The null hypothesis tested is that μ1 = μ2-, (μi is the mean for treatment i); suppose the conclusion is that μ1 < μ2 If high scores are indicative of improvement, a decision is made to use the second treatment. Why?
When the variances are equal, the treatment with the higher mean is more likely to give rise to scores greater than or equal to any given score w0. Thus the significance test gives the psychiatrist usable results, especially if a score at W0 or above means release from the psychiatric hospital.
Suppose now that each treatment has a different variance and that μ1 < μ2. Then it is by no means uniformly true that the treatment with the higher mean is more likely to give rise to scores greater than or equal to a release score of W0. For example, suppose that the scores for treatment 1 and treatment 2 follow normal distributions with μ1 = 0, σ12 = 9 and μ2 = 1, σ22 = 1, respectively (σi2 = variance of treatment i scores). In this case Table 9 gives the probability that a randomly chosen score on treatment i exceeds W0, i = 1, 2.
Nonetheless, it is often appropriate to test equality of means even when the variances may be different, and the remainder of this section deals with that case. This problem is often called the Behrens-Fisher problem. Consider, first, the large sample validity of the two-tailed, two-sample t test (equivalent to F when k = 2) based on
The denominator of t is not a consistent estimator of the standard error of X1 — X2 unless either σ12 = σ22 or n1 = n2. This fact partly explains the nonrobustness of t clearly shown in Table 10.
Table 10 gives the probability that ǀtǀ ≤ 1.96 for different values of θ = σ12/σ22 and R = n1/n2 when the null hypothesis of equal means holds. Table 10 indicates the importance of equal sample sizes in controlling the effects of unequal variances: the significance level remains at .05 irrespective of the value of θ if R = 1. Moreover, if θ < 1 and R > 1 so that the most variable population has the smallest sample, the true significance level is always less than .05 and may be seriously so. On the other hand, if θ > 1 and R > 1, the true significance level is always less than .05. These results are essentially independent of γ1 and γ2 because of the large sample sizes. This lack of robustness of significance-level validity extends to power. The small sample validity of t follows along the lines of the large sample theory (see Scheffé 1959, p. 340).
Since equal sample sizes are sometimes difficult to obtain, even approximately, considerable research has been focused upon alternative ways to
|Table 9 – Probability that a randomly chosen score on freafmenf i exceeds w0|
|Table 10 – large sample probability that ǀtǀ ≥ 1.96 when sample sizes and variances differ|
|Source: Scheffé 1959, p. 340.|
test (5) versus (6). Transformation of the response variable may achieve equality of variance for the transformed variable, so that t may be used. But the user must note that a hypothesis on the means of the transformed variates is being tested. [See STATISTICAL ANALYSIS, SPECIAL PROBLEMS OF, article on TRANSFORMATIONS OF DATA.]
Welch (1938; 1947) investigates the alternative test statistic,
and indicates how to obtain significance levels (see also Dixon & Massey  1957, p. 123). Note that the statistics v and t are the same if n1 = n2 Furthermore, Welch (1938) shows that the approximate significance level of v may be obtained from tables of the t distribution with f degrees of freedom, where
The ν test is valid for both significance level and power in large samples and is much less sensitive to θ than the usual t in small samples. For example, if θ = 1 and n1 = 5, n2 = 15, the exact probability that ǀνǀ ≤ 5.2 is .05, where 5.2 is the 5 per cent point of the exact distribution of ν. The probability that ν ≤ 5.2 for any other θ value is always between .035 and .085. In addition, even if θ = 1, so that the t test is valid, the ν test is nearly as efficient as t. When both n1 and n2 are small, s12 and S22. will have low precision. In these situations, compute f from the following formula:
Alternative testing methods for k = 2 exist (see Behrens 1963; Cochran 1964).
Inequality of variance also affects the significance levels of the Wilcoxon rank-sum test w and the median test. H. R. van der Vaart (1961) gives significance levels of the Wilcoxon test for various θ values, assuming normal distributions and large samples. Pratt (1964) extends van der Vaart’s investigation in several ways. Surprisingly, w is sensitive to θ in the case of equal samples, while t remains unaffected. The median test appears to have greater robustness of significance-level validity than t or w near θ = 1 when the three tests are comparable.
The conclusions concerning the effects of inequality of variance on the usual F test for k = 2 generally hold for arbitrary k. Robust tests along the lines of Welch’s v test have been developed for general k (see James 1951).
Efficiency considerations. When the shapes (including variances) of the k distributions are different, it is important to be precise about which null hypothesis is being tested. The null hypothesis of equality of means is tested by F; the null hypothesis that is tested by the Kruskal-Wallis method (when k = 2, p is the probability that a random observation under treatment 1 is greater than a random observation under treatment 2). The null hypothesis of equality of medians is tested by the median test. The means may be equal but the medians different, or conversely. Either the means or the medians may be equal, but p ≤ Σ, or conversely. These facts imply the noncomparability of the three tests if shape differences exist. They also imply that a satisfactory analysis of the data may require an investigator to assemble evidence from all three significance tests—and possibly additional tests.
If the k distributions have the same shape, the following conclusions are justified: the F test does not have robustness of efficiency for bell-shaped symmetrical distributions with moderately long tails. No adequate study has been made of the robustness of efficiency of the Kruskal-Wallis test, but for distributions such as that described above, no competitor is in sight. The median test should have high efficiency for very long tailed symmetrical distributions.
Tests for equality of variances . In many data analyses an investigator is interested in comparing variability among the k treatments; thus he may carry out a test of
and find confidence intervals for all ratios , i ≠ j [see the article on VARIANCES, STATISTICAL STUDY OF]. The robustness of validity of Bartlett’s test for homogeneity of variance, and hence the validity of confidence intervals derived from Bartlett’s test, will next be investigated. Bartlett’s test is based upon the statistic M:
The significance level of M may be approximated from a ϰ2 table with k – 1 degrees of freedom. The M test requires normality and has almost no robustness of validity, as may be seen from Table 11 where the nonnormality is characterized by the γ2 parameter (γ1 = 0).
The disastrous behavior of M for long-tailed symmetrical distributions may be explained by the following suggestive argument by Box (1953). Let τ denote a statistic; for example, X̄,, M are statistics. Then, in large samples, τ divided by its estimated standard error is usually normally distributed by the central limit theorem and associated mathematical facts. Thus, even though sampling may be from a nonnormal distribution, X̄/(s/Vn) is normally distributed. This result explains the robustness to nonnormality of the τ test in the matched-pairs design and the F test in the one-way analysis of variance when the populations have the same shape. But Bartlett’s M test does not have this structure of τ divided by its standard error; hence, it does not find protection under the central limit theorem.
The nonrobustness of the M test requires the use of an alternative test. Scheffé (1959) has developed a robust test and a robust multiple comparison method for this problem from a suggestion by Box (1953).
The extreme sensitivity of the M test for equality of variances to the value of y, indicates that one may expect trouble with the normal theory analysis of the random effects model, sometimes called model II [see LINEAR HYPOTHESES, article on ANALYSIS
|Table 11 – True probability of exceeding the .05 normal theory point of M in large samples|
|Source: Box 1953, p. 320.|
OF VARIANCE; see also Dixon & Massey (1951) 1957, p. 174]. Real difficulties do exist with such analyses, even in large samples.
A repeated-measurements design
An investigator records the response of each individual to a stimulus repeated at each of p different points in time. The data consist of response scores, Xi(, where Xi( denotes the score of the zth individual at time t, i = 1, 2, …, n and t = 1, 2, …, p. Each individual has p response scores. The investigator assumes that
The αt and τt denote the individual and time effects, respectively. It is assumed that the random errors elt have a common distribution and that the correlation between eit and eis is given by pǀt-sǀ. It is assumed then that an individual’s response at time t is correlated with his response at another time s, but that the responses of different individuals are independent. The investigator’s principal interest lies in testing the hypotheses.
Assume that the eit are normal and study the effects of nonzero p on the standard F tests in the twσway analysis of variance.
The hypotheses (8) and (9) may be examined, respectively, by the statistics
where MSE represents the mean square error, that is, . Table 12 gives the probability that Fa≤ 3.01 for different values of p (the exact probability is .05 for p = 0) and the probability that FT ≤ 3.01 when n — p = 5.
Table 12 shows clearly the considerable effect of correlation on the test for individuals and the slight effect such correlation has for the test on times. In the terminology of the twσway analysis of variance, if individuals denote the rows and times denote the columns, then correlation within a row seriously affects the test on rows and only slightly affects the test on columns—with the given model for the correlation. Two explanations for the
|Table 12 - Probability that Fα ≥ 3.01 and probability that Fτ ≥ 3.01 in the presence of correlation*|
|* The cell entries were computed assuming ρǀt-sǀ = ρ if ǀt–sǀ = 1 and pǀt-sǀ = 0 if ǀt-sǀ ≥ 2 in order to simplify the computations. This approximation correctly indicates the order of magnitude of the more general model ρǀt-sǀ.|
|Source: Box 1954, p. 497.|
|Exact probability for the Fα test on individuals||.0003||.0101||.05||.1305||.2470|
|Exact probability for the Fτ test on different times||.0590||.0527||.05||.0537||.0668|
nonrobustness of the Fα test are (1) the numerator and denominator of Fα are correlated, contrary to the ideal condition, and (2) essentially the wrong standard error of the means for individuals is used.
It has been shown that statistical analyses based on assumptions that are incorrect for the data can produce misleading inferences. Furthermore, ways have been indicated to choose good statistical analyses, based on plausible assumptions, so that inferences will not be distorted. The question arises, “How does one decide which assumptions to make?” For example, suppose that there is interest only in making inferences about the mean difference. It then seems preferable to use an inference procedure that is robust against suspected departures from assumptions rather than to make preliminary significance tests of the assumptions of equality on variances, normality and/or symmetry, randomness, and so forth. A procedure that is robust against all failures in assumptions cannot be found, so a procedure must be chosen that is robust against those failures in assumptions that are known to be likely from experience with the problem under study or that would distort the inferences most severely.
Robert M. Elashoff
ASPIN, ALICE A. 1949 Tables for Use in Comparisons Whose Accuracy Involves Two Variances, Separately Estimated. Biometrika 36:290-293.
BEHRENS, W.U. (1963) 1964 The Comparison of Means of Independent Normal Distributions With Different Variances. Biometrics 20:16-27. → First published in German. Discusses alternative tests to τ or v based upon Fisher’s fiducial theory of inference and the use of Bayesian methods.
Box, GEORGE E. P. 1953 Non-normality and Tests on Variance. Biometrika 40:318-335. → Readers will find sections 1, 2, 7, 8, 9 accessible in general. The discussion section is particularly important.
Box, GEORGE E. P. 1954 Some Theorems on Quadratic Forms Applied in the Study of Analysis of Variance Problems. II: Effects of Inequality of Variance and of Correlation Between Errors in the Two way Classification. Annals of Mathematical Statistics 25:484-498.
Box, GEORGE E. P.; and ANDERSEN, S. L. 1955 Permutation Theory in the Derivation of Robust Criteria and the Study of Departures From Assumptions. Journal of the Royal Statistical Society Series B 17:1-34. → The discussion on pages 26—34 presents some of the best thinking on statistical practice and is accessible in general.
Box, GEORGE E. P.; and TIAO, G. C. 1964 A Note on Criterion vs. Inference Robustness. Biometrika 51:168-173. → The authors discuss robustness of validity and efficiency with a concrete example.
COCHRAN, WILLIAM G. 1964 Approximate Significance Levels of the Behrens-Fisher Test. Biometrics 20: 191-195.
COCHRANE, DONALD; and ORCUTT, G. H. 1949 Application of Least Squares Regression to Relationships Containing Auto correlated Error Terms. Journal of the American Statistical Association 44:32-61.
DIXON, WILFRID J.; and MASSEY, FRANK J. JR. (1951) 1957 Introduction to Statistical Analysis. 2d ed. New York: McGraw-Hill.
GAYEN, A. K. 1949 The Distribution of “Student’s” τ in Random Samples of Any Size Drawn From Non-normal Universes. Biometrika 36:353-369. → The method and tables (like Table 2) and discussion are the important features of this and the next reference.
GAYEN, A. K. 1950 The Distribution of the Variance Ratio in Random Samples of Any Size Drawn From Non-normal Universes. Biometrika 37:236-255.
GEARY, R. C. 1966 A Note on Residual Heterovariance and Estimation Efficiency in Regression. American Statistician 20, no. 4:30-31.
HANNAN, E. J. 1955 An Exact Test for Correlation Between Time Series. Biometrika 42:316-326.
HOEL, PAUL G. 1964 Methods for Comparing Growth Type Curves. Biometrics 20:859-872.
HOTELLING, HAROLD 1961 The Behavior of Some Standard Statistical Tests Under Nonstandard Conditions. Volume 1, pages 319-359 in Berkeley Symposium on Mathematical Statistics and Probability, Fourth, University of California, 1960, Proceedings. Berkeley and Los Angeles: Univ. of California Press.
JAMES, G. S. 1951 The Comparison of Several Groups of Observations When the Ratios of the Population Variances Are Unknown. Biometrika 38:324-329. → The author’s method for testing the null hypothesis (eq. 5) is accessible.
JOHNSTON, JOHN 1963 Econometric Methods. New York: McGraw-Hill. → An exposition of regression methods.
LANGLEY, P. A.; and ELASHOFF, R. M. 1966 A Study of the Hodges-Lehmann Two Sample Test. Unpublished manuscript.
PEARSON, EGON S. 1931 The Analysis of Variance in Cases of Non-normal Variation. Biometrika 23:114-133. → The author investigates the validity of the F test by Monto Carlo sampling.
PRATT, JOHN W. 1964 Robustness of Some Procedures for the Two sample Location Problem. Journal of the American Statistical Association 59:665-680. → Readers with a modest statistical background will find sections 1 and 2 accessible. The author investigates the validity of several tests under inequality of variance.
SCHEFFE, HENRY 1959 The Analysis of Variance. New York: Wiley. → Chapter 10 is one of the most comprehensive accounts of the effects of departures from statistical assumptions. Readers with a modest statistical background will find pages 360-368 accessible.
SRIVASTAVA, A. B. L. 1958 Effect of Non-normality on the Power Function of t-Test. Biometrika 45:421-429.
TUKEY, JOHN W. 1960 A Survey of Sampling From Contaminated Distributions. Pages 448-485 in Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling. Edited by Ingram Olkin et al. Stanford Univ. Press. → The author reviews his previous research on robust estimators for μ and σ2 and gives a good bibliography.
TUKEY, JOHN W. 1962 The Future of Data Analysis. Annals of Mathematical Statistics 33:1-67, 812. → The author outlines his views on data analysis and makes several specific suggestions for handling spotty data. The first 21 pages are accessible; thereafter, some parts are accessible, others are not.
TUKEY, JOHN W.; and MCLAUGHLIN, DONALD H. 1963 Less Vulnerable Confidence and Significance Procedures for Location Based Upon a Single Sample: Trimming/Winsorization. Sankhya: The Indian Journal of Statistics Series A 25:331-352. → The trimmed τ is discussed. The beginning sections of the paper are accessible.
VAN DEH VAART, H. R. 1961 On the Robustness of Wil-coxon’s Two Sample Test. Pages 140-158 in Symposium on Quantitative Methods in Pharmacology, University of Leiden, 1960, Quantitative Methods in Pharmacology: Proceedings. Amsterdam: North-Holland Publishing. → The introduction and conclusion, together with the table and graphs, are accessible.
WELCH, B. L. 1938 The Significance of the Difference Between Two Means When the Population Variances Are Unequal. Biometrika 29:350-362.
WELCH, B. L. 1947 The Generalization of “Student’s” Problem When Several Different Population Variances Are Involved. Biometrika 34:28-35.
WETHERHILL, G. B. 1960 The Wilcoxon Test and Non-null Hypotheses. Journal of the Royal Statistical Society Series B 22:402-418.