Statistical Analysis, Special Problems of

views updated

Statistical Analysis, Special Problems of

I. OutliersF. J. Anscombe

BIBLIOGRAPHY

II. Transformations of DataJoseph B. Kruskal

BIBLIOGRAPHY

III. Grouped ObservationsN. F. Gjeddebsek

BIBLIOGRAPHY

IV. Truncation and CensorshipLincoln E. Moses

BIBLIOGRAPHY

I. OUTLIERS

In a series of observations or readings, an outlier is a reading that stands unexpectedly far from most of the other readings in the series. More technically, an outlier may be defined to be a reading whose residual (explained below) is excessively large. In statistical analysis it is common practice to treat outliers differently from the other readings; for example, outliers are often omitted altogether from the analysis.

The name “outlier” is perhaps the most frequently used in this connection, but there are other common terms with the same meaning, such as “wild shot,” “straggler,” “sport,” “maverick,” “aberrant reading,” and “discordant value.”

Perspective . It is often found that sets of parallel or similar numerical readings exhibit something close to a “normal” pattern of variation [see Distributions, Statistical, article on Special Continuous Distributions]. From this finding of normal variation follows the interest of statisticians in simple means (that is, averages) of homogeneous sets of readings, and in the method of least squares for more complicated bodies of readings, as seen in regression analysis and in the standard methods of analysis for factorial experiments [see Linear HYPOTHESES].

Sometimes a set of readings does not conform to the expected pattern but appears anomalous in some way. The most striking kind of anomaly— and the one that has attracted most attention in the literature of the past hundred years—is the phenomenon of outliers. One or more readings are seen to lie so far from the values to be expected from the other readings as to suggest that some special factor has affected them and that therefore they should be treated differently. Outliers have often been thought of simply as “bad” readings, and the problem of treating them as one of separating good from bad readings, so that conclusions can be based only on good readings and the bad observations can be ignored. Numerous rules have been given for making the separation, the earliest being that proposed by B. Peirce (1852). None of these rules has seemed entirely satisfactory, nor has any met with universal acceptance.

Present-day thinking favors a more flexible approach. Outliers are but one of many types of anomaly that can be present in a set of readings; other kinds of anomalies are, for example, heteroscedasticity (nonconstant variability), nonadditivity, and temporal drift. The question of what to do about such anomalies is concerned with finding a satisfactory specification (or model) of the statistical problem at hand. Since no method of statistical analysis of the readings is uniquely best, tolerable compromises must be sought, so that as many as possible of the interesting features of the data can be brought out fairly and clearly. It is important that outliers be noticed, but the problem of dealing with them is not isolated from other problems of statistical analysis.

An illustration . Suppose that a psychologist arranges to have a stimulus administered to a group of 50 subjects and that the time elapsing before each subject gives a certain response is observed. From the resulting set of times he wishes to calculate some sort of mean value or measure of “central tendency,” for eventual comparison with similar values obtained under different conditions. [See Statistics, Descriptive, article on LOCATION And DISPERSION.]

Just as he is about to calculate a simple arithmetic mean of the 50 readings, the psychologist notices that 3 of the readings are considerably larger than all the others. Should he include these outliers in his calculation, discard them, or what? Several different answers seem reasonable, according to the circumstances.

It may occur to the psychologist that the outliers have been produced by some abnormal condition, and he therefore inquires into the conduct of the trial. Perhaps he discovers that, indeed, three of the subjects behaved abnormally by going to sleep during the test, these being the ones yielding the longest response times. Now the psychologist needs to consider carefully what he wants to investigate. If he decides that he is interested only in the response times of subjects who do not go to sleep and explicitly defines his objective accordingly, he may feel justified in discarding the outlying readings and averaging the rest. But, on the other hand, he may decide that the test has been incorrectly administered, because no subject should have been allowed to go to sleep; then he may discard not only the outliers but all the rest of the readings and order a repetition to be carried out correctly.

In experimental work it is often not possible to verify directly whether some abnormal condition was associated with an outlier; nothing special is known about the reading except that it is an outlier. In that case, the whole distributional pattern of the readings should be examined. Response times, in particular, are often found to have a skew distribution, with a long upper tail and short lower tail, and this skewness may be nearly removed by taking logarithms of the readings or by making some other simple rescaling transformation [see Statistical Analysis, Special Problems Of, article On Transformations of Data]. The psychologist may find that the logarithms of his readings have a satisfactorily normal pattern without noticeable outliers. If he then judges that the arithmetic mean of the log-times, equivalent to the geometric mean of the original times, is a satisfactory measure of central tendency, his outlier problem will have been solved without the rejection or special treatment of any reading.

Finally, it may happen that even after an appropriate transformation of the readings or some other modification in the specification, there still remain one or more noticeable outliers. In that case, the psychologist may be well advised to assign reduced weight to the outliers when he calculates the sample mean, according to some procedure designed to mitigate the effect of a long-tailed error distribution or of possible gross mistakes in the readings.

Thus, several actions are open to the investigator, and there is no single, simple rule that will always lead him to a good choice.

Terminology . An outlier is an observation that seems discordant with some type of pattern. Although in principle any sort of pattern might be under consideration, as a matter of fact the notion of outlier seems rarely to be invoked, except when the expected pattern of the readings is of the kind to which the method of least squares is applicable and fully justifiable [see Linear Hypotheses, article on Regression]. That is, we have a series of readings y₁, y₂, ... , y_n, and we postulate, in specifying the statistical problem, that

(1) y_i = μ_i + e_i i = 1, 2, ...,n,

where the e_i are “errors” drawn independently from a common normal distribution having zero mean and where the expected values, μi, are specified in terms of one or several unknowns. For a homogeneous sample of readings (a single, simple sample) it is postulated that all the μ_i are equal to the common mean, μ; for more complicated bodies of readings a “linear hypothesis” is usually postulated, of form

(2)

μ_i = μ + x_i1β₁ + x_i2β₂ + ... + x_irβ_r,

where the x_ij are given and μ and the β_i are parameters. The object is to estimate (or otherwise discuss the value of) some or all of the parameters, μ, β₁, β₂, ..., and the variance of the error distribution.

Let Y_i denote the estimate of μ_i obtained by substituting in the right side of (2) the least squares estimates of the parameters—that is, the parameter values that minimize the expression Σ_i(y_i — μ_i)² given the assumed relationship among the μ_i The residuals, z_i, are defined by

z_i = y_i - Y_i

Relative to the specification (1) and (2), an outlier is a reading such that the corresponding absolute value of z is judged to be excessively large.

Causes of outliers . It is convenient to distinguish three ways in which an outlier can occur: (a) a mistake has been made in the reading, (b) no mistake has been made, but the specification is wrong, (c) no mistake has been made, the specification is correct, and a rare deviation from expectation has been observed.

In regard to (a), a mistake may be made in reading a scale, in copying an entry, or in some arithmetic calculation by which the original measurements are converted to the reported observations. Apparatus used for making measurements may fail to function as intended—for example, by developing a chemical or electrical leak. A more subtle kind of mistake occurs when the intended plan of the investigation is not carried out correctly even though the act of observation itself is performed perfectly. For example, in a study of ten-year-old children it would be a mistake to include, by accident, as though relating to those children, material that was in fact obtained from other persons, such as teachers, parents, or children of a different age.

In regard to (b), the specification can be wrong in a variety of ways [see Errors, article on EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONS]. The errors may have a nonnormal distribution or may not be drawn from a common distribution at all. The expression (2) for the expected values may be incorrect, containing too few terms or terms of the wrong sort.

In regard to (c), any value whatsoever is theoretically observable and is consistent with any given normal distribution of errors. But readings differing from the mean of a normal distribution by more than some three or four standard deviations are so exceedingly rare as to be a virtual impossibility. Usually the normal-law linear-hypothesis specification cannot be regarded as more than a rough approximation to the truth. Moreover, one can never entirely rule out the possibility that a mistake of some sort has been made. Thus, when a reading is seen to have a large residual, explanation (a) or (b) usually seems more plausible than (c). However, if one wishes to reach a verdict on the matter, one will do well to examine, not just the outliers, but all the readings for evidence of mistakes and of an incorrect specification.

Preferred treatment . Suppose that it could be known for sure whether the cause of a particular outlier was of type (a), (b), or (c) above. What action would then be preferred?

(a) If it were known that the outlying reading had resulted from a mistake, the observer would usually choose, if he could, either to correct the mistake or to discard the reading. One sufficiently gross error in a reading can wreck the whole of a statistical analysis. The danger is particularly great when the data of an investigation are processed by an automatic computer, since it is possible that no one examines the individual readings. It is of great importance that such machine processing yield a display of the residuals in some convenient form, so that gross mistakes will not pass unnoticed and the conformity of the data to the specification can be checked [see Computation].

However, it is not invariably true that if the observer becomes aware that a reading was mistaken, and if he cannot correct the mistake, he will be wise to discard the reading, as though it had never been. For example, suppose that a new educational test is tried on a representative group of students in order to establish norms—that is, to determine the distribution of crude scores to be expected. When the trial is finished and the crude scores have been obtained, some circumstance (perhaps the occurrence of outliers) prompts an investigation of the trial, and then evidence comes to hand that about a quarter of the students cheated, a listing of those involved being available. Should the scores obtained by the cheaters be discarded and norms for the test be based on the scores of the non-cheaters? Surely that would be misleading, for if the possibility of cheating will persist in the future, as in the trial, the scores of the cheaters should obviously not be excluded. If, on the other hand, a change in the administration of the test will prevent such cheating in the future, then for a correct norming the scores that the cheaters would have obtained, had they not been allowed to cheat, must be known. It would be rash to assume that these scores would be similar to the scores of the actual noncheaters. But that is what is implied by merely rejecting the cheaters’ scores. Conceivably, a change in the administration of the test might even have affected the scores of those who did not cheat. Thus, the trial would have to be run afresh under the new system of administration in order to yield a fair distribution of scores. (For further discussion of this sort of situation, see Kruskal 1960.)

(b) Consider now the next imagined case, where it is known that no mistakes in observation were made but that the specification was wrong. What to do with any particular reading would then be a secondary question; attention should first be directed toward improving the specification. The appropriateness of the expression (2) for the expected values,μ_imay possibly be improved by transformation of the observations, y_i or the associated values, x_ij or by a change in the form of the right side so that further terms are introduced or a nonlinear function of the parameters is postulated. Sometimes it is appropriate to postulate unequal variances for the errors, e_i depending perhaps on the expectations, μ_i or on some associated ^-variable. A consequence of making such changes in the specification will usually be that the standard least squares method of analysis will be applied to modified data.

As for the assumption that the distribution of the errors, e_i is normal, there are many situations where this does seem to agree roughly with the facts. In no field of observation, however, are there any grounds for thinking that the normality assumption is accurately true, and in cases where it is roughly true, extensive investigation has sometimes revealed a frequency pattern having somewhat longer tails than those of a normal distribution. Jeffreys (1939), Tukey (1962), Huber (1964), and others have considered various systems of assigning reduced weight to readings having large residuals, in order to make the least squares method less sensitive to a possibly long-tailed distribution of errors, while preserving as nearly as possible its effectiveness when the error distribution is normal and not greatly changing the computational procedure. One such modified version of the least squares method is described below.

(c) Finally, suppose that it were known that an outlier was caused neither by a mistake in observation nor by an incorrect specification, but that it was simply one of the rare deviations that must occasionally occur. Then one would wish to perform the usual statistical analysis appropriate to the specification; the outlier would be included with full weight and treated just like the other observations. This is so because the ordinary least squares estimates of the parameters (together with the usual estimate of common variance) constitute a set of sufficient statistics, and no additional information can be extracted from the configuration of the readings [see Sufficiency].

Conclusions . This exposition has considered what action would be preferred if it were known for sure whether an outlier had arisen (a) from a mistake, (b) from an incorrect specification, or (c) by mere chance, the specification and the technique of observation both being correct. It has been shown that a different action would be preferred in each case. Usually, no such knowledge of the cause of an outlier is available in practice. A compromise is therefore necessary.

Obviously, all reasonable efforts should be made to prevent mistakes in observation and to find a specification and a method of statistical analysis consonant with the data. That is so in any case, although outliers may naturally stimulate a closer scrutiny of the specification and a more vigorous search for mistakes. But however careful one is, one can never be certain that undetected mistakes have not occurred and that the specification and plan of statistical analysis are completely appropriate. For the purpose of estimating parameters in a linear hypothesis like (2) above, some modification of the customary least squares procedure giving reduced weight to outliers seems to be advisable. The harmful effects of mistakes in observation and of a long-tailed distribution of errors are thereby mitigated, while negligible damage is done if no mistakes or specification errors have been made. Suitable computational procedures have not yet been fully explored, but it seems likely that considerable attention will be directed to this topic in the near future.

A modified least squares method . One type of modified least squares method for estimating the parameters for the statistical specification indicated at (1) and (2) above is as follows:

Positive numbers, K₁ and K₂ are chosen, with K₂ > K₁ Instead of taking as estimates of the parameters the values that minimize the sum of squares Σ_i(y_i — μ_i)² one takes the values that minimize the sum of values of a function,

Σ_iψ(y_i — μ_i),

where ψ( .) is the square function for small values of the argument but increases less rapidly for larger values and is constant for very large values. Specifically, one minimizes the following composite sum:

where Σ_i denotes the sum over all values of i such that ǀy_i − μ_iǀ ≤K₁Σ₍₂₎ denotes the sum over all values of i such that K₁<ǀy_i − μ_iǀ ≤K₂, and

Σ₍₃₎ (si denotes the sum over all remaining values of i. Minimizing (3) is, roughly speaking, equivalent to the ordinary least squares method modified by giving equal weight to all those readings whose residual does not exceed K₁ in magnitude, reduced weight (inversely proportional to the magnitude of the residual) to those readings whose residual exceeds K₁ but not K₂ in magnitude, and zero weight to those readings whose residual exceeds K₂ in magnitude. The minimization is a problem in quadratic programming [see Programming]. If K, is large enough for most readings to come in the sum £,,), there are rapidly converging iterative procedures. The necessity of iterating comes from the fact that until the parameters have been estimated and the residuals calculated, it is not possible to say with certainty which readings contribute to the sum Σ₍₁₎, which to Σ₍₂₎, and which to Σ₃. As soon as all the readings have been correctly assigned, a single set of linear equations determines the values of the parameters, as in the ordinary least squares method. The ordinary method results when K₁ is set so large that all readings come in Σ₍₁₎. At another extreme, if K₁ is chosen to be in-finitesimally small but K₂ is chosen to be very large, all readings come in Σ₍₂₎, and the result is the method of least absolute deviations. For general use, as a slight modification of the ordinary least squares method, K₁ and K₂ should be chosen so that only a small proportion of the readings come in Σ₍₂₎ and scarcely any in Σ₍₃₎. For example, K_t might be roughly equal to twice the estimated standard deviation of the error distribution, and K₂ might be three or four times as large as K₁.

Such a procedure leads to the exclusion of any very wild reading from the estimation of the parameters. Less wild readings are retained with less than full weight.

F. J. Anscombe

BIBLIOGRAPHY

Anscombe, F. J. 1967 Topics in the Investigation of Linear Relations Fitted by the Method of Least Squares. Journal of the Royal Statistical Society Series B 29:1-52. → Includes 23 pages of discussion and recent references.

Anscombe, F. J.; and Tukey, John W. 1963 The Examination and Analysis of Residuals. Technometrics 5:141-160.

Chauvenet, William 1863 Manual of Spherical and Practical Astronomy. Philadelphia: Lippincott. → See especially the appendix on the method of least squares, sections 57-60.

Daniel, Cuthbert 1960 Locating Outliers in Factorial Experiments. Technometrics 2:149-156.

[Cosset, W. S.] 1927 Errors of Routine Analysis, by Student [pseud.]. Biometrika 19:151-164.

Huber, Peter J. 1964 Robust Estimation of a Location Parameter. Annals of Mathematical Statistics 35:73-101.

Jeffreys, Harold (1939) 1961 Theory of Probability. 3d ed. Oxford: Clarendon.

Kruskal, William H. 1960 Some Remarks on Wild Observations. Technometrics 2:1-3.

Peirce, Benjamin 1852 Criterion for the Rejection of Doubtful Observations. Astronomical Journal 2:161-163.

Tukey, John W. 1962 The Future of Data Analysis. Annals of Mathematical Statistics 33:1-67.

Wright, Thomas W. 1884 A Treatise on the Adjustment of Observations, With Applications to Geodetic Work and Other Measures of Precision. New York: Van Nostrand. → See especially sections 69-73.

II. TRANSFORMATIONS OF DATA

It is often useful to apply a transformation, such as y = logx or y = 1/x, to data values, x. This change can simplify relationships among the data and improve subsequent analysis.

Many transformations have been used, including where c is a constant, and y = ½ log [x/( 1 — x)]. Sometimes a beneficial transformation is constructed empirically from the data themselves, rather than given as a mathematical expression.

The general effect of a transformation depends on the shape of its plotted curve on a graph. It is this curve, rather than the mathematical formula, that has central interest. Transformations with similar curves will have similar effects, even though the formulas look quite different. Graphical similarity, however, must be judged cautiously, as the eye is easily fooled.

The benefits of transforming

The relationship of y to other variables may be simpler than that of x. For example, y may have a straight-line relationship to a variable u although x does not. As another example, y may depend “additively” on u and v even though x does not.

Suppose the variance of x is not constant but changes as other variables change. In many cases it is possible to arrange for the variance of y to be nearly constant.

In some cases the distribution of y may be much more like the normal (Gaussian) distribution or some other desired distribution than is that of x.

Thus, the benefits of transforming are usually said to be (1) simpler relationships, (2) more stable variance, (3) improved normality (or closeness to another standard distribution). Where it is necessary to choose between these, (1) is usually more important than (2), and (2) is usually more important than (3). However, many authors have remarked that frequently (although not invariably) a single transformation achieves two or all three at once.

In some cases analysis of y, however illuminating, is not to the point, because some fact about x itself is needed (usually the expected value) and the corresponding fact about y is not an acceptable substitute. If so, it is often better not to transform, although sometimes it is desirable and feasible to obtain the necessary fact about x from information about y.

The ultimate profit . The benefits listed above are not ultimate profit but merely a means to achieve it. The ultimate profit, however, is difficult to describe, for it occurs during the creative process of interpreting data. A transformation may directly aid interpretation by allowing the central information in the data to be expressed more succinctly. It may permit a subsequent stage of analysis to be simpler, more accurate, or more revealing.

Later in this article an attempt will be made to illustrate these elusive ideas. However, the first examples primarily illustrate the immediate benefits, rather than the ultimate profit.

Some simple examples

Some of the most profitable transformations seem almost as basic as the laws of nature. Three such transformations in psychology are all logarithmic: from sound pressure to the decibel scale of sound volume in the study of hearing; from light intensity to its logarithm in the study of vision; and from tone frequency in cycles per second to tone pitch on the musical scale. In each case many benefits are obtained.

To simplify curves . Transformations may be used to display the relationship between two variables in a simple form—for example, as a straight line or as a family of straight lines or of parallel curves. One or both variables may be transformed. Figure 1 is a simple illustration, using hypothetical data.

One easy way to transform while plotting is to use graph paper with a special scale. Two widely used special scales are the logarithmic scale and the normal probability scale, which correspond, respectively, to the logarithmic and “probit” transformations. [See Graphic presentation.]

To stabilize variance . The analysis of many experiments is simpler if the variance of response is approximately constant for different conditions.

Crespi (1942, pp. 483-485) described very clearly how he first used the time required by a rat to run down a 20-foot runway as a measure of its eagerness to obtain food but found that the response variance differed greatly for experimental conditions with different average responses. This hampered the intended analyses of variance, which require approximate constancy of variance. However, the transformation to speed = (20 feet)/(time) removed this difficulty entirely. In short, the reciprocal transformation helpfully stabilized the variance.

Furthermore, Crespi indicated reasons why in this context speed should be better than time as a measure of eagerness. It happens quite often that performance times, latency times, reaction times, and so forth can benefit from a reciprocal or a logarithmic transformation.

To improve normality . Many statistical techniques are valid only if applied to data whose distribution is approximately normal. When the data are far from normal, a transformation can often be used to improve normality. For example, the distribution of personal income usually is skewed strongly to the right and is very nonnormal. Applying the logarithmic transformation to income data often yields a distribution which is quite close to normal. Thus, to test for difference of income level between two groups of people on the basis of two samples, it may well be wiser to apply the usual tests to logarithm of income than to income itself.

To aid interpretation . In a letter to Science, Wald (1965) made a strong plea for plotting spectra (of electromagnetic radiation) as a function of frequency rather than of wavelength; frequency plots are now much commoner. Frequency and wavelength are connected by a reciprocal transformation, and most of the reasons cited for preferring the frequency transformation, both in Wald’s letter and in later, supporting letters, have to do with ease of interpretation. For example, frequency is proportional to the very important variable, energy. On a frequency scale, but not on a wavelength scale, the area under an absorption band is proportional to the “transition probability,” the half-width of the band is proportional to the “oscillator strength,” the shape of the band is symmetrical, and the relationship between a frequency and its harmonics is easier to see.

To quantify qualitative data . Qualitative but ordered data are often made quantitative in a rather arbitrary way. For example, if a person orders n items by preference, each item may be given its numerical rank from 1 to n [see Nonparametric Statistics, article on Ranking Methods]. In such cases the possibility of transforming the ranks is especially relevant, as the original space between the ranks deserves no special consideration. Quite often the ranks are transformed into “normal scores,” so as to be approximately normally distributed [see Psychometrics].

To deal with counted data . Sometimes the number of people, objects, or events is of interest. For example, if qualified applicants to a medical school are classified by age, ethnic group, and other characteristics, the number in each group might be under study. [See Counted Data.]

* For c = 0, y = a + b log x

If the observed values have a wide range (say, the ratio of largest to smallest is at least 2 or 3), then a transformation is especially likely to be beneficial. Figure 2 shows a whole family of transformations which are often used with counted data and in many other situations as well. Each of these is essentially the same as y - x^c for some constant c. (For greater ease of comparison, however, the plotted curves show y = a + bx^c. By a standard convention, y = log x substitutes for y = x⁰ in order to have the curves change smoothly as c passes through the value 0. No other function would serve this purpose.)

For example, suppose the observed values x come from Poisson distributions, which are quite common for counted data. Then the variance of x equals its expected value, so different x’s may have very different variances. However, the variance of is almost constant, with value ¼ (unless the expected value of x is very close to 0). For various purposes, such as testing for equality, y is better than x. Often, simpler relationships also result from transforming.

Other values of c are also common. In only four pages Taylor (1961) has displayed 24 sets of counted biological data to which values of c over the very wide range from 0.65 to -0.54 are appropriate!

*In the top part of the figure, the seal* shown on the to axis is used for the Pr, An, and r axes as well.

If the observations include 0 or cover a very wide range, it is common to use various modifica- tions, such as y = (x + k)^c, with k a small constant, often ½ or 1. There has been considerable investigation of how well various modifications stabilize the variance under certain assumptions, but most of these modifications differ so slightly that interest in them is largely theoretical.

To deal with fractions . A very common form of data is the fraction or percentage, p, of a group who have some characteristic (such as being smokers ). If the observed percentages include some extreme values (say, much smaller than 10 per cent or much larger than 90 per cent), a transformation is usually beneficial. Figure 3 shows the three transformations most frequently used with fractions: angular, probit, and logistic. Their formulas are given in Table 8, below. Of course, nothing prevents the use of transformations in the absence of extreme values, or the use of other transformations.

The upper part of Figure 3 displays the three transformations in a different way, showing how they “stretch the ends” of the unit interval relative to the middle.

Suppose a study is designed to compare different groups of men for propensity to smoke. The variance of p (the observed proportion of smokers in a sample from one group) depends on the true proportion, p*, of that group. For extreme values of p’, the variance of p gets very small. The non-constant variance of p hinders many comparisons, such as tests for equality.

One possible remedy is to transform. Each of the three transformations mentioned is likely to make the variance more nearly stable. Each can be justified theoretically in certain circumstances. (For example, if p has the binomial distribution, the angular transformation is indicated.) However, transforming fractions often has practical value, even in the absence of such theory, and this value may include benefits other than variance stabilization.

There is an important caution to keep in mind when using the angular transformation for proportions, the square root transformation for Poisson data, or other transformations leading to theoretically known variances under ideal conditions. These transformations may achieve stabilization of variance even where ideal conditions do not hold. The stabilized variance, however, is often much larger than the theoretically indicated one. Thus, when using such transformations it is almost always advisable to use a variance estimated from the transformed values themselves rather than the variance expected under theoretical conditions.

To deal with correlation coefficients . Fisher’s z-transformation is most commonly used on correlation coefficients, r, and occasionally on other variables which go from —1 to +1 [see Multi variate analysis, articles OHcorrelation]. In Figure 3 the curve of Fisher’s z-transformation coincides with the curve of the logistic transformation, because the two are algebraically identical if r = 2p – 1. Generally speaking, remarks similar to those made about fractions apply to correlation coefficients.

To improve additivity . Suppose x is influenced by two other variables, u and v; for example, suppose x has the values shown in Table 1, for two

*Table 1 — Values of x*
	V₁	V₂	V₃
u_i	27	28	35	a₁ = -10
u₂	47	48	55	a₂ = +10
	b₁ = -3	b₂ = -2	b₃ = +5	m = 40

unspecified values of u and three unspecified values of v. Examine the values of x. You will note that the difference between corresponding entries in the two rows is always 20, whichever column they are in. Likewise, the difference between corresponding entries in any two columns is independent of the particular row they are in. (The difference is 1 for the first two columns, 7 for the last two columns, and 8 for the first and third columns.) These relations, which greatly simplify the study of how x depends on u and v, are referred to as additivity. (For a discussion of additivity in another sense, see Scheffe 1959, pp. 129-133.)

An alternative definition of additivity, whose equivalence is established by simple algebra, is phrased in terms of addition and accounts for the name. Call x additive in u and v if its values can be reconstructed by adding a row number (shown here as a¡) and a column number (shown here as bj), perhaps plus an extra constant (shown here as m). For example, considering x₁₁ in the first row and column,

27 = (−10) + (−3) + 40

while, in general,

x_ij = a_i + b_j + m.

The extra constant (which is not essential, because it could be absorbed into the row numbers or the column numbers) is commonly taken to be the “grand mean” of all the entries, as it is here, and the row and column numbers are commonly taken, as here, to be the row and column means less the grand mean. [See Linear Hypotheses, article on ANALYSIS OF VARIANCE.]

To understand how additivity can be improved by a transformation, consider tables 2 and 3. Neither is additive. Suppose Table 2 is transformed by This yields Table 4, which is

*Table 2 — Values of x*
	V₁	V₂	V₃
u₁	1	4	9
u₂	4	9	16
u₃	9	16	25

*Table 3 — Values of x*
	v₁	v₂	v₃
u₁	1.2	4.1	9.2
u₂	4.3	9.4	16.2
u₃	9.1	16.5	25.4

*Table 4 — Values of y*
	v₁	v₂	v₃	a₁
u₁	1	2	3	-1
u₂	2	3	4	0
u₃	3	4	5	+1
b_i	-1	0	+1	3 = m

clearly additive. The same transformation applied to Table 3 would yield values which are additive to a good approximation. Usually, approximate additivity is the best one can hope for.

Whether or not a transformation can improve additivity depends pn the data. Thus, for tables 5 and 6, which are clearly nonadditive, no one-to-one transformation can produce even approximate additivity.

*Table 5 — Values of x*
	v₁	v₂
u₁	1	0
u₂	0	1

*Table 6 — Values of x*
	V₁	V₂	V₃
u₁	0	1	2
u₂	1	2	0
u₃	2	0	1

The concept of additivity is also meaningful and important when x depends on three variables, u, v, and w, or even more. Transformations are just as relevant to improve additivity in these cases.

Empirical transformations . Sometimes transformations are constructed as tables of numerical values and are not conveniently described by mathematical formulas. In this article such transformations are called empirical.

One use of empirical transformations is to improve additivity in u and v. J. B. Kruskal (1965) has described a method for calculating a monotonic transformation of x, carefully adapted to the given data, that improves additivity as much as possible according to a particular criterion.

Sometimes it is worthwhile to transform quantitative data into numerical ranks 1, 2, ... in order of size [for example, this is a preliminary step to many nonparametric, or distribution-free, statistical procedures; for discussion, see Nonparametric Statistics, article on Ranking Methods]. This assignment of ranks can be thought of as an empirical transformation which leaves the data uniformly distributed. If normal scores are used instead of ranks, the empirical transformation leaves the distribution of the transformed data nearly normal.

Basic concepts

Linear transformations . A linear transformation y = a + bx, with b not 0, is often convenient, to shift the decimal point (for example, y = l,000x), or to avoid negative values (for example, y = 5 + x) or for other reasons. Such transformations (often called coding) have no effect whatsoever on the properties of interest here, such as additivity, linearity, variance stability, and normality. Thus, linearly related variables are often considered equivalent or even the “same” in the context of transformations. To study the form of a transformation, a linearly related variable may be plotted instead. Thus, the comparison of power transformations y — x^c in Figure 2 has been simplified by plotting y = a + bx^c, with a and b chosen for each c to make the curve go through two fixed points.

Monotonic transformations . A transformation is called (monotonic) increasing if, as in y = log x, y gets larger as x gets larger. On a graph its curve always goes up as it goes to the right. A decreasing transformation, like y = 1/x (where x is positive), goes the other way.

Data transformations of practical interest (in the sense of this article) are almost always either increasing or decreasing, but not mixed. The term “monotonic” (or, more precisely, “strictly monotonic”) covers both cases. Sometimes the word “monotone” is used for “monotonic.”

Region of interest . The region of interest in using a transformation is the region on the x-axis in which observed data values might reasonably be found or in which they actually lie. The characteristics of a transformation are relevant only in its region of interest. In particular, it need be mono-tonic only there.

Even though y = x² is not monotonic (it increases for x positive and decreases for x negative), it can be sensible to use y = x² for observations, x, that are necessarily positive.

Mild and strong transformations . If the graph of a transformation is almost a straight line in the region of interest, it is described as mild, or almost linear. If, on the contrary, it is strongly curved, it is called strong. Note, however, that visual impressions are sensitive to the sizes of the relative scales of the x-axis and y-axis.

A very mild transformation is useful, as a preliminary step, only when the subsequent analysis seeks maximum precision. On the other hand, if a strong transformation is appropriate, it may provide major benefits even for very approximate methods, such as visual inspection of graphical display (and for precise analysis also, of course).

The strength of a transformation depends critically on the region of interest. For x from 1 to 10, y = log x is fairly strong, as Figure 2 shows. From 5 to 6 it is quite mild, and from 5 to 5.05 it is virtually straight. From 1 to 1,000,000 it is very strong indeed.

Effect of transforming on relations among averages. Transforming can change the relationship among average values quite drastically. For example, consider the hypothetical data given in Table 7 for two rats running through a 20-foot channel. Which rat is faster? According to average time, rat 2 is slightly faster, but according to average speed, it is only half as fast! Thus, even the answer to this simple question may be altered by transformation. More delicate questions are naturally much more sensitive to transformation. This shows that use of the correct transformation may be quite important in revealing structure.

*Table 7 — Hypothetical times and speeds*
	Time (seconds)	Corresponding speed (feet per second)
Rat 1
Trial 1	50.0	0.4
Trial 2	10.0	2.0
Trial 3	5.6	3.6
Average	21.9	2.0
Rat 2
Trial 1	20.0	1.0
Trial 2	20.0	1.0
Trial 3	20.0	1.0
Average	20.0	1.0

The effect of transformations on expected values and variances. The transform of the expected value does not equal the expected value of the transform, although they are crude approximations of each other. This is clearer in symbols. Because interest usually lies in estimating the expected value, E(x), from information about y = f(x), rather than vice versa, it is convenient to invert the transformation and write x = g(y), where g is f-¹. Then, to a crude approximation,

E(x) = E[g(y)] ≅ g[E(y)].

The milder the transformation is, the better this approximation is likely to be, and for linear transformations it is precisely correct.

In practice one usually uses an estimate for E(y), often the average value, y. Substituting the estimate for the true value is a second step of approximation: E(x) is crudely approximated by g(ȳ).

One simple improvement, selected from many which have been used, is

where g” denotes the second derivative of g and var(¡) denotes the variance of y. (This approximation, since it is based on a Taylor expansion through the quadratic term, is exact if g(y} is a quadratic polynomial, a situation that occurs for Substituting estimates, such as ȳ for E(y) and , the sample variance, for var(t/) is a second step of approximation:

Advanced work along these lines is well represented by Neyman and Scott (1960).

For variances, the simplest approximation is var(x)2svar(j/) . {9’[E(?/)]}², where g’ is the first derivative of g. This is fairly good for mild transformations and perfect for linear ones. After estimates are substituted, the estimate becomes

One-bend transformations . Most transformations of practical use have only one bend (as in Figure 2) or two bends (as in Figure 3). A curious fact is that the increasing one-bend transformations most often used bend downward, like y = log a;, rather than upward, like y ~ x¹. For decreasing transformations, turn the graph upside down and the same thing holds true.

Some important families of one-bend transformations are

logarithmic family;

square-root family;

power family;

exponential family;

“simple” family.

Here k and c are constants, and e ≅ 2.718 is a f amiliar constant. The region of interest is generally

*Table 8 — Two-bend transformations*
Forward forms	Backward forms	Names
a. Under the name “probit“ the addition of 5 is usual, to avoid negative values.
b. Natural logarithms are generally used here.
	p = sin²y	angular arcsine inverse sine
y = Erf^-1(p) = Θ^-1 (p) y = Θ^-1(p) + 5 (often)	p = Erf (y) = Θ (y) = normal probability up to y	probit^a normit phi-gamma
		logistic^b logit
		Fisher’s z^b z hyperbolic arctangent

a. Under the name “probit” the addition of 5 is usual, to avoid negative values.

b. Natural logarithms are generally used here.

restricted either to x ≥ 0 or to x ≥ -k. Of course, each family also includes linearly related transformations (for example, y = a + bx^c is in the power family).

The “simple” family (named and discussed in Tukey 1957) obviously includes two of the other families and, by a natural extension to mathematical limits, includes the remaining two as well.

Two-bend transformations . Only those two-bend transformations mentioned above (angular, probit, logistic, and z) are in general use, although the log-log transformation, given by y = log(– log x), is sometimes applied. A whole family of varying strengths would be desirable (such as p^c − (1 − p)^c), but no such family appears to have received more than passing mention in the literature.

Several formulas and names for two-bend transformations are presented in Table 8. Tables of these transformations are generally available. (For references, consult Fletcher et al. 1946 and the index of Greenwood & Hartley 1962.)

When to use and how to choose a transformation

Many clues suggest the possible value of transforming. Sometimes the same clue not only gives this general indication but also points to the necessary transformation or to its general shape.

When using a transformation, it is essential to visualize its plotted curve over the region of interest. If necessary, actually plot a few points on graph paper. (In several published examples the authors appear to be unaware that the transformations are so mild as to be useless in their context.)

If a quantity, such as the expected value, is needed for x itself, consider whether this quantity is best found by working directly with x or indirectly through the use of some transform. However, need should be judged cautiously; although it is often real, in many cases it may vanish on closer examination.

A word of caution: “outliers” can simulate some of the clues below and falsely suggest the need for transforming when, in fact, techniques for dealing with outliers should be used [see Statistical Analysis, Special Problems Of, article On Outliers].

Some simple clues . Very simple and yet very strong clues include counted data covering a wide range, fraction data with some observations near 0 or 1, and correlation coefficient data with some values near –1 or +1. Generally, observations which closely approach an intrinsic boundary may benefit by a transformation which expands the end region, perhaps to infinity.

If several related curves present a complex but systematic appearance, it is often possible to simplify them, as in Figure 1. For example, they might all become straight lines, or the vertical or horizontal spacing between curves might become constant along the curves.

A very nonnormal distribution of data is a clue, although, by itself, a weak one. For this purpose, nonnormality is best judged by plotting the data on “probability paper” [see Graphic Presentation]. The general shape of a normalizing transformation can be read directly from the plot.

Nonconstant variance . Suppose there are many categories (such as the cells in a two-way table) and each contains several observations. Calculate the sample variance, s²_ij, and the average value, a_ij in each category. If the s²_ij vary substantially, make a scatter plot of the s²_ij against the a_ij If the s²_ij tend to change systematically along the a-axis, then a variance-stabilizing transformation is possible and often worthwhile. Usually the scatter plot is more revealing if plotted on paper with both scales logarithmic, so that log s²_ij is plotted against log a_ij.

To choose the transformation, the relationship between the and the a_ij must be estimated. Using even a crude estimate may be better than not transforming at all. Suppose the estimated relationship is . (Commonly, this is . This is a straight line on log-log paper. Taylor [1961] contains many such examples.) It can be shown that stabilizes the variance (approximately). If (for Poisson distributions, k=1), this leads essentially to . For s²_ij = ka^c_ij it leads essentially to y = x^d, with d = 1 - c/2 (or y = log x if d = 0).

Removable nonadditivity . When nonadditivity can be removed by a transformation, it is almost always worthwhile to do so. To recognize nonadditivity is often easy, either by direct examination or by the size of the interaction terms in an analysis of variance [see Linear Hypotheses, article on Analysis of Variance]. To decide how much of it is removable may be harder.

With experience, in simple cases one can often recognize removable nonadditivity by direct examination and discover roughly the shape of the required transformation. If Tukey’s “one degree of freedom for non-additivity” (see Moore & Tukey 1954; or Scheffé 1959, pp. 129-133) yields a large value of F, some nonadditivity is removable. Closely related is the scatter-plot analysis of residuals given by Anscombe and Tukey (1963, sec. 10). Krus-kal’s method (1965) directly seeks the monotone transformation leaving the data most additive.

How to choose the transformation . In addition to the methods mentioned above, one or more transformations may be tried quite arbitrarily. If significant benefits result, then the possibility of greater benefits from a stronger, weaker, or modified transformation may be investigated.

A whole family of transformations can, in effect, be tried all at once, with the parameter values chosen to optimize some criterion. The important paper by Box and Cox (1964) gives a good discussion of this method. Another useful approach is provided in Kruskal (1965).

An illustration—galvanic skin response

It has long been known that electrical resistance through the skin changes rapidly in response to psychological stimuli—a phenomenon known as galvanic skin response (GSR) or electrodermal response. Originally, the only scale used for analyzing GSR was that of electrical resistance, R, measured in ohms (or in kilohms). As early as 1934, Darrow (see 1934; 1937) suggested the use of electrical conductance, C = 1/R, measured in mhos (or in micromhos), and later logC, as well as various modifications. More recently, other scales, such as , have also been used, and the topic has continued to receive attention up to the present.

Although agreement on “the best scale” has not been reached, many authors who treat this question agree that R is a very poor scale and that both C and logC provide substantial improvement. Most experimenters now use either C or logC, but a few still use R (and some fail to specify which scale they are using), more than thirty years after Darrow’s original paper!

Lacey and Siegel (1949) have discussed various transformations, using their own experimental results. Their final recommendation is C. Using 92 subjects, they measured the resistance of each one twice, first while the subject was sitting quietly, then just after the subject had received an unexpected electric shock. Call the two values R₁₁, (before shock) and R₁, (after shock). Each subject received only one shock, and presumably all the shocks were of the same strength.

For any scale y (whether R, log C, C, or another), let y₁₁ and y₁ be the two values (before and after) and let GSR = y₁ — y₁₁. (A separate question, omitted here, is whether y₁ — y₁₁ itself should be transformed before use as the GSR.)

The major use of GSR is to measure the strength of a subject’s response to a stimulus. For this use it is desirable that the size of the GSR not depend on extraneous variables, such as the subject’s basal resistance. Thus, in the equation y₁ = y₂ + GSR the value of GSR should be independent of y₀. This is a form of additivity.

Figure 4 shows GSR plotted against y_a for the two scales R and C. On the R scale it is obvious that the GSR has a strong systematic dependence on R₀ (in addition to its random fluctuation). The C scale is in strong contrast, for there the GSR displays relatively little systematic dependence on C₀. The corresponding plot for the intermediate logC scale (not shown here) displays a distinct but intermediate degree of dependence.

Another desirable property, often specified in papers on this topic, is that y₀ should be approximately normal. Plots on “probability paper” (not shown here) display log C, as quite nicely straight but and as definitely curved (skew), and C₀ and R₀ as even more curved. Thus, a conflict appears: the requirement of normality points to the logC scale, while additivity points to the C scale. For the major use of GSR, additivity is more important and must dominate unless some resolution can be found.

An intermediate scale might provide a resolution. Schlosberg and Stanley (1953) chose VC- A significantly different intermediate scale, with a plausible rationale, is log(R - R*), where R* depends only on the electrodes, electrode paste, and so forth. For the data of the Lacey and Siegel study, the

* GSR represents postshock level of resistance or conductance minus preshock level. The same 92 observations are plotted in both parts of the figure.

Source: Data from Lacey & Siegel 1949.

value of R* should be a little more than 4 kilohms.

Some other topics

Multivariate observations can be transformed. Even linear transformations are significant in this case, and these have been the main focus of interest so far. [See Factor Analysis; Multivariate Analysis.]

Where detailed mathematical assumptions can safely be made, transformations can sometimes be used in a more complex way (based on the maximum likelihood principle) to obtain greater precision. This is often called the Bliss-Fisher method. How widely this method should be used has been the subject of much controversy. An article by Fisher (1954, with discussion) gives a good statement of that author’s views, together with concise statements by five other eminent statisticians.

Similarity and dissimilarity measures of many kinds are sometimes transformed into spatial distances, by the technique of multidimensional scaling. Briefly, if δ_ij, is a measure of dissimilarity between objects i and j, multidimensional scaling seeks points (in r-dimensional space) whose inter-point distances, d_ij, are systematically related to the dissimilarities. The relationship between δ_ij and d_ij may usefully be considered a transformation (for further information on multidimensional scaling, see Kruskal 1964).

An ingenious and appealing application of transformations (to three sets of data) appears in Shepard (1965). In each case the data can be represented by several plotted curves on a single graph. Each curve is essentially unimodal, but the peaks occur at different places along the X-axis. The x-variable is monotonically transformed so as to give the different curves the same shape, that is, to make the curves the same except for location along the x-axis. The transformed variables appear to have subject-matter significance.

Joseph B. Kruskal

See alsoErrors; Graphic Presentation

BIBLIOGRAPHY

General discussion of transformations accompanied by worthwhile applications to data of intrinsic interest may be found in Bartlett 1947, Box & Cox 1964, Kruskal 1965, Moore & Tukey 1954, and Snedecor 1937. Actual or potential applications of great interest may be found in Lacey & Siegel 1949, Shepard 1965, Taylor 1961, and Wald 1965. For tables, consult Fletcher, Miller & Rosen-head 1946, and Greenwood & Hartley 1962. Large and useful bibliographies may be found in Grimm 1960 and Lienert 1962.

Anscombe, F. J.; and Tukey, John W. 1963 The Examination and Analysis of Residuals. Technometrics 5: 141-160.

Bartlett, M. S. 1947 The Use of Transformations. Biometrics 3:39-52. → Down-to-earth, practical advice. Widely read and still very useful.

Box, G. E. P.; and Cox, D. R. 1964 An Analysis of Transformations. Journal of the Royal Statistical Society Series B 26:211-252. → Starts with a very useful general review. The body of the paper, although important and equally useful, requires some mathematical sophistication.

Cbespi, Leo P. 1942 Quantitative Variation of Incentive and Performance in the White Rat. American Journal of Psychology 55:467-517.

Darrow, Chester W. 1934 The Significance of Skin Resistance in the Light of Its Relation to the Amount of Perspiration (Preliminary Note). Journal of General Psychology 11:451-452.

Darrow, Chester W. 1937 The Equation of the Galvanic Skin Reflex Curve: I. The Dynamics of Reaction in Relation to Excitation-background. Journal of General Psychology 16:285-309.

Fisher, R. A. 1954 The Analysis of Variance With Various Binomial Transformations. Biometrics 10: 130-151. → Contains a statement of the Bliss-Fisher method and a controversy over its scope. Do not overlook the important 11-page discussion, especially the remarks by Cochran and Anscombe, whose views are recommended.

Fletcher, Alan; Miller, Jeffery C. P.; and Rosenhead, Louis (1946) 1962 An Index of Mathematical Tables. 2d ed. 2 vols. Reading, Mass.: Addison-Wesley.

Greenwood, J. Arthur; and Hartley, H. O. 1962 Guide to Tables in Mathematical Statistics. Princeton Univ. Press.

Grimm, H. 1960 Transformation von Zufallsvariablen. Biometrische Zeitschrift 2:164-182.

Kruskal, Joseph B. 1964 Multidimensional Scaling by Optimizing Goodness of Fit to a Nonmetric Hypothesis. Psychometrika 29:1-27.

Kruskal, Joseph B. 1965 Analysis of Factorial Experiments by Estimating Monotone Transformations of the Data. Journal of the Royal Statistical Society Series B 27:251-263.

Lacey, Oliver L.; and Siegel, Paul S. 1949 An Analysis of the Unit of Measurement of the Galvanic Skin Response. Journal of Experimental Psychology 39: 122-127.

Lienert, G. A. 1962 Über die Anwendung von Variablen-Transformationen in der Psychologie. Biometrische Zeitschrift 4:145-181.

Moore, Peter G.; and Tukey, John W. 1954 Answer to Query 112. Biometrics 10:562-568. → Includes Tukey’s “One Degree of Freedom for Non-additivity” and an interesting example of how to choose a transformation, explained clearly and with brevity.

Mosteller, Frederick; and Bush, Robert R. (1954) 1959 Selected Quantitative Techniques. Volume 1, pages 289-334 in Gardner Lindzey (editor), Handbook of Social Psychology. Cambridge, Mass.: Addison-Wesley.

Neyman, J.; and Scott, E. L. 1960 Correction for Bias Introduced by a Transformation of Variables. Annals of Mathematical Statistics 31:643-655.

ScheffÉ, Henry 1959 The Analysis of Variance. New York: Wiley. → See especially pages 129-133.

Schlosberg, Harold S.; and Stanley, Walter C. S. 1953 A Simple Test of the Normality of Twenty-four Distributions of Electrical Skin Conductance. Science 117:35-37.

Shepard, Roger N. 1965 Approximation to Uniform Gradients of Generalization by Monotone Transformations of Scale. Pages 94-110 in David I. Mostofsky (editor), Stimulus Generalization. Stanford Univ. Press.

Snedecor, George W. S. (1937) 1957 Statistical Methods: Applied to Experiments in Agriculture and Biology. 5th ed. Ames: Iowa State Univ. Press. → See especially pages 314-327.

Taylor, L. R. 1961 Aggregation, Variance and the Mean. Nature 189:732-735.

Tukey, John W. 1957 On the Comparative Anatomy of Transformations. Annals of Mathematical Statistics 28:602-632.

Wald, George 1965 Frequency or Wave Length? Science 150:1239-1240. → See also several follow-up letters, under the heading “Frequency Scale for Spectra,” Science 151:400-404 (1966).

III. GROUPED OBSERVATIONS

When repeated empirical measurements are made of the same quantity, the results typically vary; on the other hand, ties, or repetitions of measurement values, often occur, because measurements are never made with perfect fineness. Length, for example, although theoretically a continuous quantity, must be measured to the closest inch, centimeter, or some such; psychological aptitudes that are conceptually continuous are measured to a degree of fineness consonant with the test or instrument used. In many cases it is desirable and efficient to measure less finely than current techniques permit, because the expense of refined measurement may outweigh the value of the added information. Also, grouping of measurements is often carried out after they are made, in order to enhance convenience of computation or presentation.

No matter how the data are obtained, unless there are only a few it is usually convenient to present them in groups or intervals; that is, the data are organized in a table that shows how many observations are in each of a relatively small number of intervals. The motivations are clarity of description and simplification of subsequent manipulation. Table 1 presents an example of such grouped data. (The data are taken from Wallis & Roberts 1956; this work should be consulted for

*Table 1 — Thirty-two persons grouped according to their body weight (in pounds)*
Weight	Number of persons	Mid-value of class
Source: Adapted with permission of The Macmillan Company from STATISTICS: A NEW APPROACH by W. Allen Wallis and Harry V. Roberts. Copyright 1956 by The Free Press of Glencoe, A Corporation.
137.5-147.5	2	142.5
147.5-157.5	5	152.5
157.5-167.5	4	162.5
167.5-177.5	5	172.5
177.5-187.5	7	182.5
187.5-197.5	5	192.5
197.5-207.5	3	202.5
207.5-217.5	0	212.5
217.5-227.5	0	222.5
227.5-237.5	0	232.5
237.5-247.5	1	242.5

details of computation and for the useful discussion in sees. 6.2.1, 7.4.5, and 8.5.2.) If it is assumed that the weights were originally measured only to the nearest pound, then there is no ambiguity at the class boundaries about which group is appropriate for each measurement. Some information is, of course, lost when the data are recorded in these coarser groups. Such loss of information can, however, be compensated for by making some more observations, and so a purely economic problem results: which is cheaper—to make fewer but finer measurements or to make more but coarser measurements?

A second problem is that most mathematical and statistical tools developed for the treatment of data of this kind presuppose exact measurements, theoretically on an infinitely fine scale. To what extent can use be made of these tools when the data are grouped? How must the theory be modified so that it may legitimately be applied to grouped observations?

Another problem is that bias may arise in cases where an observation is more likely to fall in some part of a relatively wide group than in other parts of that group. Theoretically this holds, of course, for any size of group interval, but when group intervals are less than one-quarter of the standard deviation of the distribution, there will, in practice, be little trouble.

Usually one refers all the observations falling in a group to the midpoint of that group and then works with these new “observations” as if they were exact. What does such a procedure imply? Is it justified? And what should be done if the grouping is very coarse or if there are open intervals to the right and/or left? What should be done in the case where the groups can only be ranked, because their limits have no numerical values?

Estimation of μ and σ in the normal case . Most work on problems of grouped observations has centered about simple normal samples, where one is interested in inferences about mean and variance. Suppose N observations are regarded as a random sample from a normal population with mean μ and standard deviation σ. The observations have fallen in, say, k groups. The maximum likelihood (ML) estimators of μ and σ can be found whether or not the grouping is into equal intervals and whether or not the end intervals are open. The method is a bit troublesome because of its iterative character, but tables have been constructed for facilitation of the work (see Gjeddebaek 1949; advice on finer details is given in Kulldorff 1958a; 1958b;). The ML method can be used with assurance in cases where there is some doubt about the admissibility of using the so-called simple estimators of μ and σ.

The simple estimators are obtained if all observations are referred to the midpoints of their respective groups and standard estimators m and s are calculated for the mean and standard deviation, as outlined in Wallis and Roberts (1956, sees. 7.4.5, 8.5.2). These estimators of μ and σ have almost the same variances and expected mean squares as the ML estimators when the grouping intervals are not wider than 2σ, and N, the sample size, is no more than 100. The simple estimators have biases that do not go to zero as N increases. If, however, interval width is no more than 2σ and N ≤ 100, the bias is negligible.

If the grouping is equidistant with groups of size hσ, and N < 100, the efficiencies, E, of the ML mean (or of the simple mean, m) relative to the ungrouped mean are as presented in Table 2.

*Table 2 — Efficiency of the ML and simple estimators of the mean from grouped data, relative to the ungrouped mean*
h	E (in per cent)
0.2	99.7
0.4	98.7
0.6	97.1
0.8	94.9
1.0	92.3
1.2	89.3
1.4	86.0
1.6	82.4
1.8	78.7
2.0	75.0

Here E indicates how many ungrouped observations are equivalent to 100 grouped observations made with the group size in question. The above figures can be obtained from a formula given by R. A. Fisher (1922), E = (1 + h²/12)^-1. If the phase relationship between true mean and group limits is taken into account, there are only very small changes (see Gjeddebaek 1956).

ML grouped and simple estimators of σ have an efficiency of about 58 per cent for h = 2. For estimating σ, groups should not be wider than 1.6σ, so that “phase relationship” will not play a role. For equidistant grouping with intervals of length 1.6σ, all efficiencies for estimating σ will be about 70 per cent.

Practical conclusions . Because σ depends on both natural variation and measurement variation, it may be seen, for example, that measurements of height in man lose little efficiency when recorded in centimeters rather than in millimeters. It will also be seen that if a considerable reduction of cost per observation can be obtained by measuring with a coarse scale, the information obtained per unit of cost may be increased by such coarse measuring. Even with such large group intervals as 2σ, it is necessary to make only four grouped observations for each three ungrouped observations to obtain the same amount of information about μ.

An instance can be quoted of the advantage of refraining from using a scale to its utmost capacity. Consider the routine weighing of such things as tablets or other doses of substances weighing between 200 and 600 milligrams. Sometimes, in practice, even tenths of milligrams are taken into account here. The last significant figure obtained by such a weighing will, as a rule, require preposterous labor and will divert attention from the foregoing, more important figures. Thus, carrying out the weighing to this point will do more harm than good in the long run. Also, the accuracy of reading ought to be kept in reasonable proportion to the accuracy inherent in the measured pieces; in other words, observations ought to be grouped with due regard to the size of their natural variation. Here the table of efficiencies will be useful, as it immediately reveals at what point the coarseness of a scale essentially influences the accuracy. In this connection it is worthwhile to stress that an impression of great accuracy from a result with many decimal places is misleading if the natural variation Of the results makes illusory most of the decimal places. For example, it was an exaggeration of scientific accuracy when it was said that every cigarette smoked by a person cuts off 14 minutes and 24 seconds from his lifetime. Of course, there was some publicity value in the statement. If, however, you are told that whenever you have smoked 100 cigarettes you have lost a day and a night of your lifetime, you will conclude that some scientist has given this opinion with due regard to the obvious uncertainty of such an estimate. From the viewpoint of a statistician, this statement seems a bit more honest, despite the fact that exactly the same thing is expressed as before. (Example 82C of section 3.10 of Wallis & Roberts 1956 gives a similar provocative use of figures.)

Sheppard correction . When the variance, σ², is estimated from data grouped into intervals of width hσ, it has been proposed that the best estimator of σ² is s² - (h²/12)σ², where -(h²/12)σ² is the so-called Sheppard correction. There has been persistent confusion about a statement by R. A. Fisher that the Sheppard correction should be avoided in hypothesis testing. Given N grouped observations, calculate the simple estimators m of the mean and s² of the variance, with the size of group intervals, hσ, the same for the two calculations. From the efficiency, E (see discussion above), it follows that the variance of m is (σ2-/N)(l + h²/12). According to Sheppard, s² - (h²/12)σ² is the best estimator of σ²; thus, it must follow that the best estimator of the quantity σ²(1 + h²/12) is simply s², and hence the best estimator of the variance of m is s²/N. That is the same expression used for ungrouped observations, and so s² should be used without Sheppard’s correction when m and s² are brought together in a testing procedure or in a statement of confidence limits for μ. Obvious modifications may be made when h is different for the grouping used to calculate m and s². Sheppard’s correction should also be avoided in analysis of variance, as the same group intervals are used for “within” and “between” estimators of the variance. In practice, Sheppard’s correction is not very useful, because an isolated estimate of variance is seldom required. For that case, however, the correction is well justified. A serious drawback is that it cannot be used when grouping is not equidistant. Here the maximum likelihood method seems to be the only reasonable precise one.

When the Sheppard-corrected estimator of σ² is available, it has the same efficiency as the ML estimator, at least for N less than 100 (see Gjedde-baek 1959a and the above discussion of the inconsistency of the simple estimators for large N).

Very coarse groups . When a grouping must be very coarse, the problem arises of where to place

the group limits most efficiently . The problem is compounded because the efficiency depends on the site of the unknown true mean. Sometimes—for example, for quality-control purposes—it is known where the true mean ought to be, so the theory is often useful in these instances (see Ogawa 1962).

To investigate the consequences of very wide grouping, consider the weights example once again. Let the groups be very wide—40 pounds. Depending on which group limits are deleted, one of the four situations illustrated in Table 3 results. The ML method presupposes an underlying normal distribution, and therefore it gives rather small s-values in the last two extreme situations, but all in all, Table 3 demonstrates an astonishingly small effect of such coarse grouping. [For a discussion of optimum grouping, see Nonparametric Statistics, article on Order Statistics.]

Grouping ordinal data . If the group limits have no numerical values by nature but are ranked, a different set of problems arises. By use of probits the group limits may be given numerical values, and then the situation may be treated according to the principle of maximum likelihood. In reality a two-step use of that principle is involved, and this gives rise to distribution problems. (The method is discussed in Gjeddebæk 1961; 1963.) From a hypothesis-testing point of view, the grouping of ordinal data may be regarded as the introduction of ties [see Nonparametric Statistics, article on Ranking Methods, for a discussion of such ties].

Other aspects . Extensions of techniques using grouped data to distributions other than the normal, to multivariate data, and to procedures other than estimation of parameters are possible and have been investigated. For example, Tallis and Young (1962) have discussed the multivariate case and have given hints on hypothesis testing. Kulldorff (1961) has investigated the case of the exponential distribution, and Gjeddebaek (1949; 1956; 1957; 1959a; 1959b; 1961; 1963) has worked on methods parallel to t-testing and F-testing. In addition, P. S. Swamy (1960) has done considerable work on these extensions. A comprehensive treatment of rounding errors is given by Eisenhart (1947).

N. F. GjeddebÆk

See alsoStatistics, Descriptive

BIBLIOGRAPHY

Eisenhart, Churchill 1947 Effects of Rounding or Grouping Data. Pages 185-233 in Columbia University, Statistical Research Group, Selected Techniques of Statistical Analysis for Scientific and Industrial Research, and Production and Management Engineering, by Churchill Eisenhart, Millard W. Hastay, and W. Allen Wallis. New York: McGraw-Hill.

Fisher, R. A. (1922) 1950 On the Mathematical Foundations of Theoretical Statistics. Pages 10.308a-10.368 in R. A. Fisher, Contributions to Mathematical Statistics. New York: Wiley. → First published in the Philosophical Transactions, Series A, Volume 222, of the Royal Society of London.

Fisher, R. A. (1925) 1958 Statistical Methods for Research Workers. 13th ed. New York: Hafner. → Previous editions were also published by Oliver & Boyd. See especially section 19, Appendix D, “Adjustment for Grouping.”

GjeddebÆk, N. F. 1949 Contribution to the Study of Grouped Observations: I. Application of the Method of Maximum Likelihood in Case of Normally Distributed Observations. Skandinavisk aktuarietidskrift 32: 135-159.

GjeddebÆk, N. F. 1956 Contribution to the Study of Grouped Observations: II. Loss of Information Caused by Groupings of Normally Distributed Observations. Skandinavisk aktuarietidskrift 39:154-159.

GjeddebÆk, N. F. 1957 Contribution to the Study of Grouped Observations: Iii. The Distribution of Estimates of the Mean. Skandinavisk aktuarietidskrift 40:20-25.

GjeddebÆk, N. F. 1959o Contribution to the Study of Grouped Observations: Iv. Some Comments on Simple Estimates. Biometrics 15:433-439.

GjeddebÆk, N. F. 1959b Contribution to the Study of Grouped Observations: V. Three-class Grouping of Normal Observations. Skandinavisk aktuarietidskrift 42:194-207.

GjeddebÆk, N. F. 1961 Contribution to the Study of Grouped Observations: Vi. Skandinavisk aktuarietidskrift 44:55-73.

GjeddebÆk, N. F. 1963 On Grouped Observations and Adjacent Aspects of Statistical Theory. Methods of Information in Medicine 2:116-121.

Kulldorff, Gunnah 1958a Maximum Likelihood Estimation of the Mean of a Normal Random Variable When the Sample Is Grouped. Skandinavisk aktuarietidskrift 41:1-17.

Kulldorff, Gunnar 1958b Maximum Likelihood Estimation of the Standard Deviation of a Normal Random Variable When the Sample Is Grouped. Skandinavisk aktuarietidskrift 41:18-36.

Kulldorff, Gunnar 1961 Contributions to the Theory of Estimation From Grouped and Partially Grouped Samples. Uppsala (Sweden): Almqvist & Wiksell.

Ogawa, Junjiro 1962 Determinations of Optimum Spacings in the Case of Normal Distribution. Pages 272-283 in Ahmed E. Sarhan and Bernard G. Green-berg (editors), Contributions to Order Statistics. New York: Wiley.

Stevens, W. L. 1948 Control by Gauging. Journal of the Royal Statistical Society Series B 10:54-98. → A discussion of Stevens’ paper is presented on pages 98-108.

Swamy, P. S. 1960 Estimating the Mean and Variance of a Normal Distribution From Singly and Doubly Truncated Samples of Grouped Observations. Calcutta Statistical Association Bulletin 9, no. 36.

Tallis, G. M.; and Young, S. S. Y. 1962 Maximum Likelihood Estimation of Parameters of the Normal, Log-normal, Truncated Normal and Bivariate Normal Distributions From Grouped Data. Australian Journal of Statistics 4, no. 2:49-54.

Wallis, W. Allen; and Roberts, Harry V. 1956 Statistics: A New Approach. Glencoe, Ill.: Free Press.

IV. TRUNCATION AND CENSORSHIP

Statistical problems of truncation and censorship arise when a standard statistical model is appropriate for analysis except that values of the random variable falling below—or above—some value are not measured at all (truncation) or are only counted (censorship). For example, in a study of particle size, particles below the resolving power of observational equipment will not be seen at all (truncation), or perhaps small particles will be seen and counted, but will not be measurable because of equipment limitations (censorship). Most of the existing theory for problems of this sort takes the limits at which truncation or censorship occurs to be known constants. There are practical situations in which these limits are not exactly known (indeed, the particle-size censorship example above might involve an inexactly ascertainable limit), but little theory exists for them. Truncation is sometimes usefully regarded as a special case of selection: if the probability that a possible observation having value x will actually be observed depends upon x, and is, say, p(x) (between 0 and 1), selection is occurring. If p(x) = 1 between certain limits and 0 outside them, the selection is of the type called (two-sided) truncation.

More particularly, if values below a certain lower limit, a, are not observed at all, the distribution is said to be truncated on the left. If values larger than an upper limit, b, are not observed, the distribution is said to be truncated on the right. If only values lying between a and b are observed, the distribution is said to be doubly truncated. One also uses the terms “truncated sampling” and “truncated samples” to refer to sampling from a truncated distribution. (This terminology should not be confused with the wholly different concept of truncation in sequential analysis.)

In censored sampling, observations are measured only above a, only below b, or only between a and b; but in addition it is known how many unmeasured observations occur below a and above b. Censorship on the left corresponds to measuring observations only above a; censorship on the right corresponds to measuring only below b; and double censorship corresponds to measuring only between a and b. In the case of double censorship, the total sample consists of I, the number of observations to the left of the lower limit a; r, the number of observations to the right of the upper limit b; and X_i, ..., X_n, the values of the observations occurring between the limits. In one-sided censorship, either I or r is 0. A second kind of censorship arises when, without regard to given limits a or b, the lsmallest and/or r largest observations are identified but not measured. This is type II censorship. In consequence, “type I censoring” is a name applied to the case described earlier. (In older literature the word “truncation” may be used for any of the foregoing.)

Some of the definitions mentioned above may be illustrated by the following examples:

Suppose X is the most advanced year of school attained for people born in 1930 where the information is obtained by following up records of those who entered high school; then the distribution is truncated on the left, because every possible observation is forced to be larger than 8. In particular, every observation of a sample from this distribution must be larger than 8. Censorship on the left, in this case, would occur if the sample were drawn from the population of people born in 1930 and if, in addition to noting the most advanced year of school attained by those who entered high school, the number of sample members who did not enter high school were ascertained. Similarly, if two hours is allowed for an examination and the time of submission is recorded only for late papers, the distribution of time taken to write the examination is censored on the left at two hours.

Truncation on the right would apply to a 1967 study of longevity of people born in 1910, as inferred from a comprehensive survey of death certificates. Censorship on the right would occur in this case if the sample were based on birth certificates, rather than death certificates, since then the number of individuals with longevity exceeding the upper observable limit would be known.

The distribution of height of U.S. Navy enlistees in records of naval personnel is a doubly truncated distribution because of minimum and maximum height requirements for enlistment. An example of type ii censorship on the right would be given by the dates of receipt of the first 70 responses to a questionnaire that had been sent out to 100 people.

Goals . In dealing with truncated distributions, a key issue is whether the conclusions that are sought should be applicable to the entire population or only to the truncated population itself. For instance, since the navy, in purchasing uniforms, need consider only those it enlists, the truncated, not the untruncated, population is the one of interest. On the other hand, if an anthropologist wished to use extensive navy records for estimating the height distribution of the entire population or its mean height, his inferences would be directed, not to the truncated population itself, but to the untruncated population. (Perhaps this anthropologist should not use naval enlistees at all, since the sample may not be representative for other reasons, such as educational requirements and socioeconomic factors.) In a study in which treated cancer patients are followed up for five years (or until death, if that comes sooner), the observed survival times would be from a distribution truncated on the right. This is also a natural example of censored sampling. The censored sample would provide the information relevant to setting actuarial rates for five-year (or shorter) term insurance policies. For assessing the value of treatment the censored sample is less adequate, since the remainder of the survival-time distribution is also important.

In cases in which the truncated population itself is of interest, few essentially new problems are posed by the truncation; for example, the sample mean and variance remain unbiased estimators of these parameters of the truncated population, and (at least with large samples) statistical methods that are generally robust may ordinarily be confidently applied to a truncated distribution. On the other hand, if data from the truncated sample are to be used for reaching conclusions about the untruncated distribution, special problems do arise. For instance, the sample mean and variance are not reasonable estimators of those parameters in the untruncated distribution, nor are medians or other percentiles directly interpreted in terms of the untruncated distribution. The situation for means and variances in censored samples, although not identical, is similar.

Estimation and testing under censorship . Suppose, now, that a sample of observations is censored and that the purpose is to make estimates of parameters in the population. In this case the data consist of l, X₁, ..., X_n, and r. Let N = l + n + r. Usually N is regarded as fixed. For type II censorship, if both l and r are less than N/2, then the sample median has the same distributional properties as it would if no censorship had been imposed. Similarly, such statistics as the interquartile range, certain linear combinations of order statistics, and estimates of particular percentiles may be more or less usable, just as if censoring had not occurred (depending upon the values of l, n, and r). This fact allows censoring to be deliberately imposed with advantage where the investigator knows enough to be sure that the sample median or other interesting quantiles will be among X₁, ..., X_n and where the cost of taking the sample is greatly reduced by avoiding exact measurement of a substantial part of the sample. Generally, the precision obtainable from a sample of size N where censorship has been imposed can never be greater than the precision obtainable from a sample of size N from the same distribution without censoring (Raja Rao 1958), but the censored sample may be cheaper to observe. Sometimes censorship is deliberately imposed for another reason. If the investigator fears that some of the observations in samples are actually errors (coming from a “contaminating” distribution), he may deliberately choose to censor the smallest one or two (or more) and the largest one or two (or more) and use only the intermediate values in the statistical analysis. Censorship of this form is called “trimming”; it is akin to a related technique called Winsorizing. [See Statistical Analysis, Special Problems Of, article On Outliers; Nonparametric Statistics, article on Order Statistics.]

Since censorship will generally cause some off-center part of the distribution to be the one furnishing X₁, ..., X_n, it is clear that the sample mean, X, based only on those observations will generally be a seriously biased estimator of the population mean, μ; similarly, s², the sample variance of X₁, ..., X_n, will tend to be too small to be a good estimator of the population variance, σ.2 Thus, in using a censored sample to reach conclusions about the parameters (other than quantiles), such as the mean and standard deviation of the population, it is necessary to make some assumptions about the underlying distribution in order to arrive at estimators with known properties.

Even with strong assumptions, it is difficult to obtain procedures such as confidence intervals of exact confidence coefficient. (Halperin 1960 gives a method for interval estimation of location and scale parameters with bounded confidence coefficient. ) If samples are large, then more satisfactory results are available through asymptotic theory.

With respect to testing hypotheses where two samples singly censored at the same point are available, it has been shown by Halperin (1960) that Wilcoxon’s two-sample test can be applied in an adapted form and that for samples of more than eight observations with less than 75 per cent censoring, the normal approximation to the distribution of the (suitably modified) Wilcoxon statistic holds well.

Normal distributions and censorship. The case which has been most studied is, naturally, that of the normal distribution. For type i censorship, by use of a and b, together with l, r, and X₁, ..., X_B, it is possible to calculate the maximum likelihood estimators in the normal case. The calculation is difficult and requires special tables. Among other methods for the normal distribution are those based on linear combinations of order statistics. [See Nonparametric Statistics, article on Order Statistics.]

Cohen (1959) gives a useful treatment of these problems, and a rather uncomplicated method with good properties is offered by Saw (1961), who also presents a survey of the more standard estimation methods. It is interesting to observe that a sample of size N which is censored on one side and contains n measured values is more informative about ¡i, the population mean, than would be an uncen-sored sample of n observations (Doss 1962). The additional information is obviously furnished by l (or r). The same thing is true in the estimation of σ, providing that censored observations lie in a part of the distribution with probability less than one-half (Doss 1962). Although in uncensored samples from the normal distribution the sample mean, X, and standard deviation, s, are statistically independent, they are not so with one-sided censoring; indeed, the correlation between X and s grows as the fraction censored increases (Sampford 1954).

Estimation and testing under truncation . Samples from distributions truncated at points a and b generally give less information relating to that part of the distribution which has been sampled than do samples from the same distribution with censoring at those points. (In the case of type I censorship, l and r do afford some idea of whether the center or the left-hand side or the right-hand side is furnishing the observations X₁, ..., X_n.) It follows that no distribution-free relationships between percentiles, location parameters, or dispersion parameters of the untruncated distribution and the truncated one can hold. To reach any conclusions about the untruncated population, it is essential to have some assumptions about the underlying probability law.

If certain statistics are jointly sufficient for a random sample from the untruncated distribution, then those same statistics remain sufficient for a sample from the truncated distribution (Tukey 1949; Smith 1957).

The amount of information in the truncated sample may be greater than, less than, or equal to that afforded by an untruncated sample of the same size from the same distribution. Which of these alternatives applies depends upon what the underlying distribution is and how the truncation is done (Raja Rao 1958).

For the normal distribution, a truncated sample is always less informative about both n and σ than an untruncated sample having the same number of observations (Swamy 1962). (However, an inner truncated sample from a normal distribution, that is, one in which only observations outside an interval are observed, may be more informative about σ² than an untruncated sample with the same number of observations.)

Testing two-sample hypotheses from truncated samples can be done on the basis of distributional assumptions. In addition, a little can be said about distribution-free procedures. Lehmann showed that if two continuous distributions, F and G, are being compared by Wilcoxon’s test (or any rank test), where G = F^k then whatever F is, the distribution of the test statistic depends only on k and the two sample sizes. Truncation on the right at point b gives the truncated cumulative distribution functions F_b(x) = F(x)/F(b) and G_b(x) = F^k(x)/F^k(b), so it is still true that G_b(x) = [F_b(x)]^k. Thus, truncation on the right does not affect the properties of Wilcoxon’s test against Lehmann (1953) alternatives. On the other hand, truncation on the left at point a leads to the relations G_n(x) = F^k(x)/[1 - F^k(a)] and F_n(x) = F(x)/[1 - F(a)], and it is not true that G_n(x) = [F_n(x)]^k. Further, it can be shown that the noncentrality parameter of the test, P(X < Y), declines as a grows (that is, as more and more of the distribution is truncated). Thus, truncation on the left does affect the test against Lehmann alternatives. By a similar argument, if F and G are related by 1 - G = (1 - F)^k, then truncation on the left leaves the relationship unaltered, while truncation on the right does not.

In comparing two treatments of some disease where time to recurrence of the disease is of interest, random censorship is sometimes encountered. For example, death by accidental injury may intervene before the disease has recurred. Such an observation has been subjected to censorship by a random event. Problems of this sort are treated by Gehan (1965).

Bivariate cases . The bivariate case occurs when truncation or censorship is imposed on each member of a sample, or possible sample, in terms of one variable, say, X, while another variable, Y, is the one of principal interest. For example, in studying income data, an investigator might take as his sample all tax returns submitted before the delinquency date. He would then have censored the sample on the date of submission of the tax return, but his interest would apply to some other properties of these data, such as the taxable income reported. This kind of bivariate truncation or censoring is common in social science. Estimation of the parameters of the multivariate normal distribution when there is truncation or censorship has been treated by Singh (1960) in the case of mutually independent variables. The estimation equations in the case of truncation are the usual univariate equations, which may be separately solved. But in the presence of censorship, when only the number of unobservable vectors is known and there is no information as to which components led to unob-servability, the estimating equations require simultaneous solution.

If X and Y are independent, the truncation on X does not affect the distribution of Y. Otherwise, in general, it will; and it may do so very strongly. When dependence of X and Y exists, then censorship or truncation affords opportunities for large and subtle bias, on the one hand, and for experimental strategies, on the other. For example, the selection of pilots, students, domestic breeding stock, all represent choosing a sample truncated in terms of one variable (an admission score or preliminary record of performance), with a view to ensuring large values of a different variable in the truncated (retained) portion of the population. Generally, the larger the correlation between X and Y, the greater the improvement obtainable in Y by truncation on X. [See Screening and Selection.]

A second example of truncation of one variable with the eye of purpose fixed on a second is afforded by “increased severity testing.” This engineering method may be illustrated by taking a lot of resistors designed to tolerate a low voltage over a long service life and exposing them to a short pulse of very high voltage. Those not failing are assumed satisfactory for their intended use, provided that service life is not shortened as a result of the test. This is seen as an example of bivariate truncation if one attributes to each resistor two values: X, its service life at the high voltage, and Y, its service life at the low voltage. Truncation on X presumably increases the mean value of Y in the retained, or truncated, population. Analogues of this may have relevance for psychology in such areas as stress interviews or endurance under especially difficult experimental tasks.

In the bivariate case, truncation of one variable can greatly affect the correlation between the two variables. For example, although height and weight exhibit a fairly strong correlation in adult males, this correlation virtually disappears if we consider only males with height between 5 feet 6 inches and 5 feet 9 inches. Although there is considerable variation in weight among men of nearly the same height, little of this variation is associated with variation in height. On the other hand, inner truncation—omitting cases with intermediate values of one variable—will produce spuriously high correlation coefficients. Thus, in a sample of males of heights less than 5 feet 4 inches or greater than 6 feet 6 inches, virtually all the variation in weight will be associated with variation in height. In a linear regression situation, where there is inner or outer truncation, the slope of the regression line continues to be unbiasedly estimated. But the correlation coefficient has a value that may depend so strongly on the truncation (or, more generally, selection) that there may be little if any relationship between the correlations in the truncated and untruncated populations. [See Errors, article on Nonsampling Errors.]

To show how truncation can enormously affect the correlation coefficient, consider X and Y, two random variables with a joint distribution such that

Y = α + βX + e,

where α and β are constants and where e is a random variable uncorrelated with X. This kind of simple linear structure frequently arises as a reasonable assumption. Assume that β is not 0.

Let σ²_e and σ²_x be the variances of e and X respectively; immediate computation then shows that the covariance between X and Y is βσ²_x, while the variance of Y is β²σ²_x + σ²_e. It follows that ρ², the square of the correlation coefficient between X and Y, is

Hence, if the structure stays otherwise the same but the marginal distribution of X is changed so that σ²_e becomes very large, ρ² becomes nearly unity. In particular, if the marginal distribution of X is changed by truncating inside the interval (-d, d), it is readily shown that as d becomes indefinitely large, so will σ²_x.

Similarly, if σ²_x becomes nearly 0, so will ρ². In particular, if X is truncated outside a small enough interval, σ²_x will indeed become nearly 0.

Still more difficulties arise if truncation in a bivariate (or multivariate) population is accomplished not by truncation on X or Y alone but on a function of them. If X and Y were utterly independent and a sample were drawn subject to the restraint ǀX - Yǀ ≤ a, then all the points (X, Y) in the sample would be required to lie in a diagonal strip of slope 1 and vertical (or horizontal) width 2a units. Obviously, very high “correlation” might be observed! Some follow-up studies embody a bias of just this form. Suppose that survival of husband and wife is to be studied by following up all couples married during a period of 40 years. Suppose further that A_H, the age of the husband at marriage, and A_H, the age of the wife at marriage, are highly correlated (as they are) and that L_H and L_H, their lifetimes, are completely independent of each other statistically. Then the correlation between L_H and L_w observed in a follow-up study may be very high or very low, depending upon how the sample is truncated. Consider several methods:

(1) All couples married in the 40-year period are followed up until all have died—making a study of about a hundred years’ duration; then there is no truncation of L_H or L_w, and since (by assumption) they are statistically independent, the observed correlation will, except for sampling error, be zero.

(2) All couples whose members have both died during the 41st year of the study furnish the sample values of L_H and L_w; now there will be a very high correlation. Write L_H = A_H + T_H and L_w = A_w + T_w, where T represents life length after marriage. The curious sampling scheme just proposed ensures that ǀT_H — T_wǀ ≤ 1, so T_H and T_w are highly correlated and A_H and A_w are also correlated; hence L_H and L_w in such a truncated sample will be highly correlated.

(3) All couples whose members have both died by the end of the 40th year (the cases “complete” by then) furnish the data. In this not infrequently used design, a fictitious correlation will be found. Those couples married during the last year and with both members dead will have values of T_H and T_w which are nearly equal, and they will have correlated values of A_H and A_w; such couples will contribute strongly to a positive correlation. Those couples married two years before the end of the study and with both members dead will have values of T_H and T_w differing at most by 2, and correlated values of A_H and A_w; they will also contribute—not quite so strongly—to a positive correlation. By continuation of this reasoning, it is seen that the same kind of bias (diminishing with progress toward the earliest marriage) affects the entire sample. A detailed numerical example of this problem is given by Myers (1963).

It is probably wise to view with great caution studies that are multivariate in character (involve several observable random aspects) and at the same time use samples heavily truncated or censored on one or more of the variables or—especially —on combinations of them.

Lincoln E. Moses

BIBLIOGRAPHY

Cohen, A. Clifford Jr. 1959 Simplified Estimators for the Normal Distribution When Samples Are Singly Censored or Truncated. Technometrics 2:217-237.

Doss, S. A. D. C. 1962 On the Efficiency of Ban Estimates of the Parameters of Normal Populations Based on Singly Censored Samples. Biometrika 49:570-573.

Gehan, Edmund A. 1965 A Generalized Wucoxon Test for Comparing Arbitrarily Singly-censored Samples. Biometrika 52:203-223.

Halperin, Max 1960 Extension of the Wilcoxon-Mann-Whitney Test to Samples Censored at the Same Fixed Point. Journal of the American Statistical Association 55:125-138.

Lehmann, E. L. 1953 The Power of Rank Tests. Annals of Mathematical Statistics 24:23-43.

Myers, Robert j. 1963 An Instance of the Pitfalls Prevalent in Graveyard Research. Biometrics 19:638-650.

Raja Rao, B. 1958 On the Relative Efficiencies of Ban Estimates Based on Doubly Truncated and Censored Samples. National Institute of Science, India, Proceedings 24:366-376.

Sampford, M. R. 1954 The Estimation of Response-time Distributions: III. Truncation and Survival. Bio-metrics 10:531-561.

Saw, J. G. 1961 Estimation of the Normal Population Parameters Given a Type I Censored Sample. Biometrika 48:367-377.

Singh, Naunihal 1960 Estimation of Parameters of a Multivariate Normal Population From Truncated and Censored Samples. Journal of the Royal Statistical Society Series B 22:307-311.

Smith, Walter l. 1957 A Note on Truncation and Sufficient Statistics. Annals of Mathematical Statistics 28:247-252.

Swamy, P. S. 1962 On the Joint Efficiency of the Esti-mates of the Parameters of Normal Populations Based on Singly and Doubly Truncated Samples. Journal of the American Statistical Association 57:46-53.

Tukey, John w. 1949 Sufficiency, Truncation, and Selection. Annals of Mathematical Statistics 20:309-311.

International Encyclopedia of the Social Sciences