# Sufficiency

# Sufficiency

*Sufficiency* is a term that was introduced by R. A. Fisher in 1922 to denote a concept in his theory of point estimation [*see* Fisher, R. A.]. As subsequently extended and sharpened, the concept is used to simplify theoretical statistical problems of all kinds. It is also used, sometimes questionably, in applied statistics to justify certain summarizations of the data, for example, reporting only sample means and standard deviations for metric data or reporting only proportions for counted data.

The sufficiency concept may be explained as follows. Suppose that the probabilities of two given samples have a ratio that does not depend on the unknown parameters of the underlying statistical model. Then it will be seen that nothing is gained by distinguishing between the two samples; that is, nothing is lost by agreeing to make the same inference for both of the samples. To put it another way, the two samples may be consolidated for inference purposes without losing information about the unknown parameters. When such consolidation can be carried out for many possible samples, the statistical problem becomes greatly simplified.

The argument can best be given in the context of a simple example. Consider tossing a coin four times, with “heads” or “tails” observed on each toss. There are 2^{4} = 16 possible results of this experiment, so the sample space has 16 points, which are represented in Table 1 [*see* Probability, *article on*formal probability]. For example, the point *THHT* represents the experimental result: tosses 1 and *4* gave tails, while tosses 2 and 3 gave heads. For later convenience, the 16 points are arranged

Table 1 — Sample points in coin-tossing experiment | ||||
---|---|---|---|---|

NUMBER OF HEADS | ||||

0 | I | 2 | 3 | 4 |

TTTT | HTTT | HHTT | THHH | HHHH |

THTT | HTHT | HTHH | ||

TTHT | HTTH | HHTH | ||

TTTH | THHT | HHHT | ||

THTH | ||||

TTHH |

in columns according to the number of occurrences of *H*.

If one makes the usual assumptions that the four tosses are independent and that the probability of heads is the same (say, *p*) on each toss, then it is easy to work out the probability of each point as a function of the unknown parameter, *p:* for example, Pr(THHT) = *p*^{2}(1 - *p*)^{2} [*see* Distributions, statistical, *article on*special discrete distributions]. In fact, each of the 6 points in column 3 has this same probability. Therefore, the ratio of the probabilities of any 2 points in column 3 has a fixed value, in fact the value 1, whatever the value of *p* may be, and, as stated earlier, it is not necessary to distinguish between the points in column 3. A similar argument shows that the *4* points in column 2 need not be distinguished from each other; the same is true for the *4* points in column *4.* Thus, the sample space may be reduced from the original 16 points to merely the 5 columns, corresponding to the number of H‘s. No further reductions are justified, since any 2 points in different columns have a probability ratio that depends on *p.*

To see intuitively why the consolidations do not cost any useful information, consider a statistician who knows that the experiment resulted in one of the 6 points in column 3 but who does not know just which of the 6 points occurred. Is it worth his while to inquire? Since the 6 points all have the same probability, *p*^{2}(l - *p*)^{2}, the *conditional* probability of each of the 6 points, given that the point is one of those in column 3, is the known number ⅙ [*see* Probability, *article on*formal probability]. Once the statistician knows that the sample point is in column 3, for him to ask “which point?” would be like asking for the performance of a random experiment with known probabilities of outcome. Such an experiment can scarcely produce useful information about the value of *p,* or indeed about anything else.

Another argument has been advanced by Halmos and Savage (1949). Our statistician, who knows that the observed sample point is in column 3 but who does not know which one of the 6 points was observed, may try to reconstruct the original data by selecting one of the 6 points at random (for example, by throwing a fair die or by consulting a table of random numbers [*see* Random Numbers] ). The point he gets in this way is not likely to be the point actually observed, but it is easy to verify that the “reconstructed” point has exactly the same distribution as the original point. If the statistician now uses the reconstructed point for inference about *p* in the same way he would have used the original point, the inference will perform exactly as if the original point had been used. If it is agreed that an inference procedure should be judged by its performance, the statistician who knows only the column, and who has access to a table of random numbers, can do as well as if he knew the actual point. In this sense, the consolidation of the points in each column has cost him nothing.

When a (sample) space is simplified by consolidations restricted to points with fixed probability ratio, the simplified space is called *sufficient:* the term is a natural one, in that the simplified space is “sufficient” for any inference for which the original space could have been used. The original space is itself always sufficient, but one wants to simplify it as much as possible. When all permitted consolidations have been made, the resulting space is called *minimally* sufficient (Lehmann & Scheffé 1950-1955). In the example the 5-point space consisting of the five columns is minimally sufficient; if only the points of column 3 had been consolidated, the resulting 11-point space would be sufficient, but not minimally so.

It is often convenient to define or describe a consolidation by means of a statistic, that is, a function denned on the sample space. For example, let B denote the number of heads obtained in the four tosses. Then B has the value 0, 1, 2, 3, 4 for the points in columns 1, 2, 3, 4, 5, respectively. Knowledge of the value of B is equivalent to knowledge of the column. It is then reasonable to call B a *(minimal) sufficient statistic.* (B + 2, B^{3}, and B, for example, would also be minimal sufficient statistics. ) More generally, a statistic is sufficient if it assigns the same value to 2 points only if they have a fixed probability ratio. In Fisher‘s expressive phrase, a sufficient statistic “contains all the information” that is in the original data.

The discussion above is, strictly speaking, correct only for discrete sample spaces. The concepts extend to the continuous case, but there are technical difficulties in a rigorous treatment because of the nonuniqueness of conditional distributions in that case. These technical problems will not be discussed here. (For a general treatment, see Volume 2, chapter 17 of Kendall & Stuart [1943-1946] 1958-1966, and chapters 1 and 2 of Lehmann 1959, where further references to the literature may be found.)

The discovery of sufficient statistics is often facilitated by the Fisher-Neyman factorization theorem. If the probability of the sample point (or the probability density) may be written as the product of two factors, one of which does not involve the parameters and the other of which depends on the sample point only through certain statistics, then those statistics are sufficient. This theorem may be used to verify these examples: (i) If B is the number of “successes” in *n* Bernoulli trials (n independent trials on each of which the unknown probability, *p,* of success is the same), then B is sufficient, (ii) If X_{1}, X_{2}, ..., X_{n} is a random sample from a normal population of known variance but unknown expectation μ, then the sample mean, x̄, is sufficient. ( *iii)* If, instead, the expectation is known but the variance is unknown, then the sample variance (computed around the known mean) is sufficient, *(iv)* If both parameters are unknown, then the sample mean and variance together are sufficient. (In all four cases, the sufficient statistics are minimal.) In all these examples, the families of distributions are of a kind called *exponential. [For an outline of the relationship between families of exponential distributions and sufficient statistics, see* Distributions, statistical, article on special continuous distributions.]

In the theory of statistics, sufficiency is useful in reducing the complexity of inference problems and thereby facilitating their solution. Consider, for example, the approach to point estimation in which estimators are judged in terms of bias and variance [*see* Estimation]. For any estimator, T, and any sufficient statistic, S, the estimator E(TǀS)— formed by calculating the conditional expectation of T, given S—is a function of S alone, has the same bias as T, and has a variance no larger than that of T (Rao 1945; Blackwell 1947). Hence, nothing is lost if attention is restricted to estimators that are functions of a sufficient statistic. Thus, in example (ii) it is not necessary to consider all functions of all *n* observations but only functions of the sample mean. It can be shown that X̄ itself is the only function of *X̄* which is an unbiased estimator of *μ* (Lehmann & Scheffé 1950-1955) and that *X̄* has a smaller variance than any other unbiased estimator for *μ.*

**Sufficiency in applied statistics** . In applied statistical work, the concept of sufficiency is often used to justify the reduction, especially for publication, of large bodies of experimental or observational data to a few numbers, the values of the sufficient statistics of a model devised for the data [*see* Statistics, descriptive]. For example, the full data may be 500 observations of a population. If the population is normal and the observations are independent, example (iv) justifies reducing the record to two numbers, the sample mean and variance; no information is lost thereby.

Although such reductions are very attractive, particularly to editors, the practice is a dangerous **one** . The sufficiency simplification is only as valid as the model on which it is based, and sufficiency may be quite “nonrobust”: reduction to statistics sufficient according to a certain model may entail drastic loss of information if the model is false, even if the model is in some sense “nearly” correct. A striking instance is provided by the frequently occurring example (*iv*). Suppose that the population from which the observations are drawn is indeed symmetrically distributed about its expected value *μ* and that the distribution is quite like the normal except that there is a little more probability in the tails of the distribution than normal theory would allow. (This extra weight in the tails is usually a realistic modification of the normal, allowing for the occurrence of an occasional “wild value,” or “outlier.”) [*See* Statistical ANALYSIS, SPECIAL PROBLEMS OF, *article on* OUTLIERS.] In this case the reduction to sample mean and variance may involve the loss of much or even most of the information about the value of X̄ there are estimators for *μ* computable from the original data but not from the reduced data, considerably more precise than X̄ when the altered model holds.

Another reason for publication of the original data is that the information suppressed when reducing the data to sufficient statistics is precisely what is required to test the model itself. Thus, the reader of a report whose analysis is based on example (i) may wonder if there were dependences among the *n* trials or if perhaps there was a secular trend in the success probability during the course of the observations. It is possible to investigate such questions if the original record is available, but the statistic B throws no light on them.

J. L. Hodges, Jr.

## BIBLIOGRAPHY

Blackwell, David 1947 Conditional Expectation and Unbiased Sequential Estimation. *Annals of Mathematical Statistics* 18:105-110.

Fisher, R. A. (1922) 1950 On the Mathematical Foundations of Theoretical Statistics. Pages 10.308a-10.368 in R. A. Fisher, *Contributions to Mathematical Statistics.* New York: Wiley. → First published in Volume 222 of the *Philosophical Transactions,* Series A, of the Royal Society of London.

Halmos, Paul R.; and Savage, L. J. 1949 Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics. *Annals of Mathematical Statistics* 20:225-241.

Kendall, Maurice G.; and Stuart, Alan (1943-1946) 1958-1966 *The Advanced Theory of Statistics.* 3 vols. 2d ed. New York: Hafner; London: Griffin. → Volume 1: *Distribution Theory,* 1958. Volume 2: *Inference and Relationship,* 1961. Volume 3: *Design and Analysis, and Time-series,* 1966 (1st ed.). The first editions of volumes 1 and 2 were by Kendall alone.

Lehmann, E. L. 1959 *Testing Statistical Hypotheses.* New York: Wiley.

Lehmann, E. L.; and ScheffÉ, HENRY 1950-1955 Completeness, Similar Regions, and Unbiased Estimation. *Sankhyá: The Indian Journal of Statistics* 10:305-340; 15:219-236.

Rao, C. Radhakrishna 1945 Information and the Accuracy Attainable in the Estimations of Statistical Parameters. Calcutta Mathematical Society, *Bulletin* 27:81-91.

# sufficiency

suf·fi·cien·cy
/ səˈfishənsē/
•
n.
(pl. -cies)
the condition or quality of being adequate or sufficient.
∎ [in sing.]
an adequate amount of something, esp. of something essential:
* a sufficiency of good food.*
∎ archaic
self-sufficiency or independence of character, esp. of an arrogant or imperious sort.

#### More From encyclopedia.com

#### You Might Also Like

#### NEARBY TERMS

**Sufficiency**