Sufficiency is a term that was introduced by R. A. Fisher in 1922 to denote a concept in his theory of point estimation [see Fisher, R. A.]. As subsequently extended and sharpened, the concept is used to simplify theoretical statistical problems of all kinds. It is also used, sometimes questionably, in applied statistics to justify certain summarizations of the data, for example, reporting only sample means and standard deviations for metric data or reporting only proportions for counted data.
The sufficiency concept may be explained as follows. Suppose that the probabilities of two given samples have a ratio that does not depend on the unknown parameters of the underlying statistical model. Then it will be seen that nothing is gained by distinguishing between the two samples; that is, nothing is lost by agreeing to make the same inference for both of the samples. To put it another way, the two samples may be consolidated for inference purposes without losing information about the unknown parameters. When such consolidation can be carried out for many possible samples, the statistical problem becomes greatly simplified.
The argument can best be given in the context of a simple example. Consider tossing a coin four times, with “heads” or “tails” observed on each toss. There are 24 = 16 possible results of this experiment, so the sample space has 16 points, which are represented in Table 1 [see Probability, article onformal probability]. For example, the point THHT represents the experimental result: tosses 1 and 4 gave tails, while tosses 2 and 3 gave heads. For later convenience, the 16 points are arranged
|Table 1 — Sample points in coin-tossing experiment|
|NUMBER OF HEADS|
in columns according to the number of occurrences of H.
If one makes the usual assumptions that the four tosses are independent and that the probability of heads is the same (say, p) on each toss, then it is easy to work out the probability of each point as a function of the unknown parameter, p: for example, Pr(THHT) = p2(1 - p)2 [see Distributions, statistical, article onspecial discrete distributions]. In fact, each of the 6 points in column 3 has this same probability. Therefore, the ratio of the probabilities of any 2 points in column 3 has a fixed value, in fact the value 1, whatever the value of p may be, and, as stated earlier, it is not necessary to distinguish between the points in column 3. A similar argument shows that the 4 points in column 2 need not be distinguished from each other; the same is true for the 4 points in column 4. Thus, the sample space may be reduced from the original 16 points to merely the 5 columns, corresponding to the number of H‘s. No further reductions are justified, since any 2 points in different columns have a probability ratio that depends on p.
To see intuitively why the consolidations do not cost any useful information, consider a statistician who knows that the experiment resulted in one of the 6 points in column 3 but who does not know just which of the 6 points occurred. Is it worth his while to inquire? Since the 6 points all have the same probability, p2(l - p)2, the conditional probability of each of the 6 points, given that the point is one of those in column 3, is the known number ⅙ [see Probability, article onformal probability]. Once the statistician knows that the sample point is in column 3, for him to ask “which point?” would be like asking for the performance of a random experiment with known probabilities of outcome. Such an experiment can scarcely produce useful information about the value of p, or indeed about anything else.
Another argument has been advanced by Halmos and Savage (1949). Our statistician, who knows that the observed sample point is in column 3 but who does not know which one of the 6 points was observed, may try to reconstruct the original data by selecting one of the 6 points at random (for example, by throwing a fair die or by consulting a table of random numbers [see Random Numbers] ). The point he gets in this way is not likely to be the point actually observed, but it is easy to verify that the “reconstructed” point has exactly the same distribution as the original point. If the statistician now uses the reconstructed point for inference about p in the same way he would have used the original point, the inference will perform exactly as if the original point had been used. If it is agreed that an inference procedure should be judged by its performance, the statistician who knows only the column, and who has access to a table of random numbers, can do as well as if he knew the actual point. In this sense, the consolidation of the points in each column has cost him nothing.
When a (sample) space is simplified by consolidations restricted to points with fixed probability ratio, the simplified space is called sufficient: the term is a natural one, in that the simplified space is “sufficient” for any inference for which the original space could have been used. The original space is itself always sufficient, but one wants to simplify it as much as possible. When all permitted consolidations have been made, the resulting space is called minimally sufficient (Lehmann & Scheffé 1950-1955). In the example the 5-point space consisting of the five columns is minimally sufficient; if only the points of column 3 had been consolidated, the resulting 11-point space would be sufficient, but not minimally so.
It is often convenient to define or describe a consolidation by means of a statistic, that is, a function denned on the sample space. For example, let B denote the number of heads obtained in the four tosses. Then B has the value 0, 1, 2, 3, 4 for the points in columns 1, 2, 3, 4, 5, respectively. Knowledge of the value of B is equivalent to knowledge of the column. It is then reasonable to call B a (minimal) sufficient statistic. (B + 2, B3, and B, for example, would also be minimal sufficient statistics. ) More generally, a statistic is sufficient if it assigns the same value to 2 points only if they have a fixed probability ratio. In Fisher‘s expressive phrase, a sufficient statistic “contains all the information” that is in the original data.
The discussion above is, strictly speaking, correct only for discrete sample spaces. The concepts extend to the continuous case, but there are technical difficulties in a rigorous treatment because of the nonuniqueness of conditional distributions in that case. These technical problems will not be discussed here. (For a general treatment, see Volume 2, chapter 17 of Kendall & Stuart [1943-1946] 1958-1966, and chapters 1 and 2 of Lehmann 1959, where further references to the literature may be found.)
The discovery of sufficient statistics is often facilitated by the Fisher-Neyman factorization theorem. If the probability of the sample point (or the probability density) may be written as the product of two factors, one of which does not involve the parameters and the other of which depends on the sample point only through certain statistics, then those statistics are sufficient. This theorem may be used to verify these examples: (i) If B is the number of “successes” in n Bernoulli trials (n independent trials on each of which the unknown probability, p, of success is the same), then B is sufficient, (ii) If X1, X2, ..., Xn is a random sample from a normal population of known variance but unknown expectation μ, then the sample mean, x̄, is sufficient. ( iii) If, instead, the expectation is known but the variance is unknown, then the sample variance (computed around the known mean) is sufficient, (iv) If both parameters are unknown, then the sample mean and variance together are sufficient. (In all four cases, the sufficient statistics are minimal.) In all these examples, the families of distributions are of a kind called exponential. [For an outline of the relationship between families of exponential distributions and sufficient statistics, see Distributions, statistical, article on special continuous distributions.]
In the theory of statistics, sufficiency is useful in reducing the complexity of inference problems and thereby facilitating their solution. Consider, for example, the approach to point estimation in which estimators are judged in terms of bias and variance [see Estimation]. For any estimator, T, and any sufficient statistic, S, the estimator E(TǀS)— formed by calculating the conditional expectation of T, given S—is a function of S alone, has the same bias as T, and has a variance no larger than that of T (Rao 1945; Blackwell 1947). Hence, nothing is lost if attention is restricted to estimators that are functions of a sufficient statistic. Thus, in example (ii) it is not necessary to consider all functions of all n observations but only functions of the sample mean. It can be shown that X̄ itself is the only function of X̄ which is an unbiased estimator of μ (Lehmann & Scheffé 1950-1955) and that X̄ has a smaller variance than any other unbiased estimator for μ.
Sufficiency in applied statistics . In applied statistical work, the concept of sufficiency is often used to justify the reduction, especially for publication, of large bodies of experimental or observational data to a few numbers, the values of the sufficient statistics of a model devised for the data [see Statistics, descriptive]. For example, the full data may be 500 observations of a population. If the population is normal and the observations are independent, example (iv) justifies reducing the record to two numbers, the sample mean and variance; no information is lost thereby.
Although such reductions are very attractive, particularly to editors, the practice is a dangerous one . The sufficiency simplification is only as valid as the model on which it is based, and sufficiency may be quite “nonrobust”: reduction to statistics sufficient according to a certain model may entail drastic loss of information if the model is false, even if the model is in some sense “nearly” correct. A striking instance is provided by the frequently occurring example (iv). Suppose that the population from which the observations are drawn is indeed symmetrically distributed about its expected value μ and that the distribution is quite like the normal except that there is a little more probability in the tails of the distribution than normal theory would allow. (This extra weight in the tails is usually a realistic modification of the normal, allowing for the occurrence of an occasional “wild value,” or “outlier.”) [See Statistical ANALYSIS, SPECIAL PROBLEMS OF, article on OUTLIERS.] In this case the reduction to sample mean and variance may involve the loss of much or even most of the information about the value of X̄ there are estimators for μ computable from the original data but not from the reduced data, considerably more precise than X̄ when the altered model holds.
Another reason for publication of the original data is that the information suppressed when reducing the data to sufficient statistics is precisely what is required to test the model itself. Thus, the reader of a report whose analysis is based on example (i) may wonder if there were dependences among the n trials or if perhaps there was a secular trend in the success probability during the course of the observations. It is possible to investigate such questions if the original record is available, but the statistic B throws no light on them.
J. L. Hodges, Jr.
Blackwell, David 1947 Conditional Expectation and Unbiased Sequential Estimation. Annals of Mathematical Statistics 18:105-110.
Fisher, R. A. (1922) 1950 On the Mathematical Foundations of Theoretical Statistics. Pages 10.308a-10.368 in R. A. Fisher, Contributions to Mathematical Statistics. New York: Wiley. → First published in Volume 222 of the Philosophical Transactions, Series A, of the Royal Society of London.
Halmos, Paul R.; and Savage, L. J. 1949 Application of the Radon-Nikodym Theorem to the Theory of Sufficient Statistics. Annals of Mathematical Statistics 20:225-241.
Kendall, Maurice G.; and Stuart, Alan (1943-1946) 1958-1966 The Advanced Theory of Statistics. 3 vols. 2d ed. New York: Hafner; London: Griffin. → Volume 1: Distribution Theory, 1958. Volume 2: Inference and Relationship, 1961. Volume 3: Design and Analysis, and Time-series, 1966 (1st ed.). The first editions of volumes 1 and 2 were by Kendall alone.
Lehmann, E. L. 1959 Testing Statistical Hypotheses. New York: Wiley.
Lehmann, E. L.; and ScheffÉ, HENRY 1950-1955 Completeness, Similar Regions, and Unbiased Estimation. Sankhyá: The Indian Journal of Statistics 10:305-340; 15:219-236.
Rao, C. Radhakrishna 1945 Information and the Accuracy Attainable in the Estimations of Statistical Parameters. Calcutta Mathematical Society, Bulletin 27:81-91.
suf·fi·cien·cy / səˈfishənsē/ • n. (pl. -cies) the condition or quality of being adequate or sufficient. ∎ [in sing.] an adequate amount of something, esp. of something essential: a sufficiency of good food. ∎ archaic self-sufficiency or independence of character, esp. of an arrogant or imperious sort.