# Estimation

# Estimation

I. POINT ESTIMATION*D. L. Burkholder*

II. CONFIDENCE INTERVALS AND REGIONS*J. Pfanzagl*

## I POINT ESTIMATION

How many fish are in this lake? What proportion of the voting population favors candidate A? How much paint is needed for this particular room? What fuel capacity should this airplane have if it is to carry passengers safely between New York and Paris? How many items in this shipment have the desired quality? What is the specific gravity of this metal? Questions like these represent problems of point estimation. In present-day statistical methodology, such problems are usually cast in the following form: A mathematical model describing a particular phenomenon is completely specified except for some unknown quantity or quantities. These quantities must be estimated. Galileo’s model for freely falling bodies and many models in learning theory, small group theory, and the like provide examples.

Exact answers are often impossible, difficult, expensive, or merely inconvenient to obtain. However, approximate answers that are quite likely to be close to the exact answer may be fairly easily obtainable. The theory of point estimation provides a guide for obtaining such answers; above all, it makes precise, or provides enough framework so that one could make precise, such phrases as “quite likely to be close” and others such as “this estimator is better than that one.”

As an introduction to some of the problems involved, consider estimating the number *N* of fish in a given lake. Suppose that M fish are taken from the lake, marked, and returned to the lake unharmed. A little later, a random sample of size *n* of fish from the lake is observed to contain *x* marked fish. A little thought suggests that probably the ratio *x/n* is near *M/N* or that the unknown *N* and the ratio *Mn/x* (defined only if *x >* 0) are not too far apart. For example, if *M =* 1,000, *n =* 1,000, and *x =* 20, it might be reasonable to believe that *N* is close to 50,000. [A *similar example, concerning moving populations of workers, is discussed in* SAMPLE SURVEYS.]

Clearly, this procedure *may* lead one badly astray. For example, it is possible, althoughly highly unlikely, that the same value *x* = 20 could be obtained, and hence, using the above procedure, *N* be estimated as 50,000, even if N is actually as small as 1,980 or as large as 10,000,000. Clearly, considerations of probability are basic here. If L(N) denotes the probability of obtaining 20 marked fish when N fish are in the lake, it can be shown that 0 = L( 1,979) < L( 1,980) < … < L(49,999) = L(50,000)andL(50,000)>L(50,001) > …; that is, *N* = 50,000 maximizes the *likelihood* of obtaining 20 marked fish.

**Design of experiments** . What values of M and *n* are most satisfactory in the above experiment? Clearly, the bigger *n* is, the better it is for estimation purposes, but the more expensive the experiment *[see* EXPERIMENTAL DESIGN]. A balance has to be reached between the conflicting goals of minimizing error and minimizing expense. Also, perhaps another experimental design might give better results. In the above problem, let M = 1,000, but instead of pulling a fixed number of fish out of the lake, pull out fish until exactly *x* marked fish have been obtained, where *x* is fixed in advance. Then *n,* the sample size, is the observation of interest *[see* SEQUENTIAL ANALYSIS]. Which design, of all the possible designs, should be used? This kind of question is basic to any estimation problem.

**Testing hypotheses** . An altogether different problem would arise if one did not really want the value of N for its own sake but only as a means of deciding whether or not the lake should be restocked with small fish. For example, it might be desirable to restock the lake if *N* is small, say less than 100,000, and undesirable otherwise. In this case, the problem of whether or not the lake should be restocked is equivalent to testing the hypothesis that *N* is less than 100,000 *[see* HYPOTHESIS TESTING]. In general, a good estimator does not necessarily lead to a good test.

**Confidence intervals** . The value of an estimator, that is, a point estimate, of *N* for a particular sample is a number, hopefully one close to *N;* the value of a confidence interval, that is, an interval estimate, of *N* for a particular sample is an interval, hopefully one that is not only small but that also contains *N [see* ESTIMATION, *article on* CONFIDENCE INTERVALS AND REGIONS]. The problem of finding a good interval estimate is more closely related to hypothesis testing than it is to point estimation.

Note that certain problems are clearly point estimation problems rather than problems of interval estimation: when deciding what the fuel capacity of an airplane should be, the designers must settle on one particular number.

### Steps in solving an estimation problem

The first step in the solution of an estimation problem, as suggested above, is to design an experiment (or method of taking observations) such that the outcome of the experiment—call it *x*—is affected by the unknown quantity to be estimated, which in the above discussion was *N.* Typically, *x* is related to *N* probabilistically rather than deter-ministically. This probability relation must be specified. For example, the probability of obtaining *x* marked fish in a sample of size *n* is given by the hypergeometric distribution,

provided the sample has been drawn randomly without replacement [see DISTRIBUTIONS, STATISTICAL, *article on* SPECIAL DISCRETE DISTRIBUTIONS]. (The denominator is the number of combinations of *N* things taken n at a time, and so forth.) If the randomness assumption is not quite satisfied, then the specified probability relation will be only approximately true. Such specification problems and their implications will be discussed later. Next, after the experiment has been designed and the probability model specified, one must choose a function *f* defined for each possible *x* such that if *x* is observed, then *f(x),* the value of the function *f* at *x,* is to be used as a numerical estimate of *N.* Such a function *f* is called an *estimator* of *N.* The problem of the choice of *f* will be discussed later. Finally, after a particular estimator *f* has been tentatively settled on, one might want to calculate additional performance characteristics of *f,* giving further indications of how well *f* will perform on the average. If the results of these calculations show that f will not be satisfactory, then changes in the design of the experiment, for example, an increase in sample size, might be contemplated. Clearly, there is a good deal of interplay among all the steps in the solution of an estimation problem outlined here.

**Terminological note** . Some authors distinguish terminologically between the *estimator,* the function *f,* and its numerical value for a particular sample, the *estimate.* Another distinction is that between a random variable and a generic value of the random variable. (Some authors use X for the former and *x* for the latter.) Such distinctions are sometimes important, but they are not generally made in this article, although special comments appear in a few places. Otherwise it should be clear from context whether reference is made to a function or its value, or whether reference is made to a random variable or its value.

### Choice of estimator

As a means of illustrating the various considerations influencing the choice of an estimator, a few typical examples will be discussed.

*Example 1.* Let *x* be the number of successes in *n* independent trials, the probability of a success on an individual trial being *p.* (For example, *x* might be the number of respondents out of *n* questioned in a political poll who say they are Democrats, and *p* is the probability that a randomly chosen individual in the population will say he is a Democrat.) Here *p* is unknown and may be any number between 0 and 1 inclusive. An estimator *f* of *p* ideally should be such that f(x) is close to *p* no matter what the unknown *p* is and no matter what the observation *x* is. That is, the error f(x) — *p* committed by using *f(x)* as an approximation to *p* should always be small. This is too much to expect since *x* can, by chance, be quite misleading about p. However, it is not too much to expect that the error be small in some average sense. For example, the mean squared error,

should be small no matter what the unknown *p* is, or the mean absolute error *E _{r}f — p* should be small no matter what

*p*is, or the like. For the time being, estimators will be compared only on the basis of their mean squared errors. A more general approach, the underlying ideas of which are well illustrated in this special case, will be mentioned later. The first question that arises is, Can one find an estimator

*f*such that, for every

*p*satisfying 0 ≤ p ≤ 1, the mean squared error of f at p is smaller than (or at least not greater than) the mean squared error at p of any other estimator? Obviously, such an estimator would be best in this mean squared error sense. Unfortunately, and this is what makes the problem of choosing an estimator a nontrivial problem, a best estimator does not exist. To see this, consider the estimates f

_{1}and f

_{2}defined by

*f*=

_{1}(x)*x/n*and

*f*= ½ It is not hard to show that E

_{2}(x)_{p}(f

_{1}- p)

^{2}= p(1 – p)/n, and clearly, E

_{p}(f

_{2}– p)

^{2}= (½ - p)

^{2}. If a best estimator f existed it would have to satisfy E

_{p}(f - p)

^{2}≤ E

_{p}(f

_{2}- p)

^{2}. But the latter quantity is zero for p = ½ implying that

*f = f*However, f

_{2}._{2}is not best since E

_{2}(f

_{1}- p)

^{2}is smaller than E

_{p}(f

_{2}- p)

^{2}for p near 0 or 1.

Although no best estimator exists, many good estimators exist. For example, there are many estimators f satisfying E_{p}(f - p)^{2} ≤ l/(4n) for 0≤ p ≤ 1. The estimator *f _{l},* defined above, is such an estimator. The estimator f

_{3}defined by with mean squared error is another. If

*n*is large, the mean squared error of any such estimator is small for each possible value of p. In this problem, as is typical, any one of many good available estimators would no doubt be reasonable to use in practice. Only by adding further assumptions, for example, assumptions giving some information about the unknown p, can the class of reasonable estimators be narrowed. Note that estimators are still being compared on the basis of their mean squared errors only.

The estimator *f _{3}* is

*minimax*in the sense that f

_{3}minimizes (cap,81a) E

_{p}(f — p)

^{2}with respect to

*f.*The minimax approach focuses attention on the worst that can happen using f and chooses f accordingly

*[see*DECISION THEORY]. Note that the estimator f

_{1}

*(f*does have slightly larger mean squared error than does f

_{1}(x) = x/n)_{3}for values of p near ½ for values of

*p*near 0 or 1 the advantage lies wholly with f

_{1}. Other properties of these estimators will be discussed later.

*Example 2.* Suppose that *x _{1}, x_{2}, … ,x_{n}* are observations on

*n*independent random variables, each having the Poisson distribution with parameter λ, where λ is unknown and may be any non-negative number

*[see*DISTRIBUTIONS, STATISTICAL,

*article on*SPECIAL DISCRETE DISTRIBUTIONS]. For example,

*x*could be the number of occurrences during the kth time interval of unit length of any phenomenon occurring “randomly” over time, possibly telephone calls coming into an exchange, customers coming into a store, and so forth

_{k}*[see*QUEUES]. Knowing that λ is both the mean and variance of the Poisson distribution, it might not be unreasonable to suppose that both the sample mean,

and the sample variance,

(here, one must assume that *n >* 1), provide good estimators of the unknown λ. It is not hard to show that *m* is *better* than s^{2}, that is, E_{A}(m-λ)^{2}≤ Ex(s^{2} λ)^{2} for all λ ≥ 0, with strict inequality for some λ ≥ 0.

An estimator is *inadmissible* (with respect to a given criterion like mean squared error) if a better one exists; accordingly, the estimator s^{2} is here inadmissible. An estimator is admissible if it is not inadmissible. Although it is not obvious, the estimator *m* is admissible. In fact the class of admissible estimates is very large here, as is typically the case. In example 1, all three estimators discussed, *f _{1}, f_{2}*, and f

_{3}, are admissible.

*Example 3.* Let x_{1}, x_{2}, … , x_{n} be observations on *n* independent random variables each having the normal distribution with mean μ. and variance *σ ^{2},* where both μ and σ

^{2}are unknown; μ may be any real number and σ

^{2}may be any positive number. One might be interested in estimating only μ only σ

^{2}, the pair (μ, σ

^{2}), or perhaps some combination such as μ/σ

*Example 4.* Let *x _{1} x_{2}, …, x_{n}* be observations on

*n*independent random variables each having the uniform distribution over the set of integers {1,2, …,

*N},*where N may be any positive integer. For example, in a state where automobile license plates are numbered from 1 to N, each

*x*would be the number of a randomly chosen license plate. What is a good estimator of N?

_{i}**Sufficient statistics** . A simple and effective way to narrow the class of estimators that one ought to consider when choosing a good estimator is to identify a sufficient statistic for the problem and to consider only those estimators that depend on the sufficient statistic *[see* SUFFICIENCY]. Roughly speaking, if τ is a sufficient statistic, knowing t(x) is as useful as knowing *x.* The following result is important. If τ is a sufficient statistic and f is an estimator with finite mean squared error and *f* does not depend on *t* (that is, *f* is not essentially expressible as *f=h(t)* for some function *h),* then there is another estimator *f,* that does depend on τ and such that f_{0} is better than f (in the technical sense defined above). One f_{0} that works is the conditional expectation of *f* relative to *t.*

In example 2, *m* is a sufficient statistic, hence only estimators depending on *m* need be considered. In particular s^{2}, which does not depend on *m* in the sense defined above, need not be considered. In example 3, the ordered pair *(m,* s^{2}), where *m* and s^{2} are defined as in example 2, is a sufficient statistic. In example 4, the estimator 2m might seem at first to be a plausible estimator of N. However, it does not depend on the sufficient statistic *u* defined by *u(x) =* the largest of the *x, _{K}.* Much better estimators than 2m exist. For example, the rather complicated

is such an estimator. (Note that f_{4} is approximately equal to (n+l)u/n.)

Further criteria for choice of estimator. So far estimators have been compared on the basis of their mean squared errors only. Since no best estimator exists, a unique solution to the problem of choosing an estimator is generally not obtainable by this approach. This is not really too regrettable since many good estimators usually exist. Even demanding that an estimator be minimax, not necessarily always a reasonable demand, does not always lead to a unique estimator. In example 2, every estimator of λ has unbounded mean squared error; and in example 3 every estimator of μ, has unbounded mean squared error. Hence, in these two examples, all estimators of X and *(i,* respectively, are minimax, but the concept loses all interest. In example 1, demanding minimaxity does lead to the unique minimax estimator f_{3}. A unique minimax estimator is clearly admissible.

The strong intellectual and psychological tendency of human beings to be satisfied only with unique answers has often led to further demands being placed on estimators in addition to the one that their mean squared errors be small.

*Unbiasedness.* An estimator is *unbiased* if the mean value of the estimator is equal to the quantity being estimated. In example 1, f_{1} is unbiased since E_{p}f_{1} = *p, 0 ≤ p* ≤ 1. Both *m* and s^{2} are unbiased estimators of λ in example 2. In example 3, *m* is an unbiased estimator of *N* and s^{2} is an unbiased estimator of *σ.* In example 4, both 2m and f_{1} are unbiased estimators of *N.* The search for a best unbiased estimator often leads to a unique answer. In example 2, an estimator *f* would be best unbiased (or minimum variance unbiased) if it is unbiased and satisfies

for every λ ≤ 0 and every unbiased estimator f*. The estimator *m* is such an estimator for this problem and is the only such estimator. The estimator f_{1} is the best unbiased estimator of *p* in example 1; the estimator *m* is the best unbiased estimator of *p* in example 3; and the estimator *f _{4}* is the best unbiased estimator of N in example 4. The

*relative efficiency*of two unbiased estimators is the ratio of their reciprocal variances. Relative efficiency may well depend on the parameter value.

Unbiased estimators fail to exist in some important problems. Using the first design mentioned in the problem of estimating the number of fish in a lake, N, no unbiased estimator of *N* exists. Although in example 3, s^{2} is a best unbiased estimator of σ^{2}, another estimator of *σ, (n —* I)s^{2}/(n + 1), despite being biased, is actually better than s^{2} in the sense of mean squared error. This shows that placing extra demands on estimators can actually come into conflict with the small mean squared error demand. Of course, the relative importance of the various properties an estimator may have will no doubt be judged slightly differently by different reasonable individuals.

*Invariance.* Notions of invariance can sometimes be invoked so that a best invariant estimator exists. For example, if *x = (x _{1}, …, x_{n}), b* is a real number, and

*y=(x*the estimator m is invariant in the sense that it satisfies m(y) =

_{1}+ b,…, x_{n}+ b),*m(x) + b.*It turns out that among all the estimators of

*μ*in example 3 with this property of scale invariance, the estimator m is best in the usual mean squared error sense. The argument for invariance may be stated rather loosely as follows. Irrelevancies in the data (for example, whether time is measured from 12 noon New York time or from 12 noon Greenwich time) should not make a fundamental difference in the results obtained from the analysis of the data.

A different kind of invariance problem can be troublesome in some circumstances. Suppose in example 1 that interest centers not on *p* but on some function of *p,* say 1/p. If f is a satisfactory estimator of *p,* it need not follow that 1/f is a satisfactory estimator of *1/p,* for properties like unbiasedness, mean squared error functions, etc., can change drastically under nonlinear transformations. Fortunately, in many problems the parameter itself, or a single function of it, is of central interest, so that this kind of noninvariance is not serious.

**Specification problems** . So far, estimators have been chosen relative to given probability models. If an estimator seems satisfactory for a given probability model, it may be relevant to ask if this estimator is also good for probability models closely related to the given one. For example, it is too much to expect that a model postulating normal distributions describes exactly the practical situation of interest. Fortunately, in many common problems slight changes in the probability model will not materially affect the goodness of an estimator reasonable for the original model *[see* ERRORS, *article on the* EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONS]. For example, the estimator *m* of μ in example 3 is actually a fairly reasonable estimator of the population mean μ, in a large variety of cases, particularly if the population variance σ^{2} is finite and the sample size *n* is not too small, as can be seen from the formula for its mean squared error, *σ ^{2}/n.* Circumstances arise, however, in which alternative estimators, for example, the sample median, not so much affected by slight changes in the tails of the distribution, may need to be considered. [A

*process for arriving at other such estimators, called Winsorization, is discussed in*NONPARAMETRIC STATISTICS,

*article on*ORDER STATISTICS;

*the closely related concept of trimming is discussed in*ERRORS,

*article on the*

**EFFECTS OF ERRORS IN STATISTICAL ASSUMPTIONS**.]

**More than one parameter** . Most of the material of this article deals with estimation of a single parameter. The multiparameter case is, of course, also important and multiparameter analogues of all the topics in this article exist. They are treated in the references, for example by Kendall and Stuart (1946), Cramér (1945), and Wilks (1962).

### Constructive estimation methods

**Maximum likelihood estimators** . In example 1, the estimator f^{1} is the maximum likelihood estimator of *p:* For each *x, f _{1}(x~)* is that value of

*p*maximizing (n) p*(l p)"≤, the probability of obtaining

*x.*In example 2,

*m*is the maximum likelihood estimator of λ. In example 3,

*(m,[n*— I]s

^{2}/w) is the maximum likelihood estimator of (μ, σ

^{2}). In example 4,

*u*is the maximum likelihood estimator of N. In the problem of estimating the number of fish in a lake, N, using the first design, no maximum likelihood estimator exists since no such estimate can be defined for

*x =*0, although for

*x >*0 no trouble occurs. In some examples, there is no unique maximum likelihood estimator.

Maximum likelihood estimators are often easy to obtain. A maximum likelihood estimator does not necessarily have small mean squared error nor is it always admissible. So the maximum likelihood principle can sometimes conflict with the small mean squared error principle. Nevertheless, maximum likelihood estimators are often quite good and worth looking at. If the sample size is large, they tend to behave nearly as nicely as the estimator *m* of *μ,* in example 3.

Maximum likelihood estimation is often constructive, that is, the method provides machinery that often gives a unique estimating function. There are other constructive methods, three of which are described here: the method of moments, least squares, and Bayes estimation. One or another of these constructive methods may provide a simpler or a better behaved estimator in any particular case.

**The method of moments** . The approach of the method of moments (or of expected values) is to set one or more sample moments equal to the corresponding population moments and to “solve,” if possible, for the parameters, thus obtaining estimators of these parameters. The method is particularly appropriate for simple random sampling. In example 1, if the sample is regarded as made up of *n* observations, the kth being a 1 (success) or 0 (failure), the sample mean is *x/n* and the population mean is *p,* so the resulting method of moments estimator is *x/n.* In example 4, the method of moments, as it would ordinarily be applied, leads to a poor estimator. The method can, nonetheless, be very useful, especially in more complex cases with several parameters.

**Least squares** . The least squares approach is especially useful when the observations are not obtained by simple random sampling. One considers the formal sum of squares X))t(≤ — *EX _{k})^{2},* where

*x*is an observation on the random variable Xk with expectation

_{k}*EX*(depending on the parameters to be estimated). Then one attempts to minimize the sum of squares over possible values of the parameters. If a unique minimum exists, the minimizing values of the parameters are the values of their

_{k}*least square estimators.*

The method is particularly appropriate when the Xk are independent and identically distributed except for translational shifts that are given functions of the parameters. If the *X _{k}* all have the same expectation, as in examples 1-4, the least squares estimator of that expectation is the sample mean. Least squares estimation, without modification or extension, does not provide estimators of parameters (like

*σ*in example 3) that do not enter into expectations of observations. [A

*fuller treatment of this topic appears in*LINEAR HYPOTHESES,

*article On*REGRESSION.]

**Bayes estimation** . Consider example 1 again, this time supposing that the unknown *p* is itself the outcome of some experiment and that the probability distribution underlying this experiment is known. For example, *x* could be the number of heads obtained in *n* tosses of a particular coin, the probability of a head for the particular coin being *p* where *p* is unknown, but where the coin has been picked randomly from a population of coins with a known distribution of *p* values. Then it would be reasonable to choose an estimator *f* that minimizes the mean value of the squared error *[f(x) — p] ^{2}* where the averaging is done with respect to the

*known*joint distribution of

*x*and

*p.*Such a minimizing estimator is called a

*Bayes estimator;*of course it depends on the distribution assigned to

*p [see*

**Bayesian inference**]. A distribution may be assigned to

*p*merely as a technical device for obtaining an estimator and completely apart from the question of whether

*p*actually is the outcome of an experiment. This is the spirit in which Bayes estimators are often introduced, as a way of obtaining an estimator that may or may not have good properties. On the other hand, one may assign a distribution to

*p*in such a way that those values of

*p*that seem more likely to obtain are given greater weight. Of course, different individuals might assign different distributions, for this is a matter of judgment. However, this approach does provide one possible method for using any previously obtained information about

*p*that may be available. It would be rather rare that the

*only*information available about

*p*before the experiment is that 0 <

*p*< 1.

Examples of Bayes estimators include the estimator f_{3} of example 1, obtained by assigning a certain beta distribution to *p,* and the estimator f_{5} of *p,* defined by *f _{5}(x) = (x + 1 )/(n* + 2), obtained by assigning to

*p*the uniform distribution on the interval between 0 and 1. [

*See*DISTRIBUTIONS, STATISTICAL,

*article on*SPECIAL CONTINUOUS DISTRIBUTIONS,

*for discussions of these specific distributions.]*Even

*f*is a Bayes estimator. However, f

_{2}_{1}, is not a Bayes estimator but is rather the limit of a sequence of Bayes estimators.

Restricting attention to estimators that are Bayes or the limits (in a certain sense) of sequences of Bayes estimators usually assures one of not overlooking any admissible estimator. Bayes methods frequently prove useful as technical devices in solving for minimax estimators and in many other situations.

### Asymptotic estimation theory

Because it is often difficult to compare estimators for small sample sizes, much research on point estimation is in terms of large sample sizes, working with limits as the sample size goes to infinity. In this context, an estimator itself is not considered, but rather a *sequence* of estimators, each member of which corresponds to a single sample size. For example, consider the sequence of sample means m_{1}*m _{2}, …,* where If a sequence of estimators has desirable properties in a limiting large sample sense, it is often presumed that particular members of the sequence will to some extent partake of these desirable properties.

**Consistency** . An asymptotic condition that is often regarded as essential is that of *consistency,* in the sense that the sequence of estimators is close to the true value of the parameter, with high probability, for large sample sizes. More precisely if {t_{n}} is the sequence of estimators, and if *6* is the parameter being estimated, the sequence *{t _{n}}* is said to estimate

*6*consistently if, for every interval / containing

*6*in its interior, the probability that the value of

*t*belongs to I approaches 1 as n approaches infinity, no matter what the value of

_{n}*0*is. (There is also a nonasymptotic concept of consistency, closely related to the above. Both ideas, and their applications, originated with R. A. Fisher.)

**Comparison of estimators** . For simplicity, consider now independent identically distributed random variables with common distribution depending on a single parameter, *θ.* Let *ϕ _{e}* be the density function (or frequency function) corresponding to that common distribution for the parameter value

*0.*A large number of regularity conditions are traditionally, and often tacitly, imposed on

*<$»;*for example, distributions like those of example 4 do not come under the standard theory here. In this brief summary, the regularity conditions will not be discussed. With almost no modifications, the discussion applies to qualitative, as well as numerically valued, random quantities.

Two sequences of estimators, competing as estimators of *θ* are often compared by considering the ratios of their asymptotic variances, that is, the variances of limit distributions as *n* approaches infinity. In particular, one or both sequences may have the lowest possible asymptotic variance. In discussing such matters, the following constructs, invented and named by R. A. Fisher, are important.

*Score, Fisher information, and efficiency.* The *score* of the single observation *x _{k}* is a function of both

*x,*and

_{k}*θ*defined by

and it provides the relative change in ϕ (for each possible value of *x _{k})* when

*0 is*slightly changed. Two basic facts about the score are

*E*0, var

_{θ}s_{θ}=_{θ}s

_{θ}= -E

_{θ}(∂s

_{θ}/∂θ). The quantity, -E(∂s

_{θ}/∂θ), is often called the

*Fisher information*contained in a single observation and is denoted by I(θ).

For the entire sample, *x = {x _{1}, x_{s}, ċ, x_{n}},* the

*sample score*is just the sum of the single observation scores,

The Fisher information I_{n}(θ) contained in the entire sample is defined as above with *s _{θn}* replacing s

_{θ}; it is just the sum of the Fisher information values for the

*n*single observations. Under the assumptions, each observation contributes the same amount to total information—that is, l(θ) is the same for each observation—so that I»(θ) = n I(θ).

Except for sign, I_{n}(θ) is the curvature of the likelihood function near the true value of 0. Roughly speaking, sharp curvature of the likelihood function corresponds to sharper estimation, or lower variance of estimation. The *information inequality* says that, for sequences of estimators {t_{n}} such that √n(t_{n} - θ) converges in distribution to a distribution with mean zero and variance σ^{2},

Nonasymptotic variants of this inequality have been explored by Darmois, Dugué, Cramér, Rao, and others. The basic variant, for an unbiased estimator *t _{n},* based on a sample of size

*n,*is

(This is usually called the Cramér-Rao inequality.) Under the tacit regularity conditions, this inequality becomes an equality just when

This can happen only if the right side is not a function of 0, and this in turn occurs (under regularity) when and only when the distributions given by *<t>e* form an exponential family *[see* DISTRIBUTIONS, STATISTICAL, *article on* SPECIAL CONTINUOUS DISTRIBUTIONS].

The maximum likelihood estimator of θ based on *x* = *(X _{1}, … ,* x

_{n}), say 0"

_{n}, is (under regularity) the solution of the

*likelihood equation se*0. Under these circumstances, and are both asymptotically normal with zero mean and variance unity. Further, the difference between these two quantities converges to zero in probability as

_{a}(x) —*n*increases.

Thus the maximum likelihood estimator is *asymptotically efficient,* in the sense that its asymptotic variance is as low as possible, for it satisfies the asymptotic information inequality. In general there exist other (sequences of) estimators also satisfying the information inequality; these are called *regular best asymptotically normal* (RBAN) estimators. The RBAN estimators are those that are indistinguishable from the maximum likelihood estimator in terms of asymptotic distribution, as it is traditionally construed. Oβ^en some RBAN estimator distinct from the maximum likelihood estimator is easier to compute and work with.

The word “regular,” used above, refers in part to regularity conditions on the estimators themselves, considered as functions of the sample. Without that restriction, somewhat strange *superefficient* estimators can be constructed.

The concept of asymptotic treatment has been extended recently in other directions than those summarized above, in particular by the work of R. R. Bahadur and C. R. Rao.

### A more general approach to estimation

So far the discussion has been based largely on comparing estimators through their mean squared errors. The mean absolute error could, of course, have been used. More generally, suppose that W(θ,d) is the loss incurred when the numerical estimate *d* is used as if it were the value g(θ). Here 0 is the unknown parameter of the probability distribution underlying the outcome *x* of an experiment, and <?(θ) is to be estimated. If *f* is an estimator, *x* has been observed, and *f(x)* is used as if it were the value of 0(θ), then the loss incurred is *W[0,f(x)].* The mean loss, E»W(θ,f), denoted by r(0,f), a function of both 0 and *f,* is of interest. The function *r* is called the *risk function.* Now such terms as *better, admissible, minimax, Bayes,* and so forth could be defined using the risk function, *r,* rather than mean squared error. For example, *f* is better than f* (relative to the loss W) if r(0,f) *S r(0,f) for all 0 with strict inequality for some 0 *[see* DECISION THEORY].

In the earlier discussion, W was taken to be W(θ,d) = [d – g(θ)]^{2} and *r* was therefore mean squared error.

In the more general multiparameter context mentioned earlier, **θ** is a vector of more than one ordinary (scalar) parameter, and so may be **g** (θ), the quantity to be estimated. For example, in example 3, **θ** = (μ, σ^{2}), **g** (θ) could be **θ** and W(**θ,d** ) could be (d_{1} μ)^{2} + (d_{2} – σ^{2})^{2}, where ** d** = (d

_{1}, d

_{2}) is an ordered pair of real numbers. Or consider the following example in which an infinite number of quantities are simultaneously estimated.

*Example 5.* Let *x _{1}, x_{2}, …, x_{n}* be observations on

*n*independent random variables each having the same distribution function F, where F may be any distribution function on the real line. The problem is to estimate the whole function F, that is, to estimate F(a) for each real number a. Here θ = F,

**g**(θ) = F,

_{d}may be any distribution function, and W(

**θ,d**) may be given, for example, by sup ǀd(a) – F(a)ǀ, where the supremum (least upper bound) is taken over all real a. A quite satisfactory estimator, the sample distribution function, exists here. For large

*n,*its risk function is near 0. For

*x= (x*the value of the sample distribution function is that distribution that places probability 1

_{i},x_{2}, …, x_{n}),*/n*on each of x

_{1}, x

_{2}, …, x

_{n}if these values are distinct, with the obvious differential weighting otherwise.

One difficulty with the more general approach to estimation outlined here is that the loss function W is often hard to define realistically, that is, in such a way that W(θ,d) approximates the actual loss incurred when *d* is used as if it were the value of **g** (**θ** ). Fortunately, an estimator that is good relative to one loss function, say squared error, is often good relative to a wide class of loss functions.

Perhaps the key concept in estimation theory is *better.* Once it has been decided what “this estimator is better than that one” should mean, a large part of the theory follows naturally. Many definitions of *better* are possible. Several others besides the one mentioned here appear in the literature, but none has been so deeply investigated.

### History

The theory of point estimation has a long history and a huge literature. The Bernoullis, Moivre, Bayes, Laplace, and Gauss contributed many important ideas and techniques to the subject during the eighteenth century and the early part of the nineteenth century. Karl Pearson stressed the method of moments and the importance of computing approximate variances of estimators. During the early twentieth century, no one pursued the subject with more vigor than R. A. Fisher. His contributions include the development of the maximum likelihood principle and the introduction of the important notion of sufficiency. Neyman’s systematic study of interval estimation appeared in 1937. Although the possibility of a loss function approach to statistical problems had been mentioned by Neyman and E. S. Pearson in 1933, its extensive development was not initiated until the work of Abraham Wald in 1939 *[see the biographies of* BAYES; BERNOULLI FAMILY; FlSHER, R. A.; GAUSS; LAPLACE; MOIVRE; PEARSON; WALD].

New and nonstandard estimation problems requiring new and nonstandard techniques of solution will no doubt continue to arise. Remarkable solutions to two such problems have recently been proposed under the general name of *stochastic approximation [see* SEQUENTIAL ANALYSIS].

Ideally, scientific constructs should possess not only great explanatory power but simplicity as well. The search for both will, no doubt, encourage more and more mathematical model building in the social sciences. Moreover, it is quite likely that these models will have to become more and more probabilistic if they are to achieve these aims. As a consequence, the statistical problems involved, checking the goodness of fit of the model, estimating the unknown parameters, and so forth, will have to be handled with ever-increasing care and knowledge.

D. L. Burkholder

*[See also*STATISTICS, DESCRIPTIVE.]

## BIBLIOGRAPHY

*Many elementary textbooks on statistical theory discuss the rudiments of point estimation, for example,* Hodges & Lehmann 1964. *Fuller treatments will be found in* Cramér 1945, Wilks 1962, *and* Kendall & Stuart 1946. *Large sample theory is treated at length in* LeCam 1953. *Further discussion of estimation from the loss function point of view will be found in Chapter 5 of* Wald 1950. Lehmann 1959 *treats sufficiency and invariance in some detail. Chapter 15 of* Savage 1954 *contains many illuminating comments on the problem of choosing a good estimator.*

CRAMÉR, HARALD (1945) 1951 *Mathematical Methods of Statistics.* Princeton Mathematical Series, No. 9. Princeton Univ. Press.

FISHER, R. A. (1922) 1950 On the Mathematical Foundations of Theoretical Statistics. Pages 10.308a-10.368 in R. A. Fisher, *Contributions to Mathematical Statistics.* New York: Wiley. → First published in Volume 222 of the *Philosophical Transactions,* Series A, of the Royal Society of London.

FISHER, R. A. (1925) 1950 Theory of Statistical Estimation. Pages 11.699a-ll.725 in R. A. Fisher, *Contributions to Mathematical Statistics.* New York: Wiley. → First published in Volume 22 of the *Proceedings* of the Cambridge Philosophical Society.

HODGES, JOSEPH L. JR.; and LEHMANN, E. L. 1964 Basic *Concepts of Probability and Statistics.* San Francisco: Holden-Day.

KENDALL, MAURICE G.; and STUART, ALAN (1946) 1961 *The Advanced Theory of Statistics.* Volume 2: Inference and Relationship. New York: Hafner; London: Griffin. → Kendall was the sole author of the 1946 edition.

KIEFER, J.; and WOLFOWITZ, J. 1952 Stochastic Estimation of the Maximum of Regression Function. *Annals of Mathematical Statistics* 23:462-466.

LECAM, LUCIEN 1953 On Some Asymptotic Properties of Maximum Likelihood Estimates and Related Bayes’ Estimates. California, University of, *Publications in Statistics* 1:277-329.

LEHMANN, ERICH L. 1959 *Testing Statistical Hypotheses.* New York: Wiley.

NEYMAN, JERZY 1937 Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Royal Society of London, *Philosophical Transactions* Series A 236:333-380.

PITMAN, E. J. G. 1939 The Estimation of the Location and Scale of Parameters of a Continuous Population of Any Given Form. *Biometrika* 30:391-421.

ROBBINS, HERBERT; and MONRO, SUTTON 1951 A Stochastic Approximation Method. *Annals of Mathematical Statistics* 22:400-407.

SAVAGE, LEONARD J. 1954 *The Foundations of Statistics.* New York: Wiley.

WALD, ABRAHAM 1939 Contributions to the Theory of Statistical Estimation and Testing Hypotheses. *Annals of Mathematical Statistics* 10:299-326.

WALD, ABRAHAM (1950) 1964 *Statistical Decision Functions.* New York: Wiley.

WILKS, SAMUEL S. 1962 *Mathematical Statistics.* New York: Wiley.

## II CONFIDENCE INTERVALS AND REGIONS

Confidence interval procedures—more generally, *confidence region procedures*—form an important class of statistical methods. In these methods, the outcome of the statistical analysis is a subset of the set of possible values of unknown parameters. Confidence procedures are related to other kinds of standard statistical methods, in particular to point estimation and to hypothesis testing. In this article such relationships will be described and contrasts will be drawn between confidence methods and superficially similar methods of other kinds, for example, Bayesian estimation intervals *[see* BAYESIAN INFERENCE; ESTIMATION, *article on* POINT ESTIMATION; HYPOTHESIS TESTING].

As an example of this sort of procedure, suppose the proportion of voters favoring a candidate is to be estimated on the basis of a sample. The simplest possible answer is to give a single figure, say 47 per cent; this is the type of procedure called *point estimation.* Since this estimate of the proportion is derived from a sample, it will usually be different from the true proportion. How far off the true value is this estimate likely to be? This question can be answered by supplementing the estimate with error bounds, say ± .5 per cent. Thus, one might say that the true proportion lies between 46.5 per cent and 47.5 per cent. This statement might be false. One task of the statistician is to develop a procedure for the computation of such intervals, a procedure that guarantees that the statements are true in, say, 99 per cent of all applications of this procedure. Such procedures are called confidence procedures.

**Estimation by confidence intervals** . It is perhaps easiest to begin with a simple example from normal sampling theory.

*Example 1.* Let X_{1}, …, X_{n} be a random sample of size *n* from a normal distribution with unknown mean μ. and known variance σ^{2}. Then the sample mean is a reasonable point estimator of *μ* Hence are reasonable error bounds in the following sense: the estimator X lies between and with probability .99. In other words, the interval (, ) contains the estimator X with probability .99; that is, whatever the value of μ really is,

This probability statement follows directly from the facts that has a unit normal distribution and that a unit normal random variable lies in the interval (-2.58, +2.58) with probability .99.

This statement can be given a slightly different but equivalent form: the interval covers *μ,* with probability .99, or whatever μ really is,

The interval is called a confidence interval for μ with confidence coefficient (or confidence level) .99. The confidence interval is a random interval containing the true value with probability .99. Note that it would be incorrect to say, after computing the confidence interval for a particular sample, that μ, will fall in this interval with probability .99; for μ is an unknown constant rather than a random variable. It is the confidence interval itself that is subject to random variations.

Generally speaking, there is an unknown parameter, say *θ,* to be estimated and an estimator f(X) depending on the sample X = (X_{1}, …, X_{n}). In example 1, *θ* is called *μ* and f(X) is X̄. As this estimator *f* is based on a random sample, it is itself subject to random variations. If *f* is a good estimator, its probability distribution will be concentrated closely around the true value, *θ.* From this probability distribution of *f,* one can often derive an interval, with lower bound c(θ) and upper bound c̄(θ), containing the estimator *f(X)* with high probability β (for example, *β* = .99). That is, whatever the actual value of *θ,*

Often these inequalities can be inverted, that is, two functions θ̠(X) and θ̄(X) can be specified such that θ̠(X) < θ < θ̄(X) if and only if c̠(θ) < f(X) < c̄(θ). Then, whatever θ really is,

This means that the interval (θ̠(X), θ̄(X)) contains the true value θ with probability β. Quantities like θ̠(X) and θ̄(X) are often called confidence limits. In example 1, the bounds.c̠(θ),c̄(θ) and θ̠(X),θ̄(X) are given by and , respectively.

It is also possible to develop the concept of a confidence region procedure in general, without reference to point estimation. Denote by P_{θ} the assumed probability distribution depending on a parameter θ (which may actually be a vector of several univariate, that is, real valued, parameters). Let θ be the set of all possible parameter values θ. By a confidence procedure is meant a rule for assigning to each sample X a subset of the parameter space, say θ(X). If θ(X) contains the true value θ with probability β, regardless of the true value of θ (that is, if for all θ ε θ, P_{θ}{θεθ(X)} = β), then θ(X) is called a confidence region for θ. The probability β that the true parameter value is covered by θ(X) is called the confidence coefficient.

In example 1, the interval is the confidence region for the sample X = (X_{1}, …, X_{n}) with confidence coefficient .99.

The probability specified by the confidence coefficient has the following frequency interpretation: If a large number of confidence regions are computed on different, independent occasions, each with a confidence coefficient β, then, in the long run, a proportion β of these confidence regions will contain the true parameter value. There is some danger of misinterpretation. This occurs if θ itself is erroneously considered as a random variable and the confidence statement is given the following form: the probability is β that θ falls into the computed confidence set θ(X). It should be clear that θ(X) is the random quantity and not θ.

In the simplest applications, θ is a real parameter and the confidence region θ(X) is either a proper interval (θ̠(X),θ̄(X)) or a semi-infinite interval: (-∞, θ̄(X) or (θ̠(X), + ∞). If for all θ, P_{θ}(θ < θ̄(X)) = β, then θ̄(X) is called an upper confidence bound for θ with confidence coefficient β. Similarly, θ̠(X) is a lower confidence bound.

Let θ̠(X) and θ̄(X) be lower and upper confidence bounds with confidence coefficients β_{1} and β_{2}, and suppose that θ̠(X) < θ̠(X) for all samples X. Then the interval (θ̠(X),θ̄(X) is a confidence interval with confidence coefficient β_{1} + β_{2} – 1. If β_{1} = β_{2}, that is, if P_{θ}{θ̄(X) < θ} = P_{θ}{θ < θ̠(X)}, the confidence interval (θ̠(X),θ̄(X)) is called central.

*Example 2.* As in example 1, let X = (X_{1}, …, X_{n}) be a sample of *n* independent normally distributed random variables with unknown mean μ and known variance σ^{2}. Then is an upper confidence bound for μ at confidence level .99. Thus is a semi-infinite confidence interval for μ with confidence coefficient .99, as is . Hence is a central confidence interval for μ with confidence coefficient .98 = .99 + ,99 - 1. This central confidence interval differs from that in example 1 in that the latter has confidence coefficient .99 and is correspondingly wider.

*Example 3.* Let X = (X_{1;} … , X_{n}) be a random sample from a normal distribution with known mean μ = 0 and unknown variance σ^{2}. In this case is a reasonable estimator of σ^{2}. (A subscript is used in because will later denote a more common, related, but different quantity.) Suppose *n* = 10. Then the central confidence interval for σ^{2} with confidence coefficient .98 is given by (10S^{2}_{1}/23.21, 10S^{2}_{1}/2.56). The constants 23.21 and 2.56 are readily obtained from a table of quantiles for the chi-square distribution, for nS^{2}/σ^{2} has a chi-square distribution with 10 degrees of freedom. This example shows that the endpoints of a confidence interval are generally not symmetric around the usual point estimator.

**Relation to point estimation.** The computation of confidence intervals is often referred to as *interval estimation*, in contrast to *point estimation*. As outlined above, in many practical cases, interval estimation renders information about the accuracy of point estimates. The general definition of confidence intervals is, however, independent of the problem of point estimation.

In many cases, a particular point estimator is related to the set of central confidence intervals. One forms the estimator for a given sample by thinking of the progressively narrowing intervals as the confidence level decreases toward zero. Except in pathological cases, the interval will squeeze down to a point, whose numerical value furnishes the estimator. Such an estimator is, for continuous distributions, median unbiased; that is, it is equally likely to be above and below the parameter under estimation.

**Relation to hypothesis testing.** The theory of confidence intervals is closely related in a formal way to the theory of hypothesis testing *[see* HYPOTHESIS TESTING].

*Example 4*. In example 1, the confidence interval for μ with confidence coefficient .99 was given by . To test the hypothesis μ = μ_{0} against the alternative μ ≠ μ_{0} at significance level .01, accept the hypothesis if

reject it otherwise. This is the customary two-sided test.

Observe that, given X̄, the confidence interval consists of all those values μ_{0} for which the hypothesis μ = μ_{0}, would be accepted. In other words, the confidence interval consists of all μ_{0} whose acceptance region contains the given X̄.

On the other hand, given the confidence interval with confidence coefficient .99, it is easy to perform a test of a hypothesis μ = μ_{0}: Accept the hypothesis if the hypothetical value μ_{0} belongs to the confidence interval; otherwise reject the hypothesis. Proceeding in this way, the pattern is precisely that of testing the hypothesis μ = μ_{0}, since μ_{0} belongs to the confidence interval if and only if (3) is fulfilled, that is, if the hypothesis μ = μ_{0} would be accepted according to the test procedure.

This duality is illustrated generally in Figure 1. The figure is directly meaningful when there is a single (real) parameter θ and when the sample can be reduced to a single (real) random variable. The latter reduction can frequently be accomplished via a sufficient statistic [*see* SUFFICIENCY]. When the problem is more complex, the figure is still of schematic use.

The figure shows that for each value of θ there is an acceptance region, *A*(θ), illustrated as an

interval. The two curves determine the lower and upper bounds of this interval respectively. The set of all those θ for which *A*(θ) contains a given X, θ(X), is the interval on the vertical through X between the two curves.

If the graphic representation is considered in a horizontal way (in terms of the X axis), the lower curve represents the lower confidence bound θ̠(X) as a function of X, and similarly the upper curve represents the upper confidence bound θ̄(X). If it is considered from the left (in terms of the θ axis), the functions θ̠(X) and θ̄(X) depending on X are inverted into the functions c̄(θ) and c̠(θ) respectively, depending on θ. (For this reason the letters are turned.)

The general duality between the testing of simple hypotheses and confidence procedures may be described as follows: Let θ be the set of unknown parameter values and assume that to each sample X a confidence set θ(X) is assigned, such that P_{θ}{θ ε θ(X)} = β for all θ ε θ. On the basis of such a confidence procedure, a test for any hypothesis θ=θ_{0} can easily be defined as follows: Let A(θ) be the set of all X, such that θεθ̠(X). Then the events XεA(θ) and θεθ(X) are equivalent, whence P_{θ}{X ε A(θ)} = P_{θ}{θ ε θ(X)} = β. Therefore, if A(θ_{0}) is taken as the acceptance region for testing the hypothesis θ=θ_{0}, a test with acceptance probability β (or significance level α = 1 - β) is obtained. On the other hand, given a family of acceptance regions (that is, for each hypothesis θεθ an acceptance region A(θ) contains the sample X with probability β when θ is the case), it is possible to define a confidence procedure by assigning to the sample X the set θ(X) of all θ for which A(θ) contains X (that is, the set of all parameter values θ for which the hypothesis θ would be accepted on the evidence X). Then, again θε if and only if XεA(θ), whence P_{θ}{θεθ(X)} = P_{θ}{XεA(θ)} = β. These remarks refer only to the case of simple hypotheses. In practice the more important case of composite hypotheses arises if several real parameters are present and the hypothesis consists in specifying the value of one of these. (This case is dealt with in “Nuisance parameters,” below.)

Under exceptional circumstances the confidence set θ(X) may show an unpleasant property: For some X, θ(X) might be empty, or it might be identical with the whole parameter space, θ. Those cases are usually of little practical relevance.

Thus a confidence statement contains much more information than the conclusion of a hypothesis test: The latter tells only whether a specified hypothesis is compatible with the evidence or not, whereas the confidence statement gives compatibility information about *all* relevant hypotheses.

**Optimality.** The duality between confidence procedures and families of tests implies a natural correspondence between the optimum properties of confidence procedures and optimum properties of tests.

A confidence procedure with confidence region θ′(X) is called *most accurate* if θ′(X) covers any value different from the true value with lower probability than any other confidence region θ(X) with the same confidence coefficient:

Another expression occasionally used instead of “most accurate” is “most selective.” The term “shortest,” originally introduced by Neyman, is now unusual because of the danger of confusing shortest confidence intervals and confidence intervals of minimum length.

The family of tests corresponding to most accurate confidence procedures consists of uniformly most powerful tests: Let A′(θ) and A(θ) be the acceptance regions corresponding to the confidence regions θ′(X) and θ(X) respectively; then

Therefore, by using the acceptance region A′ the false hypothesis θ is accepted with lower probability than by using A.

Uniformly most powerful tests exist only in exceptional cases. Therefore, the same holds true for most accurate confidence procedures. If, however, the class of tests is restricted (to unbiased tests or invariant tests, for example), the restricted class often contains a uniformly most powerful test within that class. Similarly, tests most powerful against a restricted class of alternatives can often be obtained. In the case of a real parameter a test for the hypothesis θ_{0} that is most powerful against all θ > θ_{0} may typically be found. All these restricted optimum properties of tests lead to corresponding restricted optimum properties of confidence procedures.

A confidence procedure is called unbiased if the confidence region covers no parameter value different from the true value with probability higher than its probability of covering the true value. The corresponding property of tests is also called unbiasedness. Therefore, families of uniformly most powerful unbiased tests lead to most accurate unbiased confidence procedures, that is, confidence procedures that are most accurate among the unbiased confidence procedures: No other unbiased confidence procedures exist leading to confidence regions that contain any value different from the true value with lower probability. The confidence interval given in example 1 is unbiased and most accurate among all unbiased confidence procedures with confidence coefficient .99. On the other hand, the confidence interval given in example 3 is not unbiased.

The optimum properties discussed above are related to concepts of optimality derived from the duality to the testing of hypotheses. A completely different concept is that of minimum length. For instance, the confidence interval given in example 1 is of minimum length. In general the length of the confidence interval is itself a random variable, as in example 3. It is therefore natural to consider a confidence procedure as optimal if the expected length of the confidence intervals is minimal. This concept is appropriate for two-sided confidence intervals. For one-sided confidence intervals the concept is not applicable immediately, as in this case the length is infinite. However, the expected value of the boundary value of the one-sided confidence interval can be substituted for expected length.

In general, confidence intervals with minimum expected length are different from, for example, most accurate unbiased confidence intervals (where such intervals exist). Under special circumstances, however (including the assumption that the distributions of the family have the same shape and differ only in location), invariant confidence procedures are of minimum expected length. The confidence procedure given in example 3 is not of minimum expected length.

Two objections that may be raised against the use of expected length as a criterion are (1) when a confidence interval fails to cover the true parameter value, a short interval is undesirable in that it pretends great accuracy when there is none, and (2) expected length depends strongly on the mode of parameterization, for example, there is no sharp relation between the expected length of a confidence interval for θ and that of the induced interval for θ^{3}.

**Discrete distributions.** In the general consideration above it was assumed that there exists a confidence procedure with confidence coefficient β in the sense that, for all θ in θ, the probability of covering the parameter θ is exactly β when θ is the true parameter. This means that for each θ there exists an acceptance region A(θ) such that

P_{θ}{A(θ)} = β. This is, however, in general true only for distributions of the continuous type, not for discrete distributions such as the binomial and Poisson distributions [*see* DISTRIBUTIONS, STATISTICAL]. Thus acceptance regions A(θ) of probability approximately β must be chosen, with the degree of approximation depending on *θ*. In practice the acceptance region is selected such that P_{θ}{A(θ)} approximates β as closely as possible, either with or without the restriction P_{θ}{A(θ)} > β. These acceptance regions A(θ) define the confidence regions θ(X) with (approximate) confidence coefficient β. When the restriction P_{θ}{A(θ)} > β is made, the term “bounded confidence region” is often used, and the region is said to have bounded confidence level β.

*Example 5*. Let X be the number of successes in *n* independent dichotomous trials with constant probability *p* of success. Then X is binomially distributed, that is, . Choose the confidence coefficient β = .99. Choose for each *p*, 0 < *p* < 1, the smallest integer c(p) such that

Inverting the bound *c(p)* one obtains one-sided confidence intervals of confidence coefficient .99 for *p*.

As an illustration, let *n* = 20 and *p* = .3. Since P{X ≤ 11} = .995 and P{X< 10} = .983, the smallest integer such that P{X ≤ c(p)} ≥ 0.99 is *c(p)* = 11. Troublesome computations of *c(p)* can be avoided by use of one of the tables or figures provided for this purpose. For references see Kendall and Stuart ([1943–1946] 1961, p. 118).

**Nuisance parameters.** In many practical problems, more than one parameter is involved. Often the interest is concentrated on one of these parameters, say θ, while the others are regarded as nuisance parameters. The aim is to make a confidence statement about θ that is true with high probability regardless of the values of the nuisance parameters. The corresponding test problem is that of testing a composite hypothesis that specifies the value of θ without making any assertion about the nuisance parameters. The test is required to have significance level less than or equal to a prescribed a regardless of the nuisance parameters. The corresponding confidence procedure will yield confidence intervals that cover the true value at least with probability 1 – α regardless of the nuisance parameters, that is, confidence intervals with bounded confidence level 1 – α. A special role is played by the so-called similar tests, having exactly significance level α for all values of the nuisance parameters. They lead to confidence intervals covering the true value with probability exactly 1 – α regardless of the nuisance parameters.

*Example 6*. Let X_{1}, …, X_{n} be a random sample from a normal distribution with unknown mean μ, and unknown variance σ^{2}. The variance σ^{2} is to be considered a nuisance parameter. Let X̄ = σ_{i}X_{i}/*n* and . For *n* = 10 a (similar) confidence interval for μ with confidence coefficient .99 is given by X̄ - 3.17 S/√10μ<X̄+3.17 S/√10. For general *n*, the confidence interval with confidence coefficient .99 is given by , where *t*_{.005.n-1} is the upper .005 point of the tabled *t-* distribution with *n – 1* degrees of freedom, for example *t*_{.005.9} = 3.17. Hence the above confidence procedure corresponds to the usual t-test. As for large *n*, because *t*_{.005.n-t} is close to 2.58, the confidence interval given here corresponds for large *n* to the confidence interval given in example 1. The confidence procedure given here is most accurate among unbiased confidence procedures.

*Example 7*. Consider μ in example 6 as the nuisance parameter. Define S^{2} as in example 6 and again take *n* = 10. Then a one-sided confidence interval for σ^{2} of confidence coefficient .99 is given by σ^{2} ≤ 9 S^{2}/2.09. In general, the one-sided confidence interval for σ^{2} with confidence coefficient .99 is given by σ ≤ (n -1) S^{2}/X^{2}_{.01,n-1}, where X^{2}_{.01,n-1} is the lower .01 point of the chi-square distribution with *n –* 1 degrees of freedom. Observe that here the number of degrees of freedom is n – 1 while in example 3 it is *n.*

**Confidence coefficient.** The expected length of the confidence interval depends, of course, on the confidence coefficient. If a higher confidence coefficient is chosen, that is, if a statement that is true with higher probability is desired, this statement has to be less precise; the confidence interval has to be wider.

It is difficult to give general rules for the selection of confidence coefficients. Traditional values are .90, .95, and .99 (corresponding to significance levels of .10, .05, and .01, respectively). The considerations to be made in this connection are the same as the considerations for choosing the size of a test [*see***HYPOTHESIS TESTING** ].

*Nested confidence procedures*. One would expect the wider confidence interval (belonging to the higher confidence level) to enclose the narrower confidence interval (belonging to the lower confidence level). A confidence procedure with this property is called “nested.” All the usual confidence procedures are nested, but this is not a fully general property of confidence procedures.

**Sample size.** Given the confidence coefficient, the expected length of the confidence interval depends, of course, on the sample size. Larger samples contain more information and therefore lead to more precise statements, that is, to narrower confidence intervals.

Given a specific problem, the accuracy that it is reasonable to require can be determined. In order to estimate the number of housewives knowing of the existence of the superactive detergent X, a confidence interval of ±5 per cent will probably be sufficiently accurate. If, on the other hand, the aim is to forecast the outcome of elections and the percentage of voters favoring a specific party was 48 per cent in the last elections, an accuracy of ±5 per cent would be quite insufficient. In this case, a confidence interval of length less than ±1 per cent would probably be required.

Given the accuracy necessary for the problem at hand, the sample size that is necessary to achieve this accuracy can be determined. In general, however, the confidence interval (and therefore the necessary sample size as well) depends on nuisance parameters. Assume that a confidence interval for the unknown mean μ, of a normal distribution with unknown variance σ^{2} is needed. Although in example 6 a confidence interval is given for which no information about is needed, such information is needed to compute the expected length of the confidence interval: The length of the confidence interval is 2t_{.005,n-1}S/√, the expected value for large *n* is therefore nearly equal to 2t_{.005,n-1}σ/√n. Therefore, in order to determine the necessary sample size *n*, some information about σ^{2} is needed. Often everyday experience or information obtained from related studies will be sufficient for this purpose. If no information whatsoever is at hand, a relatively small pilot study will yield a sufficiently accurate estimate for σ^{2}. This idea is treated rigorously in papers on sequential procedures for obtaining confidence intervals of given length (Stein 1945). In the case of the binomial distribution, no prior information at all is needed, for σ = p(1 - p) ≤ ¼ , whatever *p* might be. Using ¼ instead of σ^{2} can, however, lead to wastefully large samples if *p* is near 0 or 1.

**Robustness–nonparametric procedures.** Any statistical procedure starts from a basic model on the underlying family of distributions. In example 1, for instance, the basic model is that of a number of independent normally distributed random variables. Since it is never certain how closely these basic assumptions are fulfilled in practice, desirable statistical procedures are those that are only slightly influenced if the assumptions are violated. Statistical procedures with this property are called *robust* [*see* HYPOTHESIS TESTING]. Another approach is to abandon, as far as possible, assumptions about the type of distribution leading to nonparametric procedures.

As the duality between families of tests and confidence procedures holds true in general, robust or nonparametric tests lead to robust or nonparametric confidence procedures, respectively. [*Examples showing the construction of confidence intervals for the median of a distribution from the sign test and from Wilcoxon’s signed rank test are given in***Nonparametric statistics.** ]

**Relationship to Bayesian inference** . If the parameter is not considered as an unknown constant but as the realization of a random variable with given prior distribution, Bayesian inference can be used to obtain estimating intervals containing the true parameter with prescribed probability *[see***Bayesian inference** ].

Confidence statements can be made, however, without assuming the existence of a prior distribution, and hence confidence statements are preferred by statisticians who do not like to use “subjective” prior distributions for Bayesian inference. A somewhat different, and perhaps less controversial, application of subjective prior distributions is their use to define so-called subjective accuracy. Subjectively most accurate confidence procedures are defined in analogy to the most accurate ones by averaging the probability of covering the fixed parameter with respect to the subjective prior distribution. It can be shown that a most accurate confidence procedure is subjectively most accurate under any prior distribution with a positive density function (Borges 1962).

**Relation to fiducial inference.** Fiducial inference was introduced by R. A. Fisher (1930). This paper and succeeding publications of Fisher contain a rule for determining the fiducial distribution of the parameter on the basis of the sample X [*see* FIDUCIAL INFERENCE].

As in Bayesian inference, this distribution can be used to compute “fiducial intervals,” giving information about the parameter θ. The fiducial interval is connected with a probability statement, which admits, however, no frequency interpretation (although some advocates of fiducial methods might disagree).

For many elementary problems, fiducial intervals and confidence intervals are identical. But this is not true in general. One of the attractive properties of fiducial inference is that it leads to solutions even in cases where the classical approach failed until now, as in the case of the Behrens-Fisher problem.

Many scholars, however, find it difficult to see a convincing justification for Fisher’s rule of computing fiducial distributions and to find an intuitive interpretation of probability statements connected with fiducial intervals.

A reasonable interpretation of fiducial distributions would be as some sort of posterior distributions for the unknown parameter. It can be shown, however, that fiducial distributions cannot be used as posterior distributions in general; a Bayesian inference, starting from two independent samples and using the fiducial distribution of the first sample as prior distribution to compute a posterior distribution from the second sample, would in general lead to a result different from the fiducial distribution obtained from both samples taken together. For the comparison of fiducial and Bayesian method, see Richter (1954) and Lindley (1958).

**Prediction intervals, tolerance intervals.** Whereas confidence intervals give information about an unknown parameter, prediction intervals give information about future independent observations. Hence prediction intervals are subsets of the sample space whereas confidence intervals are subsets of the parameter space.

*Example 8*. If X_{1}, … , X_{n} is a random sample from a normal distribution with unknown mean μ. and unknown variance σ^{2}, the interval given by is a prediction interval containing a future independent observation X_{n+1} with probability 1 – 2α, if t_{α,n-1} is the upper α point of the *t*-distribution with *n*-1 degrees of freedom. Note that the probability of the event

is 1 – 2α before the random variables X_{l}, ċ, X_{n} are observed. For further discussion of this example see Proschan (1953); for discussion of a similar example, see Mood and Graybill (1950, pp. 220–244, 297–299).

The prediction interval, computed in example 8 above, must not be interpreted in the sense that it covers a proportion α of the population. In a special instance, the interval computed according to this formula might cover more or less than the proportion α. Only on the average will the proportion be α.

In many cases, there is a need for intervals covering a proportion ³ with high probability, say β. This is, however, not possible. In general, it is possible only to give rules for computing intervals covering at *least a* proportion ³ with high probability β . Intervals with this property are called ³-proportion tolerance regions with confidence coefficient β. In the normal case, one might, for example, seek a constant *c*, for given ³ and β, such that, whatever the values of μ and σ

where f(u;μ,σ) is the normal density with mean μ. and variance σ^{2}.

The constants c, leading to a ³-proportion tolerance interval (X̄ - cS, X̄ + cS) with confidence coefficient β, cannot be expressed by one of the standard distributions (as was the case in the example of the prediction interval dealt with above). Tables of *c* can be found in Owen (1962, p. 127 ff.). For further discussion see Proschan (1953), and for nonparametric tolerance intervals see Wilks (1942). [*See also***Nonparametric statistics** .]

**Confidence regions.** In multivariate problems, confidence procedures yielding intervals are generalized to those yielding confidence regions.

*Example 9*. Let X and Y be two normally distributed random variables with unknown means μ and ν, known variances 2 and 1, and covariance – 1. A confidence region for (μν) with confidence coefficient .99 is given by (X - μ)^{2} + 2(X - μ)(Y - ν) + 2(Y-ν)^{2} ≤ 9.21. The figure 9.21 is obtained from a chi-square table, since the quadratic form on the left is distributed as chi-square with two degrees of freedom. The confidence region is an ellipse with center (X, Y). When such a region is described in terms, say, of pairs of parallel tangent lines, the result may usefully be considered in the framework of multiple comparisons. [*See***Linear hypotheses** , *article on***multiple comparisons** .]

J. PFANZAGL

## BIBLIOGRAPHY

*The theory of confidence intervals is systematically developed in* Neyman 1937; 1938b. Prior *to Neyman, this concept had been used occasionally in a rather vague manner by a number of authors, for example, by* Laplace 1812, *section 16, although in a few cases the now current meaning was clearly stated, perhaps first by* Cournot 1843, *pp. 185–186. A precise formulation without systematic theory is given in* Hotelling 1931. *A more detailed account of the history is given in* Neyman 1938a.

BOHGES, RUDOLPH 1962 Subjektivtrennscharfe Konfidenz-bereiche. *Zeitschrift für Wahrscheinlichkeitstheorie* 1:47–69.

COURNOT, ANTOINE AUGUSTIN 1843 *Exposition de la théorte des chances et des probabilités*. Paris: Hachette.

FISHER, R. A. (1930) 1950 Inverse Probability. Pages 22.527a-22.535 in R. A. Fisher, *Contributions to Mathematical Statistics*. New York: Wiley. → First published in Volume 26 of the *Proceedings* of the Cambridge Philosophical Society.

FISHER, R. A. 1933 The Concepts of Inverse Probability and Fiducial Probability Referring to Unknown Parameters. Royal Society of London, *Proceedings* Series A 139:343–348.

HOTELLING, HAROLD 1931 The Generalization of Student’s Ratio. *Annals of Mathematical Statistics* 2:360–378.

KENDALL, MAURICE G.; and STUART, ALAN (1943–1946) 1961 *The Advanced Theory of Statistics*. Volume 2: Inference and Relationship. New York: Hafner; London: Griffin. → See especially pages 98–133 on “Interval Estimation: Confidence Levels” and pages 518–521 on “Distribution-free Tolerance Intervals.” (Kendall was the sole author of the first edition.)

LAPLACE, PIERRE SIMON DE (1812) 1820 *Théorie analytique des probabilités*. 3d ed., rev. Paris: Courcier. → Laplace’s mention of confidence intervals first appeared in the 2d (1814) edition.

LEHMANN, ERICH L. 1959 *Testing Statistical Hypotheses*. New York: Wiley. → See especially pages 78–83, 173–180, and 243–245.

LINDLEY, D. V. 1958 Fiducial Distributions and Bayes’ Theorem. *Journal of the Royal Statistical Society* Series B 20:102–107.

MOOD, ALEXANDER M.; and GRAYBILL, FRANKLIN A. (1950) 1963 *Introduction to the Theory of Statistics*. 2d ed. New York: McGraw-Hill. → See especially pages 220–244 on “Interval Estimation.” (Mood was the sole author of the 1950 edition.)

NEYMAN, JERZY 1937 Outline of a Theory of Statistical Estimation Based on the Classical Theory of Probability. Royal Society of London, *Philosophical Transactions* Series A 236:333–380.

NEYMAN, JERZY (1938a) 1952 *Lectures and Conferences on Mathematical Statistics and Probability*. 2d ed. Washington: U.S. Dept. of Agriculture. → See especially Chapter 4, “Statistical Estimation.”

NEYMAN, JERZY 1938b L’estimation statistique traitee comme un probleme classique de probability *Actualites scientifiques et industrielles* 739:26–57.

OWEN, DONALD B. 1962 *Handbook of Statistical Tables*. Reading, Mass.: Addison-Wesley. → A list of addenda and errata is available from the author.

PROSCHAN, FRANK 1953 Confidence and Tolerance Intervals for the Normal Distribution. *Journal of the American Statistical Association* 48:550–564.

RICHTER, HANS 1954 Zur Grundlegung der Wahrscheinlichkeitstheorie. *Mathematische Annalen* 128:305–339. → See especially pages 336–339 on “Konfidenzschluss and Fiduzialschluss.”

SCHMETTERER, LEOPOLD 1956 *Einfilhrüng in die Mathematische Statistik*. Berlin: Springer. → See especially Chapter 3 on “Konfidenzbereiche.”

STEIN, CHARLES 1945 A Two Sample Test for a Linear Hypothesis Whose Power Is Independent of the Variance. *Annals of Mathematical Statistics* 16:243–258.

WILKS, S. S. 1942 Statistical Prediction With Special Reference to the Problem of Tolerance Limits. *Annals of Mathematical Statistics* 13:400–409.

# Estimation

# Estimation

Adding, multiplying, and performing similar mathematical operations in one's head can be difficult tasks, even for the most skilled mathematics students. By estimating, however, basic operations are easier to calculate mentally. This can make daily calculation tasks, from figuring tips to monthly budgets, quickly attainable and understandable.

## How to Estimate

Although the core of estimation is **rounding** , *place value* (for example, rounding to the nearest hundreds) makes estimating flexible and useful. For instance, calculating 2.4 + 13.7 − 10.8 + 8 − 124.2 − 32 to equal −142.9 in one's head may be a daunting task. But if the equation is estimated by 10s, that is, if each number is rounded to the nearest 10, the problem becomes 0 + 10 − 10 + 10 −; 120 − 30, and it is easier to calculate its value at −140. Estimating to the 1s makes the equation 2 + 14 − 11 + 8 − 124 − 32 = −143, which is more accurate but more difficult to calculate mentally. Note that the smaller the place value used, the closer the estimation is to the actual sum.

## Estimation by Tens

Multiplication and division can be estimated with any place value, but estimating by 10s is usually the quickest method. For example, the product 8 × 1,294 = 10,352 can be estimated by 10s as 10 × 1,290 = 12,900, which is calculated with little effort. Division is similar in that estimating by 10s allows for the quickest calculation, even with decimals. For instance, 1,232.322 ÷ 12.2 = 101.01 is quicker to estimate by 10s as 1,230.0 ÷ 10.0 = 123.0.

Regardless of the ease of estimating by 10s, there is a greater degree of inaccuracy as compared to estimating by 1s. However, this estimation method need not be abandoned in order to gain accuracy; instead, it can be used to obtain estimations that are more accurate, as the following example illustrates.

Suppose a couple on a date enjoys a dinner that costs $24.32. The customary tip is 15 percent, but the couple does not have a calculator, tip table, or pencil to help figure the amount that should be added to the bill. Using the estimating-by-10s method, they figure that 15 percent of $10 is $1.50; if the bill is around $20, then the tip doubles to $3. However, a $3 tip is not enough because they have not included tip for the $4.32 remaining on the bill. Yet if $1.50 is the tip for $10, then $0.75 would be an appropriate tip for $5, which is near enough to $4.32. A total estimated tip of $3.75 is close (in fact, an overestimation) to 15 percent of $24.32, which is $3.65 (rounded to the nearest cent).

## Conservative Estimation

As seen in several of the examples, estimations tend to be more (an overestimation) or less (an underestimation) than the actual calculation. Whether this is important depends upon the situation. For example, overestimating the distance for a proposed trip may be a good idea, especially in figuring how much gas money will be needed.

This property of rounding and estimation is the foundation of conservative estimation found in financial planning. When constructing a monthly budget, financial planners will purposely underestimate income and overestimate expenses, usually by hundreds. Although an accurate budget seems ideal, this estimating technique creates a "cushion" for unexpected changes, such as a higher water bill or fewer hours worked. Furthermore, financial planners will round down (regardless of rounding rules) for underestimation and round up for overestimation.

The following table represents a sample budget for an individual. The first column includes amounts expected to pay; the second is a conservative estimate of the next month's budget; the third is a list of the actual amounts incurred; and the fourth is the difference between actual and budgeted amounts. Note that negative numbers, or amounts that take away from income, are written in parentheses.

The table shows that the individual earned less than expected and in some cases spent more than expected. Nevertheless, because the budget is conservative, there is a surplus (money left over) at the end of the month.

ESTIMATING A MONTHLY BUDGET | ||||

Expected Amount | Budget | Actual Amount | Difference | |

Income | $3,040 | $3,000 | $2,995 | ($5) |

Tax | (578) | (600) | (579) | 21 |

Rent | (575) | (600) | (575) | 25 |

Utilities | (40) | (100) | (62) | 38 |

Food | (175) | (200) | (254) | (54) |

Insurance | (175) | (200) | (175) | 25 |

Medical | (45) | (100) | (97) | 3 |

Car Payments | (245) | (300) | (245) | 55 |

Gas | (85) | (100) | (133) | (33) |

Student Loans | (325) | (400) | (325) | 75 |

Savings | (300) | (300) | (300) | 0 |

Fun Money | (49) | (100) | (175) | (75) |

Surplus (Deficit) | $28 | $0 | $75 | $75 |

## Estimation by Average

Counting the number of words on a page can be a tedious task. Therefore, writers often estimate the total by averaging the number of words on the first few lines and then multiplying that average by the number of lines on the page.

Another application of estimation by average is the classic game of guessing how many jellybeans are in a jar. The trick is to average the number of beans on the top and bottom layers and then to multiply that average by the number of layers in the jar. Because it is customary to declare a winner who guessed the closest but not over the actual count, it is best to estimate conservatively.

Estimation is a powerful skill that can be applied to tasks from proofing arithmetic to winning a counting game. However, the use of estimation is not always appropriate to the task. For example, estimating distance and direction of space debris and ships is unwise, since even the smallest decimal difference can mean life or death. In addition, technology makes it possible to add and multiply large groups of numbers faster than it may take to estimate the total. Nevertheless, estimation is an important tool in managing the everyday mathematics of life.

see also Financial Planner; Rounding.

*Michael Ota*

## Bibliography

Pappas, Theoni. *The Joy of Mathematics: Discovering Mathematics All Around You.* San Carlos, CA: World Wide Publishing, 1989.

# estimation

es·ti·ma·tion / ˌestəˈmāshən/ •
n. a rough calculation of the value, number, quantity, or extent of something: *estimations of protein concentrations.* ∎ [usu. in sing.] a judgment of the worth or character of someone or something: *the pop star rose in my estimation.*

#### More From encyclopedia.com

#### About this article

# Estimation

**-**

#### You Might Also Like

#### NEARBY TERMS

**Estimation**