# Inference, Bayesian

# Inference, Bayesian

*Bayesian inference* is a collection of statistical methods that are based on a formula devised by the English mathematician Thomas Bayes (1702-1761). Statistical inference is the procedure of drawing conclusions about a population or process based on a sample. Characteristics of a population are known as *parameters.* The distinctive aspect of Bayesian inference is that both parameters and sample data are treated as random quantities, while other approaches regard the parameters as nonrandom. An advantage of the Bayesian approach is that all inferences can be based on probability calculations, whereas non-Bayesian inference often involves subtleties and complexities. One disadvantage of the Bayesian approach is that it requires both a *likelihood function* that defines the random process that generates the data, and a *prior probability distribution* for the parameters. The prior distribution is usually based on a subjective choice, which has been a source of criticism of the Bayesian methodology. From the likelihood and the prior, Bayes’s formula gives a *posterior distribution* for the parameters, and all inferences are based on this.

## BAYES’S FORMULA

There are two interpretations of the probability of an event *A*, denoted *P* (*A* ): (1) the long-run proportion of times that the event *A* occurs upon repeated sampling; and (2) a subjective belief in how likely it is that the event *A* will occur. If *A* and *B* are two events, and *P* (*B* ) > 0, then the *conditional probability of A given B* is *P* (*A\B* ) = *P* (*AB* )/*P* (*B* ), where *AB* denotes the event that both *A* and *B* occur. The frequency interpretation of *P* (*AǀB* ) is the long-run proportion of times that *A* occurs when we restrict attention to outcomes where *B* has occurred. The subjective probability interpretation is that *P* (*AǀB* ) represents the updated belief of how likely it is that *A* will occur if we know *B* has occurred. The simplest version of Bayes’s formula is *P* (*B/A* ) = *P* (*A\B* ) *P* (*B* )/(*P* (*A\B* ) *P* (*B* ) + *P* (*A* \∼*B* ) *P* (∼*B* )), where ∼*B* denotes the complementary event to *B*, that is, the event that *B* does not occur. Thus, starting with the conditional probabilities *P* (*A* \*B* ), *P* (*A* \∼*B* ), and the unconditional probability *P* (*B* ) (*P* (∼*B* ) = 1 - *P* (*B* ) by the laws of probability), we can obtain *P* (*B* ǀ*A* ). Most applications require a more advanced version of Bayes’s formula.

Consider the “experiment” of flipping a coin. The mathematical model for the coin flip applies to many other problems, such as survey sampling when the subjects are asked to give a “yes” or “no” response. Let *θ* denote the probability of heads on a single flip, which we assume is the same for all flips. If we also assume that the flips are statistically independent given *θ* (i.e., the outcome of one flip is not predictable from other flips), then the probability model for the process is determined by *θ* and the number of flips. Note that *θ* can be any number between 0 and 1. Let the random variable *X* be the number of heads in *n* flips. Then the probability that *X* takes a value *k* is given by

*P* (*X* = *kǀθ* ) = *C _{n,k} θ*

^{k}(1 –

*θ*)

*,*

^{n-k}*k*= 0, 1, …,

*n.*

*C _{n, k}* is a binomial coefficient whose exact form is not needed. This probability distribution is called the

*binomial distribution.*We will denote

*P*(

*X*=

*kǀθ*) by

*f(kǀθ)*, and when we substitute the observed number of heads for

*k*, it gives the

*likelihood function.*

To complete the Bayesian model we specify a prior distribution for the unknown parameter *θ.* If we have no belief that one value of *θ* is more likely than another, then a natural choice for the prior is the uniform distribution on the interval of numbers from 0 to 1. This distribution has a probability density function *g(θ)* which is 1 for 0 ≤ *θ* ≤ 1 and otherwise equals 0, which means that *P (a≤θ ≤b* ) = *b* - *a* for 0≤*a* < *b* ≤ 1.

The posterior density of *θ* given *X* = *x* is given by a version of Bayes’s formula: *h(θ* ǀ*x) = K(x)f(xǀθ)g* (*θ* ), where *K(x)* ^{-1} = *∫ f* (*x ǀθ* )*g* (*θ)dθ* is the area under the curve *f* (*xǀθ* )*g* (*θ* ) when *x* is fixed at the observed value.

A quarter was flipped *n* = 25 times and *x* = 12 heads were observed. The plot of the posterior density *h* (*θ* ǀ12) is shown in Figure 1. This represents our updated beliefs about ** θ ** after observing twelve heads in twenty-five coin flips. For example, there is little chance that

*θ*≥.8; in fact,

*P*(θ ≥.8 ǀX = 12) = 0.000135, whereas according to the prior distribution,

*P*(θ ≥.8) = 0.200000.

## STATISTICAL INFERENCE

There are three general problems in statistical inference. The simplest is *point estimation:* What is our best guess for the true value of the unknown parameter *θ* ? One natural approach is to select the highest point of the posterior density, which is the *posterior mode.* In this example, the posterior mode is *θ* _{Mode} = *x/n* = 12/25 = 0.48. The posterior

mode here is also the *maximum likelihood estimate*, which is the estimate most non-Bayesian statisticians would use for this problem. The maximum likelihood estimate would not be the same as the posterior mode if we had used a different prior. The generally preferred Bayesian point estimate is the *posterior mean* : *θ _{MEAN}* =

*∫ θ h*(

*θ*ǀ

*x*)

*d θ*= (

*x*+ 1)/(

*n*+ 2) = 13/27 = 0.4815, almost the same as

*θ*here.

_{Mode}The second general problem in statistical inference is interval estimation. We would like to find two numbers *a* < *b* such that *P* (*a* < *θ* < *b* ǀ*X* = 12) is large, say 0.95. Using a computer package one finds that *P* (0.30 < *θ* < 0.67ǀ *X* = 12) = 0.950. The interval 0.30 < *θ* < 0.67 is known as a *95 percent credibility interval.* A non-Bayesian 95 percent confidence interval is 0.28 < *θ* < 0.68, which is very similar, but the interpretation depends on the subtle notion of “confidence.”

The third general statistical inference problem is hypothesis testing: We wish to determine if the observed data support or lend doubt to a statement about the parameter. The Bayesian approach is to calculate the posterior probability that the hypothesis is true. Depending on the value of this posterior probability, we may conclude that the hypothesis is likely to be true, likely to be false, or the result is inconclusive. In our example, we may ask if the coin is biased against heads—that is, is *θ* < 0.50? We find *P* (*θ* < 0.50ǀ *X* = 12) = 0.58. This probability is not particularly large or small, so we conclude that there is not evidence for a bias for (or against) heads.

Certain problems can arise in Bayesian hypothesis testing. For example, it is natural to ask whether the coin is fair—that is, does *θ* = 0.50? Because *θ* is a continuous random variable, *P* (*θ* = 0.50ǀ *X* = 12) = 0. One can perform an analysis using a prior that allows *P* (*θ* = 0.50)> 0, but the conclusions will depend on the prior. A non-Bayesian approach would not reject the hypothesis *θ* = 0.50 since there is no evidence against it (in fact, *θ* = 0.50 is in the credible interval).

This coin flip example illustrates the fundamental aspects of Bayesian inference, and some of its pros and cons. Leonard J. Savage (1954) posited a simple set of axioms and argued that all statistical inferences should logically be Bayesian. However, most practical applications of statistics tend to be non-Bayesian. There has been more usage of Bayesian statistics since about 1990 because of increasing computing power and the development of algorithms for approximating posterior distributions.

## TECHNICAL NOTES

All computations were performed with the R statistical package, which is available from the Comprehensive R Archive Network. The prior and posterior in the example belong to the family of beta distributions, and the R functions dbeta, pbeta, and qbeta were used in the calculations.

**SEE ALSO** *Bayes’ Theorem; Bayesian Econometrics; Bayesian Statistics; Distribution, Uniform; Inference, Statistical; Maximum Likelihood Regression; Probability Distributions; Randomness; Statistics in the Social Sciences*

## BIBLIOGRAPHY

Berger, James O., and José M. Bernardo. 1992. On the Development of the Reference Prior Method. In *Bayesian Statistics 4: Proceedings of the Fourth Valencia International Meeting*, eds. José M. Bernardo, James O. Berger, A. P. Dawid, and A. F. M. Smith, 35–60. London: Oxford University Press.

Berger, James O., and Thomas Sellke. 1987. Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence. *Journal of the American Statistical Association* 82: 112–122.

Department of Statistics and Mathematics, Wirtschaftsuniversität Wien (Vienna University of Economics and Business Administration). Comprehensive R Archive Network. http://cran.r-project.org/.

Gelman, Andrew, John B. Carlin, Hal S. Stern, and Donald B. Rubin. 2004. *Bayesian Data Analysis.* 2nd ed. Boca Raton, FL: Chapman and Hall/CRC.

O’Hagan, Anthony, and Jonathan Forster. 2004. *Kendall’s Advanced Theory of Statistics*, vol. 2B: *Bayesian Inference.* 2nd ed. London: Arnold.

Savage, Leonard J. [1954] 1972. *The Foundations of Statistics.* 2nd ed. New York: Dover.

*Dennis D. Cox*

# Bayesian Inference

# Bayesian Inference

Bayesian inference or Bayesian statistics is an approach to statistical inference based on the theory of subjective probability. A formal Bayesian analysis leads to probabilistic assessments of the object of uncertainty. For example, a Bayesian inference might be, “The probability is .95 that the mean of a normal distribution lies between 12.1 and 23.7.” The number .95 represents a degree of belief, either in the sense of *subjective probability coherent* or *subjective probability rational [see* Probability, *article on*interpretations, *which should be read in conjunction with the present article];*. 95 need not correspond to any “objective” long-run relative frequency. Very roughly, a degree of belief of .95 can be interpreted as betting odds of 95 to 5 or 19 to 1. A degree of belief is always potentially a basis for action; for example, it may be combined with utilities by the principle of maximization of expected utility *[see* Decision theory; Utility].

By contrast, the sampling theory or classical approach to inference leads to probabilistic statements about the method by which a particular inference is obtained. Thus a classical inference might be, “A .95 confidence interval for the mean of a normal distribution extends from 12.1 to 23.7” *[see Estimation, article on*confidence intervals and regions]. The number .95 here represents a long-run relative frequency, namely the frequency with which intervals obtained by the method that resulted in the present interval would in fact include the unknown mean. (It is not to be inferred from the fact that we used the same numbers, .95, 12.1, and 23.7, in both illustrations that there will necessarily be a numerical coincidence between the two approaches.)

The term Bayesian arises from an elementary theorem of probability theory named after the Rev. Thomas Bayes, an English clergyman of the eighteenth century, who first enunciated a special case of it and proposed its use in inference. Bayes’ theorem is used in the process of making Bayesian inferences, as will be explained below. For a number of historical reasons, however, current interest in Bayesian inference is quite recent, dating, say, from the 1950s. Hence the term “neo-Bayesian” is sometimes used instead of “Bayesian.”

**An illustration of Bayesian inference.** For a simple illustration of the Bayesian approach, consider the problem of making inferences about a Bernoulli process with parameter *p.* A Bernoulli process can be visualized in terms of repeated independent tosses of a not necessarily fair coin. It generates heads and tails in such a way that the probability of heads on a single trial is always equal to a parameter *p* regardless of the previous history of heads and tails. The subjectivistic counterpart of this description of a Bernoulli process is given by de Finetti’s concept of exchangeable events *[see*Probability, *article on*interpretations].

Suppose first that we have no direct sample evidence from the process. Based on experience with similar processes, introspection, general knowledge, etc., we may be willing to translate our judgments about the process into probabilistic terms. For example, we might assess a (subjective) probability distribution for *p̃tilde;* The tilde (⌼) indicates that we are now thinking of the parameter p as a random variable. Such a distribution is called a *prior* (or *a priori)* distribution because it is usually assessed prior to sample evidence. Purely for illustration, suppose that the prior distribution of *p̃tilde;* is uniform on the interval from 0 to 1: the probability that *p̃tilde;* lies in any subinterval is that subinterval’s length, no matter where the subinterval is located between 0 and 1. Now suppose that on three tosses of a coin we observe heads, heads, and tails. The probability of observing this sample, conditional on *p̃tilde;* = *p,* is p^{2}( 1 – p). If we regard this expression as a function of *p,* it is called the *likelihood function* of the sample. Bayes’ theorem shows how to use the likelihood function in conjunction with the prior distribution to obtain a revised or *posterior* distribution of *p̃tilde;* Posterior means after the sample evidence, and the posterior distribution represents a reconciliation of sample evidence and prior judgment. In terms of inferences about *p̃tilde;* we may write Bayes’ theorem in words as follows: Posterior probability (density) at *p,* given the observed sample, equals

Expressed mathematically,

where f’(p) denotes the prior density of *p̃tilde;, p ^{r}(l – p)^{n-r}* denotes the likelihood if

*r*heads are observed in

*n*trials, and f”(pǀr,n) denotes the posterior density of

*p̃tilde;*given the sample evidence. In our example, f’(p)= r lfor 0 ≤p≤ l and 0 otherwise;

*r =*2;

*n =*3; and

so that

Thus we emerge from the analysis with an explicit posterior probability distribution for *p̃tilde;.* This distribution characterizes fully our judgments about *p̃tilde;.* It could be applied in a formal decision-theoretic analysis in which utilities of alternative acts **are** functions of p. For example, we might make a Bayesian point estimate of *p* (each possible point estimate is regarded as an act), and the seriousness of an estimation error (loss) might be proportional to the square of the error. The best point estimate can then be shown to be the mean of the posterior distribution; in our example this would be .6. Or we might wish to describe certain aspects of the posterior distribution for summary purposes; it can be shown, for example, that, where P refers to the posterior distribution,

*P(p̃tilde;<.* 194) = 0.25 and *P(p̃tilde;>* .932) = .025,

so that a .95 *credible interval* for *p̃tilde;* extends from .194 to .932. Again, it can easily be shown that *P(p̃tilde; >* .5) = .688: the posterior probability that the coin is “biased” in favor of heads is a little over ⅔.

**The likelihood principle.** In our example, the effect of the sample evidence was wholly transmitted by the likelihood function. All we needed to know from the sample was *p ^{r}(l –* p)

^{n-r}; the actual sequence of individual observations was irrelevant so

*long as we believed the assumption of a Bernoulli process.*In general, a full Bayesian analysis requires as inputs for Bayes’ theorem only the likelihood function and the prior distribution. Thus the import of the sample evidence is fully reflected in the likelihood function, a principle known as the likelihood principle

*[see*Likelihood]. Alternatively, given that the sample is drawn from a Bernoulli process, the import of the sample is fully reflected in the numbers

*r*and

*n,*which are called

*sufficient statistics [see*Sufficiency]. (If the sample size, n, is fixed in advance of sampling, it is said that

*r*alone is sufficient.)

The likelihood principle implies certain consequences that do not accord with traditional ideas. Here are examples: (1) Once the data are in, there is no distinction between sequential analysis and analysis for fixed sample size. In the Bernoulli example, successive samples of *n _{1}* and

*n*with

_{2}*r*and

_{1}*r*successes could be analyzed as one pooled sample of

_{2}*n*+

_{1}*n*trials with

_{2}*r*+

_{1}*r*successes. Alternatively, a posterior distribution could be computed after the first sample of

_{2}*n*; this distribution could then serve as a prior distribution for the second sample; finally, a second posterior distribution could be computed after the second sample of

_{1}*n*. By either route the posterior distribution after

_{2}*n*+

_{1}*n*observations would be the same. Under almost any situation that is likely to arise in practice, the “stopping rule” by which sampling is terminated is irrelevant to the analysis of the sample. For example, it would not matter whether

_{2}*r*successes in

*n*trials were obtained by fixing

*r*in advance and observing the

*r*th success on the nth trial, or by fixing

*n*in advance and counting

*r*successes in the

*n*trials. (2) For the purpose of statistical reporting, the likelihood function is the important information to be conveyed. If a reader wants to perform his own Bayesian analysis, he needs the likelihood function, not a posterior distribution based on someone else’s prior nor traditional analyses such as significance tests, from which it may be difficult or impossible to recover the likelihood function.

**Vagueness about prior probabilities.** In our example we assessed the prior distribution of *p̃tilde;* as a uniform distribution from 0 to 1. It is sometimes thought that such an assessment means that we “know” *p̃tilde;* is so distributed and that our claim to knowledge might be verified or refuted in some way. It is indeed possible to imagine situations in which the distribution of *p̃tilde;* might be known, as when one coin is to be drawn at random from a number of coins, each of which has a known *p* determined by a very large number of tosses. The frequency distribution of these *p’*s would tdistribution of p: hen serve as a prior distribution, and all statisticians would apply Bayes’ theorem in analyzing sample evidence. But such an example would be unusual. Typically, in making an inference about *p̃tilde;* for a *particular* coin, the prior distribution of *p̃tilde;* is not a description of some distribution of p’s but rather a tool for expressing judgments about *p̃tilde;* based on evidence other than the evidence of the particular sample to be analyzed.

Not only do we rarely know the prior distribution of *p̃tilde;,* but we are typically more or less vague when we try to assess it. This vagueness is comparable to the vagueness that surrounds many decisions in everyday life. For example, a person may decide to offer $21,250 for a house he wishes to buy, even though he may be quite vague about what amount he “should” offer. Similarly, in statistical inference we may assess a prior distribution in the face of a certain amount of vagueness. If we are not willing to do so, we cannot pursue a *formal* Bayesian analysis and must evaluate sample evidence intuitively, perhaps aided by the tools of descriptive statistics and classical inference.

Vagueness about prior probabilities is not the only kind of vagueness to be faced in statistical analysis, and the other kinds of vagueness are equally troublesome for approaches to statistics that do not use prior probabilities. Vagueness about the likelihood function, that is, the process generating the data, is typically substantial and hard to deal with. Moreover, both classical and Bayesian decision theory bring in the idea of utility, and utilities often are vague.

In assessing prior probabilities, skillful self-interrogation is needed in order to mitigate vagueness. Self-interrogation may be made more systematic and illuminating in several ways. (1) *Direct judgmental assessment.* In assessing the prior distribution of *p̃tilde;,* for example, we might ask: For what *p* would we be indifferent to an even money bet that *p̃tilde;* is above or below this value? (Answer: the .50-quantile or median.) If we were told that p is above the .50-quantile just assessed, but nothing more, for what value of *p* would we now be indifferent in such a bet? (Answer: the .75-quantile.) Similarly we might locate other key quantiles or key relative heights on the density function. (2) *Translation to equivalent but hypothetical prior sample evidence.* For example, we might feel that our prior opinion about *p̃tilde;* is roughly what it would have been if we had initially held a uniform prior, and then seen *r* heads in *n* hypothetical trials from the process. The implied posterior distribution would serve as the prior. (3) *Contemplation of possible sample outcomes.* Sometimes we may find it easy to decide directly what our posterior distribution *would be* if a certain hypothetical sample outcome were to materialize. We can then work backward to see the prior distribution thereby implied. Of course, this approach is likely to be helpful only if the hypothetical sample outcomes are easy to assimilate. For example, if we make a certain technical assumption about the general shape of the prior (beta) distribution *[see*Distributions, statistical, *article on*special continuous distributions], the answers to the following two simply stated questions imply a prior distribution of p: (1) How do we assess the probability of heads *on a single trial?* (2) If we were to observe a head on a single trial (this is the hypothetical future outcome), how would we assess the probability of heads on a second trial?

These approaches are intended only to be suggestive. If several approaches to self-interrogation lead to substantially different prior distributions, we must either try to remove the internal inconsistency or be content with an intuitive analysis. Actually, from the point of view of subjective probability coherent, the discovery of internal inconsistency in one’s judgments is the only route toward more rational decisions. The danger is not that internal inconsistencies will be revealed but that they will be suppressed by self-deception or glossed over by lethargy.

It may happen that vagueness affects only unimportant aspects of the prior distribution: theoretical or empirical analysis may show that the posterior distribution is insensitive to these aspects of the distribution. For example, we may be vague about many aspects of the prior distribution, yet feel that it is nearly uniform over all values of the parameter for which the likelihood function is not essentially zero. This has been called a diffuse, informationless, or locally uniform prior distribution. These terms are to be interpreted relative to the spread of the likelihood function, which depends on the sample size; a prior that is diffuse relative to a large sample may not be diffuse relative to a small one. If the prior distribution is diffuse, the posterior distribution can be easily approximated from the assumption of a strictly uniform prior distribution. The latter assumption, known historically as Bayes’ postulate (not to be confused with Bayes’ theorem), is regarded mainly as a device that leads to good approximations in certain circumstances, although supporters of subjective probability rational sometimes regard it as more than that in their approach to Bayesian inference. The uniform prior is also useful for statistical reporting, since it leads to posterior distributions from which the likelihood is easily recovered and presents the results in a readily usable form to any reader whose prior distribution is diffuse.

**Probabilistic prediction.** A distribution, prior or posterior, of the parameter p of a Bernoulli process implies a probabilistic prediction for any future sample to be drawn from the process, assuming that the stopping rule is given. For example, the denominator in the right-hand side of Bayes’ formula for Bernoulli sampling (equation 1) can be interpreted as the probability of obtaining the particular sample actually observed, given the prior distribution of *p̃tilde;.* If Mr. A and Mr. B each has a distribution for *p̃tilde;,* and a new sample is then observed, we can calculate the probability of the sample in the light of each prior distribution. The ratio of these probabilities, technically a marginal likelihood ratio, measures the extent to which the data favor Mr. A over Mr. B or vice versa. This idea has important consequences for evaluating judgments, selecting statistical models, and performing Bayesian tests of significance.

In connection with the previous paragraph a separate point is worth making. The posterior distributions of Mr. A and Mr. B are bound to grow closer together as sample evidence piles up, so long as neither of the priors was dogmatic. An example of a dogmatic prior would be the opinion that *p* is exactly .5.

In an important sense the predictive distribution of future observations, which is derived from the posterior distribution, is more fundamental to Bayesian inference than the posterior distribution itself.

**Multivariate inference and nuisance parameters.** Thus far we have used one basic example, inferences about a Bernoulli process. To introduce some additional concepts, we now turn to inferences about the mean *μ* of a normal distribution with unknown variance σ^{2}. In this case we begin with a *joint* prior distribution for *μ* and σ^{2}. The likelihood function is now a function of two variables, *μ* and σ^{2}. An inspection of the likelihood function will show not only that the *sequence* of observations is irrelevant to inference but also that the magnitudes are irrelevant except insofar as they help determine the sample mean *x̄* and variance s^{2}, which, along with the sample size *n*, are the sufficient statistics of this example. The prior distribution combines with the likelihood essentially as before except that a double integration (or double summation) is needed instead of a single integration (or summation). The result is a joint posterior distribution of *μ* and σ^{2}.

If we are interested only in μ, then σ^{2} is said to be a *nuisance parameter.* In principle it is simple to deal with a nuisance parameter: we integrate it out of the posterior distribution. In our example this means that we must find the marginal distribution of *μ* from the joint posterior distribution of *μ* and σ^{2}.

Multivariate problems and nuisance parameters can always be dealt with by the approach just described. The integrations required may demand heavy computation, but the task is straightforward. A more difficult problem is that of assessing multivariate prior distributions, especially when the number of parameters is large, and research is needed to find better techniques for avoiding selfcontradictions and meeting the problems posed by vagueness in such assessments.

**Design of experiments and surveys.** So far we have talked only about problems of analysis of samples, without saying anything about what kind of sample evidence, and how much, should be sought. This kind of problem is known as a problem of *design.* A formal Bayesian solution of a design problem requires that we look beyond the posterior distribution to the ultimate decisions that will be made in the light of this distribution. What is the best design depends on the purposes to be served by collecting the data. Given the specific purpose and the principle of maximization of expected utility, it is possible to calculate the expected utility of the best act for any particular sample outcome. We can repeat this for each possible sample outcome for a given sample design. Next, we can weight each such utility by the probability of the corresponding outcome in the light of the prior distribution. This gives an over-all expected utility for any proposed design. Finally, we pick the sample design with the highest expected utility. For two-action problems—for example, deciding whether a new medical treatment is better or worse than a standard treatment—this procedure is in no conflict with the traditional approach of selecting designs by comparing operating characteristics, although it formalizes certain things—prior probabilities and utilities—that often are treated intuitively in the traditional approach.

**Comparison of Bayesian and classical inference.** Certain common statistical practices are subject to criticism, either from the point of view of Bayesian or of classical theory: for example, estimation problems are frequently regarded as tests of null hypotheses *[see*HypothesisTesting], and .05 or .01 significance levels are used inflexibly. Bayesian and classical theory are in many respects closer to each other than either is to everyday practice. In comparing the two approaches, therefore, we shall confine the discussion to the level of underlying theory. In one sense the basic difference is the acceptance of subjective probability judgment as a *formal* component of Bayesian inference. This does not mean that classical theorists would disavow judgment, only that they would apply it informally after the purely statistical analysis is finished: judgment is the “second span in the bridge of inference.” Building on subjective probability, Bayesian theory is a unified theory, whereas classical theory is diverse and *ad hoc.* In this sense Bayesian theory is simpler. In another sense, however, Bayesian theory is more complex, for it incorporates more into the formal analysis. Consider a famous controversy of classical statistics, the problem of comparing the means of two normal distributions with possibly unequal and unknown variances, the so-called Behrens–Fisher problem *[see* Linear hypotheses]. Conceptually this problem poses major difficulties for some classical theories (not Fisher’s fiducial inference; see Fisher 1939) but none for Bayesian theory. In application, however, the Bayesian approach faces the problem of assessing a prior distribution involving four random variables. Moreover, there may be messy computational work after the prior distribution has been assessed.

In many applications, however, a credible interval emerging from the assumption of a diffuse prior distribution is identical, or nearly identical, to the corresponding confidence interval. There is a difference of interpretation, illustrated in the opening two paragraphs of this article, but in practice many people interpret the classical result in the Bayesian way. There often are numerical similarities between the results of Bayesian and classical analyses of the same data; but there can also be substantial differences, for example, when the prior distribution is nondiffuse and when a genuine null hypothesis is to be tested.

Often it may happen that the problem of vagueness, discussed at some length above, makes a formal Bayesian analysis seem unwise. In this event Bayesian theory may still be of some value in selecting a descriptive analysis or a classical technique that conforms well to the general Bayesian approach, and perhaps in modifying the classical technique. For example, many of the classical developments in sample surveys and analysis of experiments can be given rough Bayesian interpretations when vagueness about the likelihood (as opposed to prior probabilities) prevents a full Bayesian analysis. Moreover, even an abortive Bayesian analysis may contribute insight into a problem.

Bayesian inference has as yet received much less theoretical study than has classical inference. Such commonplace and fundamental ideas of classical statistics as randomization and nonparametric methods require re-examination from the Bayesian view, and this re-examination has scarcely begun. It is hard at this writing to predict how far Bayesian theory will lead in modification and reinterpretation of classical theory. Before a fully Bayesian replacement is available, there is certainly no need to discard those classical techniques that seem roughly compatible with the Bayesian approach; indeed, many classical techniques are, under certain conditions, good approximations to fully Bayesian ones, and useful Bayesian interpretations are now known for almost all classical techniques. From the classical viewpoint, the Bayesian approach often leads to procedures with desirable sampling properties and acts as a stimulus to further theoretical development. In the meanwhile, the interaction between the two approaches promises to lead to fruitful developments in statistical inference; and the Bayesian approach promises to illuminate a number of problems—such as allowance for selectivity—that are otherwise hard to handle.

Harry V. Roberts

*[See also the biography of*Bayes.]

## BIBLIOGRAPHY

*The first book-length development of Bayesian inference, which emphasizes heavily the decision-theoretic foundations of the subject, is* Schlaifer 1959. *A more technical development of the subject is given by* Raiffa & Schlaifer 1961. *An excellent short introduction with an extensive bibliography is* Savage 1962. *A somewhat longer introduction is given by Savage and other contributors in* Joint Statistics Seminar 1959. *This volume also discusses advantages and disadvantages of the Bayesian approach. An interesting application of Bayesian inference, along with a penetrating discussion of underlying philosophy and a comparison with the corresponding classical analysis, is given in* Mosteller & Wallace 1964. *This study gives a specific example of how one might cope with vagueness about the likelihood function. Another example is to be found in* Box & Tiao 1962. *A thorough development of Bayesian inference from the viewpoint of “subjective probability rational” is to be found in* Jeffreys 1939. *A basic paper on fiducial inference is* Fisher 1939.

Bayes, Thomas (1764) 1963 *Facsimiles of Two Papers by Bayes.* New York: Hafner. → Contains “An Essay Toward Solving a Problem in the Doctrine of Chances, With Richard Price’s Foreword and Discussion,” with a commentary by Edward C. Molina, and “A Letter on Asymptotic Series From Bayes to John Canton,” with a commentary by W. Edwards Deming. Both essays first appeared in Volume 53 of the Royal Society of London’s *Philosophical Transactions* and retain the original pagination.

Box, George E. P.; and Tiao, George C. 1962 A Further Look at Robustness Via Bayes’s Theorem. *Biometrika* 49:419–432.

Edwards, Ward; Lindman, Harold; and Savage, Leonard J. 1963 Bayesian Statistical Inference for Psychological Research. *Psychological Review* 70:193–242.

Fisher, R. A. (1939) 1950 The Comparison of Samples With Possibly Unequal Variances. Pages 35.173a–35.180 in R. A. Fisher, *Contributions to Mathematical Statistics.* New York: Wiley. → First published in Volume 9 of the *Annals of Eugenics.*

Jeffreys, Harold(1939) 1961 *Theory of Probability.* 3d ed. Oxford: Clarendon.

JoinStatistics Seminar, University of London 1959 *The Foundations of Statistical Inference.* A discussion opened by Leonard J. Savage at a meeting of the Seminar. London: Methuen; New York: Wiley.

Lindley, Dennis V. 1965 *Introduction to Probability and Statistics From a Bayesian Viewpoint.* 2 vols. Cambridge Univ. Press.

Mosteller, Frederick; and Wallace, David L. 1963 Inference in an Authorship Problem: A Comparative Study of Discrimination Methods Applied to the Authorship of the Disputed Federalist Papers. *Journal of the American Statistical Association* 58:275–309.

Mosteller, Frederick; and Wallace, David L. 1964 *Inference and Disputed Authorship: The Federalist.* Reading, Mass.: Addison-Wesley.

Pratt, John W.; Raiffa, Howard; and Schlaifer, Robert 1964 The Foundations of Decision Under Uncertainty: An Elementary Exposition. *Journal of the American Statistical Association* 59:353–375.

Raiffa, Howard; and Schlaifer, Robert 1961 *Applied Statistical Decision Theory.* Graduate School of Business Administration, Studies in Managerial Economics. Boston: Harvard Univ., Division of Research.

Savage, Leonard J. 1962 Bayesian Statistics. Pages 161–194 in Symposium on Information and Decision Processes, Third, Purdue University, 1961, *Recent Developments in Information and Decision Processes.* Edited by Robert E. Machol and Paul Gray. New York: Macmillan.

Schlaifer, Robert 1959 *Probability and Statistics for Business Decisions: An Introduction to Managerial Economics Under Uncertainty.* New York: McGraw-Hill.

#### More From encyclopedia.com

#### About this article

# Bayesian Inference

**-**

#### You Might Also Like

#### NEARBY TERMS

**Bayesian Inference**