# Probability

views updated May 29 2018

# Probability

I. Formal ProbabilityGottfried E. Noether

BIBLIOGRAPHY

II. InterpretationsBruno de Finetti

BIBLIOGRAPHY

## I FORMAL PROBABILITY

This article deals with the mathematical side of probability, the calculus of probabilities, as it has sometimes been called. While there is sometimes disagreement on the philosophy and interpretation of probability, there is rather general agreement on its formal structure. Of course, when using the calculus of probabilities in connection with probabilistic models or statistical investigations, the social scientist must decide what interpretation to associate with his probabilistic statements.

### The axiomatic approach

#### The sample space

Mathematically the most satisfactory approach to probability is an axiomatic one. Although a rigorous description of such an approach is beyond the scope of this article, the basic ideas are simple. The framework for any probabilistic investigation is a sample space, a set whose elements, or points, represent the possible outcomes of the “experiment” under consideration. An event, sometimes called a chance event, is represented by a subset of the sample space. For example, in a guessing-sequence experiment, test subjects are instructed to write down sequences of numbers, with each number chosen from among the digits 1, 2, 3. For sequences of length 2, an appropriate sample space is given by the nine number pairs (1,1), (1,2), (2,1), (1,3), (2,2), (3,1), (2,3), (3,2), (3,3). In this sample space the subset consisting of the three pairs (1,1), (2,2), (3,3) stands for the event a run, and the subset consisting of the pairs (1,3), (2,2), (3,1) stands for the event sum equals 4. For convenience of reference, these two events will be denoted more briefly by R and F.

#### The algebra of events

It is helpful to introduce some concepts and notation from the algebra of events. Events (or sets) are denoted by letters A, B, · · · , with or without subscripts. The eve consisting of every experimental outcome—that is, the sample space itself–is denoted by S. The event not A is denoted by Ac (alternate notations sometimes used include Ā, CA, and A’), the event A and B by AB (or A ∩ B or A ∧ B), the event A or B by A + B (or A ∪ B or AB), with corresponding definitions for more than two events. The or in A or B is to be interpreted in the inclusive sense, implying either A or B alone or both A and B. Two events A and B are said to be disjoint (or mutually exclusive) if AB = ϕ, where ϕ) = Sc is the imposible event, an event consisting of none of the experimental outcomes. The events A + B and AB are often called, respectively, the union and the intersection of A and B, and the event Ac is often called the negation of A. The events A and Ac are said to be complementary events. Figure 1 is a pictorial representation of two events (sets), A and B, by means of a Venn diagram. The sample space, S,

is represented by a rectangle; events A and B partition S into four disjoint parts labeled according to their description in terms of A, Ac, B, and B°. Thus, for example, ACBC is the part of the sample space that belongs both to Ac and Br. By definition,

A + B = ABC + ACB + AB.

#### The axioms

In an axiomatic treatment, the probability, P(A), of the event A is a number that satisfies certain axioms. The following three axioms are basic. They may be heuristically explained in terms of little weights attached to each point, with the total of all weights one pound and with the probability of an event taken as the sum of the weights attached to the points of that event.

(i) P(A)≥0,

(ii)P(S)=1,

(iii)P(A+B)+P(A)+P(B), if AB=ϕ,

Thus, no probability should be negative, and the maximum value of any probability should be 1. This second requirement represents an arbitrary though useful normalization. Finally, the probability of the union of two disjoint events should be the sum of the individual probabilities. It turns out that for sample spaces containing only a finite number of sample points no additional assumptions need be made to permit a rigorous development of a probability calculus. In sample spaces with infinitely many sample points the above axiom system is incomplete; in particular, (iii) requires extension to denumerably many events. Mathematically more serious is the possibility that there may be events to which no probability can be assigned without violating the axioms; practically, this possibility is unimportant.

It should be noted that the axiomatic approach assumes only that some probability P(A) is associated with an event A. The axioms do not say how the probability is to be determined in a given case. Any probability assignment that does not contradict the axioms is acceptable.

The problem of assigning probabilities is pa ticularly simple in sample spaces containing a finite number of points, and the weight analogy applies particularly neatly in this case. It is then only necessary to assign probabilities to the individual sample points (the elementary events), making sure that the sum of all such probabilities is one According to axiom (iii) it follows that the probability of an arbitrary event, A, is equal to the sum of the probabilities of the sample points in A. I particular, if there are n points in the sample space and if it is decided to assign the same probability, 1/n, to each one of them, then the probability of an event A is equal to mA(1/n) = mA/n, where mA is the number of sample points in A. For example, in the guessing-sequence experiment the same probability, = .111, may be assigned to each one of the nine sample points if it is assumed that each number written down by a test subject is the result of a mental selection of one of three well-shuffled cards marked 1, 2, and 3, each successive selection being uninfluenced by earlier selections. This model will be called the dice model, since it is also appropriate for the rolling of two unconnected and unbiased “three-sided” dice. An actual guessingsequence experiment in which 200 test subjects participated produced the results given in Table 1 For example, 13 of the 200 subjects produced sequences in which both digits were 1. Table 2 gives the proportion of cases in each category (for ex ample, 13/200 = .065). Inspection reveals important deviations from the theoretical dice model, deviations that are statistically significant [seeCounted Data; Hypothesis testing]. For illustrative purposes an empirical model can be set u in which the entries in Table 2 serve as basic probabilities. In the dice model while in the empirical model Pe(R) = .065 + .065 + .045 =.175 and Pe(F) = .130 + .065 + .160 =

Table 1 – Guessed digit pairs
SECOND DIGIT
123Total
Source; Private communication from Frederick Mosteller.
FIRST DIGIT113273272
225132967
32626961
Total646670200
Table 2 – Proportions of subjects guessing each digit pair
SECOND DIGIT
123Total
Source: Private communication from Frederick Mosteller.
1.065.135.160.360
FIRST DIGIT2.125.065.145.335
3.130.130.045.305
Total.320.330.3501.000

.355. The subscript “e” distinguishes probabilities in the empirical model from those in the dice model.

### Some basic theorems and definitions

#### Probabilities of complementary events

The probability of the event not A is 1 minus the probability of the event A, P(AC) = 1 – P(A); in par ticular, the probability of the impossible event is zero, P(<f>) – 0. This result may seem self-evid but logically it is a consequence of the axioms.

### Addition theorem

If A and B are any tw events, the probability of the event A + B is equal to the sum of the separate probabilities of events A and B minus the probability of the event AB, P(A + B) = P(A) + P(B) - P(AB). For example, and Pe(R + F) = .175 + .355 - .065 = .465. If A and B are disjoint, P(AB) – 0, so the addition theorem is then just a restatement of axiom (iii).

### Conditional probability

If P(A) is not zero, the conditional probability of the event B given the event A is denned as

The meaning of this quantity is most easily interpreted in sample spaces with equally likely outcomes. Then P(A|B)=(mAB/n)/(mA/n)=mAB/mA, where mAB is the number of sample points in AB. HA is considered a new sample space and new probabilities, 1/mA, are assigned to each of the mA sample points in A, the probability of th event B in this new sample space is mAB/mA =P(B|A), since mAB of the mA points in A are in B The interpretation of P(B|A) as the probability of B in the reduced sample space A is valid generally. Figure 2 shows the two ways of looking at con ditional probability. The events AB and A can be

considered in the sample space S, as in the first part of Figure 2. It is also possible to consider the event B in the sample space A, as in the second part of Figure 2. For example, Pe(R|F) = Pe (RF)/Pe(F) = .065/.355 = .183.

#### The multiplication theorem

Equation (1) can be rewritten as P(AB) = P(A)P(B|A). This is the multiplication theorem for probabilities: The probability of the event AB is equal to the product of the probability of A and the conditional probability of B given A. The multiplication theorem is easily extended to more than two events. Thus, for three events A, B, C, P(ABC) = P(A)P(B|A)P(C|AB).

#### Independent events

In general, the conditional probability P(B│A) differs from the probability P(B). Thus, in general, knowledge of whether A has occurred will change the evaluation of the probability of B. If the occurrence of A leaves the probability of B unchanged, that is, if P(B|A) =P(B), B is said to be independent of A. It then follows that A is also independent of B, or simply that A and B are independent. A useful way of defining the independence of A and B is

that is, A and B are independent just when the probability of the joint occurrence of A and B is equal to the product of the individual probabilities. For example, in the dice model while in the empirical model Pe(RF) = .065 ≠Pe(R)Pe(F). Thus the events R and F ar independent in the dice model but dependent in the empirical model.

The extension of the concept of independence from two to more than two events produces certain complications. For two independent events, equation (2) remains true if either A or B or both are replaced by the corresponding negations Ac and Bc. Put differently, if A and B are independent, so ar A and Bc, Ac and B, Ac and Bc. If the concept of in dependence of three or more events is to imply a corresponding result–and such a requirement seems desirable–the formal definition of independence becomes cumbersome: The events A1,…An are said to be independent if the probability of the joint occurrence of any number of them equals the product of the corresponding individual probabilities. Thus, for the independence of three events A, B, C, all of the following four conditions must be satisfied: P(ABC) = P(A)P(B)P(C), P(AB) = P(A)P(B), P(Ac) = P(A)P(C), P(Bc) = P(B)P(C).

#### Bayes formula

Let A be an event that can occur only in conjunction with one of k mutually exclusive and exhaustive events H1,…Hk. If A is observed, the probability that it occurs in con junction with Hi (i = 1,…k) is equal to where

P(Hi) is the prior probability of the event Hi, the probability of H{ before it is known whether A occurs, and P(A|Hi) is the (conditional) probability of the event A given Hi, so that the denominator of (3) is P(A). The probability (3) is called the posterior probability of H{. It is important to note that application of (3) requires knowledge of prior probabilities. For example, suppose that each of the 200 number pairs of the guessing-sequence experiment has been written on a separate card. In addition, two three-sided dice have been rolled 200 times and the results noted on additional cards. One of the 400 cards is selected at random and found to contain a run. The probability that this card comes from the guessing sequence is . Thus the appearance of a run changes the prior probability .500 to the posterior probability .344, reflecting the much smaller probability of a run in the empirical model than in the dice model.

#### Combinatorial formulas

The task of counting the number of points in the sample space S and in subsets of S can often be carried out more economically with the help of a few formulas from combinatorial analysis. A permutation is an arrangement of some of the objects in a set in which order is relevant. A combination is a selection in which order is irrelevant. Given n different objects, m ≤ n of them can be selected in

ways, when no attention is paid to the order. The symbol (called a binomial coefficient and sometimes written ) is read as “the number of combinations of n things taken m at a time.” For a positive integer r, r| (read “r-factorial”) stands for the product of the first r positive integers, r! = r(r - 1)…1. By definition, 0! = 1. It is possible to arrange m out of n different objects in ways. The symbol (n)m is read as “the number of permutations of n things taken m at a time.” If the n objects are not all different–in particular, if there are n1 of a first kind, n2 of a second kind, and so on up to nk of a fcth kind such that n = n1 +…+ nk–then there are n|/{nx|nk|) ways i which all n objects can be permuted. For k = 2 this expression reduces to . See Niven (1965) and Riordan (1958) for combinatorial problems.

### Random variables

#### Definition

In many chance experiments the chief interest relates to numerical information furnished by the experiment. A numerical quantity whose value is determined by the outcome of a chance experiment–mathematically, a singlevalued function defined for every point of the sample space–is called a random variable (abbreviated r.v.). The remainder of this article deals with such variables. For reasons of mathematical simplicity, the discussion will be in terms of r.v.’s that take only a finite number of values. Actually, with suitable modifications the results are valid generally. (Some details are given below.) It is customary to denote r.v.’s by capitals such as X, Y, Z. For example, consider the sample space consisting of the n students in a college, each student having the same probability, 1/n, of being selected for an “experiment.” A possible r.v., X, is the IQ score of a student. A second r.v., Y, is a student’s weight to the nearest five pounds. A third r.v., Z, may take only the value 1 or 2, depending on whether a student is male or female. The example shows that many r.v.’s can be defined on the same sample space.

As a further example, consider the r.v. W, equal

Table 3 – Relation between sample points and values of a random variable
Sample space Values of randomvariable W
(1,1)(1,2)(1,3 ) 234
(2,1)(2,2 )(2,3)W
345
(3,1)(3,2)(3,3) 456

to the sum of the two numbers in a guessing sequence of length 2. Table 3 shows the relations Hip between sample points and values of W. To each point in S there corresponds exactly one value of W. Generally, the reverse is not true.

While it is possible to study the calculus of probabilities without specifically defining r.v.’s, formal introduction of such a concept greatly clarifies basic ideas and simplifies notation. If the sample space S in Table 3 actually refers to successive rolls of a three-sided die, possibly with unequal probabilities for the three sides, it is natural to consider only the sum of the two rolls as the event of interest [seeSufficiency]. This can be done by defining separately the five events sum equals 2, sum equals 3,…, sum equals 6 or by considering vaguely a “wandering” variable that is able to take the values 2 or 3 or…or 6. The r.v. W defined above and illustrated in Table 3 combines both approaches in a simple and unambiguous way. Note that the event F can be expressed as the event W = 4.

The advantages of a formal concept are even more pronounced when it is desirable to consider two or more measurements jointly. Thus, in the earlier example an investigator might not be interested in considering weight and sex of students each by itself but as they relate to one another. This is accomplished by considering for every point in the sample space the number pair (Y,Z) where Y and Z are the r.v.’s defined earlier. Perhaps most important, the concept of r.v. permits more concise probability statements than the ones associated with the basic sample space.

#### Frequency function of a random variable

Let X be a random variable with possible values x1&xk Define a function f(x) such that for x = Xi (i=l ,&k), f(xi) is the probability of the event that the basic experiment results in an outcome for which the r.v. X takes the value Xi, f(Xi) = P(X = Xi). Clearly, ffo) +…+ f(xk) = 1. The function f(x) defined in This way is called the frequency function of X. For example, in the dice model the frequency function of W is

#### Mean and variance of a random variable

The expected value of the r.v. X, which is denoted by E(X) (or by Ex or μ x or simply μ if there is n ambiguity), is defined as the weighted average E(X) = xτf(x1)+…+xkf(xk ). E(X) is also called the mean of X. More generally, if H(x) is a function of x, the expected value of the r.v. H(X) is given by

The function H(x) - (x – μ)2 is of particular interest and usefulness. Its expected value is called the variance of X and is denoted by varX ()The positive square root of the variance is called the standard deviation and is denoted by s.d.X (or σx or σ)

If the r.v. Y is a linear function of the r.v. X, Y = a + bX, where a and b are constants, then where |b| denotes the numerical value of b without regard to sign.

The mean and variance of a r.v. are two important characteristics. Their true theoretical significance emerges in connection with such advanced theorems as the law of large numbers and the central limit theorem (see below). The useful additive property of means and variances is also stated below. On an elementary level, the mean and th variance are useful descriptive or summary measures [seeStatistics, descriptive, article on Location and dispersion].

The mean is but one of the “averages” that can be computed from the values x1xk of the r. Another average is the median, Med, defined by the two inequalities Thus the median–which may not be uniquely defined–is a number that cuts in half, as nearly as possible, the frequency function of X.

The standard deviation is a measure of variability or spread around the mean, a small standard deviation indicating little variability among the possible values of the r.v., a large standard deviation indicating considerable variability. This rather vague statement is made more precise by the BienaymeChebyshev inequality. This inequality establishes a connection between the size of the standard deviation and the concentration of probability in intervals centered at the mean.

#### The Bienaymé-Chebyshev inequality

Let δ be an arbitrary positive constant. The probability that a r.v. X takes a value that deviates from its mean by less than δ standard deviations is at least 1 - 1/δ2, P(│X–μ│< δσ) ≥ 1 –1/δ2. This is one of the forms of the Bienaymé-Chebyshev inequality (often called the Chebyshev inequality). Often the complementary result is more interesting. The probability that a r.v. X takes a value that deviates from its mean by at least δ standard deviations is at most 1/δ2, P(│X –μ|> δcr) σ 1/δ2. Although there are r.v.’s for which the inequality becomes an equality, for most r.v.’s occurring in practice, the probability of large deviations from the mean is considerably smaller than the upper limit indicated by the inequality. An example is given below. Th Chebyshev inequality illustrates the fact that for probabilistic purposes the standard deviation is th natural unit of measurement.

#### The binomial distribution

Consider an experiment that has only two possible outcomes, called success and failure. The word trial will be used to denote a single performance of such an experiment. A possible sample space for describing the results of n trials consists of all possible sequences of length n of the type Ffsss…F, where S stand for success and F for failure. A natural r.v. defined on this sample space is the total number X of successes [seeSufficiency]. If successive trials ar independent and the probability of success in any given trial is a constant p, then

The r.v. X is called the binomial r.v. and the frequency function (5) the binomial distribution. [SeeDistributions, Statisticalfor a discussion of the binomial distributions and other distributions mentioned below.] The expected or mean value of X is np and the standard deviation is Vnp(│-p) According to the Chebyshev inequality the upper limit for the probability that a r.v. deviates from its mean by two standard deviations or more is .25. For the binomial variable with n – 50 and for example, the exact probability is only .033, considerably smaller than the Chebyshev limit.

#### Joint frequency functions

Let X and Y be tw r.v.’s denned on the same sample space. Let X take the values x1,…xk, and let Y take the value y1,…, yh. What then is the probability of th joint event X = xiand Y = yi? The function f(x,y such that f(Xi, yi,) = P(X = xi and Y = yi) (i = 1,… k; j = 1,…h) is called the bivariate frequency function of X and Y. In terms of f (x,y) the marginal frequency function of X is given by and, similarly, the marginal frequency function of Y by ) Marginal distributions are used to make probability statements that involve only one of the variables without regard to the value of the other variable. A different situation arises if the value of one of the variables becomes known. In that case, probability statements involving the other variable should be conditional on what is known. If in (1), A and are defined as the events X = x-i and Y = yj, respectively, the conditional frequency function of th r.v. Y, given that X has the value xi, is found to be g(yj|xi) = P(Y = y,|X = Xi) = f(xi yj)/f(xi), with a corresponding expression for the conditional frequency function of X given Y. For example, in the empirical model for the guessing-sequence experiment, associate a r.v. X with the first guess and a r.v. Y with the second guess. Then f(x,y)Pe (the first guess is x and the second guess is y), x,y = 1,2,3. In Table 2 the last column on the right represents the marginal distribution of X and the last row on the bottom the marginal distribution of Y. It i noteworthy that these marginal distributions do not differ very much from the marginal distributions for the dice model in which all probabilities equal . As an example of a conditional probability, note that g(l|l) = Pe(Y=l |X=1) = .065/.360 = .181, a result that differs considerably from the corresponding probability J for the dice model.

The two r.v.’s X and Y are said to be independent (or independently distributed) if the joint frequency function of X and Y can be written as th product of the two marginal frequency functions,

If X and Y are independent r.v.’s, knowledge of the value of one of the variables, provided that f(x,y ) is known, furnishes no information about the other variable, since if (6) is true, the conditional frequency functions equal the marginal frequency functions, f(x|y) = f(x) and g(y|x) = g(y), whatever the value of the conditioning variable. The r.v.’s X and Y in the guessing-sequence example are independent in the dice model and dependent in the empirical model.

In addition to the means µx and µy and the variances and of the marginal frequency functions one defines the covariance,

and the correlation coefficient, and ρ = σxyxσy. If Y are independent, their covariance and, consequently, their correlation are zero. The reverse, however, is not necessarily true; two r.v.’s, X and Y, can be uncorrelated (that is, can have correlation coefficient zero) without being independent.

The covariance of two r.v.’s is a measure of the “co-variability” of the two variables about their respective means; a positive covariance indicates a tendency of the two variables to deviate in the same direction, while a negative covariance indicates a tendency to deviate in opposite directions. Although the covariance does not depend on the zero points of the scales in which X and Y are measured, it does depend on the units of measurement. By di viding the covariance by the product of the standard deviations a normalization is introduced making the resulting correlation coefficient independent of the units of measurement as well. (If, for example, X and Y represent temperature measurements, the correlation between X and Y is the same whether temperatures are measured as degrees Fahrenhei or as degrees centigrade.) The concepts of covariance and correlation are closely tied to linear as sociation. One may have very strong (even complete) nonlinear association and very small (even zero) correlation.

The concepts discussed in This section generalize from two to more than two variables.

### Sums of random variables

#### Mean and variance of a sum of random variables

Let  X,…, Xn be a set of n r.v.’s with means E(X i )=ui , variances and covariances E(Xi – μi)(Xi – μi) = σij (i,j = 1,…,n; i≠j). Some of the most fruitful studies in th calculus of probabilities are concerned with the properties of sums of the type Z = c1X1 +…+ cw Xn where the c, are given constants. In This article only some of the simpler, although nevertheless highly important, results will be stated.

The mean and variance of Z are . For the remainder assume that the r.v.’s Xi,…, Xn are independ ently and identically distributed. This is the mathematical model assumed for many statistical investigations. The common mean of the r.v.’s X1…, Xnis denoted by μL and the common variance by σ2 Of particular interest are the two sums Sn = Xi +…+ Xn and (The binomial r.v. X is of the form Sn if a r.v. X̂ is associated with the ith trial, Xi taking the value 1 or 0, depending on whether the ith trial results in success or failure.) For the sum

The law of large numbers. If Chebyshev’s inequality is applied to X, then for arbitrarily small positive 8

By choosing n sufficiently large, the right side can be made to differ from 1 by as little as desired. It follows that the probability that X̂ deviates from μ by more than some arbitrarily small positive quantity δ can be made as small as desired. In particular, in the case of the binomial variable X, the probability that the observed success ratio, X̂ = X/n, deviates from the probability, p, of success in a single trial by more than δ can be made arbitrarily small by performing a sufficiently large number of trials. This is the simplest version of the celebrated law of large numbers, more commonly known—and misinterpreted—as the “law of averages.” The law of large numbers does not imply that the observed number of successes X necessarily deviates little from the expected number of successes np, only that the relative frequency of success X/n is close to p. Nor does the law of large numbers imply that, given X̂ > p after n trials, the probability of success on subsequent trials is small in order to compensate for an excess of successes among the first n trials. Nature “averages out” by swamping, not by fluctuating.

In more advanced treatments This law is called the weak law of large numbers, to distinguish it from a stronger form.

#### The central limit theorem

The law of large numbers has more theoretical than practical significance, since it does not furnish precise or even approXimate probabilities in any given situation. Such information is, however, provided by the central limit theorem: As n increases indefinitely, the distribution function of the standardized variable converges to so-called standard normal distribution. (A general discussion of the normal distribution is given below. ) For practical purposes the stated result means that for large n the probability that Sn takes a value between two numbers a < b can be obtained approXimately as the area under the standard normal curve between the two points . In particular, the probability that in a binomial experiment the number of successes is at least k1 and at most k2 (where k1 and k2 are integers) is approximately equal to the area under the normal curve between the limits , provided n is sufficiently large. Here a continuity correction of has been used in order to improve the approximation. For most practical purposes n may be assumed to be “sufficiently large” if np(1 – p) is at least 3.

The central limit theorem occupies a basic position not only in theory but also in application. The sample observations x1 · · ·, xn drawn by the statis tician may be looked upon as realizations of n jointly distributed random variables X1 · · ·, Xn It is cus tomary to refer to a function of sample observations as a statistic. From This point of view, a statistic is a r.v. and its distribution function is called the sampling distribution of the statistic. The problem of determining the sampling distributions of statistics of interest to the statistician is one of the important problems of the calculus of probabilities. The central limit theorem states that under very general conditions the sampling distribution of the statistic Sn can, for sufficiently large samples, be approximated in a suitable manner by the normal distribution.

More complicated versions of the law of large numbers and the central limit theorem exist for the case of r.v/s that are not identically distributed and are even dependent to some extent.

#### A more general view

A more general view of random variables and their distributions will now be presented.

Discrete and continuous random variables. For reasons of mathematical simplicity the discussion so far has been in terms of r.v/s that take only a finite number of values. Actually This limitation was used explicitly only when giving such definitions as that of an expected value. Theorems like Chebyshev’s inequality, the law of large numbers, and the central limit theorem were formulated without mention of a finite number of values. Indeed, they are true for very general r.v/s. The remainder of This article will be concerned with such r.v/s. Of necessity the mathematical tools have to be of a more advanced nature. (For infinite sample spaces, there arises the need for a concept called measurability in the discussion of events and of random variables. For simplicity such discussion is omitted here.)

By definition, a r.v. X is a single-valued function defined on a sample space. For every number x (– ∞ < x < ∞), the probability that X takes a value that is smaller than or equal to x can be determined. LetF(x) denote This probability considered as a function of x, F(x) = P(Xx). F(x) is called the (cumulative) distribution function of the r.v. X. The following properties of a distribution function are consequences of the definition: F(–∞) = 0; F(∞) = 1; F(x) is monotonically nondecreasing, that is, F(x1) ≤ F(x2) if x1 < x2. Furthermore, F(x2) – F(x1) = P(x1 < Xx2). Such a function may be continuous or discontinuous. If discontinuous, it has at most a denumerable number of discontinuities, at each of which F(x) has a simple jump, or saltus. The height of This jump is equal to the probability with which the r.v. X takes the value x where the discontinuity occurs. At the same time, if F(x) is continuous at x, the probability of the event X = x is zero.

Let F(x) be discontinuous with discontinuities occurring at the points x1, x2, …, xn, … . (If the are only a finite number of discontinuities, denote their number by n.) Let the size of the jump occurring at x = xi (i = 1,2, … ,n, …) be equal to f(xi). A particularly simple case occurs if f(x1) + f(x2) + … + f(xn) + … = 1. In this case F(x) is a “step function” and x1, x2, …, xn, … are the only values taken by the r.v. X. Such a r.v. is said to be discrete. Clearly the r.v’s considered earlier are discrete r.v’s with a finite number of values. As before, call f(x) the frequency function of the r.v. X. In terms of f(x), F(x) is given by F(x) = Σf(xi), where – ∞ < x < ∞ and the summation extends over all Xi that are smaller than or equal to x.

If F(x) is continuous for all x, X is said to be continuous r.v. Consider the case where there exists a function f(x) such that . (In statistical applications this restriction is of little importance.) The function f(x) - dF(x)/dx is called the density function of the continuous r.v. X. Clearly . Furthermore, , since for a continuous r.v. X, P(X = x) = 0 for every x. It follows that for a continuous r.v. X with density function f(x) the probability that X takes a value between two numbers x1 and x2 is given by the area between the curve representing f(x) and the x-axis and bounded by the ordinates at x1 and x2 (Figure 3). Although r.v’x1 that are part continuous and part discrete occur, they will not be considered here.

For two continuous r.v.’s denned on the same sample space, stipulate the existence of a bivariate density function f(x,y) such that

In terms of f(x,y), marginal and conditional density functions can be denned as in the case of two discrete variables.

Let H(x) be a function of x. An interesting and important problem is concerned with the distribution of the derived random variable H(X). Here only the expected value of H(X) will be discussed. E[H(X)] can be expressed in terms of the distribution function F(x) by means of the Stieltjes integral

For discrete r.v.’s, (7) reduces to E[H(X)] = ∑H(xi)f(xi) If X ha density function f(x), (7) becomes . One important new factor arises that was not present in (4), the question of existence. In (7) the expected value exists if and only if the corresponding sum or integral converges absolutely. This condition means that in the discrete case, for example, E[H(X)] exists just when ∑│H(xi)│f(xi) converges.

If H(x) = xk, K = 1, 2, …, the corresponding expected value is called the kth moment about th origin and is denoted by . Of paticular interest is the first moment, , which was written earlier as μ. The kth central moment, μk, is denned as . The first central moment is zero. The second central moment is the variance, denoted by . The moments are of interest because of the information that they provide about the distribution function. Thus the Chebyshev inequality shows the kind of information provided by the first two moments. Additional moments provide more and more precise information. Finally, in many circumstances, knowledge of the moments of all orders uniquely determines the distribution function.

#### Generating functions

Rather than compute moments from their definitions, it is often simpler to make use of a generating function. The momentgenerating function, M(u), is defined as the expected value of the random variable eux, where u is a real variable. The characteristic function is defined as the expected value of eiux, where u again is real and . The characteristic functi has the advantage that it always exists. The moment-generating function exists only for r.v’s that have moments of all orders. For k = 1, 2, …, th kth moment can be found as the /eth derivative of M(u) evaluated at u = 0, A corresponding result holds for characteristic functions.

While the moment-generating property of the moment-generating function is useful, its main significance arises from the uniqueness theorem. A moment-generating function uniquely determines its distribution function, and it is often easier to find the moment-generating function of a r.v. than its distribution function. As an example, consider the r.v. Z = X + Y, where X and Y are two independently distributed r.v.’s. It follows from the definition of a moment-generating function that the moment-generating function of Z is the product of the moment-generating functions of X and Y. No such simple relationship exists between the distribution function of Z and those of X and Y. However, once the moment-generating function of Z is known, it is theoretically possible to determine its distribution function. These results also hold for characteristic functions.

#### The Poisson and normal distributions

In conclusion two examples will be given–one involving a discrete r.v., the other a continuous r.v. When considering random events occurring in time, one is often interested in the total number of occurrences in an interval of given length. An example is the number of suicides occurring in a community in a year’s time. Then a discrete r.v. X with possible values 0, 1, 2,…can be denned. Often an apropriate mathematical model is given by the Poisson distribution, according to which

where λ is a characteristic of the type of random event considered. The moment-generating function of X is M(u) = exp [λ(eu– 1)], where exp w stands for ew. Then μ = λ = σ2 Thus λ is the mean number of occurrences in the given time interval and, at the same time, is also the variance of the number of occurrences. Furthermore, the sum of two independent Poisson variables is again a Poisson variable whose parameter X is the sum of the parameters of the two independent variables.

The Poisson distribution serves as an excellent approximation for binomial probabilities, if the probability of success, p, is small. More exactly, if np is set equal to λ, the binomial probability in (5) can be approximated by e–λλx/x |, provided n is sufficiently large. This approximation is particularly useful when only the product np is known but not the values of n and p separately.

A continuous r.v. with density function is said to have the normal distribution with parameters a and b > 0. The moment-generating function of such a variable is exp (, from which μ = a, σ2 = b2. It is therefore customary to write the density as

A particular normal density function is obtained by setting μ = 0 and σ2 = 1. This simple function is called the unit (or standard) normal density function; it appeared above in the discussion of the central limit theorem. The sum of two or more independent normal variables is again normally distributed with mean and variance equal to th sums of means and variances, respectively.

The normal distribution is often used as a mathematical model to describe populations such as that of scores on a test. Arguments in support of the normality assumption are customarily based on the central limit theorem. Thus it is argued that the value of a given measurement is determined by a large number of factors. It is less frequently realized that reference to the central limit theorem implies also that factors act in an additive fashion. Nevertheless, experience shows that the degree of nonnormality occurring in practice is often so small that the assumption of actual normality does not lead to erroneous conclusions.

Gottfried E. Noether

[For a discussion of the various distributions mentioned in the text, see alsoDistributions, Statistical.]

## BIBLIOGRAPHY

### WORKS REQUIRING AN ELEMENTARY MATHEMATICAL BACKGROUND

Cramer, Harald (1951) 1955 The Elements of Probability Theory and Some of Its Applications. New York: Wiley. → First published as Sannolikhetskalkylen och ndgra av dess anwdndningar.

Hodges, Joseph L. Jr.; and Lehmann, E. L. 1964 Basic Concepts of Probability and Statistics. San Francisco: Holden-Day.

Mosteller, Frederick; Rourke, Robert E. K.; and Thomas, George B. Jr. 1961 Probability With Statistical Applications. Reading, Mass.: Addison-Wesley.

Niven, Ivan 1965 Mathematics of Choice. New York: Random House.

Weaver, Warren 1963 Lady Luck: The Theory of Probability. Garden City, N.Y.: Doubleday.

### WORKS OF A MORE ADVANCED NATURE

Feller, William 1950–1966 An Introduction to Probability Theory and Its Applications. 2 vols. New York: Wiley. → The second edition of the first volume was published in 1957.

Gnedenko, Boris V. (1950) 1962 The Theory of Probability. New York: Chelsea. → First published as Kurs teorii veroiatnostei.

Parzen, Emanuel 1960 Modern Probability Theory and Its Applications. New York: Wiley.

Riordan, John 1958 An Introduction to Combinatorial Analysis. New York: Wiley.

## II INTERPRETATIONS

Many disputes–what are they about? There are myriad different views on probability, and disputes about them have been going on and increasing for a long time. Before outlining the principal questions, and the main attitudes toward them, we note a seeming contradiction; it may be said with equal truth that the different interpretations alte in no substantial way the contents and applications of the theory of probability and, yet, that they utterly alter everything. It is important to have in mind precisely what changes and what does not.

Nothing changes for the mathematical theory [seeProbability, article onFormal Probability].Thus a mathematician not conceptually interested in probability can do unanimously acceptable work on its theory, starting from a merely axiomatic basis. And often nothing changes even in practical applications, where the same arguments are likely to be accepted by everyone, if expressed in a sufficiently acritical way (and if the validity of the particular application is not disputed because of preconceptions inherent in one view or another).

For example, suppose someone says that he attaches the probability one sixth to an ace at his next throw of a die. If asked what he means, he may well agree with statements expressed roughly thus: he considers \$1 the fair insurance premium against a risk of \$6 to which he might be exposed by occurrence of the ace; the six faces are equally likely and only one is favorable; it may be expected that every face will appear in about of the trials in the long run; he has observed a frequency of in the past and adopts this value as the probability for the next trial; and so on. Little background is needed to see that each of these rough statements admits several interpretations (or none at all, if one balks at insufficient specification). Moreover, only one of the statements can express the very idea, or definition, of probability according to this person’s language, while the others would be accepted by him, if at all, as consequences of the definition and of some theorems or special additional assumptions for particular cases. It would be a most harmful misappraisal to conclude that the differences in interpretation are meaningless except for pedantic hairsplitters or, even worse, that they do not matter at all (as when the same geometry is constructed from equivalent sets of axioms beginning with different choices of the primitive notions). The various views not only endow the same formal statement with completely different meanings, but a particular view also usually rejects some statements as meaningless, thereby restricting the validity of the theory to a narrowed domain, where the holders of that view feel more secure. Then, to replace the rejected parts, expedients aiming at suitable reinterpretation are often invented, which, naturally, are only misinterpretations for the adherents of other views.

### A bit of history

#### The beginnings

It is an ambitious task even to make clear the distinctions and connections between the various schools of thought, so we renounce any attempt to enter far into their historical vicissitudes. A sketch of the main lines of evolution should be enough to give perspective. [See Statistics, article onThe history of statistical method.]

In an early period (roughly 1650–1800), the mathematical theory of probability had its beginnings and an extraordinarily rapid and fruitful growth. Not only were the fundamental tools and problems acquired, but also the cornerstones of some very modern edifices were laid: among others, the principle of utility maximization, by Daniel Bernoulli in 1738; the probabilistic approach to inductive reasoning and behavior, by Thomas Bayes in 1763; and even the minimax principle of game theory, by de Waldegrave in 1712 (Guilbaud 1961). But interest, in those days, focused on seemingly more concrete problems, such as card games; conceptual questions were merely foreshadowed, not investigated critically; utility theory remained unfruitful; Bayes’ principle was misleadingly linked with Bayes’ postulate of uniform initial distribution; de Waldegrave’s idea went unnoticed [seeBayesian inference; Game theory].

It happened thus that some applications of probability to new fields, such as judicial decisions, were bold and careless, that the Bayesian approach was often misused, and that ambiguity in interpretation became acute in some contexts. Particularly troublesome was the meaning of “equally likely,” when–with Laplace in 1814–this notion came forward as the basis of an ostensible definition of probability.

#### A confining criticism

A bitter criticism arose that was prompt to cut away all possible causes of trouble rather than to analyze and recover sound underlying ideas. This attitude was dominant in the nineteenth century and is still strong. Concerning “equal probabilities,” it has long been debated whether they ought to be based on “perfect knowledge that all the relevant circumstances are the same” or simply on “ignorance of any relevant circumstance that is different,” whatever these expressions themselves may mean. To illustrate, is the probability of heads1|2 for a single toss only if we know that the coin is perfect, or even if it may not be but we are not informed which face happens to be favored? The terms objective and subjective –now used to distinguish two fundamentally different natures that probability might be understood to have–first appeared in connection with these particular, not very well specified, meanings of “equally probable.”

It is apparent how narrow the field of probability becomes if restricted to cases arising from symmetric partitions. Sometimes, it was said, applications outside this field could be allowed by “analogy,” but without any real effort to explain how far (and why, and in what sense) such an extension would be valid.

Authors chiefly interested in statistical problems were led to another confining approach that hinges on that property of probability most pertinent to their field, namely, the link with frequency, for example, Venn (1866), or the limit of frequency, for example, Von Mises (1928). In any such theory, the limitation of the field of probability is less severe, in a sense, but is vaguely determined–or altogether undetermined if, as I believe, such theories are unavoidably circular.

#### A liberating criticism

There is today a vigorous revival of the current of thought (mentioned above in the subsection “The beginnings”) that could not find its true development in the eighteenth century for want of full consciousness of its own implications. According to this outlook, an attempt to replace the familiar intuitive notion of probability by any, necessarily unsuccessful, imitation is by no means required, or even admissible. On the contrary, it suffices to make this intuitive notion neat and clear, ready to be used openly for what it is.

The most deliberate contributions to this program were along two convergent lines, one based on what may be called admissibility, by Frank P. Ramsey in 1926 (see 1923–1928) and Leonard J. Savage (1954), and one on coherence, or consistency, by Bruno de Finetti (1930) and B. O. Koopman (1940a; 1940b). In addition, the impact of authors supporting views concordant only in part with this one (like John Maynard Keynes 1921; fimile Borel 1924; and Harold Jeffreys 1939), as well as the impact of many concomitant circumstances, was no less effective. We can but list the following principal ones: the development of somewhat related theories (games, as by John von Neumann and Oskar Morgenstern 1944 [seeGame Theory]; decisions, Abraham Wald 1950 [seeDecision Theory]; logical investigations, Rudolf Carnap 1950; 1952); the detection of shortcomings in objectivistic, or frequentistic, statistics; the applications of probability to problems in economics and operations research; the survival of the “old-new” ideas in some spheres where common sense and practical needs were not satisfied by other theories, as in engineering (Fry 1934; Molina 1931), in actuarial science, and in credibility theory (Bailey 1950); and so on. Above all, the less rigid conception of scientific thought following the decline of rationalistic and deterministic dogmatism facilitated acceptance of the idea that a theory of uncertainty should find its own way.

### The objective and the subjective

#### In logic

To make the step to probability easier, let us start with logic. The pertinent logic is that of sentences–more precisely, of sentences about the outside world, that is, sentences supposed to have some verifiable meaning in the outside world. We call such a sentence or the fact that it asserts–a harmless formal ambiguity–an event.

(There is a usage current in which event denotes what would here be called a class of events that are somehow homogeneous or similar, and individual event, or trial, denotes what is here called an event. The alternative usage is favored by some who emphasize frequencies in such classes of events. The nomenclature adopted here is the more flexible; while not precluding discussion of classes of events, it raises no questions about just what classes of events, if any, play a special role.)

Some examples of events–that is, of sentences, assertions, or facts–are these: A = Australia wins the Davis Cup in 1960. B = the same for 1961. C = the same for 1960 and 1961.

What can be said of an event, objectively? Objectively it is either true or false (tertium non datur), irrespective of whether its truth or falsity is known to us or of the reason for our possible ignorance (such as that the facts are in the future, that we have not been informed, that we forget, and so on). However, a third term, indeterminate, is sometimes used to denote events that depend on future facts, or, in a narrower sense, on future facts other than those considered to be fully controlled by deterministic physical laws. In this sense, event A was indeterminate until the deciding ball of the 1960 Cup finals, and B and C were indeterminate until 1961, when they became true. Yet, in the strictest logical terminology, they are true in the atemporal sense; or, with reference to time, they were and will be true forever, irrespective of the time-space point of a possible observer.

Subjectively–for some person, at a given moment–an event may be certain, impossible, or dubious. But mistakes are not excluded; certain (or impossible) does not necessarily imply true (or false).

In logic restricted to the subjective aspect, that is, ignoring or disregarding reality, one can say only whether a person’s assertions are consistent or not; including the objective aspect, one may be able to add that they are correct, unmistaken, or mistaken. In the Davis Cup example (if Ā means not-A, and a that A is dubious), then ABC, Abc, aBc, abc, abC̅, A̅bc̅, Ab̅ c̅, A̅b̅ c̅, A̅bC̅, aB̅c̅ are the only consistent assertions. For example, AbC is not, for there is no possible doubt about B if C is certain. Taking reality into account (namely, that A, B, C are actually all true), only the first, ABC, is correct; the next three are unmistaken, for no false event is considered as certain nor any true one as impossible; the remaining assertions are mistaken.

From logic to probability. Probability is the expedient devised to overcome the insufficiency of the coarse logical classification. To fill the questionable gap between true and false, instead of the generic term indeterminate, a continuous range of objective probabilities may be contemplated. And the unquestionable gap between certain and impossible, the dubious, is split into a continuous range of degrees of doubt, or degrees of belief, which are subjective probabilities.

The existence of subjective probabilities and of some kind of reasoning thereby is a fact of daily psychological experience; it cannot be denied, although it may be considered by some to be probabilistically uninteresting. (In the most radical negation its relation to the true probability theory is compared to that between the energy of a person’s will and energy as a physical notion.) To exploit this common experience, we need only a device to measure, and hence to define effectively, subjective probabilities in numerical terms and criteria to build up a theory. We consider three kinds of theories about subjective probability, Sp, SC, and SR, which aim, respectively, to characterize psychological, consistent, and rational behavior under uncertainty.

Since the meaningfulness of objective probabilities is at least as questionable as the notion of indeterminateness, it is essential to ponder the various bases proposed for objective probability and the forces impelling many people to feel it important. Here, too, we shall consider three kinds of theories of objective probability, OS, OF, and OL, based, respectively, on symmetry (or equiprobable cases), on frequency, and on the limit of frequency.

#### Some remarks

The above classification into six kinds of theories is a matter of convenience. Some theories that do not fit the scheme must be mentioned separately. In the axiomatic approach, for example, probability has, by convention, the nature of a measure as defined by Kolmogorov ( 1956, p. 2) and discussed by Savage (1954, p. 33); applications are made without specifying the meaning of probability or alluding to any of the interpretations discussed here. In other theories, probabilities may be noncomparable, and hence nonnumerical; there are such variants not only of Sp but also of SC (Smith 1961) or even SR, as by Keynes.

Different theories are not necessarily incompatible. For instance, SC should be, and sometimes is, accepted by a supporter of SR as a preliminary weaker construction–as projective is preliminary to Euclidean geometry. Even less natural associations are current: Carnap admits two distinct notions, prob1 and prob2, that correspond to SR (or SC?) and OF (or OL?); if, like many psychologists, one stops at Sp or at its variant with noncomparability (Ellsberg 1961), judging SC to be unreasonably stringent, one is often inclined to accept much more stringent assumptions, such as OS, in particular fields; and so on.

The order we shall follow is from the least to the most restrictive theory. In a sense, each requires the preceding ones together with further restrictions. Beginning with the subjectivistic interpretation, we have a good tool for investigating the special objectivistic constructions; they can in fact be imbedded into it, not only in form but also in substance. For the opinions a person holds as objectively right will of course also be adopted by him among those he considers subjectively right for himself.

#### Toward a subjectivistic definition

There is no difficulty in finding methods that attach to a subjective probability its numerical value in the usual scale from 0 to 1; what is difficult is weighing the pros and cons of various essentially equivalent measuring devices. Verbalistic comparisons or assertions are simple, but not suitable, at least at first, when we must grasp the real meaning from the measuring device. The device must therefore reveal preferences, by replies or by actions. The first way, by replies, is still somewhat verbalistic; the second is truly behavioristic but may be vitiated insofar as actions are often unpondered or dictated by caprice or by elusive side effects.

Obviously, the latter inconveniences are particularly troublesome in measuring subjective entities. It must be considered, for example, whether some opponent exists who may perhaps have, or be able to obtain, more information or be inclined to cheat. Such trouble is not, however, peculiar to probability. Measuring any physical quantity gives rise to the same sort of difficulties when its definition is based on a sufficiently realistic idealization of a measurement device.

Roughly speaking, the value p, given by a person for the probability P(E) of an event E, means the price he is just willing to pay for a unit amount of money conditional on E’s being true. That is, the preferences on the basis of which he is willing to behave (side effects being eliminated) are determined, with respect to gains or losses depending on E, by assuming that an amount S conditional on E is evaluated at pS. That must be asserted only for sufficiently small amounts; the general approach should deal in the same way with utility, a concept that, for want of space, is not discussed here [see Utility].

To construct a device that obliges a person to reveal his true opinions, it suffices to offer him a choice among a set of decisions, each entailing specified gains or losses according to the outcome of the event (or events) considered; such a set can be so arranged that any choice corresponds to a definite probability evaluation. The device can also be so simply constructed as to permit an easy demonstration that coherence requires the evaluation to obey the usual laws of the calculus of probability.

### The subjectivistic approach

If we accept such a definition, the fundamental question of what should be meant by a theory of probability now arises. There seems to be agreement that it must lead to the usual relations between probabilities, to the rules of the probability calculus, and also to rules of behavior based on any evaluation of probabilities (still excluding decisions about substantial amounts of money, which require utilities). The three kinds of theory announced earlier differ in that they are looking for: a theory of actual behavior (Sp); a theory of coherent behavior (SC); a theory of rational behavior (SR)–in other words, they are looking either for a theory of behavior under uncertainty as it is or for a theory of behavior as it ought to be, either in the weaker sense of avoiding contradictions or in the stronger sense of choosing the “correct” probability evaluations.

#### Subjective probability, psychological (SP)

No one can deny the existence of actual behavior or the interest in investigating it in men, children, rats, and so on. Such experimental studies yield only descriptive theories, which cannot be expected to conform to the ordinary mathematics of probability. A descriptive theory may exist whether a corresponding normative one does or not. For example, when studying tastes, there are no questions of which tastes are intrinsically true or false. When studying responses to problems of arithmetic or logic, it is meaningful to distinguish, and important to investigate, whether the answers are correct and what mistakes are made.

One can be content to stop with the study of SP, accepting no normative theory–which SC, SR, OS, OF, and OL are, in a more or less complete sense. But one cannot object to a normative theory on the grounds that it does not conform to the actual behavior of men or rats or children. A normative theory states what behavior is good or bad. We may question whether any normative theory exists or whether a given one has any claims to be accepted. These questions have nothing to do with whether any beings do in fact behave according to the conclusions of the theory. It cannot therefore be confirmed or rejected on the basis of observational data, which can on the contrary say only whether there is more or less urgent need to teach people how to behave consistently or rationally.

#### Subjective probability, consistent (SC)

The probability evaluations over any set of events whatever can be mathematically separated into the classes of those that are coherent and those that are not. Coherence means that no bet resulting in certain loss is considered acceptable; coherence is equivalent to admissibility, according to which no decision is preferable to another that, in every case, gives as good an outcome [see Decision Theory]. This is meant here with reference to maximizing expectations; it should ultimately be transformed into maximizing expected utility, which is the most general case of coherent behavior (Savage 1954).

As may be seen rather easily from the behavioristic devices on which personal probability is founded, coherence is equivalent to the condition that the whole usual calculus of probability be satisfied. For instance, if C = AB (as in the example about the Davis Cup), the necessary and sufficient condition for coherence is

P(A) + P(B) – 1 ≤ P(C) ≤ min [P(A),P(B)].

Properly subjectivistic authors (like the present one) think coherence is all that theory can prescribe; the choice of any one among the infinitely many coherent probability distributions is then free, in the sense that it is the responsibility of, and depends on the feelings of, whatever person is concerned.

#### Subjective probability, rational (SR)

We denote as rational, or rationalistic, a theory that aims at selecting (at least in a partial field of events) just one of the coherent probability distributions, supposed to be prescribed by some principles of thought. (This is called a “necessary” view. Some writers use “rational” for what we call “coherent.”)

In most cases, such a view amounts to presenting as cogent the feeling of symmetry that is likely to arise in many circumstances and with it the conclusion that some probabilities are equal, or uniformly distributed. But what really are the conditions where such an argument applies? A symmetry in somebody’s opinion is the conclusion itself, not a premise, and the ostensible notion of absolute ignorance seems inappropriate to any real situation. Less strict assumptions, such as symmetry of syntactical structure of the sentences asserting a set of events, are likely to permit arbitrariness and to lead to misuses–in most radical form, to the d’Alembert paradox, according to which any dubious event has p = ½ on the pretext that we are in a symmetric situation, being unable to deny either E or non-E, which means simply stopping at “dubious.”

### The objectivistic approach

The picture changes utterly in passing from the subjectivistic to the objectivistic approach, if due attention is paid to the underlying ideas. There are no longer people intent on weighing doubts. It is Nature herself who is facing the doubts, irresolute toward decisions, committing them to Chance or Fortune. The mythological expressions are but images; yet the expressions commonly used in objectivistic probability are, at bottom, equivalent to them, and the objectivistic language is so widespread that even attentive subjectivists sometimes lapse into it. It is considered meaningful to ask, for example, whether some effect is due to a cause or to chance (that is, is random); whether a fact modifies the probability of an event; whether a random variable is normal, or two are independent, or a process is Poisson; whether chance intervenes in a process (once for all, at one step, or at every instant); whether this or that phenomenon obeys the laws of probability or what their underlying chance mechanism is; and so on.

In fact, an objective probability is regarded as something belonging to Nature herself (like mass, distance, or other physical quantities) and is supposed to “exist” and have a determined value even though it may be unknown to anyone. Quite naturally, therefore, objectivistic theories do actually deal with unknown probabilities, which are of course meaningless in a subjectivistic theory. Furthermore, it can fairly be said that an objective probability is always unknown, although hypothetical estimates of its values are made in a not really specifiable sense. How can one hope to communicate with such a mysterious pseudoworld of objective probabilities and to acquire some insight on it?

#### Objective probability, symmetrical (OS)

The first partial answer comes from the objectivistic interpretation of the “symmetry principle,” or “the principle of cogent reason,” according to which identical experiments repeated under identical conditions have the same probability of success. This applies also to the case of several symmetric possible outcomes of one experiment (such as the six faces of a die) and is often asserted also for combinations (such as the 2n sequences of possible outcomes of tossing a coin n times). Accepting– perhaps on the basis of SC, admitting that objective probabilities must be consistent with subjective ones, or perhaps by convention–the rule of favorable divided by possible cases, probability is defined in a range and way very similar to those of SR; but, even apart from the change from subjective to objective interpretation, it is not so close as it may seem, because the role of information is now lost. We can no longer content ourselves with asserting something about symmetry of the real world as it is known to us but are compelled to entangle ourselves in asserting perfect symmetry of what is unknown if not indeed unknowable. Or we may switch from this supernatural attitude to the harmless one of regarding such assertions as merely “hypothetical,” but then obtain only hypothetical knowledge of the objective probabilities, since the perfect symmetry is only hypothetical. For instance, what about differences arising from magnetism before magnetism was known? Strictly, the perfect symmetry is contradictory, unless unavoidable differences in time, place, past and current circumstances, etc. are bypassed as irrelevant.

#### Objective probability, frequency (OF)

Another answer is: “Objective probability is revealed by frequency.” It is a property that somehow drives frequency (with respect to a sufficiently large number of “identical” events) toward a fixed number, p, that is the value of the probability of such events. Statistical data are, then, a clue to the ever unknowable p if the events concerned are considered identical or to an average probability if their probabilities differ.

OF actually presupposes OS whenever we intend to justify the necessity of arranging events in groups to get frequencies and to use these frequencies for other, as yet unknown, events of the group. Nonetheless, there may be a conflict if, for instance, the frequencies that occur with a die, accepted as perfect, are not almost equal, as can happen.

#### Objective probability, limit of frequency (OL)

In order to remove the unavoidable indeterminateness of OF, the suggestion has been made to increase the large number to infinity and to define p as the limit of frequency. Whether one does or does not like this idea as a theoretical expedient, clearly no practical observations or practical questions do concern eternity; for dealing with real problems this theory is at best only an elusive analogy.

#### Critique of objectivistic theories

Objectivistic theories are often preferred (especially by some practically oriented people, such as statisticians and physicists), because they seem to join the fundamental notions with practically useful properties by a direct short cut. But the short cut leaps unfathomable gulfs. It admits of no bridge between us, with our actual knowledge, and the imagined objectivistic realm, which can be turned into an innocent allegory for describing some models but has no proper claim to being complete and self-supporting. The needed bridge is supplied by subjective probability, whose role seems necessary and unchanged, whether or not we want also to make some use of the unnecessary notion of objective probability for our descriptions of the world. If this bridge is rejected, only recourse to expedients is left. It will shortly be explained why subjective probabilities are said to provide the natural bridge, while objectivistic criteria (such as the usual methods of estimating quantities, testing hypotheses, or defining inductive behavior) appear to be artificial and inadequate expedients.

### Inductive reasoning and behavior

A feature that has been postponed, to avoid premature distraction, must now be dealt with. A subjective probability, P(E), is of course conditional on the evidence, or state of information, currently possessed by the subject concerned; to make that explicit, we may write P(E│A), where A is the current state of information, which is usually left implicit. Any additional information, real or hypothetical, consisting in learning that an event H is true (and H may be the joint assertion, or logical product, of any number and kind of “simpler” events) leads from the probabilities P(E) to P(E\H), conditional on H (or on AH, if A need be made explicit). The coherence condition of SC suffices for all rules in the whole field of conditional probabilities and hence specifies by implication what it means to reason and behave coherently, not only in a static sense, that is, in a given state of information, but also in a dynamic one, in which new information arises freely or may be had at some cost by experiment or request.

The passage from P(E) to P(E\H) is prescribed simply by the theorem of compound probabilities, or equivalently—in a slightly more elaborate and specific form—by Bayes’ theorem. As for the decision after knowledge of H, it must obviously obey the same rules as before, except with P(E\H) in the place of P(E); the sole essentially new question is how best to spend time, effort, and money for more information and when to stop for final decicision, but that too is settled by the same rules.

Coherence thus gives a complete answer to decision problems, including even induction, that is, the use of new information; no room is left for arbitrariness or additional conventions. Of course, the freedom in choosing P(E), the initial distribution, is still allowed (unless we accept uniqueness from SR), and similarly for utility. The unifying conclusion is this: Coherence obliges one to behave as if he accepted some initial probabilities and utilities, acting then so as to maximize expected utility.

The particular case of statistical induction is simply that in which H expresses the outcome of several “similar” events or trials. The simplest condition, called exchangeability, obtains when only information about the number of successes and failures is relevant for the person, irrespective of just which events, or trials, are successes or failures; and the most important subcase is exchangeability of a (potentially) infinite class of events— like coin or die tossing, or drawings with replacement from an urn, or repetitions of an experiment under sufficiently similar conditions. The model of an infinite class of exchangeable events can be proved equivalent to the model that presents the events as independent conditional on the value of an unknown probability, whether the “unknown probability” is interpreted objectivistically or otherwise. In fact, to deal with the latter model consistently, one must—if not explicitly, at least in effect—start with an initial subjective distribution of the unknown probability; but exactly the same results are obtainable directly from the definition of exchangeability itself, without recourse to any probabilities other than those to which we have direct subjective access (Feller 1950–1966, vol. 2, p. 225). Whatever the approach, Bayes’ rule acts here in a simple way (we cannot go into the details here) that explains how we all come to evaluate probabilities in statistical-induction situations according to the observed frequencies, insofar as our common sense induces approximate coherence.

The inconsistency just noted of using any but the Bayesian approach, even under an objectivistic formulation, should be discussed further. The initial subjective probability distribution of a person must be the same for all decision problems that depend on a given set of events. It cannot be chosen by criteria that, like the minimax rule, depend on the specific problem or on the instrument of observation, because such criteria, although coherent within each problem, are not coherent over-all. Also, there is no justification for calling the Bayesian method unreasonable because the needed initial distribution is “unknown.” More accurately, it is of dubious choice (see, for example, Lindley 1965, vol. 2, pp. 19–21). Actually, any method proposed for avoiding the risk of such a choice is demonstrably even worse than a specific choice. The situation is as though someone were to estimate the center of gravity of some weighted points of a plane by a point outside the plane because he did not know the weights of the points sufficiently well; yet the projection of the estimate back onto the plane must improve it. This is not a mere analogy but a true picture in a suitable mathematical representation. This picture should be emphasized because it shows that, far from opposition, there is necessarily agreement between inductive behavior and inductive reasoning, which is contrary to an opinion current in objectivistic statistics.

### A few nuances of the views

No sketch of some of the representative views can cover the opinions of all authors, if only because some might take a particular idea in earnest and push it to its extreme consequences while others might consider it merely as a suggestive abstraction to be taken with a grain of salt. Our sketches may therefore appear either as insufficient or as caricatures. Still, even mentioning a few of the nuances may create a more realistic impression.

Each notion changes its meaning with the theory; let us take, as an example, independence. Two drawings, with replacement, from an urn of unknown composition are called independent by an objectivist (since for him the probability is the unknown but constant proportion of white balls) but not by a subjectivist (for whom “independent” means “devoid of influence on my opinion”). For him, such events are only exchangeable; the observed outcomes, through their frequency, do alter the conditional probability of those not yet observed, whether in terms of an unknown probability—here, the urn composition–or directly.

In OF, on the contrary, independence and the law of large numbers are almost prerequisites for the definition of probability. To escape confusing circularity, duplicates of notions are invented. Thus, the preliminary form of the law of large numbers is called the “empirical law of chance,” making a distinction based on a similar one between “highly probable” and “almost certain”—itself created expressly to be eliminated by the “principle of Cournot.” This ostensible principle is one that seems to suggest that we can practically forget the “almost” and consider probability as a method of deriving certainties; in a safer version, it might offer a link between subjective and objective probabilities.

But the first prerequisite for OF is the grouping of events into classes, sometimes tacitly slipped in by the terminology that uses the word “event” for a sequence of events rather than for an individual event. What is intended by such a class raises confusing questions. Does maintaining that probability belongs to a class imply that all events, or trials, in the class are equally probable? Or is that question to be rejected and the probability of a single trial held to be meaningless? What information entitles us to assign a given single event to such a class? These questions are aspects of a more general one that has still other aspects. For any objectivist, frequentistic or not, which states of information allow us to regard a probability as known? Which are insufficient and leave it unknown? Is there only one specific state of information concerning a given event to which a probability corresponds? Does the whole information on which a probability is based consist of all relevant circumstances of the present or of the past? Or can nothing be excluded as irrelevant so that it consists of the whole present and past?

Actually, some conventional state of information, far from knowledge of the whole past and present, is usually considered appropriate. For instance, for extractions from an urn, when the proportion of white balls is known (and perhaps when it is also known that the balls are well mixed), objectivists say that the probability is known and equal to the proportion. But why not require fuller information? If, for example, we have noted that the child drawing generally chooses a ball near the top, then additional information about the position of the white balls in the urn would seem to be relevant.

Something between equal probabilities for events in classes and individual probabilities for each trial may seem to be offered by the precaution of speaking of the probability of an event with respect to a given set of trials (see Frechet 1951, pp. 15–16). An example will illustrate the meaning and the implications not only of this attitude but more generally of all attempts to evade the main question above, namely, whether the probabilities of events in a class are equal, unequal, or nonsense. There is no proper premium for insurance on the life of Mr. Smith, unless we specify that he is to be insured as, for instance, a lawyer, a widower, a man forty years old, a blonde, a diabetic, an ex-serviceman, and so on. How then should an objectivistic insurer evaluate Mr. Smith’s application for insurance?

### Probability and philosophy

The view of a world ruled by Chance has also been opposed for philosophical reasons—not to mention theological and moral ones. Chance is incompatible with determinism, which, it was once said, is a prerequisite to science. By replying, “Chance is but the image of a set of many little causes producing large effects/’ faith in perfect determinism in the microcosm was reconciled with probability, but perhaps only with SR rather than with OS, which was the real point of conflict. At any rate, the advent of probabilistic theories in physics and elsewhere later showed that determinism is not the only possible basis for science.

Subjectivistic views have been charged with “idealism” by Soviet writers (for example, Gnedenko 1950), and “priesthood” by objectivistic statisticians (for example, van Dantzig 1957), inasmuch as these views draw their principles only from human understanding. For SR that may be partially justified; it should, however, be ascribed simply to misunderstanding when it is said of SC, which seems exposed to the opposite charge, if any. Namely, SC allows too absolute a freedom for a person’s evaluations, abstaining from any prescription beyond coherence. Is that not assent to arbitrariness? The answer is, Yes, in freedom from prefabricated schemes; but the definition calls on personal responsibility; and mathematical developments based on the coherence conditions show how and why the usual prescriptions—above all, those based on symmetry and frequency—ought to be applied, not as rigid artificial rules, but as patterns open to intelligence and discernment for proper interpretation in each case.

Bruno De Finetti

[See alsoBayesian inference; Causation; Science, article onThe philosophy of science; Scientific explanation.]

## BIBLIOGRAPHY

Contributions by Venn, Borel (about Keynes), Ramsey, de Finetti, Koopman, and Savage are collected and commented on in Kyburg & Smokier 1964.

Bailey, Arthur L. 1950 Credibility Procedures. Casualty Actuarial Society, Proceedings 37:7–23.

Bayes, Thomas (1764) 1958 An Essay Towards Solving a Problem in the Doctrine of Chances. Biometrika 45:296–315. → Reprinted in 1963 in Bayes’ Facsimiles of Two Papers by Bayes, published by Hafner.

Bernoulli, Daniel (1738) 1954 Exposition of a New Theory on the Measurement of Risk. Econometrica 22:23–36. → First published as “Specimen theoriae novae de mensura sortis.”

Borel, Émile (1924) 1964 Apropos of a Treatise on Probability Pages 45–60 in Henry E. Kyburg, Jr. and Howard E. Smokier (editors), Studies in Subjective Probability. New York: Wiley. → First published in French in Volume 98 of the Revue philosophique. Carnap, Rudolf (1950) 1962 Logical Foundations of Probability. 2d ed. Univ. of Chicago Press.

Carnap, Rudolf 1952 The Continuum of Inductive Methods. Univ. of Chicago Press.

De Finetti, Bruno 1930 Fondamenti logici del ragionamento probabilistico. Unione Matematica Italiana. Bollettino Series A 9:258–261.

Ellsberg, Daniel 1961 Risk, Ambiguity, and the Savage Axioms. Quarterly Journal of Economics 75:643–669.

Feller, William 1950–1966 An Introduction to Probability Theory and Its Applications. 2 vols. New York: Wiley. → The second edition of the first volume was published in 1957.

FrÉchet, Maurice 1951 Rapport general sur les travaux du colloque de calcul des probabilites. Pages 3–21 in Congres International de Philosophic des Sciences, Paris, 1949, Actes. Volume 4: Calcul des probabilites. Paris: Hermann.

Fry, Thornton C. 1934 A Mathematical Theory of Rational Inference. Scripta mathematica 2:205–221.

Gnedenko, Boris V. (1950) 1962 The Theory of Probability. New York: Chelsea. → First published as Kurs teorii veroiatnostei.

Guilbaud, Georges 1961 Faut-il jouer au plus fin? Pages 171–182 in Colloque sur la decision, Paris, 25–30 mai, 1960 [Actes]. France, Centre National de la Recherche Scientifique, Colloques Internationaux, Sciences Humaines. Paris: The Center.

Hacking, Ian 1965 Logic of Statistical Inference. Cambridge Univ. Press.

Jeffreys, Harold (1939) 1961 Theory of Probability. 3d ed. Oxford: Clarendon.

Keynes, John Maynard (1921) 1952 A Treatise on Probability. London: Maemillan. → A paperback edition was published in 1962 by Harper.

Kolmogorov, Andrei N. (1933) 1956 Foundations of the Theory of Probability. New York: Chelsea. → First published in German.

Koopman, Bernard O. 1940a The Axioms and Algebra of Intuitive Probability. Annals of Mathematics Second Series 41:269–292.

Koopman, Bernard O. (1940b) 1964 The Bases of Probability Pages 159–172 in Henry E. Kyburg, Jr. and Howard E. Smokier (editors), Studies in Subjective Probability. New York: Wiley.

Kyburg, Henry E. Jr.; and Smokler, Howard E. (editors) 1964 Studies in Subjective Probability. New York: Wiley.

Laplace, Pierre Simon De (1814) 1951 A Philosophical Study on Probabilities. New York: Dover. → First published as Essai philosophique sur les probabilites.

Lindley, Dennis V. 1965 Introduction to Probability and Statistics From a Bayesian Viewpoint. 2 vols. Cambridge Univ. Press.

Molina, Edward C. 1931 Bayes’ Theorem. Annals of Mathematical Statistics 2:23–37.

Ramsey, Frank P. (1923–1928)1950 The Foundations of Mathematics, and Other Logical Essays. New York: Humanities.

Savage, Leonard J. 1954 The Foundations of Statistics. New York: Wiley.

Smith, Cedric A. B. 1961 Consistency in Statistical Inference and Decision. Journal of the Royal Statistical Society Series B 23:1–25.

Van Dantzig, David 1957 Statistical Priesthood: Savage on Personal Probabilities. Statistica neerlandica 11:1–16.

Venn, John (1866) 1962 The Logic of Chance: An Essay on the Foundations and Province of the Theoryof Probability, With Special Reference to Its Logical Bearings and Its Application to Moral and Social Science. 4th ed. New York: Chelsea.

Von Mises, Richard (1928)1961 Probability, Statistics and Truth. 2d ed., rev. London: Allen & Unwin.

Von Neumann, John; and Morgenstern, Oskar (1944) 1964 Theory of Games and Economic Behavior. 3d ed. New York: Wiley.

Wald, Abraham (1950)1964 Statistical Decision Functions. New York: Wiley.

# Probability Distributions

views updated Jun 27 2018

# Probability Distributions

ASSOCIATED FUNCTIONS

DISTRIBUTIONAL PARAMETERS

MOMENTS, MEAN, AND VARIANCE

SKEWNESS AND KURTOSIS

CUMULANTS

GROUPS AND SUBGROUPS, DISTRIBUTIONS AND SUBDISTRIBUTIONS

SOME IMPORTANT DISTRIBUTIONS

BIBLIOGRAPHY

The fundamental notion in statistics is that of a group (aggregate), which is usually called a population. This denotes a collection of objects, whether animate or inanimate, for example, a population of humans, of plants, of mistakes in reading a scale, and so on. The science of statistics deals with properties of populations, or more precisely, with the data obtained by counting or measuring properties of populations of natural phenomena. Natural phenomena include various happenings of the external world, whether human or not.

Consider a population of members each of which bears some numerical value of a variable (variate), for example, a population of men with measured height. Here the variable is the height of men. We thus have a population of variates (which can be discontinuous [discrete] or continuous). In the continuous case, the number of members possessing a variate value that falls into a given interval of the variate values is called the frequency in that interval. Finally, the manner in which the frequencies are distributed over the intervals is called the frequency distribution (or simply distribution ).

To everything in social science there is a distribution. From the characteristics of persons and the aggregates they formhealth and wealth, city size and resource endowment, happiness and social harmonyto the characteristics of myriad other entitiesstock performance, volatility of money, the number of words in a language distributions both summarize the characteristics and describe their operation.

Moreover, there are distributions for every kind of variable: continuous or discrete; defined on a support that may be any subset of numbers (positive half-line, full line, positive integers, etc.); displaying a great variety of shapes; symmetric or asymmetric; with zero, one, or many peaks; skewed to the left or the right; possessing one or several modes; and so on. As a rule, continuous distributions are specified by mathematical functions.

Functions of variables and combinations (mixtures) of variables generate new variables whose distributions show the imprint of the input distributions and their interrelations. Because approximation of a variables distribution unlocks many doorsin theoretical analysis, where distributions are a prime tool for forecasting and prediction, and in empirical analysis, where distributions usually serve to represent the unobservablesdetailed compilations of distributions are a necessary part of the social scientists toolkit.

For introduction and comprehensive exposition, see Stuart and Ord (1994), Dwass (1970), and Tsokos (1972), and for rather encyclopedic coverage the volumes in Johnson and Kotzs series, Distributions in Statistics, and their revisions (e.g., Johnson, Kotz, and Balakrishnan 1994). The handbook by Evans, Hastings, and Peacock (2000) is an appropriate and valuable source for initial study.

## ASSOCIATED FUNCTIONS

A variety of mathematical functions are associated with mathematically specified distributions. The most basic is the distribution function, also known as the cumulative distribution function (cdf). The distribution function may be defined as a mathematical expression that describes the probability that a system (consisting of several components) will take a specific numerical value or set of values (such a varying system is designated as a random variable and usually denoted by capital X, Y, or Z ). The description of a system may involve other quantities, and the distribution function may take into account some (or all) of them. In the case of a system with one variable, the distribution function (or cumulative distribution function) is defined as the probability α (0 < α < 1) that the variate X assumes a value less than or equal to x and is usually denoted FX(x), or simply F(x ):

(distinguishing, as customary, between the random variable X and its specific numerical value x ). The distribution function of any random variable is a nondecreasing (usually increasing) function of x. The range of its values is [0, 1]. (Remember that the cdf is nothing else but a particular probability.)

Besides the cdf, the behavior of a random variable can be described by an associated function. The most common of the associated functions (and by far the most often graphed) is the probability density function (pdf ), denoted f(x ). In the case of continuous distributions the pdf is simply the first derivative of the distribution function with respect to the value x of the corresponding variate X. (In discrete distributions it is sometimes called the probability mass function, and is obtained by taking the difference of the consecutive values of the distribution function.) As visible from its relationship to the distribution function, the pdf, represented graphically as a curve, possesses two main properties: (1) it is nonnegative for all values of x, f (x) > 0 (because the cdf is a nondecreasing function of x ); (2) the total area under the curve is 1:

By far the most familiar and most widely used of all (continuous) distributional shapes is the symmetric bell-shaped curve depicting the pdf of the normal (or Gaussian) distribution. The normal distribution was popularized in the social sciences by the great Belgian scholar Lambert Adolphe Jacques Quételet (17961874).

Other popular associated functions used in social science are (1) the quantile function, which, inter alia, provides the foundation for distributional measures of inequality, such as Pens Parade, and (2) the hazard function, formally defined as the ratio of the pdf to 1 minus the cdf: f (x )/[1 F (x )]. The denominator, 1 F (x ), is also known as the reliability function and as the survival function and denoted by S (x ).

All the associated functions are related to each other. For example, as already noted, among continuous distributions, the probability density function is the first derivative of the distribution function with respect to x. The quantile function, variously denoted G (α) or Q (α) or F 1 (α), is the inverse of the distribution function, providing a mapping from the probability α (0 < α < 1) to the quantile value x. Recall that the range of the values of a distribution function is (0, 1). If the distribution function rises steadily from 0 to 1 there is a unique number x α for each α on the interval [0, 1] such that

The number xα is a number in the set of values of X and it is the value such that the fraction α (0 < α < 1) of the total probability (which is 1) is assigned to the interval (, xα ) This number is the ath quantile of the distribution function defined above, F X (x ) (For a standard normal distribution at probability α =.5, the value of the quantile function is 0, given that the total probability under the normal curve [which ranges from to +] up to value 0 is. 5.) Important relations between the quantile and distribution functions are (Eubank 1988; Evans, Hastings, and Peacock 2000):

## DISTRIBUTIONAL PARAMETERS

Quantities which appear explicitly in the expression of the distribution function are called parameters. Distributions usually have associated with them a number of parameters. Of these, three are regarded as basicthe location, scale, and shape parameters. The location parameter is a particular point in the variates domain, and the scale and shape parameters govern the scale and the shape, respectively. Variates differ in the number and kind of basic parameters. For example, the normal (Gaussian) distribution has two parameters, a location parameter (the mean) and a scale parameter (the standard deviation); the Pareto distribution has a location parameter and a shape parameter; and the gamma is a three-parameter distribution (with location, scale, and shape parameters). The gamma distribution is sometimes specified as a two-parameter distribution (possessing scale and shape parameters, with the location parameter being 0).

## MOMENTS, MEAN, AND VARIANCE

Borrowing from physics the idea of the moments of a function, the mean (or expected value) of a distribution (also referred to as of a random variable) is defined as the first moment about the origin (0):

(representing an ideal or a theoretical average and characterizing the central tendency). The variance is defined as the second moment about the mean:

(representing dispersion of the random variable around the expected value). The square root of the variance is known as the standard deviation and is usually denoted σ

## SKEWNESS AND KURTOSIS

Two additional quantities are the coefficient of skewness, denoted β1 (the positive square root of β1 is denoted by :

and the coefficient of kurtosis, denoted β2:

The numerators of the coefficients of skewness and kurtosis are the third and fourth moments about the mean, respectively, and the denominators are the third and fourth powers, respectively, of the standard deviation. The coefficient of skewness measures the relative asymmetry and the coefficient of kurtosis measures the peakedness (and humpness) of the distribution. For the normal distribution, the kurtosis is

Evidently, for any symmetric distribution the skewness equals 0, given that every odd moment is zero in this case. The coefficient of kurtosis is thus sometimes defined as the expression in (8) minus 3 (usually denoted by γ2), which may lead to ambiguity. (The Greek word Кυртοσ means humped).

## CUMULANTS

Cumulants (also known as semi-invariants) are simple functions of moments having useful theoretical properties. Unlike the moments, all the cumulants (except the first) are independent of the origin of calculations so it is unnecessary to specify the origin of calculations in giving their values.

## GROUPS AND SUBGROUPS, DISTRIBUTIONS AND SUBDISTRIBUTIONS

Groups and subgroups are pivotal in social science, and the distributional operations of censoring and truncation serve to analyze subgroup structures. Using what is by now standard terminology (Gibbons 1988, p. 355), let censoring refer to selection of units by their ranks or percentage (or probability) points; and let truncation refer to selection of units by values of the variate. Thus, the truncation point is the value x separating the subdistributions; the censoring point is the percentage point α separating the subdistributions. For example, the subgroups with incomes less than \$30,000 or greater than \$90,000 each form a truncated subdistribution; the top 2 percent and the bottom 5 percent of the population each form a censored subdistribution.

There is a special link between these two kinds of subgroup structure and Blaus (1974) pioneering observation that much of human behavior can be traced to the differential operation of quantitative and qualitative characteristics. Quantitative characteristicsboth cardinal characteristics (such as wealth) and ordinal characteristics (such as intelligence and beauty)generate both truncated and censored subdistribution structures. For example, the subgroups rich and poor may be generated by reference to an amount of income or by reference to percentages of the population. However, qualitative characteristicssuch as race, ethnicity, language, and religionmay be related so tightly to quantitative characteristics that the subgroups corresponding to the categories of the qualitative characteristic are nonoverlapping and thus provide the basis for generating censored subdistribution structures. For example, in caste, slavery, or segmented societies, the subdistribution structure of a quantitative characteristic may be a censored structure in which the percentages pertain to the subsets formed by a qualitative characteristicsuch as slave and free or immigrant and native.

## SOME IMPORTANT DISTRIBUTIONS

The number of probability distributions appearing in the literature is now by a conservative estimate far more than 300, and climbing. The standard compendia highlight the basic forty or fifty distributions, and among these, the most important twenty or so appear in all textbooks and in all languages. Chronologically, the earliest distributions that were scrupulously investigated, starting from the early eighteenth century, were the normal (Gaussian) distribution and, to a lesser extent, the Cauchy distribution. The Pareto distribution was proposed by the Italian-Swiss economist and sociologist Vilfredo Pareto (18481923) in the late nineteenth century, and the Weibull is of more recent origin, developed by the Swedish engineer Waloddi Weibull (18871979) in the mid-twentieth century. Information about univariate and multivariate discrete distributions and multivariate continuous distributions is by now easily available, and these are not discussed here due to space constraints. The 2000 handbook by Evans, Hastings, and Peacock provides a lucid introduction to some forty widely used distributions. A more extensive source is the six-volume compendium by Kotz, Johnson, Balakrishnan, and Kemp under the overall title Distributions in Statistics, 2nd ed., 19952002. The Dictionary and Classified Bibliography of Statistical Distributions in Scientific Work, edited by Patil, Boswell, Joshi, Ratnaparkhi, and Roux, 1984, is a handy reference source.

We now present a brief description of twelve selected basic continuous univariate distributions. Table 1 reports their probability density function, mean, and variance (for a more detailed table, see e.g. Tsokos 1972, which also provides graphical display).

Uniform Distribution The uniform is a natural conception that has been in use since before printed records. Its applications include corrections for grouping, life testing, traffic-flow analyses, and round-off errors, and it provides a model for the set of relative ranks in a group or population.

Normal Distribution As indicated above, the normal (also Gaussian, Laplace-Gaussian, and Gaussian-Laplace) is the most important distribution in probability theory and in mathematical as well as applied statistics. The normals density is symmetric and bell-shaped. Values for the density, cumulative, and inverse of the standard form are extensively tabulated. The importance of the normal distribution is due to the fact that under general conditions, the sum of many independent variables tends, as the number of variables increases, to the normal. The relevant conditions are provided in the central limit theorem.

Lognormal Distribution The lognormal distribution is defined on the positive support; it is unimodal with a long right tail. Its name refers to the fact that the logarithm of a lognormal variate has a normal distribution. The lognormal is widely used in the biological, physical, and sociobehavioral sciences, including economics, and even in philology. It serves as a model for income, physicians consulting time, sickness absence, number of persons in a census occupational category, height and weight, automobile insurance claims payments, and size of oil and gas deposits.

Exponential Distribution The exponential (more precisely, the negative exponential) has a mode at the origin ( x = 0) and a long right tail. It is widely used in studies of lifetimes, life testing, and life characteristics, producing usable approximate solutions to difficult distributional problems. It also provides the model for social status in the case where status arises from one (or several perfectly positively related) personal characteristic(s). Its mirror image, the positive exponential, has a mode at its upper bound and a long left tail. Justice processes generate both negative exponential and positive exponential distributions. Recent extensions by Jasso and Kotz (2007) are geared towards applications in the social sciences (see below).

Weibull Distribution The Weibull is an asymmetric single-peaked distribution defined on the positive support. It is widely used in analyses of reliability problems, the theory of sound, health phenomena, human performance, duration of industrial stoppages, and migratory systems.

Gamma Distribution The gamma is defined on the positive support; it is unimodal and asymmetric and has a

Probability Density Function, Mean, and Variance of Some Important Continuous Univariate Distributions
VariateProbability Density FunctionMeanVariance
Notes: In the formula for the beta distribution, B denotes the beta function, and T(. ) appearing in formulas for the gamma and Weibull distributions denotes the gamma function.
Uniform (Rectangular) a < x < b
Uniform standard form 0 < x < 11
Normal (Gaussian)- < x < μσ2
Normal standard form- < x < 01
Lognormal x > 0, c > 0μ
Weibull x > 0, λ > 0, k > 0
Weibull standard form x > 0, k > 0
Exponential x > 0, λ > 0
Exponential standard form x > 0exp(-x)11
Gamma (2-parameter) x > 0, λ > 0
Gamma standard form x > 0, c > 0cc
Paretoμ
Pareto standard form x > 1, c > 1
Laplace (double-exponential) - < x < , b > 0μ2b 2
Laplace standard form - < x < 02
Beta a < x < b p > 0, q > 0
Beta standard form 0 < x < 1 p > 0, q > 0
Logistic - < x < b > 0μ
Logistic standard form - < x < 0
Cauchy - < x <
Cauchy standard form - < x <
Power-Functionμ
Power-Function standard form 0 < x < 1, c > 0cxc-1
Equal (Dirac delta)μ0

long right tail. It includes as special cases the exponential, whose mode is at the origin, and the Erlang distribution, whose shape parameter c is an integer. The gamma is used to represent lifetimes and personal income, as well as daily demand for electrical power and the distribution of single species abundances at equilibrium. It arises also in the study of social status, where it provides a model of the case where status is generated by two or more independent characteristics, and in the study of justice, where it provides a model of the case where the justice evaluation is generated by two or more independent ordinal characteristics.

Pareto Distribution The Pareto has a mode at its positive origin and a very long right tail. It is used to model personal income, firm size, city size, and occurrence of natural resources. Because the Pareto has a positive infimum, it is ideal for modeling income distributions that have a safety net. Recently it has been used in connection with the random walk hypothesis of common stock prices.

Laplace Distribution (Double Exponential) The Laplace is a symmetric distribution with a sharp point at its modeit arises, inter alia, from the difference between two identical exponential distributions. It is similar to the normal, but the smooth top of the bell is replaced by a needle peak. It is the prior distribution in Bayesian statistical analysis. It is used as a substitute for the normal in robust statistics analysis, and provides a model for demand during lead time for slow-moving items. It also arises in the study of justice, in the case where both actual incomes and personal ideas of just incomes are independently and identically Pareto distributed; the asymmetrical Laplace form arises in the case where actual incomes and personal ideas of just incomes are independently and nonidentically Pareto distributed.

Beta Distribution The beta is a very flexible family, being a generalization of the uniform distribution. It provides the prior distribution for binomial proportions and serves in models and analyses of hydrologic variables, project planning/control systems (such as PERT), tool wear, construction duration, transmission of HIV virus, traffic flow, and risk analysis for strategic planning. Via its special case, the power-function distribution, it also is used to model the income distribution.

Logistic Distribution The logistic is a symmetric unimodal distribution defined on the real line. It is used in analyses of growth (including the growth of human populations), quantal response data, psychological issues, weight gain, and physiochemical phenomena. It also is sometimes used as a substitute for the normal.

Cauchy Distribution The Cauchy is a symmetric unimodal distribution defined on the real line. Its density is similar to the normals but with thicker tails. It has the interesting (and restrictive in applications) property that moments (including the expected value) are not defined. Advances in computational procedures diminish the effect of the absence of moments, and the Cauchy distribution is nowadays often used (in particular in financial applications) as an alternative to the normal distribution.

Power-Function Distribution The power-function, a special case of the beta, is defined on the positive support; it can have a left tail or a right tail, depending on whether its shape parameter is larger or smaller than 1. When the shape parameter is 1, the power-function becomes the uniform distribution. Because the power-function has a supremum, it is appropriate for modeling situations marked by scarcity.

To summarize, the normal, Cauchy, Laplace, and logistic distributions are defined for all real values; the exponential, Weibull, gamma, and lognormal distributions are defined for all positive values; the Pareto distribution is defined for all positive values larger than a specified number; the beta distribution (and the continuous uniform distribution) are defined for an interval of a specified length.

Some of the distributions in the list are rivals for modeling some phenomena. For example, the gamma and lognormal are competitors for modeling size distributions, and the exponential and Weibull are competitors for modeling reliability. Many other distributions are used in social science to model sociobehavioral phenomena, and still more distributions arise from sociobehavioral operations. For example, status processes generate, besides the Erlang, a general Erlang and new variates called ring-exponential and mirror-exponential distributions (Jasso and Kotz 2007). Further, in addition to the pivotal normal, other distributions used in mathematical and applied statistics include the chi-squared, Students t, and the F distribution (central and noncentral).

Finally, special mention must be made of the equal distribution (sometimes called degenerate when defined as discrete, and Diracs delta when defined as continuous), which provides a model for a perfectly equal distribution and thus serves as a benchmark in analyses of social inequality (Jasso 1980; Jasso and Kotz 2007).

SEE ALSO Bayesian Econometrics; Bayesian Statistics; Central Limit Theorem; Distribution, Normal; Distribution, Poisson; Distribution, Uniform; Method of Moments; Pareto, Vilfredo; Probabilistic Regression

## BIBLIOGRAPHY

Blau, Peter M. 1974. Presidential Address: Parameters of Social Structure. American Sociological Review 39: 615635.

Dwass, Meyer. 1970. Probability: Theory and Applications. New York: W. A. Benjamin.

Eubank, Randall L. 1988. Quantiles. In Encyclopedia of Statistical Sciences, Vol. 7, eds. Samuel Kotz, Norman L. Johnson, and Campbell B. Read, 424432. New York: Wiley.

Evans, Merran, Nicholas Hastings, and Brian Peacock. 2000. Statistical Distributions. 3rd ed. New York: Wiley.

Gibbons, Jean Dickinson. 1988. Truncated Data. In Encyclopedia of Statistical Sciences, Vol. 9, eds. Samuel Kotz, Norman L. Johnson, and Campbell B. Read, 355. New York: Wiley.

Jasso, Guillermina. 1980. A New Theory of Distributive Justice. American Sociological Review 45: 332.

Jasso, Guillermina, and Samuel Kotz. 2007. A New Continuous Distribution and Two New Families of Distributions Based on the Exponential. Statistica Neerlandica 61(3): 305328.

Johnson, Norman L., Samuel Kotz, and N. Balakrishnan. 1994. Continuous Univariate Distributions, Vol. 1. 2nd ed. New York: Wiley.

Patil, Ganapati P. 1984. Dictionary and Classified Bibliography of Statistical Distributions in Scientific Work. Vols. 13. Burtonsville, MD: International Cooperative Publishing House.

Stuart, Alan, and J. Keith Ord. 1994. Distribution Theory. Vol. 1 of Kendalls Advanced Theory of Statistics, 6th ed. London: Edward Arnold.

Tsokos, Chris P. 1972. Probability Distributions: An Introduction to Probability Theory with Applications. Belmont, CA: Duxbury Press.

Guillermina Jasso

Samuel Kotz

# Probability Theory

views updated May 29 2018

# Probability Theory

BIBLIOGRAPHY

With the identification of Huygenss 1657 book Ratiociniis in aleae ludu as its first text, Ian Hacking characterizes the decade around 1660 as the decade of the birth of probability. He chooses to bring the story of its emergence to an end with the year of publication of Jacques Bernoullis Ars Conjectandi in 1713:

In that year probability came before the public with a brilliant portent of all the things we know about it now: its mathematical profundity, its unbounded practical applications, its squirming duality, and its constant invitation for philosophizing (Hacking 1975, p. 143).

Hacking structures his prehistory, a prehistory more important than the history, around the dual notions of the aleatory versus the epistemic: the degree of belief warranted by evidence versus the tendency displayed by some chance devices to produce stable relative frequencies (1975, p. 1).

Augustin Cournot (1843) and Francis Edgeworth (1884, 1922), and following them in the second quarter of the twentieth century, John Maynard Keynes and Frank P. Ramsey, distinguished students of economy and society, all had a deep and abiding interest in probability theory, but struggled with the definition of their (instrumental) subject. Edgeworth, in particular, settled in his 1884 paper on the description of probability as importing partial incomplete belief, but was unsure about how far the gradations of belief are a subject of science (p. 223). Returning to the term, and to the philosophy of chance forty years later, he flatly stated that probability seems not to admit of definition (Edgeworth 1922, p. 257). Already, his 1911 entry on probability and expectation for the Encyclopedia Britannica had opened with the following demurral:

As in other mathematical sciences, so in probabilities, or even more so, the philosophical foundations are less clear than the calculations based thereon. On this obscure and controversial topic, absolute uniformity is not to be expected (p. 376).

In his biographical essay on Ramsey, Keynes responded to Ramseys (1922, 1926) critique of his objective theory of probability with a wary ambivalence:

The calculus of probabilities belongs to formal logic. But the basis of our degrees of beliefor the a priori probabilities, as they used to be calledis part of our human outfit, perhaps given us merely by natural selection, analogous to our perceptions and our memories rather than to formal logic. So far I yield to RamseyI think he is right. But in attempting to distinguish rational degrees of belief from belief in general he was not yet I think quite successful (Keynes 1933, pp. 338-339).

In her discussion of the subjective theory of probability, Maria Galavotti alluded to Ramseys scepticism concerning a single notion of probability equally applicable in logic and in physics, and quoted Bruno de Finettis bald antirealist claim that probability does not exist (Galavotti 1991, pp. 241, 246). Colin Howsons (1995) conclusion then that the foundations of probability have not yet entered a final stable phase was only fitting.

Whatever the final phase and the precise definition, equipossibility and asymptotics are identified as foundational in any application of the subject. In his endorsement of the equal-treatment property, and of the utilitarian who thinks it fair to treat as equals those between whom no material difference is discerned, [to] treat as equals things which are not known to be unequal, Edgeworth relied on the standard of statistical uniformity to consider John Venns claim that whereas full belief about an event is either verified or disproved by the event, fractional belief can only be verified or disproved by a series of events (Edgeworth 1884, pp. 234, 225). It is of interest to trace the evolution of this Laplacian idea to the role it plays as a principle of indifference in Keyness objective theory, and to the assignment of equal initial degrees of confirmation in Rudolf Carnaps system of inductive logic based on logical probability; (see Gillies 2000, chapters 2 and 3; Ayer 1963, chapters 7 and 8; and Ayer 1972, chapter 2). Hacking, too, devotes discussion to equally possible cases, before moving on to the first limit theorem (Hacking 1975, chapters 14 and 17). In his history of the subject in the nineteenth century, he observes an attitude, especially among the French.

When there are enough events they display regularities. This law passed beyond a mere fact of experience. It was not something to be checked against experience; it was the way things had to be. The law of large numbers became a metaphysical truth (Hacking 1990, p. 104).

By the middle of the twentieth century, however, the logical and mathematical presuppositions of such a law were well-understood (see Ayer 1972, section 2.D).

It is now conventional to see Andrei Kolmogorov (1933) as having laid the mathematical foundation of the subject in the theory of measure, and in having brought probability into the mathematical mainstream by providing a rigorous framework for the study of an infinite sequence of coin tosses, and even for an uncountably indexed set of trials. Authoritative texts abound for the mathematics of probability, and the tables of contents in Jeffrey Rosenthals A First Look at Rigorous Probability Theory (2000) or in A. V. Skorokhods Basic Principles and Applications of Probability Theory (2004), for example, bring out what is now considered to be the standard subject matter. What is important is that rather than attempting probability on the same rigorous basis as the rest of mathematics, after Kolomogorov, the question turns to the insights that probability can give rather than take from the rest of mathematics: from analysis, dynamical systems, optimization, and even number theory and geometry (see Lasota and Mackey 1994; de Melo and van Strein 1993; Steele 1997; and Dajani and Kraaikamp 2002 and their references). The subject has attained a maturity that it can be studied solely through counterexamples, as in Jordan Stoyanav (1987), but as Joseph Doob (1994a) documents, this mathematical coming-of-age has resulted in some tension between probabilists and measure-theorists. Indeed the issuea mini-issue reallyreduces to the essential difference between a measure and a probability, a difference encapsulated in the concept of independence and in a bounded rather than an unbounded measure: For the former, Mark Kacs (1964, 1985) elegant emphasis is unparalleled, and for the latter, one can hardly do better than begin with a comparison of Walter Rudin (1987) and Doob (1994b).

A subsidiary question then arises as to the evolution and the autonomy of the subject of statistics as distinct from that of probability. If the notion of independence is a synecdoche for one, is the notion of sufficiency that for the other? If Hacking (1975) is devoted to one, is Hacking (1990) devoted to the other? It is interesting that Ramsey, de Finetti, and Leonard Savage do not make it to the index of Hackings investigation of statistical fatalism, and how an avalanche of numbers turned rational moral science into empirical moral science (1990, pp. x, viii). However, if the distinction between the theories of probability and measure is a mini-issue, the distinction between probability and statistics is almost surely a nonissue, and it would be a naive anthropology indeed that uses professional societies and journals to refute the fact of one community finding its true identity in the other (see Khan 1993 in this direction). In any case, current conventions see Savage (1954, 1962) as consolidating the earlier insights into a strongly contending, if not dominant, framework for statistical decision theory, one in which utility and (finitely additive as opposed to countably additive) probability are intertwined (see Hacking 1965, chapter 13 for evaluation and possible synthesis).

Savage (1954) is also seen as bringing to culmination the original ideas of Ramsey and de Finetti and providing a founding text for individual decision making. His text thus fulfills Edgeworths promise of a mixed science of probability and utility: of what Laplace (1814) calls espèrance, the product of probability and utility; that quantity which to maximize is the main problem of the Art of Measurementof the art proper (Edgeworth 1884, p. 235). The modern twist lies in the use, nothing if not dramatic, of the theory of expected utility to depart from the notion of expectation as Laplace defined it, and to construct a theory of nonexpected utility. After an identification of Kreps (1988), Karni and Schmeidler (1991) and Machina and Schmeidler (1992) as the relevant texts, I move on.

With the wholesale importation of the idea of a continuum of agents into more applied areas of economics macroeconomics certainly, but also the economics of labor and industry, albeit in a framework of identical agents buffeted by independent exogenous shocksa law of large numbers for a continuum of random variables became an instrumental necessity. The models were geared to exploit the plausible intuition that aggregation removed idiosyncratic uncertainty, and it was difficult to see how the averaging operation of Laplace could not cancel out errors without a tilt or a dependence to them, not only asymptotically but also for a suitably idealized limit. In other words, there was a demand for a framework that was hospitable to the averaging of an independent continuum, and could thereby execute what the Lebesgue probability measure could not, and cannot, do (on this, see Doob 1994a; Khan and Sun 1999; and Sun 2006). In a landmark paper, Peter Loeb (1975) offered such a framework, and Yeneng Sun exploited it to deliver not only a law of large numbers, but also a variety of novel probabilistic patterns and dualities concerning independence and exchangeability (see Sun 1998ab). R. J. Aumanns limitation on the diversity of agent characteristics through the assumption of a finite number of commodities is now translated into a possibly analogous limitation that requires the order of averaging over names and states of nature not to matter, a limitation that yokes Fubini to an event space richer than the one conventionally constructed for products (see Hammond and Sun 2003a, 2003b, 2006a, 2006b). Thus, in its needs and demands for a limit law of great numbers, probability theory was led to supply new adjectives for a probability measure, and thereby surely to go beyond the original Kolmogorovaxioms (in addition to Fajardos 1999 overview, see Khan and Sun 2002 and Sun 2006).

Stochastic dynamics, and a surge of interest in functional analysis, had already led probability theory to a study of function-valued and vector-valued random variables, as in Joseph Diestel and Jerry Uhl (1977); after Aumanns formalization of Edgeworths conjecture (see Anderson 1991 for details and references), economic theory led it to a study of set-valued random variables. Laplaces espèrance was correspondingly generalized to sets, as was the induced law of a random variable and the law of large numbers; once random sets are made tractable, it is a small step to the consideration of random preferences, random economies, set-valued martingales, and a variety of other set-valued notions. (In addition to Khan and Sun 2002, see Sun 1999, Majumdar and Rotar 2000, and Bhattacharya and Majumdar 2004.)

We conclude this entry by pointing out that its narrower compass has forced a neglect of the economics of information and the affiliated fields of game theory and finance, let alone the manifold applications to sociology, behavioral psychology, quantum mechanics, and evidence-based bio-medical sciences. But even a briefly adequate view of these other applications surely requires another entry.

SEE ALSO Classical Statistical Analysis; Keynes, John Maynard; Probability; Probability Distributions; Risk; Statistics; Statistics in the Social Sciences; Uncertainty

## BIBLIOGRAPHY

Anderson, Robert M. 1991. Non-Standard Analysis with Applications to Economics. In Handbook of Mathematical Economics, Vol. 4, eds. Werner Hildenbrand and Hugo Sonnenschein, 2145-2208. New York: North-Holland.

Artstein, Zvi. 1983. Distributions of Random Sets and Random Selections. Israel Journal of Mathematics 46: 313-324.

Aumann, R. J. 1965. Integrals of Set-Valued Functions. Journal of Mathematical Analysis and Applications 12: 112.

Ayer, Alfred J. 1963. The Concept of a Person. New York: St. Martins Press.

Ayer, Alfred J. 1972. Probablity and Evidence. New York: Columbia University Press.

Bhattacharya, Rabi, and Mukul Majumdar. 2004. Dynamical Systems Subject to Random Shocks: An Introduction. Economic Theory 23: 112.

Cournot, Augustin A.  1984. Exposition de la Théorie des Chances et des Probabilités. Oeuvres de Cournot, Vol. 1. Paris: J. Vrin.

Dajani, Karma, and Cor Kraaikamp. 2002. Ergodic Theory of Numbers. Washington, DC: Mathematical Association of America.

Debreu, Gérard. 1967. Integration of Correspondences. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability 1 (2): 351372. Berkeley: University of California Press.

De Finetti, Bruno.  1964. La Prévision: ses lois logique, ses sources subjectives (Foresight: Its Logical Laws, Its Subjective Sources). In Studies in Subjective Probability, eds. Henry E. Kyberg and Howard E. Smokler, 93158. New York: John Wiley.

De Finetti, Bruno.  1985. Probabilisti di Cambridge (Cambridge Probability Theorists). Manchester School of Economic and Social Studies 53: 348363.

De Melo, Wellington, and Sebastian van Strein. 1993. One Dimensional Dynamics. Berlin: Springer-Verlag.

Diestel, Joseph, and Jerry J. Uhl Jr. 1977. Vector Measures. Providence, RI: American Mathematical Society.

Doob, Joseph L. 1994a. The Development of Rigor in Mathematical Probability, (19001950). In Development of Mathematics, 19001950, ed. Jean-Paul Pier, 157169. Berlin: Berkhauser-Verlag.

Doob, Joseph L. 1994b. Measure Theory. Berlin: Springer-Verlag.

Edgeworth, Francis Y. 1884. The Philosophy of Chance. Mind 9: 223235.

Edgeworth, Francis Y. 1922. The Philosophy of Chance. Mind 31: 257283.

Fajardo, Sergio. 1999. Nonstandard Analysis and a Classification of Probability Spaces. In Language, Quantum, Music, eds. Maria L. Dalla Chiara, Roberto Guintini, and Federico Laudisa, 6171. Dordrecht, Germany: Kluwer Academic.

Galavotti, Maria C. 1991. The Notion of Subjective Probability in the Work of Ramsey and de Finetti. Theoria 67: 239259.

Gillies, Donald. 2000. Philosophical Theories of Probability. London: Routledge.

Hacking, Ian. 1965. Logic of Statistical Inference. Cambridge, U.K.: Cambridge University Press.

Hacking, Ian. 1975. The Emergence of Probability. Cambridge, U.K.: Cambridge University Press.

Hacking, Ian. 1990. The Taming of Chance. Cambridge, U.K.: Cambridge University Press.

Hammond, Peter J., and Yeneng Sun. 2003a. Monte Carlo Simulation of Macroeconomic Risk with a Continuum of Agents: The Symmetric Case. Economic Theory 21: 743766.

Hammond, Peter J., and Yeneng Sun. 2003b. Monte Carlo Simulation of Macroeconomic Risk with a Continuum of Agents: The Symmetric Case. Economic Theory 21 (2): 743746.

Hammond, Peter J., and Yeneng Sun. 2006a. The Essential Equivalence of Pairwise and Mutual Conditional Independence. Probability Theory and Related Fields 135: 415427.

Hammond, Peter J., and Yeneng Sun. 2006b. Joint Measurability and the One-Way Fubini Property for a Continuum of Random Variables. Proceedings of the American Mathematical Society 134: 737747.

Howson, Colin. 1995. Theories of Probability. British Journal of the Philosophy of Science 46: 132.

Kac, Mark. 1964. Statistical Independence in Probability, Analysis, and Number Theory. Washington, DC: Mathematical Association of America.

Kac, Mark. 1985. Enigmas of Chance. New York: Harper and Row.

Karni, Edi, and David Schmeidler. 1991. Utility Theory with Uncertainty. In Handbook of Mathematical Economics, Vol. 4, eds. Werner Hildenbrand and Hugo Sonnenschein, 17631831. New York: North-Holland.

Keynes, John M.  1963. A Treatise on Probability. In The Collected Writings of John Maynard Keynes, Vol. 8. London: Macmillan.

Keynes, John M.  1985. Essays in Biography. Expanded version in The Collected Writings of John Maynard Keynes, Vol. 10. London: Macmillan.

Khan, M. Ali. 1985. On the Integration of Set-Valued Mappings in a Non-Reflexive Banach Space, II. Simon Stevin 59: 257267.

Khan, M. Ali. 1987. Correspondence. In The New Palgrave: A Dictionary of Economics, Vol. 1, eds. John Eatwell, Peter K. Newman, and Murray Milgate, 679681. London: Macmillan.

Khan, M. Ali 1993. On the Irony in/of Economic Theory. Modern Language Notes 108: 759803.

Khan, M. Ali, and Yeneng N. Sun. 1999. Weak Measurability and Characterizations of Risk. Economic Theory 13: 541560.

Khan, M. Ali, and Yeneng N. Sun. 2002. Non-Cooperative Games with Many Players. In Handbook of Game Theory with Economic Applications III, eds. Robert J. Aumann and Sergiu Hart, 17611808. Amsterdam: Elsevier Science.

Kolmogorov, Andrei N.  1956. Foundations of the Theory of Probability. Trans. Nathan Morrison. New York: Chelsea.

Kreps, David M. 1988. Notes on the Theory of Choice. Boulder, CO: Westview Press.

Laplace, Pierre-Simon.  1951. Essai philosophique sur les probabilités (Philosophical Essay on Probabilities ). New York: Dover.

Lasota, Andrzej, and Michael C. Mackey. 1994. Chaos, Fractals, and Noise: Stochastic Aspects of Dynamics. New York: Springer-Verlag.

Loeb, Peter A. 1975. Conversion from Nonstandard to Standard Measure Spaces and Applications in Probability Theory. Transactions of the American Mathematical Society 211: 113122.

Machina, Mark, and David Schmeidler. 1992. A More Robust Definition of Subjective Probability. Econometrica 60: 745780.

Majumdar, Mukul, and Vladimir Rotar. 2000. Equilibrium Prices in a Random Exchange Economy with Dependent Agents. Economic Theory 15: 531550.

Ramsey, Frank P. 1922. Mr. Keynes on Probability. Cambridge Magazine 11: 35.

Ramsey, Frank P.  1931. Truth and Probability. In The Foundations of Mathematics and Other Logical Essays, 156198. London: Routledge and Kegan Paul.

Rosenthal, Jeffrey S. 2000. A First Look at Rigorous Probability Theory. Singapore: World Scientific.

Rudin, Walter. 1987. Real and Complex Analysis. 3rd ed. New York: McGraw-Hill.

Savage, Leonard J. 1954. The Foundations of Statistics. New York: John Wiley.

Savage, Leonard J. 1962. The Foundations of Statistical Inference: A Discussion. Updated ed., 1970. London: Methuen.

Skorokhod, A. V.  2004. Basic Principles and Applications of Probability Theory. Berlin: Springer-Verlag.

Steele, J. Michael. 1997. Probability Theory and Combinatorial Optimization. Philadelphia: Society for Industrial and Applied Mathematics.

Stoyanov, Jordan M. 1987. Counterexamples in Probability. 2nd ed. New York: John Wiley and Sons.

Sun, Yeneng N. 1998a. A Theory of Hyperfinite Processes: The Complete Removal of Individual Uncertainty via Exact LLN. Journal of Mathematical Economics 29: 419503.

Sun, Yeneng N. 1998b. The Almost Equivalence of Pairwise and Mutual Independence and the Duality with Exchangeability. Probability Theory and Related Fields 112: 425456.

Sun, Yeneng N. 1999. The Complete Removal of Individual Uncertainty: Multiple Optimal Choices and Random Economies. Economic Theory 14: 507544.

Sun, Yeneng N. 2006. The Exact Law of Large Numbers via Fubini Extension and the Characterization of Insurable Risks. Journal of Economic Theory 126: 3169.

M. Ali Khan

# Probability

views updated May 11 2018

# PROBABILITY.

"Probability is the very guide of life," Bishop Butler wrote in 1736. Probability judgments of the efficacy and side effects of a pharmaceutical drug determine whether it is approved for release to the public. The outcome of a civil trial hinges on the jurors' opinions about the probabilistic weight of evidence. Geologists calculate the probability that an earthquake of a certain intensity will hit a given city, and engineers accordingly build skyscrapers with specified probabilities of withstanding such earthquakes. Probability undergirds even measurement itself, since the error bounds that accompany measurements are essentially probabilistic confidence intervals. We find probability wherever we find uncertaintythat is, almost everywhere in our lives.

It is surprising, then, that probability arrived comparatively late on the intellectual scene. To be sure, a notion of randomness was known to the ancients. Epicurus, and later Lucretius, believed that atoms occasionally underwent indeterministic swerves. The twelfth-century Arabic philosopher Averroës's notion of "equipotency" might be regarded as a precursor to probabilistic notions. But probability theory was not conceived until the seventeenth century, in the correspondence between Pierre de Fermat and Blaise Pascal and in the Port-Royal Logic. Over the next three centuries, the theory was developed by such authors as Christian Huygens, Jacob Bernoulli, Thomas Bayes, Pierre Simon Laplace, the Marquis de Condorcet, Abraham de Moivre, John Venn, William Johnson, and John Maynard Keynes. Arguably, the crowning achievement was Andrei Kolmogorov's axiomatization in 1933, which put probability on a rigorous mathematical footing.

## The Formal Theory of Probability

In Kolmogorov's theory, probabilities are numerical values that are assigned to "events." The numbers are non-negative; they have a maximum value of 1; and the probability that one of two mutually exclusive events occurs is the sum of their individual probabilities. Stated more formally, given a set and a privileged set of subsets F of, probability is a function P from F to the real numbers that obeys, for all X and Y in F, the following three axioms:
A1. P(X) 0 (Non-negativity)
A2. P(Ω) = 1 (Normalization)
A3. P(X Y) = P(X) + P(Y) if X Y = Ø (Additivity)

Kolmogorov goes on to give an infinite generalization of (A3), so-called countable additivity. He also defines the conditional probability of A given B by the formula:
P(A|B) = P(A B) / P(B), P(B) 0

Thus, we can say that the probability that the toss of a fair die results in a 6 is 1/6, but the probability that it results in a 6, given that it results in an even number, is 1/6 divided by 1/2 equals 1/3.

Important consequences of these axioms include various forms of Bayes's theorem, notably:
P(H|E) = [P(H)/P(E) ] P(E|H) = P(H)P(E|H) /[P(H)P(E|H) + P(H)P(E|H))

This theorem provides the basis for Bayesian confirmation theory, which appeals to such probabilities in its account of the evidential support that a piece of evidence E provides a hypothesis H. P (E H) is called the "likelihood" (the probability that the hypothesis gives to the evidence) and P (H ) the "prior probability" of H (the probability of the hypothesis in the absence of any evidence whatsoever).

Events A and B are said to be independent if P (A B ) = P (A ) P (B ). If P (A ) and P (B ) > 0, this is equivalent to P (A|B ) = P (A ) and to P (B|A ) = P (B ). Intuitively, information about the occurrence of one of the events does not alter the probability of the other. Thus, the outcome of a particular coin toss is presumably independent of the result of the next presidential election. Independence plays a central role in probability theory. For example, it underpins the various important "laws of large numbers," whose content is roughly that certain well-behaved processes are very likely in the long run to yield frequencies that would be expected on the basis of their probabilities.

While the mathematics of Kolmogorov's probability theory is well understood and thoroughly developed (a classic text is Feller), its interpretation remains controversial. We now turn to several rival accounts of what probabilities are and how they are to be determined (see Hájek for more detailed discussion).

## Interpretations of Probability

The classical interpretation, historically the first, can be found in the works of Pascal, Huygens, Bernoulli, and Leibniz, and it was famously presented by Laplace (1814). It assigns probabilities in the absence of any evidence and in the presence of symmetrically balanced evidence. In such circumstances, probability is shared equally among all the possible outcomesthe so-called principle of indifference. Thus, according to the classical interpretation, the probability of an event is simply the fraction of the total number of possibilities in which the event occurs. This interpretation was inspired by, and typically applied to, games of chance that by their very design create such circumstancesfor example, the classical probability of a fair die landing with an even number showing up is 3/6. Notoriously, the interpretation falters when there are competing sets of possible outcomes. What is the probability that the die lands 6 when tossed? If we list the possible outcomes as {1, 2, 3, 4, 5, 6}, the answer appears to be 1/6. But if we list them as {6, not-6}, the answer appears to be 1/2.

The logical interpretation retains the classical interpretation's idea that probabilities are determined a priori by the space of possibilities. But the logical interpretation is more general in two important ways: the possibilities may be assigned unequal weights, and probabilities can be computed whatever the evidence may be, symmetrically balanced or not. Indeed, the logical interpretation seeks to determine universally the degree of support or confirmation that a piece of evidence E confers upon a given hypothesis H. Rudolf Carnap (1950) thus hoped to offer an "inductive logic" that generalized deductive logic and its relation of "implication" (the strongest relation of support).

A central problem with Carnap's program is that changing the language in which hypotheses and items of evidence are expressed will typically change the confirmation relations between them. Moreover, deductive logic can be characterized purely syntactically: one can determine whether E implies H, or whether H is a tautology, merely by inspecting their symbolic structure and ignoring their content. Nelson Goodman showed, however, that inductive logic must be sensitive to the meanings of words, for syntactically parallel inferences can differ wildly in their inductive strength. So inductive logic is apparently not of a piece with deductive logic after all.

Frequency interpretations date back to Venn (1876). Gamblers, actuaries, and scientists have long understood that relative frequencies are intimately related to probabilities. Frequency interpretations posit the most intimate relationship of all: identity. Thus, the probability of heads on a coin that lands heads in 7 out of 10 tosses is 7/10. In general, the probability of an outcome A in a reference class B is the proportion of occurrences of A within B.

Frequentism still has the ascendancy among scientists who seek to capture an objective notion of probability independent of individuals' beliefs. It is also the philosophical position that lies in the background of the classical approach of Ronald A. Fisher, Jerzy Neyman, and Egon S. Pearson that is used in most statistics textbooks. Frequentism faces some major objections, however. For example, a coin that is tossed exactly once yields a relative frequency of heads of either 0 or 1, whatever its true biasthe infamous problem of the single case. Some frequentists (notably Hans Reichenbach and Richard von Mises) go on to consider infinite reference classes of hypothetical occurrences. Probabilities are then defined as limiting relative frequencies in infinite sequences of trials. If there are in fact only finitely many trials of the relevant type, this requires the actual sequence to be extended to a hypothetical or virtual infinite sequence. This creates new difficulties. For instance, there is apparently no fact of the matter of how the coin in my pocket would have landed if it had been tossed once, let alone an indefinitely large number of times. A well-known problem for any version of frequentism is that relative frequencies must be relativized to a reference class. Suppose that you are interested in the probability that you will live to age eighty. Which reference class should you consult? The class of all people? All people of your gender? All people who share your lifestyle? Only you have all these properties, but then the problem of the single case returns.

Propensity interpretations, like frequency interpretations, regard probability as an objective feature of the world. Probability is thought of as a physical propensity or disposition or tendency of a given type of physical situation to yield an outcome of a certain kind, or to yield a long-run (perhaps infinite) relative frequency of such an outcome. This view, which originated with Karl Popper (1959), was motivated by the desire to make sense of single-case probability attributions, particularly those found in quantum mechanics, on which frequentism apparently foundered (see Gillies for a useful survey).

A prevalent objection is that it is not informative to be told that probabilities are propensities. For example, what exactly is the property in virtue of which this coin, when suitably tossed, has a "propensity" of 1/2 to land heads? Indeed, some authors regard it as mysterious whether propensities even obey the axioms of probability in the first place. To the extent that propensity theories are parasitic on long-run frequencies, they also seem to inherit some of the problems of frequentism.

Subjectivist interpretations, pioneered by Frank P. Ramsey (1926) and Bruno de Finetti (1937), regard probabilities as degrees of belief, or credences, of appropriate agents. These agents cannot be actual people, since, as psychologists have repeatedly shown, people typically violate probability theory in various ways, often spectacularly so. Instead, we have to imagine the agents to be ideally rational. Ramsey thus regarded probability theory to be the "logic of partial belief." Underpinning subjectivism are so-called Dutch Book arguments. They begin by identifying agents' degrees of belief with their betting dispositions, and they then prove that anyone whose degrees of belief violate the axioms of probability is "incoherent"susceptible to guaranteed losses at the hands of a cunning bettor. Equally important, but often neglected, is the converse theorem that adhering to the probability axioms protects one from such an ill fate. Subjectivism has proven to be influential, especially among social scientists, Bayesian statisticians, and philosophers.

A more general approach, again originating with Ramsey, begins with certain axioms on rational preferencesfor example, if you prefer A to B and B to C, then you prefer A to C. It can be shown that if you obey these axioms, then you can be represented by a probability function (encapsulating your credences about various propositions) and a utility function (encapsulating the strengths of your desires that these propositions come about). This means that you will rate the choice worthiness of an action open to you according to its expected utilitya weighted average of the various utilities of possible outcomes associated with that action, with the corresponding probabilities providing the weights. This is the centerpiece of decision theory.

Radical subjectivists such as de Finetti recognize no constraints on initial (or "prior") subjective probabilities beyond their conforming to axioms (A1) to (A3). But they typically advocate a learning rule for updating probabilities in the light of new evidence. Suppose that you initially have credences given by a probability function P initial, and that you become certain of E (where E is the strongest such proposition). What should your new probability function P new be? The favored updating rule among Bayesians is conditionalization, where P new is related to P initial as follows:
(Conditionalization) P new(X) = P initial(X|E) (provided P initial(E) 0)

Radical subjectivism has faced the charge of being too permissive. It apparently licenses credences that we would ordinarily regard as crazy. For example, you can assign without its censure a probability of 0.999 to your being the only thinking being in the universeprovided that you remain coherent (and update by conditionalization). It also seems to allow fallacious inference rules, such as the gambler's fallacy (believing, for instance, that after a surprisingly long run of heads, a fair coin is more likely to land tails). A standard defense (e.g., Howson and Urbach) appeals to famous convergence-to-truth and merger-of-opinion results. Their upshot is that in the long run, the effect of choosing one prior rather than another is attenuated: successive conditionalizations on the evidence will, with probability 1, make a given agent eventually converge to the truth, and thus initially discrepant agents eventually come to agreement. Some authors object that these theorems tell us nothing about how quickly the convergence occurs; in particular, they do not explain the unanimity that we in fact often reach, and often rather rapidly.

## Some Recent Developments

Since the late twentieth century, some subjectivists have canvased further desiderata on credences. For example, we might evaluate credences according to how closely they match the corresponding relative frequencies, according to how well "calibrated" they are. Also under consideration are "scoring rules" that refine calibration. Various subjectivists believe that rational credences are guided by objective chances (perhaps thought of as propensities), so that if a rational agent knows the objective chance of a given outcome, her degree of belief will be the same as the objective chance. There has been important research on the aggregation of opinions and the preferences of multiple agents. This problem is well known to readers of the risk-assessment literature. Moreover, in light of work in economics and psychology on bounded rationality, there have been various attempts to "humanize" Bayesianism, for example, in the study of "degrees of incoherence," and of vague probability and decision theory (in which credences need not assume precise values).

Since the late twentieth century there have also been attempts to rehabilitate the classical and logical interpretations, and in particular the principle of indifference. Some objective Bayesians appeal to information theory, arguing that prior probabilities should maximize entropy (a measure of how flat a probability distribution is), subject to the constraints of a given problem. Probability theory has also been influenced by advances in theories of randomness and in complexity theory (see Fine; Li and Vitanyi), and by approaches to the "curve-fitting" problemfamiliar in the computer science, artificial intelligence, and philosophy of science literaturethat attempt to measure the simplicity of theories.

While Kolmogorov's theory remains the orthodoxy, a host of alternative theories of probability have been developed (see Fine; Mückenheim et al.). For instance, there has been increased interest in nonadditive theories, and the status of countable additivity is a subject of lively debate. Some authors have proposed theories of primitive conditional-probability functions, in which conditional probability replaces unconditional probability as the fundamental concept. Fertile connections between probability and logic have been explored under the rubrics of "probabilistic semantics" and "probability logic."

## Some Applications of Probability

Probability theory thus continues to be a vigorous area of research. Moreover, its advances have myriad ramifications. Probability is explicitly used in many of our best scientific theories, for example, quantum mechanics and statistical mechanics. It is also implicit in much of our theorizing. A central notion in evolutionary biology is "fitness," or expected number of offspring. Psychologists publish their conclusions with significance levels attached. Agricultural scientists perform analyses of variance on how effective fertilizers are in increasing crop yields. Economists model currency exchange rates over time as stochastic processesthat is, sequences of random variables. In cognitive science and philosophy, probability functions model states of opinion. Since probability theory is at the heart of decision theory, it has consequences for ethics and political philosophy. And assuming, as many authors do, that decision theory provides a good model of rational decision-making, it apparently has implications for even mundane aspects of our daily lives. In short, probability is ubiquitous. Bishop Butler's dictum is truer today than ever.

See also Game Theory ; Logic and Philosophy of Mathematics, Modern ; Mathematics .

## bibliography

Butler, Joseph. Analogy of Religion. 1736. Reprint, New York: Frederick Ungar, 1961.

Carnap, Rudolf. Logical Foundations of Probability. Chicago: University of Chicago Press, 1950.

De Finetti, Bruno. "La Prévision: Ses Lois Logiques, Ses Sources Subjectives." Annales de l'Institut Henri Poincaré 7 (1937): 168. Translated as "Foresight: Its Logical Laws, Its Subjective Sources." In Studies in Subjective Probability, edited by H. E. Kyburg Jr. and H. E. Smokler. New York: Robert E. Krieger, 1980.

Feller, William. An Introduction to Probability Theory and Its Applications. 3rd ed. New York: John Wiley and Sons, 1968.

Fine, Terrence. Theories of Probability. New York: Academic Press, 1973.

Gillies, Donald. "Varieties of Propensity." British Journal for the Philosophy of Science 51 (2000): 807835.

Hájek, Alan. "Probability, Interpretations of." In The Stanford Encyclopedia of Philosophy, edited by E. Zalta. Stanford, Calif.: Stanford University, 2002. Available at http://plato.stanford.edu/entries/probability-interpret/.

Howson, Colin, and Peter Urbach. Scientific Reasoning: The Bayesian Approach. 2nd ed. Chicago: Open Court, 1993.

Kolmogorov, Andrei N. Grundbegriffe der Wahrscheinlichkeitrechnung, Ergebnisse der Mathematik. 1933. Translated as Foundations of Probability. New York: Chelsea, 1950.

Laplace, Pierre Simon. Essai philosophique sur les probabilités. 1814. Translated as A Philosophical Essay on Probabilities. New York: Dover, 1951.

Li, Ming, and Paul Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications. 2nd ed. New York: Springer-Verlag, 1997.

Muckenheim, W., et al. "A Review of Extended Probability." Physics Reports 133 (1986): 337401.

Popper, Karl. The Logic of Scientific Discovery. London: Hutchinson, 1959.

Ramsey, Frank P. "Truth and Probability." 1926. Reprinted in Philosophical Papers, edited by D. H. Mellor. Cambridge, U.K.: Cambridge University Press, 1990.

Venn, John. The Logic of Chance. 2nd ed. London: Macmillan, 1876. Reprint, New York: Chelsea, 1962.

Von Mises, Richard. Wahrscheinlichkeit, Statistik und Wahrheit. 1939. Translated as Probability, Statistics, and Truth. Rev. English ed. New York: Macmillan, 1957.

Alan Hájek

# Probability

views updated Jun 11 2018

# Probability

BIBLIOGRAPHY

The concern of empirical social statisticians in matters theoretical is usually limited to troubles regarding choices of random sets with repeatably observable frequenciesmeant to test hypotheses that suggest some limitations on these values. (An example is the test of claims regarding discrimination.) As many statistical observations are hardly ever repeated, a huge literature discusses the question of how representative observed sets are: Is an observed distribution characteristic of the whole population, or is it due to an accidentally great deviation from it?

Diverse examinations of the likelihood of a freak accident are calculated, all made on the basis of assumptions, often reasonable but at times highly question begging. For example, social statisticians seldom notice that the calculations take empirical finds as crucial tests between hypotheses. They specify single hypotheses without troubling themselves to consider the hypotheses the calculations take as the default options. Most social statisticians are ignorant of the mathematical intricacies involved and of the fierce debates among top experts in the field as to what exactly these calculations mean. The long and the short of it is this: How representative is the sample? This question translates into a question of the samples randomness.

Critics often suggest that randomness is wanting because the criterion of the choice of the sample introduces bias. The standard example is in biology. The use of laboratory animals introduces bias and renders some tests worse than useless. In social research it is the use of telephone books for the selection of random telephone numbers, in oblivion of the fact that in some cases the choice of people with telephones introduces bias. The best way to handle bias is to repeat the test with a new sample using different criteria. In some cases tests for the randomness of a sample may comprise random selections from the existing sample.

The most popular computer program for social statisticians, SPSS (originally Statistical Package for the Social Sciences), warns users against correlating everything in sight, because such an approach is sure to yield some useless (because unrepeatable) results. Common sense reduces such risks, but no method is foolproof. It is therefore always advisable to be wary of computer-generated numbers. To this end, it may be useful to comprehend the general ideas behind the statistics.

The term probability is open to different readings. Some of these may be subject to the mathematical calculus of probability (see below) and some not. Methodologists are most concerned with two of these readings, only one of which obeys this calculus. (Many insist that this need not be so.) In one sense probability, especially betting, is a matter of guesses. In another sense probability is the plausibility of a conjecture, as in the common assertion, What you say is probable. Can different people propose competing probable views? To deny this is to dismiss without debate the views of peers who reject what seems plausible. This is inadvisable, at least in a democracy. Suppose, then, that it is possible for competing views to be probable. Since the calculus of probability ascribes to the probable the numerical value of more than one-half and considers the sum of probabilities of all alternatives to be at most unity, the probable in this sense defies this calculus.

Many methodologists deem this argument misleading. It fails for scientific ideas, they say, where such cases are a priori impossible. Not so: When two new theories compete, researchers who find both plausible seek facts that will tip the scale against one of them. In the light of such experiments, theories do gain probability in a sense that defies the mathematical theory of probability. Proof: The mathematical theory will render more probable the theory that comes closer to existing information, yet such theories are implausible. Plausibility goes to an imaginative theory that looks implausible at first and then gains support from new information. The calculus distinctly does not differentiate between information that was known before the theory was invented and the information that the theory reveals. End of proof.

Theoretical learning results from the wish to understand observed regularities. Probability is the study of luck. With no foreknowledge of the fates of individuals, we know that ensembles have a percentage of lucky members, a percentage that improves or declines with the institution of precaution or carelessness, respectively. The simplest ensemble is that of tosses of a coin. Unable to predict single outcomes, we can predict their ratioa number between 0 and 1. We tend to assume that this ratio is one-half. Not so, because a tossed coin with one side heavier than the other will more often fall on the heavy side: It is biased or unfair. All tossed coins turn out to have a constant ratio of heads turning up; this ratio is the probability for heads. Given a coin that we have not tested, we cannot know whether the probability of heads for it is one-half, but we take it for granted that it is a fixed ratio. Moreover as most coins that we use are fair, we tend naively to assume that one that we have not tested is fair. Some gamblers misuse this naïveté regularly.

Some say that in the absence of prior information about a coin, we have to consider it fair. Of course they say this because they speak of probability in a different sensesubjective (see below). Thus even when limiting our discourse to luck, we understand probability in different ways.

The mathematical theory of probability considers in the abstract a set of items with numerical values in a manner that follows certain intuitive axioms. These are basic equations that assign to every ordered pair a and b of these items a number called p(a, b ) (the conditional probability of a given b ): p(a, b ) = r 0 . r 1

The axioms of probability relate these numbers and items. Instead of going into the detail of the axioms, most textbooks, including most mathematically powerful ones, use examples, such as a series of outcomes in tosses of a coin, throws of a die, or pulls of a playing card out of a pack. These examples legitimately stand for the major theorems of the calculus, provided that the rule of equal probability that they usually exhibit is not generalized to allow for marked cards or biased or unfair coins and dice.

The troubled discussions about probability that occurred at the turn of the nineteenth century disappeared with the realization that equiprobability is only one possible probability distribution among many. The alternative ways to assign equiprobability seem problematic only because of erroneous methodology. Quantum theory beautifully illustrates the freedom of assigning these. The philosopher Rudolf Carnap (18911970), for example, tried to use one quantum probability distribution in his theory of subjective probability (1962). Deviations from traditional rules about probability are hypotheses to be tested like any other. The identification of probability with plausibility plays havoc, as tests may grant such theories plausibility. Giving up this identification dispels such troubles. The axioms of probability, then, concern measures of possibilities, and assuming any distribution is thus conjectural: The more-possible items receive a higher degree of probability, with impossibility as probability 0 and necessity as probability 1. Probability can be a measure of the possibility of success in betting, and it can be the betting success rate; it can also be more than that. This is why the axioms of probability should apply to an unspecified set of uninterpreted items.

The axioms demand, then, that every item a has a complement b, given c : if for some d, p(d, c ) 1, then p(a, c ) + p(b, c ) = 1. Obviously otherwise p(a, c ) + p(b, c ) = 2. Similarly for every a and b, there is an item c that is their conjunction: p(ab, d ) = p(c, d ) for every d. It turns out, as Karl Popper (19021994) has proved, that these rules abide by Boolean algebra (1968). The heart of probability theory is the multiplication theorem, the feel for which is central to the general feel for probability. It is intuitively obvious that probability is a monotone function: for every three items, p(ab, c ) p(a, c ); and p(ab, c ) = p(a, bc ) times p(b, c ). For c that may be ignored (it may be understood as the universal condition, as the condition that always holds in the system under study), we may write p(a ) for p(a, c ). The probabilities depending on it are known as absolute probabilities. The multiplication law for absolute probabilities, then, is p(ab ) = p(a, b ) times p(b ). And if p(b ) 0, then obviously p(a, b ) = p(ab ) divided by p(b ). This makes it obvious that conditional probability is logically prior to absolute probability, since items that have zero probability appear in probability considerations of all sorts (Rényi 1970; Popper 1968).

One of the most popular reasons for the subjective view of probability is the theory of errors, the assumption that random errors of measurement of some given quantity cancel each other, so the most reliable hypothesis about that quantity is that it is the average of the many measurements of it. This makes probability appear as reliability (of measurements). Consider then the hypothesis that reliability of measurements follows the axioms of the theory of probability. If this reliability is measurable by the reliance that people ascribe to measurements, then this hypothesis is empirically easily refuted: People prefer impressions to averages. This is not the case with researchers, however, as they assume errors to be random. Otherwise they are ready to change their minds. Hence, the errors in question are not errors of reliability. Moreover the same formula applies to acts, such as shooting at a target, that have nothing to do with reliability. Consider the correction of a gun fixed on a gun rest, aimed at a target, and hitting the target on average at a point that is not its center. The most important aspect of this kind of exercise is that it is repeatable. Otherwise it is pointless. (The same holds for the way astronomers eliminate random errors due to atmospheric interferences.) The demand for repeatability clearly eliminates the problem of credence. Those who refuse to work on repeatable experience are invited to test it afresh. Repeatable situations with deviations are particularly important for plotting the graph of the random differences between each hit and the center of the target. These random deviations are errors in the sense of distractions, not in the sense of observers negligently making mistakes. The error graph is in the famous bell shape (achieved when the sample grows infinitely to cover all possible deviations); the smaller the errors, the thinner the bell. This is known as dispersion, and it is essential for the study of populations subject to diverse random deviations. In physics the expression of the wish to find sharp spectral lines is the effort to reduce random interferences in the process of radiation (heat).

Historically the strongest reason for subjective probability was provided by the Marquis de Laplace (17491827) and later was endorsed by Albert Einstein (18791955): Facts are predetermined, and probability is due to ignorance. The opposite view is that randomness is objective. The chief argument here is that the assumption of randomness is essential for almost all successful application of the calculus of probability (the exception being its application to number theory). The subjectivist view of randomness as mere ignorance leads to the defunct assumption of equiprobability. Sophisticated subjectivists admit this but take it as a challenge, relying on another, more convincing reason to view probability as subjective and randomness as ignorance. It rests on a theorem named after Thomas Bayes (17021761). (This is why subjectivism is often called Bayesianism.)

Bayess theorem concerns inverse probabilities. It is the formula that enables the move from the value of p(e, h ) (of empirical data given a theory) to the value of p(h, e ) (of a theory given empirical data), from the likelihood of an effect given one of its causes to the likelihood that a given cause is responsible for the effect at hand. This formula is a theorem that is easily deducible from the multiplication law, provided all relevant probabilities are given.

This of course is an objectivist proviso; viewed subjectively, however, these provisos are convictions. This leads back to the defunct rule of equiprobability. Moreover, the most important theorem of probability theory, the law of large numbers, is not given to the subjectivist interpretation. It says that any option, however improbable, will occur in a sufficiently large collection, although the less probable the option, the less frequent it will be.

SEE ALSO Bayes Theorem; Bayesian Econometrics; Bayesian Statistics; Classical Statistical Analysis; Econometric Decomposition; Methods, Quantitative; Popper, Karl; Random Samples; Sampling; Science; Social Science; Statistics; Test Statistics

## BIBLIOGRAPHY

Carnap, Rudolf. 1962. Logical Foundations of Probability. 2nd ed. Chicago: University of Chicago Press.

Feller, William. 1967. Introduction to Probability Theory and Its Applications. 3rd ed. New York: Wiley.

Fisher, Ronald A. 1922. On the Mathematical Foundations of Theoretical Statistics. Philosophical Transactions of the Royal Society of London, ser. A, 222: 309368.

Kolmogorov, A. N. 1956. Foundations of the Theory of Probability. 2nd English ed. Trans. and ed. Nathan Morrison. New York: Chelsea.

Landau, Sabine, and Brian S. Everitt. 2004. A Handbook of Statistical Analysis Using SPSS. 3rd ed. Boca Raton, FL: Chapman and Hall.

Leblanc, Hugh. 1989. The Autonomy of Probability Theory (Notes on Kolmogorov, Rényi, and Popper). British Journal for the Philosophy of Science 40: 167181.

Levi, Isaac. 1967. Gambling with Truth: An Essay on Induction and the Aims of Science. New York: Knopf.

Popper, Karl. 1968. The Logic of Scientific Discovery. 3rd ed. London: Hutchinson.

Rényi, Alfréd. 1970. Foundations of Probability. San Francisco: Holden-Day.

Schrödinger, Erwin. 1989. Statistical Thermodynamics. New York: Dover.

Todhunter, Isaac. 1865. A History of the Mathematical Theory of Probability from the Time of Pascal to That of Laplace. London: Macmillan.

Joseph Agassi

# Probability

views updated May 14 2018

# Probability

Probability measures the likelihood that something specific will occur. For example, a tossed coin has an equal chance, or probability, of landing with one side up ("heads") or the other ("tails"). If you drive without a seat belt, your probability of being injured in an accident is much higher than if you buckle up. Probability uses numbers to explain chance.

If something is absolutely going to happen, its probability of occurring is 1, or 100 percent. If something absolutely will not happen, its probability of occurring is 0, or 0 percent.

Probability is used as a tool in many areas of genetics. A clinical geneticist uses probability to determine the likelihood that a couple will have a baby with a specific genetic disease. A statistical geneticist uses probability to learn whether a disease is more common in one population than in another. A computational biologist uses probability to learn how a gene causes a disease.

## The Clinical Geneticist and the Punnett Square

A Punnett square uses probability to explain what sorts of children two parents might have. Suppose a couple knows that cystic fibrosis, a debilitating respiratory disease, tends to run in the man's family. The couple would like to know how likely it is that they would pass on the disease to their children.

A clinical geneticist can use a Punnett square to help answer the couple's question. The clinical geneticist might start by explaining how the disease is inherited: Because cystic fibrosis is a recessive disease caused by a single gene, only children who inherit the disease-causing form of the cystic fibrosis gene from both parents display symptoms. On the other hand, because the cystic fibrosis gene is a recessive gene, a child who inherits only one copy of a defective gene, along with one normal version, will not have the disease.

Suppose the recessive, disease-causing form of the gene is referred to as "f" and the normal form of the gene is referred to as "F." Only individuals with two disease-causing genes, ff, would have the disease. Individuals with either two normal copies of the gene (FF) or one normal copy and one mutated copy (Ff) would be healthy.

If the clinical geneticist tests the parents and finds that each carries one copy of the cystic fibrosis gene, f, and one copy of the normal gene, F, what would be the probability that a baby of theirs would be born with the cystic fibrosis disease? To answer this question, we can use the Punnett square shown in the figure above. A Punnett square assumes that there is an equal probability that the parent will pass on either of its two gene forms ("alleles") to each child.

The parents' genes are represented along the edges of the square. A child inherits one gene from its mother and one from its father. The combinations of genes that the child of two Ff parents could inherit are represented by the boxes inside the square.

Of the four combinations possible, three involve the child's inheriting at least one copy of the dominant , healthy gene. In three of the four combinations, therefore, the child would not have cystic fibrosis. In only one of the four combinations would the child inherit the recessive allele from both parents. In that case, the child would have the disease. Based on the Punnett square, the counselor can tell the parents that there is a 25 percent probability, or a one-in-four chance, that their baby will have cystic fibrosis.

## The Statistical Geneticist and the Chi-Square Test

Researchers often want to know whether one particular gene occurs in a population more or less frequently than another. This may help them determine, for example, whether the gene in question causes a particular disease. For a dominant gene, such as the one that causes Huntington's disease, the frequency of the disease can be used to determine the frequency of the gene, since everyone who has the gene will eventually develop the disease. However, it would be practically impossible to find every case of Huntington's disease, because it would require knowing the medical condition of every person in a population. Instead, genetic researchers sample a small subset of the population that they believe is representative of the whole. (The same technique is used in political polling.)

Whenever a sample is used, the possibility exists that it is unrepresentative, generating misleading data. Statisticians have a variety of methods to minimize sampling error, including sampling at random and using large samples. But sampling errors cannot be eliminated entirely, so data from the sample must be reported not just as a single number but with a range that conveys the precision and possible error of the data. Instead of saying the prevalence of Huntington's disease in a population is 10 per 100,000 people, a researcher would say the prevalence is 7.8-12.1 per 100,000 people.

The potential for errors in sampling also means that statistical tests must be conducted to determine if two numbers are close enough to be considered the same. When we take two samples, even if they are both from exactly the same population, there will always be slight differences in the samples that will make the results differ.

A researcher might want to determine if the prevalence of Huntington's disease is the same in the United States as it is in Japan, for example. The population samples might indicate that the prevalences, ignoring ranges, are 10 per 100,000 in the United States and 11 per 100,000 in Japan. Are these numbers close enough to be considered the same? This is where the Chi-square test is useful.

First we state the "null hypothesis," which is that the two prevalences are the same and that the difference in the numbers is due to sampling error alone. Then we use the Chi-square test, which is a mathematical formula, to test the hypothesis.

The test generates a measure of probability, called a p value, that can range from 0 percent to 100 percent. If the p value is close to 100 percent, the difference in the two numbers is almost certainly due to sampling error alone. The lower the p value, the less likely the difference is due solely to chance.

Scientists have agreed to use a cutoff value of 5 percent for most purposes. If the p value is less than 5 percent, the two numbers are said to be significantly different, the null hypothesis is rejected, and some other cause for the difference must be sought besides sampling error. There are many statistical tests and measures of significance in addition to the Chi-square test. Each is adapted for special circumstances.

Another application of the Chi-square test in genetics is to test whether a particular genotype is more or less common in a population than would be expected. The expected frequencies can be calculated from population data and the Hardy-Weinberg Equilibrium formula. These expected frequencies can then be compared to observed frequencies, and a p value can be calculated. A significant difference between observed and expected frequencies would indicate that some factor, such as natural selection or migration, is at work in the population, acting on allele frequencies. Population geneticists use this information to plan further studies to find these factors.

## The Computational Biologist and BLAST

Genetic counseling lets potential parents make an informed decision before they decide to have a child. Geneticists, however, would like to be able to take this one step further: They would like to be able to cure genetic diseases. To be able to do so, scientists must first understand how a disease-causing gene results in illness. Computational biologists created a computer program called BLAST to help with this task.

To use BLAST, a researcher must know the DNA sequence of the disease-causing gene or the protein sequence that the gene encodes. BLAST compares DNA or protein sequences. The program can be used to search many previously studied sequences to see if there are any that are similar to a newly found sequence. BLAST measures the strength of a match between two sequences with a p value. The smaller the p value, the lower the probability that the similarity is due to chance alone.

If two sequences are alike, their functions may also be alike. For BLAST to be most useful to a researcher, there would be a gene that has already been entered in the library that resembles the disease-causing gene, and some information would be known about the function of the previously entered gene. This would help the researcher begin to hypothesize how the disease-causing gene results in illness.

see also Bioinformatics; Clinical Geneticist; Computational Biologist; Cystic Fibrosis; Hardy-Weinberg Equilibrium; Homology; Internet; Mendelian Genetics; Metabolic Disease; Statistical Geneticist.

Rebecca S. Pearlman

#### Bibliography

Nussbaum, Robert L., Roderick R. McInnes, and Huntington F. Willard. Thompson & Thompson Genetics in Medicine, 6th ed. St. Louis, MO: W. B. Saunders, 2001.

Purves, William K., et al. Life: The Science of Biology, 6th ed. Sunderland, MA: Sinauer Associates, 2001.

Seidman, Lisa, and Cynthia Moore. Basic Laboratory Methods for Biotechnology: Textbook and Laboratory Reference. Upper Saddle River, NJ: Prentice-Hall, 2000.

Tamarin, Robert H. Principles of Genetics, 7th ed. Dubuque, IA: William C. Brown,2001.

##### Internet Resources

The Dolan DNA Learning Center. Cold Spring Harbor Laboratory. <http://vector.cshl.org>.

The National Center for Biotechnology Information. <http://www.ncbi.nlm.nih.gov>.

# Probability Theory

views updated Jun 11 2018

# Probability theory

Probability theory is a branch of mathematics concerned with determining the likelihood that a given event will occur. This likelihood is determined by dividing the number of selected events by the number of total events possible. For example, consider a single die (one of a pair of dice) with six faces. Each face contains a different number of dots: 1, 2, 3, 4, 5, or 6. If you role the die in a completely random way, the probability of getting any one of the six faces (1, 2, 3, 4, 5, or 6) is one out of six.

Probability theory originally grew out of problems encountered by seventeenth-century gamblers. It has since developed into one of the most respected and useful branches of mathematics with applications in many different industries. Perhaps what makes probability theory most valuable is that it can be used to determine the expected outcome in any situationfrom the chances that a plane will crash to the probability that a person will win the lottery.

## History of probability theory

Probability theory was originally inspired by gambling problems. The earliest work on the subject was performed by Italian mathematician and physicist Girolamo Cardano (15011576). In his manual Liber de Ludo Aleae, Cardano discusses many of the basic concepts of probability complete with a systematic analysis of gambling problems. Unfortunately, Cardano's work had little effect on the development of probability because his manual did not appear in print until 1663and even then received little attention.

In 1654, another gambler named Chevalier de Méré invented a system for gambling that he was convinced would make money. He decided to bet even money that he could roll at least one twelve in 24 rolls of two dice. However, when the Chevalier began losing money, he asked his mathematician friend Blaise Pascal (16231662) to analyze his gambling system. Pascal discovered that the Chevalier's system would lose about 51 percent of the time.

Pascal became so interested in probability that he began studying more problems in this field. He discussed them with another famous mathematician, Pierre de Fermat (16011665) and, together they laid the foundation of probability theory.

## Methods of studying probability

Probability theory is concerned with determining the relationship between the number of times some specific given event occurs and the number of times any event occurs. For example, consider the flipping of a coin. One might ask how many times a head will appear when a coin is flipped 100 times.

Determining probabilities can be done in two ways: theoretically and empirically. The example of a coin toss helps illustrate the difference between these two approaches. Using a theoretical approach, we reason that in every flip there are two possibilities, a head or a tail. By assuming each event is equally likely, the probability that the coin will end up heads is ½ or 0.5.

The empirical approach does not use assumptions of equal likelihood. Instead, an actual coin flipping experiment is performed, and the number of heads is counted. The probability is then equal to the number of heads actually found divided by the total number of flips.

## Basic concepts

Probability is always represented as a fraction, for example, the number of times a "1 dot" turns up when a die is rolled (such as 1 out 6, or ) or the number of times a head will turn up when a penny is flipped (such as 1 out of 2, or ½). Thus the probability of any event always lies somewhere between 0 and 1. In this range, a probability of 0 means that there is no likelihood at all of the given event's occurring. A probability of 1 means that the given event is certain to occur.

Probabilities may or may not be dependent on each other. For example, we might ask what is the probability of picking a red card OR a king from a deck of cards. These events are independent because even if you pick a red card, you could still pick a king.

As an example of a dependent probability (also called a conditional probability), consider an experiment in which one is allowed to pick any ball at random out of an urn that contains six red balls and six black balls. On the first try, a person would have an equal probability of picking either a red or a black ball. The number of each color is the same. But the probability of picking either color is different on the second try, since only five balls of one color remain.

## Applications of probability theory

Probability theory was originally developed to help gamblers determine the best bet to make in a given situation. Many gamblers still rely on probability theoryeither consciously or unconsciouslyto make gambling decisions.

Probability theory today has a much broader range of applications than just in gambling, however. For example, one of the great changes that took place in physics during the 1920s was the realization that many events in nature cannot be described with perfect certainty. The best one can do is to say how likely the occurrence of a particular event might be.

When the nuclear model of the atom was first proposed, for example, scientists felt confident that electrons traveled in very specific orbits around the nucleus of the atom. Eventually they found that there was no basis for this level of certainty. Instead, the best they could do was to specify the probability that a given electron would appear in various regions of space in the atom. If you have ever seen a picture of an atom in a science or chemistry book, you know that the cloudlike appearance of the atom is a way of showing the probability that electrons occur in various parts of the atom.

# probability distributions

views updated Jun 27 2018

probability distributions Theoretical formulas for the probability that an observation has a particular value, or lies within a given range of values.

Discrete probability distributions apply to observations that can take only certain distinct values, such as the integers 0, 1, 2,… or the six named faces of a die. A probability, p(r), is assigned to each event such that the total is unity. Important discrete distributions are the binomial distribution and the Poisson distribution.

Continuous probability distributions apply to observations, such as physical measurements, where no two observations are likely to be exactly the same. Since the probability of observing exactly a given value is about zero, a mathematical function, the cumulative distribution function, F(x), is used instead. This is defined as the probability that the observation does not exceed x. F(x) increases monotonically with x from 0 to 1, and the probability of observing any value between two limits, x1 and x2, is F(x2) – F(x1)

This definition leads, by differential calculus, to the frequency function, f(x), which is the limiting ratio of F(x + h) – F(x) to h

as h becomes small, so that the probability of an observation between x and (x + h) is h.f(x). The most important continuous distribution is the normal (or Gaussian) distribution.

Probability distributions are defined in terms of parameters, whose values determine the numerical values of the probabilities.

# probability

views updated May 18 2018

prob·a·bil·i·ty / ˌpräbəˈbilətē/ • n. (pl. -ties) the extent to which something is probable; the likelihood of something happening or being the case: the rain will make the probability of their arrival even greater. ∎  a probable event: for a time, revolution was a strong probability. ∎  the most probable thing: the probability is that it will be phased in over a number of years. ∎  Math. the extent to which an event is likely to occur, measured by the ratio of the favorable cases to the whole number of cases possible: the area under the curve represents probability | a probability of 0.5. PHRASES: in all probability used to convey that something is very likely: he would in all probability make himself known.

# probability

views updated Jun 08 2018

probability A number between 0 and 1 associated with an event (see relative frequency) that is one of a set of possible events: an event that is certain to occur has probability 1. The probability of an event is the limiting value approached by the relative frequency of the event as the number of observations is increased indefinitely. Alternatively it is the degree of belief that the event will occur.

The concept of probability is applied to a wide range of events in different contexts. Originally interest was in the study of games of chance, where correct knowledge of probability values allowed profitable wagers to be made. Later the subject was studied by insurance companies anxious to predict probable future claims on the basis of previously observed relative frequencies. Today probability theory is the basis of statistical analysis (see statistical methods).

The probability calculus is the set of rules for combining probabilities for combinations of events, using the methods of symbolic logic applied to sets.

See also probability distributions.

# probability

Updated Aug 18 2018 Print Topic