Probability: Basic Concepts of Mathematical Probability
Probability: BASIC CONCEPTS OF MATHEMATICAL PROBABILITY
Widely used in everyday life, the word probability has no simple definition. Probability relates to chance, a notion with deep roots in antiquity, encountered in the works of philosophers and poets, reflected in widespread games of chance and the practice of sortilege, resolving uncertainty by the casting of lots. The mathematical theory of probability, the study of laws that govern random variation, originated in the seventeenth century and has grown into a vigorous branch of modern mathematics. As the foundation of statistical inference it has transformed science and is at the basis of much of modern technology. It has exercised significant influence in ethics and politics, although not always with full appreciation of either its strengths or its limitations.
Thousands of scientists, engineers, economists, and other professionals use the methods of probability and statistics in their work, aided by readily available computer software packages. But there is no strong consensus on the nature of chance in the universe, nor on the best way to make inferences from probability, so the subject continues to be of lively interest to philosophers. It is also part of daily experience—the weather, traffic conditions, sports, the lottery, the stock market, insurance, to name just a few—about which everyone has opinions.
The use of probability in science and technology is often quite technical, involving elaborate models and advanced mathematics that are beyond the understanding of nonspecialists. High-profile controversies may hinge on oversimplification by advocates and the media, unexplored biases, or a lack of appreciation of the extent of uncertainty in scientific results. Yet policy decisions based on such flawed evidence may have far-reaching economic and social consequences. Awareness of the role of probability is thus essential for judging the quality of empirical evidence, and this implies a moral responsibility for citizens of a democratic society.
Although many different techniques of the theory of probability are now in use, they all share a set of basic concepts. It is possible to express these concepts without advanced mathematics, but the concepts themselves are deep, and the results often counterintuitive. Insight may thus require persistent pondering. This entry presents the basic concepts in concise form, using only elementary mathematics. Further details and many applications are found in a wide range of introductory textbooks, written on various levels of mathematical abstraction.
A Simple Example
Consider as a first example the probability that a newborn child is a boy. One approach would be to use the theoretical model shown in Figure 1. According to Mendelian genetics, sex is determined by whether the sperm carries the father's X or Y chromosome; the egg has one of the mother's two X chromosomes. In a cell division called meiosis the twenty-three pairs of human chromosomes segregate to form two haploid (unpaired complement) cells called gametes, each containing twenty-two autosomes and one sex chromosome. In fertilization the male gamete (spermatozoan) fuses with a female gamete (ovum) to form a zygote, a diploid (double complement) cell with one set of chromosomes from each parent, its sex determined by the father. Assuming that the four possible outcomes are equally likely, two of them being XY, the probability that the child is male, written as Probability (male), can be defined as 2/4 = 1/2 = .5.
A second approach would look at the observed relative frequency of boys among the newborn, such as shown in Figure 2. In the ten-year period from 1991 to 2000 there were approximately 39,761,000 registered births in the United States. Of these 20,348,000 were boys, with a relative frequency of .5118. The annual proportions ranged between .5113 and .5123. One could say that the probability of a newborn child being male is .5118, or approximately .51.
Which answer then is correct? Most people would agree that the empirical result, based on such a large sample, has to override the model. In fact, the excess of boys among newborns has been observed throughout the world for centuries. The theoretical model is thus not an entirely correct representation of reality.
But what about the sex ratio in smaller samples? Figure 3 presents an experiment based on actual hospital records. The three graphs show the proportion of boys in 20 sequences each of 10, 50, and 250 consecutive births. Note that there is great variation in the sequences of 10, less for 50, and by 250 the proportions settle just above .5. Any one study yields only a single point, and the result from a small sample could be way off. For example, a researcher seeking to establish the proportion of boys among the newborn from a sequence of 10 could come up with a result of .2 or .9! In this example the approximate answer is already known, but in general this is not the case. The use of sample sizes too small to yield meaningful results is a serious problem in practical applications, as is the employment of inadequate theoretical models.
Two Definitions of Probability
Figures 1 and 2 illustrate two ways of defining the notion of probability in a mathematical context.
CLASSICAL DEFINITION. If there is a finite number of possible outcomes of an experiment, all equally likely and mutually exclusive, then the probability of an event is the number of outcomes favorable to the event, divided by the total number of possible outcomes.
This is the case shown in Figure 1, where the probability that a newborn infant is male is given as 2/4 = .5. Customary examples include tossing an unbiased coin or throwing a balanced die. Most situations, however, do not involve equally likely outcomes. Nor does this definition explain what probability is, it just states how to assign a numeric value to this primitive idea in certain simple cases.
STATISTICAL DEFINITION. The probability of an event denotes the relative frequency of occurrence of that event in the long run.
In Figure 2, the probability of a newborn infant being male is estimated to be about .51. This is also called the frequentist definition and is the one in common use. But it is not a fully satisfactory definition. What does "in the long run" mean? And what about situations in which the experiment cannot be repeated indefinitely under identical conditions, even in principle?
The Axiomatic Approach
A mathematically precise approach is provided by a third definition, the so-called axiomatic definition of probability, which incorporates the other two and is the foundation of the modern theory of probability. It begins with some abstract terms and then defines a few basic axioms on which an elaborate logical structure can be built using the mathematical theories of sets and measure. Probability is a number between zero and one, but nothing is specified about how to assign it. Assignment may be based on a model or on experimental data. Developments are valid if they follow from the axioms, as in other branches of mathematics, independently of any correspondence to phenomena of the physical world.
SAMPLE SPACE AND EVENTS. The framework for any probabilistic study is a sample space, often denoted by the letter S, a set whose elements represent the possible outcomes of an experiment. Subsets of S are called events, denoted by A, B, C, and so on. Consider an example of a finite sample space, and let S be the records of 100 consecutive births in a large urban hospital. Events are subsets of these records, defined by some characteristic of the newborn, such as sex, race, or birthweight. Assume further that this sample space of 100 births includes 51 boys, 9 of the infants were of low birthweight (LBW, defined as ≤ 2,500 grams), and 20 of the mothers smoked (actually, admitted to smoking) during their pregnancy; 3 of these mothers had LBW babies.
Hospital data of this type can be used, for example, to assess the relationship between smoking and low birthweight, important for the development of public health measures to lower the incidence of LBW. In a formal statistical design called a case-control study, a set of LBW babies is closely matched with controls of normal weight, to determine the proportion in each group whose mother smoked. Based on extensive data obtained from hundreds of hospital patients, this was the research method that led to the discovery that smoking is a cause of lung cancer. The case presented here is artificially simple, introduced to illustrate the abstract concepts that form the basis of mathematical probability.
THE ALGEBRA OF EVENTS. The relationships among events in a sample space can be represented by a Venn diagram, such as Figure 4. Let A = LBW babies, and let B = babies whose mother smoked. The event that A does not occur may be denoted by A′ ("A prime" or "not A"), consisting of the 91 babies of normal birthweight; A and A′ are called complementary events. The event that both A and B occur, the intersection of A and B, is denoted by A ∩ B ("A intersection B"), or simply AB, the set of 3 LBW babies whose mother smoked. The event that either A or B occurs (inclusive or), the union of A and B, is denoted by A ∪ B ("A union B"), the set of 26 babies who were LBW or their mother smoked, or both. Two events M and F are mutually exclusive if the occurrence of one precludes the occurrence of the other. Their intersection MF is the null set or impossible event, denoted by φ (the lower case Greek letter phi), where φ = S′, consisting of none of the experimental outcomes. For example, if M and F are the sets of male and female newborns, respectively, then (setting aside the complications of intersexuality) their intersection is an impossible event.
THE AXIOMS OF PROBABILITY. The probability of an event A, denoted P(A), is a number that satisfies the following three axioms:
Stating the axioms in words, the probability of any event A in the sample space S is a number between zero and one, and the probability of the entire sample space is one (because by definition S contains all events). Furthermore, if two events are mutually exclusive (only one of them can occur), then the probability of their union (one or the other occurs) is the sum of their probabilities. These axioms are sufficient for a theory of finite sample spaces, and Axiom 3 can be generalized to more than two mutually exclusive events. Treatment of infinite sample spaces requires more advanced mathematics.
ELEMENTARY THEOREMS. The following results are immediate consequences of the axioms.
The first two theorems state that the probability of the impossible event is zero, and the probability of "not A" is one minus the probability of A. Also called the additiontheorem, the third statement means that elements that are in both sets should not be counted twice; the probability of overlapping events must be subtracted. In the hospital example, assuming that individual records are equally likely to be selected, so that the classical definition applies, P(A) = 9/100 = .09, P(B) = 20/100 = .2, and P(AB) = 3/100 = .03. Then the probability that a baby selected at random is either LBW or its mother smoked or both is P(A ∪ B) = .09 + .20 − .03 = .26.
Conditional Probability and Independence
The two related concepts of conditional probability and independence are among the most important in probability theory as well as its applications. It is often of great interest to know whether the occurrence of an event affects the probability of some other event.
CONDITIONAL PROBABILITY. If P(B) > 0, the conditional probability of an event A given that an event B has occurred is defined as that is, the probability of A given B is equal to the probability of AB, divided by the probability of B. For example, consider the conditional probability that a baby selected from the sample of 100 is LBW given that its mother smoked. Then P(A∣B) = .03/.20 = .15. For nonsmoking mothers, represented by B0, the probability of a LBW child is
Rearranging equation (1), and also interchanging the events, assuming P(A) > 0, yields the multiplication theorem of probability:
These relationships, obtained from the definition of conditional probability, lead to the definition of independence.
INDEPENDENCE. Two events A and B are said to be independent if the occurrence of one has no effect on the probability of occurrence of the other. More precisely, P(A∣B)= P(A)and P(B∣A)= P(B), if P(A) > 0 and P(B) > 0. The events A and B are defined to be independent if For example, one would expect a mother's smoking status to have no effect on the sex of her child. So selecting a hospital record at random, the probability of obtaining a boy born to a smoker would be the product of the probabilities, or (.51)(.20) = .10.
Assuming the independence of events is a common situation in applications. A prototype model is that of tossing a fair coin, with probability of heads P(H) = .5. Then the probability of two heads is P(HH) = .5 × .5 = .25, of three heads is P(HHH) = .53 = .125, and the probability of n consecutive heads is (.5)n. It follows from Theorem 2 that the probability of at least one tails, or equivalently, the probability of not all heads, is one minus the probability of all heads.
Taking a more real-life (although still oversimplified) example, consider the safety engineering of a space shuttle consisting of 1,000 parts, each of which can fail independently and cause destruction of the shuttle in flight. If each part has reliability of .99999, that is, its chance of failure is one in 100,000 launches, is that a sufficient safety margin for the shuttle? Application of the results above yields that is, on average one in a hundred shuttle missions will fail, a somewhat counterintuitive result and an unacceptably high risk. With a component failure rate of one in 10,000, the chance of shuttle failure would be one in ten. Achievement of a failure rate of only one in a million per individual parts would be needed to lower the probability of a tragic launch to .001, one in a thousand.
BAYES'S THEOREM. The definition of conditional probability yields formulas that are useful in many applications, and one of these has become known as Bayes's theorem.
Given two sets A and B in a sample space S, with P(A) > 0 and P(B) > 0, Bayes's theorem can be written in its simplest form as Here P(A) is called the prior probability of A and P(A∣B) the posterior probability. Using the definition of conditional probability, the equation shows how to go from the known (or assumed) probability of an event A to estimating its probability given that the event B has occurred. Formula (2) can be generalized to n mutually exclusive events Ak that are jointly exhaustive (that is, one of them must occur and their union is S), and P(Ak) > 0, for any k =1,2, ..., n,
Bayes's theorem is sometimes referred to as a formula for finding the conditional probabilities of causes. As a somewhat oversimplified example in medicine, it may be used to diagnose (by selecting the highest posterior probability) which of n diseases Ak a patient has, given a particular set of symptoms B, when the prior probability of each disease in the general population is known, as is the probability of this set of symptoms for each of the candidate diseases. The use of conditional probabilities in medical diagnosis has been extensively developed in the field of biostatistics.
Bayes's theorem is also referred to as a formula for revising probabilities as new information becomes available. It is basic to a mode of induction called Bayesian inference, where, in contrast to classical or frequentist inference, previous information about a scientific problem is combined with new results to update the evidence. This approach pertains to an alternative, subjective interpretation of probability, in which the prior probability may be a personal assessment of the truth of the hypothesis of interest.
Random Variables and Probability Distributions
Research studies generally seek some quantitative information. In the present mathematical framework, these are numeric values associated with each element of the sample space, and the outcome is determined by the selection of elements in the experiment. The concepts involved are rather abstract. They are needed to connect the intuitive notion of probability with established mathematical entities on which standard operations can be performed to develop a mathematical theory.
RANDOM VARIABLE. The numeric quantity or code associated with each element of a sample space is called a random variable, usually denoted by the capital letters X, Y, and so on. Many different random variables can be assigned to the same sample space, depending on the aims of the study. A random variable may be discrete or continuous. The number of values assumed by a discrete random variable is finite or denumerably infinite (meaning that it can be put in one-to-one correspondence with the positive integers). A special case is the binary random variable, which has two outcomes (coded 1 and 0: heads/tails, success/failure, boy/girl). A continuous random variable assumes values along a continuum (e.g., temperature, height, weight). The random variables associated with each baby in the sample space S of 100 hospital records include sex, race, birthweight, and mother's smoking status.
PROBABILITY DISTRIBUTION. The set of probabilities of the possible values of a random variable is called the probability distribution of the random variable. The sum of the probabilities is one, because it includes the entire sample space, and P(S) = 1. In the simplest case of only two possible outcomes, such as the sex of a newborn child, the distribution consists of P(male) = .51 and P(female) = .49.
PARAMETERS OF A DISTRIBUTION. Parameters are constants that specify the location (central value) and shape of a distribution, often denoted by Greek letters. The most frequently used location parameter is the mean, also called the expected value or expectation of X, E(X), denoted by μ (lower case mu). Others are the median and the mode. E(X) is the weighted average of all possible outcomes of a random variable, weighted by the probabilities of the respective outcomes. An important parameter that specifies the spread of a distribution is the variance of the random variable X, Var(X), defined as E(X-μ )2 and denoted by σ2 (lower case sigma square). It is the expected value of squared deviations of the outcomes from the mean, always positive because the deviations are squared. The square root of the variance, or σ, is called the standard deviation of X, which expresses the spread of the distribution in the same units as the random variable. These concepts are illustrated below for two basic probability distributions, one discrete and the other continuous. When Greek letters are used, it is assumed that the parameters are known. In statistical applications their values are usually estimated from the data. The variance is important as a measure of how widely the observations fluctuate about their mean value, with a small variance providing a more precise estimate of the unknown "true" mean μ.
BINOMIAL DISTRIBUTION. Independent repetition of a Bernoulli trial, an experiment with a binary outcome (success/failure) and the same probability p of success, n times yields the binomial distribution, specified by the parameters n and p. The random variable X, defined as the number of successes in n trials, can have any value r between 0 and n, with probability where C(n,r), the binomial coefficient, is the combination of n things taken r at a time, given by the formula (The symbol n! is called "n factorial," the product of integers from 1 to n; 0! = 1. For example, 3! = 1 × 2 × 3 = 6.) Equation (3) is called the probability function of the binomial random variable. While random variables are generally denoted by capital letters, the values they assume are shown in lower case letters. (Elementary textbooks, however, do not always make this distinction.) For the binomial distribution E(X) = np and Var(X) = np(1 -p).
Returning to the hospital example, assume that 30 of the 100 infants belong to a minority race, and five records are selected at random. Then X, the number of minority babies selected, could be 0, 1, ... , 5. The probability that there is no minority baby among the five isC(5, 0) = 1, because there is only one outcome in which all five babies are white. To obtain the entire distribution, C(5, r) needs to be calculated for the other values of r using formula (4). The binomial coefficients C(n, r) can also be read off Pascal's triangle, shown in Figure 5. C(5, r) is the fifth row, yielding the coefficients 1, 5, 10, 10, 5, 1. Applying these to equation (3) for all values of r, with n = 5 and p = .3, results in the binomial distribution shown in Figure 6, top row, left. The distribution for 20 babies is shown alongside, with expected value This means that on average one can expect 6 babies of a random sample of 20 to belong to a minority group. The second row in Figure 6 shows the binomial distribution for p =.5 and n = 5 and 20, respectively.
NORMAL DISTRIBUTION. It is seen that for n = 20 the distribution looks bell-shaped, and is symmetric even for the case p =.3,whichis skewed for n = 5. In fact, it can be shown that the binomial distribution is closely approximated by the normal distribution, shown in Figure 7. The formula for the normal curve is the most famous equation of probability theory. To be read as "f of x," the symbol stands for "function of x," its numerical values obtained by computing the expression on the right for different values x of the random variable X. The distribution is completely determined by the parameters μ and σ, but also involves the mathematical constants π = 3.142 and e = 2.718, the base of the natural logarithm. Curves A and B have different means μ (4 and 8), but the same spread σ (1.0); B and C have the same mean μ (8), but different spreads σ (1.0 and .5). It can be seen that for each of these normal distributions most of the outcomes (actually about 95 percent) are within 2 standard deviations of the mean.
The normal random variable is continuous and can take on any value between minus and plus infinity. For continuous distributions f(x) is called the probability density function of the random variable, which describes the shape of the curve. But for a continuum one can speak of the probability of the random variable X only for an interval of values x between two points; it is given by the corresponding area under the curve, obtained by integral calculus. The total area under the curve is one, by definition, as it includes all possible outcomes. The normal distribution plays a central role in statistics, because many variables in nature are normally distributed and also because it provides an excellent approximation to other distributions.
Two Basic Principles of Probability Theory
The most fundamental aspect of mathematical probability can be observed empirically as a fact of nature, and also proved with rigor. This phenomenon can be expressed in the form of two principles. They are given here in their simplest versions, to convey the essential result.
LAW OF LARGE NUMBERS. This laws hold that, in the long run, the relative frequency of occurrence of an event approaches its probability. It is illustrated by the empirical results of Figure 3. Stated more precisely: As the number of observations increases, the relative frequency of an event is within an arbitrarily small interval around the true probability, with a probability that tends to one. The law of large numbers connects observed relative frequency with the mathematical concept of probability, and has been proved with increasingly refined bounds on the true probability. A more general formulation pertains to the sample mean approaching the true mean, or expected value. If the occurrence of an event is denoted by 1 and its nonoccurrence by 0, then the relative frequency is the mean of the observations, which approaches the expected value p.
CENTRAL LIMIT THEOREM. This theorem states that, in general, for very large values of n, the sample mean has an approximate normal distribution. The theorem can be proved with great precision for a variety of conditions, without specifying the shape of the underlying distribution. Figure 6 suggests the result for the binomial distribution. A striking example is given in Figure 8, which shows the distribution of averages of 5 digits, selected at random from the integers between 0 and 9. This discrete random variable has a uniform distribution, where each outcome has the same probability .10. Yet the normal approximation is quite good already for this small sample size. The central limit theorem is a powerful tool for assessing the state of nature in a wide range of circumstances, with measures of uncertainty provided by the normal distribution.
The concepts discussed here form the basis of the mathematical theory of probability, which—unlike the interpretation of probability—is not subject to controversy. The interested newcomer has a wide choice of textbooks as guides in further pursuit of the subject. The main criterion of selection should be comfort with the level of abstraction and the style of presentation: neither too terse nor too wordy. The purpose of symbol in mathematics is the unambiguous and universal expression of concepts. The use of symbol is an indispensable, welcome shorthand for those who understand; it should never be a hindrance to understanding.
Many ethical issues in science and technology require greater insight on the part of the public and call for better education concerning the extent of related uncertainties. But how does one promote understanding of a deep and complex notion such as chance and its myriad manifestations in everyday life? For the mathematical approach a good way is to start early: Encourage the young to play numbers games, to work on puzzles exploring the different ways things can happen, to confront logical paradoxes, and to savor the joy of insight—the aha! experience. Doing mathematics because it is fun enhances intuition and develops the habit of critical thinking, helping the child to grow into a self-confident adult always in search of understanding. But when is it too late? To play mathematical games the only requirement is to be young at heart.
Anderson, David R., Dennis J. Sweeney, and Thomas A. Williams. (1994). Introduction to Statistics: Concepts and Applications, 3rd edition. Minneapolis/St. Paul: West Publishing. Contains several chapters on probability, with many examples; no calculus required.
Armitage, Peter. (1971). Statistical Methods in Medical Research. New York: Wiley. Includes a concise summary of the basic concepts of probability theory. Fourth edition, cowritten with Geoffrey Berry and J. N. S. Matthews published in 2001, Malden, MA: Blackwell Science.
Edwards, A. W. F. (2002). Pascal's Arithmetical Triangle: The Story of a Mathematical Idea. Baltimore: Johns Hopkins University Press.
Feller, William. (1950, 1966). An Introduction to Probability Theory and Its Applications. 2 vols. New York: Wiley. A classic text of probability; Vol. 1 requires only elementary mathematics.
Gardner, Martin. (1978). Aha! Aha! Insight. New York: Scientific American. A popular collection of mathematical games, one of a series by this author, illustrated with cartoons; includes a chapter on combinatorics, a basic component of probability.
Gnedenko, Boris V. (1962). The Theory of Probability, trans. B. D. Seckler. New York: Chelsea Publishing. Translation of Kurs teorii veroyatnostei. A classic text on an advanced level by a leading Russian mathematician of the twentieth century.
Hacking, Ian. (2001). An Introduction to Probability and Inductive Logic. Cambridge, UK: Cambridge University Press. Introductory textbook for students of philosophy, with many examples from everyday life.
Hodges, J. L., Jr., and E. L. Lehmann. (1964). Basic Concepts of Probability and Statistics, 2nd edition. San Francisco: Holden-Day. Textbook in a more mathematical context, but does not require calculus. Second edition published 1970.
Kolmogorov, Andrei N. (1956). Foundations of the Theory of Probability, 2nd English edition, trans. Nathan Morrison. New York: Chelsea Publishing. Translation of Grundbegriffe der Wahrscheinlichkeitsrechnung (1933). The original work on the axiomatic basis of probability theory.
Kotz, Samuel; Norman L. Johnson; and Campbell B. Read, eds. (1982–1999). Encyclopedia of Statistical Sciences. 9 vols. plus supp. and 3 update vols. New York: Wiley.
Kruskal, William H., and Judith M. Tanur, eds. (1978). International Encyclopedia of Statistics. 2 vols. New York: Free Press.
Laplace, Pierre-Simon de. (1812). Théorie analytique des probabilités [Analytic theory of probability]. Paris: Courcier. First comprehensive treatment of mathematical probability.
Laplace, Pierre-Simon de. (1951). A Philosophical Essay on Probabilities, trans., from the 6th French edition, Frederick Wilson Truscott and Frederick Lincoln Emory. New York: Dover. Translation of Essaie philosophique sur les probabilités, 1819. Addressed to the general public, included as the introduction to the third edition (1820) of the work listed above.
Riordan, John. (2002 ). An Introduction to Combinatorial Analysis. New York: Dover. A classic of combinatorial analysis, a branch of mathematics basic to probability.
Strait, Peggy Tang. (1989). A First Course in Probability and Statistics with Applications, 2nd edition. San Diego, CA: Harcourt Brace Jovanovich. A careful, thorough presentation of mathematical concepts and techniques for the beginner, with hundreds of examples from a wide range of applications.
Weaver, Warren. (1963). Lady Luck: The Theory of Probability. Garden City, NY: Anchor Books. A witty, engaging introduction to probability, addressed to the general reader.