## Markov Chains

## Markov Chains

# Markov Chains

A Markov chain is a chance process having the special property that one can predict its future just as accurately from a knowledge of the present state of affairs as from a knowledge of the present together with the entire past history.

The theory of social mobility illustrates the idea. Considering only eldest sons, note the status of successive generations of a particular male line in a society divided into three classes: upper, middle, and lower. If we assume the movement of a family among the three social classes is a chance process—governed by probabilistic laws rather than deterministic ones—several possibilities present themselves. An independent process represents a perfectly mobile society: the probability that a man is in a particular class depends in no way on the class of his father. At the other extreme is a society in which the probability that a man is in a particular class depends on the class of his father, that of his grandfather, that of his great-grandfather, etc.

A Markov chain is a process of intermediate complexity: the probability that a man is in a given class may depend on the class of his father, but it does not further depend on the classes of his earlier antecedents. For instance, while upper-class and middle-class fathers may have different probabilities of producing sons of a given class, an upper-class father whose own father was also upper class must have the same probability of producing a son of a given class as does an upper-class father whose father was, say, lower class. (This example will be used to illustrate each concept introduced below.)

No chance process encountered in applications is truly independent—in particular, no society is perfectly mobile. While in the same way no natural process exactly satisfies the Markov chain condition, many of them come close enough to make a Markov chain model useful.

**Formal definitions** For an exact formulation of the Markov chain concept, introduced in 1907 by the Russian mathematician A. A. Markov, imagine a system (family, society, person, organism) that passes with each unit of time (minute, hour, generation) from one to another of the s *states* E_{1} E_{2},. .. E_{3} (Upper class, E_{1} middle class, E_{2} and lower class, E_{3}, are the states in the social mobility example.) Assume that a chance process governs the evolution of the system; the chance process is the collection of probability laws describing the way in which the system changes with time. The system is that which undergoes change; a particular analysis may involve many systems of the same kind (many families, or many societies, etc.), all obeying the same process or set of probability laws.

Since the system passes through various states in sequence, time moves in jumps, rather than continuously; hence the integers 1, 2, 3,. .. provide a natural time index. Denote by P(E_{k}ǀE_{i}, E_{j}) the conditional probability that at time *n* + 2 the system is in state E_{k}, given that at times *n* and *n* + 1 it was in states E_{i} and E_{j} in that order; and similarly for longer or shorter conditioning sequences of states. We make the usual assumption that the conditional probabilities just defined do not depend on *n* (that is, the conditional probabilities do not change as time passes). The process is a Markov chain if P(E_{k}ǀE_{i}, E_{j}) = P(E_{k}ǀE_{j}), P(E_{l}ǀE_{i},E_{j},E_{k}) = P(E_{l},E_{k}), P(E_{m}ǀE_{i},E_{j},E_{k},E_{l}) = P(E_{m}ǀE_{l}), etc.

The *transition probabilitie* p_{ij} = P(E_{j}ǀE_{i}) determine the fundamental properties of the Markov chain; they form an s by s *transition matrix***P** = (p_{ij}) the basic datum of the process. (A matrix such as P that has only nonnegative entries and has rows summing to 1 is called a stochastic matrix.)

An independent process can be considered a special sort of Markov chain for which *p _{ij} = P_{j}* (that is, the

*p*do not depend on

_{ij}*i*): the rows of

**P**are all the same in this case. The second-order transition probabilities provide a second example of information contained in

**P**. If the system is presently in state E

_{i}, then the conditional probability that it will pass to E

_{j}and then to E

_{k}in the next two steps is P(E

_{j},ǀE

_{i})P(E

_{k}ǀE

_{i},E

_{j}) = P

_{ij}P

_{jk}hence the conditional probability that the system will occupy E

_{k}two time units later is

(Summing over the index *j* accounts for all the possible intermediate states.) But this second-order transition probability is just the *(j, k)* th entry in **p** ^{2} (the matrix **P** times itself in the sense of matrix multiplication).

Table 1 | |||
---|---|---|---|

E_{1} | E_{2} | E_{3} | |

E_{1} (upper class) | .448 | .484 | .068 |

E_{2} (middle class) | .054 | .699 | .247 |

E_{3} (lower class) | .011 | .503 | .486 |

Social mobility in England and Wales is approximately described by a Markov chain with transition matrix shown in Table 1. (These numbers must, of course, be arrived at empirically. The problem of estimation of transition probabilities is discussed later.) For example, a middle-class man has chance .054 of producing a son (recall that only eldest sons are considered here) who enters the upper class. Now the chance that a man in the upper class has a middle-class grandson is =(.448 x .484) + (.484 x .699) + (.068 x .503)= .589, while the chance that a man in the middle class has a middle-class grandson is = .639.

Notice that these last two probabilities are different, which raises a point often misunderstood. The Markov chain definition prescribes that an upper-class man with upper-class antecedents must have the same chance of producing a middle-class son as an upper-class man with lower-class antecedents has. But influence—stochastic influence, so to speak—of a man on his grandson, which does exist if is entirely consistent with the definition, which requires only that the grandfather exert this influence exclusively through the intermediate generation (the father).

To describe natural phenomena by Markov chains requires idealization; the social mobility example carried through here makes this obvious. Since Markov chains allow for dependence, however, they can with satisfactory accuracy account for the evolution of diverse social, psychological, biological, and physical systems for which an independent chance process would make too crude a model. On the other hand, Markov chains have simple enough structure to be mathematically tractable.

As another example of an approximate Markov chain, consider sociological panel studies. A potential voter is asked his party preference each month for six months preceding an election; his answer places him in state E_{l} (Republican), E_{2} (Democrat), or E_{3} (undecided). The voter’s progress among these states (approximately) obeys the Markov rule if, in predicting his August preference on the basis of his previous ones, one can without (essential) loss disregard them all except the most recent one, namely that for July (and similarly for the other predictions). In learning theory, Markov models have proved fruitful for describing organisms that change state from trial to trial in response to reinforcement in a learning process. (See Bibliography for applications in such areas as industrial inspection, industrial mobility, sickness and accident statistics, and economics.)

**Mathematical analysis** The transition matrix **P** determines many of the characteristics of a Markov chain. We have seen that the *(i, j)* th entry of **P** ^{2} is the conditional probability, given that the system is in E_{i}, that it will be in E_{j} two steps later. In the same way, the nth *order transition probability* the conditional probability that the system will occupy E_{j} after *n* steps if it is now in E_{i} is the *(i,j)* th entry in **P** ^{n}. But **P** alone does not determine the absolute probability that at time *n* the system is in state E_{i}. For this we need **P** and the initial probabilities .

The probability that the system occupies states E_{i} and E_{j} respectively at times *n* and *n* +1 is P_{ij} adding over the states possible at time *n,* we conclude that or, if **a** ^{(n)} denotes the row vector . More generally, the *m* th order transition probabilities link the absolute probabilities for time *n + m* to those for time n: or **a** ^{(n)}**p** ^{m} = **a** ^{(n+m)}. Thus the absolute probabilities are completely determined, via the relation **a** ^{(n)} = **a** ^{(1)}**P** ^{n-1}, by the transition probabilities p_{ij} and the initial probabilities In other words, **P** and **a** ^{(1)} completely specify the probability laws of the Markov chain.

*Chains* with all p_{ij} > 0. The most important mathematical results about Markov chains concern the stability of the and the behavior of for large *n*. Although some of the p_{ij} may be zero, suppose they are all strictly greater than zero. (Later this restriction will be lifted and other chains considered.) One expects that for large *n* should not depend much on i—the effect of the initial state should wear off. This is indeed true: there exist positive numbers p_{1},. . ., p_{8} such that is close to *P _{j}* for large . It follows that approaches as

*n*becomes large, so that the absolute probabilities stabilize near the

*p*, no matter what the initial probabilities are.

_{i}The numbers *p _{i}* have several important mathematical properties, which in turn have probability interpretations. Clearly they sum to one: Σ

*= 1. Moreover, if*

_{i}p_{i}*n*is large, is near Σ

*while is near*

_{i}P_{i}p_{ij}*p*Since , we have Σ

_{j}*, or, with*

_{i}p_{i}p_{ij}= p_{j}**p**=

*(p*,

_{1},. .. , p_{8},)

**p****P**=

*. Therefore the*

**p***p*solve the system

_{i}of s + 1 equations in *s* unknowns. Since it turns out that this system has but one solution, the limits *p _{i}* can be found by solving it, and this is the method used in practice. The vector p for the social mobility example is (.067, .624, .309). Whatever a man’s class, there is probability near .309 that a descendant a great many generations later is in the lower class.

The quantities *p _{j}* connect up with the absolute probabilities . If

**a**

^{(n)}=

**p**, then

**a**

^{(n+1)}=

**a**

^{(n)}

**P**=

**pP**=

**p**, and similarly

**a**

^{(m)}=

**p**for all

*m*beyond

*n*. In other words, if the system has at a given time probability exactly

*p*of being in state

_{j}*E*(for each j) then the same holds ever after. For this reason, the

_{j}*P*are called a set of

_{j}*stationary probabilities;*since the solution to (1) is unique, they form the only such set.

Thus a system evolving according to the laws of the Markov chain has long-range probability approximately *P _{j}* of being in

*E*; and if the probability of being in

_{ij}*E*at a particular time is exactly

_{j}*p*this relationship is preserved in the future. The law of large numbers gives a complementary way of interpreting these facts. Imagine a large number of systems evolving independently of one another, each according to the laws of a common Markov chain. No matter what the initial states of the many systems may be, after a long time the numbers of systems in the various states will become approximately proportional to the

_{j},*p*And once this state of affairs is achieved, it will persist, except for fluctuations that are proportionately small if the number of systems involved is large. In the social mobility example, the proportions in the three classes of a large number of family lines will eventually be nearly 067:624:309.

_{j}A system setting out from state *E _{i}* is certain to return again to E

_{i}after one or more steps; the expected value of the number of steps until first re-turn turns out to be 1/

*p*a result that agrees qualitatively with intuition: the lower the probability of a state the longer the average time between successive passages through it. There exist also formulas for the expected number of steps to reach E

_{i},_{j}for a system starting in E

_{i}. In the example from mobility theory, the mean time to pass from the lower class to the upper class is 26.5 generations, while the mean for the return trip is but 5.6 generations (which should be a lesson to us all).

*Other Markov chains*. In the analysis sketched above, it was assumed that the *p _{ij}* were all strictly greater than zero. More generally, the same results hold even if some of the

*p*equal zero, provided the Markov chain has the property that the states all communicate (for each

_{ij}*i*and

*j*there is an

*n*for which > 0) and provided also that there is no periodicity (no integer

*m*exceeding 1 exists such that a passage from E

_{i}back to E

_{i}is possible in

*n*steps only if

*n*is divisible by

*m*). Such a chain is called ergodic. The results can also be reformulated to cover nonergodic chains.

The theory extends in still other useful ways. (a) A chance process is a Markov chain of second order if the probability distribution of future states depends only on the two most recently visited states: P(E_{l}ǀE_{i},E_{j},E_{k}) =P(E_{i},E_{j},E_{k}), P(E_{m}ǀE_{i},E_{j},E_{k},E_{l}) = P(E_{m}ǀE_{k},E_{l}), etc. These processes contain ordinary Markov chains as a special case and make possible a more precise description of some phenomena. The properties derived above carry over to chains of second (and higher) order, (b) If, for instance, the system is a population and the state is its size, there are then infinitely many possible states. The theory generalizes to cover such examples, but the mathematics becomes more difficult; for example (1) becomes a system of infinitely many equations in infinitely many unknowns, (c) Continuous time, rather than time that goes in jumps, is appropriate for the description of some systems—for example, populations that change state (size) because of births and deaths that can occur at any instant of time. Such processes in principle admit an approximate description within the discrete-time theory: one observes the system periodically (every year or minute or microsecond) and ignores the state occupied at other time points. However, since such an analysis is often unnatural, an extensive theory of continuous-time processes has grown up.

**Statistical analysis** One may believe a given process to be a Markov chain—or approximately a Markov chain—because of his knowledge of the underlying mechanism. Or he may hypothesize that it is a Markov chain, hoping to check this assumption against actual data. Under the Markov assumption, he may want to draw from actual data conclusions about the transition probabilities *P _{ij}* which are the basic parameters of the model. In other words, statistical problems of estimation and hypothesis testing arise for Markov chains just as they do for independent processes [

*see*Estimation; Hypothesis Testing].

Suppose one observes *n* systems (governed by a common Markov chain), of which *n _{i}* start in state E

_{i}, and follows each of them through one transition. If

*n*is the number that step from E

_{ij}n_{i}to E

_{j}so that Σ

_{j}

*n*= n

_{ij}_{i}then the log-likelihood function is Σ

_{l.j}n

_{ij}log

*p*; maximizing this function with the

_{ij}*s*constraints Σ

_{j}

*p*= 1 gives the natural ratios

_{ij}*n*as the maximum likelihood estimators of the

_{ij}/n_{i}*P*The numerical matrix given for the social mobility example was obtained in this way from a sample of some 3,500 father-son pairs in England and Wales. If the

_{ij}*n*, are large, the estimators are approximately normally distributed about their mean values p

_{i}_{ij}. By seeing how far the

*n*, differ from putative transition probabilities

_{ij}/n_{i}*P*, one can test the null hypothesis that these pi; are the true parameter values. The statistic appropriate for this problem is Σ

_{ij}*; it has approximately a chi-square distribution with*

_{i.j}(n_{ij}– n_{i}p_{ij})^{2}/n_{i}p_{ij}*s (s*– 1) degrees of freedom if the

*n*are large.

_{i}Other tests are possible if the systems under observation are traced for more than one step. Suppose *n _{ijk}* is the number of the systems that start in E

_{i}, step to E

_{j}, and then step to

*E*. If the process is a Markov chain, then (a) the chance that the second step carries the system from

_{k}*E*to

_{j}*E*does not depend on the initial state

_{k}*E*and

_{i}*(b)*the transition probabilities for the second step are the same as those for the first.

Let a dot indicate an index that has been summed out; for example, *n _{.ij}* = Σ

*is the number of systems that are in states E*

_{k}n_{kij}_{i}and E

_{j}at times 2 and 3, respectively (with the state at time 1 completely unspecified). Assuming (a), one can test (b) by comparing the estimators

*ni}. /n,.*. of the transition probabilities for the first step with the estimators n

_{ij.}/n

_{i}. . of the transition probabilities for the second step. And one can test (a) itself by comparing the ratios

*n*(the estimated second-step transition probabilities) with the ratios

_{ijk}/n_{.i.}*n*(the estimated second-step transition probabilities, allowing for possible further influence of the initial state); if

_{ijk}/n_{.jk}*(a)*holds,

*n*should, independently of

_{ijk}/n_{ij.}*i,*be near

*n*; thus one can check on the Markov chain assumption itself.

_{.jk}/n_{.j.}This statistical analysis, based on following a large number of independently evolving systems through a small number of steps, really falls under the classical chi-square and maximum likelihood theory. There is also a statistical theory for the opposite case, following a small number of systems (perhaps just one) through many steps. Although the estimates and tests for this case have forms similar to those for the first case, the derivation of their asymptotic properties is more involved.

The statistical analysis of Markov chains may be extended in various directions. For example, a mode of analysis in which *times of stay* in the states are considered, together with transition probabilities, is discussed by Weiss and Zelen (1965).

Patrick Billingsley

*[Other relevant material may be found in* Panelstudies; Social Mobility.]

## BIBLIOGRAPHY

Anderson, T. W.; and Goodman, Leo A. 1957 Statistical Inference About Markov Chains. *Annals of Mathematical Statistics* 28:89-110. → A detailed treatment of the problem of many short samples.

Billingsley, Patrick 1961 Statistical Methods in Markov Chains. *Annals of Mathematical Statistics* 32: 12-40. → A review paper covering the problem of one long sample from a Markov chain. Contains a large bibliography.

Blumen, Isadore; Kogan, Marvin; and Mccarthy, Philip 1955 *The Industrial Mobility of Labor as a Probability Process*. Cornell Studies in Industrial and Labor Relations, Vol. 6. Ithaca, N.Y.: Cornell Univ. Press. → A detailed empirical study of mobility problems.

Feller, William 1950–1966 *An Introduction to Probability Theory and Its Applications*. 2 vols. New York: Wiley. → A classic text covering continuous-time processes as well as chains with infinitely many states. Excellent examples. The second edition of the first volume was published in 1957.

Glass, D. V.; and Hall, J. R. 1954 Social Mobility in Great Britain: A Study in Inter-generation Changes in Status. In D. V. Glass (editor), Social *Mobility in Britain*. London: Routledge. → This is the source of the original data for the social mobility example.

Goodman, Leo A. 1961 Statistical Methods for the Mover–Stayer Model. *Journal of the American Statistical Association* 56:841–868.

Goodman, Leo A. 1962 Statistical Methods for Analyzing Processes of Change. *American Journal of Sociology* 68:57-78. → This and the preceding paper treat modifications of the Markov model for mobility problems.

Jaffe, Joseph; Cassotta, Louis; and Feldstein, Stanley 1964 Markovian Model of Time Patterns of Speech. *Science* 144:884–886.

Kemeny, John G.; and Snell, J. Laurie 1960 *Finite Markov Chains*. Princeton, N.J.: Van Nostrand. → Contains both theory and examples. This is the source of the figures for the social mobility example.

Prais, S. J. 1955 Measuring Social Mobility. *Journal of the Royal Statistical Society* Series A 118:56-66. → This is the source of the analysis of the social mobility example.

Weiss, George H.; and Zelen, Marvin 1965 A Semi-Markov Model for Clinical Trials. *Journal of Applied Probability* 2:269–285.