## information theory

**-**

## Information Theory

# INFORMATION THEORY

Among the more interesting trends of the past half-century has been the consolidation of probability, statistics, combinatorial optimization, information theory, and computer science into a single imposing discipline of *infomatics*.

## Minimal Belief Change

Of special philosophical interest is the enrichment of Bayesian inference by a rule of belief change that goes by minimizing the distance between two probability distributions, *P* = (*p* _{1}, … , *p* _{k } and *Q* = (*q* _{1}, … , *q* _{2}), as measured by the expected log-likelihood ratio:**(1)**

The likelihood ratio, *P* (*e* |*h* ):*P* (*e* |*k* ), is a fundamental index of the support that *e* accords *h* over *k* (see the entry "Foundations of Statistics").

Using the visually transparent Gibbs inequality,**(2)** ln *x* ≤ *x* − 1

with equality if and only if (iff) *x* = 1, in the equivalent form ln *x* ≥ 1 − 1/*x*, it follows that *H* (*P, Q* ) ≥ 0 with equality iff *P* = *Q*. Notice, however, that *H* (*P, Q* ) ≠ *H* (*Q, P* ).

Alan Turing and his wartime assistant, Irving John Good, used *H* (*P, Q* ) in their code-breaking work, but it was not until 1959 that another wartime code breaker, Solomon Kullback, developed its properties systematically in his book *Information Theory and Statistics* (1959), unleashing a floodtide of applications to classification, contingency tables, pattern recognition, and other topics.

A second line of development began with Claude Shannon's creation of information and coding theory (Shannon and Weaver 1949), whose central concept is a measure of uncertainty,**(3)**

that Shannon dubbed the entropy function. One sees that minimizing *H* (*P, U* ) against a uniform distribution, *U* = (*k* ^{−1}, … , *k* ^{−1}), namely,

is equivalent to maximizing the entropy (MAXENT). In view of this relation *H* (*P, Q* ) is often called the cross (or relative) entropy of *P* with respect to *Q*, and the rule of minimizing it MINXENT.

Shannon characterized *H* (*P* ) axiomatically as continuous, strictly increasing in *k* when *P* = (*k* ^{−1}, … , *k* ^{−1}) is uniform, and by the general form of a consistency requirement exemplified by:

where *A* (3) = *H* (⅓, ⅓, ⅓), and so on. Roughly speaking, one's uncertainty about a lottery is the same for an equivalent two-stage form of it, where in the example *H* (^{3}/_{9}, ^{4}/_{9}, ^{2}/_{9}), is one's uncertainty about the first stage.

Exploiting the conceptual link Ludwig Boltzmann established between thermodynamic entropy and Shannon entropy through his celebrated formula, *S* = *k* ln *W*, Edwin T. Jaynes showed how to derive the probability distributions (over microstates) of statistical mechanics by maximizing the (Shannon) entropy subject to mean value constraints given by measured values of macro variables like pressure, volume, or internal energy. Intuitively, the maxent distribution is the most spread out of all those satisfying the constraints, the one that is maximally noncommittal with regard to missing information. Statistical mechanics could thus be seen as a branch of statistical inference operating on physically given constraints. Jaynes quickly realized that MAXENT admits of much

k | 1 | 2 | 3 | 4 | 5 | 6 | entropy |

n_{k} |
3246 | 3449 | 2897 | 2841 | 3635 | 3932 | |

g_{k} |
0.16230 | 0.17245 | 0.14485 | 0.14205 | 0.18175 | 0.19660 | 1.784 990 |

p_{k} |
0.15294 | 0.15818 | 0.16361 | 0.16922 | 0.17502 | 0.18103 | 1.790 103 |

q_{k} |
0.16433 | 0.16963 | 0.14117 | 0.14573 | 0.18656 | 0.19258 | 1.785 225 |

r_{k} |
0.16139 | 0.17361 | 0.14434 | 0.14256 | 0.18215 | 0.19594 | 1.784 993 |

wider application. A collection of his papers (Jaynes 1983) sparked an explosion of applications whose subsequent progress can be followed in the published proceedings of workshops on maximum entropy and Bayesian methods held annually since 1981 (e.g., see Erickson and Smith 1988). Brian Buck and Vincent A. Macaulay's *Maximum Entropy in Action* (1991) is a sampler of physical applications to such fields as spectroscopy, X-ray crystallography, the structure of macromolecules, magnetic resonance, thermodynamics, and plasma physics. The method is universal, however, and applies equally well to image reconstruction or time series.

## Inferring Hidden Causes

Irving John Good (1983) viewed *H* (*P, Q* ) as a natural measure of deviation suitable for testing multinomial hypotheses, and like Karl Pearson's better-known chi-squared measure of deviation

the analogous informational measure

is asymptotically distributed as *χ* ^{2}_{k −1}, the chi-square distribution with *k* − 1 degrees of freedom. By comparing the current model to the ideally best-fitting model, the resulting psi test sets limits to how much room there is for improving support by moving to a new (possibly more complicated) model (Jaynes 2003). When such a move is indicated, MINXENT helps guide one to plausible better-fitting models, as the following example illustrates:

Seen in the first row of Table 1 are the frequencies with which the six faces of a white die turned up in *N* = 20,000 tosses recorded by the Swiss astronomer Rudolf Wolf. One sees at a glance that Wolf's die was biased; the question of interest is how to find physically plausible hypotheses that best account for these data. The distributions shown in rows 3 to 5 of the table afford successively better approximations to the empirical distribution, {*g* _{j}}, of row 2. Jaynes (1983) found them by moving the initial uniform distribution just enough to satisfy mean value constraints given by the data that correspond to physical imperfections of a die. Among the most plausible are:

(1) A shift in the center-of-mass due to the different amounts of ivory removed from opposite faces

(2) Oblateness, due to the difficulty of machining a perfect cube (Jaynes 1983)

The first implies a mean higher than the 3.5 of an "honest" die. Maximizing the entropy subject to the observed mean, 〈*i* 〉 = 3.5983, yields the distribution *P* of row 3. Notice that it assigns higher probability to the 6 spot than the 1 spot, the 5 spot than the 2 spot, and the 4 spot than the 3 spot, with decreasing margins, which is just what the posited physical cause would lead one to expect. But while *P* improves on the uniform distribution, its fit to the data is still poor, with *X* ^{2} = 33.4. This points to the existence of another constraint. The lower than expected frequencies of faces 3 and 4 are best explained by oblateness of the die, the 3-4 dimension being longer than the other two. This is reflected in the nonzero mean value, *f̄* _{2} = 0.1393, of the oblateness function, *f* _{2}(*j* ) = 1 for j = 1, 2, 5, 6 and *f* _{2}(*j* ) = −2 for j = 3, 4. Adding this constraint to the first leads to the maxent distribution *Q* for both constraints (row 4). The fit to Wolf's data is now fairly good, but there is some slight evidence for a third constraint, a slightly chipped 2-3-5 corner. Maximizing entropy subject to all three constraints then yields the distribution of the last row, whose fit to the data is preternaturally good. Note: It makes no difference in which order the constraints are applied, provided each one is retained in applying the next one.

If Wolf's die still exists, one could actually put the posited physical imperfections to the test of careful measurements. In any case, one would expect other dice to exhibit the first two imperfections. As it happens, Wolf also tossed a red die 20,000 times, and this expectation is realized (Jaynes 1983). To permit such inferences to physical causes, one must assume, of course, that Wolf tossed his dice in a manner that precludes skill.

MAXENT first enters, then, as a technique of mathematical modeling, as a means of generating plausible hypotheses for explaining old data or testing against new data (Good 1983, pp. xvi–xvii, 41, 99–100). Indeed, most of the probability distributions that figure prominently in pure and applied probability are maxent distributions for a suitable set of constraints. For example, the exponential distribution is maxent for a positive random variable whose (finite) first moment is given. (This result depends on extending the cross entropy function to continuous distributions, sums giving way to integrals.) Jaynes (2003, §7.6) even makes a case for thinking the ubiquitous use of the normal law of errors is owing to the fact that it is the maxent distribution having specified values of the first two moments.

When predictions based on MAXENT are verified, one's belief in the completeness of the given constraints is borne out, but one learns most when one's predictions fail. For then one infers the existence of a previously unsuspected cause to which the observed deviations point one. However, for this kind of "learning from error" to be effective requires that those inferences be one's best inferences (Jaynes 2003, p. 326). In what sense, then, are MINXENT or MAXENT inferences best possible?

One can look, first, at how MAXENT operates on a mean value constraint:

With **∑** *p* _{i } = 1, the method of Lagrange for constrained extrema yields a solution,**(4)** *p* _{i } = Z(*λ* )^{−1} exp(*λf* (*x* _{i })) *i* = 1, 2, … , *n*

with. More generally, given *m* < *n* such constraints, , a solution is**(5)** *p* _{i } = *Z* (λ_{1}, …, λ_{m })^{−1} exp[λ_{1}*f* _{1}(*x* _{i }) + … + λ_{m }*f* _{m }(*x* _{i })]

for *i* = 1, … , *n*, where the partition function, *Z*, is defined by:**(5a)**

Moreover, the moments of the resulting maxent distribution are given by the derivatives of ln *Z*, so that, in particular, the mean values are:**(5b)**

as one can verify. Thus, *Z* is a kind of (exponential) generating function and is called the partition function for reasons given by Jaynes (2003, pp. 280–282). The parallel minxent distribution is given by:**(6)**

where *P* ^{0 } = (*p* ^{0 z}, …, *p* ^{0 n}) is the initial "pre-distribution." (6) reduces to (5) when *p* ^{0 i} = *n* ^{−1} is constant.

For a simple discrete example, let the mean for a die be 〈*i* 〉 = 4. Here, *x* _{i } = *i* and *f* (*x* ) = *x*, so the partition function becomes (with *x* = *e* ^{λ }):

Then since *dx* /*dλ* = *e* ^{λ } = *x*, the chain rule yields:

whereupon, setting *d* ln *Z* /*dλ* = 4, 3(1 − *x* )(1 − *x* ^{6}) = 5*x* ^{7} − 6*x* ^{6} + *x*, which is solved numerically for *x* to yield, *x* = 1.190804264, or *λ* = ln *x* = 0.1746289309. Hence, the maxent distribution of mean 4 is:

(*p* _{1}, … , *p* _{6}) = (.103, .123, .146, .174, .207, .247)

The general formula for mean *f̄* works out to (6 − *f̄* )*x* ^{7} − (7 − *f̄* )*x* ^{6} + *f̄x* − *f̄* + 1 = 0, and one seeks a root other than *x* = 1. Verify that for mean 3.5, the only root is *x* = 1, so the maxent distribution is uniform. Thus, MINXENT gives back the pre-distribution if it already satisfies the constraint—a redundancy property that should hold generally. Computer programs are available for finding maxent distributions subject to mean value constraints.

The method of Lagrange for constrained extrema is not guaranteed to yield a global maximum, and so to settle this point, one invokes the Gibbs inequality (2) to obtain:

when **∑** *u* _{i } = **∑** *p* _{i } = 1, whence

−**∑** *p* _{i } ln *p* _{i } ≤ −**∑** *p* _{i } ln *u* _{i }

with equality iff *p* _{i } = *u* _{i }. Knowing the form (5) of the maxent distribution, one sets*u* _{i } = *Z* (*λ* _{l }, … , *λ* _{m })^{−1} exp[*λ* _{l }*f* _{l }(*x* _{i }) + … + *λ* _{m }*f* _{m }(*x* _{i })]

and the last inequality then specializes to:

with equality iff the *p* _{i } are given by (5). This not only shows that (5) is the (one and only) distribution of maximum entropy satisfying the given constraints, but that the right side,**(7)**

is the maximum entropy permitted by those constraints. Any distribution of lower entropy must be importing additional information. Or, as in the example of Wolf's die, if the data distribution is of lower entropy than a hypothesized distribution, an additional condition must be constraining those data to lie in a proper subset of those allowed by the hypothesis in question.

## Axiomatic Characterization

Every property of MINXENT (MAXENT) rules out potential rivals. Write *P* = *P* ^{0} ○ *C* for the distribution, *P*, nearest the pre-distribution, *P* ^{0}, among all those satisfying a constraint, *C* (of class *C* ), where for equality or inequality linear constraints the class *C* is closed and convex and there is a nearest *P* to *P* ^{0}. The following properties of MINXENT are then easily proved (Shore and Johnson 1980, §4; Williams 1980):

Uniqueness: The minxent distribution is unique

Redundancy

P○C=PwhenPsatisfies the constraintC[sinceH(Q, P) = 0 only ifQ=P]Chain consistency: The order in which constraints are applied is immaterial, provided each is retained in applying the next one

System independence: The "post-distribution" should not import a dependence between two variates that is neither implied by the constraints nor the pre-distribution

Invariance:

P=P^{0 }○Cshould not depend on one's choice of a coordinate systemSubset independence: It should not matter whether one obtains a conditional distribution for disjoint subsets of outcomes by finding the post-distribution nearest the conditional prior for each subset or by obtaining the post-distribution for the whole outcome space and conditioning it on each subset

To illustrate the last property, the even and odd faces in tossing a die may be reported separately. Let the prior be uniform and the mean for both the odd and even faces be 4. By redundancy, the post-distribution, *Q*, is also uniform on the even faces: *Q* (2) = *Q* (4) = *Q* (6) = ⅓, while for the odd faces, one finds (as above) that *Q* (1) = 0.1162, *Q* (3) = 0.2676, and *Q* (5) = 0.6162. One can also solve the problem by finding the post-distribution for all six states by applying the mean value constraints conjointly obtaining:

Q ′(1) = 0.0524 |
Q ′(2) = 0.1831 |

Q ′(3) = 0.1206 |
Q ′(4) = 0.1831 |

Q ′(5) = 0.2778 |
Q ′(6) = 0.1831 |

To condition on the subsets of odd and even faces, divide each column by its sum and obtain the same conditional distributions found earlier. If these two ways of solving the problem did not agree, MINXENT could be justly deemed inconsistent.

In 1980 John E. Shore and Rodney W. Johnson vindicated Jaynes's conjecture that "deductions made from any other information measure … will eventually lead to contradictions" (1983, p. 9) by deriving MINXENT for mean value constraints from uniqueness and the last three mentioned properties. All these ring changes on the consistency requirement that two ways of doing a calculation must agree, the very condition from which Richard T. Cox derived the basic rules of probability (see the entry "Foundations of Statistics"). Rather than attempt a sketch of their proof, one can illustrate how rival rules violate the Shore-Johnson axioms and lead to contradictions.

## Inconsistency of Other Rules

An alternative measure of the spread of a distribution, Σ*p* ^{2}*i*, dubbed the repeat rate by Turing, is, like entropy, continuous and assumes its extreme values of 1/*n* and 1 at the extremes of uniformity and concentration. The closely related Gini diversity, , has a well-established place in statistics as a measure of the qualitative diversity of a population, with widespread applications to such fields as genetics, ecology, and linguistics. Table 2 shows how closely the distribution of a die of mean 4 obtained by minimizing Σ*p* ^{2}*i*, approximates the maxent distribution for this constraint found earlier.

On a superficial examination, one might easily suppose that the repeat rate (RR) rule performs about as well as MAXENT. Still, one knows it must violate one of Shore and Johnson's (1980) consistency postulates, and inconsistency always brings a degradation of performance in its train.

In casting a red and a white die with mean values, *E* (*R* ) = 4 and *E* (*W* ) = 3, and a uniform prior, the method of Lagrange for the RR-rule leads in this case to three linear equations in three unknowns (the Lagrange multipliers), which are easily solved to yield the joint post-distribution:

from which the marginal distributions for *R* and *W* are computed to be:

and

One easily checks that *p* _{i }*q* _{i } ≠ ^{1}/_{36} = π(*i,i* ). Hence, the RR-rule violates system independence. Its inconsistency shows up even more clearly when one enlarges the problem to the case of symmetric mean values, *E* (*R* ) = 3.5 + *Δ* and *E* (*W* ) = 3.5 − *Δ*. From the joint post-distribution, one can compute a value of *Δ*, namely, *Δ* = ^{7}/_{12} at which the RR-rule makes the smallest of the joint probabilities zero, even though no outcome is excluded by the pre-distribution or the constraints. For values of *Δ* greater than ^{7}/_{12}, the RR-rule breaks down completely, making some of the joint probabilities negative.

Consider the more general family of rules that minimize a Csiszar divergence:*H* _{f }(*P, Q* ) = ∑*q* _{i }*f* (*p* _{i }*q* _{i })

with *f* a convex function. This family includes MINXENT as the special case *f* (*x* ) = *x* ln *x*, as well as the chi-squared rule that minimizes, given by *f* (*x* ) = (*x* − 1)^{2}. Applied to a uniform prior, *q* _{i } = 1/*n*, this is minimized by minimizing Σ*p* ^{2}*i*, hence, it inherits the inconsistencies of the RR-rule.

## Relation to Bayesian Conditioning

The upshot of Shore and Johnson's (1980) axiomatic derivation of MINXENT is to place it on a par with Bayes's rule for revising a probability assignment—the most basic rule of belief change. Hence, it is of interest to see what MINXENT delivers in this case (Williams 1980). For any distribution *P* satisfying *P* (*B* ) = 1:**(8)** *H* (*P, P* ^{0}) = *H* (*P* _{B }, *P* ^{0}_{B }) − ln *P* ^{0}(*B* )

writing *P* _{B }(*A* ) = *P* (*AB* )/*P* (*B* ) for the conditional measure. For then using

since *P* _{B }(*x* _{i }) = *P* (*x* _{i })/*P* (*B* ) = *P* (*x* _{i }), which proves (8). Since (8) is clearly a minimum when *H* (*P* _{B }, *P* ^{0}_{B}) = 0, hence, when (*P* _{B } = *P* ^{0}_{B}), one sees that the distribution *P* nearest *P* ^{0} among those for which *P* (*B* ) = 1 is the Bayesian posterior distribution, *P* ^{0}_{B}.

What if the constraint *P* (*B* ) = 1 is weakened to *P* (*B* ) = *q*, 0 < *q* < 1? For any such *P* one has the following straightforward generalization of (8):**(8a)**

with *q̄* = 1 − *q*. The right side is minimized by making both *H* (*P* _{B }, *P* ^{0}_{B}) and *H* (*P* _{B̄ }, *P* ^{0}_{B}) zero, which means that the nearest *P* to *P* ^{0} is given by:

a *q* -weighted average of the conditional distributions for *B* and its negation. There is an obvious generalization to a partition *B* _{l }, … , *B* _{n } and the constraint *P* (*B* _{i }) = *q* _{i }, with ∑*q* _{i } = 1. This special case of MINXENT is known as Jeffrey conditioning. Indeed, Williams (1980) generalizes

i | 1 | 2 | 3 | 4 | 5 | 6 |

Maxent | .103 | .123 | .146 | .174 | .207 | .247 |

RR-rule | .095 | .124 | .152 | .181 | .209 | .238 |

further to the case where the *B* 's need not be mutually exclusive. The validity of this rule requires that the sole affect of the datum or sensory input is to raise the probability of *B* to a value *q* < 1. For conditions under which Jeffrey conditioning is not valid, see Jaynes (2003, §5.6).

## Frequency Connections

MAXENT also has frequency connections (Jaynes 1983). Of the *k* ^{N } outcomes (i.e., outcome sequences) of *N* trials, the number that yields category counts (*n* _{l}, … , *n* _{k}) with ∑*n* _{i } = *N* is given by the multinomial coefficient:

Using Stirling's approximation to the factorial function, one easily proves that**(9)** *N* ^{−1} ln *W* → *H* (*f* _{l }, … , *f* _{k })

Hence, the maxent distribution is *W* -maximizing, hence realized by the most outcomes. Moreover, given two sets of relative frequencies, {*f* _{i }}and {*f* _{i }′},**(10)**

gives the ratio of the number of ways each can be realized, where *H* = *H* (*f* _{l }, … , *f* _{k }) and *H* ′ = *H* (*f* _{l }′, … , *f* _{k }′), (*f* _{i } / *f* _{i }′) and . For example, for the two distributions of Table 2, *H* = 1.7485056 and *H* ′ = 1.7470082, and so at *N* = 20,000 trials, the maxent distribution is realized by *W* /*W* ′ = 9.86 × 10^{12} more outcomes than the similar looking distribution of the RR-rule. The peak is thus enormously sharp. Just how sharp is quantified by Jaynes's concentration theorem (1983), which allows one to compute the fraction of possible outcome sequences whose category counts, *f* _{i }, have entropy in the range *H* _{max} − *ΔH* ≤ *H* (*f* _{l }, … , *f* _{k }) ≤ *H* _{max}. This gives, in effect, a new kind of significance test for detecting when additional constraints are hidden in one's data.

Not unrelated results concern typical outcomes of a stationary Markov process (Khinchin 1957). Namely, if *H* is the entropy of a regular Markov chain and *s* sufficiently large, then almost all *s* -step outcomes *C* satisfy:

with η arbitrarily small. That is, almost all *s* -step sequences of the process deliver information arbitrarily close to the average, *H* ^{(s)} = *sH*. The entropy is defined in the obvious way as the expectation of*H* _{i } = −∑*p* _{ik } log *p* _{ik }

the one-step uncertainty, so that *H* = σ_{i }*P* _{i }*H* _{i } where *P* = (*P* _{l }, … , *P* _{k }) is the stationary distribution and (*p* _{ik }) is the one-step transition matrix.

Aleksandr Khinchin's prophetic remark, that "the study of entropy will become a permanent part of probability theory" (1957, p. 2), has been borne out not only by the flowering of the maxent method but also by D. S. Ornstein's proof that entropy is a complete invariant of an ergodic Markov chain (for an informal treatment and references to the mathematical literature, see Suppes 2002, §4.5). That is, two ergodic chains (in which any state is reachable from any other) are isomorphic iff they have the same period and the same entropy.

## Information Theory and Statistical Mechanics

The frequency implications of MAXENT play their most important role, however, in statistical mechanics. Thus, Ludwig Boltzmann found the famous Maxwell-Boltzmann energy distribution for molecules in a conservative force field by partitioning the 6n-dimensional phase space of position and velocity (or momentum) coordinates into cells, *R* _{k }, small enough for the energy to be a constant, *E* _{k }, and large enough to contain a sizeable number, *N* _{k }, of molecules. Then the total number, *N*, and the total energy, *E*, are constants of the molecular motion. Boltzmann argues that the most probable distribution, (N_{l }, … , N_{s }), is the one that is realized by the most microstates among those compatible with the constraints:

∑*N* _{i } = *N* and ∑*N* _{i }*E* _{i } = *E*

By virtue of (5), this most probable distribution,**(11)**

is none other than the maxent distribution for the given constraints.

In this derivation, Boltzmann may be said to have launched MAXENT and the information theoretical approach to statistical mechanics. All the canonical distributions J. Willard Gibbs (1902) later derived are simply maxent distributions for other sets of constraints, for example, that for fixed values of the total energy and angular momentum is Gibbs's rotational ensemble. Unfortunately, Gibbs slipped in the logical basis of his derivation so unobtrusively that most readers missed it. Moreover, he provided no clear or compelling rationale, so that in their famous review article of statistical mechanics published a decade later, Paul Ehrenfest and Tatiana Ehrenfest (1912) dismissed Gibbs's method of derivation as "a mere analytic trick" (for the relevant history, see Jaynes 1983, pp. 98ff). Thus began a long siege of confusion and controversy over the justification of Gibbs's formalism that continues to this day, notwithstanding Jaynes's rediscovery of the MAXENT method of Gibbs in 1957 and his clear rationale for using it (1983, chapter 1). In particular, the information theoretical approach dispenses with the ergodic hypothesis.

Jaynes's second great contribution was to extend the Gibbsian (MAXENT) formalism to irreversible processes and nonequilibrium thermodynamics (1983, chapter 10, §D). He writes:

The final breakthrough came in the Christmas vacation period of 1962 when, after all else had failed, I finally had the courage to sit down and work out all the details of the calculations that result from using the Maximum Entropy Principle;

and nothing else. Within three days the new formalism was in hand, masses of the known correct results of Onsager, Wiener, Kirkwood, Callen, Kubo, Mori, MacLennon, were pouring out as special cases, just as fast as I could write them down; and it was clear that this was it. (p. 239)

The unbelievably short derivation of (11) as the equilibrium distribution of the energies has seemed too short to many physicists and philosophers. This initial impression is only reinforced when it is seen that (11) implies the familiar barometric formula,**(11a)** *ρ* (z) = *ρ* (0) exp(−*βmgz* )

for the density of the atmosphere at height *z*, as well as Maxwell's velocity distribution in the form:**(12)** *f* (*v* _{b }) = *B* (β)exp(−β*mv* ^{2}_{b }/2)

where *f* (*v* _{b })*dv* _{b } is now the probability that a molecule has a velocity in a tiny neighborhood of *v* _{b }. Indeed, even more follows, for (12) does not depend on the position of the molecule (an assumption Boltzmann was forced to make in his derivation of the Maxwell distribution from his collision equation). This implies, in turn, the dynamic stability of the distribution. However, the MAXENT derivation seems to ignore the dynamics altogether.

Jaynes responded in several ways. First, the derivation does not ignore the dynamics; it uses conservation of energy as well as the preservation of the volumes of the cells, *R* _{k }, under evolution of the system (Liouville's theorem). Jaynes emphasizes that, in addition, one is trying to predict reproducible macrostates. These are *ipso facto* under the experimenter's control, and so the myriads of details concerning microstates not under his control must needs be irrelevant for prediction. Moreover, reproducible macroscopic properties must be characteristic of the overwhelming preponderance of microstates in the allowed region of the phase space. Given the large number of degrees of freedom entailed, the maxent distribution will be enormously peaked. Hence, predictions of other macro quantities based on their mean values will be correct (within experimental error) with probability close to one (Jaynes 1983). As far as Jaynes is concerned, these considerations fully explain the predictive success of equilibrium thermodynamics as "inferences from the available information."

In particular, there is no need to appeal, as Maxwell and Boltzmann did (but Gibbs did not) to the equality of infinite time averages and averages with respect to the canonical distribution. Even if this could be established from some other easily verified assumption—the program of ergodic theory—it would be nothing to the purpose. For one would have to show, in addition, that the averages over finite time intervals involved in measuring macro quantities closely approximate their infinite time averages, and there are positive reasons to doubt this (Jaynes 1983).

Apart from the clarity, unity, and simplicity the information theoretical approach brings to the foundation of statistical mechanics, David Hestenes (1993) considers Jaynes's greatest merit to lie in his recognition that "in the evolution of statistical mechanics the principles of physics had gotten confused with principles of statistical inference" (1993, p. 153).

Jaynes regards the formalism of quantum mechanics as a similar "nasty omelet" scrambling together properties of physical systems and our information about them in ways that are difficult to unscramble. See the cited article by David Hestenes for one noteworthy attempt to disentangle subjective and objective aspects of the electron wave function, namely, a probability factor and a kinematic factor, using a powerful "universal geometric calculus" based on Hermann Grassmann's Ausdehnumgslehre. Hestenes purports to show, in particular, that the complex "probability amplitudes" of the formalism have nothing to do with probability per se but have, instead, a physical origin in the "Zitterbewegung", or circular dance, of the electrons generating their spin and magnetic moment.

Jaynes's views on quantum mechanics are outlined (with references) in the same volume—a festschrift in his honor—in which the article by Hestenes appears. For other views, see Philosophy of Statistical Mechanics and Quantum Mechanics.

** See also ** Bayes, Bayes's Theorem, Bayesian Approach to Philosophy of Science; Bell, John, and Bell's Theorem; Bohr, Niels; Boltzmann, Ludwig; Copenhagen Interpretation; Einstein, Albert; Gibbs, Josiah; Maxwell, James Clerk; Philosophy of Statistical Mechanics; Probability; Quantum Mechanics; Semantics; Turing, Alan M.

## Bibliography

Buck, Brian, and Vincent A. Macaulay, eds. *Maximum Entropy in Action*. Oxford, U.K.: Clarendon Press, 1991.

Cover, Thomas M., and Joy A. Thomas. *Elements of Information Theory*. New York: Wiley, 1991.

Erickson, Gary J., and C. Ray Smith, eds. *Maximum-Entropy and Bayesian Methods in Science and Engineering*. 2 vols. Dordrecht, Netherlands: Kluwer Academic, 1988.

Gibbs, J. Willard. *Elementary Principles of Statistical Mechanics*. New Haven, CT: Yale University Press, 1902.

Good, Irving John. *Good Thinking*. Minneapolis: University of Minnesota Press, 1983.

Hestenes, David. "The Kinematic Origin of Complex Wave Functions." In *Physics and Probability: Essays in Honor of Edwin T. Jaynes*, edited by W. T. Grandy Jr. and P. W. Milonni, 153–160. New York: Cambridge University Press, 1993.

Jaynes, Edwin T. "A Backward Look to the Future." In *Physics and Probability: Essays in Honor of Edwin T. Jaynes*, edited by W. T. Grandy and P. W. Milonni, 261–275. New York: Cambridge University Press, 1993.

Jaynes, Edwin T. "Clearing up Mysteries—The Original Goal." In *Maximum Entropy and Bayesian Methods*, edited by John Skilling. Dordrecht, Netherlands: Kluwer Academic, 1989.

Jaynes, Edwin T. *Papers on Probability, Statistics, and Statistical Physics*, edited by R. D. Rosenkrantz. Dordrecht, Netherlands: D. Reidel, 1983.

Jaynes, Edwin T. *Probability Theory: The Logic of Science*. New York: Cambridge University Press, 2003.

Khinchin, Aleksandr. *Mathematical Foundations of Information Theory*. New York: Dover, 1957.

Kullback, Solomon. *Information Theory and Statistics*. New York: Wiley, 1959.

Shannon, Claude, and Warren Weaver. *The Mathematical Theory of Communication*. Urbana: University of Illinois Press, 1949.

Shore, John E., and Rodney W. Johnson. "Axiomatic Derivation of Maximum Entropy and Minimum Cross Entropy." *IEEE Transactions on Information* IT-26 (1980): 26–37.

Suppes, Patrick. *Representation and Invariance of Scientific Structures*. Stanford, CA: CSLI Publications, 2002.

Williams, Paul M. "Bayesian Conditionalization and the Principle of Minimum Information." *British Journal for the Philosophy of Science* 31 (1980): 131–144.

*Roger D. Rosenkrantz (2005)*

## Information Theory

# Information Theory

The concepts and measures of the statistical theory of selective information (information theory) have become so thoroughly enmeshed with the whole of behavioral science that delineation of the exact contribution of the theory is nearly impossible. The very verbal descriptive fabric of the behavioral sciences has become thoroughly interlaced with informational concepts: individuals or groups are described as “information sources” or “receivers”; skilled performance is described as “information processing”; memory is described as “information storage”; nerves are described as “communication channels”; the patterning of neural impulses is described as “information coding”; the brain is described as “an informational computer,” etc. Indeed, the molecule, the cell, the organ, the individual, the group, the organization, and the society have all been examined from the point of view of a general systems theory which focuses upon the information-processing, rather than upon the energetic, characteristics of each system (J. G. Miller 1955). Perhaps the closest analogue to the impact of information theory upon psychology is the impact that behaviorism had upon psychology, with the subsequent redefinition of psychology as the science of behavior. In both cases questions of definition have replaced questions of possible relevance. [*See*Systems Analysis, *article on*General Systems Theory.]

Information theory is a formal mathematical theory, a branch of the theory of probability. As such, the theory is self-contained; it does not require verification by experiment (Frick 1959). Yet, formal theories often have profound influence as conceptual models and as models for experiment. The theory is indelibly flavored by the context of electrical communications and control in which it was developed. Cherry ([1957] 1961, pp. 30-65) has charted the development of the theory within the field of communications. The genesis of the modern theory of statistical communications is due primarily to Hartley (1928). Building upon earlier work, by Nyquist and Küpfmüller, Hartley showed that in order to transmit a given amount of information a communication channel must undergo an exchange between its duration and its bandwidth, or frequency range. With a narrower frequency range, the communication channel must be available for a longer duration to transmit a given amount of information. Information was identified with a arbitrary selection of symbols from a set of defined symbols. The measure of information was defined in terms of the logarithm of the number of equally likely symbols available for communication. The essence of the idea is that information is measured in terms of what *could have been* communicated under a defined set of circumstances rather than in terms of what actually *is* communicated at a particular moment. The definition is sufficiently broad to provide a general framework for the specification of a wide class of communication systems. Following Hartley, numerous distinguished contributions were made throughout the world. These included the contribution by R. A. Fisher in characterizing the efficiency and sufficiency of statistics and that of D. Gabor, who introduced the concept of the *logan* as the elementary unit of information. It was the contributions of Shannon (1948) and of Wiener (1948), however, which provided the intellectual synthesis that marked the birth of modern information theory. [*See*Cybernetics*and the biography of*Wiener; *see also*Frick 1959.]

Shannon provides a scheme for a general communication system.

It consists of essentially five parts: 1. An

information sourcewhich produces a message or a sequence of messages to be communicated to the receiving terminal. … 2. Atransmitterwhich operates on the message in some way to produce a signal suitable for transmission over the channel. … 3. Thechannelis merely the medium used to transmit the signal from transmitter to receiver…. During transmission, or at one of the terminals, the signal may be perturbed by noise. … 4. Thereceiverordinarily performs the inverse operation of that done by the transmitter, reconstructing the message from the signal. … 5. Thedestinationis the person (or thing) for whom the message is intended. ([1948] 1962, pp. 4-6)

This description, while initially directed toward electrical communication systems, is sufficiently general to use in the consideration of a wide class of information systems.

**The measure of information.** The essential idea of the Shannon-Wiener mathematical theory of communication is that communication is a statistical process which can be described only in probabilistic terms. When it is possible to predict completely each message out of a source of possible messages, by definition no information will be conveyed by the message. When any one message is as probable as any other possible message, maximum information will be conveyed by the message. From this point of view, the information of any message is associated with the reduction in the range of possible selections by the receiver of any message, i.e., with the reduction of the receiver’s uncertainty. Uncertainty, choice, information, surprise value, and range of selections, therefore, all become intimately related concepts. The meaning, reasonableness, and personal importance of the message, however, are not considered within this approach to communications. The concern of the theory is to provide a measure of the amount of information for any selection or choice from defined sources of information.

The measure of the amount of information associated with a given selection can be arbitrarily defined as the logarithm of the number of equally likely alternatives. The measure can also be rigorously derived by initially defining a set of conditions which it must satisfy. Shannon employed the latter procedure, and the interested reader is referred to the original article (1948) for the statement of the conditions. Luce (1960) has also listed a set of necessary conditions which lead to the same result. The conditions are (*a*) independence of irrelevant alternatives—the amount of information transmitted by the selection of any item from a defined set of items shall be a real number which depends only upon the probability of that item, *p(i),* and not upon the probability distribution of the other items; (*b*) continuity—the measure of information shall be a continuous (in the mathematical sense) function of *p(i);* (c) additivity—if two independent selections, *i* and *j,* with probabilities *p(i)* and *p(j),* are made, the amount of information transmitted in the joint selection of *(i, j)* shall be a simple sum of the information transmitted by each of the selections; and (*d*) scale—one unit of information is associated with a binary selection between two equally likely alternatives; the unit is called the *bit.* The only measure which satisfies all of these conditions for any symbol *i* is the negative logarithm (to the base 2) of the probability of *i, p(i).* And, over the ensemble of possible items, the average information of the ensemble is the average weighted logarithmic measure, *H,* or

The *H* measure has a number of important properties. First, *H* ≥ 0; it is 0 if, and only if, the probability of *a* single *i* equals 1, while the probability of the remaining *(n —1)i* is equal to 0; otherwise *H* is greater than 0. Information is associated with any ensemble of items whenever there is uncertainty about which item will be presented. Second, *H* is maximum when all of the items are equally probable. If there are *n* possible items, the uncertainty associated with the set of items is maximum when *p(i)* = *1/n.* Third, *H* is maximum if all items occur independently of each other. If the occurrence of one item is related to the occurrence of another, the average information is reduced by the extent of that relatedness. This property is extremely important for the behavioral sciences, since the information measure provides a measure of the degree of relatedness between items in a set of possible items. The ratio of the uncertainty of a source to the maximum possible uncertainty with the same set of symbols is a measure of the relative transmitting efficiency of the source; Shannon has termed this the relative entropy of the source. And 1 minus the relative entropy has been defined as the *redundancy* of the source. Fourth, it is possible to encode a long sequence of items so that, on the average, *H* binary units per item are required to specify the sequence, even though more than *H* binary units per item are required to specify a short sequence. This property, called the encoding property, has recently become extremely important for electrical communications but has not been much exploited by the behavioral sciences.

**History.** Although Hartley’s development of the theory provided both the essential definition of information and a measure for its description, it went little noticed in the behavioral sciences. The historian of science will probably note that the behavioral sciences were not ready for the idea, nor, for that matter, was communications engineering fully ready. Shannon’s development, on the other hand, was enthusiastically grasped very early by a handful of psychologists, primarily those associated with the Psycho-Acoustics Laboratory at Harvard University.

George A. Miller, in a personal communication to the author of this article in January 1964, has described the intellectual ferment associated with the early developments. Noteworthy is Miller’s comment that “had the group not been actively interested in other related ideas from the communication engineers, the occurrence of Shannon’s paper probably would have gone unnoticed by the group.” The initial enthusiasm was stirred by the realization that Shannon had provided a tool for describing discrete events that at the same time was compatible with the continuous Fourier analysis of communication systems, with which the group was already acquainted.

Dahling has provided a valuable bibliographic survey of the early spread of concepts of the theory of selective information through a number of fields, ranging from telecommunication to journalism. Information theory provides an interesting case study for the diffusion of ideas in modern science, because of its great impact and its relative recency. From his analysis Dahling concluded that “the idea was drawn from a flurry of current related activity and, as the idea developed, it gained impetus and speed of adoption from the same surrounding activity that gave rise to it” and that “the adoption of the theory was speeded by a clearly apparent need for such a theory” (1962, p. 121). Moreover, “because the idea dealt with matters of common interest, it was able to spread more rapidly between disciplines” (p. 126). The idea “spread to other disciplines in proportion to its congeniality with their methods” and “to its analogic and suggestive value” (p. 132). Experimental psychologists working in communication problems and trained in the mathematics of communication engineering became logical carriers of the theory to the behavioral sciences. The introduction of information concepts to psychology was made by several routes. A number of excellent summaries are available that trace this development within experimental psychology: Attneave (1959), Garner (1962), Luce (1960), G. A. Miller (1956), and Quastler (1955). Here we shall briefly summarize a few of the salient avenues of entry into the field of experimental psychology, although parallel developments can doubtlessly be cited in any of the behavioral sciences. Thus, the balance of this review is a highly selective examination of the role of information theory in the social sciences. It is not a *general* review.

**Organization of behavior sequences.** The information measure was introduced to psychology in a now classic paper by Miller and Frick (1949). Their primary concern was the description of sequences of discrete responses. Their aim was twofold : the development of a stochastic model of behaviorial sequences and the development of a quantitative measure of the organization of the sequences. The Markov model, employed by Shannon, served as their descriptive model for the generation of response sequences; the information measure served as their measure of the degree of organization of the sequences [*see*Markov Chains].

For illustrative purposes response sequences of rats and of young girls, provided earlier in a multiple-choice experiment by Hamilton, were analyzed. An index of response stereotypy was identified as 1.0 minus the ratio of the obtained uncertainty, relative to maximum possible uncertainty. Thus, the measure of the stereotypy of response sequences is formally identical with the measure of relative redundancy of communication sources. For example, two responses, left and right, are defined as the class of responses available for observation of a rat in a given experimental situation. If the sequence of the rat’s responses were perfectly predictable (e.g., were all left responses or a left-right alternation sequence), there would be 0 uncertainty in specifying successive responses. Thus, an index of response stereotypy of 1 would be obtained. Conversely, if the rat responded left and right equally often and if the sequence of responses was unpredictable, there would be maximum uncertainty in specifying successive responses. An index of response stereotypy of 0 would be obtained. In Hamilton’s data identical indexes of response stereotypy were obtained for both girls and rats when the distributions of single-response choices were examined and when the distributions of pairs of successive choices were examined. The responses of girls became differentiated from those of the rats only when sequences of three successive choices were analyzed. In pointing out the importance of the higher-order statistical structure of response sequences and in providing an objective measure of its patterning, Miller and Frick laid the groundwork for the mathematical modeling of complex response sequences. [*See*Rresponse Sets.]

*Language.* The statistical analysis of language represents a special application of the analysis of response sequences. Indeed, interest in cryptographic secrecy systems profoundly shaped the direction of Shannon’s development of information theory. The English alphabet is potentially capable of producing many more different messages than it actually does. In practice certain letters are employed more frequently than others, e.g., *e* relative to *z*; and certain sequences occur more frequently than others, e.g., *th* relative to *ht.* Shannon (1951) measured the relative redundancy of English and obtained a lower bound of about 50 per cent and an upper bound of about 75 per cent redundancy relative to a random distribution of letters. A related observation is that English text may be nearly completely reconstructed when as much as 50 per cent of the text has been deleted (Shannon 1951; Chapanis 1954). Furthermore, in most communications environments the range of possible communications is strongly restricted by situational factors. In tower-pilot communications at air force bases (e.g., Fritz & Grier 1955; Frick & Sumby 1952), it was demonstrated that the over-all redundancy may approach 95 per cent, again relative to a random distribution of letters. As a result of Shannon’s work and, especially, its popularization by Miller, nonlinguists became willing to tackle the intricacies of language as a complex series of response sequences, amenable to measurement and quantitative specification. [*See*Linguistics.]

A related development of information concepts in psychology was the demonstration of the important role of informational factors in the perception of speech (G. A. Miller 1951). For example, the intelligibility of words heard against a noise background is a critical function of the size of the test vocabulary (Miller et al. 1951), i.e., a critical function of stimulus information. A given word might be perceived nearly perfectly when it is embedded within a small vocabulary of possible words but might be perceived correctly only rarely when it is embedded within a large vocabulary of possible words. This result is reasonable if information is associated with what could have been presented, rather than in terms of what actually is presented.

A number of different investigators have found the Miller, Heise, and Lichten data to be a rich source for testing theories of choice behavior. For example, Garner (1962) has demonstrated that these data are consistent with the assumption that under a given signal-to-noise ratio different vocabularies may transmit the same information. Stated differently, a large vocabulary coupled with a high error rate may yield nearly the same amount of information as that transmitted by a small vocabulary and a low error rate. [*See*Perception, *article*speech perception.]

*Identification experiments.* Another way that information concepts have been introduced to psychology is by the quantitative description, in informational terms, of the identification experiment (Garner & Hake 1951). In the identification experiment, one of n possible stimuli is presented to the subject, whose task is to identify which one of the *n* stimuli was presented. For example, the instruments of a symphony orchestra are defined as the class of objects for study, and one of the instruments is sounded at random. The listener is instructed to indicate which instrument of the defined set was sounded. When the stimuli are well ordered and associated with a common unit of measurement—weight, length, duration, frequency, etc.—identification performance may readily be described in terms of conventional statistical measures, e.g., the average error. When the stimuli are not well ordered, as in the case of the symphonic instruments or a series of photographs depicting various emotional moods, identification performance cannot readily be described in terms of such conventional statistical measures. The transmitted-information measure is ideally suited to be an appropriate nonmetric measure of relatedness between stimuli and responses. In addition, a vexing methodological problem is associated with the identification experiment for ordered stimuli. The identification experiment attempts to answer a straightforward question: how many stimuli can be correctly identified? The answer to the question, furnished by a given body of data, depends upon what criterion for errors is adopted. If a small average error is permitted, the same body of data will admit a larger number of distinguishable stimuli than if a large average error is permitted.

A resolution to this problem is suggested by Hake and Garner (1951), who demonstrated that while the proportion of errors is greater in locating a point at one of 50 possible positions than at one of ten positions, the amount of transmitted information for ten possible positions is about equal to that for 50 possible positions. In turn, the amount of transmitted information specifies an equivalent number of perfectly identified categories.

A concentrated flurry of experimental activity demonstrated limited transmitted-informational capabilities with a wide variety of stimulus variables. Although the categorical measure of information was better matched to nonmetric variables, most of the initial studies took place with welldefined stimulus variables upon continuous scales, e.g., length of line, direction, sound frequency, etc. The only apparent advantage of the information measure to these studies was that a single measure of performance could be employed across a wide set of variables. The historian will probably judge that many experimental psychologists had previously steered clear of variables with weak metric properties and, as a result, were unable to appreciate immediately the full potential of the informational technique for nonmetric variables. In any event, the number of identifiable stimulus components associated with any single stimulus variable was found to be disappointingly small—from less than four to about ten stimuli. However, experimenters quickly discovered that a large number of identifiable stimulus components could be achieved by employing a large number of different stimulus variables, as long as the number of components per variable was kept reasonably small. (This story is told in G. A. Miller 1956, by means of the engaging title “The Magical Number Seven, Plus or Minus Two”; and in Garner 1962.)

*Response speed and skilled tasks.* An area of active experimental interest is the relation between the speed of response and the informational characteristics of skilled tasks. Hick (1952) sparked interest in this area with his demonstration that reaction time was linearly related to the logarithm of the number of possible alternatives available to the subject. Further, he suggested that a measure of the *rate* of information transmitted, in terms of the reciprocal of the slope of the empirical function relating reaction time to stimulus information, might be achieved from a discrete-trials reaction-time experiment. This transformation provides an estimate of the rate of information transmission in humans as about five to seven bits per second (Bricker 1955). More recent findings, however, have shown that with highly overlearned tasks, such as the reading of numerals, there is little change in reaction time as a function of the information of the task (Mowbray & Rhoades 1959; Leonard 1959). In this circumstance, identification of the rate of information transmission with the reciprocal of the slope of the reaction-time functions would lead to the unreasonable conclusion that there is an infinitely high rate of information transmission. The rate of information transmitted by the human receiver has been measured directly, in a variety of tasks, by a number of investigators. (This work is summarized in Quastler 1955, pp. 305-359.) In highly overlearned tasks there is an upper limit of about 35 bits per second, which is jointly a function of the highest effective speed, the highest effective range of alternatives, and the highest effective transmission rate (*ibid.,* p. 345). For most tasks, man’s information-transmission rate is far lower than 35 bits per second. Electronic channels of communication, by contrast, have capabilities of millions or billions of bits per second. Clearly, man’s forte is not the rate at which he can process information. When one examines certain structural features of information processing, however, the disparity between man and machine is narrowed. The largest and most elaborate of computers cannot yet perform many pattern-recognition tasks routinely performed by children. However, the rapid development of sophisticated computer programs may radically alter this situation. As Garner suggests, we shall need to devote more emphasis to the structural, as distinguished from the metric, characteristics of information if we are to understand human information processing. [*See*Learning, *article on*Acquisition of skill; Reaction time.]

*Structure of information.* The structural examination of information is based upon a multivariate extension of Shannon’s analysis by McGill (1954; 1955) and by Garner and McGill (1956). This work is summarized by Garner (1962). Garner has demonstrated the power of a multivariate information analysis for dissecting the information-relevant features of complex information sources. In this development, formulas associated with multiple correlation and with the analysis of variance find their direct counterparts within multivariate information analysis. Multivariate information analysis thus achieves the status of a general statistical tool for categorical materials, regardless of the appropriateness of the specific conceptualization of behavior in the terms of source, channel, noise, receiver, designation, etc. Furthermore, the efficiency of experimental design may be evaluated from the point of view of multivariate informational analysis. [*See*Multivariate analysis; *see also*McGill 1955.]

Gainer’s approach to a structural analysis of an information source rests on the distinction between *internal* structure, the relations between variables composing a set of stimuli, and *external* structure, the relations between stimuli and responses. This distinction is perhaps clarified by referring to Figure 1. A total ensemble of 16 possible stimulus patterns results from the binary coding of four variables: figure shape (circle or triangle), dot location (above or below), gap location (right or left), and number of internal lines (one or two). Thus, the 16 possible patterns have a potential

information transmission of four bits per pattern. Let us now arbitrarily select a subset of eight of the possible 16 patterns. Such a subset has a potential information transmission of only three bits per pattern. According to Garner, the one bit lost in terms of external structure can appear in the form of internal structure. In the eight patterns of subset A of Figure 1, internal structure is represented by a simple contingency between figure shape and gap location (right gap with circles; left gap with triangles). In the eight patterns of subset *J,* internal structure is represented by a four-variable interaction among the variables. In these terms, subsets *A* and J represent the same amount of external structure and the same amount of internal structure but differ in the *form* of their internal structure. As a result of the differences in form of internal structure, the identification, from the 16 possible patterns, of the eight patterns of subset *A* is substantially superior to the identification of the eight patterns of subset *J.* The free recall of subsets with simple internal structure is also superior to that of subsets with complex internal structure (Whitman & Garner 1962). In the opinion of the author of this article, an extension of this method of structural analysis might reasonably be expected to provide a tool for the experimental assault upon qualitative differences in information. [*For further discussion of structural analysis, see*Systems analysis, *article on*Psychological systems.]

The close relationship between information theory and psychology can be best summarized by the concluding remarks of the 1954 Conference on Information Theory in Psychology, organized by Henry Quastler. Although more than a decade has intervened, the remarks are nonetheless appropriate today.

There is something frustrating and elusive about information theory. At first glance, it seems to be the answer to one’s problems, whatever these problems may be. At second glance it turns out that it doesn’ t work out as smoothly or as easily as anticipated. Such disappointments, together with some instances of undoubtedly ill-advised use, have caused a certain amount of irritation. So nowadays one is not safe in using information theory without loudly proclaiming that he knows what he is doing and that he is quite aware that this method is not going to alleviate all worries. Even then, he is bound to get his quota of stern warnings against unfounded assumptions he has allegedly made.

It seems that these warnings have reached a point of diminishing returns. Most of us who still use information theory are quite aware of the fact that this method is difficult, full of pitfalls, and definitely limited. We are hopeful, of course—nobody would work in a difficult field without expecting results—but always ready for a sober evaluation of the domain of our method.

It has become very clear that information theory is one thing, information measures another. The two are historically linked, but can very well be disassociated. Information theory is defined by concepts and problems. It deals in a very particular way with amounts of variation, and with operations which have effect on such amounts. Information theory needs some measure of variation—but it doesn’ t have to be H; neither is the applicability of H and related measures restricted to information theory. (Quastler 1955, pp. 2-3)

Although a biophysicist by training, Quastler was acutely sensitive to psychological problems, as witnessed by the perspective of the quotation cited above. His death was a serious setback to the further definition of the role of information theory within psychology.

The historian of psychology will undoubtedly note the evangelistic endeavors in the early 1950s to remake psychology in the image of information theory. He will also note the flickering of that evangelical spirit as the concepts became more and more absorbed into the fabric of psychology. It is this author’s guess that future historians will note that the development of information theory within psychology followed Garner’s lead in highlighting the structural, rather than the metric, features of information measurement.

Irwin Pollack

[*Other relevant material may be found in*Cybernetics; Mathematics; Models, Mathematical; Probability; Simulation; *and in the biography of*Wiener.]

## BIBLIOGRAPHY

Attneave, Fred 1959 *Applications of Information Theory to Psychology: A Summary of Basic Concepts, Methods, and Results.* New York: Holt.

Bricker, Peter D. 1955 The Identification of Redundant Stimulus Patterns. *Journal of Experimental Psychology* 49:73-81.

Broadbent, Donald E. 1958 *Perception and Communication.* Oxford: Pergamon.

Bross, Irwin D. J. 1966 Algebra and Illusion. *Science* 152:1330 only. → An interesting comment on the “fruitfulness” of applying formal models to science.

California, University of, Los Angeles, Western Data Processing Center 1961 *Contributions to Scientific Research in Management: The Proceedings of a Scientific Program.* Los Angeles: The University. → See especially the article by Jacob Marshak, “Remarks on the Economics of Information.”

Chapanis, Alphonse 1954 The Reconstruction of Abbreviated Printed Messages. *Journal of Experimental Psychology* 48:496-510.

Cherry, Colin (1957) 1961 *On Human Communication: A Review, a Survey, and a Criticism.* New York: Wiley. → A paperback edition was published in 1963.

Dahling, Randall L. 1962 Shannon’s Information Theory: The Spread of an Idea. Pages 119-139 in Stanford University, Institute for Communication Research, *Studies of Innovation and of Communication to the Public,* by Elihu Katz et al. Stanford, Calif.: The Institute.

Frick, F. C. 1959 Information Theory. Volume 2, pages 611-636 in Sigmund Koch (editor), *Psychology: A Study of a Science.* New York: McGraw-Hill.

Frick, F. C; and Sumby, W. H. 1952 Control Tower Language. *Journal of the Acoustical Society of America* 24:595-596.

Fritz, L.; and Grier, George W. jr. 1955 Pragmatic Communication: A Study of Information Flow in Air Traffic Control. Pages 232-243 in Henry Quastler (editor), *Information Theory in Psychology.* Glencoe, III.: Free Press.

Garner, Wendell R. 1962 *Uncertainty and Structure as Psychological Concepts.* New York: Wiley.

Garner, Wendell R.; and Hake, Harold W. 1951 The Amount of Information in Absolute Judgments. *Psychological Review* 58:446-459.

Garner, Wendell R.; and Mcgill, William J. 1956 The Relation Between Information and Variance Analyses. *Psychometrika* 21:219-228.

Gilbert, E. N. 1966 Information Theory After 18 Years. *Science* 152:320-326. → An overview from the point of view of the mathematical statistician.

Hake, Harold W.; and Garner, Wendell R. 1951 The Effect of Presenting Various Numbers of Discrete Steps on Scale Reading Accuracy. *Journal of Experimental Psychology* 42:358-366.

Hartley, R. V. L. 1928 Transmission of Information. *Bell System Technical Journal* 7:535-563.

Hick, W. E. 1952 On the Rate of Gain of Information. *Quarterly Journal of Experimental Psychology* 4:11-26.

Kullback, Solomon 1959 *Information Theory and Statistics.* New York: Wiley. → Considers a development of statistical theory along information lines.

Leonard, J. Alfred 1959 Tactical Choice Reactions: I. *Quarterly Journal of Experimental Psychology* 11: 76-83.

Luce, R. Duncan 1960 The Theory of Selective Information and Some of Its Behavioral Applications. Pages 1-119 in R. Duncan Luce (editor), *Developments in Mathematical Psychology.* Glencoe, III.: Free Press.

Mcgill, William J. 1954 Multivariate Information Transmission. *Psychometrika* 19:97-116.

Mcgill, William J. 1955 Isomorphism in Statistical Analysis. Pages 56-62 in Henry Quastler (editor), *Information Theory in Psychology.* Glencoe, III.: Free Press.

Miller, George A. 1951 *Language and Communication.* New York: McGraw-Hill. → A paperback edition was published in 1963.

Miller, George A. 1953 What Is Information Measurement? *American Psychologist* 8:3-li.

Miller, George A. 1956 The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. *Psychological Review* 63: 81-97.

Miller, George A.; and Frick, Frederick C. 1949 Statistical Behavioristics and Sequences of Responses. *Psychological Review* 56:311-324.

Miller, George A.; Heise, George A.; and Lichten, Wwilliam 1951 The Intelligibility of Speech as a Function of the Context of the Test Materials. *Journal of Experimental Psychology* 41:329-335.

Miller, James G. 1955 Toward a General Theory for the Behavioral Sciences. *American Psychologist* 10: 513-531.

Mowbray, G. H.; and Rhoades, M. V. 1959 On the Reduction of Choice Reaction Times With Practice. *Quarterly Journal of Experimental Psychology* 11:16-23.

Quastler, Henry (editor) 1955 *Information Theory in Psychology: Problems and Methods.* Glencoe, III.: Free Press.

Shannon, Claude E. (1948) 1962 The Mathematical Theory of Communication. Pages 3-91 in Claude E. Shannon and Warren Weaver, *The Mathematical Theory of Communication.* Urbana: Univ. of Illinois Press. → First published in Volume 27 of the *Bell System Technical Journal.*

Shannon, Claude E. 1951 Prediction and Entropy of Printed English. *Bell System Technical Journal* 30: 50-64.

Whitman, James R.; and Garner, Wendell R. 1962 Free Recall Learning of Visual Figures as a Function of Form of Internal Structure. *Journal of Experimental Psychology* 64:558-564.

Whitman, James R.; and Garner, Wendell R. 1963 Concept Learning as a Function of Form of Internal Structure. *Journal of Verbal Learning and Verbal Behavior* 2:195-202.

Wiener, Norbert (1948) 1962 *Cybernetics: Or, Control and Communication in the Animal and the Machine.* 2d ed. Cambridge, Mass.: M.I.T. Press.

## Information Theory

# Information Theory

"Information" is a term used universally in fields associated with computing technology. It is often loosely applied when no other term seems to be readily at hand; examples of this are terms such as "information technology," "information systems," and "information retrieval." It surprises most people when they discover that the term "information" actually has a very real meaning in an engineering context. It does not mean the same thing as "knowledge" or "data," but is instead intertwined with elements of communication systems theory.

When computing systems are connected together, it is necessary to consider how they might exchange data and work cooperatively. This introduces the notion that messages can be formulated by computing machines and be dispatched to other machines that receive them and then deal with their contents. All of the issues that are involved with these transmission and reception operations constitute what is known as "information theory."

A communication channel is a connective structure of some sort that supports the exchange of messages. Examples are wired interconnections such as **ethernets** or perhaps **fiber optic** cables, or even wireless communications such as microwave links. These are all paths over which digital information can be transmitted.

## Noise and Errors

Information theory has to do with how messages are sent via communication channels. When this field was first being studied, the common consensus was that it would be impossible to get digital machines to make exchanges in a way that was guaranteed to be error-free. This is because all the components used to construct computing machines are imperfect; they tend to distort the electrical signals they process as a side effect of their operation.

The components add extra electrical signals called "noise." In this instance, the term "noise" does not necessarily refer to something that can be heard. Instead, "noise" is used to describe the corruption of electrical signals, which makes them harder for devices in the computer system to understand correctly. This signal corruption might appear as extra voltage levels in the signal, or some signals may be completely missing.

Because communication channels inherently contain noise, exchanged messages are always being damaged in one way or another. When a particular message is dispatched from one machine to another, there is a chance that it might be distorted by imperfections in the channel and therefore not correctly interpreted by the recipient. Channel noise cannot be entirely eliminated. For this reason, early information theorists believed that it was a reality that messages transmitted digitally would not arrive at their destinations in exactly the way that the senders had sent them.

## Information Defined

This pessimistic outlook all changed in 1947 with the publication of Claude Shannon's seminal study of information theory. He proposed that even in the presence of noise (which it had been agreed was unavoidable), it was possible to ensure error-free transmission. This effectively heralded the era of a new field of computing science and engineering: that of information theory. "Information" was granted a precise definition. It was related to the inverse of the probability of the content of a message. For example, if a person was told in a message that "tomorrow, the sky will be blue," that person would conclude that there was not much in that message that he or she had not already expected. In other words, there was not much information in that message, because it essentially reaffirmed an expectation. There is not much information in that message, because the probability of the outcome is high. Conversely, if one were told in a message that "tomorrow, the sky will be green," then he or she would be greatly surprised. There is more information in this second message purely by virtue of the fact that the probability of this event is so much lower. The information pertaining to a particular event is inversely proportional to the logarithm of the probability of the event actually taking place.

Information log (1/p) where p is the probability of an event within the message.

Shannon's work led to a new field of engineering. Quantities such as the capacity of a channel to transmit information could be evaluated. This provided telecommunications specialists with a way of knowing just how many messages could be simultaneously transmitted over a channel without loss.

## Encoding

In addition to this, ways of representing, or encoding, information during transmission from one place to another were explored; some approaches were better than others. Encoding simply means that some pieces of information that are normally represented by particular symbols are converted to another collection of symbols that might better suit their reliable transfer. For example, text messages are often represented by collections of alphabetic characters when created and read, but they are then converted into another form, such as **ASCII** codes, for transmission over a communication channel. At the receiving end, the codes are converted back into text again.

The advantage these conversions offer is that some ways of representing information are more robust to the effects of noise in information channels than others, and perhaps more efficient, as well. So, the extra expense involved in carrying out these encoding and decoding operations is offset by the reliability they offer.

Information theory has become a mature field of engineering and computer science. It has enhanced the reliability of computer-based networks at all levels, from small **local area networks (LANs)** to the Internet, and it has done so in a way that is unobtrusive, so that users are unaware of its presence. In addition to this, information theory has also assisted in the development of techniques for encoding digital information and sending this over analog communication channels that were not designed for handling computer-based transmissions, such as the public telephone networks. It is important to remember that these contributions of information theory to modern computing began with the ability to define information mathematically, and the work Claude Shannon did to understand communication channels and encoding schemes.

**see also** Cybernetics; Networks; Shannon, Claude E.

*Stephen* *Murray*

### Bibliography

Lathi, Bhagwandas P. *Modern Digital and Analog Communication Systems,* 2nd ed. Orlando, FL: Holt, Rinehart and Winston, 1989.

Proakis, John G. *Digital Communications,* 3rd ed. New York: McGraw-Hill, 1995.

Shanmugam, K. Sam. *Digital and Analog Communication Systems.* New York: John Wiley & Sons, 1985.

Sklar, Bernard. *Digital Communications, Fundamentals and Applications.* Englewood Cliffs, NJ: Prentice Hall, 1988.

## Information Theory

# INFORMATION THEORY

Information theory posits that information is simply data that reduces the level of uncertainty on a given subject or problem, and can be quantified by the extent to which uncertainty is diminished. More importantly for the practical uses of information theory, however, is that it fits the concept of information and communication into mathematical theory. All content, no matter what its form-music, text, video-can be reduced to a simple string of ones and zeros, thereby allowing tremendous flexibility in the mode of interpretation of that information. The application of information theory has had a tremendous impact on telecommunications and information technology and, by implication, the Internet, since it deals expressly with information-carrying capacities.

## CLAUDE SHANNON

Information theory is the product of the renowned scientist Claude Shannon, widely acknowledged as one of the most innovative thinkers of his day. Born in 1916 in Petoskey, Michigan, Shannon grew up in an era when telecommunications were primarily limited to the telegraph and the telephone. From an early age, Shannon displayed an affinity for electronic equipment and radios and a penchant for devising his own inventions, much in the spirit of his hero and distant relative Thomas Edison.

Shannon attended the University of Michigan and later the Massachusetts Institute of Technology (MIT), studying electrical engineering and mathematics, in which he excelled. After college, he went to work for Bell Telephone Laboratories, where he worked on cryptographic systems using early computers. In 1948, on the strength of his work and research, Shannon published his "A Mathematical Theory of Communication," a breakthrough paper that for the first time demonstrated that all information exchanges could be expressed digitally in terms of ones and zeros, based on mathematical reductions.

Shannon redefined the traditional concept of entropy to mean, in the realm of information theory, the amount of uncertainty in a given system. Information was simply anything that reduced the level of uncertainty, and hence the degree of entropy. To measure the amount of information, Shannon devised a mathematical theory in which capacities could be expressed in terms of bits per second. In fact, many historians of science insist Shannon was the first to employ the term "bit," which is shorthand for "binary digit." All information, Shannon claimed, could ultimately be understood as a string of bits, and could therefore be stored and transmitted as such. Shannon also developed theories on the practical transmission of digital information. He surmised that when information is sent over "noisy" or compromised channels, simply adding redundant bits to the message can smooth out and correct the corruption in the information.

## INFORMATION THEORY TODAY

As the foundation upon which modern telecommunications systems, technologies, and theories are built, information theory was of central importance to the Internet era; it was ultimately responsible for most of the revolutionary breakthroughs in digital communication and information storage. Compact discs and digital television, not to mention the Internet, are everyday items that owe their existence to information theory. Information theory holds that all channels of information transmission and storage can also be expressed and analyzed in terms of bits, thereby providing the link that allowed for perfecting physical methods of information transmission, including how to send highly encoded Internet signals over simple telephone wires.

In the Internet world, information theory proved tremendously important not only for the basics of Internet telecommunications but also for cryptography, another field in which Shannon worked. Cryptography, in the contemporary sense, refers to protecting electronic information from compromise by applying mathematical algorithms consisting of a series of bits that scrambles the information and later decodes it when necessary. Cryptography was a key component of the development of e-commerce, since it lay at the heart of privacy and transaction protection.

Shannon's theory proved one of the great intellectual breakthroughs of the 20th century, as it gave scientists a new way to consider information and provided the basic framework within which all digital communications technology would take shape. In addition to its role as the bedrock of modern telecommunications, information theory also washed over fields as disparate as biology, ecology, medicine, mathematics, psychology, linguistics, and even investment theory.

### FURTHER READING:

"Claude Shannon." *The Times* (London), March 12, 2001.

Golomb, Solomon W. "Retrospective: Claude E. Shannon (1916-2001)." *Science,* April 20, 2001.

Robinson Pierce, John. *An Introduction to Information Theory: Symbols, Signals and Noise.* 2nd ed. Mineola, NY: Dover Publications, 1980.

*SEE ALSO:* Cryptography; Encryption

## information theory

information theory or communication theory, mathematical theory formulated principally by the American scientist Claude E. Shannon to explain aspects and problems of information and communication. While the theory is not specific in all respects, it proves the existence of optimum coding schemes without showing how to find them. For example, it succeeds remarkably in outlining the engineering requirements of communication systems and the limitations of such systems.

In information theory, the term *information* is used in a special sense; it is a measure of the freedom of choice with which a message is selected from the set of all possible messages. Information is thus distinct from meaning, since it is entirely possible for a string of nonsense words and a meaningful sentence to be equivalent with respect to information content.

**Measurement of Information Content**

Numerically, information is measured in bits (short for *binary digit*; see binary system). One bit is equivalent to the choice between two equally likely choices. For example, if we know that a coin is to be tossed but are unable to see it as it falls, a message telling whether the coin came up heads or tails gives us one bit of information. When there are several equally likely choices, the number of bits is equal to the logarithm of the number of choices taken to the base two. For example, if a message specifies one of sixteen equally likely choices, it is said to contain four bits of information. When the various choices are not equally probable, the situation is more complex.

Interestingly, the mathematical expression for information content closely resembles the expression for entropy in thermodynamics. The greater the information in a message, the lower its randomness, or
"noisiness,"
and hence the smaller its entropy. Since the information content is, in general, associated with a source that generates messages, it is often called the entropy of the source. Often, because of constraints such as grammar, a source does not use its full range of choice. A source that uses just 70% of its freedom of choice would be said to have a relative entropy of 0.7. The redundancy of such a source is defined as 100% minus the relative entropy, or, in this case, 30%. The redundancy of English is estimated to be about 50%; i.e., about half of the elements used in writing or speaking are freely chosen, and the rest are required by the structure of the language.

**Analysis of the Transfer of Messages through Channels**

A message proceeds along a channel from the source to the receiver; information theory defines for any given channel a limiting capacity or rate at which it can carry information, expressed in bits per second. In general, it is necessary to process, or encode, information from a source before transmitting it through a given channel. For example, a human voice must be encoded before it can be transmitted by telephone. An important theorem of information theory states that if a source with a given entropy feeds information to a channel with a given capacity, and if the source entropy is less than the channel capacity, a code exists for which the frequency of errors may be reduced as low as desired. If the channel capacity is less than the source entropy, no such code exists.

The theory further shows that noise, or random disturbance of the channel, creates uncertainty as to the correspondence between the received signal and the transmitted signal. The average uncertainty in the message when the signal is known is called the equivocation. It is shown that the net effect of noise is to reduce the information capacity of the channel. However, redundancy in a message, as distinguished from redundancy in a source, makes it more likely that the message can be reconstructed at the receiver without error. For example, if something is already known as a certainty, then all messages about it give no information and are 100% redundant, and the information is thus immune to any disturbances of the channel. Using various mathematical means, Shannon was able to define channel capacity for continuous signals, such as music and speech.**Bibliography**

See C. E. Shannon and W. Weaver, *The Mathematical Theory of Communication* (1949); M. Mansuripur, *Introduction to Information Theory* (1987); J. Gleick, *The Information: A History, a Theory, a Flood* (2011).

## Information Theory

# Information Theory

The version of information theory formulated by mathematician and engineer Claude Shannon (1916–2001) addresses the processes involved in the transmission of digitized data down a communication channel. Once a set of data has been encoded into binary strings, these strings are converted into electronic pulses, each of equal length, typically with 0 represented by zero volts and 1 by + 5 volts. Thus, a string such as 0100110 would be transmitted as seven pulses:

It is clear from the example that the lengths of pulses must be fixed in order to distinguish between 1 and 11. In practice, the diagram represents an idealized state. Electronic pulses are not perfectly discrete, and neither are the lengths of pulses absolutely precise. The electronic circuits that generate these signals are based upon analogue processes that do not operate perfectly, and each pulse will consist of millions of electrons emitted and controlled by transistors and other components that only operate within certain tolerances. As a result, in addition to the information sent intentionally down a channel, it is necessary to cater for the presence of error in the signal; such error is called *noise.*

This example illustrates the dangers inherent in the differences between the way one represents a process in a conceptual system and the underlying physical processes that deliver it. To conceive of computers as if they operate with perfectly clear 0 and 1 circuits is to overlook the elaborate and extensive error-checking necessary to ensure that data are not transmitted incorrectly, which is expensive both in time and cost.

In 1948, Shannon published what came to be the defining paper of communication theory. In this paper he investigated how noise imposes a fundamental limit on the rate at which data can be transmitted down a channel. Early in his paper he wrote:

The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. Frequently the messages have

meaning; that is they refer to or are correlated according to some system with certain physical or conceptual entities. These semantic aspects of communication are irrelevant to the engineering problem. (p.379)

The irrelevance of meaning to communication is precisely the point that encoding and the transmission of information are not intrinsically connected. Shannon realized that if one wishes to transmit the binary sequence 0100110 down a channel, it is irrelevant what it means, not least because different encodings can make it mean almost anything. What matters is that what one intends to transmit—as a binary string—should arrive "exactly or approximately" at the other end as that same binary string. The assumption is that the encoding process that produces the binary string and the decoding process that regenerates the original message are known both to the transmitter and the receiver. Communication theory addresses the problems of ensuring that what is received is what was transmitted, to a good approximation.

*See also* Information; Information Technology

*Bibliography*

shannon, claude e. "a mathematical theory of communication." the bell system technical journal 27 (1948): 379–423, 623–656.

john c. puddefoot

## information theory

**information theory** The study of information by mathematical methods. Informally, information can be considered as the extent to which a message conveys what was previously unknown, and so is new or surprising. Mathematically, the rate at which information is conveyed from a source is identified with the entropy of the source (per second or per symbol). Although information theory is sometimes restricted to the entropy formulation of sources and channels, it may include coding theory, in which case the term is used synonymously with communication theory.