Mathematical Learning Theory
Mathematical Learning Theory
MATHEMATICAL LEARNING THEORY
Theories of learning were enormously visible and influential in psychology from 1930 to 1950, but dropped precipitously from view during the next two decades while the information-processing approach to cognition based on a computer metaphor gained ascendancy. The groundwork for a resurgence of learning theory in the context of cognitive psychology was laid during the earlier period by the development of a subspecialty that may be termed mathematical learning theory.
In psychology, just as in physical and biological sciences, mathematical theories should be expected to aid in the analysis of complex systems whose behavior depends on many interacting factors and to bring out causal relationships that would not be apparent to unaided empirical observation. Mathematical reasoning was part and parcel of theory construction from the beginnings of experimental psychology in some research areas, notably the measurement of sensation, but came on the scene much later in the study of learning, and then got off to an inauspicious start. By the 1930s, laboratory investigation of conditioning and learning at both the animal and the human level had accumulated ample quantitative data to invite mathematical analysis. However, the early efforts were confined to the routine exercise of finding equations that could provide compact descriptions of the average course of improvement with practice in simple laboratory tasks (Gulliksen, 1934). Unfortunately, these efforts yielded no harvest of new insights into the nature of learning and did not enter into the design of research.
Clark L. Hull and Habit Strength
The agenda for a new approach that might more nearly justify the label mathematical learning theory was set by Clark L. Hull, who distilled much of the learning theory extant at the end of the 1930s into a system of axioms from which theorems might be derived to predict aspects of conditioning and simple learning in a wide variety of experimental situations (Hull, 1943). One of the central ideas was a hypothetical measure of the degree of learning of any stimulus-response relationship. This measure, termed habit strength (H), was assumed to grow during learning of the association between any stimulus and response according to the quantitative law H = M (1 - ecN) (1) where N denotes number of learning experiences, M is the maximum value of H, and c is a constant representing speed of learning. Thus, H should increase from an initial value of zero to its maximum as a smooth "diminishing returns" function of trials. However, knowing the value of H does not enable prediction of what the learner will do at any point in practice for two reasons. First, a test for learning may be given in a situation different from that in which practice occurred (as when a student studies at home but is tested in a classroom); according to Hull's principle of stimulus generalization, the effective habit strength on the test is assumed to be some reduced value H' that differs from H as a function of the difference between the two situations. Further, motivation must be taken into account. If the student in the example studies in a relaxed mood but arrives at the test situation in a state of anxiety, performance may be affected and fail to mirror the actual level of learning. In Hull's theory, it is assumed that the strength of the tendency to perform a learned act, termed excitatory potential (E), is determined by habit strength and level of motivation or emotion (denoted D) acting jointly according to the multiplicative relation E = E × D. (2) These conceptions were implemented in vigorous research programs mounted by Kenneth W. Spence, one analyzing the relation between anxiety and learning in human beings and the other producing the first quantitative accounts of the ability of some animals to exhibit transposition (i.e., to generalize from learning experiences in terms of relationships such as brighter than or larger than, as distinguished from absolute values of stimulus magnitude).
The Law of Habit Growth
Hull's theory as a whole did not prove viable. It depended on too many assumptions and was too loose-jointed in structure to be readily testable, and its deterministic axioms lacked the flexibility needed to interpret varieties of learning more complex than simple conditioning. However, some components have survived to reappear as constituents of new theories. A notable instance is the law of habit growth. An implication of Equation 1 is that the change in habit strength, δH, on any trial can be expressed as δH= c (M - H), (3) in which form it is apparent that learning is fastest when the difference between current habit strength and its maximum is large, and slow when the difference is small. According to Hull's assumptions, and indeed nearly all learning theories of his time, the function should be applicable on any trial when the response undergoing learning occurs in temporal contiguity to reinforcement (that is, a positive incentive such as food for a hungry animal or a to-be-feared event such as an electric shock). In the course of research over the next two decades, it became apparent that learning depends also on the degree to which an experience provides the learner with new information. If, for example, an animal has already learned that a particular stimulus, Sl, predicts the occurrence of shock, then on a trial when S1 and another stimulus, S2, precede shock in combination, the animal may learn nothing about S2 (Kamin, 1969). This observation and many related ones can be explained by a new theory, which can be viewed as a refinement and extension of one component of Hull's axiom about habit growth. In the new version, the change in habit strength on any learning trial is described by the function δH = c (M - HT), (4) where H denotes strength of a particular habit undergoing acquisition—for example, the relation between S2 and shock in the illustration—and HT denotes the sum of strengths of all habits active in the situation—for example, the habits relating both Sl and S2 to shock in the illustration (Rescorla and Wagner, 1972). Studying the new learning function, one can see that if habit strength for S1 is large enough so that HT is equal to M, then there will be no increment in habit strength for S2. The prediction that learning about S2 is blocked by the previous learning about S1 has been borne out by many experiments. This success, and others like it, in dealing with novel and sometimes surprising findings has put the body of theory built by Rescorla and Wagner and others on the groundwork of Hull's system into a commanding position in present-day research on animal learning.
At one time it appeared that the basic ingredients of Hull's theory could be adapted in a straightforward and simple way to apply to human learning. In the early 1950s, a physicist converted to mathematical psychology, in collaboration with an eminent statistician, arrived at the idea that the hypothetical notion of habit strength could be dispensed with so that learning functions like Equation 3 would be expressed in terms of changes in response probabilities (Bush and Mosteller, 1955). If, for example, a rat was learning to go to the correct (rewarded) side of a simple T maze, they assumed that when a choice of one side led to reward, the probability, p, of that choice would increase in accord with the function δp = c (1 - p), (5) and when it led to nonreward, p would decrease in accord with δp = - cp (6) where c is a constant reflecting speed of learning. By a simple derivation it can be shown that if reward on one side of the maze has some fixed probability (0, 1, or some intermediate value), Equations 5 and 6, taken together, imply a learning function of the same form as Hull's function for growth of habit strength. Also, the prediction can be derived that the final level of the learning function will be such that, on the average, the probability of choosing a given side is equal to the probability of reward on that side, as illustrated in Figure 1. This prediction of probability matching has been confirmed (under some especially simplified experimental conditions) in a number of studies. In an elegant series of elaborations and applications of this simple model, Bush and Mosteller (1955) showed that it could be carried over to human learning in situations requiring repeated choices between alternatives associated with different magnitudes and probabilities of reward, as in some forms of gambling.
However, only a few such applications appeared, and in time it became apparent that, although Bush and Mosteller's methods proved widely useful, neither their approach nor others building on Hull's system could be extended to account for many forms of human learning.
Statistical Learning Theory
The limitations of models based on conceptions of habit growth or similarly gradual changes in response probabilities were anticipated by the great pioneer of modern neuroscience, D. O. Hebb. In a seminal monograph, Hebb (1949) noted that, whereas slow and gradual learning characterizes animals low on the phylogenetic scale and immature organisms higher on the scale, the learning of the higher forms, most conspicuously human beings, as adults is normally much faster, and often takes the form of abrupt reorganizations of the products of previous learning (as when a person trying to master a technique of computer programming suddenly grasps a key relationship).
A theoretical approach that would prove more adaptable to the treatment of human learning appeared in the 1950s, contemporaneously with the work of Bush and Mosteller, in the development of statistical learning theory, also known as stimulus sampling theory (Estes, 1950). This approach is based on the ideas that even apparently simple forms of learning entail the formation of associations among representations in memory of many aspects of the situation confronting the learner, and that more complex forms are largely a matter of classifying or reclassifying already existing associations (now more often termed memory elements). In the earliest versions of the theory, the associations simply related stimuli and responses. If, say, a rat was learning the correct turn in a T maze, then on any trial associations would be formed between the response made, together with the ensuing reward or nonreward, and whatever aspects of the choice situation the animal happened to attend to (the notion of sampling).
An interesting mathematical property of this model is that for simple learning situations, if the total population of stimulus aspects available for sampling is large, the basic learning functions take a form identical to those of Bush and Mosteller's model (Equations 5 and 6), with the learning-rate parameter c interpreted as the proportion of stimulus aspects sampled on any trial. When the population is not large, however, the model is better treated as a type of probabilistic process (technically, a Markov chain) in which learning is conceived in terms of discrete transitions between states. A state is defined by the way the total set of memory elements is classified with respect to alternative responses. The model is thus able to handle the fact that learning is under some conditions slow and gradual but under other conditions fast and discontinuous. In a special case of the model, it is assumed that the learner perceives the entire stimulus pattern present on a trial as a unit, so that the system has only two states, one in which the unit is associated with the correct response and one in which it is not. Surprisingly at the time, this ultra-simplified model was found to account in detail for the data of experiments on the learning of elementary associations (as between faces and names) and concept identification, even for human learners (Bower, 1961).
The stimulus-sampling model automatically accounts for the phenomenon of generalization, which required a special postulate in Hull's system. If learning has occurred in one situation, the probability that the learned response will transfer to a new situation is equal to the proportion of aspects of the second situation that were sampled in the first. The sampling model has the serious limitation that it cannot explain how perfect discriminations can be learned between situations that have aspects in common. However, this deficiency was remedied in a new version formulated by Douglas L. Medin and his associates especially for the interpretation of discrimination and categorization (Medin and Schaffer, 1978). With a new rule for computing similarity between stimuli or situations, this formulation, generally known as the exemplarmemory model, has been shown to account for many properties of human category learning in simulated medical diagnosis and the like.
An interesting aspect of science is the way progress often occurs by seeing older theories in a new light. A striking instance is the observation that Rescorla and Wagner's extension of Hull's learning theory can be viewed as a special case of a type of adaptive network model that has been imported from engineering into the new branch of cognitive science known as connectionism (Rumelhart and McClelland, 1986). A connectionist network is composed of layers of elementary units (nodes) interconnected by pathways. When the network receives an input at the bottom layer, as from the stimulus display in a learning situation, activation is transmitted through the network to output nodes at the top layer that lead to response mechanisms. During learning, memories are stored not in individual units but as distributions of strengths of connections in groups of units. These strengths are modified during learning by a process that is driven by an overall tendency toward error correction. Connectionist models have been developed and applied for the most part in relation to complex cognitive activities such as speech perception and language acquisition (McClelland and Rumelhart, 1986), but it has also been discovered that the descendant of Hull's learning theory embodied in the Rescorla-Wagner model can be viewed as a special case of a connectionist network (Gluck and Bower, 1988). The consequence of this observation has been a wave of new work applying and extending what was originally conceived as a model of animal conditioning to a variety of work on human learning and categorization. Thus, by the recurring process of rediscovery and reinterpretation, mathematical learning theory has become active in a new guise as a component of the new discipline of cognitive science.
Bower, G. H. (1961). Application of a model to paired-associate learning. Psychometrika 26, 255-280.
Bush, R. R., and Mosteller, F. (1955). Stochastic models for learning. New York: Wiley.
Estes, W. K. (1950). Toward a statistical theory of learning. Psychological Review 57, 94-107.
Gluck, M. A., and Bower, G. H. (1988). From conditioning to category learning: An adaptive network model. Journal of Experimental Psychology: General 117, 225-244.
Gulliksen, H. A. (1934). A rational equation of the learning curve based on Thorndike's law of effect. Journal of General Psychology 11, 395-434.
Hebb, D. O. (1949). Organization of behavior: A neurophysiological theory. New York: Wiley.
Hull, C. L. (1943). Principles of behavior. New York: Appleton.
Kamin, L. J. (1969). Predictability, surprise, attention, and conditioning. In B. A. Campbell and R. M. Church, eds., Punishment and aversive behavior. New York: Appleton-Century-Crofts.
McClelland, J. L., and Rumelhart, D. E. (1986). Parallel distributed processing: Explorations in the microstructure of cognition, Vol. 2. Cambridge, MA: MIT Press.
Medin, D. L., and Schaffer, M. M. (1978). Context theory of classification learning. Psychological Review 85, 207-238.
Rescorla, R. A., and Wagner, A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In A. H. Black and W. F. Prokasy, eds., Classical conditioning, Vol. 2: Current research and theory. New York: Appleton-Century-Crofts.
Rumelhart, D. E., and McClelland, J. L. (1986). Parallel distributed processing, Vol. 1. Cambridge, MA: MIT Press.