Learning algorithms are sets of rules, usually expressed using mathematical equations or computer instructions, that enable a system to improve its performance on the basis of its own experience. Also called learning procedures, methods, or rules, learning algorithms are key components of mathematical models of animal learning and of technological devices engineered to improve their behavior as they operate. Learning algorithms have been studied extensively in artificial intelligence, psychology, computational neuroscience, statistics, and engineering. They are used in many applications in science, industry, and finance that require fitting models to or finding patterns in collections of data or that require learning skills needed to solve specific tasks. Learning algorithms play major roles in artificial neural network systems, where they provide rules for adjusting the strengths, or weights, of connections between network elements. These connections correspond to synapses by which neurons communicate with other neurons and with sensory and motor mechanisms.
In 1949 the psychologist Donald Hebb hypothesized that when a neuron, A, repeatedly and persistently takes part in firing another neuron, B, the synapses by which A stimulates B are strengthened. It then becomes easier for neuron A to fire neuron B. Consequently, any event that stimulates neuron A may also stimulate neuron B and thus become associated with the consequences of B's firing as well as with other events that stimulate B. This is one of the earliest and most influential ideas contributing to neural-network learning algorithms.
Turning this hypothesis into a complete learning algorithm requires precise definitions of the variables involved and how they interact. One of the simplest ways to do this is to consider the situation in which neuron B receives input from other neurons, labeled 1, 2, …, n, any of which can play the role of neuron A in Hebb's hypothesis. Let XB(t) be a positive number representing the activity level (the instantaneous rate of firing) of neuron B at time t. Similarly, for i = 1, 2, …, n, let Xi(t) represent the activity levels at time t of neurons 1, 2, …, n (see Figure l). In representing how the activity of neuron B depends on the activities of the other neurons, the simplest assumption is that XB(t) is a weighted sum of the activities of the other neurons, where each weight represents the current strength of a synapse. If Wi(t) is the strength at time t of the synapse by which neuron i influences neuron B, this means that XB(t) = W1(t)X1(t) + … + Wn(t)Xn(t). (1) Hebb's hypothesis suggests how the synaptic strengths change over time, depending on the other variables. The extent to which neuron i takes part in firing neuron B at a time t is often represented by the product of the activity levels of neuron B and neuron i: XB(t) Xi(t). This product is large when the activity levels of both neurons are high, and it is small when the activity level of one or both neurons is close to zero. Thus, if neuron i persistently takes part in firing neuron B, this product will be large for many times t. This leads to a rule for changing synaptic strengths that can be written for each synapse i as: Wi(t + δ t) = Wi(t) + cXB(t)Xi(t), (2) where δ t is a small time increment and c is a small positive number. Selecting values for δ t and c determines how rapidly the synaptic strengths change. According to Equation 2, at any time, t, when the activity levels of neurons i and B are both greater than zero, XB (t)Xi(t) is greater than zero, and the synaptic strength Wi(t) increases from time t to time t + δ t. According to Equation 1, this larger synaptic weight means that neuron i's activity will contribute more to the activity of neuron B in the future. Equations 1 and 2 constitute a learning algorithm; both are needed to compute the changes in the synaptic strengths when the values Xi (t), i = l, 2, …, n, are given for each time t. This is the simplest of many learning algorithms based on Hebb's hypothesis, and, despite many shortcomings, it has played an important role in theories of learning in neural networks. Brown et al. (1990) discuss this and many other Hebb-inspired learning algorithms and how well they model synaptic properties actually observed in regions of the brain such as the hippocampus.
Other learning algorithms are motivated by a desire to solve problems that occur frequently in a variety of fields. One type of problem that has received much attention, especially in the field of statistics, is that of function fitting. Suppose you observe the inputs and outputs of some system under study and assemble a training set of observations (Xi, Zi), i = 1, 2, …, m, where Zi is the system's output for input Xi. (The superscripts are used to avoid confusion with the neuron-activity levels used above.) On the basis of these data, you would like to predict what the system's output will be for new inputs. For example, the system might be a medical treatment, with inputs giving features of patients (such as the results of clinical tests) and outputs describing treatment outcome (which can be a quantitative measure or a nonnumeric, or categorical, description, such as recovery/no-recovery). You would like to predict the outcomes of treating new patients.
In statistics, this is known as a regression or a classification problem (for the respective cases of quantitative and categorical outcomes). Solving it requires finding a rule (mathematically, a function) that best fits the data in the training set (where "best" is precisely defined in some way). The resulting rule is then used to provide the desired predictions. Many methods for solving these problems are formulated as learning algorithms. The training set represents the experience of the learning system, and the rule being learned improves in its ability to provide valid predictions as the data is processed. Learning researchers think of the training set as a set of examples provided by a "teacher," and they call this process learning by example, learning with a teacher, or, most often, supervised learning. Many learning algorithms have been devised for supervised learning problems, and modern research on the theoretical properties of these algorithms is strongly tied to the field of statistics. Hastie et al. (2001) is a good reference for the statistical view of supervised learning algorithms.
Least Mean Square Algorithm
The least mean square (LMS) algorithm is an influential supervised learning algorithm proposed by the electrical engineers Bernard Widrow and Marcian Hoff in 1960. Like the Hebbian algorithm described above, it can be expressed as a rule for changing the synaptic weights of a neuronlike element, but it requires another variable to provide training input to the element. This signal tells the element what its activity should be; it is said to provide the desired outputs or target outputs. During learning, input signals and target outputs are selected from a training set of example input/output pairs. If Z(t) denotes the target output at time t for an element whose actual output is X(t), then Z(t) - X(t) gives the error in the element's output: the discrepancy between what the element's activity should be and what it actually is. The LMS algorithm is an error-correction algorithm because it changes synaptic weights to reduce the sizes of the errors gradually over time.
If we suppose that X(t) is a weighted sum of inputs as given by Equation 1, then the LMS algorithm says that a synaptic weight i changes as follows:
Wi(t + δ t) = Wi(t) + c [Z(t) - X(t)]Xi(t). (3)
This means that when the element's output, X(t), is too low, the error is positive and the synaptic weight, Wi(t), increases (assuming both c and Xi(t) are positive). On the other hand, when the element's output is too high, the error is negative, and the weight decreases. In either case, this change in synaptic weight tends to make the element's output more nearly equal to the target output when input element i is active in the future.
The LMS algorithm is closely related to several other error-correction algorithms. The perceptron learning algorithm differs from the LMS algorithm only in using an error that is sensitive to the difference in the signs of the target and actual outputs but is not sensitive to the difference in their sizes. The LMS algorithm is also nearly identical in mathematical form to an influential model of Pavlovian, or classical, conditioning proposed by psychologists Robert Rescorla and Allan Wagner (1972). Instead of changes in synaptic strengths, this model defines changes in behavioral variables called associative strengths that link conditioned stimuli with an unconditioned response. The LMS algorithm is the basis of many models of the cerebellum, where climbing fibers may provide the error signals.
Theories of the LMS algorithm and many other learning algorithms depend on viewing the learning process as a search for a set of weights that is best according to some measure of the algorithm's performance. For the LMS algorithm, this measure assigns to each possible set of weights the average, or mean, of the squares of the errors that would result from using that set of weights for all relevant inputs. If we visualize this measure as a surface over "weight space," the error measure for the current weights corresponds to a point on this surface, and the LMS algorithm changes the weights by moving this point down the slope of the surface, thus improving the system's ability to match the target outputs. This is called a gradient-descent algorithm because the direction of steepest slope is given mathematically by the gradient of the error measure. A gradient-descent algorithm will eventually find a local minimum, meaning a point that is better than all points in the immediate neighborhood, although a better point may exist. When the element's output is a linear function of its weights, as given by Equation 1, all local minima are also global minima, meaning that the error surface does not get any lower than at these points. This means that the LMS algorithm finds a set of weights that produces the least mean-square error over the relevant inputs, thus justifying the algorithm's name.
A generalization of the LMS algorithm known as the error back-propagation algorithm has been influential in artificial neural network research. (Haykin  provides details about this and other network learning algorithms.) In its simplest form, this is a gradient-descent algorithm that applies to networks of neuronlike elements that are connected in multiple layers and that are capable of computing nonlinear functions of the network's input. An error for each element is computed by propagating the error of the network as a whole back through the network in the direction opposite to the flow of element activity. Mathematically, this process computes the gradient of the error surface at the point corresponding to all the current weights in the network. Unlike the linear LMS algorithm, the error back-propagation algorithm is not guaranteed to find a best set of weights because the error surface can have a complex topography with many local minima.
Other Learning Algorithms
A learning algorithm is unsupervised, or self-organizing, if its experience comes from a training set containing inputs but no target outputs. Instead of trying to match target outputs, it tries to recode the inputs according to some built-in principle. For example, some unsupervised learning algorithms recode input signals in order to reduce their complexity while trying to preserve their information content. Often this takes the form of learning how to form clusters of input signals so that signals within a cluster are more similar to one another than they are to signals in other clusters. The Hebbian learning algorithm is most often used for unsupervised clustering in networks where elements compete for the opportunity to represent different inputs.
Reinforcement learning algorithms rely on training input called reward signals, which evaluate the quality of the learning system's behavior without explicitly telling the system what its outputs should be. Reinforcement learning algorithms have to discover how to improve their behavior by trying various outputs and comparing the resulting evaluations. This is a form of trial-and-error learning, which should not be confused with error-correction learning, such as that performed by the LMS algorithm, which requires target outputs. Many reinforcement learning systems use temporal difference (TD) algorithms to learn predictions of the amount of reward a reinforcement learning algorithm will accumulate over an extended period. TD algorithms reduce errors in predictions made at different times by adjusting earlier predictions to be closer to later—and therefore more reliable—predictions. The computer scientists Richard Sutton and Andrew Barto (1998) extensively discuss reinforcement learning and TD algorithms. Physiological experiments by the neuroscientist Wolfram Schultz (1998) reveal intriguing parallels between the behavior of TD algorithms and the activity of dompamine-producing neurons in the brain.
Learning algorithms have been devised for very different purposes: to model hypotheses about neural mechanisms of learning, to model the learning behavior of animals, and to solve important statistical and engineering problems. Despite these different purposes, the resulting algorithms show a remarkable degree of similarity. Perhaps this is not so surprising because animal behavior and its neural basis are nature's solutions to problems that have much in common with those studied by statisticians and engineers.
Brown, T. H., Kairiss, E. W., and Keenan, C. L. (1990). Hebbian synapses: Biophysical mechanisms and algorithms. Annual Review of Neuroscience 13, 475-511.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning. New York: Springer-Verlag.
Haykin, S. (1999). Neural networks: A comprehensive foundation. Upper Saddle River, NJ: Prentice Hall.
Hebb, D. O. (1949). The organization of behavior. New York: Wiley.
Rescorla, R. A., and Wagner. A. R. (1972). A theory of Pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In A. H. Black and W. F. Prokasy, eds., Classical conditioning, Vol. 2: Current research and theory, pp. 64-99. New York: Appleton.
Schultz, W. (1998). Predictive reward signals of dopamine neurons. Journal of Neurophysiology 80, 1-27.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA: MIT Press.
Widrow, B., and Hoff, M. E. (1960). Adaptive switching circuits. In 1960 IRE WESCON convention record, pp. 96-104. New York: IRE. Reprinted in J. A. Anderson and E. Rosenfeld, eds., Neurocomputing: Foundations of research. Cambridge, MA: MIT Press, 1988.