Parallel Distributed Processing Models of Memory
PARALLEL DISTRIBUTED PROCESSING MODELS OF MEMORY
This article describes a class of computational models that help us understand some of the most important characteristics of human memory. The computational models are called parallel distributed processing (PDP) models because memories are stored and retrieved in a system consisting of a large number of simple computational elements, all working at the same time and all contributing to the outcome. They are sometimes also called connectionist models because the knowledge that governs retrieval is stored in the strengths of the connections among the elements.
The article begins with a common metaphor for human memory, and shows why it fails to capture several key characteristics of memory that are captured by the PDP approach. Then a brief statement of the general characteristics of PDP systems is given. Following this, two specific models are presented that capture key characteristics of memory in slightly different ways. Strengths and weaknesses of the two approaches are considered, and a synthesis is presented. The article ends with a brief discussion of the techniques that have been developed for adjusting connection strengths in PDP systems.
Characteristics of Memory
A common metaphor for human memory might be called the "computer file" metaphor. On this metaphor, we store a copy of an idea or experience in a file, which we can later retrieve and reexamine. There are several problems with this view.
Memories are accessed by content.
First of all, the natural way of accessing records in a computer is by their address in the computer. However, what actually happens in human memory is that we access memories by their contents. Any description that uniquely identifies a memory is likely to be sufficient for recall. Even more interesting, each individual element of the description may be nearly useless by itself, if it applies to many memories; only the combination needs to be unique. Thus
"He bet on sports. He played baseball."
is enough for many people to identify Pete Rose, even through the cues about baseball and betting on sports would not generally be sufficient as cues individually, since each matches too many memories.
Memory fills in gaps.
The computer-file metaphor also misses the fact that when we recall, we often fill in information that could not have been part of the original record. Pieces of information that were not part of the original experience intrude on our recollections. Sometimes these intrusions are misleading, but often enough they are in fact helpful reconstructions based on things we know about similar memories. For example, if we are told that someone has been shot by someone else from a distance of 300 yards, we are likely to recall later that a rifle was used, even though this was not mentioned when we heard about the original event.
Memory generalizes over examples.
A third crucial characteristic of memory is that it allows us to form generalizations. If every apricot we see is orange, we come to treat this as an inherent characteristic of apricots. But if cars come in many different colors, we come to treat the color as a freely varying property. So when we are asked to retrieve the common properties of apricots, the color is a prominent element of our recollection; but no color comes out when we are asked to retrieve the common properties of cars.
Proponents of the computer-file view of memory deal with these issues by adding special processes. Access by content is done by laborious sequential search. Reconstruction is done by applying inferential processes to the retrieved record. Generalization occurs through a process of forming explicit records for the category (e.g., car or apricot).
In PDP systems, these three characteristics of memory are intrinsic to the operation of the memory system.
Characteristics of PDP Systems
A PDP system consists of a large number of neuron-like computing elements called units. Each unit can take on an activation value between some maximum and minimum values, often 1 and 0. In such systems, the representation of something that we are currently thinking about is a pattern of activation over the computing elements. Processing occurs by the propagation of activation from one unit to another via connections among the units. A connection may be excitatory (positive-valued) or inhibitory (negative-valued). If the connection from one unit to another is excitatory, then the activation of the receiving unit tends to increase whenever the sending unit is active. If the connection is inhibitory, then the activation of the receiving unit tends to decrease. But note that each unit may receive connections from many other units. The actual change in activation, then, is based on the net input, aggregated over all of the excitatory and inhibitory connections.
In a system like this, the knowledge that governs processing is stored in the connections among the units, for it is these connections that determine what pattern will result from the presentation of an input. Learning occurs through adjustments of connection strengths. Memory storage is just a form of learning, and also occurs by connection weight adjustment.
To make these ideas concrete, we now examine two PDP models of memory. The models differ in a crucial way. In the first, each individual computing element (henceforth called a unit) represents a separate cognitive unit, be it a feature (for example, the color of something), or a whole object, or the object's name. When we are remembering events, there is a unit for each event. Such models are called localist models. In the second type of model, cognitive units are not separately assigned to individual computing units. Rather, the representation of each cognitive unit is thought of as a pattern of activation over an ensemble of computing units. Alternative objects of thought are represented by alternative patterns of activation. This type of model is called a distributed model.
A Localist PDP Model of Memory
McClelland (1981) presented a PDP model that illustrates the properties of access by content, filling in of gaps, and generalization. The database for the model is shown in Figure 1. The network is shown in Figure 2.
The data base consists of descriptions of a group of people who are members of two gangs, the Jets and the Sharks. Each person has a name, and the list specifies the age, marital status, and education of each person. Perusal of the list reveals that the Jets are, by and large, younger and less well educated than the Sharks, and tend to be single rather than married. However, these tendencies are not absolute and, furthermore, there is no single Jet who has all of the properties that tend to be typical of Jets.
The goal of the network is to allow retrieval of general and specific information about individuals in the data base. The network consists of a unit for each person (in the center of Figure 2) and a unit for each property (name, age, educational level, occupation, gang) that a person can have. Units are grouped into pools by type as shown, so that all the name units are in one pool, for instance. There is a bidirectional excitatory connection between each person's unit and the units for each of his properties; and there are bidirectional inhibitory connections between units that can be thought of as incompatible alternatives. Thus there is inhibition between the different occupation units, between the different age units, and so on. There is also inhibition between the different name units and between the units for different individuals.
In this network, units take on activation values between 1 and -0.2. The output is equal to the activation, unless the activation is less than 0; then there is no output. In the absence of input, the activations of all the units are set to a resting value of -0.1.
Retrieval by Name
Retrieval begins with the presentation of a probe, in the form of externally supplied input to one or more of the property units. To retrieve the properties of Lance, for example, we need only turn on the name unit for Lance. The activation process is gradual and builds up over time, eventually resulting in a stable pattern that in this case represents the properties of Lance. Activation spreads from the name unit to the property units by way of the instance unit. Feedback from activated properties tends to activate the instance units for other individuals, but because of the mutual inhibition, these activators are kept relatively low.
Retrieval by Content
It should be clear how we can access an individual by properties, as well as by name. As long as we present a set of properties that uniquely matches a single individual, retrieval of the rest of what is known of properties of that individual is quite good. Other similar individuals may become partially active, but the correct person unit will dominate the person pool, and the correct properties will be activated.
Filling in Gaps
Suppose that we delete the connection between Lance and burglar. This creates a gap in the database. However, the model can fill in this gap, in the following way. As the other properties of Lance become active, they in turn feed back activation to units for other individuals similar to Lance. Because the instance unit for Lance himself is not specifying any activation for an occupation, the instance units for other, similar individuals conspire together to fill in the gap. In this case it turns out that there is a group of individuals who are very similar to Lance and who are all burglars. As a result, the network fills in burglar for Lance as well. One may view this as an example of guilt by association. In this case, it so happens that the model is correct in filling in burglar, but of course this kind of filling in is by no means guaranteed to be correct. Similarly, in human memory, our reconstructions of past events often blend in the contents of other, similar events.
The model can be used to retrieve a generalization over a set of individuals who match a particular probe. For example, one can retrieve the typical properties of Jets simply by turning on the Jet unit and allowing the network to settle. The result is that the network activates 20s, junior high, and single strongly. No name is strongly activated, and the three occupations are all activated about equally, reflecting the fact that all three occur with equal frequency among the Jets.
In summary, this simple model shows how retrieval by content, filling in gaps, and generalization are intrinsic to the process of retrieval in the PDP approach to memory.
A Distributed PDP Model of Memory
The second model to be considered is a distributed model. Many authors (e.g., Kohonen, 1977; Anderson et al., 1977) have proposed variants of such models. The one shown in Figure 3 is from McClelland and Rumelhart (1985). The model is called distributed because there are no single units for individuals or for properties. Instead, the representation to be stored is a distributed pattern over the entire set of units. Similar memories are represented by similar patterns, as before; but now each unit need not correspond to a specific feature or property, and there are no separate units for the item as a whole. Again, the knowledge is stored in the connections among the units.
Methods for training such networks will be considered in more detail below. Suffice it to note one simple method, called the Hebbian method. According to this method, we increase the connection strength between two units if they are both active in a particular pattern at the same time.
Distributed networks trained with this Hebbian learning rule exhibit many of the properties of localist networks. They perform an operation, called pattern completion, that is similar to retrieval by content. In pattern completion, any part of the pattern can be used as a cue to retrieve the rest of the pattern, although there are limits to this that we will consider below. Because many memories are stored using the same connection weights, they have a very strong tendency to fill in gaps in one pattern with parts of other, similar patterns. These models also generalize. When similar patterns are stored, what is learned about one pattern will tend to transfer to those parts that it has in common with the other. When a set of similar patterns is stored, what is common to all of them will build up as each example is learned; what is different will cancel out.
There is a final important property of distributed memory models, and that is graceful degradation. The knowledge that governs the ability to reconstruct each pattern is distributed throughout the network, so if some of the connections are lost, it will not necessarily be catastrophic. In fact, the network can function quite well even when many of the units are destroyed, especially if it is relatively lightly loaded with memories.
Each of the two models described above has some limitations. The localist model requires a special instance unit to be devoted to each memory trace; this is inefficient, especially when there is redundancy across different memories in terms of what properties tend to concur in the same memory. On the other hand, the distributed model is limited because only a few distinct patterns can be stored in the direct connections among the members of a set of units.
The best of both worlds can be obtained in a hybrid system, in which the various parts of the representation of a memory are bound together by a set of superordinate units, as in the localist model, but each superordinate unit participates in the representation of many different memories, as in the distributed model.
Learning Rules for PDP Systems
Several of the learning rules for PDP systems are reviewed in Rumelhart, Hinton, and McClelland (1986). Here we consider two main classes, Hebbian learning rules and error-correcting learning rules. We have already mentioned the Hebbian learning rule, which increases the strength of the connection between two units when both units are simultaneously active. In a common variant, the strength of the connection is decreased when one unit is active and the other is inactive.
These Hebbian learning rules are limited in what can be learned with them. Some of these limitations are overcome by what are called error-correcting learning rules. In such learning rules, the idea is that the pattern to be learned is treated not only as input but also as the target for learning. A pattern is presented, and the network is allowed to settle. Once it has done so, the discrepancies between the resulting pattern and the input pattern are used to determine what changes should be made in the connections. For example, if a unit is activated that should not be active, the connection weights coming into that unit from other active units will be reduced. Several very powerful learning procedures for adjusting connection weights that are based on the idea of reducing the discrepancy between output and target have been developed in recent years. The best-known is the back-propagation learning procedure (Rumelhart, Hinton, and Williams, 1986). Another important learning rule for PDP systems is the Boltzmann machine learning rule (Ackley, Hinton, and Sejnowski, 1985). Both work well in training the hybrid systems described above.
Ackley, D. H., Hinton, G. E., and Sejnowski, T. J. (1985). A learning algorithm for Boltzmann machines. Cognitive Science 9, 147-169.
Anderson, J. A., Silverstein, J. W., Ritz, S. A., and Jones, R. S. (1977). Distinctive features, categorical perception, and probability learning: Some applications of a neural model. Psychological Review 84, 413-451.
[Image not available for copyright reasons]
Hertz, J., Krogh, A., and Palmer, R. (1990). Introduction to the theory of neural computation. Redwood City, CA: Addison-Wesley.
Hinton, G. E., and Anderson J. A., eds. (1981). Parallel models of associative memory. Hillsdale, NJ: Erlbaum.
Kohonen, T. (1977). Associative memory: A system theoretical approach. New York: Springer-Verlag.
McClelland, J. L. (1981). Retrieving general and specific information from stored knowledge of specifics. Paper presented at the third annual meeting of the Cognitive Science Society. Berkeley, CA.
McClelland, J. L., and Rumelhart, D. E. (1985). Distributed memory and the representation of general and specific information. Journal of Experimental Psychology: General 114, 159-188.
Rumelhart, D. E., Hinton, G. E., and McClelland, J. L. (1986). A general framework for parallel distributed processing. In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., Parallel distributed processing: Explorations in the microstructures of cognition, Vol. 1. Cambridge, MA: MIT Press.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, eds., Parallel distributed processing: Explorations in the microstructures of cognition, Vol. 1. Cambridge, MA: MIT Press.
Rumelhart, D. E., McClelland, J. L., and the PDP Research Group. (1986). Parallel distributed processing: Explorations in the microstructure of cognition, 2 vols. Cambridge, MA: MIT Press.