If the environment is to play a more prominent role in memory theorising it is important to develop models in which the environment and the mechanism interact to produce behaviour. The first model to take a global view of the impact of the environment was Anderson and Milson's (1989) rational analysis of memory. In this model, it is assumed that the memory system has adapted through its evolutionary history to the distributions of items with which it is faced. Anderson and Milson (1989) go on to outline a Bayesian framework to bridge the gap between the environmental statistics and measures of experimental performance and show that to a first degree of approximation the system reproduces the results of manipulating frequency, recency and item spacing and gives insight into priming and fan effects.
Anderson and Milson's (1989) Bayesian approach is not, however, a process account of environmental optimisation. To develop such an account it is necessary to specify a learning mechanism capable of implementing the Bayesian decision formulation. Error-correcting backpropagation networks adapt to the statistics of the training environments in which they are immersed providing an analogy to the learning processes that occur within a person's lifetime. There are, however, a number of obstacles which discourage the direct application of backpropagation networks in the memory domain. In particular, the degree of interference or unlearning typically found in backpropagation networks (McCloskey & Cohen, 1989; Ratcliff, 1990) far exceeds that found in subjects.
In the modelling work presented in this paper, a hybrid architecture called the Hebbian Recurrent Network (HRN), employing both Hebbian and backpropagation learning rules, is proposed. The architecture is simulated to ensure that it embodies fundamental criteria for a model of human memory such as the ability to form memories rapidly without excessive interference.
An Adaptive Memory
Within the recent literature there has been considerable debate about
the ability of backpropagation models to capture human memory
phenomena. While there is a feeling that backpropagation models have
much to offer, the results thus far have been mixed (Ratcliff, 1990;
Lewandowsky, 1991). Backpropagation models provide mechanisms by which
encodings, decision criteria and control functions can be learned as a
consequence of exposure to the environment, yet, on some very basic
variables like the degree of interference, they have not performed as
well as conventional memory models.
Estes (1991) makes a telling comment in concluding his discussion of the difficulties of connectionist models:
The purpose of this paper is to start to address what forms of bias must be added to a network to make the learning tasks commonly solved by humans tractable and to ensure that the network performs these tasks in a psychologically plausible way.
The discussion begins by outlining some of the contributions that backpropagation architectures can make to the modelling of memory. Next, some of the major aspects of memory phenomena that remain obstacles for backpropagation models are examined. Finally, the Hebbian Recurrent Network (HRN), which integrates learning and memory models, is presented.
Learning Issues
In an interactive (learning) model, it is the interplay of the
environment and the architecture that leads to performance (Bickhard,
1991). For example, in backpropagation models, performance is
determined both by the architecture (i.e. structure of the
interconnections, the transfer function, and the values of the
parameters) and the statistical contingencies embodied by the training
set.
The advantages of such an interplay can be examined in terms of the components of the memory system. In the following subsections, the learning of representation, decision criteria and control are considered in turn.
Backpropagation models are able to construct internal representations (Rumelhart, Hinton, & Williams, 1986), thus offering a way of avoiding many representational assumptions. Typically, the hidden unit representations are formed as a consequence of both input similarity (e.g. the words "been" and "bean" might assume similar representations since they have similar orthographics and identical phonology) and functional similarity (e.g. the words "idea" and "concept" might adopt similar representations because they are used in functionally similar ways). Elman (1989, 1990) has demonstrated how a network in which the input representations have no similarity structure can exploit the functional similarity in the statistics of the training set to create a similarity landscape on the hidden units. Hence, backpropagation networks introduce a principled way in which "abstract features" might be formed as a consequence of the environment with which they are faced.
In the majority of memory paradigms, however, subjects become more accurate as they gain experience. While some of the improvement may be due to the refinement of the representation, a portion is attributable to an improvement in the ability to decide upon a response (Postman, 1969).
Memory Issues
In the previous section, the aspects of the memory system that might be
acquired by a learning system such as a backpropagation network were
outlined. To be serious alternatives to current models of memory,
however, there are a number of criteria on which current memory models
perform well that must be fulfilled. These memory criteria include
maintaining significant capacity without introducing unrealistic
amounts of interference, generalising to unseen lists of items using
small numbers of training examples and the ability to establish memory
traces rapidly.
Within the literature two major strategies have emerged in order to deal with the problem of catastrophic interference. The first involves increasing the orthogonality of items that are to be learned in succession. Lewandowsky (1991), Kruschke (1992) and French (1991) have suggested methods for encouraging orthogonality and, hence, decreasing the amount of interference within feedforward networks.
The alternative approach has been to use recurrent architectures to encode lists of items rather than single items on their hidden units (Nolfi, Parisi, Vallar, & Burani, 1990; Wiles & Phillips, 1991)[1]. The network has the task of learning a single higher order encoding function rather than a sequence of items and, hence, the interference is reduced. Unfortunately, this encoding function becomes much more difficult to acquire as the number of items to be encoded increases. Consequently, existing recurrent architectures have severe capacity restrictions.
For traditional memory models, such generalisation is not a problem since the memory mechanisms are chosen so that they will encode lists independently of which items are to be encoded. In the recurrent network architectures that have been applied to memory phenomena (Nolfi et al., 1990; Wiles & Phillips, 1991), however, there is no such constraint. The network must learn to recognise each new item. In addition, in the serial recall task investigated by Wiles and Phillips (1991), a significant proportion of the possible orderings of the items must also have been encountered. Brousse and Smolensky (1989) highlight this issue and suggest that the process of building the list representation be hard wired by using a tensor product of the items position and the item vector, effectively concatenating the list items. In order to fulfill the second memory criterion some such approach must be adopted.
The variety of temporal networks that have been developed fall into two classes[2]. The first class includes those networks that buffer input in order to maintain temporal information (e.g. Time Delay Neural Network (TDNN), Waibel, Hanazawa, Hinton, Shikano, & Lang, 1987; Waibel, 1989). In a TDNN, the input sequence is presented to the backpropagation network through a series of delays. Hence, if i(t) is the input at timestep t, the network receives, i(t); i(t-1); i(t-2); i(t-3); i(t-4) and i(t-5) at the same time. In a memory task, the entire study list would be required by the network at the time of decision. While the depth of the network remains constant, the number of inputs grows both with the vocabulary and the length of the sequence. Not only is such an architecture prohibitively large, but it requires that each item be presented at each possible timestep within the training set. Only in this way can the TDNN learn to respond appropriately independently of the position of the item. Hence, a large training set would be required and the generalisation criterion would be violated.
The second class subsumes recurrent networks such as the Jordan network (Jordan, in press), the Simple Recurrent Network (SRN, Elman, 1989, 1990), Back Propagation Through Time (BPTT, Rumelhart et al., 1986) and Real Time Recurrent Learning (RTRL, Williams & Zipser, 1989). In contrast to the TDNN, which assumes that the input is buffered to maintain information, recurrent networks postulate recurrent connections through which information is cycled and, hence, preserved.
The recurrent connections can emanate either from the output units (e.g. Jordan network) or from the hidden units (e.g. SRN, BPTT, RTRL). There are, however, problems that are not able to be solved by the Jordan network because the information required to be maintained is not present at the output and is necessarily lost (Cottrell & Tsung, 1993; Dell, Juliano, & Govindjee, 1993). In particular, memory control paradigms such as rehearsal can operate without the requisite information in the outputs. Hence, to avoid restrictions on the control paradigms that can be implemented, hidden unit recurrency is required.
Having established the form of the architecture, it remains to choose the learning algorithm which will be applied. The BPTT algorithm unfolds a network (Rumelhart et al., 1986) creating a level for each timestep with tied weights between the timesteps. Because very deep networks are constructed by this method training is often very difficult and time consuming. In addition, a great deal of memory is required. The RTRL algorithm follows the same gradient as a BPTT network that is unfolded for the entire length of a sequence, but does not require time or space proportional to sequence length. It is an O(n4) algorithm [3] (where n is the number of nodes) and is slow for large architectures.
The Simple Recurrent Network (SRN, Elman, 1989, 1990) can be thought of as an approximation to the fully recurrent networks such as BPTT and RTRL. Instead of backpropagating over the entire length of the sequence, it considers only the last timestep. While being faster (i.e. O(n2)) than either BPTT[4] or RTRL, it does not actively seek to maintain relevant information, but makes use of the information that is maintained by chance. Because the current state is a consequence of the preceding states, it is possible that relevant information will be retained until needed. In practice, this information is often lost before such time as the error signal can be generated. Without useful information in the form of hints being required on the outputs, the ability of the SRN to remember is limited to short study lists (Phillips, 1991).
Despite its limited nature the SRN has performed well on the language prediction tasks to which it has been applied (Elman, 1989, 1990). The hidden unit patterns that were chosen by the network formed a similarity landscape that was related to syntactic and semantic structure in the training corpus. In addition, the SRN developed the ability to decide upon a response (or response set) given the prior context. Furthermore, the network was able to retain information that could influence later encoding and predictions, such as pluralisation, over short time spans - a limited form of control. Because the SRN fulfills the learning criteria established earlier it was chosen as the learning mechanism of the HRN.
Current recurrent network models of human memory (Nolfi et al., 1990; Wiles & Phillips, 1991) attempt to learn to build a representation of the entire list. This representation requires training on a significant proportion of all possible lists in order to ensure generalisation (Wiles & Phillips, 1991). People, however, do not receive such extensive training. To perform the sorts of tasks people do routinely, a model must include a form of bias that circumvents this learning problem, that is, the model must learn to encode items not entire lists. While the recurrent activations are the only means by which information can be stored, however, they will continue to be used to store the sequence of items. One solution is to provide a separate mechanism that is capable of storing lists. Fortunately, such a mechanism has been the subject of research into human memory for several decades. In the following section, this literature is examined to find a mechanism suitable for inclusion into the SRN.
With the exception of Search of Associative Memory (SAM Raaijmakers & Shiffrin, 1981) these models employ distributed representations and are all possible candidates for incorporation into the SRN. Minerva II (Hintzman, 1984) is an exemplar model, meaning that each memory trace is stored separately. Such a model is able to account for situational (Hintzman & Block, 1971; Greene, 1992) and categorical (Greene, 1992) frequency judgements, but incurs the cost of storage proportional to time. The alternative is to employ a blend memory in which images are superimposed upon each other. TODAM (Murdock, 1982), CHARM (Eich, 1982) and the Matrix model (Pike, 1984; Humphreys et al., 1989) are examples of superposition models, and were considered to be more feasible for implementation in the SRN. The final choice was between the correlation/convolution methods of TODAM and CHARM, and the Hebbian scheme of the Matrix model. While the Hebbian rule has well established roots in the connectionist literature (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1982), the correlation/convolution method has also been studied in this context (Plate, 1991, 1992). Due to the arguments put forth by Pike (1984) that the matrix model is less subject to noise, and the obvious mapping of the cells of a matrix to the weights of a Hebbian network, the Matrix model was chosen. Allowing weights to store information instead of requiring the activation pattern to encode it was expected to improve performance in a plausible fashion (Schmidhuber, 1991; Mozer, 1992).
The architecture of the HRN is similar to that of the SRN, in that the input and context layers are completely connected to the hidden layer and the hidden layer is completely connected to the output layer (see Figure 1). In contrast to the SRN, which copies the contents of the previous hidden layer to the context layer, the hidden layer of the HRN is connected to the context layer by a set of Hebbian weights. It is these weights that form the memory of the system. At any given timestep, the context layer contains the results of the last memory retrieval.
Table 1: The HRN algorithm. Note that the scaling factor value was introduced to stop the context units from saturating. When updating the Hebbian weights, retaining 1-ß and adding ß of the new vector ensures that weights are bounded (since the activations are bounded between 1 and -1), and that the mechanism will be stable. The Hebbian weights are updated during both training and testing.
The weights are updated after the hidden and output activation patterns have been updated, but before the context units are changed. The backpropagation weights are updated using the backpropagation rule as in the SRN. The Hebbian weights are updated by autoassociating the hidden unit activation pattern. The outer product of the hidden vector with itself is calculated and added to the existing Hebbian matrix.
At this point, the role of the "context" units may need some clarification. This terminology has been borrowed from Elman (1989, 1990) and does not correspond to the context cues that are often featured in models of human memory (c.f. Gillund & Shiffrin, 1984; Humphreys et al., 1989; Chappell & Humphreys, 1994, 1993; Heathcote, 1993). The context units receive the outputs of memory and, hence, form the "context" of the next input to the memory system. The context cues used in the memory literature are derived from the experimental instructions and define the episode from which a subject is expected to recognise items. In the following simulations, change of context has been implemented by resetting both the context units and the Hebbian weights. This mechanism decreases simulation time, but is insufficient for tasks that involve retrieval from multiple lists (i.e. multiple contexts). To model these tasks, context could be included as part of the input. As is often done in distributed memory systems, each episode could be represented by a vector and the appropriate vector would be reinstated at the time of test to indicate from which context the network is required to retrieve.
Figure 2 outlines where the processes involved in memory tasks are implemented in the HRN. The input-to-hidden and context-to-hidden weights implement the encoding process. The context-to-hidden and hidden-to-output implement the retrieval process and the Hebbian weights from the hidden-to-context units are responsible for storage.
Firstly, the representations should be determined by the dynamics of the network as a consequence of learning and not chosen a priori by the experimenter. Hence, the inputs to the memory system come from the hidden activations of the backpropagation network. The representational scheme will be formed as a result of the input and required outputs, that is, as a function of the environment in which the network must operate. Furthermore, it is also necessary that the the matrix memory be autoassociative. The problem with using a feedforward Hebbian memory in a system in which the representations are learned is that the Hebbian memory itself is solely responsible for generating the representation at its outputs. Any Hebbian learning could only reinforce the output patterns that already occurred as a consequence of the initial weights. For instance, if the initial weights were all zero the only outputs that would be generated would also be zero. Using a Hebbian weight update scheme, however, the zero outputs would result in no change to the weights and hence nothing could be retained. Hence, the Hebbian system is necessarily autoassociative, it necessarily has a hidden layer as input, and that hidden layer must feed to and from error correcting weights.
The splitting of the Hebbian and backpropagation weights is not intended to suggest that these must represent different sorts of weights in the brain. What is being identified is the functional disparity. The Hebbian units are responsible for storage while the backpropagation weights perform function approximation. It may be the case that both of these processes can be captured by a single weight update rule.
The second point is that the outputs of the Hebbian memory are available at the input to the feedforward backpropagation architecture. Consequently, the result of probing memory can be used to construct the next cue to memory, allowing the control aspects of the task to be acquired. In addition, such a recurrent system makes chains of recollection possible. The ability to perform these chains of recollection is a prominent aspect of everyday recall (Lewandowsky & Murdock, 1989).
Study/Test versus Training/Test Terminology: A confusing aspect of the terminology that has been inherited from the memory and backpropagation literatures is the distinction between Study/Test paradigms (from the memory literature) and Training/Test methodology (from the backpropagation literature). Within the memory literature a Study/Test paradigm is one in which subjects are first given a study list and are then given a test list to assess their memory for the studied items. In the backpropagation literature the term "test" is used to indicate the assessment of how well the network has acquired the function which underlies the data with which it was presented. Typically, it is presented with unseen cases and matched against the desired response. The HRN uses both of sets of terminology. The Hebbian weights embody the memory of the subject while the backpropagation weights embody the memory functions (including representations, decision criteria and control functions) that are acquired over the subject's entire past history. In these simulations both the previous history (i.e. training) and the current evaluation (i.e. testing) were characterised as a set of study/test paradigms. During training the backpropagation weights are altered and the example set will reflect the general experience of the subject whereas in the test phase the backpropagation weights are frozen and the example set will reflect the statistics of presentation of items within the experimental setting.
Table 2: An example input/output sequence for the episodic recognition task.
| First Study Item | Second Study Item | Third Study Item | Probe | Answer | |
| Input | A | B | C | B | Pause |
| Targets | Blank | Blank | Blank | Blank | Yes |
The backpropagation weights of the network were trained to perform recognition by presenting study/test sequences. Once training was complete the backpropagation weights (but not the Hebbian weights) were frozen.
Figure 3 shows the HRN applied to a small recognition task. In this case, the study lists consist of two items which are presented in the first and second timesteps. At the third timestep an item is input to be recognised and at the fourth timestep the recognition decision is made. The diagrams on the left hand side demonstrate the processing of a target item (i.e. an item that was present in the two item study list) and the diagrams on the right hand side show the processing of a distractor item (i.e. one that was not present in the two item list). There are four input patterns, one for each of the three items as well as one "Pause" symbol, which is input when the recognition decision is to be made. At the output there are three patterns. For the first three timesteps the "Blank" pattern is output demonstrating that the network has successfully learned not to "babble" when no output is expected. The other outputs represent "Yes" and "No", the possible responses to the recognition decision. Four hidden units were used and, consequently, there were four context units. In addition, the inputs were repeated at the output (giving four more outputs). These extra targets were added to condition the error space during learning and do not affect post training processing.
Figure 3: The trained HRN processing a target and a distractor.
Consider the processing of the target, as on the left hand side of Figure 3. The items of the list (i.e. "A" and "B") are presented to the inputs, and with the context (which is zero to begin with) are used to form a cue. Because the Hebbian weights (which will be refer to as medium term store, MTS) are zero, there is no activation of the context units. If the patterns at the hidden units in response to these two inputs are denoted by AH and BH respectively, then, at the end of processing of the study list the Hebbian matrix is (1-ß)ß AH A'H+ ß BH B'H where ß is the memory decay parameter. Note that up until this stage the target or distractor item has not been presented (only the study list) so the left and right hand sides are identical. At step three, the probe item is input. Apart from some noise, the context units are zero and, hence, the hidden units will contain AH. In the example, the hidden pattern is (-.7,-.3,0,.1). When this pattern is multiplied by the Hebbian weights there is a match:
A'H((1-ß)ß AH A'H + ß BH B'H) = (1-ß) ß |AH|2 AH
since AH B'H is approximately equal to zero. In this example, the context units have the pattern (-.6,-.3,0,.1). At step four, the network determines if the context units represent a valid item pattern. In this case, they do and the network responds with "Yes".
Now consider the distractor example on the right hand side of Figure 3. The study items are input in the same fashion. When the "C" item is input as the probe item, however, there is no such pattern in the MTS and, hence, the context units are near zero (c.f. 0, .1, .1, -.1) when the recognition decision is to be made. The network has no trouble determining that this item has not been seen and responds "No".
While this architecture is capable of performing recognition tasks, it is not in principle restricted to such tasks. In general, the current input and the internal state are used to form a cue for MTS. Such a cue will recall any previously stored pattern that is either similar or the same as itself. Hence, presenting a concept is likely to recall other related concepts. Once these concepts have been recalled they participate in the formation of the next cue set for MTS and so on.
Simulation Method
The SRN and HRN were applied to the episodic recognition task [5]. The recognition task was mapped onto the
network in the manner described above (see Table 2). The effect of
list length [6] and number of items in the
training set was explored. In addition, the effect of the size of the
vocabulary and the magnitude of the memory decay parameter on the
performance of the HRN was investigated. All results are averaged over 20
trials with a learning rate of 0.05 and 20 hidden units. Except in
those simulations in which the memory decay is explicitly being varied,
it was set to 0.2. Unless otherwise stated the vocabulary was set to
twice the list length and the training sets consisted of 500 sequences
(i.e. 500 * (listlength+2) patterns). The test sets consisted of 200
sequences. Each trial involved restarting the network with a new set of
weights and new training and testing sets. The simulator was written in
C on a SparcStation II platform and is available from the author on
request.
Performance was measured at each decision timestep. The largest of the "blank", "yes" and "no" activation levels was taken to be the response of the network.
Normally, when applying signal detection theory there are only two responses and a negative value of d' indicates bias against correct responding. In these simulations, however, a third response (i.e. "blank") is added and the chance baseline is reduced below 0.5 leading to negative d' values. However, the "blank" response was rarely given during the decision phase after the first few epochs of training. Hence, the problem quickly reduced to a two response paradigm and negative d' values were eliminated.
The learning curves for each of the architectures at each list length are shown in Figures 5 and 6. In contrast to the SRN, performance rose very quickly in the HRN particularly at the lower list lengths. At the longer list lengths the increase in the number of items to be learned slows down the rate of improvement. In addition, the variances for the HRN graphs are much lower than those of the SRN for three item study lists, particularly in the latter stages of training. In the SRN there is no guarantee that the information required to make a recognition decision will be maintained. Because the true gradient through time is not being followed, a set of initial weights that does not maintain the requisite information sufficiently from the outset cannot be trained to do so. As training progresses, however, the performance on those simulations for which sufficient separation was maintained will continue to improve, leading to higher variances. In the HRN, the information required to make the recognition decision is maintained in the Hebbian weights. Furthermore, the retention mechanism is constant and the only thing that jeopardises the asymptotic performance is the orthogonality of the hidden unit patterns. For this reason, the variances in the HRN tend to be smaller.
Another interesting feature is the drop in performance that occurs in the HRN. For instance, on three, four and five item lists, performance reaches a maximum in the first 100 epochs and then decreases to an asymptote. The drop is not simply a consequence of overfitting. The training results show the same decrement in performance as the test results.
As suggested by Lewandowsky (1991) the tanh function was used. Any two vectors whose components are chosen from a uniform distribution in the range 1 to -1 will be statistically independent (i.e tend to orthogonality in high dimensional spaces). Lewandowsky (1991) demonstrates the decrease in catastrophic interference that ensues. However, unless large numbers of hidden units are used, mutual orthogonality of the pattern set quickly diminishes as the number of patterns increases. Correspondingly the performance of the Hebbian matrix drops.
In addition to the drop incurred by increasing the number of input patterns, the orthogonality of the hidden unit patterns decreases as training progresses (see Figure 7). The measure of orthogonality used is:
where vi is the average hidden pattern for input symbol i, and n is the number of input symbols. This measure is one minus the average normalised dot product. It ranges between one and zero and is at a maximum when all patterns are orthogonal to one another. The mutual orthogonality of the hidden patterns increases to a maximum after the first 100 epochs and then drops slowly to an asymptote as training progresses. It is this drop in the orthogonality that compromises the Hebbian memory and leads to the drop in performance of the network.
There are methods for encouraging orthogonality of hidden unit patterns (French, 1991; Chauvin, 1989) and it seems likely that these would improve performance.
Although the result is not conclusive, because only small values of the list length have been sampled, the graph suggests that training time (in epochs) grows in an approximately linear fashion with the list length and vocabulary.
Hidden Unit Analysis
In order for the HRN to operate successfully it must form stable
representations of the input items on its hidden units. Figure 13
shows the hidden patterns of the HRN plotted on the first two principal
components[9]. Two hundred sequences each
containing five patterns were presented and, hence, Figure 13 shows
1000 hidden unit patterns. The dominant feature is the separation on
the input pattern. To successfully autoassociate the input patterns the
network maps all input/context patterns that correspond to the same
input onto similar locations in hidden unit space, and attempts to
separate these clusters from each other. Considering that there are on
average 143 of each input pattern the clusters are very tight.
In contrast, position information is only poorly retained. Figure 14 shows the HRN hidden unit patterns plotted on the first two canonical discriminants[10] when grouped on position (i.e. 1, 2, 3, or P for probe). The probe and third position patterns are reasonably well separated, but the first and second position patterns show very little separation. Figure 15 shows the canonical discriminant analysis (CDA) plots of the hidden patterns grouped on position (i.e. 1, 2, 3, or P for probe) and on item (i.e. a, b, c, d, e, or f) to the same scale. Clearly, the separation of the items is much greater than the separation of the positions. Unlike the serial recall task explored by Wiles and Phillips (1991), position information is not required to do the recognition task and, hence, separation on the basis of position is poor[11].
In contrasting the SRN and HRN hidden unit representations, the first point to note is that the HRN is better at recognising patterns from the study list than the SRN. Examining the hidden patterns on the "Answer" timestep shows that the Target and Distractor patterns are not as well separated in the SRN (see Figure 16). It is more difficult to construct a decision boundary that will correctly classify these patterns as "Yes" or "No" responses.
While the SRN must form a representation of the entire list in its hidden unit activation patterns, the HRN can rely on the Hebbian memory to store items. Figures 17 and 18 are Hierarchical Cluster Analysis (HCA, Elman, 1989) diagrams of the hidden patterns after the final study item has been input. HCA groups patterns of activation in a tree structure so that patterns which were close to each other in the hidden unit space occupy adjacent branches of the tree. When the final study item is input, the SRN must have a representation of the entire list in its hidden pattern. In contrast, the HRN need only separate the current items in it hidden activations because items from the first and second positions of the list are stored in the memory. The labels in Figures 17 and 18 indicate whether the corresponding study list contained an "a" and, if so, whether that 'a' occurred in the third position. In both the HRN and SRN the "a"s in the third position are well separated from the rest of the patterns. In the HRN, however, first and second position "a"s are mixed with the other patterns. By contrast, the SRN "a"s are well separated regardless of position. The learning task that the HRN must solve is easier than that of the SRN.
Conclusions
The introduction outlined six criteria for a model of
human memory, and in this section the HRN's performance is evaluated
with respect to these criteria.
The first of the criteria was the learning of representation. The analysis of the hidden unit patterns of the HRN demonstrates that it is the case that the model learns a representational landscape. In the recognition task explored in this paper, all items were equal. The input patterns were orthogonal and all items fill identical functional roles. Consequently, the patterns were evenly distributed throughout the space. In tasks in which either the input patterns vary or the items are required to fill different functional roles (such as Elman's, 1989, prediction task) the hidden unit space would be more structured, and it would be possible to assess how the memory performs on perceptually and semantically related items.
The development of decision boundaries was the second criterion. In the HRN, the decisions are made by the backpropagation connections from the context-to-hidden layer and from the hidden-to-output layer. Hence, the HRN does satisfy the condition of learning its decision criteria. Furthermore, as training progressed the performance increased, indicating that the decision criteria was being altered so as to better separate the target patterns from the distractors.
The third criterion was the development of control regimes. In such a simple task the degree of control required was limited. The only major distinction was between the study phase in which the network was required to output a "Blank" and the decision phase when the network responded with either "Yes" or "No" indicating whether it considered the probe item to have been in the study list. While it often took several hundred epochs of training before the network began to correctly determine whether to output "Yes" or "No", the control question of whether to output a "Blank" or a decision was typically learned very quickly, usually well within one hundred epochs. While this example is very simple it demonstrates that the HRN is capable of acquiring a control regime.
Learning is only useful if it can be done in a feasible time period. The time complexity of learning seems to be approximately linear in the length of the lists (and size of the vocabulary) with which the HRN is tested.
The first of the memory criteria was the degree of interference. The Hebbian memory of the HRN allowed it to retain items without significant interference. The HRN inherits the performance characteristics from the matrix memory and maintains its performance (at least until vocabulary size becomes larger than the number of hidden units) with extra items decrementing performance only marginally. Since human vocabularies are in the order of 10000 items and the number of items that could conceivably be remembered is even greater, it is critical that a model of memory be able to store and retrieve a large number of items. The ability of the HRN to generalise well even as the number of vocabulary items increased is of particular importance, and is one of the major distinguishing factors between it and the SRN.
The last of the criteria was the rapid binding in memory of already established representations. The possible representations are developed by the backpropagation mechanism over the course of training. Specific memories, however, are stored in the Hebbian weights. Hence, memories are laid in a single timestep, while representations are formed over a much longer timespan. It is the dual memory architecture that avoids catastrophic interference, allows for the significant improvement in generalisation, and accounts for the dramatically different timespans of memory and learning.
While there is still work to be done to establish the extent of the capabilities of the HRN, it has been demonstrated that by integrating insights from backpropagation models with those from the mathematical memory modelling literature it is possible to fulfill both learning and memory criteria.
References
Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S.
(1977). Distinctive features, categorical perception and probability
learning: Some applications of a neural model. Psychological Review,
84, 413-451.
Anderson, J. R. (1990). The Adaptive Character of Thought. Laurence Erlbaum Associates, Hillsdale, NJ.
Anderson, J. R., & Milson, R. (1989). Human memory: An adaptive perspective. Psychological Review, 96 (4), 703-719.
Anderson, J. R., & Schooler, L. J. (1991). Reflections of the environment in memory. Psychological Science, 2 (6), 396-408.
Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed system and its control processes. In Spence, K. W., & Spence, J. T. (Eds.), The Psychology of Learning and Motivation, pp. 90-191. Academic Press.
Bickhard, M. H. (1991). The import of Fodor's anti-constructivist argument. In Steffe, L. P. (Ed.), Epistemological Foundations of Mathematical Experience, chap. 2, pp. 14-25. Springer Verlag, New York, NY.
Brousse, O., & Smolensky, P. (1989). Virtual memories and massive generalisation in connectionist combinatorial learning. In Program of the Eleventh Annual Conference of the Cognitive Science Society, pp. 380-387 Hillsdale, NJ. Lawrence Erlbaum Associates.
Brunswik, E. (1956). Perception and the Representative Design of Psychological Experiments. University of California Press, Berkeley, CA.
Campbell, D. T. (1974). Evolutionary epistemology. In Schilp, P. A. (Ed.), The philosophy of Karl Popper. Open Court, La Salle, IL.
Chappell, M., & Humphreys, M. S. (1994). Autoassociative neural network for sparse representations: Analysis and application to models of recognition and cued recall. Psychological Review, 101, 103-128.
Chauvin, Y. (1989). A backpropagation algorithm with optimal use of hidden units. In Touretsky, D. S. (Ed.), Advances in Neural Information Processing Systems, pp. 519-526. Morgan Kaufmann.
Cottrell, G. W., & Tsung, F. S. (1993). Learning simple arithmetic procedures. Connection Science, 5 (1), 37-58.
Dell, G. S., Juliano, C., & Govindjee, A. (1993). Structure and content in language production: A theory of frame constraints in phonological errors. Cognitive Science, 17, 149-195.
Eich, J. M. (1982). A Composite Holographic Associative Recall Model. Psychological Review, 89 (6), 627-661.
Elman, J. L. (1989). Representation and structure in connectionist models. Technical Report 8903, Center for Research in Language, University of California, San Diego, CA.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179-211.
Estes, W. K. (1991). Cognitive architectures from the standpoint of an experimental psychologist. Annual Review of Psychology, 42, 1-28.
French, R. (1991). Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks. Technical Report 51, Center for Research on Concepts and Cognition, Indiana University, IN.
Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for both recognition and recall. Psychological Review, 91 (1), 1-67.
Greene, R. L. (1992). Human Memory: Paradigms and Paradoxes. Lawrence Erlbaum Associates, Hillsdale, NJ.
Heathcote, A. (1993). An ART model of human recognition memory. In Leong, P., & Jabri, M. (Eds.), Proceedings of the Fourth Australian Conference on Neural Networks, pp. 212-215.
Hetherington, P., & Seidenberg, M. S. (1989). Is there catastrophic interference in connectionist networks?. In Program of the Eleventh Annual Conference of the Cognitive Science Society, pp. 26-33 Hillsdale, NJ. Lawrence Erlbaum Associates.
Hinton, G. (1993). Tutorial on neural networks. Conducted at University of Sydney.
Hintzman, D. L. (1984). Minerva 2: A simulation model of human memory. Behaviour Research Methods, Instruments, and Computers, 16 (2), 96-101.
Hintzman, D. L. (1991). Why are formal models useful in psychology?. In Hockley, W. E., & Lewandowsky, S. (Eds.), Relating Theory and Data: Essays on Human Memory in Honour of Bennet B. Murdock, pp. 39-56. Lawrence Erlbaum Associates, Hillsdale, NJ.
Hintzman, D. L., & Block, R. A. (1971). Repetition and memory: Evidence for a multi-trace hypothesis. Journal of Experimental Psychology, 88 (3), 297-306.
Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences, pp. 2554-2558. National Academy of Sciences.
Humphreys, M. S., Bain, J. D., & Pike, R. (1989). Different ways to cue a coherent memory system: A theory for episodic, semantic and procedural tasks. Psychological Review, 96 (2), 208-233.
Humphreys, M. S., Pike, R., Bain, J. D., & Tehan, G. (1989b). Global matching: A comparison of the SAM, Minerva II, Matrix and TODAM models. Journal of Mathematical Psychology, 33 (1), 36-67.
Jordan, M. I. (in press). Serial order: A parallel distributed processing approach. In Elman, J. L., & Rumelhart, D. E. (Eds.), Advances in Connectionist Theory: Speech. Lawrence Erlbaum Associates, Hillsdale, NJ.
Koltz, S., & Johnson, N. L. (1982). Encyclopaedia of Statistical Sciences. John Wiley and Sons, NY.
Kruschke, J. K. (1992). Alcove: An exemplar-based connectionist model of category learning. Psychological Review, 99 (1), 22-44.
Lewandowsky, S. (1991). Gradual unlearning and catastrophic interference: A comparison of distributed architectures. In Hockley, W. E., & Lewandowsky, S. (Eds.), Relating Theory and Data: Essays on Human Memory in Honour of Bennet B. Murdock, pp. 445-476. Lawrence Erlbaum Associates, Hillsdale, NJ.
Lewandowsky, S., & Murdock, B. B. (1989). Memory for serial order. Psychological Review, 96 (1), 25-57.
Martin, E. (1965). Transfer of verbal paired associates. Psychological Review, 72, 327-343.
McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In Bower, G. H. (Ed.), The Psychology of Learning and Motivation, pp. 109-165. Academic Press, NY.
Melton, A. W. (1963). Implications of short-term memory for a general theory of memory. Journal of Verbal Learning and Verbal Behaviour, 2, 1-21.
Mozer, M. C. (1992). Induction of multiscale temporal structure. In Moody, J. E., Hanson, S. J., & Lippmann, R. P. (Eds.), Advances in Neural Information Processing Systems 4, pp. 325-332 San Mateo: CA. Morgan Kaufmann.
Mozer, M. C. (1993). Neural net architectures for temporal sequence processing. To appear in: A. Weigend & N. Gershenfeld (Eds.) Predicting the future and understanding the past. Redwood City, CA: Addison-Wesley Publishing.
Murdock, B. B. (1982). A theory for the storage and retrieval of item and associative information. Psychological Review, 89 (6), 609-626.
Murdock, B. B. (1987). Learning in a distributed memory model. In Izawa, C. (Ed.), Current Issues in Cognitive Processes, chap. 4, pp. 69-106. Lawrence Erlbaum Associates. The Tulane Flowerree Symposium on Cognition.
Murdock, B. B., & Lamon, M. (1988). The replacement effect: Repeating some items while replacing others. Memory & Cognition, 16 (2), 91-101.
Nolfi, S., Parisi, D., Vallar, G., & Burani, C. (1990). Recall of sequences of items by a neural network. In Touretsky, D. S., Elman, J. L., Sejnowski, T. J., & Hinton, G. E. (Eds.), Proceedings of the 1990 Connectionist Models Summer School. Morgan Kaufmann, San Mateo, CA.
Osgood, C. E. (1949). The similarity paradox in human learning: A resolution. Psychological Review, 56, 132-143.
Phillips, S. (1991). Serial recall using an Elman net with hints. Unpublished manuscript.
Phillips, S., & Wiles, J. (1993). Exponential generalisations from a polynomial number of examples in a combinatorial domain. Proceedings of the 1993 International Joint Conference on Neural Networks, 505-508.
Pike, R. (1984). Comparison of convolution and matrix distributed memory systems for associative recall and recognition. Psychological Review, 91 (3), 281-293.
Plate, T. (1991). Holographic reduced representations: Convolution algebra for compositional distributed representations. In Mylopoulos, J., & Reiter, R. (Eds.), Proceedings of the Twelfth International Conference on Artificial Intelligence, pp. 30-35 Sydney, Australia. Morgan Kaufmann Publishers.
Plate, T. (1992). Holographic recurrent networks. In Giles, C. L., Hanson, S. J., & Cowan, J. D. (Eds.), Advances in Information Processing Systems, Vol. 5, pp. 34-41 San Mateo, CA. Morgan Kaufmann.
Postman, L. (1969). Experimental analysis of learning to learn. In Bower, G. H., & Spence, J. T. (Eds.), Psychology of Learning and Motivation, Vol. 3, pp. 241-297. Academic Press.
Raaijmakers, J. G. W., & Shiffrin, R. M. (1981). Search of Associative Memory. Psychological Review, 88, 93-134.
Ratcliff, R. (1990). Connectionist models of recognition memory: Constraints im- posed by learning and forgetting functions. Psychological Review, 97 (2), 285- 308.
Ratcliff, R., Clark, S. E., & Shiffrin, R. M. (1990). The list-strength effect: I. Data and discussion. Journal of Experimental Psychology: Learning, Memory and Cognition, 16, 163-178.
Regier, T. (1992). The acquisition of lexical semantics for spatial terms: A connectionist model of perceptual categorisation. Technical Report TR-92-062, International Computer Science Institute, Berkeley, CA. Reilly, R. (1993). A connectionist attentional shift model of eye-movement control in reading. In Proceedings of the 15th Annual Meeting of the Cognitive Science Society Boulder, CO. To appear.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In Rumelhart, D. E., & McClelland, J. L. (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pp. 318-362. MIT Press, Cambridge, MA.
Schmidhuber, J. (1991). Neural sequence chunkers. Technical Report FKI-148- 91, Technische Universitaet Muenchen, Institut fuer Informatik, Munich, Germany.
Schmidhuber, J. (1992). A fixed size O(n3) time complexity learning algorithm for fully recurrent networks. Neural Computation, 4 (2), 243-248.
Shepard, R. N. (1967). Recognition memory for words, sentences and pictures. Journal of Verbal Learning and Verbal Behaviour, 6, 156-163.
Shepard, R. N. (1987). Towards a universal law of generalisation for psychological systems. Science, 237, 1317-1323.
Stornetta, W. S., & Huberman, B. A. (1987). An improved three-layer backpropagation algorithm. In Caudill, M., & Butler, C. (Eds.), IEEE First International Conference on Neural Networks, pp. 637-645 San Diego, CA. IEEE.
Swets, J. A., & Green, D. M. (1961). Sequential observations by human observers of signals in noise. In Cherry, C. (Ed.), Information Theory: Proceedings of the fourth London symposium, pp. 177-195 London. Butterworth.
Waibel, A. (1989). Modular construction of time-delay neural networks for speech recognition. Neural Computation, 1 (1), 39-46.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1987). Phoneme recognition using time-delay neural networks. Technical Report TR-1-0006, ATR Interpreting Telephony Research Laboratories, Japan.
Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training and analysis of simple recurrent networks. In Moody, J. E., Hanson, S. J., & Lippmann, R. P. (Eds.), Advances in Neural Information Processing Systems 4, pp. 325-332 San Mateo: CA. Morgan Kaufmann.
Wiles, J., & Phillips, S. (1991). Serial recall of binary sequences. Unpublished manuscript.
Williams, R. J., & Zipser, D. (1989). Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1 (1), 87-111.
Williams, R. J., & Zipser, D. (1990). Gradient-based learning algorithms for recurrent networks. In Chauvin, Y., & Rumelhart, D. E. (Eds.), Backpropagation: Theory, Architectures and Applications, pp. 1-42. Erlbaum, Hillsdale, NJ.