The Effect of the Environment on Memory: A Connectionist Model

Simon Dennis

Department of Psychology

The University of Queensland

Best if viewed with netscape 2.0
Submitted 18th October 1995

Abstract

The Hebbian Recurrent Network (HRN) is developed to model the interaction of the environment and human memory. The HRN integrates work in the mathematical modelling of memory with that in error correcting connectionist networks by incorporating the matrix model (Pike, 1984; Humphreys, Bain & Pike, 1989) in the Simple Recurrent Network (SRN, Elman, 1989, 1990). The result is an architecture which has the desirable memory characteristics of the matrix model such as low interference and massive generalisation, but which is able to learn appropriate encodings for items, decision criteria and the control functions of memory which have been chosen a priori in the memory literature. Simulations demonstrate the the HRN is well suited to the recognition task. When compared with the SRN, the HRN is able to learn longer lists, generalises from smaller training sets, and is not degraded significantly by increasing the vocabulary size.

Contents

Introduction

How people acquire the encoding and retrieval functions of the memory system has important ramifications for the study of human retention. One of the most critical of these ramifications is the nature of the relationship between the environment and the mechanism of memory. The idea that the human cognitive system is adapted to the environment has a long history (Anderson, 1990; Brunswik, 1956; Campbell, 1974; Shepard, 1987). However, the role that the environment might play in models of human memory has, until recently, had a relatively small impact on theory construction.

If the environment is to play a more prominent role in memory theorising it is important to develop models in which the environment and the mechanism interact to produce behaviour. The first model to take a global view of the impact of the environment was Anderson and Milson's (1989) rational analysis of memory. In this model, it is assumed that the memory system has adapted through its evolutionary history to the distributions of items with which it is faced. Anderson and Milson (1989) go on to outline a Bayesian framework to bridge the gap between the environmental statistics and measures of experimental performance and show that to a first degree of approximation the system reproduces the results of manipulating frequency, recency and item spacing and gives insight into priming and fan effects.

Anderson and Milson's (1989) Bayesian approach is not, however, a process account of environmental optimisation. To develop such an account it is necessary to specify a learning mechanism capable of implementing the Bayesian decision formulation. Error-correcting backpropagation networks adapt to the statistics of the training environments in which they are immersed providing an analogy to the learning processes that occur within a person's lifetime. There are, however, a number of obstacles which discourage the direct application of backpropagation networks in the memory domain. In particular, the degree of interference or unlearning typically found in backpropagation networks (McCloskey & Cohen, 1989; Ratcliff, 1990) far exceeds that found in subjects.

In the modelling work presented in this paper, a hybrid architecture called the Hebbian Recurrent Network (HRN), employing both Hebbian and backpropagation learning rules, is proposed. The architecture is simulated to ensure that it embodies fundamental criteria for a model of human memory such as the ability to form memories rapidly without excessive interference.

An Adaptive Memory

Within the recent literature there has been considerable debate about the ability of backpropagation models to capture human memory phenomena. While there is a feeling that backpropagation models have much to offer, the results thus far have been mixed (Ratcliff, 1990; Lewandowsky, 1991). Backpropagation models provide mechanisms by which encodings, decision criteria and control functions can be learned as a consequence of exposure to the environment, yet, on some very basic variables like the degree of interference, they have not performed as well as conventional memory models.

Estes (1991) makes a telling comment in concluding his discussion of the difficulties of connectionist models:

"A stray thought that comes to mind is that the situation is a bit reminiscent of a segment of the history of learning theory. In the learning theory of the period 1930-1960, it was assumed that the learning processes are basically the same at least for all of the higher animals and that the learning in the individual organism starts from a tabular rasa, the counterpart of a network of homogeneous and mutually connected nodes. During the next decade, however, under the impact of ethology and the beginnings of modern neuroscience, the prevalent view shifted to one that recognised biological constraints on learning ... and it is now quite generally assumed that learning in any organism, human or subhuman, builds on a substrate of species-specific predispositions and products of previous learning. Implementing this more biologically founded orientation in connectionist learning models is a tall order; but the effort will surely have to be made sooner or later, and the results may cast some of the current problems in a new light." (Estes, 1991, p 24)

The purpose of this paper is to start to address what forms of bias must be added to a network to make the learning tasks commonly solved by humans tractable and to ensure that the network performs these tasks in a psychologically plausible way.

The discussion begins by outlining some of the contributions that backpropagation architectures can make to the modelling of memory. Next, some of the major aspects of memory phenomena that remain obstacles for backpropagation models are examined. Finally, the Hebbian Recurrent Network (HRN), which integrates learning and memory models, is presented.

Learning Issues

In an interactive (learning) model, it is the interplay of the environment and the architecture that leads to performance (Bickhard, 1991). For example, in backpropagation models, performance is determined both by the architecture (i.e. structure of the interconnections, the transfer function, and the values of the parameters) and the statistical contingencies embodied by the training set.

The advantages of such an interplay can be examined in terms of the components of the memory system. In the following subsections, the learning of representation, decision criteria and control are considered in turn.

Learning Representation

The formation of appropriate encodings for items that will be entered into memory has been a difficult and largely unexplored area in the memory literature. While it is known that items with similar meanings and perceptual forms interact with each other to a greater extent than unrelated items (Osgood, 1949; Martin, 1965), there has been little progress in determining how the formation of such an encoding landscape may come about. Eich's (1982) presentation of the representations used in CHARM is indicative of the assumptions that are commonly made in the memory literature, at least among those who posit distributed memory systems. Items are represented by vectors of "abstract features", and the similarity of representations is assumed to be a consequence of overlap of these features. Typically, these representations are strongly constrained by the requirements of the mechanism in which they will be used. Eich (1982) assumes statistical independence of the representations of unrelated words, with the dot product being the measure of similarity of items. The dot product is a particularly good choice in the context of the convolution/correlation mechanism of CHARM as it means that stimulus and response generalisation are immediate consequences. However, since the nature of the component features is deliberately left unspecified, there is no way to independently decide upon the values of the inter-item dot products, and typically these are free parameters, which are optimised to fit the empirical data.

Backpropagation models are able to construct internal representations (Rumelhart, Hinton, & Williams, 1986), thus offering a way of avoiding many representational assumptions. Typically, the hidden unit representations are formed as a consequence of both input similarity (e.g. the words "been" and "bean" might assume similar representations since they have similar orthographics and identical phonology) and functional similarity (e.g. the words "idea" and "concept" might adopt similar representations because they are used in functionally similar ways). Elman (1989, 1990) has demonstrated how a network in which the input representations have no similarity structure can exploit the functional similarity in the statistics of the training set to create a similarity landscape on the hidden units. Hence, backpropagation networks introduce a principled way in which "abstract features" might be formed as a consequence of the environment with which they are faced.

Learning Decision Criteria

Another aspect of the memory task that is usually held constant is the decision mechanism. For example, consider recognition as modelled by SAM, Minerva II, TODAM, CHARM and the matrix model. A signal detection framework is applied. Each model calculates a global matching strength (Humphreys, Pike, Bain & Tehan, 1989b) to be compared against a criterion. In general, the criterion is assumed to be flexible allowing for some adaption, but the form of the matching strength function remains the same. There is no sense in which the matching function could be said to have been acquired.

In the majority of memory paradigms, however, subjects become more accurate as they gain experience. While some of the improvement may be due to the refinement of the representation, a portion is attributable to an improvement in the ability to decide upon a response (Postman, 1969).

Learning Control

While the representations used in memory and the decision criterion that determine performance are well developed topics within the memory literature, the nature of the control processes is often left unelaborated (see Atkinson & Shiffrin, 1968, for an exception). For instance, how is it that the subject decides to give a response only during the test phase? Current mathematical models of memory assume that the answer to this question is embedded in the program or is part of the metamemory. Models that attempt to account for learning allow parameters on already existing processes to change rather than have these processes acquired through experience (c.f. Murdock, 1987). In order to approach questions about the nature of rehearsal, elaborative processing, imaging and other metamemory skills it would be advantageous to have a more exacting theory about how simpler control processes, such as when to output a response, are achieved.

Memory Issues

In the previous section, the aspects of the memory system that might be acquired by a learning system such as a backpropagation network were outlined. To be serious alternatives to current models of memory, however, there are a number of criteria on which current memory models perform well that must be fulfilled. These memory criteria include maintaining significant capacity without introducing unrealistic amounts of interference, generalising to unseen lists of items using small numbers of training examples and the ability to establish memory traces rapidly.

Capacity and Interference

The problem of catastrophic interference has received a great deal of attention in the recent literature (Ratcliff, 1990; McCloskey & Cohen, 1989; Lewandowsky, 1991; Brousse & Smolensky, 1989; Hetherington & Seidenberg, 1989; Wiles & Phillips, 1991; Kruschke, 1992; Chappell & Humphreys, 1994). The difficulty arises when what has been learned is disrupted dramatically by subsequent learning. That is, there is too much retroactive interference. The problem is of particular importance in the modelling of recognition memory where the capacity is very large and the degree of interference is small. Certainly, interference is not nearly as marked as in standard feedforward backpropagation architectures (Ratcliff, 1990).

Within the literature two major strategies have emerged in order to deal with the problem of catastrophic interference. The first involves increasing the orthogonality of items that are to be learned in succession. Lewandowsky (1991), Kruschke (1992) and French (1991) have suggested methods for encouraging orthogonality and, hence, decreasing the amount of interference within feedforward networks.

The alternative approach has been to use recurrent architectures to encode lists of items rather than single items on their hidden units (Nolfi, Parisi, Vallar, & Burani, 1990; Wiles & Phillips, 1991)[1]. The network has the task of learning a single higher order encoding function rather than a sequence of items and, hence, the interference is reduced. Unfortunately, this encoding function becomes much more difficult to acquire as the number of items to be encoded increases. Consequently, existing recurrent architectures have severe capacity restrictions.

Generalisation

Another issue closely related to interference is the degree of generalisation. With what proportion of the entire space of possible input lists must the network be presented to perform well on unseen cases? In the context of memory, the important variable is the size of the vocabulary. Subjects have extensive vocabularies, yet are able to perform memory tasks involving any of the items within that vocabulary despite having limited experience with lists of these items. The second of the memory criteria, then, is that the model be capable of generalising on the basis of a very small proportion of the input set.

For traditional memory models, such generalisation is not a problem since the memory mechanisms are chosen so that they will encode lists independently of which items are to be encoded. In the recurrent network architectures that have been applied to memory phenomena (Nolfi et al., 1990; Wiles & Phillips, 1991), however, there is no such constraint. The network must learn to recognise each new item. In addition, in the serial recall task investigated by Wiles and Phillips (1991), a significant proportion of the possible orderings of the items must also have been encountered. Brousse and Smolensky (1989) highlight this issue and suggest that the process of building the list representation be hard wired by using a tensor product of the items position and the item vector, effectively concatenating the list items. In order to fulfill the second memory criterion some such approach must be adopted.

Rapid Binding

The last of the memory criteria revolves around the distinction between memory and learning (Melton, 1963). Models such as that of Ratcliff (1990) assume that memorisation is best modelled by the learning processes of feedforward networks. Memory, however, seems to involve the rapid binding of already established representations rather than the acquisition of new representations (Wiles & Phillips, 1991). Enduring memories can be laid with presentation durations of just a few hundred milliseconds - not much time for a learning mechanism to operate. The time course of learning-to-learn effects, in contrast, tends to be in the order of minutes or hours. Furthermore, the development of metamemory skills (e.g. realizing that recall is more difficult than recognition) happens over several years. A model of memory should be capable of explaining the difference between memorisation and learning, and be able to account for the difference in the time scales.

The Hebbian Recurrent Network

The Objectives

In the last section, some of the important design criteria for a model that integrates learning and memory were discussed. In summary, an adequate model of human memory should be able to:
  1. Form representations that are sensitive to environmental statistics.
  2. Acquire decision criteria.
  3. Acquire control mechanisms.
  4. Avoid catastrophic interference and capacity restrictions.
  5. Generalise to lists of arbitrary order with large vocabularies.
  6. Form bindings on established representations rapidly.
Current backpropagation models can address the learning issues but fail on the memory criteria on which traditional memory models do very well. By embedding a traditional memory model within a backpropagation learning model, all of these issues can be addressed. In the next sections, a learning model and a memory model are selected and then integrated to form the Hebbian Recurrent Network (HRN).

The Model

Choosing a Learning Model
Memory phenomena involve temporal tasks. Furthermore, the information that is added to memory is affected by the current contents of memory. Memory is a closed-loop system in Murdock and Lamon's (1988) terminology (see also Lewandowsky & Murdock, 1989). While standard feedforward architectures are capable of forming representations and acquiring decision criteria, they cannot embody the temporal relations that characterise control problems.

The variety of temporal networks that have been developed fall into two classes[2]. The first class includes those networks that buffer input in order to maintain temporal information (e.g. Time Delay Neural Network (TDNN), Waibel, Hanazawa, Hinton, Shikano, & Lang, 1987; Waibel, 1989). In a TDNN, the input sequence is presented to the backpropagation network through a series of delays. Hence, if i(t) is the input at timestep t, the network receives, i(t); i(t-1); i(t-2); i(t-3); i(t-4) and i(t-5) at the same time. In a memory task, the entire study list would be required by the network at the time of decision. While the depth of the network remains constant, the number of inputs grows both with the vocabulary and the length of the sequence. Not only is such an architecture prohibitively large, but it requires that each item be presented at each possible timestep within the training set. Only in this way can the TDNN learn to respond appropriately independently of the position of the item. Hence, a large training set would be required and the generalisation criterion would be violated.

The second class subsumes recurrent networks such as the Jordan network (Jordan, in press), the Simple Recurrent Network (SRN, Elman, 1989, 1990), Back Propagation Through Time (BPTT, Rumelhart et al., 1986) and Real Time Recurrent Learning (RTRL, Williams & Zipser, 1989). In contrast to the TDNN, which assumes that the input is buffered to maintain information, recurrent networks postulate recurrent connections through which information is cycled and, hence, preserved.

The recurrent connections can emanate either from the output units (e.g. Jordan network) or from the hidden units (e.g. SRN, BPTT, RTRL). There are, however, problems that are not able to be solved by the Jordan network because the information required to be maintained is not present at the output and is necessarily lost (Cottrell & Tsung, 1993; Dell, Juliano, & Govindjee, 1993). In particular, memory control paradigms such as rehearsal can operate without the requisite information in the outputs. Hence, to avoid restrictions on the control paradigms that can be implemented, hidden unit recurrency is required.

Having established the form of the architecture, it remains to choose the learning algorithm which will be applied. The BPTT algorithm unfolds a network (Rumelhart et al., 1986) creating a level for each timestep with tied weights between the timesteps. Because very deep networks are constructed by this method training is often very difficult and time consuming. In addition, a great deal of memory is required. The RTRL algorithm follows the same gradient as a BPTT network that is unfolded for the entire length of a sequence, but does not require time or space proportional to sequence length. It is an O(n4) algorithm [3] (where n is the number of nodes) and is slow for large architectures.

The Simple Recurrent Network (SRN, Elman, 1989, 1990) can be thought of as an approximation to the fully recurrent networks such as BPTT and RTRL. Instead of backpropagating over the entire length of the sequence, it considers only the last timestep. While being faster (i.e. O(n2)) than either BPTT[4] or RTRL, it does not actively seek to maintain relevant information, but makes use of the information that is maintained by chance. Because the current state is a consequence of the preceding states, it is possible that relevant information will be retained until needed. In practice, this information is often lost before such time as the error signal can be generated. Without useful information in the form of hints being required on the outputs, the ability of the SRN to remember is limited to short study lists (Phillips, 1991).

Despite its limited nature the SRN has performed well on the language prediction tasks to which it has been applied (Elman, 1989, 1990). The hidden unit patterns that were chosen by the network formed a similarity landscape that was related to syntactic and semantic structure in the training corpus. In addition, the SRN developed the ability to decide upon a response (or response set) given the prior context. Furthermore, the network was able to retain information that could influence later encoding and predictions, such as pluralisation, over short time spans - a limited form of control. Because the SRN fulfills the learning criteria established earlier it was chosen as the learning mechanism of the HRN.

Current recurrent network models of human memory (Nolfi et al., 1990; Wiles & Phillips, 1991) attempt to learn to build a representation of the entire list. This representation requires training on a significant proportion of all possible lists in order to ensure generalisation (Wiles & Phillips, 1991). People, however, do not receive such extensive training. To perform the sorts of tasks people do routinely, a model must include a form of bias that circumvents this learning problem, that is, the model must learn to encode items not entire lists. While the recurrent activations are the only means by which information can be stored, however, they will continue to be used to store the sequence of items. One solution is to provide a separate mechanism that is capable of storing lists. Fortunately, such a mechanism has been the subject of research into human memory for several decades. In the following section, this literature is examined to find a mechanism suitable for inclusion into the SRN.

Choosing a Memory Model
From the extensive work in the verbal learning tradition during the 1960s and 1970s emerged a series of mathematical models that were influential during the 1980s (Raaijmakers & Shiffrin, 1981; Murdock, 1982; Eich, 1982; Pike, 1984; Hintzman, 1984). These models have made significant contributions to the precision with which memory models are now specified and have helped to focus research (Hintzman, 1991).

With the exception of Search of Associative Memory (SAM Raaijmakers & Shiffrin, 1981) these models employ distributed representations and are all possible candidates for incorporation into the SRN. Minerva II (Hintzman, 1984) is an exemplar model, meaning that each memory trace is stored separately. Such a model is able to account for situational (Hintzman & Block, 1971; Greene, 1992) and categorical (Greene, 1992) frequency judgements, but incurs the cost of storage proportional to time. The alternative is to employ a blend memory in which images are superimposed upon each other. TODAM (Murdock, 1982), CHARM (Eich, 1982) and the Matrix model (Pike, 1984; Humphreys et al., 1989) are examples of superposition models, and were considered to be more feasible for implementation in the SRN. The final choice was between the correlation/convolution methods of TODAM and CHARM, and the Hebbian scheme of the Matrix model. While the Hebbian rule has well established roots in the connectionist literature (Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1982), the correlation/convolution method has also been studied in this context (Plate, 1991, 1992). Due to the arguments put forth by Pike (1984) that the matrix model is less subject to noise, and the obvious mapping of the cells of a matrix to the weights of a Hebbian network, the Matrix model was chosen. Allowing weights to store information instead of requiring the activation pattern to encode it was expected to improve performance in a plausible fashion (Schmidhuber, 1991; Mozer, 1992).

Description of the Hebbian Recurrent Network
There are two ways in which one may view the HRN: The first is to regard it as a matrix memory to which backpropagation weights have been added to allow the acquisition of the control, decision and representation aspects; the second way is to think of it as an SRN with longer term memory.

Figure 1: The Hebbian Recurrent Network (HRN). The solid arrows are sets of weights that are modified using the backpropagation algorithm. The dashed line represents the feeding of hidden unit activations through a set of Hebbian weights to the context units in preparation for the next timestep. The Hebbian weights are updated after the activations are fed through. In addition, the teacher signal included the input pattern to ensure that the hidden unit states corresponding to different inputs were separated.

The architecture of the HRN is similar to that of the SRN, in that the input and context layers are completely connected to the hidden layer and the hidden layer is completely connected to the output layer (see Figure 1). In contrast to the SRN, which copies the contents of the previous hidden layer to the context layer, the hidden layer of the HRN is connected to the context layer by a set of Hebbian weights. It is these weights that form the memory of the system. At any given timestep, the context layer contains the results of the last memory retrieval.

Table 1: The HRN algorithm. Note that the scaling factor value was introduced to stop the context units from saturating. When updating the Hebbian weights, retaining 1-ß and adding ß of the new vector ensures that weights are bounded (since the activations are bounded between 1 and -1), and that the mechanism will be stable. The Hebbian weights are updated during both training and testing.

  1. Set inputs. Context activations and Hebbian weights are set to zero before each sequence.
  2. Feed activation through the input and context weights to the hidden layer and from the hidden layer to the outputs using the tanh function (Stornetta & Huberman, 1987).

    f(X) = 2/(1+e-x) - 1
    AH(t) = f(AI(t) WIH(t) + AC(t) WCH(t))
    AO(t) = f(AH(t) WHO(t))

    where AI are the input activations, AH are the hidden activations, AO are the output activations, WIH are the input to hidden weights, WCH are the context to hidden weights and WHO are the hidden to output weights.
  3. Use the usual backpropagation rule to train the input to hidden (WIH), context to hidden (WCH) and hidden to output (WHO) weights.
  4. Calculate the values for the context units to be input at the next time step by multiplying the hidden activation pattern by the Hebbian matrix and dividing by the scaling factor (y).
    AC(t+1) = f(AH(t) WHC(t)/y)

    The scaling factor was set to be (number of hidden units)2 which ensures that inputs to f are bounded by 1 and -1.
  5. Update the Hebbian network by autoassociating the hidden pattern.
    WHC(t+1) = (1-ß) WHC(t) + ß AH(t) A'H(t)
    where WHC are the hidden to context weights, ß is the memory decay parameter and the ' character indicates transposition.

The HRN algorithm is also similar to that of the SRN (see Table 1). Before each sequence the context unit activations and the Hebbian weights are set to zero. At each timestep the hidden units receive input from both the input units and the context units. The outputs of the network are calculated on the basis of the hidden patterns. The activation function is tanh. In addition, the hidden patterns are used to probe the matrix memory. The result of this probe is instantiated on the context units in preparation for the next timestep.

Figure 2: Memory processes in the HRN. The input to hidden and context to hidden weights implement the encoding process. The hidden to output and context to hidden weights implement the retrieval process, while the Hebbian weights from hidden to context store the memory traces.

The weights are updated after the hidden and output activation patterns have been updated, but before the context units are changed. The backpropagation weights are updated using the backpropagation rule as in the SRN. The Hebbian weights are updated by autoassociating the hidden unit activation pattern. The outer product of the hidden vector with itself is calculated and added to the existing Hebbian matrix.

At this point, the role of the "context" units may need some clarification. This terminology has been borrowed from Elman (1989, 1990) and does not correspond to the context cues that are often featured in models of human memory (c.f. Gillund & Shiffrin, 1984; Humphreys et al., 1989; Chappell & Humphreys, 1994, 1993; Heathcote, 1993). The context units receive the outputs of memory and, hence, form the "context" of the next input to the memory system. The context cues used in the memory literature are derived from the experimental instructions and define the episode from which a subject is expected to recognise items. In the following simulations, change of context has been implemented by resetting both the context units and the Hebbian weights. This mechanism decreases simulation time, but is insufficient for tasks that involve retrieval from multiple lists (i.e. multiple contexts). To model these tasks, context could be included as part of the input. As is often done in distributed memory systems, each episode could be represented by a vector and the appropriate vector would be reinstated at the time of test to indicate from which context the network is required to retrieve.

Figure 2 outlines where the processes involved in memory tasks are implemented in the HRN. The input-to-hidden and context-to-hidden weights implement the encoding process. The context-to-hidden and hidden-to-output implement the retrieval process and the Hebbian weights from the hidden-to-context units are responsible for storage.

Rationale of the HRN
The objective in designing the HRN was to satisfy the learning and memory criteria discussed above. The Hebbian weights have been added to accommodate the memory criteria and, provided the representations to be retained are available to matrix memory, it should be possible to satisfy these criteria. The learning criteria, however, impose two important constraints on the structure of the HRN.

Firstly, the representations should be determined by the dynamics of the network as a consequence of learning and not chosen a priori by the experimenter. Hence, the inputs to the memory system come from the hidden activations of the backpropagation network. The representational scheme will be formed as a result of the input and required outputs, that is, as a function of the environment in which the network must operate. Furthermore, it is also necessary that the the matrix memory be autoassociative. The problem with using a feedforward Hebbian memory in a system in which the representations are learned is that the Hebbian memory itself is solely responsible for generating the representation at its outputs. Any Hebbian learning could only reinforce the output patterns that already occurred as a consequence of the initial weights. For instance, if the initial weights were all zero the only outputs that would be generated would also be zero. Using a Hebbian weight update scheme, however, the zero outputs would result in no change to the weights and hence nothing could be retained. Hence, the Hebbian system is necessarily autoassociative, it necessarily has a hidden layer as input, and that hidden layer must feed to and from error correcting weights.

The splitting of the Hebbian and backpropagation weights is not intended to suggest that these must represent different sorts of weights in the brain. What is being identified is the functional disparity. The Hebbian units are responsible for storage while the backpropagation weights perform function approximation. It may be the case that both of these processes can be captured by a single weight update rule.

The second point is that the outputs of the Hebbian memory are available at the input to the feedforward backpropagation architecture. Consequently, the result of probing memory can be used to construct the next cue to memory, allowing the control aspects of the task to be acquired. In addition, such a recurrent system makes chains of recollection possible. The ability to perform these chains of recollection is a prominent aspect of everyday recall (Lewandowsky & Murdock, 1989).

The Task: Episodic Recognition

To test the performance of the HRN the episodic recognition paradigm has been chosen. In this paradigm, a list of words is presented to the subjects during a study phase. Subsequently, a series of words are shown, some of which occurred in the study list and some which did not. Subjects are asked whether they have seen each item within the study context. Human subjects are surprisingly accurate on these sorts of tasks. For instance, Shepard (1967) gave subjects a study list of 540 words and found that they were able to recognise 88% in a subsequent forced choice test. Hence, one of the questions that will be addressed in the following simulations is how the performance of the HRN declines as the list length increases. Furthermore, each of the words used in Shepard's study was taken from a population of 600 English nouns and adjectives - a substantial vocabulary. A second issue of interest that derives from the experimental results is how the performance of the HRN decreases as the vocabulary size increases.

Study/Test versus Training/Test Terminology: A confusing aspect of the terminology that has been inherited from the memory and backpropagation literatures is the distinction between Study/Test paradigms (from the memory literature) and Training/Test methodology (from the backpropagation literature). Within the memory literature a Study/Test paradigm is one in which subjects are first given a study list and are then given a test list to assess their memory for the studied items. In the backpropagation literature the term "test" is used to indicate the assessment of how well the network has acquired the function which underlies the data with which it was presented. Typically, it is presented with unseen cases and matched against the desired response. The HRN uses both of sets of terminology. The Hebbian weights embody the memory of the subject while the backpropagation weights embody the memory functions (including representations, decision criteria and control functions) that are acquired over the subject's entire past history. In these simulations both the previous history (i.e. training) and the current evaluation (i.e. testing) were characterised as a set of study/test paradigms. During training the backpropagation weights are altered and the example set will reflect the general experience of the subject whereas in the test phase the backpropagation weights are frozen and the example set will reflect the statistics of presentation of items within the experimental setting.

Stepping through a Simple Example

Table 2 outlines how the task was mapped onto the network. On the input, each of the study items was presented one at a time. These were followed by a single probe item and then by a "Pause" signal. On the output, the network was required to respond with the "Blank" symbol on all occasions except during the answer phase at which time the network was required to respond with either "Yes" or "No" depending on whether the probe pattern occurred in the study list. Local encodings were used throughout and the non zero entries were set to 0.7 (to avoid the low gradient tails of the sigmoid).

Table 2: An example input/output sequence for the episodic recognition task.

First Study Item Second Study ItemThird Study ItemProbeAnswer
Input A B C B Pause
Targets Blank Blank Blank Blank Yes

The backpropagation weights of the network were trained to perform recognition by presenting study/test sequences. Once training was complete the backpropagation weights (but not the Hebbian weights) were frozen.

Figure 3 shows the HRN applied to a small recognition task. In this case, the study lists consist of two items which are presented in the first and second timesteps. At the third timestep an item is input to be recognised and at the fourth timestep the recognition decision is made. The diagrams on the left hand side demonstrate the processing of a target item (i.e. an item that was present in the two item study list) and the diagrams on the right hand side show the processing of a distractor item (i.e. one that was not present in the two item list). There are four input patterns, one for each of the three items as well as one "Pause" symbol, which is input when the recognition decision is to be made. At the output there are three patterns. For the first three timesteps the "Blank" pattern is output demonstrating that the network has successfully learned not to "babble" when no output is expected. The other outputs represent "Yes" and "No", the possible responses to the recognition decision. Four hidden units were used and, consequently, there were four context units. In addition, the inputs were repeated at the output (giving four more outputs). These extra targets were added to condition the error space during learning and do not affect post training processing.

Figure 3: The trained HRN processing a target and a distractor.

Consider the processing of the target, as on the left hand side of Figure 3. The items of the list (i.e. "A" and "B") are presented to the inputs, and with the context (which is zero to begin with) are used to form a cue. Because the Hebbian weights (which will be refer to as medium term store, MTS) are zero, there is no activation of the context units. If the patterns at the hidden units in response to these two inputs are denoted by AH and BH respectively, then, at the end of processing of the study list the Hebbian matrix is (1-ß)ß AH A'H+ ß BH B'H where ß is the memory decay parameter. Note that up until this stage the target or distractor item has not been presented (only the study list) so the left and right hand sides are identical. At step three, the probe item is input. Apart from some noise, the context units are zero and, hence, the hidden units will contain AH. In the example, the hidden pattern is (-.7,-.3,0,.1). When this pattern is multiplied by the Hebbian weights there is a match:

A'H((1-ß)ß AH A'H + ß BH B'H) = (1-ß) ß |AH|2 AH

since AH B'H is approximately equal to zero. In this example, the context units have the pattern (-.6,-.3,0,.1). At step four, the network determines if the context units represent a valid item pattern. In this case, they do and the network responds with "Yes".

Now consider the distractor example on the right hand side of Figure 3. The study items are input in the same fashion. When the "C" item is input as the probe item, however, there is no such pattern in the MTS and, hence, the context units are near zero (c.f. 0, .1, .1, -.1) when the recognition decision is to be made. The network has no trouble determining that this item has not been seen and responds "No".

While this architecture is capable of performing recognition tasks, it is not in principle restricted to such tasks. In general, the current input and the internal state are used to form a cue for MTS. Such a cue will recall any previously stored pattern that is either similar or the same as itself. Hence, presenting a concept is likely to recall other related concepts. Once these concepts have been recalled they participate in the formation of the next cue set for MTS and so on.

Simulation Method

The SRN and HRN were applied to the episodic recognition task [5]. The recognition task was mapped onto the network in the manner described above (see Table 2). The effect of list length [6] and number of items in the training set was explored. In addition, the effect of the size of the vocabulary and the magnitude of the memory decay parameter on the performance of the HRN was investigated. All results are averaged over 20 trials with a learning rate of 0.05 and 20 hidden units. Except in those simulations in which the memory decay is explicitly being varied, it was set to 0.2. Unless otherwise stated the vocabulary was set to twice the list length and the training sets consisted of 500 sequences (i.e. 500 * (listlength+2) patterns). The test sets consisted of 200 sequences. Each trial involved restarting the network with a new set of weights and new training and testing sets. The simulator was written in C on a SparcStation II platform and is available from the author on request.

Performance was measured at each decision timestep. The largest of the "blank", "yes" and "no" activation levels was taken to be the response of the network.

Measures

In the memory literature it is common to report performance in terms of d' (Swets & Green, 1961). The d' value is a measure of sensitivity derived from signal detection theory and is the distance between the noise and signal distributions divided by the standard deviation of the noise distribution (i.e. (µsignal - µ noise)/onoise). In general these values are not available, however, and it is commonly calculated using the number of times the subject correctly recognises an item from the study list (i.e. the hits) and the number of times they incorrectly recognise an item when it did not occur in the study list (i.e. the false alarms). A hit was scored when the network responded "yes" when the test item did occur in the study list. A false alarm was scored when either a "yes" or a "blank" response was given when the correct answer was "no". The value of d' was found by subtracting the z score for the false alarm rate from the z score for the hit rate (i.e. z(hit rate) - z(false alarm rate)).

Normally, when applying signal detection theory there are only two responses and a negative value of d' indicates bias against correct responding. In these simulations, however, a third response (i.e. "blank") is added and the chance baseline is reduced below 0.5 leading to negative d' values. However, the "blank" response was rarely given during the decision phase after the first few epochs of training. Hence, the problem quickly reduced to a two response paradigm and negative d' values were eliminated.

Results and Discussion

List Length

The performance of the SRN decreases very quickly as list length is increased and is at chance levels after only five items (see Figure 4). In the HRN, the decrease in performance on lists of length three to eight is gradual. There is a small increase from lists of length eight to lists of length nine, but it is not significant, F(1,38) = 0.563, p = 0.458. When the list length reaches 10 performance drops to chance. In these simulations, the vocabulary size was set to be twice the list length so as to maintain a 0.5 probability of a positive test item. Hence, when list length reaches 10 the vocabulary has reached 20 items. Beyond 20 items it is impossible for a network that has 20 hidden units to maintain orthogonal hidden unit representations. The inability of the network to memorise larger lists is a consequence of the fact that the vocabulary size has reached the number of hidden units. Therefore, in a psychological model, it would be assumed that the number of hidden units would be much greater than both the size of the list and the size of the vocabulary, as is the case in TODAM, CHARM and the matrix model.

Figure 4: Performance (d') as a function of list length on the HRN and SRN. The HRN performed better than the SRN on lists of length nine or less. On lists of length 10 the HRN's performance falls to chance. The bars indicate the 95% confidence intervals.

The learning curves for each of the architectures at each list length are shown in Figures 5 and 6. In contrast to the SRN, performance rose very quickly in the HRN particularly at the lower list lengths. At the longer list lengths the increase in the number of items to be learned slows down the rate of improvement. In addition, the variances for the HRN graphs are much lower than those of the SRN for three item study lists, particularly in the latter stages of training. In the SRN there is no guarantee that the information required to make a recognition decision will be maintained. Because the true gradient through time is not being followed, a set of initial weights that does not maintain the requisite information sufficiently from the outset cannot be trained to do so. As training progresses, however, the performance on those simulations for which sufficient separation was maintained will continue to improve, leading to higher variances. In the HRN, the information required to make the recognition decision is maintained in the Hebbian weights. Furthermore, the retention mechanism is constant and the only thing that jeopardises the asymptotic performance is the orthogonality of the hidden unit patterns. For this reason, the variances in the HRN tend to be smaller.

Another interesting feature is the drop in performance that occurs in the HRN. For instance, on three, four and five item lists, performance reaches a maximum in the first 100 epochs and then decreases to an asymptote. The drop is not simply a consequence of overfitting. The training results show the same decrement in performance as the test results.

Figure 5: Learning curves for the SRN as list length was varied. The above graphs show the average d' (on the test set) as the number of epochs increased to 1500. The bars represent the 95% confidence intervals for the 20 trials. The only condition in which performance was consistently above chance was when three items were presented. The average d' value was significantly above chance by 200 epochs and increased gradually as training progressed. An interesting feature of the three item curve is the increase in the variances as training progresses. Some sets of starting weights will maintain the information necessary for the recognition decision and others will not. The SRN is not following the gradient through time but rather is reliant on the information that is maintained. Hence, if the information is not present from the start the SRN continues to perform at chance levels. On those occasions when the information is preserved, however, its performance improves. As training progresses, the difference between those simulations in which the information is retained and those in which it is not increases, and consequently, the variance increases.

Figure 6: Learning curves for the HRN as list length was varied. The above graphs show the average d' (on the test set) as the number of epochs increased to 1500. The bars represent the 95% confidence intervals for the 20 trials. In contrast to the SRN, performance rose very quickly in the HRN, particularly at the lower list lengths. In all of the simulations except those involving the study lists of length 10, the average d' value was significantly above chance after 100 epochs. In addition, the variances for the HRN graphs were much lower than those of the SRN. Another interesting feature is the drop in performance that occurred in the HRN. In the three, four and five item lists, performance reached a maximum in the first 100 epochs and then decreases to an asymptote. It is important to note that the number of vocabulary items was maintained at twice the list length. Hence, the 8 and 9 item lists have 16 and 18 items to learn to autoassociate respectively which would account for their slower increases. The dramatic drop in performance for the 10 item lists occurs as a consequence of saturation of the hidden unit representational space. The items in a study list of length 10 were chosen from a vocabulary of 20 items. To use the Hebbian memory efficiently orthogonal hidden unit representations of the 20 items must be constructed. In these simulations, however, there were only 20 hidden units so that maintaining orthogonality is very difficult.

As suggested by Lewandowsky (1991) the tanh function was used. Any two vectors whose components are chosen from a uniform distribution in the range 1 to -1 will be statistically independent (i.e tend to orthogonality in high dimensional spaces). Lewandowsky (1991) demonstrates the decrease in catastrophic interference that ensues. However, unless large numbers of hidden units are used, mutual orthogonality of the pattern set quickly diminishes as the number of patterns increases. Correspondingly the performance of the Hebbian matrix drops.

In addition to the drop incurred by increasing the number of input patterns, the orthogonality of the hidden unit patterns decreases as training progresses (see Figure 7). The measure of orthogonality used is:

where vi is the average hidden pattern for input symbol i, and n is the number of input symbols. This measure is one minus the average normalised dot product. It ranges between one and zero and is at a maximum when all patterns are orthogonal to one another. The mutual orthogonality of the hidden patterns increases to a maximum after the first 100 epochs and then drops slowly to an asymptote as training progresses. It is this drop in the orthogonality that compromises the Hebbian memory and leads to the drop in performance of the network.

Figure 7: Orthogonality of hidden unit patterns of the HRN as a function of training time. See text for a description of the calculation of the orthogonality measure. The graph shows the average over 20 runs. The bars represent the 95% confidence intervals. As training progresses the mutual orthogonality of the hidden unit patterns decreases, jeopardising the Hebbian networks ability to perform accurately.

There are methods for encouraging orthogonality of hidden unit patterns (French, 1991; Chauvin, 1989) and it seems likely that these would improve performance.

The Memory Decay Parameter

The HRN was tested on the 4-item recognition problem with a range of memory decay values. Figure 8 shows the effect on performance as measured by d'. The performance increases to a maximum at a memory decay of 0.3 and then trails off as the decay rate is increased to 1. The optimum value is dependent on the length of the list. Figure 9 shows that as the memory decay parameter is decreased to zero the serial position curve flattens. Hence, longer lists will perform optimally with smaller values of the memory decay parameter.
Figure 8: Performance (d') of the HRN as a function of the memory decay parameter on the 4-Item recognition task. Performance peeks at a memory decay of about 0.3 and drops as the decay increases to 1. The bars indicate the 95% confidence intervals.

Figure 9: Performance (d') of the HRN as a function of serial position for three values of the memory decay parameter. At a memory decay of 0.1 the new value of each weight is primarily a consequence of the previous value of that weight. Hence, position makes little difference and a flat position curve is observed. As the memory decay increases from 0.1 to 0.5 and from 0.5 to 1.0 the emphasis is moved towards the end of the list. In the extreme case of a memory decay of 1.0 there is no contribution from earlier timesteps and hence performance on the last item is very good while all other items fall to chance. The bars indicate the 95% confidence intervals.

Generalisation

One important criterion on which to judge the value of a network is its ability to generalise to unseen cases. In the current context, the HRN is required to respond correctly to test items when presented with novel study/test sequences[7]. As Geman, Bienenstock, and Doursat (1992) point out, there is a tradeoff between the generality of an architecture (degree of bias) and the number of training examples that will be required to train it. Backpropagation nets such as the SRN, BPTT and RTRL are approximations to unbiased systems, the approximation becoming more accurate as the number of hidden units is raised. Hence, they are prone to poor generalisation in the absence of significant numbers of training examples[8]. Figure 10 demonstrates the generalisation of the HRN as compared to the SRN. At 200 training examples the training performance of the SRN is much better than that of the HRN. The test performance, however, is not as good. The SRN is not generalising to unseen cases as well.
Figure 10: Performance of the SRN and HRN as a function of number of training examples. The HRN incorporates more bias than the SRN. Hence, as the number of training sequences decreases from 500 to 200 the change in the disparity between the training and testing performance is much greater for the SRN than for the HRN.

Vocabulary Size

Adding vocabulary items to the SRN tends to make learning much more difficult. Human subjects on the other hand seem to maintain very large vocabularies with surprisingly little interference or unlearning. Figure 11 shows performance as the vocabulary size is manipulated when the HRN is applied to the 4-item recognition task. As was noted earlier, performance is virtually unaffected by vocabulary size until it reaches the number of hidden units. At this point there is a sharp drop. As shown by the learning curves, it takes longer for the HRN to reach asymptote as the number of items increases, but even when the vocabulary contains 16 items it is able to do so well within the 1500 epochs allowed.

Figure 11: Performance (d') of the HRN as a function of the number of vocabulary items on the 4-item recognition task. The size of the vocabulary seems to make little difference until it reaches 20 when it drops to below chance. The bars indicate the 95% confidence intervals.

Time Complexity

As mentioned in the previous section, the HRN takes longer to learn the recognition task as the number of vocabulary items increases. How quickly does the training time increase with the size of the input? Figure 12 shows the number of epochs the HRN required to reach its maximum performance plotted against the list length.

Figure 12: Time complexity of the HRN on the recognition task. The number of epochs required for the HRN to reach maximum performance versus the list length (i.e. size of the input). The growth is linear.

Although the result is not conclusive, because only small values of the list length have been sampled, the graph suggests that training time (in epochs) grows in an approximately linear fashion with the list length and vocabulary.

Hidden Unit Analysis

In order for the HRN to operate successfully it must form stable representations of the input items on its hidden units. Figure 13 shows the hidden patterns of the HRN plotted on the first two principal components[9]. Two hundred sequences each containing five patterns were presented and, hence, Figure 13 shows 1000 hidden unit patterns. The dominant feature is the separation on the input pattern. To successfully autoassociate the input patterns the network maps all input/context patterns that correspond to the same input onto similar locations in hidden unit space, and attempts to separate these clusters from each other. Considering that there are on average 143 of each input pattern the clusters are very tight.

In contrast, position information is only poorly retained. Figure 14 shows the HRN hidden unit patterns plotted on the first two canonical discriminants[10] when grouped on position (i.e. 1, 2, 3, or P for probe). The probe and third position patterns are reasonably well separated, but the first and second position patterns show very little separation. Figure 15 shows the canonical discriminant analysis (CDA) plots of the hidden patterns grouped on position (i.e. 1, 2, 3, or P for probe) and on item (i.e. a, b, c, d, e, or f) to the same scale. Clearly, the separation of the items is much greater than the separation of the positions. Unlike the serial recall task explored by Wiles and Phillips (1991), position information is not required to do the recognition task and, hence, separation on the basis of position is poor[11].

Figure 13: Principal components plot of the hidden patterns of the HRN. The HRN was run on 200 sequences and, hence, 1000 patterns are shown here. The hidden unit patterns cluster very tightly around the input pattern as a consequence of the autoassociation of the input.

Figure 14: Canonical discriminants plot of the hidden patterns of the HRN grouped on position. The labels indicate first (i.e. 1), second (i.e. 2), third (i.e. 3) and probe (i.e. P) items. While the probe and third position patterns are reasonably well separated, the first and second positions show very little separation. Furthermore, the range of both axes is very small. Position information is retained, but poorly.

Figure 15: Canonical discriminants plots of the hidden patterns of the HRN grouped on input and on position. Figure A shows the CDA plot when the patterns are grouped on input (i.e. a, b, c, d, e, or f). Figure B shows in the same scale the CDA plot when the patterns are grouped on position (i.e. 1, 2, 3, or P for probe). Separation when grouped on item is much greater than the separation when the hidden patterns are grouped on position. Such a pattern is expected since the recognition task does not require position information. In contrast, the item information is required at every timestep.

In contrasting the SRN and HRN hidden unit representations, the first point to note is that the HRN is better at recognising patterns from the study list than the SRN. Examining the hidden patterns on the "Answer" timestep shows that the Target and Distractor patterns are not as well separated in the SRN (see Figure 16). It is more difficult to construct a decision boundary that will correctly classify these patterns as "Yes" or "No" responses.

Figure 16: Canonical discriminant plots of the hidden unit patterns of the SRN and HRN grouped on output. The separation of target patterns, denoted by "O", and distractor patterns denoted by "X" as determined by the number of overlapping patterns is better for the HRN than for the SRN. This translates into improved performance for the HRN. The points labelled "B" correspond to "Blank" outputs.

While the SRN must form a representation of the entire list in its hidden unit activation patterns, the HRN can rely on the Hebbian memory to store items. Figures 17 and 18 are Hierarchical Cluster Analysis (HCA, Elman, 1989) diagrams of the hidden patterns after the final study item has been input. HCA groups patterns of activation in a tree structure so that patterns which were close to each other in the hidden unit space occupy adjacent branches of the tree. When the final study item is input, the SRN must have a representation of the entire list in its hidden pattern. In contrast, the HRN need only separate the current items in it hidden activations because items from the first and second positions of the list are stored in the memory. The labels in Figures 17 and 18 indicate whether the corresponding study list contained an "a" and, if so, whether that 'a' occurred in the third position. In both the HRN and SRN the "a"s in the third position are well separated from the rest of the patterns. In the HRN, however, first and second position "a"s are mixed with the other patterns. By contrast, the SRN "a"s are well separated regardless of position. The learning task that the HRN must solve is easier than that of the SRN.

Figure 17: Hierarchical Cluster Analysis (HCA) of the hidden unit patterns of the SRN after the study list has been input. Note that the sequences that contain an "a" are cluster together. Such an organisation occurs since the SRN must respond with an "a" regardless of which position the "a" occurred in.

Figure 18: Hierarchical Cluster Analysis (HCA) of the hidden unit patterns of the HRN after the study list has been input. In contrast to the cluster analysis of the SRN, the HRN clusters only the third position "a"s well. The first and second position "a"s are mixed in with the non "a" patterns. The HRN does not need to separate these patterns in the hidden unit patterns since they are retained in the Hebbian memory.

Conclusions

The introduction outlined six criteria for a model of human memory, and in this section the HRN's performance is evaluated with respect to these criteria.

The first of the criteria was the learning of representation. The analysis of the hidden unit patterns of the HRN demonstrates that it is the case that the model learns a representational landscape. In the recognition task explored in this paper, all items were equal. The input patterns were orthogonal and all items fill identical functional roles. Consequently, the patterns were evenly distributed throughout the space. In tasks in which either the input patterns vary or the items are required to fill different functional roles (such as Elman's, 1989, prediction task) the hidden unit space would be more structured, and it would be possible to assess how the memory performs on perceptually and semantically related items.

The development of decision boundaries was the second criterion. In the HRN, the decisions are made by the backpropagation connections from the context-to-hidden layer and from the hidden-to-output layer. Hence, the HRN does satisfy the condition of learning its decision criteria. Furthermore, as training progressed the performance increased, indicating that the decision criteria was being altered so as to better separate the target patterns from the distractors.

The third criterion was the development of control regimes. In such a simple task the degree of control required was limited. The only major distinction was between the study phase in which the network was required to output a "Blank" and the decision phase when the network responded with either "Yes" or "No" indicating whether it considered the probe item to have been in the study list. While it often took several hundred epochs of training before the network began to correctly determine whether to output "Yes" or "No", the control question of whether to output a "Blank" or a decision was typically learned very quickly, usually well within one hundred epochs. While this example is very simple it demonstrates that the HRN is capable of acquiring a control regime.

Learning is only useful if it can be done in a feasible time period. The time complexity of learning seems to be approximately linear in the length of the lists (and size of the vocabulary) with which the HRN is tested.

The first of the memory criteria was the degree of interference. The Hebbian memory of the HRN allowed it to retain items without significant interference. The HRN inherits the performance characteristics from the matrix memory and maintains its performance (at least until vocabulary size becomes larger than the number of hidden units) with extra items decrementing performance only marginally. Since human vocabularies are in the order of 10000 items and the number of items that could conceivably be remembered is even greater, it is critical that a model of memory be able to store and retrieve a large number of items. The ability of the HRN to generalise well even as the number of vocabulary items increased is of particular importance, and is one of the major distinguishing factors between it and the SRN.

The last of the criteria was the rapid binding in memory of already established representations. The possible representations are developed by the backpropagation mechanism over the course of training. Specific memories, however, are stored in the Hebbian weights. Hence, memories are laid in a single timestep, while representations are formed over a much longer timespan. It is the dual memory architecture that avoids catastrophic interference, allows for the significant improvement in generalisation, and accounts for the dramatically different timespans of memory and learning.

While there is still work to be done to establish the extent of the capabilities of the HRN, it has been demonstrated that by integrating insights from backpropagation models with those from the mathematical memory modelling literature it is possible to fulfill both learning and memory criteria.

References

Anderson, J. A., Silverstein, J. W., Ritz, S. A., & Jones, R. S. (1977). Distinctive features, categorical perception and probability learning: Some applications of a neural model. Psychological Review, 84, 413-451.

Anderson, J. R. (1990). The Adaptive Character of Thought. Laurence Erlbaum Associates, Hillsdale, NJ.

Anderson, J. R., & Milson, R. (1989). Human memory: An adaptive perspective. Psychological Review, 96 (4), 703-719.

Anderson, J. R., & Schooler, L. J. (1991). Reflections of the environment in memory. Psychological Science, 2 (6), 396-408.

Atkinson, R. C., & Shiffrin, R. M. (1968). Human memory: A proposed system and its control processes. In Spence, K. W., & Spence, J. T. (Eds.), The Psychology of Learning and Motivation, pp. 90-191. Academic Press.

Bickhard, M. H. (1991). The import of Fodor's anti-constructivist argument. In Steffe, L. P. (Ed.), Epistemological Foundations of Mathematical Experience, chap. 2, pp. 14-25. Springer Verlag, New York, NY.

Brousse, O., & Smolensky, P. (1989). Virtual memories and massive generalisation in connectionist combinatorial learning. In Program of the Eleventh Annual Conference of the Cognitive Science Society, pp. 380-387 Hillsdale, NJ. Lawrence Erlbaum Associates.

Brunswik, E. (1956). Perception and the Representative Design of Psychological Experiments. University of California Press, Berkeley, CA.

Campbell, D. T. (1974). Evolutionary epistemology. In Schilp, P. A. (Ed.), The philosophy of Karl Popper. Open Court, La Salle, IL.

Chappell, M., & Humphreys, M. S. (1994). Autoassociative neural network for sparse representations: Analysis and application to models of recognition and cued recall. Psychological Review, 101, 103-128.

Chauvin, Y. (1989). A backpropagation algorithm with optimal use of hidden units. In Touretsky, D. S. (Ed.), Advances in Neural Information Processing Systems, pp. 519-526. Morgan Kaufmann.

Cottrell, G. W., & Tsung, F. S. (1993). Learning simple arithmetic procedures. Connection Science, 5 (1), 37-58.

Dell, G. S., Juliano, C., & Govindjee, A. (1993). Structure and content in language production: A theory of frame constraints in phonological errors. Cognitive Science, 17, 149-195.

Eich, J. M. (1982). A Composite Holographic Associative Recall Model. Psychological Review, 89 (6), 627-661.

Elman, J. L. (1989). Representation and structure in connectionist models. Technical Report 8903, Center for Research in Language, University of California, San Diego, CA.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179-211.

Estes, W. K. (1991). Cognitive architectures from the standpoint of an experimental psychologist. Annual Review of Psychology, 42, 1-28.

French, R. (1991). Using semi-distributed representations to overcome catastrophic forgetting in connectionist networks. Technical Report 51, Center for Research on Concepts and Cognition, Indiana University, IN.

Gillund, G., & Shiffrin, R. M. (1984). A retrieval model for both recognition and recall. Psychological Review, 91 (1), 1-67.

Greene, R. L. (1992). Human Memory: Paradigms and Paradoxes. Lawrence Erlbaum Associates, Hillsdale, NJ.

Heathcote, A. (1993). An ART model of human recognition memory. In Leong, P., & Jabri, M. (Eds.), Proceedings of the Fourth Australian Conference on Neural Networks, pp. 212-215.

Hetherington, P., & Seidenberg, M. S. (1989). Is there catastrophic interference in connectionist networks?. In Program of the Eleventh Annual Conference of the Cognitive Science Society, pp. 26-33 Hillsdale, NJ. Lawrence Erlbaum Associates.

Hinton, G. (1993). Tutorial on neural networks. Conducted at University of Sydney.

Hintzman, D. L. (1984). Minerva 2: A simulation model of human memory. Behaviour Research Methods, Instruments, and Computers, 16 (2), 96-101.

Hintzman, D. L. (1991). Why are formal models useful in psychology?. In Hockley, W. E., & Lewandowsky, S. (Eds.), Relating Theory and Data: Essays on Human Memory in Honour of Bennet B. Murdock, pp. 39-56. Lawrence Erlbaum Associates, Hillsdale, NJ.

Hintzman, D. L., & Block, R. A. (1971). Repetition and memory: Evidence for a multi-trace hypothesis. Journal of Experimental Psychology, 88 (3), 297-306.

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences, pp. 2554-2558. National Academy of Sciences.

Humphreys, M. S., Bain, J. D., & Pike, R. (1989). Different ways to cue a coherent memory system: A theory for episodic, semantic and procedural tasks. Psychological Review, 96 (2), 208-233.

Humphreys, M. S., Pike, R., Bain, J. D., & Tehan, G. (1989b). Global matching: A comparison of the SAM, Minerva II, Matrix and TODAM models. Journal of Mathematical Psychology, 33 (1), 36-67.

Jordan, M. I. (in press). Serial order: A parallel distributed processing approach. In Elman, J. L., & Rumelhart, D. E. (Eds.), Advances in Connectionist Theory: Speech. Lawrence Erlbaum Associates, Hillsdale, NJ.

Koltz, S., & Johnson, N. L. (1982). Encyclopaedia of Statistical Sciences. John Wiley and Sons, NY.

Kruschke, J. K. (1992). Alcove: An exemplar-based connectionist model of category learning. Psychological Review, 99 (1), 22-44.

Lewandowsky, S. (1991). Gradual unlearning and catastrophic interference: A comparison of distributed architectures. In Hockley, W. E., & Lewandowsky, S. (Eds.), Relating Theory and Data: Essays on Human Memory in Honour of Bennet B. Murdock, pp. 445-476. Lawrence Erlbaum Associates, Hillsdale, NJ.

Lewandowsky, S., & Murdock, B. B. (1989). Memory for serial order. Psychological Review, 96 (1), 25-57.

Martin, E. (1965). Transfer of verbal paired associates. Psychological Review, 72, 327-343.

McCloskey, M., & Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. In Bower, G. H. (Ed.), The Psychology of Learning and Motivation, pp. 109-165. Academic Press, NY.

Melton, A. W. (1963). Implications of short-term memory for a general theory of memory. Journal of Verbal Learning and Verbal Behaviour, 2, 1-21.

Mozer, M. C. (1992). Induction of multiscale temporal structure. In Moody, J. E., Hanson, S. J., & Lippmann, R. P. (Eds.), Advances in Neural Information Processing Systems 4, pp. 325-332 San Mateo: CA. Morgan Kaufmann.

Mozer, M. C. (1993). Neural net architectures for temporal sequence processing. To appear in: A. Weigend & N. Gershenfeld (Eds.) Predicting the future and understanding the past. Redwood City, CA: Addison-Wesley Publishing.

Murdock, B. B. (1982). A theory for the storage and retrieval of item and associative information. Psychological Review, 89 (6), 609-626.

Murdock, B. B. (1987). Learning in a distributed memory model. In Izawa, C. (Ed.), Current Issues in Cognitive Processes, chap. 4, pp. 69-106. Lawrence Erlbaum Associates. The Tulane Flowerree Symposium on Cognition.

Murdock, B. B., & Lamon, M. (1988). The replacement effect: Repeating some items while replacing others. Memory & Cognition, 16 (2), 91-101.

Nolfi, S., Parisi, D., Vallar, G., & Burani, C. (1990). Recall of sequences of items by a neural network. In Touretsky, D. S., Elman, J. L., Sejnowski, T. J., & Hinton, G. E. (Eds.), Proceedings of the 1990 Connectionist Models Summer School. Morgan Kaufmann, San Mateo, CA.

Osgood, C. E. (1949). The similarity paradox in human learning: A resolution. Psychological Review, 56, 132-143.

Phillips, S. (1991). Serial recall using an Elman net with hints. Unpublished manuscript.

Phillips, S., & Wiles, J. (1993). Exponential generalisations from a polynomial number of examples in a combinatorial domain. Proceedings of the 1993 International Joint Conference on Neural Networks, 505-508.

Pike, R. (1984). Comparison of convolution and matrix distributed memory systems for associative recall and recognition. Psychological Review, 91 (3), 281-293.

Plate, T. (1991). Holographic reduced representations: Convolution algebra for compositional distributed representations. In Mylopoulos, J., & Reiter, R. (Eds.), Proceedings of the Twelfth International Conference on Artificial Intelligence, pp. 30-35 Sydney, Australia. Morgan Kaufmann Publishers.

Plate, T. (1992). Holographic recurrent networks. In Giles, C. L., Hanson, S. J., & Cowan, J. D. (Eds.), Advances in Information Processing Systems, Vol. 5, pp. 34-41 San Mateo, CA. Morgan Kaufmann.

Postman, L. (1969). Experimental analysis of learning to learn. In Bower, G. H., & Spence, J. T. (Eds.), Psychology of Learning and Motivation, Vol. 3, pp. 241-297. Academic Press.

Raaijmakers, J. G. W., & Shiffrin, R. M. (1981). Search of Associative Memory. Psychological Review, 88, 93-134.

Ratcliff, R. (1990). Connectionist models of recognition memory: Constraints im- posed by learning and forgetting functions. Psychological Review, 97 (2), 285- 308.

Ratcliff, R., Clark, S. E., & Shiffrin, R. M. (1990). The list-strength effect: I. Data and discussion. Journal of Experimental Psychology: Learning, Memory and Cognition, 16, 163-178.

Regier, T. (1992). The acquisition of lexical semantics for spatial terms: A connectionist model of perceptual categorisation. Technical Report TR-92-062, International Computer Science Institute, Berkeley, CA. Reilly, R. (1993). A connectionist attentional shift model of eye-movement control in reading. In Proceedings of the 15th Annual Meeting of the Cognitive Science Society Boulder, CO. To appear.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In Rumelhart, D. E., & McClelland, J. L. (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1, pp. 318-362. MIT Press, Cambridge, MA.

Schmidhuber, J. (1991). Neural sequence chunkers. Technical Report FKI-148- 91, Technische Universitaet Muenchen, Institut fuer Informatik, Munich, Germany.

Schmidhuber, J. (1992). A fixed size O(n3) time complexity learning algorithm for fully recurrent networks. Neural Computation, 4 (2), 243-248.

Shepard, R. N. (1967). Recognition memory for words, sentences and pictures. Journal of Verbal Learning and Verbal Behaviour, 6, 156-163.

Shepard, R. N. (1987). Towards a universal law of generalisation for psychological systems. Science, 237, 1317-1323.

Stornetta, W. S., & Huberman, B. A. (1987). An improved three-layer backpropagation algorithm. In Caudill, M., & Butler, C. (Eds.), IEEE First International Conference on Neural Networks, pp. 637-645 San Diego, CA. IEEE.

Swets, J. A., & Green, D. M. (1961). Sequential observations by human observers of signals in noise. In Cherry, C. (Ed.), Information Theory: Proceedings of the fourth London symposium, pp. 177-195 London. Butterworth.

Waibel, A. (1989). Modular construction of time-delay neural networks for speech recognition. Neural Computation, 1 (1), 39-46.

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1987). Phoneme recognition using time-delay neural networks. Technical Report TR-1-0006, ATR Interpreting Telephony Research Laboratories, Japan.

Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training and analysis of simple recurrent networks. In Moody, J. E., Hanson, S. J., & Lippmann, R. P. (Eds.), Advances in Neural Information Processing Systems 4, pp. 325-332 San Mateo: CA. Morgan Kaufmann.

Wiles, J., & Phillips, S. (1991). Serial recall of binary sequences. Unpublished manuscript.

Williams, R. J., & Zipser, D. (1989). Experimental analysis of the real-time recurrent learning algorithm. Connection Science, 1 (1), 87-111.

Williams, R. J., & Zipser, D. (1990). Gradient-based learning algorithms for recurrent networks. In Chauvin, Y., & Rumelhart, D. E. (Eds.), Backpropagation: Theory, Architectures and Applications, pp. 1-42. Erlbaum, Hillsdale, NJ.

Footnotes

[1] Brousse and Smolensky (1989) used a somewhat similar approach, in that lists of items were encoded in patterns of activation rather than in the weights. However, their approach employed a tensor product mechanism to construct the patterns, rather than relying on a recurrent architecture to learn an appropriate encoding scheme.

[2] For an analysis of current temporal learning algorithms and their suitability in cognitive tasks see Regier (1992). In addition, Mozer (1993) provides a detailed classification of temporal network architectures.

[3] At least in its original formulation (Williams & Zipser, 1990), although an O(n3) algorithm has been developed (Schmidhuber, 1992).

[4] This statement needs some qualification. There are many variations on BPTT. If the SRN is compared against BPTT where error is backpropagated at the end of each sequence rather than after each pattern, the time complexities per sequence are equal. However, the number of weight updates is decreased from l, where l is the length of the sequence, to just one per sequence. Experience with feedforward nets suggests that if there is redundancy in the training set online training is often faster than batch because of the increase in the number of weight updates (Hinton, 1993).

[5] The RTRL algorithm was also attempted, but for the size of the training sets used the simulations proved too slow even on lists of length three. After almost two weeks running in the background on a SparcStation II only 300 epochs were complete. Thanks to Jeff Elman for the tlearn simulator on which the RTRL simulations were run.

[6] List length refers to the length of the study list, not the length of the entire sequence. Hence, if the list length is three there are five patterns in the sequence.

[7] Note that in these simulations all of the items would have been presented during training. What would not have been seen are the sequences of items in the test set.

[8] In the feedforward case, feasible generalisation on combinatorial domains can be achieved by bounding the number of hidden units (Phillips & Wiles, 1993), however, the recurrent case remains an open question.

[9] Koltz and Johnson (1982) provide a description of Principal Components Analysis (PCA) and Elman (1990) demonstrates its application to hidden unit analysis.

[10] Koltz and Johnson (1982) provide a description of Canonical Discriminants Analysis (CDA) and Wiles and Bloesch (1992) demonstrate its application to hidden unit analysis.

[11] If more hidden units are employed then one does see residual position information. The networks from which these plots were derived contained seven hidden units.