Artificial neural network models provide experimental psychologists with testable, mechanistic accounts of processes which may mediate perception and cognition. This paper outlines some techniques for comparing indices of human cognitive processes, such as reaction time, accuracy, and confidence ratings, with neural network performance and patterns of activation in hidden unit space. Finally, the efficacy of neural networks as predictive systems, based on their ability to represent statistical regularities of the environment, is discussed.
Network accuracy can be compared directly with accuracy recorded by subjects on comparable tasks. For example, we have compared accuracy data recorded by musically trained and untrained listeners on music discrimination and classification tasks with output values generated by single-layer networks (Stevens, 1992; Stevens & Latimer, 1992; 1993). The differing performance of musically trained and untrained subjects was mimicked in the network by manipulating systematically the amount of training to which the network was exposed. Specifically, the mean accuracy of trained subjects was simulated in a fully-trained or "expert" network and compared with the mean accuracy of untrained subjects as mimicked in a partially-trained or "novice" network.


The pitch-change direction discrimination task was simulated using an SRN which consisted of 11 input units and seven context and hidden units. Two output units were used to signal the presence of either a standard or a modified composition. The binary activations of the input units corresponded to the 11 different frequencies of the compositions. As there were 32 notes per composition, the input units were activated over 32 time steps with one activation per time step. The recurrent connections, which linked the hidden units to the context units, gave the network memory inasmuch as they allowed the hidden units to be influenced by their previous activations. The model was trained using online learning and Backprop Through Time for one time step. The context units were reset to zero between each composition, the learning rate was 0.005, and the tanh activation function was used. The standard and five of the modified compositions were discriminated accurately after 20,000 training epochs. Modified Composition 2 was confused with the standard composition after this period of training. The activations of the output units graphed as a function of the unfolding time steps of the composition are shown in Figure 3. In the model, the pattern of activation mimics the reaction times recorded by musically trained subjects. For example, correct discrimination of the first modified composition is made after the distinguishing second time step has been input. Similarly, Modification 3 was discriminated accurately after activation of time step 28.

The hidden units of the SRN provide a rich source of information about the network representations which mediate input/output activations. There is a vector of hidden unit activations for each of the 32 time steps of the standard and modified compositions and the 160 hidden unit vectors for all compositions recognized by the trained SRN were subjected to canonical discriminant analysis (CDA) in an effort to reduce the dimensionality and complexity of the data (Wiles & Bloesch, 1992). To explore the differences between the hidden unit representation of standard and modified compositions, the analysis was carried out on hidden unit activation vectors and beyond time steps where feature manipulations occurred. CDA takes the set of seven-dimensional hidden unit vectors, labelled with one of two groups - either a standard or a modified pattern - and by minimizing the within-group differences, finds a vector (the first canonical component) that best separates the two groups of hidden unit patterns. The original points can be projected onto the canonical component vectors to yield a low-dimensional plot with points clustered according to group membership.
The first canonical component for Modified Composition 5, shown in Figure 4, accounts for the greatest amount of variance in the data, and, in the SRN, partitions off the standard and modified compositions. The values of the first canonical component are plotted over the 32 time steps of the composition. Label s marks the time steps which contain notes identical to those of the standard (i.e. the opening notes of a modified composition before the feature manipulation occurs), whereas d indicates the location of the feature manipulation which distinguishes the modified composition from the standard. Figure 4 also illustrates the way in which the hidden unit activations corresponding to notes of the standard and modified compositions occupy different regions of hidden unit space. If we consider the points in each graph as a trajectory through time, it can be seen that, once a distinguishing feature is input to the network, there is movement toward a region in space away from that area occupied by values corresponding to presentation of notes of the standard composition. Thus, the sameness and difference of compositions are represented in the SRN in the relative location of activations in hidden unit space; the standard composition is represented by clustering in one area and, once a discriminating feature is input, there is movement away from that region to a different location in space. Armed with such an efficient internal representation, it is not surprising that the SRN discriminates modified compositions from the standard as soon as the input unit which represents a discriminating feature is activated.

What has the analysis of the hidden unit activations of the SRN contributed to our understanding of music cognition? In our earlier discussion of the single-layer perceptron, differences in accuracy and reaction time recorded by musically trained and untrained subjects were emulated in fully- and partially-trained networks, respectively. Importantly though, the perceptron algorithm requires that all weighted activations be summed so that the efficient reaction time functions of musically trained subjects are unlikely to be achieved using a single-layer network. A more plausible explanation of the differences between trained and untrained subjects invokes the properties of the SRN as the possible mechanism. It is conceivable that, with training and experience, musically trained subjects acquire knowledge of the higher-order structure of musical compositions and subsequent expectancies of melodic, harmonic, and rhythmic progressions. This knowledge is constructed by the SRN and represented in the hidden unit activations. Additionally, given that the reaction time function of trained subjects can be simulated more closely by the SRN than the perceptron, the influence of recurrent connections must also be acknowledged. Retention of the hidden unit activations from the previous time step in the SRN facilitated speed of recognition. By extrapolation, it may be the case that musically trained listeners, but not untrained listeners, retain or accumulate pitch information from the compositions as they unfold in time.
The distributions of activity across the output vector in the early stages of training demonstrate the statistical learning properties of the network. The statistical regularities extracted and learned by the network are of psychological interest for a number of reasons. First, the mean activations of the tone units across the entire composition are reminiscent of the tonal hierarchy which has been described by both music theorists and psychologists. (Krumhansl, 1990; Krumhansl & Shepard, 1979). During the initial epochs of training, the network responds with the most stable and frequently occurring tones, namely tones C, E, and G which comprise the tonic chord. This kind of hierarchical ordering of tones characterises the relative importance of tones in particular keys and has been shown to influence performance on a range of experimental tasks including judged relatedness of tones, judged key membership, judgments of phrase endings, and patterns of memory confusions (Krumhansl, 1991). The network, therefore, provides a mechanistic account of the way in which statistical properties of music are learned and represented; for example, frequently occurring tones are weighted heavily. The statistical probabilities extracted by the network can also be used for predictive purposes: is it the case that listeners, when exposed initially to a novel musical composition and asked to predict the next tone in a sequence, predict the most frequent or probable tone? Given the distribution of activity across the time-based components, duration and accent, it is also possible to use the network to produce a time-based analogue of the tonal hierarchy. For example, the activation of the duration units during the initial training epochs imply that a quarter note is the most frequent and stable note length in the composition and could be compared with predictions of note length made by musically trained and untrained listeners when assigned a prediction task.
Network performance can be measured by calculating the total sum of squared error (TSS) at various stages of training, and the mean contribution of the four components to the TSS error are depicted in Figure 5. In the latter stages of training, the tone, duration and accent TSS errors were close to zero. The relative TSS values can also be used to predict performance. Consider a situation where the TSS value of one of the components, such as accent, decreases rapidly during the initial training epochs. Such a result implies that accent is learned quickly and is salient. We can ask a similar question of human subjects: is it the case that accent information is extracted early when subjects are exposed to novel stimuli? Experimental stimuli could be constructed to compare the relative salience of accent and pitch information in a discrimination task. The comparison would involve two sequences of the same pitch pattern but with differing accent patterns versus two sequences of differing pitch patterns with the same accent pattern. If accent pattern is salient relative to pitch, then we would expect there to be more errors in discriminating between the latter sequences than the former sequences.

Gluck, M. A., & Bower, G. M. (1988). Evaluating an adaptive network model of human learning. Journal of Memory and Language, 27, 166-195.
Krumhansl, C. L. (1990). Cognitive Foundations of Musical Pitch, Oxford: Oxford University Press.
Krumhansl, C. L. (1991). "Music psychology: Tonal structures in perception and memory," Annual Review of Psychology, vol. 42, pp. 277-303.
Krumhansl, C., & Shepard, R. N. (1979). "Quantification of the hierarchy of tonal functions within a diatonic context," Journal of Experimental Psychology: Human Perception & Performance, vol. 5, pp. 579-594.
McClelland, J. L., & Rumelhart, R. E. (1989). Explorations in Parallel Distributed Processing: A Handbook of Models, Programs and Exercises. Cambridge, Mass.: MIT Press.
Stevens, C. (1992). Derivation and Investigation of Features Mediating Musical Pattern Recognition. Unpublished doctoral dissertation, University of Sydney.
Stevens, C., & Latimer, C. (1992). A comparison of connectionist models of music recognition and human performance. Minds and Machines, 2, 379-400.
Stevens, C., & Latimer, C. (1993). Recognition of short tonal compositions by connectionist models and listeners: Effects of feature manipulation and training. Musikometrika-5, in press.
Stevens, C., & Wiles, J. (1994). Representations of tonal music: A case study in the development of temporal relationships. In M. C. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman, & A. S. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School (pp. 228-235). Hillsdale, NJ: Erlbaum Associates.
Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training and analysis of simple recurrent networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4, San Mateo, CA: Morgan Kaufmann.