[Table of Contents]

The In(put)s and Out(put)s of Comparing Human and Network Performance: Some Ideas on Representations, Activations and Weights

Kate Stevens
Department of Psychology
University of Western Sydney
PO Box 555
Campbelltown NSW 2560
KJ.Stevens@uws.edu.au

Artificial neural network models provide experimental psychologists with testable, mechanistic accounts of processes which may mediate perception and cognition. This paper outlines some techniques for comparing indices of human cognitive processes, such as reaction time, accuracy, and confidence ratings, with neural network performance and patterns of activation in hidden unit space. Finally, the efficacy of neural networks as predictive systems, based on their ability to represent statistical regularities of the environment, is discussed.

Indices of Accuracy and Response Time in Single-Layer Neural Networks

Connectionist models consist of processing units which are organized into input, output, and sometimes, middle layers. The processing units communicate through a set of interconnections with variable weights or connection strengths, and modification of weights or connection strengths allows a network to "learn". As learning systems, networks are exposed to, and trained to recognize, a set of patterns: they learn to associate particular patterns of input activation with particular patterns of output activation via the gradual adjustment of network weights. Ideally, learning generalizes so that trained networks can recognize new patterns which share properties with the training set.

Accuracy

The accuracy of performance of a network can be gauged from the appropriateness of the response on any single trial. In a backpropagation network, for example, accuracy is highest when there is no difference between the desired and target output. The amount of error in the network is frequently quantified as the total sum of squared error (TSS): as network accuracy increases, TSS decreases. In networks, such as a perceptron, where the activation function of a single output unit has been thresholded, accurate performance is said to be achieved when, for example, the output unit responds above threshold for members of one class of patterns and below threshold for members of another class. Additionally, the generalization of the network when presented with a set of related novel patterns is measured as percentage correct.

Network accuracy can be compared directly with accuracy recorded by subjects on comparable tasks. For example, we have compared accuracy data recorded by musically trained and untrained listeners on music discrimination and classification tasks with output values generated by single-layer networks (Stevens, 1992; Stevens & Latimer, 1992; 1993). The differing performance of musically trained and untrained subjects was mimicked in the network by manipulating systematically the amount of training to which the network was exposed. Specifically, the mean accuracy of trained subjects was simulated in a fully-trained or "expert" network and compared with the mean accuracy of untrained subjects as mimicked in a partially-trained or "novice" network.

Response Time

The time taken by a network to respond accurately on a given trial can also be deduced from the activation values of output units in a backprop model. The comparison with human data is most easily illustrated using data from a network which has been trained to discriminate between temporal patterns. The output activations of these networks can be mapped against the unfolding input pattern. That is, patterns can be input to the network one timeslice at a time and accuracy of the output units can be checked at each timeslice. Figure 1 illustrates the pattern of activation of the output units of a single-layer network trained to discriminate short modified musical compositions from a standard composition. The output unit activations are correct after the activation of Timeslice 7. The reaction times recorded by subjects in response to presentation of one of the musical compositions have been converted to timeslices to enable direct comparison with the network; the mean and range of these reaction times are superimposed on the graph. The figure shows that the network responds quickly and it can be deduced that the input of Timeslice 7 is sufficient for accurate performance. Subjects are slightly less efficient than the trained network but, on average, respond within one or two timeslices after input of the first highly discriminating feature of a modified composition.

Figure 1. Comparison of output activation of a trained single-layer backprop network and mean response times of subjects (after conversion to timeslices). On average, subjects responded at Timeslice 8 when presented with Modified Composition 1, compared with correct discrimination by the network after activation of Timeslice 7.

Analysis and Interpretation of Weights and Hidden Unit Activations

Weight Graphs from Single-Layer Networks

Gluck & Bower (1988) have advocated simulation of tasks using simple network models to permit unambiguous interpretation of network weights and activations. Specifically, the strengths on connections which link input to output units in a trained single-layer network illustrate the degree to which features represented by individual input units are differentiating. To take an example, Figure 2 illustrates the final weights on connections from 32 input units to the single output unit of a single-layer perceptron trained to discriminate six modified musical compositions from a standard composition. The large negative weights on connections from input units 2, 13, and 28 correspond to location of feature manipulations in modified compositions where falling pitch-changes were replaced with rising pitch-changes. The large positive weights correspond to places in the three modified compositions where falling pitch-changes were substituted for rising pitch-changes. These weight graphs can be used to predict human performance on the same task: if it is the case that subjects are making their decision on the basis of the above discriminating features, then accurate responses should follow the sounding of the component notes of these distinguishing features.

Figure 2. Final weights or strengths on connections from the 32 input units to the single output unit of a single- layer perceptron. The large positive and negative weights correspond to the location of feature manipulations in the modified musical compositions.

Analysis of Hidden Unit Activations in a Multi-Layer Network

A more sophisticated neural network architecture which can process temporal stimuli is the simple recurrent network (SRN) (Elman, 1989). The SRN was designed for use with linguistic stimuli and is ideally suited for use with musical patterns in that it combines architectural properties, such as hidden units and sequential processing, characteristic of the more elementary networks discussed earlier. However, with the inclusion of a layer of hidden units the strengths on connections are less easy to interpret than those of the single-layer architectures discussed above as there are now two layers of modifiable weights. Instead, attention is most often paid to the hidden unit activations which intervene between the input and output units. This section will describe the architecture and results of a multi-layer network trained on the pitch-change discrimination task and will outline one technique used to explore the hidden unit activations.

The pitch-change direction discrimination task was simulated using an SRN which consisted of 11 input units and seven context and hidden units. Two output units were used to signal the presence of either a standard or a modified composition. The binary activations of the input units corresponded to the 11 different frequencies of the compositions. As there were 32 notes per composition, the input units were activated over 32 time steps with one activation per time step. The recurrent connections, which linked the hidden units to the context units, gave the network memory inasmuch as they allowed the hidden units to be influenced by their previous activations. The model was trained using online learning and Backprop Through Time for one time step. The context units were reset to zero between each composition, the learning rate was 0.005, and the tanh activation function was used. The standard and five of the modified compositions were discriminated accurately after 20,000 training epochs. Modified Composition 2 was confused with the standard composition after this period of training. The activations of the output units graphed as a function of the unfolding time steps of the composition are shown in Figure 3. In the model, the pattern of activation mimics the reaction times recorded by musically trained subjects. For example, correct discrimination of the first modified composition is made after the distinguishing second time step has been input. Similarly, Modification 3 was discriminated accurately after activation of time step 28.

Figure 3. Activation of the two output units of the simple recurrent network plotted over the 32 time steps of the compositions. Modified compositions are discriminated accurately after input of the distinguishing pitch-change direction feature. The graph illustrates network response when presented with the Standard and Modified Composition 1.

The hidden units of the SRN provide a rich source of information about the network representations which mediate input/output activations. There is a vector of hidden unit activations for each of the 32 time steps of the standard and modified compositions and the 160 hidden unit vectors for all compositions recognized by the trained SRN were subjected to canonical discriminant analysis (CDA) in an effort to reduce the dimensionality and complexity of the data (Wiles & Bloesch, 1992). To explore the differences between the hidden unit representation of standard and modified compositions, the analysis was carried out on hidden unit activation vectors and beyond time steps where feature manipulations occurred. CDA takes the set of seven-dimensional hidden unit vectors, labelled with one of two groups - either a standard or a modified pattern - and by minimizing the within-group differences, finds a vector (the first canonical component) that best separates the two groups of hidden unit patterns. The original points can be projected onto the canonical component vectors to yield a low-dimensional plot with points clustered according to group membership.

The first canonical component for Modified Composition 5, shown in Figure 4, accounts for the greatest amount of variance in the data, and, in the SRN, partitions off the standard and modified compositions. The values of the first canonical component are plotted over the 32 time steps of the composition. Label s marks the time steps which contain notes identical to those of the standard (i.e. the opening notes of a modified composition before the feature manipulation occurs), whereas d indicates the location of the feature manipulation which distinguishes the modified composition from the standard. Figure 4 also illustrates the way in which the hidden unit activations corresponding to notes of the standard and modified compositions occupy different regions of hidden unit space. If we consider the points in each graph as a trajectory through time, it can be seen that, once a distinguishing feature is input to the network, there is movement toward a region in space away from that area occupied by values corresponding to presentation of notes of the standard composition. Thus, the sameness and difference of compositions are represented in the SRN in the relative location of activations in hidden unit space; the standard composition is represented by clustering in one area and, once a discriminating feature is input, there is movement away from that region to a different location in space. Armed with such an efficient internal representation, it is not surprising that the SRN discriminates modified compositions from the standard as soon as the input unit which represents a discriminating feature is activated.

Figure 4. The values of the first canonical component from the hidden unit activations of the SRN for Modified Composition 5 shown as a function of time step. The points at and beyond the label s refer to the parts of the modified composition which are identical to the standard and the label d identifies the location of the distinguishing feature of Modification 5. Input of the discriminating feature results in transition from one location in hidden unit space to another.

What has the analysis of the hidden unit activations of the SRN contributed to our understanding of music cognition? In our earlier discussion of the single-layer perceptron, differences in accuracy and reaction time recorded by musically trained and untrained subjects were emulated in fully- and partially-trained networks, respectively. Importantly though, the perceptron algorithm requires that all weighted activations be summed so that the efficient reaction time functions of musically trained subjects are unlikely to be achieved using a single-layer network. A more plausible explanation of the differences between trained and untrained subjects invokes the properties of the SRN as the possible mechanism. It is conceivable that, with training and experience, musically trained subjects acquire knowledge of the higher-order structure of musical compositions and subsequent expectancies of melodic, harmonic, and rhythmic progressions. This knowledge is constructed by the SRN and represented in the hidden unit activations. Additionally, given that the reaction time function of trained subjects can be simulated more closely by the SRN than the perceptron, the influence of recurrent connections must also be acknowledged. Retention of the hidden unit activations from the previous time step in the SRN facilitated speed of recognition. By extrapolation, it may be the case that musically trained listeners, but not untrained listeners, retain or accumulate pitch information from the compositions as they unfold in time.

Extraction of Statistical Regularities and Prediction of Human Performance

The prediction paradigm devised by Elman (1989) is a useful technique for comparing human and network performance. Coupled with a simple recurrent network (SRN), Elman used the paradigm to explore the prediction of linguistic units by an artificial neural network. We have also used an SRN and the prediction paradigm to investigate the way representations of tonal music are constructed during network learning (Stevens & Wiles, 1994). This kind of network, coupled with the prediction paradigm, has some advantages over feedforward architectures such as single- or multi- layer perceptrons. For example, the pattern of activity across the output layer of the network can be interpreted as a probability distribution and used to predict human performance on similar tasks. The network is sensitive to, and learns to represent, the statistical regularities of the training environment. This section details the way in which an SRN, trained to predict the next event in a 153-note melody, sheds light on the regularities inherent in the temporal flow of a musical composition. I also speculate on possibilities for experimentation and prediction of human performance based on the output activations and error values of the network.

Predictive Simple Recurrent Network

The network was a simple recurrent network (SRN) consisting of 25 input and output units, and 20 hidden and context units. The SRN was required to predict the next event in the musical sequence; the patterns of activation on the 25-unit output layer were mappings of activation patterns on the 25-unit input layer, one time step later. The model was simulated using a modification of the bp program in the pdp software package (McClelland & Rumelhart, 1989), and learning and momentum rates were set at 0.05 and 0.2, respectively. The training set comprised the 153- note melody of The Blue Danube by Johann Strauss. A record of the developmental performance of the network was made at log steps from Epoch 0 through to Epoch 4096, where an epoch referred to one pass through the melody or 153 training steps. The aim in training the network was to study the way in which effects of temporal transitions between events were incorporated into the representation constructed gradually by the network.

Network Results and Predictions

Output from the SRN at each time step consisted of a distribution of activations across tone, octave, duration and accent units of the output vector. The distribution reflected the likelihood of a component being active given the preceding musical context. During the early stages of training, the network learned to predict the base probabilities of events, and, as expected, it gave evidence of being sensitive to the statistical regularities in the training set by responding initially with the average combination of tone, octave, duration and accent components. With further training, the network learned context-sensitive variations from the mean component activations for each time step.

The distributions of activity across the output vector in the early stages of training demonstrate the statistical learning properties of the network. The statistical regularities extracted and learned by the network are of psychological interest for a number of reasons. First, the mean activations of the tone units across the entire composition are reminiscent of the tonal hierarchy which has been described by both music theorists and psychologists. (Krumhansl, 1990; Krumhansl & Shepard, 1979). During the initial epochs of training, the network responds with the most stable and frequently occurring tones, namely tones C, E, and G which comprise the tonic chord. This kind of hierarchical ordering of tones characterises the relative importance of tones in particular keys and has been shown to influence performance on a range of experimental tasks including judged relatedness of tones, judged key membership, judgments of phrase endings, and patterns of memory confusions (Krumhansl, 1991). The network, therefore, provides a mechanistic account of the way in which statistical properties of music are learned and represented; for example, frequently occurring tones are weighted heavily. The statistical probabilities extracted by the network can also be used for predictive purposes: is it the case that listeners, when exposed initially to a novel musical composition and asked to predict the next tone in a sequence, predict the most frequent or probable tone? Given the distribution of activity across the time-based components, duration and accent, it is also possible to use the network to produce a time-based analogue of the tonal hierarchy. For example, the activation of the duration units during the initial training epochs imply that a quarter note is the most frequent and stable note length in the composition and could be compared with predictions of note length made by musically trained and untrained listeners when assigned a prediction task.

Network performance can be measured by calculating the total sum of squared error (TSS) at various stages of training, and the mean contribution of the four components to the TSS error are depicted in Figure 5. In the latter stages of training, the tone, duration and accent TSS errors were close to zero. The relative TSS values can also be used to predict performance. Consider a situation where the TSS value of one of the components, such as accent, decreases rapidly during the initial training epochs. Such a result implies that accent is learned quickly and is salient. We can ask a similar question of human subjects: is it the case that accent information is extracted early when subjects are exposed to novel stimuli? Experimental stimuli could be constructed to compare the relative salience of accent and pitch information in a discrimination task. The comparison would involve two sequences of the same pitch pattern but with differing accent patterns versus two sequences of differing pitch patterns with the same accent pattern. If accent pattern is salient relative to pitch, then we would expect there to be more errors in discriminating between the latter sequences than the former sequences.

Figure 5. Mean sum of squared error shown for each of the four music components as a function of amount of training. The mean was calculated by dividing the raw TSS value for each component by the number of associated output units.

Conclusions

The present paper has demonstrated the variety of ways in which neural network variables and human experimental data can be compared. One approach involves the development of a model to simulate performance of a human cognitive task such as visual or auditory pattern recognition. Performance values of the network, reflected in unit activations and connection strengths, are equated with dependent variables used as measures of human performance such as accuracy and reaction time. Neural networks can also be used to examine the encoding and interaction of different kinds of sensory, perceptual or cognitive information. For example, we have illustrated the combination of pitch and time-based information in the hidden unit activations and connection strengths of a multi-layer network trained to predict a musical sequence. Finally, the potential for networks to extract and represent the statistical regularities of a particular training environment has been discussed. The latter kind of neural network may not constitute a cognitive mechanism but the acquired statistical regularities may shed light on constraints which underpin human perception and cognition.

Acknowledgments

This research was supported by an Australian Postgraduate Research Award and Australian Research Council Postdoctoral Fellowship. The modification to McClelland & Rumelhart's (1989) bp program was developed by Paul Bakker, Departments of Computer Science and Psychology, University of Queensland.

References

Elman, J. L. (1989). Structured Representations and Connectionist Models (CRL Tech. Rep. No. 8901). San Diego: University of California, Center for Research in Language.

Gluck, M. A., & Bower, G. M. (1988). Evaluating an adaptive network model of human learning. Journal of Memory and Language, 27, 166-195.

Krumhansl, C. L. (1990). Cognitive Foundations of Musical Pitch, Oxford: Oxford University Press.

Krumhansl, C. L. (1991). "Music psychology: Tonal structures in perception and memory," Annual Review of Psychology, vol. 42, pp. 277-303.

Krumhansl, C., & Shepard, R. N. (1979). "Quantification of the hierarchy of tonal functions within a diatonic context," Journal of Experimental Psychology: Human Perception & Performance, vol. 5, pp. 579-594.

McClelland, J. L., & Rumelhart, R. E. (1989). Explorations in Parallel Distributed Processing: A Handbook of Models, Programs and Exercises. Cambridge, Mass.: MIT Press.

Stevens, C. (1992). Derivation and Investigation of Features Mediating Musical Pattern Recognition. Unpublished doctoral dissertation, University of Sydney.

Stevens, C., & Latimer, C. (1992). A comparison of connectionist models of music recognition and human performance. Minds and Machines, 2, 379-400.

Stevens, C., & Latimer, C. (1993). Recognition of short tonal compositions by connectionist models and listeners: Effects of feature manipulation and training. Musikometrika-5, in press.

Stevens, C., & Wiles, J. (1994). Representations of tonal music: A case study in the development of temporal relationships. In M. C. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman, & A. S. Weigend (Eds.), Proceedings of the 1993 Connectionist Models Summer School (pp. 228-235). Hillsdale, NJ: Erlbaum Associates.

Wiles, J., & Bloesch, A. (1992). Operators and curried functions: Training and analysis of simple recurrent networks. In J. E. Moody, S. J. Hanson, & R. P. Lippmann (Eds.), Advances in Neural Information Processing Systems 4, San Mateo, CA: Morgan Kaufmann.