Date Published: November 29, 2018
Publisher: Public Library of Science
Author(s): Meng Dong, Xuhui Huang, Bo Xu, Changsong Zhou.
Speech recognition (SR) has been improved significantly by artificial neural networks (ANNs), but ANNs have the drawbacks of biologically implausibility and excessive power consumption because of the nonlocal transfer of real-valued errors and weights. While spiking neural networks (SNNs) have the potential to solve these drawbacks of ANNs due to their efficient spike communication and their natural way to utilize kinds of synaptic plasticity rules found in brain for weight modification. However, existing SNN models for SR either had bad performance, or were trained in biologically implausible ways. In this paper, we present a biologically inspired convolutional SNN model for SR. The network adopts the time-to-first-spike coding scheme for fast and efficient information processing. A biological learning rule, spike-timing-dependent plasticity (STDP), is used to adjust the synaptic weights of convolutional neurons to form receptive fields in an unsupervised way. In the convolutional structure, the strategy of local weight sharing is introduced and could lead to better feature extraction of speech signals than global weight sharing. We first evaluated the SNN model with a linear support vector machine (SVM) on the TIDIGITS dataset and it got the performance of 97.5%, comparable to the best results of ANNs. Deep analysis on network outputs showed that, not only are the output data more linearly separable, but they also have fewer dimensions and become sparse. To further confirm the validity of our model, we trained it on a more difficult recognition task based on the TIMIT dataset, and it got a high performance of 93.8%. Moreover, a linear spike-based classifier—tempotron—can also achieve high accuracies very close to that of SVM on both the two tasks. These demonstrate that an STDP-based convolutional SNN model equipped with local weight sharing and temporal coding is capable of solving the SR task accurately and efficiently.
Automatic speech recognition is the ability for a machine to recognize and translate spoken language into text. It is a challenging task since the speech signal is high variable due to different speaker characteristics, varying speaking speed, and background noise. In recent years, artificial neural networks (ANNs), especially deep neural networks, have outperformed traditional Gaussian mixture models and became the predominant method in speech recognition area .
Our network model consists of three layers, which is illustrated in the architecture diagram Fig 1. The input layer converts the speech signal into spikes using the time-to-first-spike coding scheme, the convolutional layer learns acoustic features from the input by STDP learning rule, and the pooling layer compresses the information while providing the translation-invariance. The details of each layer will be explained in the following sections.
Our model was evaluated on the task of speaker-independent recognition of isolated spoken words with the TIDIGITS dataset  and the TIMIT dataset . In this section, first we show the performance of our SNN model by using SVM as a classifier, which is compared with performances of other SNN and ANN models. Next, we validate the advantage of the local weight sharing strategy. Then, we analyze the transformation of receptive fields of the convolutional neurons and the characteristics of the network output to understand why our model works so well. Finally, we prove that our SNN model can also work well with a spike-based classifier by taking tempotron as an example.
Spiking neural networks had been gradually drawing attention due to its potential of solving ANNs’ problems of biological implausibility and computational intensity. However, it is not easy to train a SNN well for typical pattern recognition tasks, and various training methods have been proposed previously . Many studies chose to train a traditional ANN instead, and convert it to a SNN by replacing each rate-based neuron with a spiking neuron [15, 16, 55–58]. Although they showed good performance on pattern recognition tasks, the problem of training a SNN was actually bypassed. Some researchers used differentiable formulations of SNNs, so they could train them with backpropagation directly [14, 59]. With this approach, the training algorithm searches a larger solution space and can achieve better performance. These methods are not biologically plausible since there are no evidence that error backpropagation could happen in the brain. In contrast, our model uses the STDP rule observed in biological synapses to train the SNN. Particularly, since STDP is a local and unsupervised learning rule, the training process doesn’t need any label information. Thus our SNN model is able to utilize the large amount of unlabeled data, which is less expensive and easier to obtain than labeled data. Moreover, a simple linear classifier (linear SVM or spike-based tempotron) can be sufficient to classify the STDP-trained data with high accuracies, this reveals powerful ability of our model for extracting input features in a more biologically realistic way.
To provide an alternative speech recognition solution to ANNs which is biologically implausible and energy-intense, we proposed an STDP-based SNN model with the time-to-first-spike coding scheme and the local weight sharing strategy. It can achieve high accuracies on two speech recognition tasks. By adopting STDP learning rule and the temporal coding scheme, our SNN is able to learn acoustic features fast and efficiently, and make the speech data low-dimensional, sparse, and more linearly separable. Compared to global weight sharing, the proposed local weight sharing is more suitable for learning the features of speech signals. Moreover, our model can achieve comparable performance to traditional ANN approaches when using SVM as a classifier, and can also work well when using the spike-based classifier—tempotron. Therefore, in practice, due to the spike-based computation, our model with tempotron can be implemented on neuromorphic chips easily as a speech recognition solution with ultra-low power consumption. In summary, our study shows that a biologically plausible SNN model equipped with STDP, local weight sharing, and temporal coding has the ability of solving speech recognition tasks efficiently.