Date Published: October 10, 2018
Publisher: Public Library of Science
Author(s): Doroteo T. Toledano, María Pilar Fernández-Gallego, Alicia Lozano-Diez, Ewan Dunbar.
Speech Analysis for Automatic Speech Recognition (ASR) systems typically starts with a Short-Time Fourier Transform (STFT) that implies selecting a fixed point in the time-frequency resolution trade-off. This approach, combined with a Mel-frequency scaled filterbank and a Discrete Cosine Transform give rise to the Mel-Frequency Cepstral Coefficients (MFCC), which have been the most common speech features in speech processing for the last decades. These features were particularly well suited for the previous Hidden Markov Models/Gaussian Mixture Models (HMM/GMM) state of the art in ASR. In particular they produced highly uncorrelated features of small dimensionality (typically 13 coefficients plus deltas and double deltas), which was very convenient for diagonal covariance GMMs, for dealing with the curse of dimensionality and for the limited computing resources of a decade ago. Currently most ASR systems use Deep Neural Networks (DNN) instead of the GMMs for modeling the acoustic features, which provides more flexibility regarding the definition of the features. In particular, acoustic features can be highly correlated and can be much larger in size because the DNNs are very powerful at processing high-dimensionality inputs. Also, the computing hardware has reached a level of evolution that makes computational cost in speech processing a less relevant issue. In this context we have decided to revisit the problem of the time-frequency resolution in speech analysis, and in particular to check if multi-resolution speech analysis (both in time and frequency) can be helpful in improving acoustic modeling using DNNs. Our experiments start with several Kaldi baseline system for the well known TIMIT corpus and modify them by adding multi-resolution speech representations by concatenating different spectra computed using different time-frequency resolutions and different post-processed and speaker-adapted features using different time-frequency resolutions. Our experiments show that using a multi-resolution speech representation tends to improve over results using the baseline single resolution speech representation, which seems to confirm our main hypothesis. However, results combining multi-resolution with the highly post-processed and speaker-adapted features, which provide the best results in Kaldi for TIMIT, yield only very modest improvements.
Automatic speech recognition (ASR) aims at converting speech signals into textual representations and is an essential part in data analysis applications that process multimedia (audio/video) content, such as keyword spotting and speaker detection, and in applications that use voice in human-machine interfaces, such as intelligent personal assistants, interactive voice response (IVR) systems and voice search, to name a few.
Deep neural networks (DNN) are machine learning tools which allow for the learning of complex non-linear multidimensional functions of a given input in order to minimize an error cost. A graphical example of a standard deep neural network is presented in Fig 4.
This section describes the dataset, the evaluation metrics and the tools used for the experimental part of this research paper.
The baseline systems chosen for our experiments are the standard Kaldi DNN recipes for TIMIT included in the Kaldi distribution. In all cases (both for baseline systems and the systems proposed in this article) the recipe starts with a common training procedure for the HMM/GMM system, which includes:
We are interested in experimenting with DNNs fed with input features that include different time-frequency resolution representation of the speech analysis to verify our hypothesis that this could improve acoustic-phonetic modeling. Perhaps the easiest and most direct way to do this is by conducting different STFT analyses of the speech and combining the different spectra obtained into a single feature vector to be used as input to the DNNs. In this way, we are only modifying the input to the network, so we keep the rest of the parameters of the baseline systems described in Section Baseline systems.
Normally, automatic speech recognition starts with a Short-Time Fourier Transform (STFT) which defines a fixed point in the time-frequency resolution trade-off. This approach, traditionally followed by the calculation of Mel-Frequency Cepstral Coefficients (MFCC) that produced reasonably uncorrelated features was very well suited to the old state-of-the-art in Automatic Speech Recognition (ASR), dominated by the use of Hidden Markov Models (HMMs) to model the speech dynamics and Gaussian Mixture Models (GMMs) (with diagonal covariance matrices) to model the features extracted from a speech frame. Nowadays, one of the most commonly used frameworks in practical ASR systems consists on the adoption of Deep Neural Networks (DNN) as a replacement of the GMMs, giving rise to the hybrid HMM/DNN systems. For these systems, since the acoustic features are fed directly as the input to a DNN, several restrictions on the speech features have vanished. For instance, input features do not need to be uncorrelated, and we have more freedom to enlarge the input feature vector because DNNs can handle better the curse of dimensionality.