Date Published: May 24, 2019
Publisher: Public Library of Science
Author(s): Sabah Al-Hameed, Mohammed Benaissa, Heidi Christensen, Bahman Mirheidari, Daniel Blackburn, Markus Reuber, Stephen D Ginsberg.
Neurodegenerative diseases causing dementia are known to affect a person’s speech and language. Part of the expert assessment in memory clinics therefore routinely focuses on detecting such features. The current outpatient procedures examining patients’ verbal and interactional abilities mainly focus on verbal recall, word fluency, and comprehension. By capturing neurodegeneration-associated characteristics in a person’s voice, the incorporation of novel methods based on the automatic analysis of speech signals may give us more information about a person’s ability to interact which could contribute to the diagnostic process. In this proof-of-principle study, we demonstrate that purely acoustic features, extracted from recordings of patients’ answers to a neurologist’s questions in a specialist memory clinic can support the initial distinction between patients presenting with cognitive concerns attributable to progressive neurodegenerative disorders (ND) or Functional Memory Disorder (FMD, i.e., subjective memory concerns unassociated with objective cognitive deficits or a risk of progression). The study involved 15 FMD and 15 ND patients where a total of 51 acoustic features were extracted from the recordings. Feature selection was used to identify the most discriminating features which were then used to train five different machine learning classifiers to differentiate between the FMD/ND classes, achieving a mean classification accuracy of 96.2%. The discriminative power of purely acoustic approaches could be integrated into diagnostic pathways for patients presenting with memory concerns and are computationally less demanding than methods focusing on linguistic elements of speech and language that require automatic speech recognition and understanding.
Memory complaints are common, increase with age and are a major reason for primary care consultations. There is an increasing emphasis on an earlier diagnosis of neurodegenerative disorders as evolving treatments are likely to be more effective before irreversible changes have occurred in the brain [1, 2]. The drive to seek early diagnostic clarification has led to an over 600% increase in referrals to secondary care memory clinics in the UK over the last ten years and generated considerable pressure on diagnostic pathways . Although these dramatic changes have increased the number of patients in whom neurodegenerative disorders have been identified, a large proportion of the patients now referred to specialist memory clinics actually have functional (non-progressive) memory concerns without objective evidence of cognitive deficits. Therefore, improvements to stratification and screening procedures would be highly desirable and could enable better targeting of limited health care resources . However, the early identification of patients with neurodegenerative disorders is a challenging task due to a lack of accurate predictive biomarkers suitable for routine screening or stratification. Biomarkers capable of identifying patients at high risk of developing the commonest cause of progressive cognitive decline, Alzheimer’s disease (AD), pre-symptomatically exist  but are either expensive and only available in very few centers (e.g. amyloid Positron Emission Tomography) or are invasive (e.g. amyloid and tau testing in the cerebrospinal fluid) and not suitable for for screening at the interface between primary and specialist care patients .
The system is intended as an early stratification tool for patients presenting with progressive ND-related cognitive problems based solely on diagnostic acoustic features in patients’ speech. As illustrated in Fig 1, it consists of three main stages: pre-processing, feature extraction and machine learning based classification.
In the machine learning community, cross-validation is widely used to ensure an effective method of model selection to achieve, a robust performance evaluation and prevent over-fitting . We used K-fold cross-validation with k = 5, to partition the data into five equal parts called “folds”. The model was trained using four out of five folds and tested with the remaining 5th fold. This step was repeated k = 5 times until all folds had been used in the training and testing process. This however, did not generate the validation set directly. Instead, we used what is known as the nested k-fold cross-validation method, which uses two k-fold loops namely the outer and the inner loops. The outer loop generates the testing (1/5 data) and the training (4/5 data) folds, while the inner loop takes all the training folds combined (coming from the outer loop) and generates the validation and training folds. Fig 2 shows the design of the nested 5-fold cross-validation. Feature selection and the model’s hyper-parameter tuning were explored and the model with the best features and best parameters was tested using the test folds. This process runs through all the loops and the final model result is reported as the average of the best model’s scores across the outer test folds. Importantly, each fold generated had to contain a balanced number of samples between the two classes for the nested k-folds cross-validation in order not to skew the output towards one class. We used Scikit learn libraries to perform this task .
The results of the study suggest that machine learning models based on the analyses of the acoustic data from patients with cognitive complaints are capable of detecting differences between the two classes, ND and FMD, in keeping with prior research [7–9, 14, 16]. We explored the discriminating potential of acoustic features using five different classification algorithms (SVM, random forest, Adaboost, multi-layer perceptron, and SGD) and tested our findings using the validation procedure described above. The best models’ results are listed in Tables 5 and 6. These were obtained using both scenarios: the original dataset with 30 samples and the augmented dataset with 230 samples. The average results for all models improved regardless of feature selection method and for both Tables 5 and 6. All models scored 97% accuracy except Adaboost which reached a maximum at 93% when the original dataset of 30 recording samples was used. The number of features used for each model is smaller when the wrapper and embedded approaches are used compared to the statistical ranking approach, for example the SVM wrapper model needed only 9 out of 22 ranked features from Table 4 compared to 11 features when the ranking is performed based on statistical significance. The differences between the two feature selection approaches were expected because both methods utilised the classifier scores to identify the best set of features maximizing the performance while the analytical approach, on the other hand, included an un-optimized set of features for the models to reach their maximum accuracies.
This study has shown that automatic speech analysis technology focusing and acoustic features in their speech could be a valuable complementary method in the diagnostic pathway of patients presenting with cognitive concerns. We aimed to build a machine learning model that learns from our data and is able to predict the cognitive status of patients referred to a specialist memory clinic. In this study we used a binary classification; FMD or ND. The highest classification accuracy reached 97% which was achieved by four machine learning diagnostic models: SVM, random forest, multi-layer perceptron, and SGD. The most discriminant features utilized by the models include ratios and statistics of pauses and utterances, which aligns with the literature. ND patients’ speech has previously been found to be characterized by an increased number and duration of pauses as well as a reduction in the number of utterances which may be caused by difficulty in word finding (lexical retrieval) . Similarly Singh et al.  and Roark et al.  reported that the mean time of both pauses and speech are useful in discriminating healthy subjects from MCI and AD patients. Other features of discriminating value in our study included the number and degree of voice breaks, which aligns with findings also previoulsy reported by Meilán et al. .
The results of this study lead to several conclusions. First, a relatively small number of extracted acoustical features are shown to be of great importance in the differentiation between ND and FMD. These features are likely related to changes in the neurobiology associated with a given neurodegenerative cognitive disorder, reflected in the acoustic output. Secondly, the proposed approach can be easily deployed at clinics and during standard clinical encounters. This will require only minimal effort on the part of the examiner and mean a much quicker diagnosis for the examinee. Finally, despite the limitations of this study, our findings show that acoustic-only features offer a potential low-cost, simple and alternative to more complex features requiring automatic speech recognition, part-of-speech parsing, and understanding of speech in the automated screening or stratification of patients with cognitive complaints. Hence it has great potential for use early in pathways that assess people with cognitive complaints [4, 68].