Research Article: Severity-Based Adaptation with Limited Data for ASR to Aid Dysarthric Speakers

Date Published: January 23, 2014

Publisher: Public Library of Science

Author(s): Mumtaz Begum Mustafa, Siti Salwah Salim, Noraini Mohamed, Bassam Al-Qatab, Chng Eng Siong, Joel Snyder.


Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data.

Partial Text

Speech is second nature for most of us, to the extent that we cannot imagine how life would be like without it, as speech communication is a vital skill in our society. Inability to communicate verbally is a serious disability that can drastically affect a person’s life. Speech impairment deprives a person of communicating with others, and severe speech impairment can be frustrating for both sufferers and listeners.

This research experiments with different adaptation techniques and source models that can be suitably applied for the optimum performance of an ASR system in recognising dysarthric speech. An adapted acoustic model for recognising the dysarthric speech of Nemours was built by adapting two SI models based on the unimpaired TIMIT [37] and the impaired TORGO [23] speech; the aim was to assess the suitability of these two SI models as the source model and identify any emerging differences between the models. The performance of the ASR system under each identified adaptation technique (MLLR and C-MLLR) and the source model (unimpaired and impaired speech) is measured in terms of the word error rate (WER) for each level of severity of impaired speech (mild, moderate and severe). We have performed a statistical analysis to determine any significant difference in the variance of WER. This section describes the databases, research methodology, performance measures and equipment involved in our experiments.

The recognition of dysarthric speech using the TIMIT SI model is more accurate for the mildly impaired speech (for both adaptation techniques), while the one based on TORGO performs well in recognising the moderately impaired and the severely impaired speech (for both adaptation techniques) except for experiment MOD-A. The CMLLR technique shows a lower WER than the MLLR technique for both the TIMIT and TORGO adapted model. Table 4 presents the WERs in recognising the dysarthric speech of Nemours.

In this research, we have determined the performance of the ASR system in recognising impaired speech; the target model was adapted using the source model of both the unimpaired speech of TIMIT and the impaired speech of TORGO. The WERs of the two source models are different, with TORGO being better for recognising severe dysarthric speech while TIMIT is better for recognising mild dysarthric speech.

The biggest setback to the development of an ASR system for impaired speech is the small size of speech that can be acquired from a speaker with speech impairment. As such, it is vital for developers of ASR systems of impaired speech to seek alternative means for such development. Although the acoustic characteristics for unimpaired and impaired speech are indeed very different, the acoustic model of the former can be used as a source model for adapting the targeted impaired speech. The performance of the unimpaired speech acoustic model can be further improved using an effective adaptation technique. In this research, it was found that the CMLLR technique performs better than MLLR when using an unimpaired speech model.