Research Article: Words from spontaneous conversational speech can be recognized with human-like accuracy by an error-driven learning algorithm that discriminates between meanings straight from smart acoustic features, bypassing the phoneme as recognition unit

Date Published: April 10, 2017

Publisher: Public Library of Science

Author(s): Denis Arnold, Fabian Tomaschek, Konstantin Sering, Florence Lopez, R. Harald Baayen, Hedderik van Rijn.


Sound units play a pivotal role in cognitive models of auditory comprehension. The general consensus is that during perception listeners break down speech into auditory words and subsequently phones. Indeed, cognitive speech recognition is typically taken to be computationally intractable without phones. Here we present a computational model trained on 20 hours of conversational speech that recognizes word meanings within the range of human performance (model 25%, native speakers 20–44%), without making use of phone or word form representations. Our model also generates successfully predictions about the speed and accuracy of human auditory comprehension. At the heart of the model is a ‘wide’ yet sparse two-layer artificial neural network with some hundred thousand input units representing summaries of changes in acoustic frequency bands, and proxies for lexical meanings as output units. We believe that our model holds promise for resolving longstanding theoretical problems surrounding the notion of the phone in linguistic theory.

Partial Text

The invention of alphabetic writing systems has deeply influenced western reflection on language and language processing [1]. Just as letters make up written words, spoken words are assumed to consist of sequences of speech sounds (phones), the universal building blocks of language [2]. However, acoustic realizations of phones and words are known to be extremely variable within and across speakers. Nevertheless, it is generally accepted that the understanding of spoken words hinges on the identification of phones. In linguistics, psycholinguistics, and cognitive science, it is widely assumed that the only way in which the extreme variability in the speech signal can be dealt with is by funneling speech comprehension through abstract phone representations or feature bundles derived thereof [3, 4].

Recognition accuracy ranged from 40.6% to 98.8% (mean 72.6%), dictation accuracy ranged from 20.8% to 44.0% (mean 32.6%). Recognition accuracy and dictation accuracy were not correlated (r = 0.19, t38 = 1.17, p = 0.25). Dictation accuracy provides a more precise approximation of human identification performance than the self-reported recognition accuracy measure, which emerges as overly optimistic.




0 0 vote
Article Rating
Notify of
Inline Feedbacks
View all comments