Research Article: Phonetic acquisition in cortical dynamics, a computational approach

Date Published: June 7, 2019

Publisher: Public Library of Science

Author(s): Dario Dematties, Silvio Rizzi, George K. Thiruvathukal, Alejandro Wainselboim, B. Silvano Zanutto, Wojciech Samek.


Many computational theories have been developed to improve artificial phonetic classification performance from linguistic auditory streams. However, less attention has been given to psycholinguistic data and neurophysiological features recently found in cortical tissue. We focus on a context in which basic linguistic units–such as phonemes–are extracted and robustly classified by humans and other animals from complex acoustic streams in speech data. We are especially motivated by the fact that 8-month-old human infants can accomplish segmentation of words from fluent audio streams based exclusively on the statistical relationships between neighboring speech sounds without any kind of supervision. In this paper, we introduce a biologically inspired and fully unsupervised neurocomputational approach that incorporates key neurophysiological and anatomical cortical properties, including columnar organization, spontaneous micro-columnar formation, adaptation to contextual activations and Sparse Distributed Representations (SDRs) produced by means of partial N-Methyl-D-aspartic acid (NMDA) depolarization. Its feature abstraction capabilities show promising phonetic invariance and generalization attributes. Our model improves the performance of a Support Vector Machine (SVM) classifier for monosyllabic, disyllabic and trisyllabic word classification tasks in the presence of environmental disturbances such as white noise, reverberation, and pitch and voice variations. Furthermore, our approach emphasizes potential self-organizing cortical principles achieving improvement without any kind of optimization guidance which could minimize hypothetical loss functions by means of–for example–backpropagation. Thus, our computational model outperforms multiresolution spectro-temporal auditory feature representations using only the statistical sequential structure immerse in the phonotactic rules of the input stream.

Partial Text

It is well known that human beings can reliably discriminate phonemes as well as other linguistic units by categorizing them, despite considerable variability across different speakers with different pitches and prosody. Furthermore, this ability extends to noisy and reverberant environments.

The classification performances are shown in Fig 9.

Results obtained in the present work support the computational hypotheses posed in our modeling approach in order to mimic incidental phonetic invariance and generalization. Some of these hypotheses have already been explained in terms of their properties [32], but more specifically in terms of their sequence learning capabilities [53]. Nevertheless, there are no precedents of such neurophysiological features tested in word classification tasks as the ones carried out here, in which phonotactic rules are acquired without the application of optimization procedures such as backpropagating errors by means of gradient descent. In addition, our approach presents substantial differences in terms of feature algorithmic implementation. In the present work, distal synapses make continuous individual contributions and our anatomical micro-columnar organization acquires its physiological behavior spontaneously from learning. We also tested such features in a realization with hundreds of cortical columns each combining several micro-columns with stochastic afferent activation whose future implementations are intended to explode large-scale simulations in leadership supercomputers.

We show via a computational simulation that our cortical model leverages the performance in word classification tasks under specific environmental conditions (e.g., white noise and reverberation) and for certain acoustic variants applied to the auditory stimuli (e.g., pitch and voice variations). The model acquires the phonotactic rules in the input, without any kind of supervised or reinforced optimization procedure, taking only advantage of auto-organized algorithmic properties. We also show effectiveness in classifying multisyllabic words, which suggests that our implementation of neurophysiological predictive dynamics plus stochastic sparse patterns of activation outperforms the MRSTSA algorithm in terms of phonotactic sequential invariance for disturbances applied to the audio signal. Most importantly, the present model–based on current neurophysiological and neuroanatomical data of the human auditory pathway–is able to mimic incidental phonetic acquisition observed in human infants, which is a key mechanism involved during early language learning. Increasing the models complexity (by addition of further cortical layers), could allow the model to replicate further mechanisms involved during human language acquisition, such as inferential learning or prediction generation. In addition, neurophysiological and anatomical properties in our model could be considered potentially relevant to the design of artificial intelligence systems and may achieve higher levels of phonetic invariance and generalization than the ones achieved by current deep learning architectures.