Date Published: March 21, 2019
Publisher: Public Library of Science
Author(s): Kaylah Lalonde, Lynne A. Werner, Julie Jeannette Gros-Louis.
Causal inference—the process of deciding whether two incoming signals come from the same source—is an important step in audiovisual (AV) speech perception. This research explored causal inference and perception of incongruent AV English consonants. Nine adults were presented auditory, visual, congruent AV, and incongruent AV consonant-vowel syllables. Incongruent AV stimuli included auditory and visual syllables with matched vowels, but mismatched consonants. Open-set responses were collected. For most incongruent syllables, participants were aware of the mismatch between auditory and visual signals (59.04%) or reported the auditory syllable (33.73%). Otherwise, participants reported the visual syllable (1.13%) or some other syllable (6.11%). Statistical analyses were used to assess whether visual distinctiveness and place, voice, and manner features predicted responses. Mismatch responses occurred more when the auditory and visual consonants were visually distinct, when place and manner differed across auditory and visual consonants, and for consonants with high visual accuracy. Auditory responses occurred more when the auditory and visual consonants were visually similar, when place and manner were the same across auditory and visual stimuli, and with consonants produced further back in the mouth. Visual responses occurred more when voicing and manner were the same across auditory and visual stimuli, and for front and middle consonants. Other responses were variable, but typically matched the visual place, auditory voice, and auditory manner of the input. Overall, results indicate that causal inference and incongruent AV consonant perception depend on salience and reliability of auditory and visual inputs and degree of redundancy between auditory and visual inputs. A parameter-free computational model of incongruent AV speech perception based on unimodal confusions, with a causal inference rule, was applied. Data from the current study present an opportunity to test and improve the generalizability of current AV speech integration models.
Speech perception is inherently multimodal. In face-to-face communication, we automatically combine speech information from the face and voice. This automaticity is dramatically demonstrated in the popular McGurk illusion . When presented an auditory /bɑbɑ/ paired with visual /gɑgɑ/, participants often perceive a fused /dɑdɑ/ that was not present in either modality. The McGurk illusion demonstrates that the brain integrates signals from across modalities into a single perceptual representation.
In all of the analyses and figures below, data represent the mean across the three vowel contexts. We would expect to observe some differences in consonant perception as a function of vowel context, as—for example—many visual consonants are easier to discriminate with the open vowel /ɑ/ than the close, back vowel /u/  (See also, Table 1). This could prompt participants to rely more on the auditory stimulus in the /u/ context than in the /ɑ/ context. Therefore, analyses of each individual vowel context are included in the S1 Appendix. In general, visual-only accuracy and the proportion of each response type varied across vowel contexts. However, the relationship between predictive variables and proportions of mismatch, auditory, and Other responses was typically the same across vowels. Some vowel-related differences were found for visual responses, likely due to their relatively small representation of visual responses in the data.
Most models of AV speech integration have yet to incorporate causal inference [8 for an exception]. We expect the data from the current study will be useful when researchers begin to do so. To demonstrate, we modified a fixed (parameter-free) version of the FLMP [10, 15, 25] to incorporate causal inference judgements. We used the modified FLMP to predict response to AV speech (including mismatch detection) based on confusions among unimodal auditory and unimodal visual consonants.
Traditional research on AV speech perception typically assumes that participants will integrate incongruent auditory and visual speech signals. However, we often are able to detect when auditory and visual speech does not match. Causal inference—the process of deciding whether incoming auditory and visual signals come from the same source—is an important step in AV speech perception . The purpose of this investigation was to explore the rules governing causal inference in perception of incongruent AV English consonants and the pattern of perceptual responses that occurs after controlling for causal inference. For the majority of incongruent AV consonant pairs (59%), participants were aware of the mismatch between the auditory and visual consonants, highlighting the need to incorporate causal inference as a key step in AV speech perception models .