logo
ResearchBunny Logo
Motor engagement relates to accurate perception of phonemes and audiovisual words, but not auditory words

Linguistics and Languages

Motor engagement relates to accurate perception of phonemes and audiovisual words, but not auditory words

K. Michaelis, M. Miyakoshi, et al.

Discover how the motor system influences speech perception in groundbreaking research by Kelly Michaelis, Makoto Miyakoshi, Gina Norato, Andrei V. Medvedev, and Peter E. Turkeltaub. This study reveals intriguing EEG signatures and adaptive task results, showcasing the nuanced relationship between motor activity and auditory stimuli.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether, when, and how motor cortical systems contribute to speech perception beyond temporal lobe pathways. Competing views range from motor regions playing a minor supportive role to being essential for perception. Motor modeling accounts predict that articulatory representations in premotor/motor cortex facilitate decoding, especially for speech and in noisy conditions, whereas alternative accounts attribute motor activity to domain-general processes (attention, decision-making, rehearsal). The authors test, within a dual-stream (ventral/dorsal) framework, hypotheses that dorsal motor involvement is flexibly engaged when (1) stimuli are sublexical (no lexical template), (2) auditory signals are ambiguous/noisy (lexical uncertainty), or (3) audiovisual speech provides visual input that obligately engages motor representations. They further ask whether motor engagement is speech-specific versus evident for non-speech environmental sounds, and whether it supports accurate perception under controlled difficulty while temporally separating perception from decision-making.
Literature Review
Prior work shows motor areas active during speech perception and production, yet their role remains controversial. Motor modeling theories propose articulatory templates in PMC/M1/Broca’s area aid decoding, particularly for degraded speech, predicting speech specificity and dependence on task/stimulus features. Conversely, domain-general accounts suggest motor activations reflect attention, decision processes, or rehearsal. Gaps include limited tests of speech specificity (mixed results), focus on sublexical units rather than words, unclear effects of audiovisual input and noise level, and confounds from small stimulus sets that allow preloading of response templates. The dual-stream model (Rauschecker & Scott; Hickok & Poeppel) suggests ventral pathways support lexical mapping while dorsal auditory-motor pathways contribute when ventral processing is insufficient or visual speech is present.
Methodology
Design: 4-alternative forced choice (4AFC) task with large stimulus sets and a temporal separation between stimulus perception and response options to minimize decision/rehearsal confounds. Stimulus types (4): auditory words (AudWords), audiovisual words (AVWords), auditory phonemes (Phonemes), and non-speech environmental sounds (EnvSounds). Difficulty manipulated via adaptive staircase to maintain ~80% (Easy) and ~50% (Hard) accuracy by adjusting stimulus intensity while keeping pink noise constant. Key comparisons: speech vs non-speech; lexical (words) vs sublexical (phonemes); auditory-only vs audiovisual; Easy vs Hard. Participants: 24 healthy right-handed native English-speaking adults (mean age 23.9 years; 16 females). Exclusions: equipment errors, insufficient ICs, noisy data, etc., from an initial larger cohort. Stimuli/Procedure: Mean stimulus duration ~1139 ms. Pink noise played continuously; a difficulty cue preceded each trial. After 1000 ms prestimulus period, visual stimulus (still frame or morphed image; video for AVWords) faded in; auditory onset jittered within a 3000-ms window. After stimulus offset, a 1000-ms waiting period preceded on-screen response options; responses were made with right hand. Each participant completed 100 trials per stimulus type × difficulty (total 800 trials) across two sessions (AVWord/AudWord in one; Phoneme/EnvSound in another), separated by 1–14 days. EEG acquisition: 128-channel HydroCel Geodesic Sensor Net, 500 Hz sampling, 0.1–100 Hz hardware bandpass, vertex reference; impedances <70 kΩ. Offline preprocessing: low-pass 57.5 Hz, downsample to 250 Hz, high-pass 2 Hz; artifact cleaning including AMICA; epoching −2 to 1.5 s relative to stimulus sound onset; equivalent current dipoles estimated; analyses at IC (source) level. Time-frequency analysis: Group-level ERSP using combined Morlet wavelet/STFT from 4–55 Hz (log-spaced 50 bins), 1 to 1.46 s window with 10 ms steps; baseline was 418–582 ms relative to onset of the constant pink noise (identical across conditions) to remove noise-listening effects. K-means IC clustering by dipole location identified left and right sensorimotor clusters. Response-related controls: Examined μ/beta (8–30 Hz) activity time-locked to button press and to answer-choice onset to ensure response-related motor activity did not contaminate the stimulus-perception window. Primary EEG window: 8–30 Hz, −200 to 1000 ms relative to stimulus onset (to include preparatory activity), focusing on left and right sensorimotor clusters. Statistics: Linear mixed-effects models (R lmer) with subject and IC as random effects. Model A: cluster (left vs right) across all conditions; Model B (Accuracy): left cluster ERSP ~ stimulus type × accuracy (Correct vs Incorrect; collapsed across difficulty); Model C (Difficulty): left cluster ERSP ~ stimulus type × difficulty (Easy vs Hard; correct trials only). Likelihood ratio tests assessed interactions; outliers removed via Tukey’s rule. One-sample tests assessed deviation from baseline within conditions. Behavioral accuracy modeled via mixed-effects; stimulus volume analyzed via repeated-measures ANOVA with post-hoc tests; separate lipreading performance assessed.
Key Findings
Behavioral: - Adaptive staircases maintained performance near 80% (Easy) and 50% (Hard) across stimulus types; mixed-effects model showed a strong effect of difficulty (estimate = −29.09, SE = 0.63, p < 0.001). - Lipreading accuracy slightly above chance (28.7% vs 25%; p = 0.004), confirming AVWord performance relied on both modalities. - Volume to achieve thresholds: significant condition × difficulty interaction F(3,69) = 3.75, p = 0.015; main effects of difficulty F(1,23) = 1295, p < 0.001 and condition F(3,69) = 60.91, p < 0.001. Post-hoc: Easy < Hard volumes within each condition; between-condition differences in volume except AVWords vs EnvSounds; AudWords required highest, Phonemes lowest mean volume. EEG—Hemispheric effects: - Significant difference between left and right sensorimotor clusters in 8–30 Hz window (estimated mean difference = −0.33, SE = 0.04, p < 0.001). - Left cluster showed μ/beta suppression relative to baseline (estimate = −0.13, SE = 0.06, p = 0.048), beginning prior to stimulus onset and persisting through epoch. - Right cluster showed increased 8–30 Hz power (ERS) (estimate = 0.20, SE = 0.06, p = 0.002), consistent with inhibitory processes. EEG—Stimulus type × accuracy (left cluster): - Significant interaction χ²(3) = 17.6, p < 0.001. - AVWords: greater μ/beta suppression for Correct vs Incorrect (diff = 0.29, SE = 0.11, p = 0.01). - Phonemes: trend toward greater suppression for Correct vs Incorrect (diff = 0.16, SE = 0.10, p = 0.10, n.s.). - AudWords: greater suppression for Incorrect vs Correct (diff = −0.33, SE = 0.11, p = 0.003). - EnvSounds: no difference (diff = −0.02, SE = 0.10, p = 0.80). - Pairwise contrasts driving the interaction: AudWord vs AVWord (estimate = 0.62, SE = 0.16, p < 0.001); AudWord vs Phoneme (estimate = 0.50, SE = 0.15, p = 0.006). EEG—Stimulus type × difficulty (left cluster; correct trials only): - Interaction not significant χ²(3) = 5.0, p < 0.17. Main effect of stimulus type: F = 4.27, p = 0.006. - Greater suppression for AVWords and Phonemes vs AudWords (AudWord vs AVWord estimate = −0.29, SE = 0.10, p = 0.02; AudWord vs Phoneme estimate = −0.29, SE = 0.11, p = 0.04). - AudWords: almost no suppression for Easy; increased μ/beta power (ERS) for Hard. Minimal modulation for EnvSounds. Baseline comparisons (left cluster): - EnvSounds not different from baseline in any condition (e.g., Correct M = −0.049, p = 0.448; Incorrect M = −0.076, p = 0.370; Easy M = −0.079, p = 0.322; Hard M = −0.017, p = 0.870). - Significant modulations observed for AudWord Incorrect (M = −0.233, p = 0.039), AudWord Hard (M = 0.238, p = 0.037), AVWord Easy (M = −0.241, p = 0.025), Phoneme Correct (p = 0.019), Phoneme Hard (p = 0.013) (uncorrected).
Discussion
Findings indicate that motor engagement during perception is left-lateralized and specific to speech, not driven by domain-general factors such as attention or motor preparation. Left sensorimotor μ/beta suppression occurred for speech conditions but not environmental sounds, supporting speech-specific articulatory modeling. Right hemisphere μ activity increased, consistent with inhibitory control of irrelevant information. Critically, motor engagement related differentially to perceptual outcomes by stimulus type: greater suppression predicted correct perception for phonemes and audiovisual words, consistent with sublexical articulatory modeling and obligatory engagement with visual speech; for auditory-only words, greater suppression predicted incorrect perception, suggesting that relying on dorsal motor mechanisms may be ineffective when ventral lexical processing suffices, or reflects compensatory but unsuccessful recruitment when ventral processing fails. Difficulty effects were not significant, potentially due to ceiling of motor engagement in Easy trials, non-linear relationships with ambiguity, or removal of decision-related confounds by temporally isolating perception. Overall, results support a flexible dorsal-ventral interaction in which motor systems aid perception depending on lexicality and modality, with inhibitory processes in the right hemisphere helping suppress noise.
Conclusion
Motor processing in left hemisphere sensorimotor regions is selectively engaged during speech perception, not environmental sound perception, and its engagement predicts accurate perception for phonemes and audiovisual words but not for auditory-only words. This supports an interactive dual-stream model where dorsal motor areas are dynamically recruited depending on stimulus characteristics and context. The work advances understanding by temporally isolating perception from decision-making and controlling difficulty across conditions. Future research should employ causal manipulations (e.g., TMS) to test necessity across stimulus types, combine EEG with fMRI for improved spatial-temporal resolution, use multivariate decoding to differentiate sublexical versus audiovisual motor processes, and systematically probe non-linear relationships between ambiguity and motor engagement.
Limitations
- Causality: EEG results are correlational; causal inference about motor contributions cannot be made without perturbation (though prior TMS work supports a phoneme–word dissociation). - Spatial resolution: EEG source localization cannot precisely distinguish nearby motor areas (e.g., ventral premotor cortex vs primary motor cortex). - Ambiguity manipulation: The Easy level may have already induced motor engagement, potentially obscuring difficulty effects; a non-linear relationship between ambiguity and motor activity is possible. - Trial distribution: Fewer Easy-Incorrect trials necessitated collapsing across difficulty for accuracy analyses, which may limit condition-specific inferences about accuracy effects.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny