Linguistics and Languages
Acoustic and language-specific sources for phonemic abstraction from speech
A. Mai, S. Riès, et al.
This groundbreaking research from Anna Mai, Stephanie Riès, Sharona Ben-Haim, Jerry J. Shih, and Timothy Q. Gentner delves into how we comprehend spoken language, revealing the intricate balance between auditory and linguistic processing. By analyzing neural responses to conversational English, this study uncovers the vital role of phonemes and highlights the integration of acoustic features and phonology in our understanding of speech. Discover how this work bridges sound and meaning!
~3 min • Beginner • English
Introduction
The study examines how the brain abstracts phonemic and morphological information from continuous acoustic speech signals. Motivated by the high overlap between acoustic and phonological information and by debates over the psychological reality of phonemes, the authors leverage language-specific divergences where acoustic similarity does not align with phonemic or morphological identity. They focus on English /d/ vs. /t/ neutralized to [ɾ] in certain contexts, and on morpho-phonological alternations for the English regular past tense and plural allomorphy. The core questions are whether phonemes are causally effective units in neural processing beyond acoustics, how acoustic versus phonological information contribute to neural responses, and whether covariance among acoustic features interacts with categorical phonemic information. By using intracranial recordings during natural speech listening and combining targeted linguistic contrasts with modeling (LME, MNE), the study aims to dissociate surface acoustic from underlying phonemic and morphological representations and to test the role of language-specific knowledge in these abstractions.
Literature Review
Prior work shows spatial tuning to spectrotemporal features in auditory cortex and apparent sensitivity to phonological features that mirrors acoustic organization. Some argue that phonology may be unnecessary for speech processing since acoustic information can suffice for word recognition and for neural encoding models. Yet linguistics documents divergences between acoustic and phonological similarity, and combined models with both acoustic and categorical phoneme features can outperform models with either alone (Di Liberto et al., 2015), implying non-identical information. The paper builds on these findings by explicitly probing language-specific divergences (neutralization, allomorphy) and testing how phonemic labels and spectrographic features, including their covariance, jointly account for neural responses. It also connects to theories of the phoneme as physical, psychological, or theoretical constructs, evaluating evidence for a psychological reality of phonemic categories.
Methodology
Participants: Ten native English-speaking epilepsy patients (SEEG/subdural electrodes) at UC San Diego Health (ages 20s–50s; 4 women, 6 men). All reported normal hearing and had no Catalan experience.
Stimuli and task: Participants passively listened to 25–76 s passages of conversational American English (Buckeye Corpus). Attention was monitored via two-alternative content questions after each English passage. Catalan passages (6 total, ~49 s each) from a Catalan corpus were interspersed; participants pressed a key upon detecting digitally inserted English nouns to ensure attention.
Neural recording and preprocessing: Intracranial EEG recorded with Natus Quantum; audio captured simultaneously. Data were notch filtered (60 Hz and harmonics), bandpass filtered (0.1–170 Hz), downsampled to 510 Hz, artifact-corrected via ICA, and epoched. Speech-responsive channels were identified via sliding-window t-tests comparing speech vs. silent epochs (p<0.05). Band-limited power was computed for delta (1–3 Hz), theta (4–7 Hz), alpha (8–12 Hz), beta (13–30 Hz), gamma (31–50 Hz), and high-gamma (70–150 Hz). Peri-target epochs (−100 to 500 ms) were z-scored to baseline.
Linguistic contrasts for site classification: To probe acoustic–phonemic divergence (English coronal stop neutralization), and phonemic–morphemic divergence (regular past tense, plural), the authors used sliding-window one-way ANOVAs (100 ms windows, 50 ms overlap) with Tukey’s post hoc tests to classify electrodes as showing:
- Acoustic surface sites (e.g., taps more similar to each other than to [t]) or phonemic underlying sites (e.g., underlying /t/ taps patterning with other /t/ allophones vs. underlying /d/ taps) for the tap comparison.
- Morphological surface vs. morphological underlying sites for past tense (/t/, /d/ allomorphs vs. non-past [d]) and plural (/s/, /z/ allomorphs vs. non-plural [z]). Criteria required significant ANOVA windows and specific post hoc difference/no-difference patterns per contrast.
Null distributions: To assess chance levels, for each band the same analysis was repeated with 1,000 arbitrary phone triplets (A; Bx, By) drawn from the real data to generate null distributions preserving spatial correlations.
Features for modeling:
- Spectrographic features (LME): 16×16 spectrograms (1024 ms context; 16 frequency bins after log and averaging; time resolution ~64 ms) aligned to neural data; dimensionality reduced to 16×8 (128 features) for main analyses. Alternative features from a GAIA autoencoder latent space (128-d) reported in Supplementary Information.
- Phonemic label features (LME): One-hot phone labels time-aligned, averaged to match spectrogram windows; vectors sum to 1 over window.
LME models: For each participant and each neural response (six bands + broadband LFP), seven mixed-effects models were fit: s1 (spectrogram), p1 (phoneme labels), s1p1 (both), plus shuffled controls (s2/s3 within-trial/session shuffles; p2/p3 shuffles). Electrode and excerpt speaker were random effects. Model comparison used AIC and model weights; predicted vs. recorded correlations were computed.
MNE models: First-order (linear) and second-order (quadratic) Maximum Noise Entropy models were fit per channel and band to map stimulus features to neural responses. Two input conditions: unlabeled spectrograms (16×16) and labeled spectrograms (top row encodes two 8-bit ARPABET labels as values ±1 SD of spectrogram mean). Quadratic models include pairwise feature interactions (J matrix). Training used four-fold jackknife, conjugate gradient minimization with early stopping. Predicted vs. recorded Pearson r was Fisher Z-transformed. Mixed-effects analyses tested effects of model order (linear/quadratic), label availability (labeled/unlabeled), their interaction, and shuffling.
Language-knowledge analyses: Additional mixed-effects families assessed whether model fit advantages for phonemic labels depended on language (English vs. Catalan), including model×language interactions for both LME and MNE predicted-recorded correlations.
Electrode localization and inclusion: SEEG contacts localized via registered MRI/CT; significant speech-responsive sites mapped (many in white matter).
Key Findings
- Targeted contrasts revealed extensive phonemic abstraction: per band (~485 speech-responsive electrodes), an average of 31 acoustic surface sites (SD ±22.4) and 190 phonemic underlying sites (SD ±38.5) for the coronal stop–tap alternation; ~6 electrodes (SD ±2.4) overlapped as both surface and underlying.
- Morphology: Past tense showed on average 73 surface sites (SD ±23.9), 46 morphological underlying sites (SD ±8.8), and ~1 overlap (SD ±1.1) per band. Plural showed 47 surface sites (SD ±14.2), 45 morphological underlying sites (SD ±12.2), and ~1 overlap (SD ±0.5).
- These counts exceeded chance expectations from null distributions. Example: in high-gamma, P(≥111 phonemic underlying AND ≥80 acoustic surface sites) < 0.1% for the tap alternation; for past tense, P(≥33 surface AND ≥32 underlying) = 0.1%; for plural, P(≥19 surface AND ≥38 underlying) = 0.2%.
- LME model selection: Lower bands (delta, theta, alpha) and broadband LFP were best fit by models including both spectrographic and phonemic features (s1p1) for most participants (LFP: 9/10 s1p1, 1/10 p1). Beta: s1p1 best in 8 participants, p1 in 2. Higher bands: gamma and high-gamma were best fit by spectrogram-only models (s1) in 8/10 participants, with s1p1 best in the remainder. Phonemic labels thus explain additional variance primarily at lower frequencies; acoustics dominate higher frequencies.
- MNE modeling: The best mixed-effects model for predicted–recorded correlations included model order, label availability, and their interaction; it outperformed alternatives by AIC (ΔAIC = 54.75 vs. next best), indicating that quadratic (covariance) features improve predictions only when phonemic label information is present. Model order alone did not improve fit without labels.
- Language knowledge: Phonemic label benefits required familiarity with the language. For LME, the model including language and language×model interactions was best (AIC improvement 609.59 over next). For MNE, the full interaction model was best (AIC improvement 775.87). Phoneme-only models (p1) correlated better for English than Catalan; spectrogram-only models fit Catalan better than English. Quadratic MNE models improved fits for English only when labels were included, not for Catalan.
- Temporal/band insights: Predictive correlations were larger in lower bands; acoustic and phonemic effects often co-occurred temporally in gamma/high-gamma at distinct sites, supporting simultaneous but separable processing of acoustic detail and phonemic category.
Discussion
Findings demonstrate robust neural sensitivity to underlying phonemic identity even when acoustic realizations are neutralized (e.g., [ɾ] from /t/ vs. /d/), providing strong evidence that phonemes function as psychologically real, language-specific units in prelexical processing. The presence of morphological underlying sites indicates that morphological identity is abstracted over phonologically conditioned alternants, with neural responses tracking morphological exponence (past/plural) beyond surface phonetics. Mixed-effects modeling shows that phonemic labels account for variance not explained by acoustics, particularly in lower frequency bands and LFP, while higher frequency power is primarily acoustic. MNE analyses reveal that stimulus covariance (quadratic features) enhances prediction only when categorical phonemic information is available, indicating a synergistic interaction: phonemic identity bootstraps the use of spectrotemporal relationships. Language-knowledge analyses confirm that these categorical benefits depend on familiarity with the language, ruling out explanations that labels merely recapitulate acoustics. Together, results bridge acoustic processing with phonology and morphosyntax, supporting a structured, multi-level abstraction mechanism in speech comprehension.
Conclusion
This work provides convergent evidence that the brain abstracts phonemic and morphological categories from continuous speech acoustics in a language-specific manner. By leveraging acoustic–phonemic and phonemic–morphological divergences, the study dissociates surface acoustic similarity from underlying linguistic identity and shows that phonemes are causally effective units for explaining neural responses. Modeling reveals that lower-frequency neural activity integrates phonemic category information beyond acoustics, and that stimulus covariance contributes when coupled with phonemic labels, linking sensory feature relationships to linguistic categories. Future work should test generalization to irregular morphology to determine whether underlying sites reflect exponent-specific vs. category-general representations, increase participant numbers and cortical coverage to resolve inter-individual and band-specific variability, compare gray vs. white matter signal contributions, and explore higher-order (beyond quadratic) feature interactions and cross-linguistic generality.
Limitations
- Electrode coverage was dominated by white matter contacts, complicating direct comparison to cortical surface measures and interpretation of source localization; white matter LFP properties are less well understood.
- Sample size (10 participants) and heterogeneous electrode coverage may contribute to variability in best-fit models across participants, especially in higher frequency bands.
- The morphological analyses did not include irregular forms, limiting conclusions about category-general vs. exponent-specific representation and about compositionality vs. analogy.
- Attention/task differences between English (content questions) and Catalan (embedded English-word detection) may have influenced acoustic tracking, potentially enhancing acoustic-model performance for Catalan.
- The site-classification ANOVA approach, while conservative with null distributions, relies on windowed statistics and may miss effects outside the analyzed windows or conflate nearby processes.
- Generalization beyond English and Catalan, and beyond the specific alternations tested, remains to be established.
Related Publications
Explore these studies to deepen your understanding of the subject.

