logo
ResearchBunny Logo
Phonemic segmentation of narrative speech in human cerebral cortex

Linguistics and Languages

Phonemic segmentation of narrative speech in human cerebral cortex

X. L. Gong, A. G. Huth, et al.

Discover groundbreaking insights into phonemic processing with this fMRI study, revealing that the cortical regions involved are more extensive than previously believed. Conducted by Xue L. Gong, Alexander G. Huth, Fatma Deniz, Keith Johnson, Jack L. Gallant, and Frédéric E. Theunissen, this research uncovers the complexities of diphones and triphones representation in the brain, challenging existing notions about word retrieval and auditory perception.... show more
Introduction

The study addresses how the human cortex transforms continuous speech acoustics into phonemic units and ultimately words/meaning. While broad cortical loci from sound to words are known, the neural bases of phonemic segmentation—how phonemic subunits combine into syllables and words—remain unclear. The authors recorded fMRI while participants listened to natural narrative speech and used voxelwise encoding models to compare acoustic, phonemic (single phonemes, diphones, triphones), and semantic features. The goals were to localize phonemic brain regions, characterize the temporal granularity of phonemic segmentation, and determine where phonemic processing transitions to lexical/semantic representation.

Literature Review

Prior fMRI work distinguishes cortical responses to words versus phonemic stimuli and contrasts spectral versus articulatory features to map primary auditory versus speech-specific regions in STG. EEG/MEG/ECoG research has leveraged cortical entrainment at multiple temporal scales, identifying neural correlates of word segmentation in inferior/middle frontal gyri, phrase/sentence processing in middle/posterior STG and IFG, and syllable-rate entrainment linked to lower-level processing. However, the functional specificity of entrainment to linguistic versus general acoustic processing remains debated. The locus of transition from phonemic processing to lexical retrieval has been proposed in anterior/lateral posterior temporal regions and inferior prefrontal cortex. This work integrates abstraction-level contrasts (acoustic/phonemic/semantic) with segmentation granularity (single phoneme/diphone/triphone) to revisit these questions.

Methodology

Participants: 11 healthy adults (7 male, 4 female; ages 24–36), normal hearing; handedness via Edinburgh inventory. IRB approved; informed consent. Stimuli: >2 hours of natural spoken stories from The Moth Radio Hour. Training set: ten 10–15 min stories (total 3737 TRs). Validation: one 10-min story repeated twice (291 TRs). Audio delivered via in-ear headphones; transcripts with word and phoneme annotations. MRI acquisition: 3T Siemens TIM Trio, 32-channel coil. fMRI: TR=2.0045 s, TE=31 ms, flip=70°, voxel=2.24×2.24×4.1 mm, 32 axial slices. Structural T1 MP-RAGE collected. Preprocessing: motion correction (FSL FLIRT/MCFLIRT), alignment to run template, drift removal (Savitzky–Golay, 120 s window), z-scoring, ROI localizers for AC, Broca’s, sPMv. Feature spaces (downsampled to TR with Lanczos filtering):

  • Acoustic: time-varying power spectrum (25 Hz–15 kHz, 33.5 Hz bands; 448 features) and phoneme count per TR (1 feature). Combined as Acoustic Baseline (449 features).
  • Phonemic identity: single phonemes (39 features: 15 consonants, 24 vowels), diphones (858), triphones (4841), based on CMU dictionary and IPhOD statistics.
  • Semantic: 985-dimensional word-embedding vectors based on co-occurrence with 985 basis words (Wikipedia list of 1000 basic words). Voxelwise models (VMs): Regularized linear encoding with banded ridge (Tikhonov) regression, separate regularization per feature block and temporal delays, with a fixed HRF-inspired temporal prior. Predictors included four delays (2, 4, 6, 8 s), yielding: Acoustic Baseline VM (1796 features), Phonemic VM (22,952), Phonemic–Semantic VM (26,892). Responses and predictors z-scored (per story for Y). Acoustic Baseline predictions were subtracted from BOLD to obtain residual Y_res for subsequent phonemic/semantic modeling. Model selection and validation: Hyperparameter grid search (10 log-spaced values per feature block; ranges up to 10^3–10^5 depending on model) with 10-fold cross-validation on training; final performance evaluated on held-out validation story (R and R² per voxel). Statistical significance via permutation tests (1000 shuffles) with FDR control (q=0.01). Variance partitioning: Using the joint Phonemic VM’s optimal hyperparameters, computed unique and joint contributions of single phonemes, diphones, triphones, and their pairwise/all combinations; bias-corrected R². Further partitioned diphone features into three categories: short words (e.g., M.AY = “my”), word beginnings (e.g., AE.N in “and”), and residual diphones; assessed normalized contributions. ROIs: Functionally defined AC (responses to auditory localizer), Broca’s, sPMv; anatomically defined IPFC (ventral to IFS), SPFC (dorsal to SFS), LPC vs MPC, LTC vs VTC. Mixed-effects models (subjects as random effects) used for inference and effect-size estimation (Cohen’s d′). Additional spatial analyses projected phonemic vs semantic prediction performance along medial–lateral axis in LTC (perpendicular to STS) and inferior–superior axis in IPFC (perpendicular to IFS). Simulations: Assessed feasibility given fMRI’s low temporal resolution. Modeled voxels with varying sensitivity to specific phonemic units; demonstrated that with ~2 hours of data, identity-sensitive voxels can be distinguished from count-based responses if tuned to fewer than ~10 single phonemes or ~100 diphones; differentiation among sensitivities to single phonemes, diphones, triphones also feasible. Longer data (5 h for one subject) modestly improved triphone detectability without altering diphone dominance.
Key Findings
  • Extensive phonemic cortical network: Significant phonemic representations (after removing acoustic baseline) in bilateral superior temporal gyrus (STG), superior temporal sulcus (STS), and lateral temporal cortex (LTC), extending into lateral/posterior parietal cortex (LPC/PPC), medial parietal cortex (MPC), and inferior/superior prefrontal cortex (IPFC/SPFC). Primary auditory cortex (PAC) showed low phonemic prediction.
  • Diphone-level segmentation dominates: Across the cortex, the unique contribution of diphone features exceeded that of single phonemes and triphones (mixed-effects LRT χ²(2)=10355.60, p<2.2×10⁻¹⁶), with a large effect size (average d′=13.114±0.203 2SE). Within temporal cortex ROIs, diphones had higher unique contributions than single phonemes and triphones, with significant effects across ROIs (χ²(9)=625.68, p<2.2×10⁻¹⁶), across features (χ²(8)=4828.82, p<2.2×10⁻¹⁶), and a significant ROI×feature interaction (χ²(6)=522.32, p<2.2×10⁻¹⁶). Effect sizes (d′ for diphone vs single/triphone): PAC 0.440±0.968; STG 12.838±0.838; STS 12.509±0.327; LTC 9.060±0.471. In Broca’s area, diphones also dominated (χ²(2)=443.64, p<2.2×10⁻¹⁶; d′=24.168±0.203).
  • Limited single-phoneme and triphone uniqueness: Unique single-phoneme effects localized to a small number of voxels in bilateral STS; triphone unique or joint contributions were minor and scattered.
  • Role of short words vs diphone identity: Diphone models’ contribution partly, but not solely, reflected short words. Normalized contributions differed among diphone categories (short words > word beginnings > residuals) with significant mixed-effects tests (e.g., F(2,10)=7.44, p=6.6×10⁻⁵; F(2,10)=8.62, p=1.8×10⁻³). Models using only diphone transition probabilities (phonotactics; 8 features from IPhOD) had substantially lower predictive power than identity-based diphone models.
  • Phonemic vs semantic mapping and gradients: Overall cortex showed stronger semantic than phonemic contributions (χ²(1)=712.23, p<2.2×10⁻¹⁶; d′=2.018±0.151). In temporal ROIs, phonemic > semantic in STG (χ²(1)=19.12, p=8.1×10⁻⁵) and STS (χ²(1)=51.75, p=6.3×10⁻¹⁵); semantic > phonemic in LTC (χ²(1)=36.65, p=1.4×10⁻⁸). Spatial gradients: phonemic-to-semantic transition along a medial–lateral gradient in LTC (higher phonemic near STS; higher semantic near ITS), and an inferior–superior gradient in IPFC (phonemic inferior to IFS; semantic superior), supported by paired t-tests along anatomical axes.
  • Feasibility confirmed: Simulations and an extended dataset for one subject showed that with ~2–5 hours of data, fMRI encoding approaches can recover phonemic identity sensitivities and reveal diphone-level dominance despite the slow BOLD signal.
Discussion

The findings delineate a widespread phonemic network beyond classical mid-STG, encompassing bilateral STG/STS/LTC, parietal regions, and inferior prefrontal cortex. Variance partitioning indicates that cortical phonemic segmentation predominantly occurs at the diphone level, with minimal unique triphone representation and limited single-phoneme specificity. This supports phonetic theory: due to coarticulation, single-phoneme acoustics vary with context, while diphones provide a more stable unit for mapping acoustic features to phonemic representations. The overlap between phonemic regions and lexical/semantic areas, and the enhanced representation of diphones that form short words, suggest that phonemic and lexical processes are intertwined. Nonetheless, diphone identity contributes beyond short words and beyond diphone transition statistics, indicating genuine encoding of diphone content rather than merely phonotactic expectations. Cortical gradients in LTC (medial phonemic to lateral semantic) and IPFC (inferior phonemic to superior semantic) outline putative pathways for transitioning from phonemic/lexical stages to semantic representations. These results refine models of speech processing by specifying the granularity of sublexical segmentation and the spatial organization of phonemic-to-semantic transformations.

Conclusion

This study maps extensive cortical regions encoding phonemic information during natural narrative listening and shows that phonemic segmentation is predominantly realized at the diphone level. Phonemic regions extend across lateral temporal and parietal cortices and into inferior prefrontal cortex, overlapping with lexical/semantic areas and revealing systematic gradients from phonemic/lexical to semantic representations. The work demonstrates that fMRI encoding models can recover fast linguistic computations when stimuli are long and feature occurrences are sufficiently sparse. Future research should: (1) incorporate models of adaptive segmentation and linguistic context to test word segmentation expectations; (2) further parse phonemic vs lexical contributions and explore subnetwork specializations within LTC and IPFC; (3) examine higher-order phonemic combinations with longer datasets; (4) investigate bilingual/multilingual processing to dissociate language-specific vs language-general phonemic mechanisms; and (5) evaluate additional speech features (e.g., syllabic/envelope-based) in multimodal datasets.

Limitations
  • Temporal resolution: fMRI’s sluggish BOLD response may obscure rapid phonemic events; conclusions rely on long recordings and model-based recovery of temporally sparse identity features.
  • Specificity evidence: While variance partitioning shows strong diphone uniqueness, the study does not provide direct topographic maps organized by diphone structure; triphone-specific maps were not identified.
  • Feature scope: Other segmentation schemes (e.g., syllable or envelope-based features) were not exhaustively tested; adaptive segmentation models were not implemented.
  • Data duration and sensitivity: With ~2 hours (and one subject at 5 hours), detection of higher-order combinations (triphones and beyond) remains limited; identity-based models require sufficient voxel sensitivity (≈<10 single phonemes or <100 diphones) to outperform count-based models.
  • Short word confound: A substantial fraction of diphone predictive power arises from short words; although identity effects persist beyond this, disentangling phonemic from lexical effects fully requires additional manipulations.
  • Generalizability: Naturalistic narrative stimuli limit control over phoneme/triphone distributions; cross-language generalization was not tested.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny