Linguistics and Languages

Phonemic segmentation of narrative speech in human cerebral cortex

X. L. Gong, A. G. Huth, et al.

Discover groundbreaking insights into phonemic processing with this fMRI study, revealing that the cortical regions involved are more extensive than previously believed. Conducted by Xue L. Gong, Alexander G. Huth, Fatma Deniz, Keith Johnson, Jack L. Gallant, and Frédéric E. Theunissen, this research uncovers the complexities of diphones and triphones representation in the brain, challenging existing notions about word retrieval and auditory perception.

00:00

Playback language: English

Index

Introduction

The human brain's ability to decipher speech involves intricate transformations of continuous acoustic waveforms into meaningful linguistic units. While the general cortical areas responsible for transforming sounds into words are well-established, the neural mechanisms underlying phonemic processing—the integration of phonemic subunits into larger units like syllables and words—remain largely unclear. This study used fMRI to explore phonemic processing in the human brain. Participants listened to narrative stories while their brain activity was measured. The researchers employed voxelwise encoding models (VMs), which assessed the predictive power of various features (acoustic, phonemic, semantic) on BOLD signals. This approach combined analyses of different abstraction levels (acoustic features, phonemes, words) and temporal scales (phonemes, diphones, triphones) to identify brain regions sensitive to specific phonemic units and the transitions between phonemic processing and lexical retrieval. The study aimed to localize phonemic brain regions, investigate phonemic segmentation granularity, and pinpoint brain areas mediating the shift from phonemic processing to lexical semantic representation.

Literature Review

Prior neurolinguistic research has employed two complementary approaches: abstraction level comparisons and temporal scale analyses. fMRI studies frequently contrast responses to different abstraction levels (e.g., words vs. non-words) to isolate word-specific regions. Similarly, comparisons of predictive power using spectral versus articulatory features distinguish auditory cortical areas from speech-specific areas. EEG, MEG, and ECOG studies, on the other hand, leverage temporal entrainment between neural signals and speech waveforms. These studies have localized word segmentation to frontal gyri and linked lower-frequency neural activity to phrase and sentence processing, while higher-frequency activity correlates with syllable processing. However, the interpretation of cortical entrainment in EEG studies often requires further analysis to distinguish between its roles in linguistic and acoustic processing. This study combines these two methods—comparing different levels of linguistic abstraction and examining different temporal granularity—to identify brain regions involved in phonemic processing and transition from phonemes to lexical and semantic representations.

Methodology

Eleven participants (7 males, 4 females) listened to over two hours of narrative stories from "The Moth Radio Hour" while undergoing fMRI. The speech stimuli were preprocessed to extract six feature spaces: time-varying spectrum power, phoneme count, single phonemes, diphones, triphones, and semantic features (word embeddings). Voxelwise encoding models (VMs) were trained to predict BOLD signals using these features. A baseline VM was first constructed using spectrum power and phoneme count to account for the mere presence of speech sounds. This baseline's predictions were subtracted from the BOLD response to isolate the brain activity related to phonemic and semantic processing. Subsequent VMs incorporated phonemic features (single phonemes, diphones, triphones) and semantic features. Regularized linear regression (Tikhonov regression with banded ridge regression) was used to prevent overfitting. The best regularization hyperparameters were determined using 10-fold cross-validation. A separate validation dataset was used to evaluate model performance. Variance partitioning was employed to disentangle the unique and joint contributions of different phonemic features. Regions of interest (ROIs) were defined based on anatomical landmarks and functional localizers (auditory cortex, Broca's area, ventral speech premotor area). Statistical significance was assessed using permutation testing with FDR correction. Mixed-effects statistical models were used to account for inter-subject variability and to analyze differences in prediction performance across regions and feature spaces. Simulations were conducted to evaluate the feasibility of detecting fast phonemic events using fMRI given the limited temporal resolution of BOLD signals.

Key Findings

The study identified widespread phonemic processing regions in the superior temporal gyrus (STG), superior temporal sulcus (STS), lateral temporal cortex (LTC), inferior parietal lobule (IPL), and inferior prefrontal cortex (IPFC). Diphone features showed the strongest predictive power across these regions, indicating that phonemic segmentation primarily occurs at the diphone level. Single phoneme representation was more limited to the STS. The study showed a medial-to-lateral gradient in the temporal cortex, with phonemic representations dominant in medial regions (STG, STS) and semantic representations in lateral regions (LTC). A similar inferior-to-superior gradient was observed in the IPFC, with phonemic processing more prominent inferiorly and semantic processing superiorly. The unique contribution of diphone features was higher than that of single phonemes or triphones in all phonemic regions. Approximately half of the diphone’s contribution could be attributed to short words, highlighting the interconnectedness of phonemic and lexical processing, however distinct subregions for strictly phonemic versus lexical processing were not observed. Simulations indicated that the fMRI approach could reliably detect fast phonemic events, even with the low temporal resolution of BOLD signals, provided that individual voxels possess sufficient sensitivity to specific phonemic units.

Discussion

These findings challenge the traditional view of phonemic processing being confined to small regions in the mid-STG. The extensive lateral temporal and parietal involvement suggests a more distributed network for phonemic processing and lexical access, with the diphone level playing a crucial role in segmentation. The medial-to-lateral and inferior-to-superior gradients in the temporal and prefrontal cortices reveal a functional organization where phonemic and semantic processing transition smoothly. The finding that phonemic processing overlaps substantially with lexical retrieval areas underscores the close relationship between these two processes, consistent with behavioral studies showing gradual integration across multiple phonemes during word recognition. The results demonstrate that fMRI can be used to investigate even rapid neural processes if the event being studied has temporal sparsity and sufficient sensitivity. Future research could use more complex VMs to further delineate sub-networks and their roles in word segmentation, semantic prediction, and cross-linguistic comparisons.

Conclusion

This study significantly advances our understanding of the neural basis of speech processing by revealing the extent and organization of cortical areas involved in phonemic processing. The dominance of diphone segmentation and the gradients of phonemic-semantic representation suggest a highly integrated system where phonemic, lexical, and semantic processes are intricately interwoven. The methodological insights demonstrate that fMRI can contribute meaningfully to the investigation of rapid neural events in language processing. Further research using advanced encoding models could provide finer-grained insights into the functional organization of these cortical networks and the underlying mechanisms of speech comprehension.

Limitations

The study's reliance on a specific type of narrative speech might limit the generalizability of its findings to other speech styles. Although simulations suggest that the fMRI approach could detect fast phonemic events, there might be some undetected subtle effects that could be clarified with higher temporal resolution neuroimaging techniques. While the study addresses the overlap between phonemic and lexical processing, it does not definitively rule out the existence of strictly phonemic regions in the brain.

Related Publications

Explore these studies to deepen your understanding of the subject.

Linguistics and Languages

Evidence of a predictive coding hierarchy in the human brain listening to speech

C. Caucheteux, A. Gramfort, et al.

Medicine and Health

Neural control of lexical tone production in human laryngeal motor cortex

J. Lu, Y. Li, et al.

Psychology

Fast temporal dynamics and causal relevance of face processing in the human temporal cortex

J. Schrouff, O. Raccah, et al.

Medicine and Health

Deep learning-based virtual staining, segmentation, and classification in label-free photoacoustic histology of human specimens

C. Yoon, E. Park, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny