logo
ResearchBunny Logo
Neural dynamics of phoneme sequences reveal position-invariant code for content and order

Psychology

Neural dynamics of phoneme sequences reveal position-invariant code for content and order

L. Gwilliams, J. King, et al.

This groundbreaking research by Laura Gwilliams, Jean-Remi King, Alec Marantz, and David Poeppel delves into the human brain's remarkable ability to sequence speech signals for word recognition. By utilizing magnetoencephalograms from participants engaged in narrative listening, the study reveals how the brain encodes multiple speech sounds, adapting to both predictable and unexpected phonemes. Discover how our brains remain flexible in processing spoken language!... show more
Introduction

The study investigates how the brain parses continuous, variable acoustic input into sequences of discrete phonemes to recognize words during natural speech comprehension. While existing cognitive models address word recognition largely for isolated words and have shown neural encoding of phonetic features and interactions with (sub)lexical representations, it remains unclear how the brain assembles sequences of acoustic-phonetic signals into ordered phoneme strings in continuous speech. Parsing is challenging due to absent reliable boundaries, co-articulatory blending between adjacent phonemes, and the necessity to preserve order to distinguish words sharing the same phonemes (e.g., pets vs pest). The research question centers on identifying the neural code that simultaneously represents phonetic content and order without confusion between neighboring speech sounds during ongoing speech processing.

Literature Review

Prior work has demonstrated neural encoding of phonetic features in superior temporal regions and interactions between phonetic and lexical levels, as well as sensitivity to phoneme probability within words. Traditional models (e.g., TRACE, Shortlist) and information-theoretic approaches have informed understanding of word recognition but provide limited accounts of sequence representation in continuous speech. Hypotheses such as explicit sequence encoding, purely context-dependent encodings (e.g., Wickelphones), or position-specific/edge-based coding predict limited generalization across positions. The present study builds on these findings by testing position invariance, temporal dynamics (temporal generalization), and the influence of predictability (surprisal) and lexical uncertainty (cohort entropy) on phonetic representations, contrasting neural responses with acoustic mel spectrogram correlates.

Methodology

Participants: Twenty-one native English speakers (13 female; mean age 24.8, SD 6.4) participated twice (interval 1 day to 2 months), with informed consent under NYU Abu Dhabi IRB approval. Stimuli: Four stories from the Open American National Corpus were synthesized using Mac OS X TTS (voices Ava, Samantha, Allison), with voice and speech rate (145–205 wpm) changing every 5–20 sentences; inter-sentence silences varied 0–1000 ms. Phoneme boundaries/labels were obtained with the Gentle aligner. Procedure: Participants passively listened via tube earphones (~70 dB SPL) to ~1 h of stories per session; comprehension questions (~every 3 min) yielded 98% accuracy to maintain attention. MEG acquisition: 208-channel axial gradiometer system (KIT), 1000 Hz sampling, 0.01–200 Hz online filtering; head position tracked pre/post. Preprocessing: Noise reduced with CALM; a temporal receptive field (TRF) model regressed out MEG components explained by acoustic envelope and pitch (ridge regression, lags up to 200 ms), ensuring residuals were not driven by low-level acoustics. Data were bandpass-filtered 0.1–50 Hz, downsampled to 250 Hz, and epoched −200 to +600 ms around phoneme onsets without baseline correction. Acoustic features: Mel spectrograms (librosa) with 2048-sample Hamming windows, 128-sample hop, 208 mel bands from 1–11250 Hz, downsampled to 250 Hz and epoched similarly. Features modeled: 14 binary phonetic features (voicing; manner: fricative, nasal, plosive, approximant, vowel; place for vowels: central/low/mid/high; place for consonants: coronal, glottal, labial, velar). Nuisance variables included stress, sequence frequency, morphological and syllabic position features. Subset variables included word onset/offset, surprisal, entropy, and distance from edges. Surprisal and cohort entropy were computed from cohort-based word probabilities and English Lexicon Project frequencies. Decoding framework: Back-to-back (B2B) ridge regression disentangled correlated features via a two-stage approach: (1) decoding each feature from multichannel MEG (or mel bands) independently at each of 201 time points (−200 to +600 ms), with regularization selected via RidgeCV across 20 log-spaced alphas; (2) encoding step mapping true features to decoded predictions to estimate feature-specific beta coefficients (diag(H)) as the performance metric. Train/test splits (50/50) were repeated 100 times and averaged. Proportion of variance explained was computed by normalizing each feature’s beta time course to the maximum summed beta across features (noise ceiling). Temporal generalization (TG): Decoders trained at time t were tested across all times t′ to assess representational stability/dynamics; statistics contrasted diagonal vs horizontal axes. Position generalization: Decoders trained on one phoneme position (e.g., P1) were tested on other positions to assess position invariance. History capacity: Simulations used unregularized decoding coefficients to reconstruct synthetic MEG for 4-phoneme anagrams and assessed cosine similarity to the true sequences; additional analyses decoded current and three preceding phonemes from instantaneous responses. Latency reconstruction: Ridge regression decoded time since phoneme onset from 100–400 ms samples; performance assessed by correlation between true and reconstructed latency. Spatial analyses: Decoding/encoding coefficients (no regularization) were visualized over sensors; trajectories quantified by projections onto anterior–posterior and left–right masks, comparing trajectory structure to null distributions. Statistical inference: Second-order permutation cluster tests (10,000 permutations) assessed above-chance decoding and between-condition contrasts; multiple-comparison corrections included Bonferroni across time samples for overlap analyses.

Key Findings
  • Phonetic features are decodable from MEG between ~50–300 ms and account for 46.2% of explainable variance (temporal permutation test p<0.001; critical t≈3.61). Information-theoretic measures (surprisal, entropy, log sequence frequency) account for 14.5% (−10–540 ms; p<0.001; t=2.98). Positional properties account for 31% (−120–600 ms; p<0.001; t=5.61). Boundary onset/offset features explain the remaining variance (0–410 ms; p<0.001; t=3.1).
  • From the mel spectrogram, phonetic features account for 52.1% of explainable variance (0–280 ms; p<0.001; t=9.56). Acoustic and neural phonetic decoding performances correlate across features (Spearman r=0.59; p=0.032), but information-theoretic and positional measures do not (r=0.13; p=0.41) and are stronger in MEG than acoustics (t=2.82; p=0.012), indicating higher-order neural encoding beyond acoustics.
  • Multiple phonemes are represented simultaneously: phonetic features of at least the current and three preceding phonemes can be decoded from the same instantaneous MEG response; simulations show robust encoding of four-phoneme histories and significant reconstruction up to five preceding phonemes under idealized conditions.
  • Phonetic feature representations are position-invariant: a classifier trained on first position (P1) generalizes to P2, P3, P−1, P−2 from 20–270 ms (p<0.001; t=3.3) with comparable performance, though training and testing on the same position yields the strongest decoding (max 71% vs 23–37% variance explained).
  • Temporal generalization reveals dynamic coding: although phonetic information is decodable for ~300 ms along the diagonal, any given spatial pattern is informative for ~80 ms; diagonal vs horizontal decoding differs significantly (df=200; p<0.001; t=7.54), showing evolving neural topographies.
  • Elapsed time since phoneme onset is encoded: reconstructed latency correlates strongly with true latency (Pearson r=0.87; p<0.001), implying implicit order coding (content jointly encoded with timing).
  • Parallel coding avoids overlap: after thresholding, only 7.3% of significant time-points overlap across positions (SD=9%); excluding first/last phonemes, overlap drops to 3.1% (SD=3.2%). Acoustic spectrograms show stationary cues (square TG) with 92.5% overlap (SD=12.3%), significantly greater than neural (t=−21.3; p<0.001), indicating neural dynamics are not inherited from acoustics.
  • Spatial dynamics: phonetic feature coding remains localized around auditory cortex; phoneme position shows a posterior–anterior trajectory toward frontal regions; both trajectories are more structured than null (p<0.001).
  • Dynamics scale with phoneme duration and input rate: long phonemes generalize 56 ms longer than short (p=0.005; t=−2.6); trajectory angle differs (short: 42.3°, long: 47.1°; t=2.56; p=0.013), consistent with adaptive processing speed to maintain a similar number of concurrently encoded phonemes.
  • Linguistic factors modulate dynamics: at word onset, neural phonetic decoding is delayed vs offset (lag to acoustic peak: 136 ms at onset vs 4 ms at offset; t=−3.08; p=0.002), and sustained longer at onsets (328–396 ms; p<0.001; t=−3.79). More predictable (low surprisal) phonemes are decoded earlier than less predictable (120–132 ms; p=0.007). Higher lexical entropy leads to longer maintenance/stronger decoding in 300–420 ms (304–328 ms p=0.002; and 200–420 ms p=0.016), suggesting maintenance until lexical uncertainty resolves.
Discussion

The findings demonstrate a neural coding scheme that simultaneously represents phonetic content and order during continuous speech. By embedding phonetic features in neural patterns that evolve with elapsed time since phoneme onset, the auditory system maintains parallel representations of multiple recent phonemes while preventing interference: each phoneme occupies a distinct, time-evolving population code. Position-invariant content encoding rules out purely position-specific or strictly context-dependent (e.g., Wickelphone) schemes as sole mechanisms, indicating at least one context-independent representation. The dynamic temporal code explicitly carries timing information, enabling order discrimination for sequences sharing the same phonemes (e.g., pets vs pest). The adaptive timing of initiation and termination of phonetic processing, modulated by phoneme predictability (surprisal) and lexical uncertainty (cohort entropy), supports interactive processing consistent with predictive coding and analysis-by-synthesis accounts. Neural dynamics differ from the largely stationary acoustic cues, implicating active cortical transformations localized primarily to auditory cortices with distinct trajectories for content vs position. Together, these results clarify how the brain constructs sequence representations that can interface with higher-order lexical processing during natural speech.

Conclusion

This work shows that human auditory cortex implements a dynamic, position-invariant population code that jointly encodes phonetic content and elapsed processing time, enabling simultaneous representation of multiple recent phonemes and implicit order coding. The code minimizes overlap between neighboring phonemes, supports flexible adaptation to speech rate, and varies with phonological predictability and lexical uncertainty, indicating close interaction with higher-level processes. These results identify a candidate intermediate representation—a sliding phonetic trigram—linking acoustic-phonetic input to (sub)lexical units. Future research should improve single-trial SNR (e.g., via repetitions), use higher spatial-resolution methods (e.g., ECoG) to map spatiotemporal trajectories, and establish causal links between representational dynamics and behavior during comprehension.

Limitations
  • Low single-trial MEG signal-to-noise ratio led to modest decoding effect sizes (often 1–2% above chance), necessitating large trial counts and limiting analyses of infrequent contexts or features. - MEG’s spatial resolution and SNR constrain inferences about precise cortical loci and configuration changes underlying dynamic coding; higher-resolution data (e.g., ECoG) are needed. - Passive listening design precluded direct correlations between neural decoding and behavioral performance; future work should relate representational trajectories to comprehension success and error patterns. - Potential residual confounds across phoneme positions (e.g., unmatched distributional properties) may contribute to stronger within-position decoding relative to cross-position generalization.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny