logo
ResearchBunny Logo
Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

Linguistics and Languages

Combining predictive coding and neural oscillations enables online syllable recognition in natural speech

S. Hovsepyan, I. Olasagasti, et al.

This innovative research by Sevada Hovsepyan, Itsaso Olasagasti, and Anne-Lise Giraud investigates how predictive coding and neural oscillations enhance our ability to recognize syllables in natural speech. The developed computational model reveals the remarkable alignment of internal predictions and acoustic inputs, showcasing the dynamic interplay vital for effective sensory processing.... show more
Introduction

The study addresses how the brain segments continuous speech into discrete units (syllables) during online comprehension, a process thought to involve coordination between bottom-up acoustic information and top-down predictions. Prior work implicates cross-frequency coupling, particularly theta (3–8 Hz) and gamma oscillations, in parsing syllables and organizing phonemic encoding, while predictive coding frameworks emphasize the role of internal generative models and contextual priors in perception. The central hypothesis is that incorporating physiologically plausible theta-gamma coupling into a predictive coding architecture will enhance online syllable parsing and recognition by temporally aligning top-down spectrotemporal predictions with bottom-up acoustic input. The study tests whether and how theta oscillations that signal syllable onsets and durations can improve continuous syllable identification, and under what conditions stimulus-driven versus endogenous theta-gamma coupling is advantageous.

Literature Review

Neural oscillations are linked to diverse cognitive functions, and cross-frequency coupling has been proposed to support hierarchical processing in speech, with theta organizing syllabic parsing and gamma encoding within-syllable content. Predictive coding and related Bayesian approaches posit top-down generative models that predict sensory inputs, with bottom-up prediction errors ensuring model updating; low-beta activity has been associated with top-down predictive mechanisms during speech. Prior models have either emphasized oscillatory preprocessing for decoding or recognition of isolated items, but few integrate oscillatory timing with predictive coding for continuous speech. Earlier neurocomputational work showed that theta-gamma interactions can aid speech decoding and that intrinsic theta oscillators can align to speech rhythms for syllabification. The current work builds on these insights by explicitly combining predictive coding with biologically motivated theta-gamma coupling to handle continuous, natural speech at the syllable level.

Methodology

Design: A neurocomputational model (Process) within a predictive coding framework was developed to perform online syllable parsing and recognition in continuous natural speech. The model separates what (identity and spectral structure of syllables) from when (timing/duration), the latter implemented via oscillatory processes. Inputs: For each sentence, the model receives (1) a slow amplitude modulation signal derived from the waveform and (2) a 6-channel auditory spectrogram (log-spaced channels, 150 Hz–1 kHz, normalized). Together with the slow modulation, this yields 7 input channels. Dataset: 220 English sentences from the TIMIT corpus (22 speakers; 2888 syllables total). Fifteen sentences were used for parameter tuning and excluded from evaluation; 210 sentences were used for performance analyses. Phonemic boundaries were provided by TIMIT and converted to syllable boundaries via a rule-based tool (providing reference syllable timing and spectrotemporal templates with 6×8 matrices per syllable). Generative model and inference: A two-level predictive coding model with hidden and causal states was inverted using Dynamic Expectation Maximization. Top level includes (i) a theta module driven by the tracked slow amplitude modulation via a canonical Ermentrout–Kopell oscillator; whenever the theta phase reaches a criterion, a Gaussian pulse signals a theta trigger (estimated syllable onset), and the instantaneous theta frequency estimates syllable duration; (ii) a spectrotemporal module with eight sequential gamma units providing processing windows across the syllable and a pool of syllable units, each encoding an 8-step spectrotemporal template. Bottom level uses a Hopfield-like network to generate predicted spectrogram dynamics for comparison with input; prediction errors update states. Theta-gamma coupling and resets: Gamma sequences are reset at detected syllable onsets (theta triggers) to align temporal predictions with input. The last (8th) gamma unit triggers a reset of syllable-unit accumulated evidence before the next syllable to prevent interference. The theta instantaneous rate sets or informs the preferred gamma sequence rate (syllable duration estimate) in some variants. Model variants (A–F): Six architectures differed in (i) whether gamma activity is reset by theta triggers and whether a preferred gamma rate is set exogenously by theta (true theta-gamma coupling, variant A) or endogenously (variant B), (ii) presence/absence of a preferred gamma rate (variants C, D have no preferred rate), and (iii) presence/absence of resetting syllable-unit accumulated evidence by the last gamma unit (E, F lack this reset). Variants thus test exogenous vs endogenous theta-gamma coupling and the role of resets. Compressed speech tests: Variants A and B were additionally tested on time-compressed speech (×2 and ×3) to assess advantages of stimulus-driven theta under adverse timing conditions. Performance metric: For each gamma cycle (syllable window), the syllable unit with the highest average activation was taken as the recognized syllable; performance is the percentage of correctly identified syllables per sentence. Chance level was estimated via random simulations at 6.9%. Statistics: Pairwise comparisons across variants used Wilcoxon signed-rank tests with Bonferroni correction (n=210 sentences). Compressed speech comparisons used paired t-tests. Bayesian Information Criterion (BIC) was computed from posterior log-likelihoods of true syllable sequences, penalizing model complexity (free parameters counted per variant).

Key Findings
  • All model variants performed above chance (6.9%), but significant differences emerged across architectures (Wilcoxon tests; most pairwise comparisons significant at corrected alpha < 1e-7).
  • Resetting accumulated evidence in syllable units is crucial: removing resets (e.g., variants E, F) markedly reduced performance. Relative to A/B, performance was on average lower by about 23% (C and F) and ~15% (C and D), indicating the importance of clearing evidence at syllable boundaries.
  • Theta-gamma coupling improves performance beyond resets alone: variants A and B (with coupling and resets) outperformed variants C and D (with resets but without preferred gamma rate) by approximately 8–10%.
  • Exogenous vs endogenous theta at natural rates: For natural (uncompressed) speech, no significant performance difference was found between stimulus-driven theta-gamma coupling (A) and endogenous coupling with a preferred gamma rate (B), likely because natural syllable durations matched model expectations.
  • Under adverse timing (compression): With time-compressed speech, overall performance decreased with increasing compression. At ×2 compression, A and B remained similar; at ×3, stimulus-driven theta (A) was significantly better than endogenous (B) (mean difference = 1.274; 95% CI [1.208, 2.99]; Cohen’s d = 0.1363; p = 0.0248), indicating an advantage for stimulus-driven theta-gamma coupling when speech rates deviate strongly from expectations.
  • BIC model comparison: Variant A achieved the highest BIC, indicating the best accuracy–complexity trade-off and more confident/precise syllable recognition relative to other variants, including B.
  • Theta onset detection achieved about 53% correct detection of true syllable onsets within ±50 ms using the simplified theta oscillator, sufficient to confer benefits when combined with coupling and resets.
Discussion

The findings demonstrate that integrating neural oscillations with predictive coding enhances online syllable parsing and recognition. Theta-gamma coupling temporally aligns top-down spectrotemporal predictions with bottom-up input, while resetting syllable-unit evidence at syllable boundaries prevents cross-syllabic interference, jointly improving decoding. Although endogenous gamma-derived rhythms can approximate natural syllable durations, stimulus-driven theta becomes advantageous when speech timing deviates (e.g., heavy compression), highlighting the utility of stimulus-locked temporal tracking for robust parsing. BIC results indicate that exogenous theta-gamma coupling with resets yields more precise and confident syllable inferences for comparable accuracy, suggesting resilience to variability and potential noise. These results support theories positing cross-frequency hierarchical organization (theta organizing syllable-scale windows, gamma encoding sub-syllabic content) operating within a predictive coding architecture where top-down priors interact with bottom-up errors. The work bridges algorithmic predictive processing and biologically plausible oscillatory mechanisms, with implications for neurobiological models of speech and potential applications in ASR.

Conclusion

This study shows that combining predictive coding with theta-gamma oscillatory coupling enables effective online recognition of syllables in continuous, natural speech. Two mechanisms are key: (1) resetting accumulated evidence in syllable units at the end of each syllable and (2) theta-gamma coupling that embeds gamma activity within syllable-duration cycles. Stimulus-driven theta confers advantages under atypical timing conditions and yields a superior accuracy–complexity balance (highest BIC). The approach provides a neurally plausible account of dynamic sensory processing integrating bottom-up and top-down flows. Future work should enhance onset detection mechanisms, incorporate beta-band top-down dynamics (e.g., precision weighting), test robustness under noise and multi-speaker scenarios, and explore implementing oscillatory parsing and coupling in ASR to improve performance and efficiency.

Limitations
  • Simplified oscillatory implementation: a single canonical theta neuron replaced more detailed excitatory–inhibitory networks; theta onset detection was modest (~53% within ±50 ms), potentially limiting gains from exogenous coupling.
  • Scope limited to syllable-level parsing and recognition; higher-level linguistic context and semantics were not modeled.
  • Endogenous vs exogenous theta differences may be underestimated due to natural syllable durations matching model expectations; benefits may be greater with improved onset detection or in noisier conditions.
  • Auditory representation reduced to 6 spectrogram channels; richer features might impact generalization.
  • Evaluations used clean TIMIT sentences; robustness to noise, reverberation, and multi-talker mixtures was not assessed.
  • The model does not explicitly implement low-beta top-down rhythms or precision-weighting mechanisms hypothesized in predictive coding.
  • Performance metrics focus on winner-take-all syllable identification per gamma cycle and do not capture all aspects of uncertainty; some reported percentage differences are approximate from aggregate analyses.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny