Linguistics and Languages
The role of isochrony in speech perception in noise
V. Aubanel and J. Schwartz
The study investigates whether isochrony (regular timing of speech units) or natural speech timing better supports speech intelligibility in noise. Prior work on cortical oscillations and active sensing suggests that neural excitability cycles could align to rhythmic speech structure, potentially benefiting isochronous input. Classical rhythm theories proposed stress-timed (e.g., English) vs syllable-timed (e.g., French) languages based on underlying isochrony of different units, but natural speech is not strictly isochronous and timing variations carry linguistic information. This yields two competing hypotheses: (1) isochrony enhances processing by maximizing predictability and entrainment; (2) deviations from natural timing impair recognition, making natural timing optimal. The authors aim to disentangle roles of naturalness and isochrony at two hierarchical levels (accent group and syllable) across English and French using intelligibility in noise as the outcome.
Background literature links low-frequency cortical oscillations to sensory selection and speech tracking, with the syllable often posited as a key temporal unit. The isochrony and rhythmic class hypotheses historically categorized languages by isochronous feet (stress-timed) or syllables (syllable-timed), though empirical support has been mixed and debated. Entrainment-based accounts propose benefits of regular timing (reducing need for phase resets), yet natural speech displays irregularities crucial for linguistic encoding. Prior work on isochronously retimed English speech in noise indicated that natural timing outperforms retimed forms, motivating cross-language tests and finer-grained metrics of departure from natural rhythm and isochrony at both accent and syllable levels.
Design and materials: Two parallel experiments used sentence-length materials from the Harvard corpus (English) and the Fharvard corpus (French). For each language, 180 sentences (5–7 keywords; French exactly 5) were recorded (English female talker; French male talker). Sentences were annotated at two hierarchical rhythmic levels: syllable (lowest unit) and accent group (English stressed syllable; French accentual phrase). P-centers were defined (typically near vowel onset), initially via forced alignment and then manually corrected. Accentual phrase boundaries in French were validated via 3 native annotators; 180 sentences had full agreement. Accent group P-centers aligned with their corresponding syllable P-centers.
Temporal manipulations: Five timing conditions per sentence: NAT (unmodified); ISO.acc (accent-group isochrony); ANI.acc (accent-level anisochrony); ISO.syl (syllable isochrony); ANI.syl (syllable-level anisochrony). For ISO conditions, inter-unit durations were set to the mean of the natural durations at the relevant level while keeping sentence endpoints identical. For ANI conditions, the same net temporal distortion magnitude as ISO was applied but with unpredictable timing by time-reversing the sequence of natural inter-unit durations and then equalizing them, thereby breaking regularity while matching distortion. Temporal transformations compressed/expanded segments between annotated events using WSOLA (high-quality, pitch-preserving time-scaling). All manipulated sentences were mixed with speech-shaped noise at −3 dB SNR (noise derived via 200-pole LPC filter from talker’s full corpus) to target ~60% keyword recognition in NAT.
Temporal distortion metrics: Net temporal distortion δ was computed as the RMS of the binary-log-transformed time-scale step function across segments, treating compression and dilation symmetrically. Four additional metrics quantified departure from two canonical forms at each level: departure from natural timing (dnat.acc, dnat.syl) and from isochrony (diso.acc, diso.syl). By construction, NAT has dnat.=0; ISO has diso.=0; ISO and ANI at a given level share the same δ magnitude.
Participants and procedure: English: 26 native Australian English speakers (21 female), mean age 20.9 (SD 6.3), ethics approval H9495. French: 27 native French speakers (15 female), mean age 26.7 (SD 8.8), ethics approval CERGA IRB00010290-2017-12-12-33. Binaural presentation over closed headphones at a comfortable fixed level. Task: type what was heard; five blocks (one per condition), 36 sentences each; 5 practice trials per condition (fixed order), with block order counterbalanced and sentence order pseudo-randomized.
Scoring and validation: Automatic keyword matching with dictionaries for homophones, numerals, and common misspellings. Manual validation of 530 responses (~5.5%) showed 98% agreement; minor dictionary updates followed.
Statistical analysis: For condition effects, generalized linear mixed-effects models (glmer, lme4) were fit per language with random intercepts; simultaneous generalized hypothesis tests (glht, multcomp) adjusted for multiple comparisons. For metric-based analyses across both languages, logistic mixed models predicted intelligibility using subsets where metrics were non-zero: (A) NAT sentences with diso.acc and diso.syl; (B) ISO sentences with dnat.* metrics; (C) ANI sentences with all four metrics. Fixed effects included language and relevant metrics; random intercepts by sentence and participant. Model selection used likelihood-ratio tests to identify minimal equivalent models. Fixed-effect sizes (R²) were computed with r2beta.
- Natural timing superior: Unmodified natural sentences (NAT) were significantly more intelligible than all temporally modified conditions in both languages (all comparisons vs NAT p<0.001; Table 1, rows 1–4). Greater temporal distortion (δ) corresponded to lower intelligibility.
- Isochronous vs anisochronous: In English, accent-level isochrony yielded higher intelligibility than anisochrony (ISO.acc > ANI.acc; Est=0.177, z=4.00, p<0.001). A trend favored ISO over ANI at the syllable level (Est=0.110, z=2.43, p=0.092). Combined across levels, English showed an ISO advantage (Est=0.287, z=4.54, p<0.001). In French, no significant ISO vs ANI advantage was observed (rows 5–7 non-significant).
- Level matters: Syllable-level distortions reduced intelligibility more than accent-level distortions in both languages (Table 1, row 8; p<0.001 for French and English), consistent with larger applied distortion at the syllable level.
- Metric-based analyses (language did not contribute as a fixed effect): A) NAT sentences (departure from isochrony): Minimal model with diso.acc and diso.syl (no interactions; AIC 6721.7). Intelligibility increased with accent-level irregularity (diso.acc Est=1.065, p=0.021) and decreased with departure from syllable isochrony (diso.syl Est=−1.515, p=0.008). Fixed-effect R² total=0.028 (diso.syl=0.018; diso.acc=0.013). B) ISO sentences (departure from natural timing): Minimal model retained only dnat.syl (AIC 13502). Intelligibility was strongly negatively correlated with departure from natural syllabic timing (dnat.syl Est=−2.623, z=−9.541, p<2e-16). Fixed-effect R²=0.045. C) ANI sentences (departure from both natural and isochrony): Minimal model included dnat.syl and diso.syl (AIC 13591). dnat.syl was a strong negative predictor (Est=−2.942, z=−7.637, p=2.22e-14), while diso.syl showed a small negative trend (Est=−0.531, p=0.0527). Fixed-effect R²=0.059 (dnat.syl=0.030; diso.syl=0.003). dnat.syl and diso.syl were correlated (French r=0.72; English r=0.56).
- Cross-language similarity: The pattern of results was similar for French and English; language did not significantly modulate metric effects.
- Interpretation: Natural timing statistics (top-down predictive information) dominate intelligibility, with a secondary, smaller benefit of syllabic isochrony (bottom-up regularity), especially apparent when timing is otherwise irregular.
The findings resolve competing hypotheses by showing that natural timing provides a stronger foundation for speech perception in noise than isochrony. Any temporal distortion from natural timing reduces intelligibility, indicating listeners exploit learned timing statistics to guide top-down predictions during decoding. Nonetheless, syllable-level regularity confers a modest benefit, consistent with bottom-up facilitation via cortical entrainment to syllabic rhythms. The lack of language effects suggests a language-general temporal mechanism, positioning the syllable as a core neurolinguistic unit whose temporal scale aligns with theta-range oscillations (the proposed theta-syllable). Accent-group timing regularity did not improve intelligibility, except that, in natural sentences, greater departure from accent-level isochrony correlated with higher intelligibility—possibly due to sparser, more salient prominences reducing energetic masking at a fixed SNR. Overall, results support models of speech processing with interdependent bottom-up and top-down mechanisms wherein natural timing exerts a dominant, predictive influence, while isochrony acts as a secondary attractor state reflecting neurophysiological constraints.
Isochrony is not a primary requirement for successful speech processing: natural timing of speech units is paramount for intelligibility in noise. Across English and French, any departure from natural timing degrades recognition, and syllable-level timing dominates accent-level timing in predicting intelligibility. Isochrony contributes secondarily, most notably as a small benefit of syllabic regularity when timing is otherwise irregular, consistent with neurobiological oscillatory constraints centered on the syllable (theta range). The work unifies perspectives on speech rhythm by emphasizing a core syllabic unit with linguistically anchored P-centers and neurophysiologically defined duration. Future research should include multiple speakers per language and more languages, assess longer temporal contexts where explicit isochrony may emerge as a perceptual feature, and integrate neurophysiological measures to quantify top-down versus bottom-up contributions.
- One talker per language; language effects may be confounded with talker idiosyncrasies.
- Accentual phrase boundary annotation in French is less straightforward and may introduce variability despite inter-annotator agreement.
- Fixed SNR (−3 dB) and sentence-length materials limit generalizability to other noise levels and longer discourse contexts.
- Necessary correlation between departure-from-natural and departure-from-isochrony metrics in anisochronous conditions complicates separation of their independent contributions.
- Isochrony effects over extended time scales (beyond sentence level) were not tested; participants may adapt to or explicitly perceive rhythmic regularities over longer exposure.
Related Publications
Explore these studies to deepen your understanding of the subject.

