logo
ResearchBunny Logo
Auditory-motor synchronization and perception suggest partially distinct time scales in speech and music

Psychology

Auditory-motor synchronization and perception suggest partially distinct time scales in speech and music

A. V. Barchet, M. J. Henry, et al.

Discover how different motor actions, like whispering and finger-tapping, impact our ability to perceive and synchronize speech and music! This groundbreaking research by Alice Vivien Barchet, Molly J. Henry, Claire Pelofi, and Johanna M. Rimmele delves into the intricacies of rhythmic timing mechanisms, showcasing fascinating insights into our cognitive processes.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses whether speech and music rely on shared versus domain-specific rhythmic timing mechanisms, particularly across different time scales, and whether these mechanisms depend on the motor effectors typically engaged in each domain. Speech and music are both rhythmic and hierarchical but show distinct dominant temporal modulations: music often emphasizes beat-related rhythms around 1–2 Hz, while speech emphasizes syllabic rhythms around 4–8 Hz. Prior work links temporal processing to cortical oscillations (delta/theta) and suggests motor system contributions to auditory prediction. The authors hypothesized that motor effectors associated with speech (mouth/vocal cords) would support optimal synchronization at faster rates (~4.5 Hz), while those associated with music (fingers) would be optimal at slower rates (~2 Hz). They also considered an alternative hypothesis of a general motor timing advantage at slower rates across domains. The study investigates how auditory-motor synchronization and purely auditory perception vary with rate, stimulus type (syllables vs piano tones), and motor effector (whisper vs finger tap), and whether synchronization predicts perception.
Literature Review
The paper reviews evidence that both speech and music possess quasi-rhythmic structures enabling temporal prediction and segmentation, supported by cortical entrainment. Speech processing is linked to auditory cortex theta rhythms (~4–8 Hz), while motor system delta rhythms (~1–2 Hz) contribute to temporal prediction and beat processing. Produced music typically shows dominant amplitude modulations around 1–2 Hz, aligning with preferred beat perception, whereas produced speech shows dominant modulations around 4–8 Hz (syllabic rate). Behavioral performance mirrors these profiles: beat detection is best near ~1.4–2 Hz, while speech comprehension peaks around ~4–5 Hz and declines at higher rates. Spontaneous production rates also differ by effector: finger-tapping and locomotion cluster near ~2 Hz; syllable production and whispering tend to ~4–5+ Hz. The SSS test literature shows auditory-motor coupling at ~4.5 Hz predicts speech-related perceptual abilities. Evidence suggests overlapping but frequency-specific neural selectivity for speech and music, with hierarchical temporal structure beyond dominant rates in both domains (e.g., phrasal rhythms ~1–2 Hz in speech; multi-level rhythmic structure in music).
Methodology
Preregistration: asPredicted.org (https://aspredicted.org/ci7ms.pdf; 9 March 2022). Ethical approval: Ethics Council of the Max Planck Society (Nr. 2017_12). Participants: 66 recruited; synchronization task final N=62 (36 women, 23 men, 2 non-binary, 1 undisclosed; age 18–40, M=26.28, SD=4.16). Exclusions: 2 for loud speech instead of whisper; 2 for inconsistency across trials (outside 99% CI in regression consistency check). Perception task final N=57 (33 women, 21 men, 2 non-binary, 1 undisclosed; age 19–40, M=26.54, SD=4.12). Stimuli: Twelve distinct CV syllables (MBROLA, male German diphone database de2, 16 kHz; resampled to 44.1 kHz) and twelve piano tones C3–B3 (MIDI 48–59; MIDIUtil Python 3.8.8; FluidSynth 2.2.4). Sequences formed by random ordering without consecutive repeats. Tasks and rates: Two rate ranges—slow ~2 Hz (1.92–2.08 Hz), fast ~4.5 Hz (4.3–4.7 Hz). - Synchronization task (adapted accelerated SSS protocol): Participants synchronized via whispering (repeating “TEH”) or finger-tapping (dominant hand) to accelerating sequences of syllables or piano tones. Motor effector (whisper vs tap) and stimulus type (syllable vs tone) order randomized. Hardware: ER-3c in-ear headphones; Shure MX418 microphone (~3 cm from mouth); RME Fireface UC audio interface; Psychtoolbox full duplex mode for simultaneous playback/recording; loopback mic captured stimuli alongside motor output. Volume adjustment ensured participants could not hear their own tapping/whispering; SPL capped at 90 dB. Priming: two 10 s primes at target rate (4.5 Hz for fast; 2 Hz for slow), then 10 s of paced production without audio. Synchronization sequences: Fast—rates 4.3→4.7 Hz, step 0.1 Hz every 48 events; Slow—1.92→2.08 Hz, step 0.04 Hz every 48 events; each 240 events (50 s fast; 120 s slow). Two runs per condition (two primes + one synchronization trial). - Auditory perception task: Temporal deviance detection on isochronous sequences of 10 items (syllables or tones), 80 trials per stimulus type; 50% of trials had the final item shifted earlier by 28–34% IOI for syllables and 12–18% IOI for tones (piloted for comparable difficulty). Rates varied within blocks: fast 4.3–4.7 Hz and slow 1.92–2.08 Hz in 2% steps. SPL ~70 dB. Training with feedback provided. Design: Blocks per stimulus type; randomized block and trial order; rates intermixed. Data processing: Envelopes computed via NSL Auditory Model (180–7246 Hz cochlear filtering), resampled at 100 Hz. Bandpass for PLV extraction: fast 3.5–5.5 Hz; slow 1.56–2.44 Hz. Phases via Hilbert transform. PLV computed between acoustic and motor envelopes in windows: fast 5 s with 2 s overlap; slow 11 s with 4.5 s overlap. Normalization: permutation baseline by shuffling acoustic segments (5 s fast; 11 s slow), subtracting permutation PLV from observed PLV; PLVs averaged across two trials per condition. Statistical analysis: - Synchronization LMM: Dependent variable: normalized PLV. Fixed effects considered: rate (fast/slow), motor effector (tapping/whispering), stimulus (tones/syllables), motor envelope features (peak amplitude <10 Hz; peak width FWHM), and acoustic envelope features (analogous). Random structure: participant intercept; random slope for rate. Model selection via forward stepwise likelihood-ratio tests (α=0.05). Post-hoc pairwise comparisons with emmeans (Kenward-Roger df, Tukey adjusted). Bayes factor (JASP; Cauchy r=1/√2) for null result at fast rates. R2 per Nakagawa & Schielzeth. - PCA on eight synchronization conditions’ PLVs (varimax rotation) to derive components of synchronization ability; 3 components retained (eigenvalues ≥1) explaining 70% variance; multivariate normality supported (Henze-Zirkler p=0.35). - Perception GLMM: Logistic mixed model predicting trial accuracy from rate, stimulus, their interaction, acoustic envelope width, and PCA components from synchronization (Components 1 and 3 retained; Component 2 not included). Random intercepts for participant and stimulus; random slopes for rate, stimulus, and their interaction on participant. Model selection via stepwise procedure; post-hoc via emmeans; diagnostics via DHARMa. Software: MATLAB R2020b/R2021a; Psychtoolbox 3.0.12; R 4.0.5 with lme4, lmerTest, emmeans, DHARMa, effectsize, performance, psych, car, MVN; plotting ggplot2, sjPlot, introdataviz.
Key Findings
- Synchronization task (LMM; N=62): Significant main effects of rate and stimulus, and a rate × motor effector interaction. • Rate: Slow > fast (Estimate = 0.22, 95% CI [0.19, 0.25], p<0.001; Partial η²=0.74). • Stimulus: Tones > syllables (Estimate = 0.06, 95% CI [0.04, 0.07], p<0.001; Partial η²=0.14). • Motor effector main effect ns (Estimate = 0.01, p=0.433), but interaction Rate × Motor effector significant (Estimate = -0.09, 95% CI [-0.12, -0.06], p<0.001; Partial η²=0.10). • Motor envelope peak width positively related to PLV (Estimate = 0.03, 95% CI [0.02, 0.04], p<0.001). Post-hoc: • Slow rate: Tapping > Whispering (Estimate = 0.09, p<0.001, Cohen’s d=1.05, 95% CI [0.06, 0.11]). • Fast rate: Tapping ≈ Whispering (Estimate = -0.01, p=0.432, d=-0.10). Bayes factor supported H0 at fast rates (BF01=9.41). Overall, synchronization better at slow (~2 Hz) across conditions (fast vs slow for tapping: Estimate = -0.22, p<0.001, d=-2.68; whispering: Estimate = -0.12, p<0.001, d=-1.53). Model fit: marginal R²=46.5%, conditional R²=71.6%. - PCA of synchronization (8 conditions): Three components (70% variance explained). • Component 1 (fast component): loads on all fast conditions (general factor across motor effectors and stimuli). • Component 2 (slow whispering component): loads on slow whisper conditions. • Component 3 (slow tapping component): loads on slow tapping conditions. Indicates shared synchronization mechanism at fast rates, but effector-specific mechanisms at slow rates. - Perception task (GLMM; N=57): Significant interaction rate × stimulus (Estimate = 1.29, 95% CI [0.86, 1.73], OR=3.64, p<0.001) and main effect of stimulus (tones lower overall at fast; Estimate = -0.65, 95% CI [-0.90, -0.40], OR=0.52, p<0.001). Post-hoc: • Fast (~4.5 Hz): Syllables > Tones (Estimate = 0.65, p<0.001). • Slow (~2 Hz): Tones > Syllables (Estimate = -0.65, p<0.001). Acoustic envelope width negatively related to accuracy (Estimate = -0.20, 95% CI [-0.28, -0.13], OR=0.82, p<0.001). Synchronization components predicted perception: • Fast component: positive predictor (Estimate = 0.36, 95% CI [0.20, 0.51], OR=1.43, p<0.001). • Slow tapping component: positive predictor (Estimate = 0.22, 95% CI [0.07, 0.37], OR=1.24, p=0.005). Slow whispering component not retained. Model R²: marginal 7.7%, conditional 24.2%.
Discussion
The results address the central question of whether speech and music rely on distinct rhythmic timing mechanisms tied to their characteristic time scales and motor effectors. Perceptually, performance dissociated by domain and rate: syllable sequences were better processed at faster rates (~4.5 Hz), whereas piano tone sequences were better at slower rates (~2 Hz), aligning with known dominant modulation frequencies in speech and music. In synchronization, performance was generally optimal at slow rates, consistent with a motor system predisposition toward ~2 Hz timing. Crucially, an effector-specific advantage emerged at slow rates: finger-tapping outperformed whispering, suggesting that motor effectors associated with music engage rhythmic timing mechanisms optimized for slow time scales. At fast rates, synchronization was similar across effectors and captured by a single PCA component, indicating a shared, effector-general mechanism. The predictive link from synchronization to perception—especially from the fast synchronization component and the slow tapping component—supports the idea that motor timing capacities facilitate auditory temporal judgments. Neural interpretations proposed include delta rhythms in supplementary motor areas supporting slow timing and theta-range auditory/motor rhythms supporting syllabic processing, with partially independent auditory–motor coupling at ~2 Hz and ~4.5 Hz. Control analyses indicated that order effects or simple acoustic features do not account for the main findings, and effects persisted when considering musical sophistication.
Conclusion
The study demonstrates that speech and music perception are optimized at different time scales that reflect their characteristic rhythms: syllable sequences at ~4.5 Hz and piano tone sequences at ~2 Hz. Auditory–motor synchronization is overall stronger at slow rates, with effector-specific advantages at ~2 Hz (tapping > whispering), while synchronization converges across effectors at ~4.5 Hz. PCA results suggest shared synchronization mechanisms at fast rates but effector-specific mechanisms at slow rates. Synchronization performance—particularly a general fast-rate component and a slow tapping component—predicts perceptual accuracy. These findings support partially distinct rhythmic timing mechanisms for speech and music, likely reflecting differential recruitment of cortical motor circuitry across time scales. Future work should test neural dissociations of auditory–motor coupling at ~2 vs ~4.5 Hz with electrophysiology, examine singing (shared vocal effectors but music-like acoustics), and assess more naturalistic stimuli and contexts to generalize these timing principles.
Limitations
- Stimulus and effector scope: Simplified, matched stimuli (CV syllables and piano tones) and effectors (whispering and finger-tapping) rather than natural speech/music production and performance. - Whispering instead of natural speaking was used to reduce auditory feedback; although similar articulators are engaged, this may not fully capture speech production dynamics. - Limited power to detect small effects in some synchronization contrasts at fast rates (post-hoc simulations indicated insufficient sensitivity for small effect sizes). - Results do not identify minimal acoustic features driving domain-specific processing; focus was on timing rather than spectrotemporal distinctions. - Generalization to richer contexts (e.g., phrase structure, melody, expressive timing) requires further research.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny