Psychology

Spectro-temporal acoustical markers differentiate speech from song across cultures

P. Albouy, S. A. Mehr, et al.

Discover how researchers, including Philippe Albouy and Samuel A. Mehr, found that the unique acoustic features of speech and song can be distinguished across cultures. By analyzing vocalizations from 369 individuals, they identified spectro-temporal cues that reveal a fundamental biological basis for this distinction, shedding light on the universal aspects of music and language.... show more

Introduction

The study addresses whether speech and song differ in ways that are fundamentally and sufficiently captured by their spectro-temporal modulation (STM) content, as opposed to differences arising mainly from culture-specific, learned features. Prior work shows substantial cross-cultural variability in speech and song, and production constraints suggest potential acoustical markers. Theoretical and neurophysiological frameworks propose that complex sounds can be characterized by STM power and that neural spectro-temporal tuning is aligned with ecologically relevant sounds. Prior findings indicate that speech generally has faster temporal modulations, whereas song relies more on spectral structure; and that temporal and spectral cues differentially affect intelligibility and melody perception. However, most musical evidence comes from Western repertoires. The authors therefore test whether STM signatures can distinguish speech from song across diverse societies. They predict: (i) distinct STM distributions for speech vs. song; (ii) overlap within categories across societies; (iii) accurate machine classification using only STM; (iv) classifier features corresponding to univariate STM contrasts; (v) naïve listeners classifying accurately using similar cues; and (vi) STM features outperforming a broad set of other acoustic variables in classification.

Literature Review

Background includes evidence that language and music share and differ in structural properties, with music exhibiting more discrete pitch variation and hierarchical tonal organization. Spectro-temporal modulation offers a principled acoustical framework, supported by neurophysiology showing separable spectral and temporal modulation processing in humans and nonhuman primates, and spectro-temporal receptive fields aligning with communicative sounds across species. Prior research demonstrated: (1) speech tends to contain faster temporal modulations than music; (2) speech perception can rely on temporal cues even with degraded spectral information; (3) melody perception in song depends on spectral cues; and (4) hemispheric/asymmetric neural sensitivities for spectral and temporal modulations. Although speech temporal rates have been characterized across many languages, music STM characterizations have been biased toward Western repertoires, motivating a cross-cultural test of STM distinctions between speech and song.

Methodology

Corpus and recording: 738 recordings (369 matched pairs of adult-directed speech and song) from 369 individuals across 21 societies on six continents, speaking 18 languages from 12 language families. Participants produced casual speech on a topic of their choice and a non-infant-directed song of their choice. Only matched pairs from the same individual were analyzed, with durations matched within-speaker by truncating the longer sample to the duration of the shorter. Recordings were made with Zoom H2n recorders at 44.1 kHz using a 360° dual x–y microphone pattern; the stereo file with the participant loudest was analyzed.

STM extraction: For each of the 738 samples, spectro-temporal modulation (STM) patterns (modulograms) were computed using the ModFilter algorithm. The STM domain was derived from the 2D FFT of the autocorrelation matrix of the spectrogram, yielding modulation energy across spectral and temporal axes.

Univariate analyses: Nonparametric permutation statistics with FDR correction in spectral and temporal domains (FieldTrip as implemented in Brainstorm) contrasted STM patterns of song vs speech (p < 0.001).

Clustering and cross-society consistency: A heatmap counted societies showing significant effects in identified STM hotspots. K-means clustering on spectro-temporal coordinates of statistical peaks (using absolute temporal modulations) assessed clustering across societies.

Machine learning: Linear SVM classifiers (MATLAB fitcecoc) with fieldsite-wise k-fold cross-validation assessed cross-cultural generalization. In each fold, the model was trained on 20 societies and tested on the held-out 21st, repeated across all societies. Regularization parameter λ = 0.01 (selected via validation with an independent infant-directed vocalization dataset). Feature weights (z-scored) were extracted to assess feature importance. Alternative cross-validation schemes held out countries (n=18), language families (n=15/16), world subregions (n=14/15), and world regions (n=6) to test robustness.

Behavioral experiment: 80 French-speaking adults (France/Canada; 74 completed; mean age 32.4 years; 33 female, 4 non-binary; 10 musically trained; no neurological/psychiatric history) rated each recording on a 5-point scale from singing (1) to speaking (-1) as quickly as possible (9 s response window) in a sound-attenuated booth over Sennheiser HD 280 Pro headphones. Stimuli were presented via PsychoPy in three pseudo-randomized blocks (246 trials per block), balanced by society and category. No feedback was given.

Acoustical feature analyses: A broad set of 98 acoustic variables (e.g., f0, formants, intensity, vowel rate/space, roughness, rhythmic measures) from prior work were used to relate to STM features via Partial Least Squares (PLS) regression at each STM coordinate. Correlations between PLS-fitted and observed STM responses were FDR-corrected. Variable importance in projection (VIP) scores identified influential acoustic features (VIP > 1) at spectral and temporal peaks. SVM decoding compared four feature sets: (1) STM only; (2) STM + acoustic features; (3) acoustic features only; (4) acoustic features only without VIP variables. Decoding accuracies were compared using Friedman RM ANOVA with Bonferroni-corrected post hocs. Correlations tested whether decoding accuracies predicted naïve listeners’ behavioral differences within societies.

Key Findings

Univariate STM contrasts revealed two song>speech spectral modulation hotspots (p < 0.001, FDR): peaks at approximately (spectral, temporal) 3.53 cyc/kHz, 0.66 Hz and 7.11 cyc/kHz, -0.66 Hz. Speech>song showed increased temporal modulations with peaks at ~6.16 Hz, 0 cyc/kHz; -6.33 Hz, 0 cyc/kHz; and 4.83 Hz, 5.07 cyc/kHz.
Cross-society consistency: 20/21 societies showed increased spectral modulations for song at ~3.71 cyc/kHz, 0.66 Hz; 20/21 showed increased temporal modulations for speech at ~6.33 Hz, 0.09 cyc/kHz. K-means revealed two clusters centered at ~3.48 cyc/kHz, 0.13 Hz (song-typical spectral) and ~0.33 cyc/kHz, 6.33 Hz (speech-typical temporal).
Machine learning: Fieldsite-wise cross-validated SVM using STM features significantly classified song vs speech in all 21 societies (Wilcoxon W(20)=231, p<0.001; effect size rank biserial = 1.00). Mean accuracy = 84.5% ± 10.4 SD; sensitivity = 83.8% ± 15.9; specificity = 85.2% ± 13.8. Decoding accuracy correlated with average sample duration across societies (r(20)=0.63, p<0.001). Feature-weight maps aligned with univariate STM differences.
Robustness: Comparable cross-validated performance and feature importance patterns when holding out countries (n=18), language families (n=15/16), world subregions (n=14/15), or world regions (n=6), all p<0.001 with large effects.
Behavioral: Naïve listeners accurately categorized recordings (Song: W(73)=2775, p<0.001; Speech: W(73)=0, p<0.001; both with rank biserial effect size = 1.00). Accuracy was high within each of the 21 societies. STM-behavior linkage: Differences in STM (Song–Speech) predicted behavioral rating differences, with positive relation for increased spectral modulation for song (e.g., at ~-0.33 Hz, 3.35 cyc/kHz) and negative relation for decreased temporal modulation for song (e.g., at ~4.49 Hz, 0.09 cyc/kHz), FDR p<0.05. Classifier decoding accuracy correlated with behavioral differences across societies (r(20)=0.51, p=0.001).
STM vs other acoustical variables: PLS showed acoustic features predicted STM at two hotspots overlapping with main findings (spectral peak ~3.49 cyc/kHz, 0.49 Hz; temporal peak ~0.13 cyc/kHz, 7.13 Hz). VIP>1 contributors included Pitch, Amplitude Space, Temporal Modulations (spectral hotspot), and Amplitude Space, Second Formant (temporal hotspot). SVM decoding comparison: All above chance (all W(20)>231, p<0.001), but STM-only outperformed acoustic-only (p<0.001) and acoustic without VIP variables (p<0.001). STM+acoustic outperformed acoustic-only (p<0.001) and acoustic without VIP (p<0.001), but did not differ from STM-only (p=0.91), indicating STM sufficiency. Decoding accuracy from acoustic-only features did not correlate with behavioral differences (r(20)=0.14, p=0.54).

Discussion

Findings demonstrate cross-culturally consistent spectro-temporal distinctions between speech and song: songs emphasize higher spectral and lower temporal modulation rates, whereas speech emphasizes higher temporal modulation with relatively lower spectral modulation. These STM signatures enabled robust cross-cultural machine classification and aligned closely with the cues human listeners used, despite their unfamiliarity with the source languages and musical systems. The results support the view that spectro-temporal modulation content is a fundamental acoustical basis differentiating speech and song, likely reflecting differences in vocal motor control demands (e.g., faster syllabic rates in speech versus longer, more stable pitch production in singing) and aligning with neurophysiological evidence of partially distinct cortical sensitivities to spectral and temporal modulations and potential hemispheric asymmetries. While culture shapes variability, the shared STM markers appear to be widespread and sufficient for categorization. The work advances a parsimonious account of speech–song categorization grounded in STM, explaining both machine and human performance better than a broad suite of other acoustic variables. It also provides a framework for examining intermediate forms (e.g., rap, infant-directed speech) and perceptual phenomena like the speech-to-song illusion within a unified spectro-temporal space.

Conclusion

Across 21 societies and 369 individuals, speech and song are reliably distinguished by their spectro-temporal modulation profiles. STM features alone achieve high cross-cultural classification accuracy and best predict naïve listener judgments, outperforming a wide set of other acoustic descriptors. These results suggest a fundamental, biologically grounded distinction between speech and song based on how human vocalizations occupy the spectro-temporal continuum. Future work could probe cultural variability in STM deployment across musical traditions, investigate intermediate or blended vocal forms, and further test neural mechanisms linking hemispheric asymmetries and STM tuning to speech and song processing.

Limitations

Sample duration influenced STM estimation quality and decoding accuracy, with shorter durations reducing classifier performance (positive correlation between accuracy and average sample duration across societies).
Although societies were diverse, variability remained across sites, particularly in songs’ exploitation of a wide range of STM values; not all cultures may carve the spectro-temporal space identically.
Behavioral participants were primarily French-speaking listeners unfamiliar with many source cultures, which is appropriate to test generalization but may limit insights into culture-specific perceptual strategies.
Convenience sampling of field sites and high proportion of female participants (86%) may affect generalizability of production characteristics.
Recordings used a 360° microphone pattern that could capture environmental context; however, identical processing was applied across conditions.
Only adult-directed speech and song were analyzed; other vocalization contexts (e.g., infant-directed, ritualized speech/song) were not primary targets in this analysis.