logo
ResearchBunny Logo
Spontaneous emergence of rudimentary music detectors in deep neural networks

Physics

Spontaneous emergence of rudimentary music detectors in deep neural networks

G. Kim, D. Kim, et al.

This exciting study by Gwangsu Kim, Dong-Kyum Kim, and Hawoong Jeong unveils how deep neural networks can spontaneously develop music-selective units when trained on natural sounds, suggesting a fascinating link between artificial systems and human auditory processing!

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates how universal aspects of music perception might arise without explicit musical training and what functions music-selective neural circuits serve. In humans, distinct music-selective neural populations exist in non-primary auditory cortex and are observed even in individuals without formal training and in infants. Some auditory capabilities, such as harmonicity-based segregation, appear across cultures, raising the question of whether exposure to music is necessary for early music-selectivity. Building on prior work where task-optimized deep neural networks (DNNs) capture sensory processing principles, the authors hypothesize that music-selectivity could emerge as a by-product of optimizing auditory systems to process natural sounds, and that such selectivity could provide a functional basis for generalization in sound recognition.
Literature Review
Prior human neuroimaging identified music-selective populations distinct from speech and other sounds and found such selectivity present without formal training. Infants show early sensitivity to musical features. Cross-cultural evidence (e.g., native Amazonians) suggests basic machinery like harmonicity-based segregation is not dependent on Western musical exposure. Modeling studies show DNNs trained on natural tasks can develop brain-like representations and even higher-level cognitive functions (e.g., Gestalt, numerosity). A task-optimized auditory DNN replicated human cortical responses to music and speech, implying that optimization for natural stimuli can yield brain-like auditory representations. These lines of work motivate testing whether music-selectivity can arise from learning to process non-musical natural sounds and whether such selectivity supports generalization.
Methodology
- Datasets: AudioSet balanced subset with 10 s YouTube audio clips labeled across 527 categories. Training set: 17,902; test set: 17,585. Music-related categories were defined under the music hierarchy plus selected human voice labels (singing-related). To create a no-music training condition, all music-labeled clips were removed and residual mislabeled music clips (about 4.5%, N=507) were manually excluded. - Network: A convolutional neural network for audio. Input waveform (22,050 Hz) converted to log-Mel spectrogram (64 mel bands, 0–8 kHz, window 25 ms, hop 12.5 ms). Four convolutional layers (with batch norm, max pooling, ReLU, dropout 0.2), followed by global average pooling to produce a 256-d feature vector. Two fully connected layers and sigmoid outputs for multi-label detection. - Training: Binary cross-entropy loss per category; AdamW optimizer (weight decay 0.01). One Cycle LR schedule from 4e-5 up to 1e-3 then down to 4e-9 over 100 epochs (200 for randomized labels), batch size 32. Five runs with different seeds; best epoch selected by validation average precision. - Feature analysis: Responses at the global average pooling layer (256-d) were used as features. t-SNE used to visualize clustering of music vs non-music. - Music-selective units: Music-Selectivity Index (MSI) defined as MSI = (m_music − m_non-music)/(s_music + s_non-music). Units in top 12.5% MSI deemed music-selective (MS units). Tested robustness to input amplitude normalization and attenuation of music RMS. - Linear baselines: Extracted linear features from log-Mel spectrogram using PCA (top 256 PCs; explained variance 0.965) and a spectro-temporal 2D Gabor filter bank (GBFB; 263 filters). Compared clustering and linear separability to DNN features via t-SNE and linear classifiers. - External validation: Tested responses of MS units to the 165 natural sounds dataset (11 categories) used in prior human fMRI voxel decomposition showing a music-selective component. - Sound quilting: Generated quilts by segmenting sounds (50–1600 ms, octave steps), reordering segments with smooth boundaries to preserve short-term but disrupt long-term structure. Compared responses of MS units (and other units) to quilts of music and non-music versus original. - Generalization vs memorization: Trained networks on AudioSet with randomized labels to enforce memorization (high training AP, chance test AP). Assessed MS unit responses to quilts and clustering. - Ablation: Silenced groups of units (MSI top 12.5%, middle 43.75–56.25%, bottom 12.5%, and L1-norm top 12.5%) and measured drop in average precision (excluding music-related categories). - Speech removal: Trained another network removing both music and speech labels to assess speech’s role in emergence of music-selectivity and temporal structure encoding. - Statistics: Wilcoxon rank-sum/signed-rank tests with common language effect sizes; n=5 independent networks for most comparisons.
Key Findings
- Distinct music representation without music training: Even when trained with all music-related categories removed, the network formed a distinct cluster for music in t-SNE space while maintaining reasonable performance on other audio events. - Linear features insufficient: PCA and GBFB features did not produce clear music/non-music separation in t-SNE. Linear classifiers using DNN features achieved much higher music/non-music discrimination (network mAP 0.887 ± 0.005; chance 0.266) than PCA (mAP 0.437) or GBFB (mAP 0.828). Adding PCA/GBFB to network features did not improve mAP (Net+PCA 0.887 ± 0.004; Net+GBFB 0.894 ± 0.004). - Stronger average response to music: In networks trained without music, units’ average response to music exceeded non-music (one-tailed Wilcoxon rank-sum, U=30,340,954, p≈4.89×10^-276, ES=0.689; music n=3999, non-music n=11,010). - Music-selective units (MSI top 12.5%): Showed 2.07× stronger response to music than non-music on held-out training data. Robust to amplitude normalization and even when music RMS was reduced up to 1:8 relative to non-music (power ratio 1:64): responses to music remained stronger (training: 1.70×; test: 1.68×). Using only MS units enabled linear music/non-music classification with AP 0.832 ± 0.007 across 25 genres. - Brain-like selectivity on external dataset: MS units trained without music responded more to music in the 165 natural sounds dataset, mirroring the fMRI music component. Music/non-music response ratios: trained without music 1.88 ± 0.13; trained with music 1.61 ± 0.16; randomly initialized 1.03 ± 0.0044; GBFB 0.95. Networks trained without music and with music did not differ significantly in AudioSet music/non-music response ratio; both exceeded random and GBFB baselines. - Temporal structure encoding (sound quilts): MS unit responses increased with segment size for music quilts, indicating sensitivity to long-range temporal structure; response to original was higher than to 50 ms segments (original 0.768 ± 0.030; 50 ms 0.599 ± 0.047; Wilcoxon signed-rank U=15, p=0.031). For non-music quilts, the correlation with segment size was weaker. Non-MS units showed largely constant responses across segment sizes. - Generalization is critical: Networks trained to memorize randomized labels showed some clustering but their top-MSI units did not exhibit sensitivity to segment size in music quilts (significantly different from generalization-trained networks at small segments: e.g., 50 ms and 100 ms, p=0.006 for both). - Functional importance (ablation): Silencing MS units caused a significantly larger drop in average precision than silencing bottom/middle MSI units or high-L1 units (MSI top 12.5% vs Baseline: U=1,560,865, p≈3.03×10^-211, ES=0.917; vs L1-norm top 12.5%: U=1,223,566, p≈9.74×10^-60, ES=0.719). - Role of speech: Removing speech during training still yielded music clustering and selectivity (1.77× music/non-music response) but reduced sensitivity to long-timescale temporal structure in quilts, suggesting speech supports learning longer-range temporal features shared with music. - Speech selectivity weaker: Putative speech-selective units (SSI top 12.5%) showed only 1.31× stronger response to speech than non-speech and lacked temporal structure sensitivity in speech quilts. Other categories (vehicle, animal, water) exhibited weaker effects than music.
Discussion
The findings show that music-selective units can emerge spontaneously in a DNN optimized for natural sound detection, even without exposure to music. These units encode multi-timescale temporal structure and reproduce population-level characteristics of human music-selective cortex. The work supports the hypothesis that music-selectivity may arise as a by-product of adapting to natural sound statistics, with the selectivity providing a functional basis for generalization rather than class-specific overfitting. The sensitivity to sound quilts aligns with human auditory cortical responses, and the ablation results indicate that features constituting music are also critical for broader natural sound recognition. The role of speech in enhancing long-timescale structure suggests shared temporal processing resources across music and speech. Collectively, the results argue that ecological adaptation to natural sounds could provide a blueprint for music perception and help explain cross-species and cross-cultural commonalities in basic musical processing.
Conclusion
The study demonstrates that DNNs trained on natural sounds develop rudimentary music detectors without explicit music training. These music-selective units encode temporal structure over multiple timescales, contribute critically to generalization in natural sound detection, and replicate key aspects of human music-selective cortical responses. The results suggest that evolutionary or ecological pressures to process natural sounds can seed an innate basis for music perception. Future work should incorporate experience-dependent development to model maturation and refinement of music selectivity, explore how different auditory environments shape musical styles, and examine more brain-like architectures (e.g., recurrent/top-down circuits) to bridge algorithmic differences with biological learning mechanisms.
Limitations
- Architectural and learning differences: The CNN is feedforward and trained via backpropagation, lacking intracortical and top-down connectivity and biologically plausible learning rules. - Dataset mismatch: Training used modern audio data that may not reflect evolutionary auditory environments (e.g., mechanical/vehicle sounds), potentially biasing emergent features. - Scope: The model addresses initial emergence of music-selectivity, not the full experience-dependent development and refinement seen in humans. - Species variability: Reports of absent functional organization for harmonic tones/music in some species indicate higher-order demands and experience may be necessary for mature selectivity. - External dataset constraints: The 165 natural sounds dataset contains short excerpts and limited music duration, which may bias comparisons.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny