logo
ResearchBunny Logo
Automatic Detection of Reactions to Music via Earable Sensing

The Arts

Automatic Detection of Reactions to Music via Earable Sensing

C. M. Lee, J. Lee, et al.

Discover GrooveMeter, an innovative system that utilizes earable sensing technology to automatically detect vocal and motion responses to music. This remarkable research, conducted by Chulhong Mineuihyeok Lee, Jaeseung Lee, Jin Yu, and Seungwoo Kang, showcases impressive accuracy and offers potential applications ranging from automatic music ratings to music therapy tools.

00:00
00:00
Playback language: English
Introduction
Listening to music is a pervasive activity, with individuals spending significant time each day engaging with music. People often exhibit natural reactions while listening, such as head nodding, foot tapping, humming, or singing along. These reactions reflect engagement with the music and provide valuable insights into listener preferences and experiences. This paper addresses the lack of robust, real-world methods for detecting these reactions, focusing on developing a mobile system capable of tracking these reactions in everyday listening scenarios. Existing research in music psychology often relies on self-report or bulky equipment in controlled environments (PET, fMRI, motion capture, ECG, GSR), limiting their applicability to real-life situations. GrooveMeter aims to bridge this gap by utilizing readily available smart earbuds as unobtrusive sensing devices, employing their IMU and microphone to detect readily observable bodily reactions such as singing along, humming, whistling, and head movements. The system faces challenges including differentiating reactions from irrelevant events (e.g., talking, drinking), managing background noise and motion artifacts in daily environments, and minimizing computational and energy costs for continuous operation on mobile devices. The study focuses on developing techniques to overcome these challenges to enable music engagement-aware applications such as personalized music ratings, recommendations, and enhanced fan-musician interaction in online concerts.
Literature Review
Existing research explores reaction sensing in various contexts, including multimedia consumption (movies, live performances), humor appraisal, and attention level monitoring in online learning. However, these studies often use different sensor modalities and techniques due to the varying characteristics of reactions. The paper also reviews the literature on human sensing using earables, including activity detection, head motion, facial expression analysis, and respiration rate monitoring. While several studies use earables for various purposes, the automatic detection of music-listening reactions using earable sensing is a novel contribution of this work. Previous work on understanding music listening behavior and contexts often focuses on what, when, why, and where people listen to music, neglecting the 'how'—the listeners' reactions. GrooveMeter directly addresses this research gap by developing novel sensing and processing methods specifically for detecting music-related reactions in real-life settings.
Methodology
GrooveMeter's architecture involves a server-side music database and a client-side smartphone application that processes data from earbuds. The system employs distinct pipelines for vocal and motion reaction detection. **Vocal Reaction Detection:** This pipeline includes several key steps: 1. **Certain Non-reaction Signal Filtering:** A two-step filter based on accelerometer data (movement level) and audio data (decibel level) efficiently removes segments unlikely to contain vocal reactions, significantly reducing processing overhead. This filter is designed to identify segments with either too little or too much movement, indicating a high likelihood of non-reaction. 2. **Sound Event Classification:** YAMNet, a pre-trained sound classification model, is used as a base, but its outputs are mapped to GrooveMeter's label set (singing/humming, whistling, non-reaction). A rank constraint relaxation policy is employed to handle cases where YAMNet does not confidently assign one of the target labels. This accounts for the fact that real-world vocal reactions are often ambiguous. 3. **Information-leveraged Correction:** To correct ambiguous labels (speech, music), a prosodic similarity computation compares the pitch pattern of the user's vocal signal with the vocal line of the song (extracted using Spleeter and CREPE). Dynamic Time Warping (DTW) is used to measure the similarity of note sequences. Segments with low similarity are labeled as non-reactions. 4. **Post-processing:** A Hidden Markov Model (HMM) smoothes the final output to handle short, intermittent events, thereby improving accuracy. **Motion Reaction Detection:** This pipeline consists of: 1. **Reaction-irrelevant Movement Filtering:** A threshold-based filter, similar to that used for vocal reactions, removes segments with no movement or excessive movement, indicating a low likelihood of head motion (the specific reaction targeted here). 2. **Noise Removal:** A low-pass filter (5 Hz cutoff) removes high-frequency noise from IMU data. This is based on the observation that head movements related to music reactions generally have low frequencies. 3. **Motion Reaction Classification:** IMU data is segmented into 100ms units, with statistical features extracted. These units form a sequence that is fed into a Long Short-Term Memory (LSTM) network for binary classification (head motion vs. non-reaction). A 7-second window is used for classification. 4. **Post-processing:** The final output combines the classifier output with the filtered non-reaction segments. **Dataset:** The MusicReactionSet dataset was collected from 30 participants in four diverse real-world scenarios: resting in a lounge, working at an office, riding in a car, and relaxing at a cafe. The data includes 926 minutes of audio and IMU data, manually labeled for ground truth. Apple AirPods Pro and eSense earbuds were used for data acquisition.
Key Findings
GrooveMeter achieved high accuracy in both vocal and motion reaction detection. Leave-one-subject-out cross-validation on the MusicReactionSet dataset yielded macro F1 scores of 0.89 for vocal reactions and 0.81 for motion reactions. The filtering approach significantly reduced energy overhead (50% or more). The system demonstrated superior accuracy and robustness compared to alternative methods, particularly in noisy environments like cafes. The filtering operation improves performance by reducing false positives. The correction step using music information further improves the accuracy, especially in noisy environments. The HMM-based smoothing significantly enhances the accuracy of vocal reaction detection. A case study demonstrated the feasibility of GrooveMeter in various applications: automatic music rating (MAE 0.22), familiarity detection (F1 score 0.78 for unknown songs, 0.81 for known songs), and reaction-based music recommendation. Participants in a small-scale deployment study generally rated the accuracy of the reaction detection as high (average rating of 4 out of 5). The system's performance was also consistent across different environments and activities. Even in situations with significant motion noise, like exercise, GrooveMeter showed high accuracy in distinguishing non-reactions from movements related to the exercise. The execution time and energy consumption of GrooveMeter were low enough to support interactive applications.
Discussion
The findings demonstrate the feasibility of accurately and robustly detecting music-listening reactions using earable sensing in real-world scenarios. The high accuracy and robustness of GrooveMeter, especially in noisy environments, showcase the effectiveness of the proposed filtering, correction, and smoothing techniques. The case study highlights the potential of integrating reaction data into various music-related applications. The results demonstrate the practicality of using earable sensors for automatically assessing music engagement, paving the way for more personalized and interactive music experiences. The robust performance across different contexts suggests that the system could be widely applicable to a variety of real-world situations. The relatively low energy and processing overhead demonstrate the feasibility of integrating GrooveMeter into mobile devices without significant impact on battery life. Further research could explore methods for online adaptation of thresholds and models to enhance robustness further.
Conclusion
GrooveMeter presents a novel system for automatically detecting music-listening reactions using earable sensing, achieving high accuracy and robustness. The system successfully addresses several key challenges in real-world deployment. Future work should explore the detection of additional reaction types (physiological, emotional), online adaptation techniques for improved robustness, and more extensive user studies to assess the long-term usability and impact of integrating GrooveMeter into various applications. Extending the system to other contexts beyond music listening, such as online meetings or lectures, is also a promising avenue for future research.
Limitations
The study's dataset, while extensive, may not fully capture the diversity of real-world music-listening scenarios and individual listener variations. The threshold-based operations, while effective, may not be optimal for all situations. The case study on application use cases was limited in scope and requires further investigation with a larger number of participants and longer observation periods. Further research into the optimization of earbud energy consumption is also needed.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny