logo
ResearchBunny Logo
Automatic Detection of Reactions to Music via Earable Sensing

The Arts

Automatic Detection of Reactions to Music via Earable Sensing

C. M. Lee, J. Lee, et al.

Discover GrooveMeter, an innovative system that utilizes earable sensing technology to automatically detect vocal and motion responses to music. This remarkable research, conducted by Chulhong Mineuihyeok Lee, Jaeseung Lee, Jin Yu, and Seungwoo Kang, showcases impressive accuracy and offers potential applications ranging from automatic music ratings to music therapy tools.

00:00
00:00
~3 min • Beginner • English
Introduction
Listening to music is pervasive, and people naturally exhibit physical reactions such as head nodding, foot tapping, humming, or singing along, which reflect engagement. Prior studies in music psychology have examined responses using self-reports or bulky laboratory equipment (e.g., PET/fMRI, motion capture, ECG/GSR), limiting real-world applicability. This work proposes GrooveMeter, a mobile system to track readily observable bodily reactions to music—specifically singing along, humming, whistling, and head motion—using sensor-equipped earbuds (IMU and microphone). Detecting such reactions in the wild is challenging due to reaction-irrelevant behaviors with similar signal characteristics, background audio noise and motion artifacts (as music listening is often secondary), and the cost of continuous on-device sensing and classification. GrooveMeter addresses these challenges with early-stage filtering, multi-step pipelines tailored to reaction patterns, and leveraging similarity between listener sensor data and musical structure. The MusicReactionSet dataset (30 participants, 240 sessions across lounge, office, car, cafe; 926 minutes) supports development and evaluation. Contributions: (1) MusicReactionSet, the first dataset targeting vocal and motion reactions during daily music listening; (2) GrooveMeter, the first earable sensing system specialized for automatic detection of vocal and motion reactions to music; (3) novel techniques for efficient and robust detection by filtering reaction-irrelevant segments and leveraging music information, with comprehensive evaluation demonstrating accuracy, robustness, and efficiency.
Literature Review
Related work spans: (1) Reaction sensing during content consumption, e.g., implicit responses to movies, live performances, humor appraisal, frisson detection, and attention estimation. These rely on sensors suited to their content and do not exploit content characteristics for sensing; GrooveMeter leverages music-content similarity for robustness. (2) Human sensing with earables: prior work uses earbud IMU/microphones for activity and head motion recognition, facial expression and jaw/tongue motion, respiration, gait, authentication, microsleep detection, ear-disease detection, and physiological sensing. GrooveMeter focuses on music-listening reactions and introduces new techniques tailored to these reactions. (3) Understanding music listening behavior and context: studies using mobile tools and surveys examine what/when/why/where people listen; GrooveMeter differs by detecting how people react during listening and enabling large-scale real-life data collection with earbuds.
Methodology
System overview and design: GrooveMeter detects vocal (singing/humming, whistling) and motion (head movement) reactions in real time from earbud microphone and IMU streams. The architecture retrieves music information on a server; sensing streams run on earbuds and are processed on the user’s smartphone to preserve privacy. The pipeline emphasizes unobtrusive sensing, accurate/robust detection, and low overhead. Vocal reaction detection pipeline: - Early non-reaction filtering: two-step filter to reduce cost and false positives. (1) Motion-based filter: standard deviation of accelerometer magnitude; filters out too little movement or too large movement (threshold range 0.0104–0.12 g; motion filter measured 113 mW on Galaxy S21). (2) Sound-based filter: filters segments below 49 dB SPL (consumes 134 mW). Filtered segments are labeled non-reaction and bypass heavy processing. - Preprocessing: Earbud audio resampled to 16 kHz; 1 s segments; 1st-order LPF at 2 kHz. - Feature extraction: STFT with Hann window (25 ms window, 10 ms hop); map to 64 mel bins (125–7,500 Hz); log mel spectrogram; frame into 96×64 (96 frames × 64 mel bands). - Base classification: Use YAMNet (MobileNet-based audio classifier). Because YAMNet lacks fine-grained labels for earbud-captured reactions, map its outputs to GrooveMeter labels: speech and music → ambiguous; humming/singing → singing/humming; whistling/whistle → whistling; others → non-reaction. YAMNet mislabels singing/humming as speech/music, especially with background noise. - Rank constraint relaxation: Assess output uncertainty via least-margin (difference between top-2 softmax scores). If margin < 0.9, inspect top-5 labels; if any target labels present (speech, music, humming, singing, whistling, whistle), mark as uncertain and forward to correction. This recovers reaction segments that appear in lower-rank outputs. - Music-informed correction via prosodic similarity: Extract pitch every 0.1 s using CREPE; convert frequency to musical notes, then to 12-tone chromatic scale (octave-invariant). For the song track, first separate vocal stem using Spleeter, then extract the same 12-tone note sequence. Compute similarity between the two note sequences using DTW (notes mapped to integers 0–11). If DTW distance > 130, label as non-reaction; otherwise map to singing/humming (for speech/music/humming/singing) or whistling (for whistling/whistle). This exploits that vocal reactions track the song’s melody; DTW distance distributions separate reactions from non-reactions across noise conditions. - Output smoothing: Hidden Markov Model (HMM)-based smoothing via Viterbi over 6 s windows on sequences of classification outputs to handle intermittent pauses, alternations, and spurious short events. Produces temporally consistent reaction segments with start/end times; smoothing can be disabled for low-latency apps. Motion reaction detection pipeline: - Reaction-irrelevant movement filtering: Standard deviation of 1 s accelerometer magnitude filters out no movement and too-large movement as non-reaction. Thresholds chosen from dataset CDF: 0.0092 g (low) to 0.114 g (high). - Noise removal: Motion reactions follow beat patterns; dominant frequencies typically <4 Hz. Apply 1st-order LPF at 5 Hz to 70 Hz IMU to suppress higher-frequency artifacts from other activities. - Motion abstraction and classification: Segment filtered IMU into 100 ms intervals. For each segment, compute gyroscope-axis features: max, min, mean, range, standard deviation, RMS per axis (3 axes × 6 = 18 features), yielding a sequence of motion units (temporal abstraction of user motion). Classify 7 s windows (matrix 70×18) using an LSTM (32 hidden units) with dropout 0.5, ReLU, and softmax to detect head motion vs non-reaction. Trained up to 300 epochs with Adam. LSTM chosen over RNN/GRU based on empirical performance. - Post-processing: Combine classifier outputs with filtered segments to produce motion reaction events (type, start/end). Aggregate for analytics (e.g., song parts eliciting most movement). Dataset (MusicReactionSet): 30 participants (18M/12F), ages 20–26 (mean 22.7), 240 sessions (8 per participant) across four situations—lounge (resting), office (working), car (riding), cafe (relaxing). Participants freely selected songs (top-50 chart), half up-tempo/exciting, half slow/soft; first song per location omitted to avoid settling effects. Total 926 minutes of synchronized earbud microphone (AirPods Pro, 44.1 kHz) and IMU (eSense, 70 Hz). Ground truth via covered camera recordings; IRB approved. Additional controlled lab dataset: 10 participants, instructed continuous reactions and non-reaction motions for baseline comparisons. Additional activity robustness dataset: 5 participants doing push-ups, squats, and running while listening (no intentional reactions). Implementation and system cost: Android service (Galaxy S21, Android 11; Galaxy S8+, Android 9) using TensorFlow Lite. End-to-end execution time per segment: vocal ~17 ms, motion ~13 ms on S21; <90 ms total on S8+. Energy measured via Monsoon: with filtering, energy overhead 3.7 mJ/s (S21) vs 7.3 mJ/s without; 28.5 mJ/s (S8+) vs 70.4 mJ/s without. Filtering stages individually measured (e.g., motion filter 27.9 mW, sound filter 247 mW on S21). Earbud-side battery impact: eSense with IMU+audio consumes ~20% more battery over 1 hour than music-only use. Applications: Prototype music player with reaction summary and reaction-based recommendation; case studies include automatic music rating prediction and song familiarity detection using reaction statistics (normalized durations/counts) with decision trees, evaluated via LOSO CV.
Key Findings
- Overall detection accuracy: Using LOSO CV on MusicReactionSet, GrooveMeter achieves macro F1 scores of 0.89 (vocal reactions) and 0.81 (motion reactions). For vocal reactions, precision/recall: singing/humming 0.85/0.87; whistling 0.93/0.78; non-reaction 0.99/0.97. In some analyses, macro-averaged vocal F1 ~0.90. - Robustness to noise: Significant gains in noisy settings (e.g., cafe), where music-informed correction and HMM smoothing improved F1 by up to 0.21 over a YAMNet label-mapping baseline. Correction particularly reduced false positives where background music and nearby speech are present. - Impact of filtering: Early-stage filtering not only reduces computation but improves F1. Filtering ratios: motion-based ~12%, sound-based ~60%, both ~63%, reducing classified segments to 37% while raising F1 by ~0.1 over no filtering. Filtering reduced energy by ≥50% (e.g., S21: 7.3→3.7 mJ/s; S8+: 70.4→28.5 mJ/s). - Motion reaction detection: Head-motion F1 = 0.74 (non-reaction F1 = 0.85), outperforming baselines (RandomForest 0.60, CNN 0.71, ConvLSTM 0.65 for head motion). GrooveMeter’s LSTM over motion units better handles varied, non-uniform head motion patterns and is more robust in office/cafe scenarios with motion noise (decrease in F1 ≤0.03 vs CNN’s 0.06–0.09). - Genre effects: Up-tempo/exciting songs yield higher precision/recall for motion reactions (clearer periodicity) than slow/soft songs. - Robustness to other activities: When users performed exercises (no intentional reactions), accuracy for detecting non-reaction was high: push-ups 0.94, squats 0.92, running 0.99; filtering ratios high (48–99%) except push-ups (18%). - Latency and on-device feasibility: Per-segment processing times ~17 ms (vocal) and ~13 ms (motion) on Galaxy S21; interactive latency feasible even on S8+ (<90 ms total). - Application case studies: (1) Automatic music rating from reaction summaries achieved MAE 0.22 on unknown songs. (2) Familiarity detection achieved F1 = 0.78 (known; precision 0.85, recall 0.72) and 0.81 (unknown; precision 0.76, recall 0.88). (3) Small deployment (3 users, 1 day) indicated positive subjective accuracy (avg 4/5) and preference for reaction-based recommendations; users reported similar reactions to recommended songs.
Discussion
GrooveMeter addresses the challenge of detecting real-life music listening reactions by combining lightweight filtering, tailored classification, and music-content-informed correction, producing robust outputs in noisy and activity-rich environments. The results show that reaction-irrelevant segments can be effectively filtered early, reducing both false positives and energy cost, while the prosodic similarity to song melody reliably distinguishes vocal reactions from speech and background audio even in cafes and cars. The motion pipeline’s temporal abstraction plus LSTM captures diverse, non-uniform head motion patterns better than conventional period-statistics or CNN/ConvLSTM baselines, leading to improved generalization across users and contexts. These findings support the research goal of enabling engagement-aware music applications: the system’s accurate, time-localized reaction events enable automatic rating, familiarity inference, recommendation, and analytics on which song parts elicit engagement, with practical on-device latency and power consumption.
Conclusion
GrooveMeter is a novel earable-sensing system that automatically detects vocal (singing/humming, whistling) and motion (head movement) reactions during everyday music listening. By combining early-stage non-reaction filtering, YAMNet-based classification with rank-relaxation, music-informed prosodic correction (CREPE, Spleeter, DTW), and HMM smoothing, along with an LSTM-based motion detector over motion-unit sequences, GrooveMeter achieves robust, efficient detection in the wild. Evaluations on MusicReactionSet (926 minutes, 30 participants, 4 real-life contexts) show high accuracy, robustness to noise and activities, and reduced energy overhead, and application case studies demonstrate feasibility for engagement-aware services. Future work includes extending to physiological/emotional reactions with additional sensors, refining thresholds via online adaptation, enlarging datasets across broader demographics and contexts, and exploring context-dependent analyses and other content domains.
Limitations
- Threshold-based components: Several operations use fixed thresholds (movement and decibel filters, DTW similarity), chosen conservatively; while effective and efficient, static thresholds may underperform in unseen environments. Future work includes online adaptation or ML-based alternatives. - Dataset scope: MusicReactionSet, though diverse (four settings, 30 participants), may not cover all real-world variability in environments, demographics, or listening habits; larger, more heterogeneous datasets would help generalization. - Limited reaction types: Focuses on detectable physical reactions (vocal, head motion). Other reactions (e.g., foot tapping, finger snapping), physiological, and emotional responses are not modeled; adding sensors (PPG, EDA) could broaden coverage. - Deployment scale: Small-scale, short-term deployment (3 users, 1 day) limits conclusions about real-world performance and user benefits over time; longer, larger deployments are needed. - Device constraints: Earbud energy costs and limited processing on earbuds themselves; current design processes on smartphones, and earbud-side optimization remains future work.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny