
Computer Science
Emotion-aware music tower blocks (EMOMTB): an intelligent audiovisual interface for music discovery and recommendation
A. B. Melchiorre, M. Schedl, et al.
Discover EmoMTB, an innovative music exploration system designed by researchers Alessandro B Melchiorre, Markus Schedl, David Penz, Christian Ganhör, Oleg Lesota, Vasco Fragoso, Florian Fritzl, Emilia Parada-Cabaleiro, and Franz Schubert. Navigate a vibrant 'music city' where tracks are visually represented as colored cubes, allowing you to delve into familiar and new genres while experiencing emotional feedback.
~3 min • Beginner • English
Introduction
The paper addresses limitations of mainstream music streaming and recommendation platforms that present results as linear lists, which are susceptible to cognitive (e.g., position) and algorithmic (e.g., popularity) biases. The authors propose EmoMTB, an intelligent, audiovisual interface for non-linear exploration of large music catalogs. Using a city metaphor, tracks are arranged spatially based on similarity derived from audio features and fine-grained genres, enabling smooth transitions within and between genres. The system integrates personalized recommendations and emotion awareness, re-ranking starting points based on a user’s self-reported emotion or a crowd emotion inferred from Twitter. The purpose is to enhance discovery, mitigate exposure biases, and offer an engaging exploration experience across nearly half a million tracks.
Literature Review
Related work spans three areas: (1) Intelligent music exploration interfaces that spatialize collections (e.g., Islands of Music, nepTune, deepTune, Music Galaxy, Songrium, MusicLatentVIS, and previous MTB). These commonly leverage audio features and dimensionality reduction (often t-SNE) to create interactive maps for exploration; some rely on list-based search as a complement. A few incorporate emotion information (e.g., Vad et al.; Liang and Willemsen) using valence/energy or audio-derived emotion descriptors. EmoMTB differs by scaling to 436k tracks, integrating Spotify playback, combining audio and fine-grained genre features to support continuous genre transitions, providing personalized and emotion-aware recommendations, and enabling smartphone gamepad-like navigation. (2) Music emotion recognition (MER) typically uses audio, lyrics, symbolic features, or multimodal fusion. Because audio access is restricted and labeled datasets with basic emotions are scarce, user-generated tags (e.g., Last.fm) offer a viable alternative but are less utilized. (3) Emotion-aware recommendation has used signals from microblogs (e.g., Sina Weibo), emotion-annotated points-of-interest matched to music, mood-based interfaces (MoodPlay), and physiological signals from wearables. EmoMTB contributes by predicting track emotions from Last.fm tags and crowd emotion from Twitter, and by re-ranking personalized recommendations accordingly.
Methodology
Data and preprocessing: EmoMTB builds on the LFM-2b dataset (2B listening events, ~120k users, ~51M tracks) with Last.fm tags and weights (0–100). Tracks are matched to Spotify via name similarity (normalized LCS threshold > 0.5) to fetch audio features and popularity, yielding a working catalog of 436,064 tracks.
Landscape generation: Tracks are represented by a combined feature set: (a) fine-grained genres extracted by matching Last.fm tags to EveryNoise micro-genres (2,374 unique genres), represented as TF-IDF vectors using tag weights as term frequency and document frequency as the number of tracks with the tag; (b) Spotify audio features: Energy, Valence, Acousticness, Instrumentalness, Speechiness. The resulting 2,379-dimensional vectors are reduced via PCA to cover 95% variance (405 components), then projected to 2D using t-SNE (perplexity 45). Coordinates are discretized to a tiled map. Tracks with very similar coordinates are stacked into towers; within towers, blocks are sorted by Spotify popularity (most popular on top). Colors derive from a mapping of 12 macro-genres (from Holm et al.) to hues, using the highest-weighted genre per track.
Emotion prediction: EmoMTB models four basic emotions (happiness, sadness, anger, fear) from Ekman’s Big Six. Due to limited song-level datasets with these categories, transfer learning is applied: a multilayer perceptron is trained on aggregated, cleaned, and balanced text datasets (eight corpora totaling 21,480 samples) labeled with the four emotions. OpenXBoW constructs bag-of-words features using ANEW and VADER lexica. Last.fm tags serve as inputs to predict track emotions; tweets mentioning EmoMTB inform a ‘crowd’ emotion.
Recommendation and re-ranking: During onboarding, EmoMTB retrieves a user’s top 5 short-term and top 5 long-term tracks from Spotify as seeds, then requests up to 200 recommendations. Recommendations are filtered to those present in EmoMTB’s catalog and mapped to landscape coordinates. After the user selects an emotion (or the system infers crowd emotion), the list is re-ordered by the classifier’s confidence for that emotion (descending).
Interaction and visualization: The 3D landscape (three.js) renders a flat floor and emotion-themed sky, with a directional light whose properties vary by selected emotion. A white hovering torus avatar snaps to the grid; motion uses smooth acceleration/deceleration. Selecting a block displays metadata (track, artist, fine-grained genres, predicted emotion) and triggers playback after a 2 s hover; stopping occurs after hovering empty space for 5 s. A smartphone web controller provides joystick navigation, vertical movement within towers (elevator buttons), emotion selection (including crowd option), and a scrollable recommendation list. Selecting a recommended track animates the avatar to its location. System architecture comprises a web server (data storage; Spotify/Twitter integration; relay between phone and visualization), the user’s smartphone, and the visualization display, enabling easy deployment via web clients.
Evaluation: Threefold assessment: (1) clustering homogeneity via local genre entropy on sliding 3×3 windows; (2) emotion classifier performance via fivefold Monte Carlo cross-validation on the aggregated text dataset; (3) qualitative user feedback via a post-experience web survey after an exhibition deployment (Ars Electronica Festival 2021).
Key Findings
- Scale and feasibility: The system integrates 436,064 tracks with Spotify playback and interactive exploration, substantially larger than many prior audiovisual interfaces.
- Clustering quality: Total genre entropy of the landscape is 0.168 (6.7% of the maximum log(12) ≈ 2.485), indicating high local genre homogeneity. Randomly shuffling genres while keeping positions yields 1.241 ± 0.001 (~50% of maximum), confirming the meaningfulness of the projection and clustering.
- Emotion recognition: The transfer-learned classifier on the aggregated text dataset (21,480 samples) achieves mean accuracy 59.0%, recall 59.1%, precision 59.2% across five folds. Performance varies by source dataset size, with larger corpora yielding better results.
- Qualitative user study (n=8): Most participants highlighted discovering new songs as the most relevant feature (6/8); visual appeal described as good (6/8) but simple (4/8), with suggestions for added elements; interface perceived as simple and intuitive (6/8), though possibly less straightforward for users unfamiliar with mobile games (2/8). Recommendation quality rated satisfactory (3/8) to very satisfactory (3/8), serving as strong starting points for exploration. The emotional component was considered appropriate and interesting by a majority (5/8, 3/8 respectively) but seen as improvable (4/8), as track emojis did not always match perceived emotion.
Discussion
The results support EmoMTB’s premise that a spatial, audiovisual interface can mitigate limitations of list-based recommendation by enabling non-linear exploration and exposing users to diverse yet semantically adjacent music. Low local genre entropy demonstrates coherent neighborhood structures and continuous genre transitions, validating the combined audio-plus-genre feature approach and the projection pipeline. While the emotion classifier’s moderate accuracy limits fine-grained affective control, re-ranking by predicted emotion still provides a meaningful personalization layer that users can leverage. Qualitative feedback indicates the interface is engaging, aids discovery, and that recommendation seeds effectively bootstrap exploration, aligning with the goal of balancing algorithmic guidance and serendipity. Identified shortcomings—particularly the emotional labeling mismatch for some tracks—highlight avenues to refine emotion modeling and its integration into the UI. Overall, EmoMTB offers a viable path to reduce exposure/popularity biases by visualizing the entire catalog and allowing users to traverse beyond their comfort zones along musically meaningful routes.
Conclusion
The paper introduces EmoMTB, a working prototype of an intelligent audiovisual interface for music discovery that combines large-scale spatial visualization, personalized recommendations, and emotion-aware re-ranking. By projecting 436k tracks into a 2D city metaphor using audio and fine-grained genre features, the system enables smooth transitions within and across genres, facilitating exploration beyond traditional ranked lists. Quantitative evaluation shows highly homogeneous genre clusters; transfer-learned emotion recognition achieves moderate accuracy; and a qualitative user study reports positive user experiences and effective discovery.
Future research directions include: (1) removing the dependency on Spotify accounts and simplifying legal/technical constraints; (2) improving emotion recognition and deepening its integration (e.g., emotional clustering and richer affective theming); (3) enhancing interaction to let users modify or build personalized landscapes; (4) enriching visualization metaphors (e.g., tramways as curated playlists); (5) enabling multi-user collaborative exploration; and (6) systematically investigating and mitigating popularity bias using such interfaces.
Limitations
- Requires users to have a Spotify account for technical/legal reasons.
- Emotion recognition performance is limited; the current integration of emotion-awareness is relatively simple and sometimes mismatches perceived emotions.
- Full experience assumes two screens (smartphone controller plus a large display), which may reduce accessibility outside exhibition-like setups.
Related Publications
Explore these studies to deepen your understanding of the subject.






