The Arts

Msanii: High Fidelity Music Synthesis on a Shoestring Budget

K. Maina

Discover Msanii, a groundbreaking model for creating stunningly high-fidelity music with long duration and efficiency. Authored by Kinyugo Maina, this innovative approach utilizes mel spectrograms and diffusion models to generate seamless audio experiences without traditional synthesis methods. Dive into the future of music synthesis!

00:00

~3 min • Beginner • English

Index

Abstract

In this paper, we present Msanii, a novel diffusion-based model for synthesizing long-context, high-fidelity music efficiently. Our model combines the expressiveness of mel spectrograms, the generative capabilities of diffusion models, and the vocoding capabilities of neural vocoders. We demonstrate the effectiveness of Msanii by synthesizing tens of seconds (190 seconds) of stereo music at high sample rates (44.1 kHz) without the use of concatenative synthesis, cascading architectures, or compression techniques. To the best of our knowledge, this is the first work to successfully employ a diffusion-based model for synthesizing such long music samples at high sample rates. Our demo can be found here and our code here.

Publisher

None (Work in Progress)

Published On

Jan 18, 2023

Authors

Kinyugo Maina

DOI

https://doi.org/https://doi.org/10.48550/arxiv.2301.06468