This paper introduces Msanii, a novel diffusion-based model for efficient, long-context, high-fidelity music synthesis. It combines mel spectrograms, diffusion models, and neural vocoders to generate tens of seconds of stereo music at 44.1 kHz without concatenative synthesis, cascading architectures, or compression. The model also demonstrates potential for audio inpainting and style transfer.
Publisher
None (Work in Progress)
Published On
Jan 18, 2023
Authors
Kinyugo Maina
Tags
music synthesis
diffusion model
mel spectrograms
audio inpainting
style transfer
neural vocoders
high fidelity
Related Publications
Explore these studies to deepen your understanding of the subject.