The Arts
Msanii: High Fidelity Music Synthesis on a Shoestring Budget
K. Maina
Discover Msanii, a groundbreaking model for creating stunningly high-fidelity music with long duration and efficiency. Authored by Kinyugo Maina, this innovative approach utilizes mel spectrograms and diffusion models to generate seamless audio experiences without traditional synthesis methods. Dive into the future of music synthesis!
~3 min • Beginner • English
Introduction
The paper addresses the challenge of efficient, high-fidelity, long-form music synthesis, which is difficult due to the high dimensionality of audio and the need to model long-range musical structure. Prior advances in generative modeling (GANs, VAEs, normalizing flows, autoregressive models, and diffusion models) have pushed the state of the art in images, speech, text, and music, yet generating long, coherent high-sample-rate music remains computationally demanding. The authors propose operating in a lower-dimensional time-frequency domain (mel spectrograms) to reduce complexity while preserving salient musical features. Autoregressive models in waveform space are slow at inference, and GANs can suffer from unstable training and low sample diversity. Diffusion models provide simpler training, strong quality, and faster inference than autoregressive approaches, making them appealing for music synthesis. The paper introduces Msanii, a diffusion-based approach operating on mel spectrogram tokens with a novel U-Net that mixes convolutional residual blocks and attention to capture both local and global context. The approach avoids concatenative synthesis, cascading architectures, and compression, and also enables audio tasks like inpainting, outpainting, interpolation, and style transfer without retraining. The contributions include: (1) a diffusion-based model for long-context, high-fidelity music synthesis in the mel spectrogram domain at 44.1 kHz; (2) synthesis of 190-second stereo music; and (3) demonstration of additional audio tasks without retraining.
Literature Review
The work situates itself within generative modeling for audio and music. Prior efforts include autoregressive waveform models (e.g., Jukebox) that capture detail but are slow to sample and struggle with very long contexts, and GAN-based TF-domain models (e.g., GANSynth, MelNet, adversarial TF synthesis) which can be fast but are prone to training instability and reduced diversity. Diffusion models have shown strong results across modalities (images, speech, and audio) with simpler training and state-of-the-art quality. To manage audio’s dimensionality, TF representations like mel spectrograms are common in TTS, voice conversion, singing synthesis, and music generation. Vocoders (MelGAN, ISTFTNet, MCNN, Phase Gradient) reconstruct waveform from mel, but designing lightweight, phase-accurate models is challenging. Prior works have also used Griffin–Lim for phase approximation. The paper builds on these lines by using mel spectrograms as tokens along time, coupling a diffusion model with a U-Net+attention architecture, and a lightweight neural vocoder that predicts magnitude STFT while estimating phase via Griffin–Lim.
Methodology
Overview: Msanii synthesizes mel spectrograms using a diffusion model with a U-Net architecture and converts them to waveform with a lightweight neural vocoder that predicts magnitude STFT; phase is reconstructed using Griffin–Lim.
- Representation: Operate in mel spectrogram domain to reduce dimensionality while preserving musical features. Treat mel spectrograms as sequences of time tokens whose dimensionality equals the number of mel bins; process channels independently, enabling arbitrary channel counts (e.g., stereo).
- Diffusion modeling: Use a standard forward noising process with a cosine noise schedule (GLIDE-style) over 1000 timesteps (training), and learn the reverse process predicting noise with a simplified L2 loss as in DDPM. Sampling uses DDIM.
- U-Net architecture: Combines convolutional residual blocks (ConvNeXt-style MLP with GELU pre-activation) and attention to balance local detail and global context.
• Tokenization (input): reshape and linearly project spectrogram X ∈ R^{c×f×l} to latent H ∈ R^{d×c×l} by projecting along frequency; detokenization (output) inverts this mapping to Y ∈ R^{c×f×l}.
• Residual block: Instance Normalization, 3×3 padded conv (same channels), timestep embedding projected via 1×1 conv and added to features, followed by ConvNeXt-style MLP and residual connection; optional channel projection for residual path.
• Attention: InstanceNorm + Linear Attention (linear complexity). Placed after residual blocks (post-attention) in deeper layers for improved global coherence.
• Down/Up-sampling: Downsample via 1×3 conv, stride (1×2), padding (0×1) to halve time dimension; upsample via 1×4 transposed conv, stride (1×2), padding (0×1).
• Width guidance: Recommend U-Net width at least 2× the spectrogram frequency dimension for robustness, especially under noise.
- Neural vocoder: Inspired by ISTFTNet. Inputs mel X ∈ R^{c×f_m×l}, passes through input layer, single residual block (no timestep embedding), and output layer to produce magnitude STFT Y ∈ R^{c×f_s×l}; apply exponential to map from log to linear magnitude. Phase reconstructed with Griffin–Lim (including fast variants).
- Losses for vocoder: Spectral Convergence Loss (Frobenius norm on magnitude STFT) to emphasize large components, and Log-Magnitude L1 loss to emphasize small components; magnitude-only training, phase via Griffin–Lim.
- Data and preprocessing: POP909 MIDI dataset (909 pop songs) rendered to 44.1 kHz stereo audio using FluidSynth. STFT window 2048, hop 1024, 128 mel filterbanks. Propose learned data-specific preprocessing: moving-average standard scaling (batch vs running stats) and min–max scaling with momentum and decay; clamp to target range during inference.
- Training details:
• U-Net (diffusion): width 256; 2 U-Net blocks per resolution; total 14 blocks (encoder+decoder); ~49.8M parameters; timestep embedding dim 128. Train 110k steps with Adam (β1=0.5), LR=2e-4, 500-step linear warmup; maintain EMA weights. Train with 16-bit precision. Max audio length 8,387,584 samples (~190 s) to match downsampling divisibility. Batch size 4 on a single 16 GB GPU.
• Vocoder: width 256; single residual block; ~1.4M parameters. Train 40k steps with Adam (β1=0.5), LR=2e-4, 500-step warmup; 16-bit precision. Max audio length 523,264 samples (~11 s). Batch size 8 on a single 16 GB GPU.
- Sampling and tasks:
• Generation: DDIM sampling with 200 steps; Griffin–Lim with 200 iterations.
• Audio-to-audio (style transfer): Add noise at chosen timestep to an input mel then reverse diffuse; low-noise preserves structure but noisy; high-noise matches training timbres but loses original structure; percussive content is more robust.
• Interpolation: Noise two mel spectrograms to a timestep t, linearly interpolate, then reverse diffuse; percussive sounds dominate; low-noise preserves structure with noise artifacts; high-noise trends toward training distribution structure.
• Inpainting/Outpainting: Use RePaint with masks; current results lack rhythmic and harmonic consistency with context and can sound like abrupt novel segments.
Key Findings
- Msanii can synthesize long, coherent music: approximately 190 seconds of stereo audio at 44.1 kHz without concatenative synthesis, cascading architectures, or compression.
- Subjective listening indicates good long-term coherence (≈3 minutes) and diverse musical structures (e.g., repeating patterns), despite training on a relatively small dataset (POP909 rendered with FluidSynth).
- Sampling used 200 DDIM steps and 200 Griffin–Lim iterations; increasing DDIM steps beyond 200 did not yield noticeable quality gains, suggesting potential for shorter noise schedules and faster sampling.
- Style transfer: At low noise levels, structure is preserved but with noticeable noise; at higher noise, output better matches training domain timbre/instrumentation but loses original structure. Percussive elements (e.g., drums) are less sensitive to noise.
- Interpolation: Percussive components remain prominent even for low interpolation ratios; low-noise retains source structure with noise; high-noise outputs trend toward training distribution structure.
- Inpainting/Outpainting: Current outputs often lack rhythm and fail to capture melody/harmony of the surrounding context, producing abrupt changes and sounding like unrelated segments.
- Model particulars: U-Net ≈49.8M params; vocoder ≈1.4M params; training with 16-bit precision on a single 16 GB GPU; diffusion trained with cosine noise schedule over 1000 steps; sampling with DDIM (200 steps) and Griffin–Lim (200 iterations).
Discussion
Operating in the mel spectrogram domain with a diffusion model and a U-Net that mixes convolution and attention enables efficient generation of long-form, coherent music at high sample rates. The approach addresses the core challenge of high-dimensional waveform modeling by reducing dimensionality while maintaining salient musical features. Subjective evaluations suggest strong long-term structure and diversity, supporting the hypothesis that diffusion models in TF space can capture both local timbre and global form without compression or cascading stages. However, realism and naturalness lag behind human-produced music, likely due to phase reconstruction via Griffin–Lim and limited training data; this affects timbral fidelity and may introduce artifacts. The weak performance of inpainting and outpainting indicates that naïvely applying masked diffusion with RePaint in the TF domain may not preserve rhythmic and harmonic continuity, motivating task-specific conditioning or objective modifications. The lack of quantitative metrics limits rigorous comparison with prior work. Overall, the results demonstrate feasibility and promise for controllable, efficient music synthesis with further improvements in conditioning, phase/vocoding, training scale, and evaluation.
Conclusion
The paper introduces Msanii, a diffusion-based mel-spectrogram synthesis model with a lightweight neural vocoder, capable of generating minutes-long, coherent, high-sample-rate stereo music efficiently. By leveraging a U-Net with residual and attention components and operating in a lower-dimensional TF domain, Msanii avoids concatenative synthesis, cascading pipelines, and compression. Experiments show promising subjective quality and generalization to tasks like style transfer and interpolation, though inpainting/outpainting and overall realism remain areas for improvement. The authors highlight future directions including conditional generation (e.g., lyrics/MIDI/mood), user-in-the-loop control, quantitative evaluations in realistic production settings, broader audio tasks (classification, restoration), scaling, faster sampling for near-real-time use, and multi-instrument/stem generation. Msanii presents a promising foundation for high-fidelity music synthesis and production tools.
Limitations
- Work-in-progress with results subject to change; no quantitative evaluation metrics reported—only manual subjective listening.
- Phase reconstructed via Griffin–Lim can reduce realism/naturalness and introduce artifacts; magnitude-only vocoder may limit fidelity.
- Inpainting and outpainting currently fail to preserve rhythmic, melodic, and harmonic continuity, causing abrupt structural changes.
- Early training struggles with global coherence; improvements appear with longer training but remain sensitive.
- Small dataset (POP909 rendered via FluidSynth) may limit diversity and realism; synthesized training audio may not reflect real-recording variability.
- Unconditional generation limits controllability; no conditioning on lyrics, mood, or MIDI in current experiments.
- Single-GPU training and relatively modest model scale may constrain performance; no benchmarking against baselines with standardized metrics.
Related Publications
Explore these studies to deepen your understanding of the subject.

