BS-Roformer vs Demucs: Which AI Separates Vocals Better?
Both are landmark music-separation models, but they come from different generations and different ideas about how to pull a voice out of a mix. Here's what each one actually does, why the newer transformer approach raised the bar, and why the best results today come from combining models rather than crowning one.
If you've used any modern vocal remover, you've heard the output of one of these architectures โ or their descendants. Demucs is Meta's hybrid model that defined "good enough to be useful" for years. BS-Roformer (Band-Split Rotary Transformer) is part of the newer wave that pushed separation quality to a level that surprises even audio engineers. Understanding the difference helps you reason about why one tool's instrumental sounds cleaner than another's.
We'll keep this accessible โ no equations required โ but precise enough to be genuinely informative. If you want the ground-up version of how separation works at all, read how AI vocal separation works first, then come back here for the model-vs-model comparison.
A 30-second primer on the problem
A stereo song is a single waveform where every instrument and voice are summed together. Separation means undoing that sum โ recovering the individual sources from the mixture. The standard way to measure how well a model does this is SDR (Signal-to-Distortion Ratio), in decibels: higher is better, and a couple of dB is an audible difference. Most models work by turning audio into a spectrogram (a picture of frequency over time) and learning a mask that says, for each tiny time-frequency cell, "how much of this belongs to the vocal." Apply the mask, convert back to audio, and you have a stem.
What Demucs is
Demucs (the v4 line is the well-known release) is a hybrid model: it processes audio in both the time domain (the raw waveform) and the frequency domain (the spectrogram) at once, then fuses the two. The intuition is that some cues are easier to catch in the waveform โ sharp transients like drum hits โ while others, like the harmonic structure of a sustained note, are clearer in the spectrogram. By learning from both, Demucs captures detail that a frequency-only or time-only model would miss.
Architecturally it's built largely on convolutional networks (CNNs), with later versions adding transformer elements in the middle of the network. It was a major step up from the earlier generation (like Spleeter) and remains a strong, widely used, fully open-source baseline. On common public benchmarks, Demucs v4 lands in roughly the ~9โ10 dB SDR range for vocals โ genuinely usable stems that powered countless tools and projects.
What BS-Roformer is
BS-Roformer takes a different bet. The "BS" is band-split: instead of treating the whole spectrogram uniformly, it slices the frequency axis into bands and processes them with awareness of how each band behaves. Low frequencies (bass, kick) and high frequencies (cymbals, sibilance, breath) have very different characteristics, and giving the model band-specific treatment lets it specialize.
The "roformer" part is a transformer with rotary positional encoding. Transformers โ the same family of architecture behind modern language models โ are exceptional at modeling long-range relationships. In music that matters enormously: a vocal phrase, a melodic motif, or a recurring harmony spans seconds, not milliseconds. Where a CNN mostly sees a local window, a transformer can relate a note here to a phrase there, which helps it decide what is "voice" versus "instrument" with far more context.
The payoff is cleaner masks: less instrumental bleeding into the vocal, and fewer vocal "ghosts" left in the instrumental. Modern Roformer-family models (BS-Roformer and the closely related Mel-Band Roformer) push meaningfully higher than the Demucs range on the same benchmarks โ which is exactly why the latest tools sound noticeably more surgical.
Side-by-side
| Dimension | BS-Roformer | Demucs (v4) |
|---|---|---|
| Core idea | Band-split spectrogram + rotary transformer | Hybrid time + frequency, CNN-based |
| Strength | Long-range context, surgical masks | Transients + broad, robust separation |
| Vocal SDR (typical) | Meaningfully above the Demucs range | ~9โ10 dB on common benchmarks |
| Generation | Current state of the art | Strong previous-generation baseline |
| Availability | Research + community checkpoints | Fully open source, easy to run |
| Compute cost | Higher (transformer-heavy) | Lower, runs on modest hardware |
Note we deliberately don't claim an exact SDR number for BS-Roformer here โ figures vary by checkpoint, training data, and benchmark. The honest, defensible statement is that current Roformer models clearly outperform the Demucs baseline on vocal separation, while Demucs remains a great, lightweight, open option.
So which is better? It depends โ and you shouldn't have to choose
For raw vocal-separation quality on a capable GPU, BS-Roformer-class models win. For running locally on modest hardware, for transient-heavy material, or when you want a fully open, easy-to-install baseline, Demucs is still excellent โ and it's free to self-host. Many of the best open-source workflows actually use Demucs as a first stage and a Roformer model for the vocal pass.
The deeper insight is that no single model is best on every song. One model handles breathy, intimate vocals beautifully but smears dense harmonies; another nails harmonies but leaves a touch of cymbal in the vocal. This is why serious pipelines ensemble multiple models โ running several and combining their outputs so each one's weaknesses get covered by the others' strengths.
How AIVoiceSeparator uses this
Rather than pick a winner, our pipeline runs a weighted three-model ensemble:
- BS-Roformer โ the long-range-context specialist, weighted most heavily.
- Mel-Band Roformer โ a sibling Roformer that splits bands on the perceptual mel scale, strong on the frequencies the ear cares about most.
- MDX23C InstVoc โ a complementary architecture that catches things the Roformers occasionally miss, adding robustness.
The outputs are combined with a phase-preserving, mask-based average so the stems stay clean rather than smearing. Measured end to end, this ensemble reaches an SDR of 12.97 dB on vocals โ comfortably above what any single open baseline delivers, and the reason our instrumentals come out so clean. We then loudness-normalize each stem to EBU R128 so it drops straight into a mix at a sensible level.
You can hear the difference yourself: pull a backing track with the instrumental extractor, grab a vocal-only stem with the acapella extractor, or just paste a link into the YouTube vocal remover. If you're deciding between tools generally, our roundup of the best free vocal removers compares the hosted and self-hosted options side by side.
The takeaway
- Demucs = the dependable, open, hybrid baseline โ still great, especially self-hosted.
- BS-Roformer = newer band-split transformer that raised the quality ceiling.
- Ensembles beat any single model, because different architectures fail on different songs.
- SDR is the yardstick: a few dB is audible, and our 12.97 dB ensemble sits well above the open baseline.
You don't need to memorize the architecture to benefit from it โ but now you know why one instrumental sounds hollow and another sounds like the singer simply stepped out of the room.
One last practical note. The reason model quality keeps climbing is that this field moves fast: each year brings new band-split variants, better training data, and smarter ensemble strategies. A tool that felt cutting-edge two years ago can now sound dated, which is why it pays to use a separator that keeps its models current rather than shipping one frozen checkpoint. The numbers in this article โ the ~9โ10 dB Demucs range and our 12.97 dB ensemble โ describe today's landscape, and today's landscape is genuinely good enough that the limiting factor is usually your source recording, not the AI.
Frequently asked questions
Is BS-Roformer always better than Demucs?
On vocal-separation quality with enough compute, generally yes. But Demucs is lighter, fully open source, and excellent for self-hosting and transient-heavy material. The "best" choice depends on your constraints.
What does SDR actually measure?
Signal-to-Distortion Ratio in decibels โ how close the separated stem is to the true source. Higher is better, and even a 1โ2 dB gap is audible.
Why run three models instead of the best one?
Because different models fail on different songs. Ensembling lets each model's strengths cover the others' weaknesses, which is why our ensemble outperforms any single component.
What is Mel-Band Roformer?
A Roformer variant that splits frequency bands using the perceptual mel scale, concentrating capacity where human hearing is most sensitive. It pairs well with BS-Roformer in an ensemble.
Can I run these models myself?
Yes โ Demucs is fully open source, and community checkpoints for Roformer models are available through tools like Ultimate Vocal Remover. Our hosted ensemble just saves you the setup.
Hear the 12.97 dB ensemble for yourself
๐๏ธ Open AIVoiceSeparatorFree 1 song/day ยท no signup ยท Patreon Pro = 20 songs/day