How AI Vocal Separation Works — Spectrograms, Masks & Roformers

Pulling a clean vocal out of a finished song feels like un-baking a cake. Yet modern AI does it routinely. This is an intuitive, jargon-light tour of how that actually works — from turning sound into a picture, to the neural networks that "erase" everything but the voice.

📊 Spectrograms & the STFT 🎭 Soft masks 🤖 Roformer ensemble

The core idea: turn sound into a picture, edit the picture, turn it back

A raw audio file is just a long list of numbers — the position of your speaker cone, tens of thousands of times per second. That waveform is almost impossible to edit surgically, because the vocal and the drums and the bass are all summed into the same wiggle. So the first move in almost every separation system is to convert the audio into a representation where the different instruments become visually distinct: a spectrogram.

A spectrogram is made with a tool called the Short-Time Fourier Transform (STFT). The STFT slides a small window along the audio and, for each window, asks "which frequencies are present right now, and how loud?" Stack those answers side by side and you get a 2D image: time runs left to right, frequency runs bottom (low/bass) to top (high/cymbals), and brightness shows loudness. A bass line is a bright band low down; a hi-hat is a fizz at the top; a sung note shows up as a fundamental line with a ladder of harmonics above it.

Once sound is a picture, separation becomes an image problem: figure out which pixels belong to the voice. Solve that, then run the STFT in reverse — the inverse STFT — to turn the edited picture back into audio you can listen to. Everything else is detail about how the "figure out which pixels" step is done.

See it in action on your own song

🎙️ Try the separator free

Free 3 songs/month · no signup · Patreon Pro = 2 songs/day

Soft masks: the model paints a transparency map

The naive approach would be to have the model classify each pixel as "vocal" or "not vocal" — a hard yes/no mask. That sounds clean but it isn't, because at any given moment a vocal and a guitar can share the exact same frequency. A hard cut leaves jagged, robotic-sounding gaps.

Modern systems instead predict a soft mask: for every point in the spectrogram, a value between 0 and 1 saying "what fraction of the energy here belongs to the vocal?" Think of it as a transparency layer painted over the picture — 1.0 means "this is fully voice, keep all of it," 0.0 means "this is fully background, remove it," and 0.6 means "mostly voice, keep most of it." Multiply the original spectrogram by this mask and you get the vocal's spectrogram; multiply by (1 minus the mask) and you get the instrumental. Because the mask is continuous, frequencies that are shared get split proportionally instead of brutally chopped, which is why good modern separation sounds smooth rather than gated.

The neural network's entire job is to look at the mixed spectrogram and paint that mask accurately. It learns to do this by training on thousands of songs for which the isolated stems are known, gradually getting better at recognizing what a human voice looks like even when it's buried under a full band. When you run the acapella extractor or the instrumental extractor, you're seeing two sides of the same mask: keep the voice, or keep everything but the voice.

The evolution: from phase tricks to Roformers

It's worth understanding how we got here, because each generation fixed a specific weakness of the last.

Phase cancellation (the pre-AI hack)

The oldest "vocal remover" trick needs no AI at all. In many mixes the lead vocal is panned dead center, equally in the left and right channels. If you invert one channel and sum the two, anything identical in both cancels out — taking the centered vocal with it. It's clever and instant, but crude: it also kills the bass and kick (usually centered too), it fails entirely on mono or stereo-spread vocals, and what's left is hollow. This is the technique behind a lot of old free tools, and its limitations are exactly what AI was built to overcome.

Spleeter (deep learning goes mainstream, 2019)

Spleeter, released by Deezer, put spectrogram-mask separation in everyone's hands. A convolutional network predicted masks for vocals, drums, bass and "other." It was a leap over phase tricks and is still widely used because it's fast and lightweight — but it was trained on a limited dataset and tends to leave audible bleed and muffled highs by today's standards.

Demucs (working in the waveform, 2019–2022)

Meta's Demucs took a different angle: rather than only editing the spectrogram, later versions (Hybrid Demucs) work partly on the raw waveform too, which helps it preserve transients — the sharp attack of a drum or a consonant — that pure spectrogram methods can smear. Demucs became the strong open-source baseline that newer models are measured against. We compare it head-to-head with the current top model in BS-Roformer vs Demucs.

MDX and the Roformers (transformers take over, 2022–present)

The MDX line (including MDX23C) and then the Roformer family — BS-Roformer (Band-Split Roformer) and Mel-Band Roformer — brought the transformer architecture, the same idea behind large language models, to audio. A transformer's "attention" lets the model relate a moment in the song to every other moment, so it understands musical context: it can tell that a sustained note is part of a vocal phrase rather than a stray synth. Band-Split and Mel-Band variants slice the frequency axis the way human hearing does — finer resolution down low, coarser up high — so the model spends its capacity where the ear is most sensitive. These models set the current state of the art for vocal isolation.

What "SDR" actually measures

You'll see separation quality quoted as SDR — Signal-to-Distortion Ratio, in decibels. Intuitively, it compares how much of the output is the true target stem versus how much is error: leftover bleed, missing pieces, and artifacts the model invented. Higher is better, and because decibels are logarithmic, small numbers mean real differences — a few dB of SDR is the gap between "I can hear the drums leaking" and "this sounds like the studio vocal."

SDR is the standard yardstick used in research benchmarks like the MDX/SDX challenges, which is why it's the fairest way to compare tools. Treat any single number as approximate — it depends on the test set and the exact stem — but as a relative measure it's reliable. AIVoiceSeparator's Studio pipeline measures around 12.97 dB SDR, which sits meaningfully above the classic Demucs baseline. That gain is what you actually hear as cleaner highs and less bleed when you separate a YouTube track or a TikTok clip.

Why an ensemble beats any single model

Here's the key design choice in our pipeline: instead of betting on one network, Studio mode runs three and combines them. Each model has different strengths and different failure modes — one might excel at preserving breathy highs but occasionally leak a snare, another might null the snare perfectly but dull the air. Crucially, their mistakes are uncorrelated: they tend to err in different places.

When you average their outputs with sensible weights, the parts they agree on (the real vocal) reinforce each other, while the parts they disagree on (each model's individual artifacts) partially cancel out. It's the same statistical principle behind asking a panel of experts instead of one — the consensus is more reliable than any single opinion. Our weighting is:

Model	Weight	Strength it contributes
BS-Roformer (Band-Split)	40%	State-of-the-art overall isolation; the anchor model
Mel-Band Roformer	35%	Hearing-aligned frequency bands; smooth, natural top end
MDX23C InstVoc	25%	A different architecture for decorrelated errors and robustness

The cost is time and compute — running three models takes roughly five to six minutes for a five-minute song on the GPU — but the payoff is the cleaner, more natural result that single-model tools can't match. For the deeper single-model comparison, see BS-Roformer vs Demucs.

The last step: EBU R128 loudness normalization

Separation isn't quite the end. When you pull a stem out of a mix, its loudness is whatever was left after the model removed everything else — which can be surprisingly quiet, or inconsistent between the vocal and the instrumental. If you dropped those raw stems into a project they'd sit at random levels.

So as a final pass, AIVoiceSeparator applies EBU R128 loudness normalization — the same broadcast standard used to keep TV and streaming volume consistent. R128 measures perceived loudness (LUFS), not just peak level, and adjusts the stem to a sensible target. The practical result: your vocal and instrumental come out at predictable, usable levels that drop straight into a DAW or a karaoke setup without you having to ride the gain. Combined with the BPM and key detection that runs on every job, you get stems that are not just clean but immediately ready to work with.

Putting it all together

So the full journey of your song through the system is: the waveform becomes a spectrogram via the STFT → three transformer models each predict a soft mask for the vocal → those masks are combined in a weighted ensemble → the masked spectrograms are turned back into audio with the inverse STFT → the stems are loudness-normalized to EBU R128 and tagged with BPM and key. What feels like un-baking a cake is really a pipeline of well-understood steps, each one cleaning up a weakness of the step before. And because the whole thing runs on a private GPU in Thailand, every job is deleted after 24 hours and none of your audio is ever used to train models.

Frequently asked questions

What's the difference between a spectrogram and a waveform?

A waveform is the raw amplitude over time — hard to edit per-instrument. A spectrogram (made by the STFT) shows which frequencies are present at each moment, turning separation into an image-editing problem where instruments become visually distinct.

What is a soft mask?

A per-point transparency map between 0 and 1 saying how much of the energy at each time-frequency spot belongs to the vocal. Multiplying the spectrogram by it keeps the voice; using its inverse keeps the instrumental.

What does SDR mean?

Signal-to-Distortion Ratio, in decibels — how much of the output is the true stem versus bleed and artifacts. Higher is better; a few dB is an audible difference. Treat absolute numbers as approximate but trust the relative comparison.

Why use three models instead of one?

Each model's errors land in different places. Averaging them with weights reinforces the real vocal and partially cancels each model's individual artifacts — a more reliable result than any single network.

Why normalize loudness at the end?

Raw separated stems come out at unpredictable levels. EBU R128 normalization sets a consistent perceived loudness so the stems drop straight into a DAW or karaoke app without manual gain-riding.

Does AI separation always work perfectly?

No. Live recordings, brick-wall masters and dense harmony stacks are harder because instruments and vocals overlap heavily. Clean studio sources give the best results.