Source separation for music is the task of isolating contributions, or stems, from different instruments recorded individually and arranged together to form a song. Such components include voice, bass, drums and any other accompaniments.
Source separation models either work on the spectrogram or waveform domain.
- Spectrogram-based models use frequency-domain representations of the audio signal as input, while waveform-based models work directly on the raw waveform.
- Spectrogram-based models can apply convolutions along frequencies, while waveform-based models use fully connected layers with respect to their channels.
???? Unlike audio synthesis tasks that generate waveforms directly, state-of-the-art source separation methods compute masks on the magnitude spectrum. ???? State-of-the-art approaches in music source separation still operate on the spectrograms generated by the short-time Fourier transform (STFT). They produce a mask on the magnitude spectrums for each frame and each source, and the output audio is generated by running an inverse STFT on the masked spectrograms reusing the input mixture phase .
Waveform Domain Architectures:
- Demucs
- Hybrid Demucs
- Demucs with Transformers
- Band-split RNN
- Conv-Tasnet
1. DEMUCS: (Deep Extractor for Music Sources)
Motivation: Conv-Tasnet, originally designed for monophonic speech separation and audio sampled at 8 kHz, was adapted to the task of stereophonic music source separation for audio sampled at 44.1 kHz. While Conv-Tasnet separates with a high accuracy the different sources, artifacts were observed when listening to the generated audio: a constant broadband noise, hollow instruments attacks or even missing parts. They were especially noticeable on the drums and bass sources.
DEMUCS Key Features:
- Waveform-to-Waveform model
- Similar to Conv-Tasnet, Demucs directly operates on the raw input waveform and generates a waveform for each source. In other words, Demucs takes a stereo mixture as input and outputs a stereo estimate for each source (C = 2).
- It is an encoder/decoder architecture composed of a convolutional encoder, a bidirectional LSTM, and a convolutional decoder (based on wide transposed convolutions with large strides), with the encoder and decoder linked with skip U-Net connections.
- The other critical features of the approach are increasing the number of channels exponentially with depth, gated linear units as activation function which also allow for masking, and a new initialization scheme.
- Experiments on the MusDB dataset show that, with proper data augmentation, Demucs surpasses all state-of-the-art architecture in the waveform or spectrogram domain by at least 0.3 dB of SDR.
- Inspired by models for music synthesis rather than masking approaches.
Results:
- While Conv-Tasnet outperforms several existing spectrogram-domain methods, it does suffer from large audio artifacts. On the other hand, with proper augmentation, the Demucs architecture surpasses all existing spectrogram or waveform domain architectures in terms of SDR, with 6.3 points of SDR without extra training data (against 6.0 for the best existing method D3Net), and up to 6.8 with extra training data.
- In fact, for the bass source, Demucs is the first model to surpass the IRM oracle, with 7.6 SDR. The pitch/tempo shift augmentation was found to be useful, which lead to a gain of 0.4 points of SDR, in particular for a model with a large number of parameters like Demucs, while it can be detrimental to Conv-TasNet.
- There is no clear winner between waveform and spectrogram domain models, as the former seems to dominate for the bass and drums sources, while the latter obtain the best performance on the vocals and other sources, as measured both by objective metrics and human evaluations.
- Spectrogram domain models have an advantage when the content is mostly harmonic and fast changing, while for sources without harmonicity (drums) or with strong and emphasized attack regimes (bass), waveform domain will better preserve the structure of the music source.
- One of the main drawbacks of the Demucs model when compared to other architecture is its large model size, more than 1014MB, against 42MB for Conv-TasNet. The size can be reduced either by reducing the initial number of channels (32 or 48 channels), which will improve both the model size, as well as reduce the computational complexity of the model, or using the DiffQ quantization technique
- getting to 32 channels will lead to a decrease of 0.2 dB in performance.
- Quantization reduces the model size down to 120MB without any loss of SDR, which is still more than the 42MB of Conv-Tasnet, but close to 10x improvement over the uncompressed baseline.
- Batch Normalization was not used as it was found to be detrimental to the model performance.
- During training, only small audio extracts are given, so that a quiet part or a loud part would be scaled back to an average volume. However, when using entire songs as input, it will most likely contain both quiet and loud parts. The normalization will not map both to the same volume, leading to a difference between training and evaluation.
2. Hybrid DEMUCS:
Colab : Click here
Motivation:
- Waveform domain methods surpassed spectrogram ones when considering the overall SDR (6.3 dB on MusDB), although its performance is still inferior on vocals sources. Conv-Tasnet, a model based on masking over a learnt time-frequency representation using dilated convolutions, was also adapted to music source separation, but suffered from more artifacts and lower SDR.
- To extend the prediction capabilities of DEMUCS. The original Demucs model was designed to separate two or three sources, and its performance degraded when trying to separate more sources in a mixture.
Key Features:
- Performs end-to-end hybrid waveform/spectrogram domain source separation by letting the model decide which domain is best suited for each source, and even combining both.
- Hybrid Demucs was developed to improve separation quality for mixtures with four or more sources by incorporating a hybrid approach that combines the strengths of both the frequency and time domains.
- This architecture comes with additional improvements, such as compressed residual branches comprising dilated convolutions, local attention or singular value regularization, and chunked biLSTM, and most importantly, a novel hybrid spectrogram/temporal domain U-Net structure, with parallel temporal and spectrogram branches, that merge into a common core.
- The original U-Net architecture is extended to provide two parallel branches: one in the time (temporal) and one in the frequency (spectral) domain.
- The temporal branch takes the input waveform and process it like the standard Demucs. It contains 5 layers, which are going to reduce the number of time steps by a factor of 45 = 1024. Compared with the original architecture, all ReLU activations are replaced by Gaussian Error Linear Units (GELU).
- The spectral branch takes the spectrogram obtained from a STFT over 4096 time steps, with a hop length of 1024. In order to reduce the frequency dimension, we apply the same convolutions as in the temporal branch, but along the frequency dimension. After being processed by the spectral encoder, the signal has only one “frequency” left, and the same number of channels and sample rate as the output of the temporal branch.
- The temporal and spectral representations are then summed before going through a shared encoder/decoder layer which further reduces by 2 the number of time steps (using a kernel size of 4). Its output serves both as the input of the temporal and spectral decoder.
- The output of the spectral branch is inversed with the ISTFT, and summed with the temporal branch output, giving the final model prediction.
- These changes translated into strong improvements of the overall quality and absence of bleeding between sources as measured by human evaluations.
- Won the Music Demixing Challenge (MDX) 2021 organized by Sony, when trained only on MusDB, with 7.32 dB of SDR, and 2nd with extra training data allowed!!!???!!
Results:
- On the MusDB HQ benchark, overall, a 1.4 dB improvement of the Signal-To-Distortion (SDR) was observed across all sources as measured on the MusDB HQ dataset, with an overall quality rated at 2.83 out of 5 (2.36 for the non hybrid Demucs), and absence of contamination at 3.04 (against 2.37 for the non hybrid Demucs).
Advantage(s):
- Due to this overall design, the model is free to use whichever representation is most conveniant for different parts of the signal, even within one source, and can freely share information between the two representations.
Limitations:
- For all its gain, one limitation of this approach is the increased complexity of the U-Net encoder/decoder, requiring careful alignmement of the temporal and spectral signals through well shaped convolutions.
Motivation:
- To check whether long range contextual information is useful, or if local acoustic features are sufficient.
- Attention based Transformers integrate information well over long sequences.
Model: A hybrid temporal/spectral bi-U-Net based on Hybrid Demucs, where the innermost layers are replaced by a cross-domain Transformer Encoder, using self-attention within one domain, and cross-attention across domains.
Results:
-
It performs poorly when trained only on MUSDB, however it outperforms Hybrid Demucs by 0.45 dB of SDR when using 800 extra training songs.
-
Using sparse attention kernels to extend its receptive field, and per source fine-tuning, SoTA results were achieved on MUSDB with extra training data, with 9.20 dB of SDR.
This model reuses the masking approach of spectrogram methods but learns the masks jointly with a convolutional front-end, operating directly in the waveform domain for both the inputs and outputs. Conv-Tasnet surpasses both the IRM and IBM oracles.