Adversarial PIT for
Universal Sound Separation


Universal sound separation consists of separating mixes with arbitrary sounds of different types, and permutation invariant training (PIT) is used to train source agnostic models that do so. In this work, we complement PIT with adversarial losses but find it challenging with the standard formulation used in speech source separation. We overcome this challenge with a novel I-replacement context-based adversarial loss, and by training with multiple discriminators. Our experiments show that by simply improving the loss (keeping the same model and dataset) we obtain a non-negligible improvement of 1.4 dB SI-SNRi in the reverberant FUSS dataset. We also find adversarial PIT to be effective at reducing spectral holes, ubiquitous in mask-based separation models, which highlights the potential relevance of adversarial losses for source separation.

Here you will find x2 separations for each type of mix we considered: with 2 sources, with 3 sources, with 4 sources, and real-world mixes. Adversarial PIT separations are done with the model reporting the best SI-SNR score (13.8 dB) in our paper.


FUSS evaluation set #106 (mix with 2 sources)


mixture source #1 source #2 source #3 source #4
Ground truth no source #3
(silence)
no source #4
(silence)
DCASE baseline
Adversarial PIT

 

FUSS evaluation set #721 (mix with 2 sources)


mixture source #1 source #2 source #3 source #4
Ground truth no source #3
(silence)
no source #4
(silence)
DCASE baseline
Adversarial PIT

 

FUSS evaluation set #427 (mix with 3 sources)


mixture source #1 source #2 source #3 source #4
Ground truth no source #4
(silence)
DCASE baseline
Adversarial PIT

 

FUSS evaluation set #430 (mix with 3 sources)


mixture source #1 source #2 source #3 source #4
Ground truth no source #4
(silence)
DCASE baseline
Adversarial PIT

 

FUSS evaluation set #947 (mix with 4 sources)


mixture source #1 source #2 source #3 source #4
Ground truth
DCASE baseline
Adversarial PIT

 

FUSS evaluation set #38 (mix with 4 sources)


mixture source #1 source #2 source #3 source #4
Ground truth
DCASE baseline
Adversarial PIT

 

Real-world mix: brass band + applause


mixture source #1 source #2 source #3 source #4
DCASE baseline
Adversarial PIT

 

Real-world mix: cafeteria


mixture source #1 source #2 source #3 source #4
DCASE baseline
Adversarial PIT

 

Discussion


Our best model separations more closely match the ground-truth sources and contain less spectral holes than the DCASE baseline. Spectral holes are ubiquitous across mask-based source separation models, and are the unpleasant result of over-suppressing a source in a band where other sources are present. Adversarial training seems appropriate to tackle this issue since it improves the realness of the separations by avoiding spectral holes (which are not present in the training dataset).

We also compare our best model score (13.8 dB SI-SDRi) against the lower and upper bounds (25.3 and 0 dB SI-SDRi) to see that there is still room for improvement. This can also be note in the following examples. One issue is with mixes of 3 or 4 sources where the model "over-separates" splitting sources into two, or "under-separates" grouping multiple sources into a single source. Also, the quality of the separations degrades with mixes containing more sources, where there is more time-frequency overlap between sources, and with real-world examples.