What if you can separate sounds based on different and non-mutually exclusive semantic concepts?

Posted by Thymios on December 3rd, 2022

Here we present some audio-samples where we slice the acoustic scene based on several signal characteristics and the corresponding discriminative semantic concept values such as: the gender of the speaker (female/male as self-identified by the dataset’s speakers), the spatial location (near/far), and the language (English/French/German/Spanish). The following are some randomly sampled examples from our paper:

"Heterogeneous target speech separation." In Proceedings of Interspeech, 2022, pp. 1796-1800. [pdf] [code] [video] [slides]

Spatial location of the speaker (aka distance-based separation)

Query: Near-field speaker

Imagine a situation where you want to focus on the person nearest to the microphone (aka you wanna focus on the person right in front of you).
Type of sound Waveform
Input mixture
Ground-truth target waveform (nearest speaker)
Estimated target waveform (est. nearest speaker)
Ground-truth non-target waveform (optional)
Estimated non-target waveform (optional )

Query: Far-field speaker

Imagine a situation where you want to focus on the person farthest from the microphone.
Type of sound Waveform
Input mixture
Ground-truth target waveform (farthest speaker)
Estimated target waveform (est. farthest speaker)
Ground-truth non-target waveform
Estimated other waveform (optional )

A multi-lingual mixture with query: Near-field speaker

This is a slightly different scenario where the model can still retrieve the speaker closest to the microphone but can also exploit semantic information used to train the network such as language conditioning. Notice that the model does not know where exactly to put the cap sound but really nicely disentangles the two speakers in the right order :)
Type of sound Waveform
Input mixture
Ground-truth target waveform (nearest speaker)
Estimated target waveform (est. nearest speaker)
Ground-truth non-target waveform
Estimated non-target waveform (optional )

Gender conditioning of the speakers

Query: Female speaker with a cross-gender input-mixture

Imagine a situation that you want to focus on the female speaker.
Type of sound Waveform
Input mixture
Ground-truth target waveform (female speaker)
Estimated target waveform (est. female speaker)
Ground-truth non-target waveform
Estimated non-target waveform (optional )

Query: Female speaker from a two-female speaker input-mixture

In the paper we perform an extensive analysis on how to deal with these degenerate cases where either the target or the non-target waveform is zero or the input-mixture itself. Here the model has been trained using a small sampling probability during heterogeneous training to deal with cases like that and notice that the estimated non-target (aka other) waveform is almost silent.
Type of sound Waveform
Input mixture
Ground-truth target waveform (both female speakers)
Estimated target waveform (est. both female speakers)
Ground-truth non-target waveform
Estimated non-target waveform (optional )

Query: Female speaker from a two-male input-mixture

In a similar fashion, now the estimated target waveform is almost silence and all the energy should go to the other waveform.
Type of sound Waveform
Input mixture
Ground-truth target waveform (no female speakers)
Estimated target waveform (est. female speakers)
Ground-truth non-target waveform
Estimated non-target waveform (optional )

Language conditioning of the speakers

Query: German speaker from a multi-lingual mixture

Give me the German speaker, I will learn French another time! We noticed that language is one of the most challenging conditions in our experiments.
Type of sound Waveform
Input mixture
Ground-truth target waveform (German speaker)
Estimated target waveform (est. German speaker)
Ground-truth non-target waveform (French speaker)
Estimated non-target waveform (optional output - est. French speaker)

A really hard query with a Spanish and an english speaker

Extract the Spanish speaker from this hard mixture!
Type of sound Waveform
Input mixture
Ground-truth target waveform (Spanish speaker)
Estimated target waveform (est. Spanish speaker)
Ground-truth non-target waveform (English speaker)
Estimated non-target waveform (optional output - est. English speaker)