Here we present some audio-samples where we slice the acoustic scene based on several
signal characteristics and the corresponding discriminative semantic concept values
such as: the gender of the speaker (female/male as self-identified by the dataset’s speakers),
the spatial location (near/far), and
the language (English/French/German/Spanish).
The following are some randomly sampled examples from our paper:
"Heterogeneous target speech separation." In Proceedings of Interspeech, 2022,
pp. 1796-1800.
[pdf][code][video][slides]
Spatial location of the speaker (aka distance-based separation)
Query: Near-field speaker
Imagine a situation where you want to focus on the person nearest to
the microphone (aka you wanna focus on the person right in front of you).
Type of sound
Waveform
Input mixture
Ground-truth target waveform (nearest speaker)
Estimated target waveform (est. nearest speaker)
Ground-truth non-target waveform
(optional)
Estimated non-target waveform (optional
)
Query: Far-field speaker
Imagine a situation where you want to focus on the person farthest from
the microphone.
Type of sound
Waveform
Input mixture
Ground-truth target waveform (farthest speaker)
Estimated target waveform (est. farthest speaker)
Ground-truth non-target waveform
Estimated other waveform (optional
)
A multi-lingual mixture with query: Near-field speaker
This is a slightly different scenario where the model can still retrieve the speaker closest to
the microphone but can also exploit semantic information used to train the network such as language
conditioning. Notice that the model does not know where exactly to put the cap sound but
really nicely disentangles the two speakers in the right order :)
Type of sound
Waveform
Input mixture
Ground-truth target waveform (nearest speaker)
Estimated target waveform (est. nearest speaker)
Ground-truth non-target waveform
Estimated non-target waveform (optional
)
Gender conditioning of the speakers
Query: Female speaker with a cross-gender input-mixture
Imagine a situation that you want to focus on the female speaker.
Type of sound
Waveform
Input mixture
Ground-truth target waveform (female speaker)
Estimated target waveform (est. female speaker)
Ground-truth non-target waveform
Estimated non-target waveform (optional
)
Query: Female speaker from a two-female speaker input-mixture
In the paper we perform an extensive analysis on how to deal with these degenerate cases where
either the target or the non-target waveform is zero or the input-mixture itself. Here the model
has been trained using a small sampling probability during heterogeneous training to deal with
cases like that and notice that the estimated non-target (aka other) waveform is almost silent.