Determined convolutive mixtures under dynamic conditions

Blind source separation in real-world environment is a challenging task even for the simplest welldetermined case where the number of the sources is known in advance and is equal to the number of the microphones. For this reason the experimental evaluations of most of the algorithms proposed in literature are conducted in controlled scenarios: the reverberation is not very high, the length of the mixtures is given, the sources are observed for a relatively long time and do not change their locations. However, such conditions do not reflect well a real-word scenario: the reverberation can not be neglected, many sources can move in the environment and overlap in different time-instants.

In this proposal we define a task for the evaluation of the BSS under dynamic conditions in order to evaluate the robustness of the proposed algorithms in a more realistic scenario.

Potential candidates for this evaluation are batch on-line, batch off-line and on-line BSS algorithms.

Results

Results for task1: click here
Results for task2: click here

Datasets

Download: task1_mixture.zip (7 MB)
Download: task2_mixture.zip (26 MB)
These files are licensed for research use only by the task proposer (see below).

We consider the case when maximum 2 sources are active at the same time and are recorded by a stereo microphone. The source mixtures are obtained by summing the individual source
components recorded by each microphone. The components are generated by convolving random utterances with measured impulse responses and contaminated with an additive white, Gaussian noise (AWGN) according to an SNR of 40dB.

The impulse responses between the microphones and different source locations (corresponding to different angular directions) are measured in a real room with a high reverberation time (T60
around 700-800ms). The distance between the source and the microphones is about 1.1m.

The microphone spacings are 2cm, 6cm, and 10 cm.

FILE SYNTAX for task1 data: x_<array label>_test<source combination index>.wav
FILE SYNTAX for task2 data: x_<array label>.wav

Tasks

We propose two different tasks:

separation of short audio mixtures (1-2s) obtained by random combinations of source locations and utterances;
separation of a sequence of audio mixtures obtained by random combinations of source locations and utterances. (Explanation can be found here. Example mixtures: examples.zip)

Task 2 is obviously more realistic than task 1 but also more challenging since the start/end point of each utterance is not known in advance. In this case the BSS algorithm should adapt the separation across the sequence according to the change of the conditions.

In both the tasks neither the DOA nor the impulse responses are given.

Each participant is asked to submit the estimated separated sources for task1 and/or task2

Submission

For task1, the participants should submit only two output files for each mixture.

For task2, the participants should submit only two output files. Each output will contain a sequence of the estimated utterances. It does not matter the output order of each utterance.
We do not ask the participants to submit the individual utterances of each mixture because the segmentation of the outputs is not trivial, especially in that high reverberation.

Both on-line and batch adaptation based algorithms are welcome.

Submission method

[SUBMISSION CLOSED on Apr.21]
Each participant is asked to submit the results of his/her algorithm for tasks 1 or 2 over all the mixtures from either or both test sets.

Each participant should make his results available online in the form of a tarball called <YourName>_task1.zip or <YourName>_task2.zip.
The included files must be named as follows:

Task1 (store all files to <YourName>_task1.zip)
- x_array<space>cm_test<combi>_src_<j>.wav: estimated source <j>, mono WAV file sampled at 16 kHz
- x_array<space>cm_test<combi>_sim_<j>.wav: stimated spatial image of source <j>, stereo WAV file sampled at 16 kHz
Task2 (store all files to <YourName>_task2.zip)
- x_<space>cm_src_<j>.wav.wav: estimated source <j>, mono WAV file sampled at 16 kHz
- x_<space>cm_sim_<j>.wav: estimated spatial image of source <j>, stereo WAV file sampled at 16 kHz

[SUBMISSION CLOSED on Apr.21]
Each participant should then send an email to "nesta (at) fbk.eu" with Cc to "shoko (a) cslab.kecl.ntt.co.jp" providing:

contact information (name, affiliation)
basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
the URL of the tarball(s)

Note that the submitted audio files will be made available on a website under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link)

license.

Evaluation criteria

As for the evaluation method for source signal estimation in SiSEC2008 we propose to evaluate the performance by the criteria defined in the BSS_EVAL (external link)

toolbox. The BSS_EVAL toolbox decomposes the estimated sources in a sum of components corresponding to: 1) a deformation of the original source 2) a deformation of the sources accounting for the presence of unwanted interfering sources 3) artifacts introduced by the BSS procedure 4) a deformation of a perturbating noise. According to this decomposition, different performance measures are defined. In this task only the SIR and SDR will be evaluated, considering the best input/output ordering which leads to the best SIR result.

In order to give measures which are close to the human perception, we suggest to evaluate the SIR/SDR by filtering the components estimated by the BSS_EVAL with an “A-weighting” filter.

Reference software for evaluation can be found in: examples.zip.

We propose to analyze the average SIR/SDR, the corresponding standard deviation, a cumulative histograms to get statistics on their values and, for task2, a plot to show the performance variations along the time-frames.

For task1 the SIR/SDR will be computed for each individual mixture.

For task2, the performance is evaluated by framing the output signals in segments containing two overlapping sources. To do so, we use the timestamps indicating the start and end points of the utterances in the sequence. This information is then used by the evaluation script in order to apply BSS_EVAL to correct segments of the output sequence. In this way we can avoid to erroneusly apply BSS_EVAL to segments where there are less than two sources.
Moreover, by exploiting the original timestamps we avoid to evaluate the performance with segments too large which contain more than two utterances.

Note: the input and output files must be exactly synchronized. Therefore, we expect the output and input audio files having the same length. We aware the participants that the exact synchronization between input and output is essential for a correct performance evaluation by the BSS_EVAL toolbox.

Reference software

Example software with sample wav-files can be downloaded: examples.zip.
Please consult "readme.txt" in the zip-file for more information.

Potential Participants

Francesco Nesta (nesta (a) fbk.eu)
Benedikt Loesch (benedikt.loesch (a) lss.uni-stuttgart.de)
Zbynek Koldovsky (zbynek.koldovsky (a) tul.cz)

Task proposed by Francesco Nesta

Back to Audio source separation top

1 commentaire1 fichier joint