Underdetermined-speech and music mixtures

We propose to repeat the underdetermined-speech and music mixtures (external link)

task in the last SiSEC, SiSEC2008, with fresh test data.

Results

Results for test dataset: click here
Results for test2 dataset: click here

Test data

We have two datasets:
Download test.zip (22 MB)　(former test data of SiSEC2008 (external link)

.) [modified Feb.25] (modification announcement, Feb. 25)
Download test2.zip (16 MB)

test.zip

test.zip contains three types of stereo mixtures:

instantaneous mixtures (static sources scaled by positive gains)
synthetic convolutive mixtures (static sources filtered by synthetic room impulse responses simulating a pair of omnidirectional microphones via the Roomsim toolbox)
live recordings (static sources played through loudspeakers in a meeting room, recorded one at a time by a pair of omnidirectional microphones and subsequently added together)

The room dimensions are the same for synthetic convolutive mixtures and live recordings (4.45 x 3.55 x 2.5 m). The reverberation time is set to either 130 ms or 250 ms and the distance between the two microphones to either 5 cm or 1 m, resulting in 9 mixing conditions overall.

For each mixing condition, 6 mixture signals have been generated from different sets of source signals placed at different spatial positions:

4 male speech sources
4 female speech sources
3 male speech sources
3 female speech sources
3 non-percussive music sources
3 music sources including drums

The source directions of arrival vary between -60 degrees and +60 degrees with a minimal spacing of 15 degrees and the distances between the sources and the center of the microphone pair vary between 80 cm and 1.20 m.

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link)

license. The authors are Glen Phillips, Mark Engelberg, Psycho Voyager, Nine Inch Nails and Ali Farka Touré for music source signals and Shoko Araki and Emmanuel Vincent for mixture signals.

test2.zip

test2.zip contains two types of stereo mixtures:

instantaneous mixtures (static sources scaled by positive and negative gains)
simulated recordings (static sources filtered by impulse responses recorded in a real room situation with loudspeakers and omnidirectional microphones)

The room dimension for simulated recordings was 4.45 x 3.55 x 2.5 m, and the distances between the sources and the center of the microphone pair was 1.20 m. The reverberation time for simulated recordings was set to either 130 ms or 380 ms and the distance between the two microphones to either 4 cm or 20 cm. Therefore, 5 mixing conditions are considered, together with instantaneous mixtures.

For each mixing condition, 6 mixture signals have been generated from different sets of source signals placed at different spatial positions:

4 male speech sources
4 female speech sources
3 male speech sources
3 female speech sources
3 non-percussive music sources
3 music sources including drums

The data consist of stereo WAV audio files, that can be imported in Matlab using the wavread command. These files are named test2_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav, where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time and <spacing> the microphone spacing.

Licensing Issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link)

license. The authors are Shannon Hurley, Nine Inch Nails, AlexQ (Alexander Lozupone), Mokamed, Carl Leth and Jim's Big Ego for music source signals and Hiroshi Sawada for mixture signals.

Development data

Download dev1.zip (91 MB)
Download dev2.zip (47 MB)
(Both are the former development data of SiSEC2008 (external link)

)
[both zip-files have been modified Feb.25] (modification announcement, Feb. 25)

The data consist of Matlab MAT-files and WAV audio files, that can be imported in Matlab using the commands load and wavread respectively. These files are named as follows:

dev1_<srcset>_<mixtype>_<reverb>_src_<j>.wav: mono source signal
dev1_<srcset>_inst_matrix.mat: mixing matrix for instantaneous mixtures
dev1_<srcset>_<mixtype>_<reverb>_<spacing>_setup.txt: positions of the sources for convolutive mixtures
dev1_<srcset>_<mixtype>_<reverb>_<spacing>_filt.mat: mixing filter system for convolutive mixtures
dev1_<srcset>_<mixtype>_<reverb>_<spacing>_sim_<j>.wav: stereo contribution of a source signal to the two mixture channels
dev1_<srcset>_<mixtype>_<reverb>_<spacing>_mix.wav: stereo mixture signal

where <srcset> is a shortcut for the set of source signals, <mixtype> for a shortcut for the mixture type, <reverb> the reverberation time, <spacing> the microphone spacing and <j> the source index.

All mixture signals and source image signals have 10s duration. Music source signals have 11s duration to avoid border effects within convolutive mixtures. The last 10s are then selected once the mixing system has been applied.

Licensing issue: These files are made available under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link)

license. The authors are Another Dreamer and Alex Q for music source signals and Hiroshi Sawada, Shoko Araki and Emmanuel Vincent for mixture signals.

Tasks

The source separation problem has been split into four tasks:

source counting (estimate the number of sources)
mixing system estimation (estimate the mixing matrix for instantantaneous mixtures or the frequency-dependent mixing matrix for convolutive mixtures)
source signal estimation (estimate the mono source signals)
source spatial image estimation (estimate the stereo contribution of each source to the two mixture channels)

Each participant is asked to submit the results of his/her algorithm for tasks 3 and/or 4. The results for tasks 1 and 2 may also be submitted if possible.

Submissions

Each participant is asked to submit the results of his/her algorithm for tasks 3 and/or 4

over all or part of "test2".
and over all or part of "test", if his/her algorithm was not previously submitted to the Stereo Audio Source Separation Evaluation Campaign nor SiSEC2008 so as to assess improvements compared to that campaign

The results for tasks 1 and 2 may also be submitted if possible.

In addition, each participant is asked to provide basic information about his/her algorithm (e.g. a bibliographical reference) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.

Please submit your results through this link (external link)

. Please register the system first, then you'll receive your ID and password to login the system.

[NOTE about test2.zip]: The submission system requires that the processed signals should have an exact duration of 10 s. However, some mixture files in the test2.zip is longer than 10 s. Therefore, when you submit your files, please cut the end of the files so that the duration becomes 10 s. This cut will be considered in the evaluation stage.
We are sorry for this inconvenience.

Note that the submitted audio files will be made available on a website under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link)

license.

Reference software

Please refer the previous SiSEC2008 page (external link)

Evaluation criteria

We propose to evaluate the estimated source signals via the criteria defined in the BSS_EVAL (external link)

toolbox. These criteria allow an arbitrary filtering between the estimated source and the true source and measure inteference and artifacts distortion separately. All source orderings are tested and the ordering leading to the best SIR is selected.

Similarly, we propose to evaluate the estimated spatial source image signals via the criteria used for the Stereo Audio Source Separation Evaluation Campaign (external link)

. These criteria distinguish spatial (or filtering) distortion, interference and artifacts. All source orderings are tested and the ordering leading to the best SIR is selected.

Several tools for evaluation can be found at previous SiSEC2008 page (external link)

.

In addition, new auditory-motivated objective measures will be used to assess the quality of the estimated spatial source image signals, in the mono and stereo cases. The main features are:

as in previous evaluations, four performance measures akin to SDR, ISR, SIR and SAR are given: global score, target preservation score, interference rejection score and artifacts absence score
these auditory-motivated measures were trained on a set of subjective quality ratings obtained from the SISEC 2009 sound material and improve correlation to subjective measures by more than 20% compared to classical SDR, ISR, SIR and SAR

Code of the auditory-motivated measures is available here (external link)

A preliminary version of this toolbox is available here. It provides a new method to decompose the distortion into three components: target distortion "eTarget" (error signal related to the target source), interference eInterf (error signal related to the other sources) and artifacts eArtif (remaining error signal).

Potential Participants

Dan Barry (dan.barry (a) dit_ie)
Pau Bofill (pau (a) ac_upc_edu)
Andreas Ehmann (aehmann (a) uiuc_edu)
Vikrham Gowreesunker (gowr0001 (a) umn_edu)
Matt Kleffner (kleffner (a) uiuc_edu)
Nikolaos Mitianoudis (n.mitianoudis (a) imperial_ac_uk)
Hiroshi Sawada (sawada (a) cslab_kecl_ntt_co_jp)
Emmanuel Vincent (emmanuel.vincent (a) irisa_fr)
Ming Xiao (xiaoming1968 (a) 163_com)
Ron Weiss (ronw (a) ee_columbia_edu)
Michael Mandel (mim (a) ee_columbia_edu)
Shoko Araki (shoko (a) cslab_kecl_ntt_co_jp)
Yosuke Izumi (izumi (a) hil_t_u-tokyo_ac_jp)
Taesu Kim (taesu (a) ucsd_edu)
Maximo Cobos (mcobos (a) iteam_upv_es)
John Woodruff (woodruff.95 (a) osu_edu)
Antonio Rebordao (antonio (a) gavo_t_u-tokyo_ac_jp)
Alexey Ozerov (alexey.ozerov (a) telecom-paristech_fr)
Andrew Nesbit (andrew.nesbit (a) elec_qmul_ac_uk)
Matthieu Puigt (mpuigt (a) ast_obs-mip_fr)
Simon Arberet (sarberet (a) irisa_fr)
Zaher Elchami (zaher.elchami (a) orange-ftgroup_com)
Ngoc Q. K. Duong (qduong (a) irisa_fr)

Task proposed by Audio Committee

Back to Audio source separation top

2 Commentaires