Determined and over-determined speech and music mixtures


We propose to repeat the Determined and over-determined speech and music mixtures (external link) task (Lucas Parra's (external link) and Kenneth Hild's datasets) in the last SiSEC, SiSEC2008, by adding fresh test data.

Results


Please see this page (external link).

Test data


Download parra.zip (external link) (23 MB) (former test data of SiSEC2008.)
Download hild.zip (external link) (0.5 MB) (former test data of SiSEC2008.)
Download loesch.zip (external link) (6 MB) (fresh data)

Dataset "parra"

The dataset "parra" contains 21 4-channel recordings of 2 to 4 speech sources in 7 different recording conditions. All recordings have been downsampled to the same rate (16 kHz) and cut to the same duration (10 s), so as to ease handling of the datasets and comparison of the results with other datasets. The data consist of 4-channel WAV audio files, that can be imported in Matlab using the wavread command. These files are named room<r>_<J>sources_mix.wav, where <r> is a character identifying the room and <J> is the number of sources. The authors of the dataset are Hiroshi Sawada (rooms 4 and 5), Mads Dyrholm (rooms 1, 2, 3, C and O) and Lucas Parra. The music used for recordings in rooms 1, 2 and 3 was taken from "Germ Germ" by Das Böse Ding and has kindly been approved for public presentation by Jan Klare of Das Böse Ding in the name of research.

Room 1: Chamber w. cushion walls. W1.5 W2 H2.5 meters.
Scenario: The sources were placed randomly in the room, either on the floor or on a table. The Microphones were placed randomly approx. 50cm from the walls at different heights.
Equipment: Behringer ECM8000 omnidirectional microphones. SM Pro Audio PR4V microphone preamp. Standard desktop mono PC speakers were used for sources. Audiotrak Maya44USB soundcard for 4 channel recording.

Room 2: Medium size conference room. W10 W8 H3 meters
Scenario: The sources were placed randomly in the room, either on the floor or on a table or stool. With an average distance of 2 meters between any two sources.
The Microphones were placed along the wall closest to the sources, approx. 1 meter from the wall, at different heights, uniformly spaced with approx. 1 meter.
Equipment: Behringer ECM8000 omnidirectional microphones. SM Pro Audio PR4V microphone preamp. Standard desktop mono PC speakers were used for sources. Audiotrak Maya44USB soundcard for 4 channel recording.

Room 3: Medium size office room. W3 W3 H2.5 meters
Scenario: The sources were placed randomly in the room, either on the floor or on a table or stool. With an average distance of 1.5 meters between any two sources. The Microphones were placed at different heights, uniformly spaced with approx. 1 meter.
Equipment: Behringer ECM8000 omnidirectional microphones. SM Pro Audio PR4V microphone preamp. Standard desktop mono PC speakers were used for sources. Audiotrak Maya44USB soundcard for 4 channel recording.

Room 4: Chamber w. cushion walls. W3.55 W4.45 H2.5 meters
Scenario: The all four microphones were placed 3-dimensionally around the center of the room with height around 125 cm. The maximum distance between any two microphones was 5.7 cm. The first three sources were placed at the same height with the microphones. The last source was placed at a different height. Source distances from microphones: around 100 cm
Equipment: Sony ECM-77B omnidirectional microphones. Yamaha HA8 microphone preamp. Bose 101MM speakers with 1705II power amplifier were used for sources. Dasbox model-500 for A/D and D/A converters.

Room 5: Same as room 4
Scenario: The all four microphones were placed 3-dimensioanlly around the center of the room with height around 125 cm. The maximum distance between any two microphones was 5.7 cm. The first three sources were placed at the same height with the microphones. The last source was placed at a different height. Source distances from microphones: around 180 cm
Equipment: Sony ECM-77B omnidirectional microphones. Yamaha HA8 microphone preamp. Bose 101MM speakers with 1705II power amplifier were used for sources. Dasbox model-500 for A/D and D/A converters.

Room C: Same as room 2.
Scenario: Similar to room 2. The exact placement of the microphones and of the sources, and their amplitudes, are different though.
Equipment: Behringer ECM8000 omnidirectional microphones. SM Pro Audio PR4V microphone preamp. Standard desktop mono PC speakers were used for sources. Audiotrak Maya44USB soundcard for 4 channel recording.

Room O: Same as room 3.
Scenario: Similar to room 3. The exact placement of the microphones and of the sources, and their amplitudes, are different though.
Equipment: Behringer ECM8000 omnidirectional microphones. SM Pro Audio PR4V microphone preamp. Standard desktop mono PC speakers were used for sources. Audiotrak Maya44USB soundcard for 4 channel recording.

Dataset "hild"

The dataset "hild" contains 1 stereo recording of 2 speech sources. This recording is a stereo WAV audio file named iliad_mix.wav, that can be imported in Matlab using the wavread command. The author of the dataset is Kenneth Hild.

Room info: 3.7 x 4.4 m (lab containing several desks and chairs; the instantaneous power of the impulse response decayed by 35 dB relative to the peak power in 270 ms)
Scenario: The signals representing both speakers are actually the same person quoting from different parts of the epic Iliad, as translated by Samuel Butler. The microphones are placed on either side of a dummy head. Distance from speakers to center of head (microphone array) is 161 and 137 cm.
Equipment: Studio Projects B3 pressure gradient transducer microphone, with cardiod pickup pattern. Audio Buddy 2-channel preamplifier with phantom power.

Dataset "loesch"

The dataset "loesch" contains two sets of (2 speech)x(2 microphones), (3 speech)x(3 microphones), (4 speech)x(4 microphones), and (2 speech & 2 music) x (4 microphones) scenarios. The WAV audio can be imported in Matlab using the wavread command. These files are named "roomL_set<s>_<J>sources_<I>mics_mix.wav for speech mixtures, where <s> is a setting index, <J> is the number of sources, and <I> is the number of microphones.
The files name "roomL_sm_set<s>_4sources_4mics_mix.wav are the mixtures for (2 speech & 2 music) x (4 microphones). Here "2 music" consist of a stereo music signal.
The author of the dataset is Benedikt Loesch.

The speech signals come from the CHAINS database (we thank to Dr. Fred Cummins), and the music signals are songs from http://www.danosongs.com/. (external link)


Room info: An office room whose reverberation time is about 450-500ms.
Scenario for speech mixtures: Microphones are arranged as a linear array with spacing of approximately 10cm between the microphones. Sources were played back using small loudspeakers placed approximately 120-140cm from the center of the array. The 2x2 scenarios use microphones 2 and 3, the 3x3 use microphones 1, 2 and 3 and the 4x4 scenarios use all 4 microphones.
Scenario for 2 speech & 2 music mixtures: Microphones are arranged as a linear array with spacing of approximately 10cm between the microphones. Speech is played at a distance of about 120cm-140cm. Music is played at a distance of about 200cm.

Tasks


All of the mixtures will be convolutive mixtures.

The source separation problem has been split into following 3 tasks:
    1. source counting (estimate the number of sources)
    2. source signal estimation (estimate the mono source signals)
    3. source spatial image estimation (estimate the contribution of each source to all channels)
In practice, reference mono source signals are not available. So the results of task 2 will be evaluated w.r.t. the contribution of each source to the first mixture channel.


And just for the (2 speech & 2 music) mixtures in "loesch.zip",
    1. speech extraction (extract the mono speech source signals and/or speech source spatial images. That is our target signals are speech signals)
    2. speech and music extraction (estimate both two speeches and two musics signals. estimate mono sources signals and/or source images.).

Submission


Each participant is asked to submit the results of his/her algorithm for tasks 2 or 3 as preferred over all or part of the two test sets. Algorithms using a limited number of mixture channels (e.g. only the first two channels) are welcome.

The results for the "source counting" task may also be submitted if possible. They will help diagnosing the performance of various parts of the algorithm when available.

Please submit your results through this link (external link). Please register the system first, then you'll receive your ID and password to login the system.

In addition, each participant is asked to provide basic information about his/her algorithm (e.g. number of channels used, bibliographical reference) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.

Note that the submitted audio files will be made available on a website under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike 2.0 (external link) license.


Evaluation criteria


We propose to evaluate the estimated source signals via the criteria defined in the BSS_EVAL (external link) toolbox. These criteria allow an arbitrary filtering between the estimated source and the true source and measure inteference and artifacts distortion separately. All source orderings are tested and the ordering leading to the best SIR is selected.

Similarly, we propose to evaluate the estimated spatial source image signals via the criteria used for the Stereo Audio Source Separation Evaluation Campaign (external link). These criteria distinguish spatial (or filtering) distortion, interference and artifacts. All source orderings are tested and the ordering leading to the best SIR is selected.

The above performance criteria are respectively implemented in
* bss_eval_sources.m (external link)
* bss_eval_images.m (external link)


In addition, new auditory-motivated objective measures will be used to assess the quality of the estimated spatial source image signals, in the mono and stereo cases. The main features are:
  • as in previous evaluations, four performance measures akin to SDR, ISR, SIR and SAR are given: global score, target preservation score, interference rejection score and artifacts absence score
  • these auditory-motivated measures were trained on a set of subjective quality ratings obtained from the SISEC 2009 sound material and improve correlation to subjective measures by more than 20% compared to classical SDR, ISR, SIR and SAR
Code of the auditory-motivated measures is available here (external link).
  • A preliminary version of this toolbox is available here. It provides a new method to decompose the distortion into three components: target distortion "eTarget" (error signal related to the target source), interference eInterf (error signal related to the other sources) and artifacts eArtif (remaining error signal).


Potential Participants


  • Lucas Parra (parra (a) ccny_cuny_edu)
  • Kenneth E. Hild II (k.hild (a) ieee_org)
  • Robert Johnson (rjohnson (a) fmrib_ox_ac_uk)
  • Francesco Nesta (nesta (a) fbk_eu)
  • Intae Lee (lititl (a) yahoo_co_kr)
  • Itahashi Takashi (itahashi-takashi (a) edu_brain_kyutech_ac_jp)
  • Scott Douglas (douglas (a) engr_smu_edu)
  • Benedikt Loesch (benedikt.loesch (a) lss.uni-stuttgart.de)

Task proposed by Audio Committee and Benedikt Loesch

Back to Audio source separation top


Menu