Robust blind linear/non-linear separation of short two-sources-two-microphones recordings


Consider the task of targeting a speaker whose voice is interfered by other acoustical
sources (speaker in an office, phone caller in a road traffic). A practical assumption is that, at most, two sources are active in short time-intervals or that only one jammer source is dominant. To separate the speaker's voice, we should cope with the blind linear/non-linear separation of the two sources as well as possible (using only two microphones due to commercial reasons).

The separating algorithm applied in this task should be robust to the position of sources and should be able to work with short recordings. Once its parameters are adjusted, it should perform well in all scenarios.

Results

Please visit to this page (external link).

Test data


Download public.zip (external link) (2 MB)
The speech signals come from the CHAINS database (we thank to Dr. Fred Cummins), and the mixtures are provided by the authors (see list of authors below).
These files are licensed for research use only by their authors.


Recordings of different sources at different positions in two environments (see details below). The length of each recording is 1 seconds with the sampling frequency of 16 kHz.
  • Two environments:
    • Room 1 – an ordinary living room
    • Room 2 – a study room with a running PC (diffuse noise of fan)
  • Three different positions i.e., positions of loudspeakers (sources).
  • Six combinations of a male and female speech with a jammer source
    1. male speech × jammer male speech
    2. female speech × jammer female speech
    3. male speech × sneeze
    4. male speech × laugh
    5. female speech × glass break
    6. female speech × TV sport noise
Altogether, there are (3+3)×6=36 mixtures.

Data format: stereo wav-files structured in directories, 16 bits, 16 kHz sampling frequency.
The files are named sisec2010/room<r>/<srcset>/room<r>set<srcset>x<combi>.wav, where <r> is the room index (1 or 2), <srcset> is the source position index (1, 2 or 3), and <combi> is the index for the combination of a target speech and a jammer source (1--6).

Getting data in Matlab with loadsisec.m: function loadsisec(<r>, <srcset>, <combi>)
Output: 2×16000 matrix x containing signals from microphones.
Usage: just call, e.g., loadsisec(1, 2, 3).


Task


  • Task 1: Separate each of 36 mixtures with linear filters. Parameters of your algorithm must be the same for each mixture. The outputs should be coefficients of MIMO separating filters in the following Matlab form (let rows of x be the input signals from microphones):
    • In case that the algorithm outputs mono-channel separated sources, the output should be 3D array H such that filter(H(1,:,i),1,x(1,:))+filter(H(2,:,i),1,x(2,:)) yields the i-th separated source.
    • In case that the algorithm outputs separated microphone responses (images) of sources, the output should be 4D array H such that filter (H(1,:,i,j),1,x(1,:))+filter(H(2,:,i,j),1,x(2,:)) yields the separated response (image) of the i-th source at the j-th microphone.
The separated sources may have random order (global permutation). In case that the separating filters are not FIR, H should be sufficiently long truncated impulse response of the filters.

  • Task 2: Separate each of 36 mixtures with non-linear approaches. Parameters of your algorithm must be the same for each mixture. The outputs are
    • the mono source signals and/or
    • spatial source images estimate the stereo contribution of each source to the two mixture channels

Submission


  • Task 1: Each participant is asked to submit a mat-file which stores each H into a cell-array "Result", Result{i,j,k}=H, where i,j,k are indices of data (i.e., i-th room, j-th setting, k-th mixture.) The filename of the mat-file should be "result3D.mat" for 3D array and "result4D.mat" for 4D array. The submission of the filters is mandatory for Task 1.
  • Task 2: Each participant is asked to submit the separated signals of his/her algorithm.

In addition, each participant is asked to provide basic information about his/her algorithm (e.g. number of channels used, bibliographical reference) and to declare its average running time, expressed in seconds per test excerpt and per GHz of CPU.

Submission method


[SUBMISSION CLOSED on Apr.21]
Each participant is asked to submit the results of his/her algorithm for tasks 1 or 2 over all the mixtures.

Each participant should make his results available online in the form of a tarball called <YourName>_task1.zip or <YourName>_task2.zip.
The included files must be named as follows:
  • Task1 (store all files to <YourName>_task1.zip)
    • The filename of the mat-file should be "result3D.mat" for 3D array and "result4D.mat" for 4D array.
  • Task2 (store all files to <YourName>_task2.zip)
    • room<r>set<srcset>x<combi>_src_<j>.wav : estimated source <j>, mono WAV file sampled at 16 kHz
    • room<r>set<srcset>x<combi>_sim_<j>.wav : estimated spatial image of source <j>, stereo WAV file sampled at 16 kHz


[SUBMISSION CLOSED on Apr.21]
Each participant should then send an email to "zbynek.koldovsky (at) tul.cz" with Cc to "shoko (a) cslab.kecl.ntt.co.jp" providing:
  • contact information (name, affiliation)
  • basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
  • the URL of the tarball(s)


Hints for Task 1

The filter coefficients can be easily obtained by augmenting the input signals by unit pulses at the end and enforcing the algorithm to utilize the original part of input signals only. Then, the coefficients of separating filters can be found at the end of the separated signals. A pseudocode of this approach can be as follows:

N=length(x);
x=[x zeros(2,2000)];
x(1,N+500)=1;
x(2,N+1500)=1;
estimated_signals=SeparatingAlgorithm(x,’use only x(:,1:N) for computations’);
If “responses are estimated”
-+H(1,:,1,1)= estimated_signals (1,N+1:N+1000,1);+-
H(1,:,2,1)= estimated_signals (2, N+1:N+1000,1);
H(1,:,2,2)= estimated_signals (2, N+1:N+1000,2);
H(1,:,1,2)= estimated_signals (1, N+1:N+1000,2);
H(2,:,1,1)= estimated_signals (1, N+1001:N+2000,1);
H(2,:,1,2)= estimated_signals (1, N+1001:N+2000,2);
H(2,:,2,2)= estimated_signals (2, N+1001:N+2000,2);
H(2,:,2,1)= estimated_signals (2, N+1001:N+2000,1);

Else % “mono-sources are estimated”
-+H(1,:,1)= estimated_signals (1,N+1:N+1000);+-
H(2,:,1)= estimated_signals (1,N+1001:N+2000);
H(1,:,2)= estimated_signals (2,N+1:N+1000);

H(2,:,2)= estimated_signals (2,N+1001:N+2000);

End


Reference Software




Evaluation Criteria

  • Criterion 1: SIR and SDR computed using known microphone responses of sources (see here for detailed definitions). That is, for each separating filter, Signal to Interference Ratio (SIR) and Signal to Distortion Ratio (SDR) will be computed. [D. Schobben, K. Torkkola and P. Smaragdis, "Evaluation of blind signal separation methods," ICA '99, Aussois, France, pp. 261-266, Jan. 1999]. In case that the algorithm computes both microphone responses (images) of sources, each criterion will be averaged over these responses. The global permutation is resolved according to the maximal achieved SIR (see siseceval.m and examples (external link) to get more idea).
  • Criterion 2: SIR and SDR of separated signals evaluated through BSS_EVAL (external link) toolbox.

Task1 is evaluated with both Criterion 1 & 2, and task2 with Criterion 2.


Potential participants

Any participants of SiSEC 2008 and further authors of BASS algorithms.



Task proposed by Z. Koldovsky and P. Tichavsky

Note: Complete version of the dataset containing signals of the duration 2 seconds, 44.1 kHz, 24 bits, source microphone responses (images), and illustrations of positions of loudspeakers will be available after the LVA/ICA 2010 conference.
Compete dataset is now available!!(Oct.8.2010) -->Download now! (external link)


Back to Audio source separation top

Menu