These scripts are referring to the paper "Components Loss for Neural Networks in Mask-Based Speech Enhancement". In this repository, we provide the source code for training the mask-based speech enhancement convolutional neural networks (CNNs) using our proposed components loss (CL), which includes both 2 components loss (2CL) and 3 components loss (3CL). The corresponding test code is also offered.
The code was written by Ziyi Xu and with the help from Ziyue Zhao and Samy Elshamy.
We propose a novel components loss (CL) for the training of neural networks for mask-based speech enhancement. During the training process, the proposed CL offers separate control over preservation of the speech component quality, suppression of the residual noise component power, and preservation of a naturally sounding residual noise component. We obtain a better and more balanced performance in almost all employed instrumental quality metrics over the baseline losses, the latter comprising the conventional mean squared error (MSE) loss function and also auditory-related loss functions, such as the perceptual evaluation of speech quality (PESQ) loss and the recently proposed perceptual weighting filter loss.
- Install TensorFlow 1.5.0 and Keras 2.1.4
- Some Python packages need to be installed, please see detailed information in the Python scripts (e.g. numpy, scipy.io, and sklearn)
- Install Matlab
Note that in this project the clean speech signals are taken from the Grid corpus (downsampled to 16 kHz) and noise signals are taken from the ChiMe-3 database.
- We use Matlab to prepare the input and target magnitude spectra for both training and validation sets.
- To run the training script, you need:
training_input_noisy.mat
(normalized noisy speech amplitude spectra, with zero mean and unit variance)validation_input_noisy.mat
(normalized noisy speech amplitude spectra, with zero mean and unit variance)training_pure_noise.mat
(amplitude spectra of noise component)validation_pure_noise.mat
(amplitude spectra of noise component)training_clean_speech.mat
(amplitude spectra of speech component)validation_clean_speech.mat
(amplitude spectra of speech component)
- All matrices have the dimensions of L*K (e.g. 1,000,000 *132). K represents the number of input and output frequency bins and is set to 132, and L represents the number of frames.
- All
.mat
files must be stored inversion 7.3
, using Matlab commandsave('filename.mat','variable','-v7.3')
to enable very large data matrix saving. - Small examples are placed under the directory:
./ training_data/
. To start your own training, replace these.mat
files by your own data. More details are in the Python scripts. You can try the training script by using these small examples.
- Run the Python script to train the CNN model with the proposed 2CL based on the prepared training/validation data:
python Mask-based_CNN_2CL_training.py
- Run the Python script to train the CNN model with the proposed 3CL based on the prepared training/validation data:
python Mask-based_CNN_3CL_training.py
- We also use Matlab to prepare the input magnitude spectra for test data and to store the phase information for the time-domain signal recovering.
- To run the test script, you need:
test_input_noisy_speech.mat
(normalized noisy speech amplitude spectra, with zero mean and unit variance using the statistics collected on the training data)test_pure_noise.mat
(amplitude spectra of noise component, used to generate the filtered noise component, which can be used for white-box based performance measures)test_clean_speech.mat
(amplitude spectra of speech component, used to generate the filtered speech component, which can be used for white-box based performance measures)test_noisy_speech_unmorm.mat
(unnormalized noisy speech amplitude spectra, used for predicting enhanced speech)
- All matrices have the dimensions of L*K (e.g. 1,000 *132) as explained before.
- All
.mat
files are stored using Matlab commandsave('filename.mat','variable')
, which allows to save maximum 2 GB.mat
file. If you have a very large test data, you also need to store.mat
files in-v7.3
, and to modify the corresponding data loading part in the test script. - Small examples are placed under the directory:
./ test_data /
. To start your test, replace these.mat
files by your own data. More details are in the Python scripts. You can try the test script by using these small examples. - The output of the scripts include:
test_n_tilde.mat.mat
(filtered noise amplitude spectra)test_s_tilde.mat
(filtered speech amplitude spectra)test_s_hat.mat
(enhanced speech amplitude spectra)
- The filtered noise and speech amplitude spectra are then used to reconstruct the filtered noise and speech time domain signal, which can be used for white-box based performance measures.
- Run the Python script to test the trained CNN model with the proposed 2CL using the prepared test data:
python Mask-based_CNN_2CL_test.py
- Run the Python script to test the trained CNN model with the proposed 3CL using the prepared test data:
python Mask-based_CNN_3CL_test.py
- The stored test data phase information is used to recover the time domain signal by IFFT with overlap add (OLA).
- We provide audio demos using files from the test dataset in the presence of pedestrian (PED) noise at 10dB signal-to-noise ratio (SNR) level. The audios include speech from both female and male test speakers.
- We put the audio demos under the directory:
./Audio_demo/
. - We also offer the corresponding audio demos in the format of
.wav
files, and put them under the directory:./Audio_demo_wav_file/
.
If you use the losses and/or scripts in your research, please cite
@article{xu2019Comploss,
author = {Z. Xu, S. Elshamy, Z. Zhao and T. Fingscheidt},
title = {{Components Loss for Neural Networks in Mask-Based Speech Enhancement}},
journal = {arXiv preprint arXiv: 1908.05087},
year = {2019},
month = {Aug.}
}
- The author would like to thank Ziyue Zhao and Samy Elshamy for the advice concerning the construction of these source code in GitHub.