This is a curated list of awesome Speech Enhancement tutorials, papers, libraries, datasets, tools, scripts and results. The purpose of this repo is to organize the world’s resources for speech enhancement, and make them universally accessible and useful.
This repo is jointly contributed by Nana Hou (Nanyang Technoligical University), Meng Ge, Hao Shi (Tianjin University), Chenglin Xu (National University of Singapore), Chen Weiguang (Hunan University).
To add items to this page, simply send a pull request.
- A literature survey on single channel speech enhancement, 2020 [paper]
- Research Advances and Perspectives on the Cocktail Party Problem and Related Auditory Models, Bo Xu, 2019 [paper (Chinese)]
- Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments, Zixing Zhang, 2017 [paper]
- Supervised speech separation based on deep learning: An Overview, 2017 [paper]
- A review on speech enhancement techniques, 2015 [paper]
- Nonlinear speech enhancement: an overview, 2007 [paper]
- Speech enhancement using self-adaptation and multi-head attention, ICASSP 2020 [paper]
- PAN: phoneme-aware network for monaural speech enhancement, ICASSP 2020 [paper]
- Noise tokens: learning neural noise templates for environment-aware speech enhancement [paper]
- Speaker-aware deep denoising autoencoder with embedded speaker identity for speech enhancement, Interspeech 2019 [paper]
- Lite Audio-Visual Speech Enhancement, INTERSPEECH 2020 [paper]
- Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks, TETCI, 2018 (first audio-visual SE) [journal]
- Efficient trainable front-ends for neural speech enhancement, ICASSP 2020 [paper]
- Spectrograms fusion with minimum difference masks estimation for monaural speech dereverberation, ICASSP 2020 [paper]
- Masking and inpainting: a two-stage speech enhancement approach for low snr and non-stationary noise, ICASSP 2020 [paper]
- A composite dnn architecture for speech enhancement, ICASSP 2020 [paper]
- An attention-based neural network approach for single channel speech enhancement, ICASSP 2019 [paper]
- Multi-domain processing via hybrid denoising networks for speech enhancement, 2018 [paper]
- Speech enhancement using self-adaptation and multi-head attention, ICASSP 2020 [paper]
- Channel-attention dense u-net for multichannel speech enhancement, ICASSP 2020 [paper]
- T-GSA: transformer with gaussian-weighted self-attention for speech enhancement, ICASSP 2020 [paper]
- PAGAN: a phase-adapted generative adversarial networks for speech enhancement, ICASSP 2020 [paper]
- MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement, ICML 2019 [paper]
- Time-frequency masking-based speech enhancement using generative adversarial network, ICASSP 2018 [paper]
- SEGAN: speech enhancement generative adversarial network, Interspeech 2017 [paper]
- Speech Enhancement Based on Deep Denoising Autoencoder, INTERSPEECH 2013 (first deep learning based SE) [paper]
- Phase reconstruction based on recurrent phase unwrapping with deep neural networks, ICASSP 2020 [paper]
- PAGAN: a phase-adapted generative adversarial networks for speech enhancement, ICASSP 2020 [paper
- Invertible dnn-based nonlinear time-frequency transform for speech enhancement, ICASSP 2020 [paper]
- Phase-aware speech enhancement with deep complex u-net, ICLR 2019 [paper] [code]
- PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network, AAAI 2020 [paper]
- MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement, ICML 2019 [paper]
- Speech denoising with deep feature losses, Interspeech 2019 [paper]
- End-to-end multi-task denoising for joint sdr and pesq optimization, Arxiv 2019 [paper]
- Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement, Arxiv 2017 [paper]
- Multiple-target deep learning for LSTM-RNN based speech enhancement, HSCMA 2017 [paper]
- Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks, ISCA 2015 [paper]
- SNR-Based Progressive Learning of Deep Neural Network for Speech Enhancement, INTERSPEECH 2016 [paper]
- Cross-language transfer learning for deep neural network based speech enhancement, ISCSLP 2014 [paper]
- Improving robustness of deep learning based monaural speech enhancement against processing artifacts, ICASSP 2020 [paper]
Link | Language | Description |
---|---|---|
SETK | Python & C++ | SETK: Speech Enhancement Tools integrated with Kaldi. |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
Beamformer | Python | Implementation of the mask-based adaptive beamformer (MVDR, GEVD, MCWF). |
Time-frequency Mask | Python | Computation of the time-frequency mask (PSM, IRM, IBM, IAM, ...) as the neural network training labels. |
SSL | Python | Implementation of Sound Source Localization. |
Data format | Python | Format tranform between Kaldi, Numpy and Matlab. |
Link | Language | Description |
---|---|---|
PESQ etc. | Matlab | Evaluation for PESQ, CSIG, CBAK, COVL, STOI |
SNR, LSD | Python | Evaluation for signal-to-noise-ratio and log-spectral-distortion. |
SDR | Matlab | Evaluation for signal-to-distortion-ratio. |
Link | Language | Description |
---|---|---|
LPS | Python | Extract log-power-spectrum/magnitude spectrum/log-magnitude spectrum/Cepstral mean and variance normalization. |
MFCC | Python | This library provides common speech features for ASR including MFCCs and filterbank energies. |
pyAudioAnalysis | Python | Python Audio Analysis Library: Feature Extraction, Classification, Segmentation and Applications. |
Link | Language | Description |
---|---|---|
Data simulation | Python | Add reverberation, noise or mix speaker. |
RIR simulation | Python | Generation of the room impluse response (RIR) using image method. |
pyroomacoustics | Python | Pyroomacoustics is a package for audio signal processing for indoor applications. |
gpuRIR | Python | Python library for Room Impulse Response (RIR) simulation with GPU acceleration |
rir_simulator_python | Python | Room impulse response simulator using python |
audiomentations | Python | A Python library for audio data augmentation, e.g. time stretch, pitch shift, add noise, add room reverberation |
Name | Utterances | Speakers | Language | Pricing | Additional information |
---|---|---|---|---|---|
Dataset by University of Edinburgh (2016) | 35K+ | 86 | English | Free | Noisy speech database for training speech enhancement algorithms and TTS models. |
TIMIT (1993) | 6K+ | 630 | English | $250.00 | The TIMIT corpus of read speech is one of the earliest speaker recognition datasets. |
VCTK (2009) | 43K+ | 109 | English | Free | Most were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker's accent. |
WSJ0 (1993) | -- | 149 | English | $1500 | The WSJ database was generated from a machine-readable corpus of Wall Street Journal news text. |
REVERB (2014) | - | 8K+ | English | Free | This corpus is from REVERB 2014 chanllenge. The challenge assumes the scenario of capturing utterances spoken by a single stationary distant-talking speaker with 1-channel (1ch), 2-channel (2ch) or 8-channel (8ch) microphone-arrays in reverberant meeting rooms. It features both real recordings and simulated data, a part of which simulates the real recordings. |
LibriSpeech (2015) | 292K | 2K+ | English | Free | Large-scale (1000 hours) corpus of read English speech. |
CHiME series (~2020) | -- | -- | English | Free | The database is published by CHiME Speech Separation and Recognition Challenge. |
Name | Noise types | Pricing | Additional information |
---|---|---|---|
DEMAND (2013) | 18 | Free | Diverse Environments Multichannel Acoustic Noise Database provides a set of recordings that allow testing of algorithms using real-world noise in a variety of settings. |
115 Noise (2015) | 115 | Free | The noise bank for simulate noisy data with clean speech. For N1-N100 noises, they were collected by Guoning Hu and the other 15 home-made noise types by USTC. |
NoiseX-92 (1996) | 15 | Free | Database of recording of various noises available on 2 CDROMs. |
RIR_Noises (2017) | - | Free | A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision.This data includes all the room impulse responses (RIRs) and noises we used in our paper "A Study on Data Augmentation of Reverberant Speech for Robust Speech Recognition" submitted to ICASSP 2017. It includes the real RIRs and isotropic noises from the RWCP sound scene database, the 2014 REVERB challenge database and the Aachen impulse response database (AIR); the simulated RIRs generated by ourselves and also the point-source noises that extracted from the MUSAN corpus. |
STOA results in dataset by University of Edinburgh. The following methods are all trained by "trainset_28spk" and tested by common testset. ("F" denotes frequency-domain and "T" is time-domain.)
Methods | Publish | Domain | PESQ | CSIG | CBAK | COVL | SegSNR | STOI |
---|---|---|---|---|---|---|---|---|
Noisy | -- | -- | 1.97 | 3.35 | 2.44 | 2.63 | 1.68 | 0.91 |
Wiener | -- | -- | 2.22 | 3.23 | 2.68 | 2.67 | 5.07 | -- |
SEGAN | INTERSPEECH 2017 | T | 2.16 | 3.48 | 2.94 | 2.80 | 7.73 | 0.93 |
CNN-GAN | APSIPA 2018 | F | 2.34 | 3.55 | 2.95 | 2.92 | -- | 0.93 |
WaveUnet | arxiv 2018 | T | 2.40 | 3.52 | 3.24 | 2.96 | 9.97 | -- |
WaveNet | ICASSP 2018 | T | -- | 3.62 | 3.24 | 2.98 | -- | -- |
U-net | ISMIR 2017 | F | 2.48 | 3.65 | 3.21 | 3.05 | 9.34 | -- |
MSE-GAN | ICASSP 2018 | F | 2.53 | 3.80 | 3.12 | 3.14 | -- | 0.93 |
DFL | INTERSPEECH 2019 | T | -- | 3.86 | 3.33 | 3.22 | -- | -- |
DFL reimplemented | ICLR 2019 | T | 2.51 | 3.79 | 3.27 | 3.14 | 9.86 | -- |
TasNet | TASLP 2019 | T | 2.57 | 3.80 | 3.29 | 3.18 | 9.65 | -- |
MDPhD | arxiv 2018 | T&F | 2.70 | 3.85 | 3.39 | 3.27 | 10.22 | -- |
Complex U-net | INTERSPEECH 2019 | F | 3.24 | 4.34 | 4.10 | 3.81 | 16.85 | -- |
Complex U-net reimplemented | arxiv 2019 | F | 2.87 | 4.12 | 3.47 | 3.51 | 9.96 | -- |
SDR-PRSQ | arxiv 2019 | F | 3.01 | 4.09 | 3.54 | 3.55 | 10.44 | |
T-GSA | ICASSP 2020 | F | 3.06 | 4.18 | 3.59 | 3.62 | 10.78 | -- |
RHRnet | ICASSP 2020 | T | 3.20 | 4.37 | 4.02 | 3.82 | 14.71 | 0.98 |
- a speech enhancement Android APP from Prof. Yu Tsao's group [download][video][Github]
- Audio Source Separation and Speech Enhancement, Emmanuel Vincent, 2019 [link]
- A Study on WaveNet, GANs and General CNNRNN Architectures, 2019 [link]
- Deep learning: method and applications, 2016 [link]
- Deep learning by Ian Goodfellow and Yoshua Bengio and Aaron Courville, 2016 [link]
- Robust automatic speech recognition by Jinyu Li and Li Deng, 2015 [link]
- CCF speech seminar 2020 [link]
- Real-time Single-channel Speech Enhancement with Recurrent Neural Networks by Microsoft Research, 2019 [link]
- Deep learning in speech by Hongyi Li, 2019 [link]
- High-Accuracy Neural-Network Models for Speech Enhancement, 2017 [link]
- DNN-Based Online Speech Enhancement Using Multitask Learning and Suppression Rule Estimation, 2015 [link]
- Microphone array signal processing: beyond the beamformer,2011 [link]
- Intelligibility Evaluation and Speech Enhancement based on Deep Learning by Yu Tsao, (INTERSPEECH 2020 tutorial) [link] [video]
- Speech Enhancement based on Deep Learning and Intelligibility Evaluation by Yu Tsao, (APSIPA 2019 tutorial) [link]
- Deep learning in speech by Hongyi Li, 2019 [link]
- Learning-based approach to speech enhancement and separation (INTERSPEECH tutorial, 2016) [link]
- Deep learning for speech/language processing (INTERSPEECH tutorial by Li Deng, 2015) [link]
- Speech enhancement algorithms (Stanford University, 2013) [link]