Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Overview

Demo samples of our paper "Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training". If you have questions, feel free to ask me (zhangpeng2018@ia.ac.cn，1002434886@qq.com).

Abstract

Audio-visual speech separation (AVSS) refers to separating individual voice from an audio mixture of multiple simultaneous talkers by conditioning on visual features. For the AVSS task, visual features play an important role, based on which we manage to extract more effective visual features to improve the performance**. In this paper, we propose a novel AVSS model that uses speech-related visual features for isolating the target speaker. Specifically, the method of extracting speech-related visual features has two steps. Firstly, we extract the visual features that contain speech-related information by learning joint audio-visual representation. Secondly, we use the adversarial training method to enhance speech-related information in visual features further. We adopt the time-domain approach and build audio-visual speech separation networks with temporal convolutional neural network block. Experiments on audio-visual datasets, including GRID, TCD-TIMIT, AVSpeech, and LRS2, show that our model significantly outperforms previous state-of-the-art AVSS models. We also demonstrate that our model can achieve excellent speech separation performance in noisy real-world scenarios. Moreover, in order to alleviate the performance degradation of AVSS models caused by the missing of some video frames, we propose a training strategy, which makes our model robust when video frames are partially missing.

Model

Extract speech-related visual features

The structure of the visual front model.

Visual model of extracting visual-speech feature

Audio-visual speech separation networks

The networks can be easily built based on our paper.

Datasets

The method of generating training, validation, and test samples is detailed in our paper.

Result

Video and Audio Samples

We provide many samples from standard datasets and recorded in a real world environment.

Listen and watch the samples that recorded in real world environment at ./samples/samples of real-world environment.
Listen the samples from standard datasets at ./samples/sample of standard dataset
Spectrogram samples that recorded in real world environment.

Citations

If you find this repo helpful, please consider citing:

@inproceedings{zhang2021avss,
  title={Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training},
  author={Zhang, Peng and Xu, Jiaming and Shi, Jing and Hao, Yunzhe and Qin, Lei and Xu, Bo},
  booktitle={In Proceedings of the 33th International Joint Conference on Neural Network (IJCNN)},
  year={2021},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
data_preprocess		data_preprocess
datasets		datasets
image		image
log		log
params		params
samples		samples
.gitattributes		.gitattributes
README.md		README.md
config.py		config.py
dataloader.py		dataloader.py
generate_test_list.py		generate_test_list.py
inference.py		inference.py
logger.py		logger.py
loss.py		loss.py
model.py		model.py
module.py		module.py
run.sh		run.sh
separation.py		separation.py
train.py		train.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Overview

Abstract

Model

Extract speech-related visual features

Audio-visual speech separation networks

Datasets

Result

Video and Audio Samples

Citations

About

Releases

Packages

Contributors 3

Languages

aispeech-lab/advr-avss

Folders and files

Latest commit

History

Repository files navigation

Audio-Visual Speech Separation with Visual Features Enhanced by Adversarial Training

Overview

Abstract

Model

Extract speech-related visual features

Audio-visual speech separation networks

Datasets

Result

Video and Audio Samples

Citations

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages