Skip to content

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion (Interspeech 2022)

Notifications You must be signed in to change notification settings

YoungSeng/SRD-VC

Repository files navigation

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

1. Dependencies

Install required python packages:

pip install -r requirements.txt

2. Quick Start

Download pre-trained model from TsinghuaCloud, GoogleDrive or BaiduNetDisk and put it into My_model/my_demo.

Download Speechsplit pre-trained model (pitch decoder 640000-P.ckpt and vocoder checkpoint_step001000000_ema.pth) from here.

Then cd My_model and modify paths in demo.py to your own paths.

Run python demo.py and you will get the converted audio .wav in /my_demo same like test_result.

You else can choose the conditions in demo.py.

3. Preparation, Training and Inference

Download the VCTK dataset.

Extract spectrogram and f0:make_spect_f0.py.

And modify it to your own path and divide the dataset, run data_split.py.

Generate training metadata: make_metadata.py.

Run the training scripts: main.py.

Generate testing metadata: make_test_metadata.py.

Run the inference scripts: inference.py

4. Evaluation

You may refer to the following: WER.py, mcd.py, f0_pcc.py, draw_f0_distributions.py, draw_speaker_embedding.py

5. Acknowledgement and References

This work is supported by National Natural Science Foundation of China (NSFC) (62076144), National Social Science Foundation of China (NSSF) (13&ZD189) and Shenzhen Key Laboratory of next generation interactive media innovative technology (ZDSYS20210623092001004).

Our work mainly inspired by:

(1) SpeechSplit:

K. Qian, Y. Zhang, S. Chang, M. Hasegawa-Johnson, and D. Cox, “Unsupervised speech decomposition via triple information bottleneck,” in International Conference on Machine Learning. PMLR, 2020, pp. 7836–7846.

(2) VQMIVC:

D. Wang, L. Deng, Y. T. Yeung, X. Chen, X. Liu, and H. Meng, “VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion,” in Interspeech, 2021, pp. 1344–1348.

(3) ClsVC:

H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Clsvc: Learning speech representations with two different classification tasks.” Openreview, 2021, https://openreview.net/forum?id=xp2D-1PtLc5.

6. Citation

If you find our work useful in your research, please consider citing:

@inproceedings{yang22f_interspeech,
  author={SiCheng Yang and Methawee Tantrawenith and Haolin Zhuang and Zhiyong Wu and Aolan Sun and Jianzong Wang and Ning Cheng and Huaizhen Tang and Xintao Zhao and Jie Wang and Helen Meng},
  title={{Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion}},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={2553--2557},
  doi={10.21437/Interspeech.2022-571}
}

Some results can be found here. Please feel free to contact us (yangsc21@mails.tsinghua.edu.cn) with any question or concerns.

About

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion (Interspeech 2022)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages