Implementation of DCComix TTS: An End-to-End Expressive TTS with Discrete Code Collaborated with Mixer Accepted to Interspech 2023. Audio samples/demo for this system is here
Abstract: Despite the huge successes made in neutral TTS, content-leakage remains a challenge. In this paper, we propose a new input representation and simple architecture to achieve improved prosody modeling. Inspired by the recent success in the use of discrete code in TTS, we introduce discrete code to the input of the reference encoder. Specifically, we leverage the vector quantizer from the audio compression model to exploit the diverse acoustic information it has already been trained on. In addition, we apply the modified MLP-Mixer to the reference encoder, making the architecture lighter. As a result, we train the prosody transfer TTS in an end-to-end manner. We prove the effectiveness of our method through both subjective and objective evaluations. We demonstrate that the reference encoder learns better speaker-independent prosody when discrete code is utilized as input in the experiments. In addition, we obtain comparable results even when fewer parameters are inputted.
- This repository leverages Nemo for VITS and MixerTTS implementation.
- We use Encodec for discrete code
- python ≥ 3.8
- pytorch 1.11.0+cu113
- nemo_toolkit 1.18.0
See requirements.txt
for other libraries
- prepare data (VCTK)
python preprocess/make_manifest.py
- Note that we resample VCTK audios to 24kHz to match resolution with Encodec
- preprocessing
- text normalization
python torchdata/text_preprocess.py
- run
train.py
- for
dc-comix-tts
: useref_mixer_codec_vits.yaml
- for
@software{Harper_NeMo_a_toolkit,
author = {Harper, Eric and Majumdar, Somshubra and Kuchaiev, Oleksii and Jason, Li and Zhang, Yang and Bakhturina, Evelina and Noroozi, Vahid and Subramanian, Sandeep and Nithin, Koluguri and Jocelyn, Huang and Jia, Fei and Balam, Jagadeesh and Yang, Xuesong and Livne, Micha and Dong, Yi and Naren, Sean and Ginsburg, Boris},
title = {{NeMo: a toolkit for Conversational AI and Large Language Models}},
url = {https://github.com/NVIDIA/NeMo}
}
@article{defossez2022highfi,
title={High Fidelity Neural Audio Compression},
author={Défossez, Alexandre and Copet, Jade and Synnaeve, Gabriel and Adi, Yossi},
journal={arXiv preprint arXiv:2210.13438},
year={2022}
}