New pretrained synthesizer model (tensorflow) #538

ghost · 2020-09-30T07:59:31Z

Trained on LibriSpeech, using the current synthesizer (tensorflow). This performs similarly to the current model, with fewer random gaps appearing in the middle of synthesized utterances. It handles short input texts better too.

Download link: https://www.dropbox.com/s/3kyjgew55c4yxtf/librispeech_270k_tf.zip?dl=0

Unzip the file and move the logs-pretrained folder to synthesizer/saved_models.

I am not going to provide scripts to reproduce the training. For anyone interested, you will need to curate LibriSpeech to have more consistent prosody. This is what I did when running synthesizer_preprocess_audio.py:

In synthesizer/hparams.py, set silence_min_duration_split=0.05
Right before this line, run encoder.preprocess_wav() on each wav, this will use voice activation detection to trim silences (see Trim silences during synthesizer preprocess #501). Compare the lengths of the "before" and "after" wavs. If they don't match then it means a silence is detected and it is discarded. I keep the "before" wav if the lengths match.
Post-process datasets_root/SV2TTS/synthesizer/train.txt to include utterances between 225 and 600 mel frames (2.8 to 7.5 sec). This leaves 48 hours of training data.
Train from scratch for about 270k steps. I used a batch size of 12 because of limited GPU memory.

The text was updated successfully, but these errors were encountered:

ghost · 2020-09-30T08:19:41Z

This model still has the occasional attention failure. However, this is not caused by Corentin's modifications to Rayhane's taco2. I have studied the differences line by line and concluded that there is no error introduced. Rather, I think the attention problems are inherent to the SV2TTS architecture, particularly because the speaker embedding is input to the attention mechanism.

Attention is problematic even in single-speaker tacotrons, and it gets worse in multispeaker due to the speaker embedding concat. This highlights the need to use a better attention mechanism for SV2TTS.

Choons · 2020-09-30T18:14:33Z

amazing work! Thanks @blue-fish !

SmartPoly1 · 2021-12-04T06:01:56Z

why are your dropbox links not working

This was referenced Sep 30, 2020

Fixing the synthesizer's gaps in spectrograms #53

Closed

List of pretrained models available for download #400

Closed

ghost mentioned this issue Oct 12, 2020

Re: speed or rate of talking - generated audio speaking way too fast #347

Closed

macriluke mentioned this issue Oct 13, 2020

Update on maintaining this project #364

Closed

ghost closed this as completed Oct 25, 2020

ghost mentioned this issue Nov 2, 2020

Bacth size relation with memory #587

Closed

ghost mentioned this issue Nov 22, 2020

speech is distorted when the text entered is long #604

Closed

ghost mentioned this issue Dec 7, 2020

Pytorch synthesizer #472

Merged

12 tasks

This was referenced Jan 18, 2021

Another problem... #631

Closed

Does the synthesizer only work on "middle length" (+/- 20 words) sentences? #636

Closed

This was referenced Nov 5, 2021

TTS outputing different words than the ones typed in #883

Closed

Slow Training GPU RTX 2080 #700

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New pretrained synthesizer model (tensorflow) #538

New pretrained synthesizer model (tensorflow) #538

ghost commented Sep 30, 2020

ghost commented Sep 30, 2020

Choons commented Sep 30, 2020

SmartPoly1 commented Dec 4, 2021

New pretrained synthesizer model (tensorflow) #538

New pretrained synthesizer model (tensorflow) #538

Comments

ghost commented Sep 30, 2020

Download link: https://www.dropbox.com/s/3kyjgew55c4yxtf/librispeech_270k_tf.zip?dl=0

ghost commented Sep 30, 2020

Choons commented Sep 30, 2020

SmartPoly1 commented Dec 4, 2021