-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training a new model based on LibriTTS #449
Comments
@mbdash I just noticed this. This would be a really nice contribution if you are up for it! On the pretrained models page it says the synthesizer was trained in a week on 4 GPUs (1080ti). If you are not willing to tie up your GPU for a full month, it will still be helpful if you can get to a partially-trained model that has intelligible speech so others can continue training and finetuning. Training instructions for synthesizer
You can quit and resume training at any time, though you will lose all progress since the last checkpoint. It will be interesting to see how well it does with default hparams. |
From what I understand, LibriTTS offers several advantages over LibriSpeech:
We should consider updating the hparams so we can ultimately generate 24 kHz audio from this: Real-Time-Voice-Cloning/synthesizer/hparams.py Lines 113 to 117 in 054f16e
@CorentinJ also suggests reducing the max allowable utterance duration (these hparams are used in synthesizer/preprocess.py): Real-Time-Voice-Cloning/synthesizer/hparams.py Lines 95 to 103 in 054f16e
I don't have any solutions for the other suggestions mentioned (switching attention paradigm, removing speakers with bad prosody): #364 (comment) |
Ok, |
Update 2020-07-25 22h20 EST: step 5 Generate mel spectrograms for training For posterity, note typo in command in step 5, missing "s" in the flag "--datasets_name" |
Thanks for the update and correction. Let's run training with the default hparams. We're already switching from LibriSpeech to LibriTTS and it's best to only change one parameter at a time. |
Hi have an error because synthesizer_preprocess_embeds.py wants a pretrained model? I fail to understand why we need to provide pre-trained data when trying to train from scratch, but i will stick in the latest pretrained model until told otherwise.
|
@mbdash Look at the middle part of the image here and hopefully it will make more sense why the pretrained encoder model is needed to generate embeddings for synthesizer training: #30 (comment) Please speak up if it still doesn't make sense. Think of the synthesizer as a black box with 2 inputs: an embedding, and text to synthesize. Different speakers sound different even when speaking the same text. The synthesizer uses the embedding to impart that voice information in the mel spectrogram that it produces as output. The synthesizer gets the embedding from the encoder, which in turn can be thought of a black box that turns a speaker's wav data into an embedding. So you need to run the encoder model to get the embedding, and you get the error message because it can't find the model. |
Ok, great, I opened the image but I need slightly more coffee to really look at it ;-) thx for the quick response. |
Wow that is fast. At that rate it will take just over 4 days to reach the 278k steps in the current model. And it will train even faster as the model gets better. Please share some griffin-lim wavs when they become intelligible. |
step 2850 @ 13H15 EST |
This seems to be a bottleneck, is the data on an external drive? I'm averaging about 14 sec for batch generation on a slow CPU but the data lives on a SSD. |
If that's a typical batch generation time now, 2.3 sec for 64 batches is just 0.036 sec per step or 1 hour over 100,000 steps. Not worth it to transfer the data over to the SSD in my opinion. |
Check out the training logs area: The files in the |
rtvc_libritts_s_mdl @ 10k steps Cheers! |
Overall, the synthesizer training seems to be progressing nicely! I'll be interested to see as many plots and wavs as you care to share, but otherwise it's a lot of waiting now. It would be nice if you can share in-work checkpoints, say starting at 100k and every 50k steps after that. Or generate some samples using the toolbox. I've never trained from the start and it would be interesting to see the progression. |
rtvc_libritts_s_mdl @ 20k steps in ~6h |
I used the original pretrained models (hereafter, LibriSpeech_278k) to synthesize the same utterance as the 20k example, also inverting it with Griffin-Lim. The clarity is about the same but there is less harshness with LibriSpeech_278k (not sure what the correct technical term for that is). "When he spoke of the execution he wanted to pass over the horrible details, but Natasha insisted that he should not omit anything." You can definitely hear more of a pause after "details" in the 20k wav so the new model is learning how to deal with punctuation! 4592_22178_000024_000001.zip |
rtvc_libritts_s_mdl @ 74k steps in ~21h |
@mbdash From that batch I find the 50k sample remarkable. Your LibriTTS-based model is much closer to the ground truth, capturing the effect of the 3 commas and question mark on prosody. For this one clip I say your model performs better than LibriSpeech_278k but it will be interesting to see how well the model generalizes to new voices (embeddings) unseen during training.
|
Yes I keep listening to them paying attention to details and I can clearly ear the tts using the punctuation. |
How long does it take to run each step now? Clearly it is progressing faster than 1.3-1.4 sec/step that is in the screenshot from yesterday. |
It is a moving average of the last 100 steps: Real-Time-Voice-Cloning/synthesizer/train.py Line 165 in 054f16e
Real-Time-Voice-Cloning/synthesizer/train.py Lines 207 to 216 in 054f16e
|
102K reached in approx ~30h i think |
Can you make a backup of the 100k model checkpoint (or one that is in this range)? Just in case we want to come back to it later. Is the average loss still coming down? Perhaps it converges much faster with LibriTTS. When I did the single-speaker finetuning on LibriSpeech p211 the synthesizer loss started at 0.70, and you are already in the 0.60-0.65 range. |
@shoegazerstella You might want to run synthesizer_train.py with |
Hi @blue-fish thanks a lot for your help! I had another little issue similar to #439 (comment), thus it seems it is processing 24353 samples only. Is that correct? thanks!
|
I restarted the training from scratch with the correct number of samples, I am now at step 8k. |
If I were to start again, I'd either keep punctuation or discard it entirely by switching back to LibriSpeech. Maybe increase the max mel frames (to 600 or 700) so the synth can train on slightly more complex sentences. So disregard the suggestions in #449 (comment) Also, I am not noticing much improvement in voice quality when I increased the tacotron1 layer sizes in #447 to be more in line with what we have in the current tacotron2: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/8110552273afe6eb6093faaf701be5215a8285c9#diff-a20e5738bee4a9f617e9faabe4e7e17e For my next model I will revert those changes and initialize my weights using fatchord's pretrained tacotron1 model in the WaveRNN repo, which uses LJSpeech. My results with fatchord's hparams (#447 (comment)) show that it is sufficient for voice cloning. |
@shoegazerstella Please make the change in https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/5ad081ca25f9276fac31417c0bfce54c59c2a98f before testing your trained models with the toolbox. We should be using the postnet output for best results. The mel spectrograms and sample wavs from training already use the correct output so your training outputs are unaffected. |
From #364 (comment)
Based on experience here, fixing the attention mechanism needs to be the first step. If we reduce the max utterance duration without fixing attention, then the resulting model will have trouble synthesizing long sentences. In other words, reducing duration of training utterances is the reward for implementing a better attention paradigm. Some alternatives are discussed and evaluated in 1910.10288. In the meantime we should go back to
|
(Removed pretrained model, it is no longer compatible with the current pytorch synthesizer.) |
@shoegazerstella How is synthesizer training coming along? |
Hi @blue-fish, |
Hi @shoegazerstella ! The results look and sound great but we need to put the model to the test and see whether it generalizes well to new text. During training of the synth, at every time step the Tacotron decoder is given the previous frame of the ground-truth mel spectrogram, and predicts the current frame using that info combined with the encoder output. When generating unseen speech, there is no ground truth spectrogram to rely upon, so the decoder has no choice but to use the previous predicted output. This may cause the synth to behave wildly for long or rarely seen input sequences. So testing is the only way to find out. Would you please upload the current model checkpoint (.pt file) along with a copy of your Edit: Until a vocoder is trained at 22,050 Hz you will have to use Griffin-Lim for testing. It will sound like garbage if you connect it to the original pretrained vocoder (trained at 16,000 Hz). |
I just started training a synth on VCTK using these hparams and it is training quickly. |
Maybe you want to wait for the new encoder i am training? I should be done in a couple days training a new encoder using LibriSpeech + CommonVoice + VCTK then adding VoxCeleb 1& 2 to continue the training. |
Hi @blue-fish |
Thank you @shoegazerstella ! Do you want feedback on the model now, or wait until August 31st? For anyone else who would like to try the above synthesizer model: here is a synthesizer/hparams.py that is compatible with the latest changes to my |
Please see #501 everyone. Although LibriTTS wavs are trimmed so that there no leading or trailing silence, sometimes there are huge gaps in the middle of utterances and we can remove them by preprocessing the wavs. This should help improve the issue we see with gaps when synthesizing. |
Hi @blue-fish, |
Welcome back @shoegazerstella . I tried your model by loading it in the toolbox with random examples. Not surprisingly, it still had much of the issues as the model I trained at 16,000 Hz. Would you please continue training of the model using the schedule below? You should also increase the batch size to fully utilize the memory of your (dual?) v100 GPUs. Start training, and monitor the GPU memory utilization for a minute with
|
We are still working actively on this, but collaborating elsewhere. If you are interested in contributing time towards the development of better models please leave a message in #474 . |
Hi @mbdash, did you train the synthesizer using your trained 1M steps encoder afterwards? Cause I find your encoder is really good but this synthesizer is only based on original encoder and in Tensorflow form, I can't use it in Pytorch code now. |
Originally posted by @mbdash in #441 (comment)
The text was updated successfully, but these errors were encountered: