Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training a new model based on LibriTTS #449

Closed
ghost opened this issue Jul 25, 2020 · 66 comments
Closed

Training a new model based on LibriTTS #449

ghost opened this issue Jul 25, 2020 · 66 comments

Comments

@ghost
Copy link

ghost commented Jul 25, 2020

@blue-fish, Would it be useful if I was to offer a GPU (2080 ti) for contributing on training a new model based on LibriTTS ?
I have yet to train any models and would gladly exchange GPU time for an opportunity to learn.
I wonder how long it would take on a single 2080 ti.

Originally posted by @mbdash in #441 (comment)

@ghost
Copy link
Author

ghost commented Jul 25, 2020

@mbdash I just noticed this. This would be a really nice contribution if you are up for it!

On the pretrained models page it says the synthesizer was trained in a week on 4 GPUs (1080ti). If you are not willing to tie up your GPU for a full month, it will still be helpful if you can get to a partially-trained model that has intelligible speech so others can continue training and finetuning.

Training instructions for synthesizer

  1. Pull the latest copy of the repo to get LibriTTS support in Add synthesizer preprocessing support for other datasets #441.
  2. Download LibriTTS "train-clean-100" and "train-clean-360" from here: https://openslr.org/60/
    • While it is downloading, enable tensorflow GPU support if not already done
  3. Make a datasets folder, it can be on an external drive if you don't have enough storage (this will consume 150-200 GB)
  4. Extract LibriTTS downloads to this path: datasets/LibriTTS
  5. Generate mel spectrograms for training: python synthesizer_preprocess_audio.py path/to/datasets_folder --no_alignments --datasets_name LibriTTS
  6. Generate embeddings for training: python synthesizer_preprocess_embeds.py path/to/datasets_folder/SV2TTS/synthesizer
  7. Start training from scratch: python synthesizer_train.py new_model_name path/to/datasets_folder/SV2TTS/synthesizer
    • You will start seeing wavs when it reaches each checkpoint interval (default: 2,000 steps)

You can quit and resume training at any time, though you will lose all progress since the last checkpoint. It will be interesting to see how well it does with default hparams.

@ghost
Copy link
Author

ghost commented Jul 25, 2020

From what I understand, LibriTTS offers several advantages over LibriSpeech:

  1. The transcripts contain punctuation so the model will respond to it instead of ignoring it as it does currently.
  2. Audio has been split into smaller segments making alignments unnecessary
  3. Higher sampling rate of 24 kHz instead of 16 kHz

We should consider updating the hparams so we can ultimately generate 24 kHz audio from this:
* Edit: There are more fundamental problems than bitrate affecting quality, so keeping it at 16,000 is preferable as it speeds training and retains compatibility with the current vocoder

# Mel spectrogram
n_fft=800, # Extra window size is filled with 0 paddings to match this parameter
hop_size=200, # For 16000Hz, 200 = 12.5 ms (0.0125 * sample_rate)
win_size=800, # For 16000Hz, 800 = 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
sample_rate=16000, # 16000Hz (corresponding to librispeech) (sox --i <filename>)

@CorentinJ also suggests reducing the max allowable utterance duration (these hparams are used in synthesizer/preprocess.py):

# Whether to clip silence in Audio (at beginning and end of audio only, not the middle)
# train samples of lengths between 3sec and 14sec are more than enough to make a model capable
# of good parallelization.
clip_mels_length=True,
# For cases of OOM (Not really recommended, only use if facing unsolvable OOM errors,
# also consider clipping your samples to smaller chunks)
max_mel_frames=900,
# Only relevant when clip_mels_length = True, please only use after trying output_per_steps=3
# and still getting OOM errors.

I don't have any solutions for the other suggestions mentioned (switching attention paradigm, removing speakers with bad prosody): #364 (comment)

@mbdash
Copy link
Collaborator

mbdash commented Jul 25, 2020

Ok,
I will sync LibriTTS overnight, try to set this up over the weekend and get the GPU working on it.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

Update 2020-07-25 22h20 EST:

step 5 Generate mel spectrograms for training
Currently at 25% all cpus available to VM are at full load.

For posterity, note typo in command in step 5, missing "s" in the flag "--datasets_name"
python synthesizer_preprocess_audio.py ~/rtvc_LibriTTS/datasets --no_alignments --datasets_name LibriTTS

@ghost
Copy link
Author

ghost commented Jul 26, 2020

Thanks for the update and correction.

Let's run training with the default hparams. We're already switching from LibriSpeech to LibriTTS and it's best to only change one parameter at a time.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

Hi have an error because synthesizer_preprocess_embeds.py wants a pretrained model?

I fail to understand why we need to provide pre-trained data when trying to train from scratch, but i will stick in the latest pretrained model until told otherwise.

(rtvc_py373) username@vm:~/github/Real-Time-Voice-Cloning$ python synthesizer_preprocess_embeds.py /mnt/nfs/a_share/rtvc_LibriTTS/datasets/SV2TTS/synthesizer/
Arguments:
    synthesizer_root:      /mnt/nfs/a_share/rtvc_LibriTTS/datasets/SV2TTS/synthesizer
    encoder_model_fpath:   encoder/saved_models/pretrained.pt
    n_processes:           4

Embedding:   0%|                                                                                                                                                  | 0/111521 [00:02<?, ?utterances/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/username/github/Real-Time-Voice-Cloning/synthesizer/preprocess.py", line 228, in embed_utterance
    encoder.load_model(encoder_model_fpath)
  File "/home/username/github/Real-Time-Voice-Cloning/encoder/inference.py", line 33, in load_model
    checkpoint = torch.load(weights_fpath, _device)
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/site-packages/torch/serialization.py", line 384, in load
    f = f.open('rb')
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/pathlib.py", line 1186, in open
    opener=self._opener)
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/pathlib.py", line 1039, in _opener
    return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "synthesizer_preprocess_embeds.py", line 25, in <module>
    create_embeddings(**vars(args))
  File "/home/username/github/Real-Time-Voice-Cloning/synthesizer/preprocess.py", line 254, in create_embeddings
    list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/site-packages/tqdm/std.py", line 1130, in __iter__
    for obj in iterable:
  File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'

@ghost
Copy link
Author

ghost commented Jul 26, 2020

@mbdash Look at the middle part of the image here and hopefully it will make more sense why the pretrained encoder model is needed to generate embeddings for synthesizer training: #30 (comment) Please speak up if it still doesn't make sense.

Think of the synthesizer as a black box with 2 inputs: an embedding, and text to synthesize. Different speakers sound different even when speaking the same text. The synthesizer uses the embedding to impart that voice information in the mel spectrogram that it produces as output. The synthesizer gets the embedding from the encoder, which in turn can be thought of a black box that turns a speaker's wav data into an embedding.

So you need to run the encoder model to get the embedding, and you get the error message because it can't find the model.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

Ok, great,
if you tell me it is as designed I will continue. it is currently at 50% Embedding.

I opened the image but I need slightly more coffee to really look at it ;-)

thx for the quick response.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

ok I started synthesizer_train.py and it is @ step 250 now @ 2020 07 26 12H24 EST

image

@ghost
Copy link
Author

ghost commented Jul 26, 2020

Wow that is fast. At that rate it will take just over 4 days to reach the 278k steps in the current model. And it will train even faster as the model gets better. Please share some griffin-lim wavs when they become intelligible.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

step 2850 @ 13H15 EST
so approx 2500 steps in ~1h

@ghost
Copy link
Author

ghost commented Jul 26, 2020

Generated 64 train batches of size 36 in 21.814 sec

This seems to be a bottleneck, is the data on an external drive? I'm averaging about 14 sec for batch generation on a slow CPU but the data lives on a SSD.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

latest @ 14h25:
image

My setup is not optimal. It is currently residing on the HDD side on my array, I just added a new SSD but is is not been used atm.
When I stop the training, I will move the data on a share living on the SSD or even a passthrough NVME.

@ghost
Copy link
Author

ghost commented Jul 26, 2020

If that's a typical batch generation time now, 2.3 sec for 64 batches is just 0.036 sec per step or 1 hour over 100,000 steps. Not worth it to transfer the data over to the SSD in my opinion.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

step 10k reached @ 15h30
so we can estimate ~10k steps / 3h

Where are located the wavs you want me to share?
When I try to ls datasets/SV2TTS/synthesizer/audio my terminal hang.

image

@ghost
Copy link
Author

ghost commented Jul 26, 2020

Where are located the wavs you want me to share?

Check out the training logs area: synthesizer/saved_models/logs-new_model_name/wavs

The files in the plots folder are also interesting and show how well the new synthesizer model is working.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

rtvc_libritts_s_mdl @ 10k steps

Cheers!
rtvc_libritts_s_mdl_10k.zip

@ghost
Copy link
Author

ghost commented Jul 26, 2020

Overall, the synthesizer training seems to be progressing nicely! I'll be interested to see as many plots and wavs as you care to share, but otherwise it's a lot of waiting now.

It would be nice if you can share in-work checkpoints, say starting at 100k and every 50k steps after that. Or generate some samples using the toolbox. I've never trained from the start and it would be interesting to see the progression.

@mbdash
Copy link
Collaborator

mbdash commented Jul 26, 2020

rtvc_libritts_s_mdl @ 20k steps in ~6h

rtvc_libritts_s_mdl_20k.zip

@ghost
Copy link
Author

ghost commented Jul 26, 2020

I used the original pretrained models (hereafter, LibriSpeech_278k) to synthesize the same utterance as the 20k example, also inverting it with Griffin-Lim. The clarity is about the same but there is less harshness with LibriSpeech_278k (not sure what the correct technical term for that is).

"When he spoke of the execution he wanted to pass over the horrible details, but Natasha insisted that he should not omit anything."

You can definitely hear more of a pause after "details" in the 20k wav so the new model is learning how to deal with punctuation!

4592_22178_000024_000001.zip

@mbdash
Copy link
Collaborator

mbdash commented Jul 27, 2020

rtvc_libritts_s_mdl @ 74k steps in ~21h

rtvc_libritts_s_mdl_74k.zip

@ghost
Copy link
Author

ghost commented Jul 27, 2020

@mbdash From that batch I find the 50k sample remarkable. Your LibriTTS-based model is much closer to the ground truth, capturing the effect of the 3 commas and question mark on prosody.

For this one clip I say your model performs better than LibriSpeech_278k but it will be interesting to see how well the model generalizes to new voices (embeddings) unseen during training.

As they sat thus something brushed against peter as light as a kiss, and stayed there, as if saying timidly, "Can I be of any use?"

step-50000_comparison.zip

@mbdash
Copy link
Collaborator

mbdash commented Jul 27, 2020

Yes I keep listening to them paying attention to details and I can clearly ear the tts using the punctuation.

@ghost
Copy link
Author

ghost commented Jul 27, 2020

How long does it take to run each step now? Clearly it is progressing faster than 1.3-1.4 sec/step that is in the screenshot from yesterday.

@mbdash
Copy link
Collaborator

mbdash commented Jul 27, 2020

I don't think the numbers are very accurate

image

I try counting Mississippis but they pop / print way faster and sometimes in fast sequences
image

@ghost
Copy link
Author

ghost commented Jul 27, 2020

It is a moving average of the last 100 steps:

time_window = ValueWindow(100)

# Training loop
while not coord.should_stop() and step < args.tacotron_train_steps:
start_time = time.time()
step, loss, opt = sess.run([global_step, model.loss, model.optimize])
time_window.append(time.time() - start_time)
loss_window.append(loss)
message = "Step {:7d} [{:.3f} sec/step, loss={:.5f}, avg_loss={:.5f}]".format(
step, time_window.average, loss, loss_window.average)
log(message, end="\r", slack=(step % args.checkpoint_interval == 0))
print(message)

@mbdash
Copy link
Collaborator

mbdash commented Jul 27, 2020

102K reached in approx ~30h i think

rtvc_libritts_s_mdl_102k.zip

@ghost
Copy link
Author

ghost commented Jul 27, 2020

Can you make a backup of the 100k model checkpoint (or one that is in this range)? Just in case we want to come back to it later.

Is the average loss still coming down? Perhaps it converges much faster with LibriTTS. When I did the single-speaker finetuning on LibriSpeech p211 the synthesizer loss started at 0.70, and you are already in the 0.60-0.65 range.

@ghost
Copy link
Author

ghost commented Aug 13, 2020

@shoegazerstella You might want to run synthesizer_train.py with -s 500 to save the model every 500 steps (that way you do not lose too much progress when stopping and restarting)

@shoegazerstella
Copy link

Hi @blue-fish thanks a lot for your help!
Training is now in progress, the configuration follows the parameters you suggested above.

I had another little issue similar to #439 (comment), thus it seems it is processing 24353 samples only. Is that correct? thanks!

Initialising Tacotron Model...

Trainable Parameters: 24.888M

Starting the training of Tacotron from scratch

Using inputs from:
        DATA/SV2TTS/synthesizer/train.txt
        DATA/SV2TTS/synthesizer/mels
        DATA/SV2TTS/synthesizer/embeds

Found 24353 samples
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   10k Steps    |     32     |     0.001     |        7         |
+----------------+------------+---------------+------------------+

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py:211: RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
  self.dropout, self.training, self.bidirectional, self.batch_first)
| Epoch: 1/14 (762/762) | Loss: 0.8026 | 1.0 steps/s | Step: 0k | [[B
| Epoch: 2/14 (762/762) | Loss: 0.7637 | 1.0 steps/s | Step: 1k |
| Epoch: 3/14 (476/762) | Loss: 0.7511 | 1.0 steps/s | Step: 2k | Input at step 2000: my dear child, i said grandly, do you really suppose i am afraid of that poor wretch?~__________________________
| Epoch: 3/14 (762/762) | Loss: 0.7460 | 1.0 steps/s | Step: 2k |
| Epoch: 4/14 (361/762) | Loss: 0.7274 | 1.0 steps/s | Step: 2k | 

@shoegazerstella
Copy link

I restarted the training from scratch with the correct number of samples, I am now at step 8k.
I will share later some spectrogram plots + wavs.

@ghost
Copy link
Author

ghost commented Aug 14, 2020

If I were to start again, I'd either keep punctuation or discard it entirely by switching back to LibriSpeech. Maybe increase the max mel frames (to 600 or 700) so the synth can train on slightly more complex sentences. So disregard the suggestions in #449 (comment)

Also, I am not noticing much improvement in voice quality when I increased the tacotron1 layer sizes in #447 to be more in line with what we have in the current tacotron2: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/8110552273afe6eb6093faaf701be5215a8285c9#diff-a20e5738bee4a9f617e9faabe4e7e17e

For my next model I will revert those changes and initialize my weights using fatchord's pretrained tacotron1 model in the WaveRNN repo, which uses LJSpeech. My results with fatchord's hparams (#447 (comment)) show that it is sufficient for voice cloning.

@ghost
Copy link
Author

ghost commented Aug 14, 2020

@shoegazerstella Please make the change in https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/5ad081ca25f9276fac31417c0bfce54c59c2a98f before testing your trained models with the toolbox. We should be using the postnet output for best results. The mel spectrograms and sample wavs from training already use the correct output so your training outputs are unaffected.

@ghost
Copy link
Author

ghost commented Aug 15, 2020

From #364 (comment)

  • You can lower the upper bound I put on utterance duration, which I suspect has for effect of removing long utterances that are more likely to have more pauses (I formally evaluated models trained this way to generate less frequent long pauses). It also trains faster and does not have drawbacks (with a good attention paradigm, the model can generate sentences longer than seen in training).

Based on experience here, fixing the attention mechanism needs to be the first step. If we reduce the max utterance duration without fixing attention, then the resulting model will have trouble synthesizing long sentences. In other words, reducing duration of training utterances is the reward for implementing a better attention paradigm.

Some alternatives are discussed and evaluated in 1910.10288. In the meantime we should go back to max_mel_frames = 900 (and accept the gaps that come along with it).

@ghost
Copy link
Author

ghost commented Aug 15, 2020

(Removed pretrained model, it is no longer compatible with the current pytorch synthesizer.)

@ghost
Copy link
Author

ghost commented Aug 20, 2020

@shoegazerstella How is synthesizer training coming along?

@shoegazerstella
Copy link

@shoegazerstella How is synthesizer training coming along?

Hi @blue-fish,
I am sharing plots and wavs. It seems it has far passed the 250k steps.
What do you think of these results?

@ghost
Copy link
Author

ghost commented Aug 20, 2020

Hi @shoegazerstella ! The results look and sound great but we need to put the model to the test and see whether it generalizes well to new text.

During training of the synth, at every time step the Tacotron decoder is given the previous frame of the ground-truth mel spectrogram, and predicts the current frame using that info combined with the encoder output. When generating unseen speech, there is no ground truth spectrogram to rely upon, so the decoder has no choice but to use the previous predicted output. This may cause the synth to behave wildly for long or rarely seen input sequences. So testing is the only way to find out.

Would you please upload the current model checkpoint (.pt file) along with a copy of your synthesizer/hparams.py?

Edit: Until a vocoder is trained at 22,050 Hz you will have to use Griffin-Lim for testing. It will sound like garbage if you connect it to the original pretrained vocoder (trained at 16,000 Hz).

@ghost
Copy link
Author

ghost commented Aug 20, 2020

I just started training a synth on VCTK using these hparams and it is training quickly.

@mbdash
Copy link
Collaborator

mbdash commented Aug 20, 2020

Maybe you want to wait for the new encoder i am training?

I should be done in a couple days training a new encoder using

LibriSpeech + CommonVoice + VCTK
for 315k steps (loss < 0.005)

then adding

VoxCeleb 1& 2 to continue the training.
Loss is currently at <=0.1 at step 344k

@shoegazerstella
Copy link

Hi @blue-fish
Here the last synth checkpoint + hparams.py
I'm OOO till August 31st so I won't be able to test it or make adjustments for further trainings before that day.
Thank you!

@ghost
Copy link
Author

ghost commented Aug 21, 2020

Thank you @shoegazerstella ! Do you want feedback on the model now, or wait until August 31st?

For anyone else who would like to try the above synthesizer model: here is a synthesizer/hparams.py that is compatible with the latest changes to my 447_pytorch_synthesizer branch.

@ghost
Copy link
Author

ghost commented Aug 21, 2020

Please see #501 everyone. Although LibriTTS wavs are trimmed so that there no leading or trailing silence, sometimes there are huge gaps in the middle of utterances and we can remove them by preprocessing the wavs. This should help improve the issue we see with gaps when synthesizing.

@shoegazerstella
Copy link

Hi @blue-fish,
Did you have time for testing the model I sent?
If not, I could do it, but I just wanted to understand if there was a testing script (that is using the usual test set and compute cumulative error metrics - if so, could you point me to it?), or if I should test it on some random examples.
Thanks!

@ghost
Copy link
Author

ghost commented Sep 1, 2020

Welcome back @shoegazerstella . I tried your model by loading it in the toolbox with random examples. Not surprisingly, it still had much of the issues as the model I trained at 16,000 Hz. Would you please continue training of the model using the schedule below?

You should also increase the batch size to fully utilize the memory of your (dual?) v100 GPUs. Start training, and monitor the GPU memory utilization for a minute with watch -n 0.5 nvidia-smi. Keep adjusting until you are at 80-90% memory.

        ### Tacotron Training
        tts_schedule = [(7,  1e-3,    20_000,  96),   # Progressive training schedule
                        (5,  3e-4,    50_000,  64),   # (r, lr, step, batch_size)
                        (2,  1e-4,   100_000,  32),   #
                        (2,  1e-5, 2_000_000,  32)],  # r = reduction factor (# of mel frames
                                                      #     synthesized for each decoder iteration)
                                                      # lr = learning rate

@ghost
Copy link
Author

ghost commented Oct 12, 2020

We are still working actively on this, but collaborating elsewhere. If you are interested in contributing time towards the development of better models please leave a message in #474 .

@zhuochunli
Copy link

Synth Trained on LibriTTS 200k steps with old /original encoder.

https://drive.google.com/drive/folders/1ah6QNyB8jIcFuKusPOVdx0pPIZxeZeul?usp=sharing

Let me know if the link works. or not and if any files are missing.

Hi @mbdash, did you train the synthesizer using your trained 1M steps encoder afterwards? Cause I find your encoder is really good but this synthesizer is only based on original encoder and in Tensorflow form, I can't use it in Pytorch code now.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants