Poor vocoder outcome #981

gabrielrdw20 · 2022-01-12T17:22:52Z

Hello, I am fairly new to this topic. I have two problems that I cannot find any solution for. I read the documentation, scrolled all similar issues reported here, but didn't find a solution that would be helpful in my case. I added the same question on padmalcom Github, so maybe somebody will give it a look.

Short description:
Encoder was trained fine, synthesizer as well. The only problem is my vocoder. Training of the vocoder in very slow and generates unsuited mel spectrograms in the toolbox (but tested wav files are fine). Instead of human speech, the toolbox generates almost noise itself.

Please take a look at the files:
https://drive.google.com/drive/folders/1-SKYHRP8zy7vETqtMMJpKv1n7XKidBZL?usp=sharing

Long description:

I trained all the 3 parts, encoder, synthesizer and a vocoder, but the last one is quite problematic. I trained them all from scratch, having 244 unique Polish speakers. I used (and adjusted to Polish language) the code uploaded on Github by padmalcom. It looks like my vocoder is trained properly (this opinion is based on the wav filed generated by the vocoder). Somehow, when I open them in the demo_toolbox.py, the predicted mel spectrogram it's not even enar the target one. Is there any chance you might know what could cause the problem?

Till this moment, vocoder did only 14k iterations which might be the issue. This part is going really slow. Should it be like that? It's been 2 days of my PC working non-stop, and achieved only 14k iterations. I have NVIDIA GeForce RTX 3060 Ti and have installed latest releas of CUDA.

Any idea what could have gone wrong? I would be grateful for any suggestions :)

ireneb612 · 2022-01-13T15:00:55Z

Hi, I think that 2000 iterations correspond to one epoch! So I think it would be nice to train more!

Bebaam · 2022-01-19T12:41:04Z

It should be possible to use the pretrained vocoder, if you did not change the sampling rate (16k) or the embedding size of 256. Did you try to use it?

Only having 14k iterations will be the problem here. Sure the GPU is used for training? 14k after 2 days sounds like the CPU is used here. Usually you will have proper results after a few 100k steps.

gabrielrdw20 · 2022-01-19T13:12:35Z

Hi @Bebaam Thanks for your response. Unfortunately, my embedding has size of 243. My main goal was to train all 3 networks from scratch anyway.

Only having 14k iterations will be the problem here. Sure the GPU is used for training?

Yes, it is shown in my cmd that GPU was involved. I will try the vocoder to reach that 100k. I am also wondering if the main problems are not caused by no visible attention line after training synthesizer.

Encoder (166800):

Synthesizer (80k):

Bebaam · 2022-01-19T13:29:13Z

Why did you change from 256 to 243? Pretrained vocoder has more than one million epochs, so to get at least the same quality, you'll need way more than 100k steps for your vocoder.
The encoder looks fine, in general if error is less than 0.01, you can stop training there.
For the synthesizer: If you did not learn attention yet, you will have no success training a vocoder. So at first you need to fix this.

Bebaam · 2022-01-19T13:32:46Z

With your parameters (if you did not change anything else significantly) in my experience you should see the attention plot after a few thousand steps (5-25k). Which batchsize did you use for training the synthesizer? I would use all the VRAM you have, so the higher the batch_size, the faster the training should be (again, at least in my experience).

gabrielrdw20 · 2022-01-19T13:38:27Z

@Bebaam I apologize, I confused speakers number with embedding. In the file encoder > params_model.py it is stated:

Model parameters
model_hidden_size = 256
model_embedding_size = 256
model_num_layers = 3

Training parameters
learning_rate_init = 1e-4
speakers_per_batch = 64
utterances_per_speaker = 10

so nothing was changed here. Attention is still not properly generated and I have no idea if I should change some parameters or just wait to make more that a 100k for both the synthesizer and vocoder.

gabrielrdw20 · 2022-01-19T13:42:49Z

Which batchsize did you use for training the synthesizer?

I use the one implemented originally by the code author. Nothing was changed here.

Bebaam · 2022-01-19T14:19:57Z

Ok. Sometimes when attention is not achieved, a complete retraining of synthesizer could be worth it.

Furthermore, did you use the dataformat as in #437. Not sure whether it is neccessary, but was recommended. Maybe encoder needs to be retrained then, too.

Otherwise, maybe this could help: fatchord/WaveRNN#154 (comment)
The first one is implemented here afaik, the second one - starting with a higher reduction factor - could be helpful and could make training faster.

I would also monitor used VRAM and increase the batch_size if possible.

gabrielrdw20 · 2022-01-19T18:47:30Z

Hi @Bebaam, thanks for your support. I've check the recommended solutions. From the start, the synthesizer file has implemented softmax instead of sigmoid function. I wrote a script checking if a wav file is not too much noisy to be included into adataset and I only selected the ones that are ok. That's the reason I'm a bit surprised that the syntesizer cannot learn attention from a properly trained encoder. I will try to use 2 different data and do it from scratch once again.

gabrielrdw20 · 2022-01-22T09:18:28Z

Update

@Bebaam you were right, attention showed up before synthesizer's 10k iterations. What I don't understand is the audio file selection issue. I was confident that my samples were of good quality and long enough. Nevertheless, I didn't listen to them one by one and there are probably too long pauses somewhere. The thread can be closed :)

A new audio sample is in "new outcome" folder:
https://drive.google.com/drive/folders/1-SKYHRP8zy7vETqtMMJpKv1n7XKidBZL?usp=sharing

Bebaam · 2022-01-26T13:07:09Z

Yeah sometimes you never know where the error comes from, so I am glad to hear that it works now, attention looks good now in my opinion :)

Bebaam · 2022-01-26T13:09:46Z

I don't understand a word, but it sounds good 😄

gabrielrdw20 · 2022-01-27T21:45:55Z

@Bebaam Thanks I still wonder why files scrapped from YT are not accepted by the model and an attention is not achieved. Btw, as I struggled from the start with createin a proper path of files structure, I made a complete .py code that creates the whole structure from scratch + deals with some problems I encoutered while training model (e.g. checking if number of wav and txt files are equal, converting mp3 to wav, cutting files, scrapping videos from YT and converting to wav, using Google Speach basic API to process speech to text (as this step wasn't working for me in the original code), etc.). The only this there is to add the autor's code in the main folder. I will send a link here in a day or two, maybe someone finds that useful.

EDIT:

Here it is :)

https://github.com/gabrielrdw20/Real-Time-Voice-Cloning-Polish/tree/main/start_here

gabrielrdw20 closed this as completed Jan 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor vocoder outcome #981

Poor vocoder outcome #981

gabrielrdw20 commented Jan 12, 2022

ireneb612 commented Jan 13, 2022

Bebaam commented Jan 19, 2022

gabrielrdw20 commented Jan 19, 2022 •

edited

Loading

Bebaam commented Jan 19, 2022

Bebaam commented Jan 19, 2022 •

edited

Loading

gabrielrdw20 commented Jan 19, 2022 •

edited

Loading

gabrielrdw20 commented Jan 19, 2022 •

edited

Loading

Bebaam commented Jan 19, 2022

gabrielrdw20 commented Jan 19, 2022

gabrielrdw20 commented Jan 22, 2022

Bebaam commented Jan 26, 2022

Bebaam commented Jan 26, 2022

gabrielrdw20 commented Jan 27, 2022 •

edited

Loading

Poor vocoder outcome #981

Poor vocoder outcome #981

Comments

gabrielrdw20 commented Jan 12, 2022

ireneb612 commented Jan 13, 2022

Bebaam commented Jan 19, 2022

gabrielrdw20 commented Jan 19, 2022 • edited Loading

Bebaam commented Jan 19, 2022

Bebaam commented Jan 19, 2022 • edited Loading

gabrielrdw20 commented Jan 19, 2022 • edited Loading

gabrielrdw20 commented Jan 19, 2022 • edited Loading

Bebaam commented Jan 19, 2022

gabrielrdw20 commented Jan 19, 2022

gabrielrdw20 commented Jan 22, 2022

Bebaam commented Jan 26, 2022

Bebaam commented Jan 26, 2022

gabrielrdw20 commented Jan 27, 2022 • edited Loading

gabrielrdw20 commented Jan 19, 2022 •

edited

Loading

Bebaam commented Jan 19, 2022 •

edited

Loading

gabrielrdw20 commented Jan 19, 2022 •

edited

Loading

gabrielrdw20 commented Jan 19, 2022 •

edited

Loading

gabrielrdw20 commented Jan 27, 2022 •

edited

Loading