Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor vocoder outcome #981

Closed
gabrielrdw20 opened this issue Jan 12, 2022 · 13 comments
Closed

Poor vocoder outcome #981

gabrielrdw20 opened this issue Jan 12, 2022 · 13 comments

Comments

@gabrielrdw20
Copy link

Hello, I am fairly new to this topic. I have two problems that I cannot find any solution for. I read the documentation, scrolled all similar issues reported here, but didn't find a solution that would be helpful in my case. I added the same question on padmalcom Github, so maybe somebody will give it a look.

Short description:
Encoder was trained fine, synthesizer as well. The only problem is my vocoder. Training of the vocoder in very slow and generates unsuited mel spectrograms in the toolbox (but tested wav files are fine). Instead of human speech, the toolbox generates almost noise itself.

Please take a look at the files:
https://drive.google.com/drive/folders/1-SKYHRP8zy7vETqtMMJpKv1n7XKidBZL?usp=sharing

Long description:

I trained all the 3 parts, encoder, synthesizer and a vocoder, but the last one is quite problematic. I trained them all from scratch, having 244 unique Polish speakers. I used (and adjusted to Polish language) the code uploaded on Github by padmalcom. It looks like my vocoder is trained properly (this opinion is based on the wav filed generated by the vocoder). Somehow, when I open them in the demo_toolbox.py, the predicted mel spectrogram it's not even enar the target one. Is there any chance you might know what could cause the problem?

Till this moment, vocoder did only 14k iterations which might be the issue. This part is going really slow. Should it be like that? It's been 2 days of my PC working non-stop, and achieved only 14k iterations. I have NVIDIA GeForce RTX 3060 Ti and have installed latest releas of CUDA.

Any idea what could have gone wrong? I would be grateful for any suggestions :)

@ireneb612
Copy link

Hi, I think that 2000 iterations correspond to one epoch! So I think it would be nice to train more!

@Bebaam
Copy link

Bebaam commented Jan 19, 2022

It should be possible to use the pretrained vocoder, if you did not change the sampling rate (16k) or the embedding size of 256. Did you try to use it?

Only having 14k iterations will be the problem here. Sure the GPU is used for training? 14k after 2 days sounds like the CPU is used here. Usually you will have proper results after a few 100k steps.

@gabrielrdw20
Copy link
Author

gabrielrdw20 commented Jan 19, 2022

Hi @Bebaam Thanks for your response. Unfortunately, my embedding has size of 243. My main goal was to train all 3 networks from scratch anyway.

Only having 14k iterations will be the problem here. Sure the GPU is used for training?

Yes, it is shown in my cmd that GPU was involved. I will try the vocoder to reach that 100k. I am also wondering if the main problems are not caused by no visible attention line after training synthesizer.

Encoder (166800):
encoder_umap_166800

Synthesizer (80k):
attention_step_80000_sample_1

@Bebaam
Copy link

Bebaam commented Jan 19, 2022

Why did you change from 256 to 243? Pretrained vocoder has more than one million epochs, so to get at least the same quality, you'll need way more than 100k steps for your vocoder.
The encoder looks fine, in general if error is less than 0.01, you can stop training there.
For the synthesizer: If you did not learn attention yet, you will have no success training a vocoder. So at first you need to fix this.

@Bebaam
Copy link

Bebaam commented Jan 19, 2022

With your parameters (if you did not change anything else significantly) in my experience you should see the attention plot after a few thousand steps (5-25k). Which batchsize did you use for training the synthesizer? I would use all the VRAM you have, so the higher the batch_size, the faster the training should be (again, at least in my experience).

@gabrielrdw20
Copy link
Author

gabrielrdw20 commented Jan 19, 2022

@Bebaam I apologize, I confused speakers number with embedding. In the file encoder > params_model.py it is stated:

Model parameters
model_hidden_size = 256
model_embedding_size = 256
model_num_layers = 3

Training parameters
learning_rate_init = 1e-4
speakers_per_batch = 64
utterances_per_speaker = 10

so nothing was changed here. Attention is still not properly generated and I have no idea if I should change some parameters or just wait to make more that a 100k for both the synthesizer and vocoder.

@gabrielrdw20
Copy link
Author

gabrielrdw20 commented Jan 19, 2022

Which batchsize did you use for training the synthesizer?

I use the one implemented originally by the code author. Nothing was changed here.

batch

@Bebaam
Copy link

Bebaam commented Jan 19, 2022

Ok. Sometimes when attention is not achieved, a complete retraining of synthesizer could be worth it.

Furthermore, did you use the dataformat as in #437. Not sure whether it is neccessary, but was recommended. Maybe encoder needs to be retrained then, too.

Otherwise, maybe this could help: fatchord/WaveRNN#154 (comment)
The first one is implemented here afaik, the second one - starting with a higher reduction factor - could be helpful and could make training faster.

I would also monitor used VRAM and increase the batch_size if possible.

@gabrielrdw20
Copy link
Author

Hi @Bebaam, thanks for your support. I've check the recommended solutions. From the start, the synthesizer file has implemented softmax instead of sigmoid function. I wrote a script checking if a wav file is not too much noisy to be included into adataset and I only selected the ones that are ok. That's the reason I'm a bit surprised that the syntesizer cannot learn attention from a properly trained encoder. I will try to use 2 different data and do it from scratch once again.

@gabrielrdw20
Copy link
Author

Update

@Bebaam you were right, attention showed up before synthesizer's 10k iterations. What I don't understand is the audio file selection issue. I was confident that my samples were of good quality and long enough. Nevertheless, I didn't listen to them one by one and there are probably too long pauses somewhere. The thread can be closed :)

attention_step_29500_sample_1

A new audio sample is in "new outcome" folder:
https://drive.google.com/drive/folders/1-SKYHRP8zy7vETqtMMJpKv1n7XKidBZL?usp=sharing

@Bebaam
Copy link

Bebaam commented Jan 26, 2022

Yeah sometimes you never know where the error comes from, so I am glad to hear that it works now, attention looks good now in my opinion :)

@Bebaam
Copy link

Bebaam commented Jan 26, 2022

I don't understand a word, but it sounds good 😄

@gabrielrdw20
Copy link
Author

gabrielrdw20 commented Jan 27, 2022

@Bebaam Thanks I still wonder why files scrapped from YT are not accepted by the model and an attention is not achieved. Btw, as I struggled from the start with createin a proper path of files structure, I made a complete .py code that creates the whole structure from scratch + deals with some problems I encoutered while training model (e.g. checking if number of wav and txt files are equal, converting mp3 to wav, cutting files, scrapping videos from YT and converting to wav, using Google Speach basic API to process speech to text (as this step wasn't working for me in the original code), etc.). The only this there is to add the autor's code in the main folder. I will send a link here in a day or two, maybe someone finds that useful.

EDIT:

Here it is :)

https://github.com/gabrielrdw20/Real-Time-Voice-Cloning-Polish/tree/main/start_here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants