Train Synthetizer in Spanish #941

AlexSteveChungAlvarez · 2021-12-07T23:03:34Z

I trained the synthesizer with this dataset: http://openslr.org/73/ .
The models obtained until 50k steps are here: https://drive.google.com/drive/folders/1pYc0YK6YfdikMONkR-29054_uMxTgy_g?usp=sharing . Though, the results are not even near to the target voice to clone. Any suggestions?
It does sound like a human, but not like the target.

Originally posted by @AlexSteveChungAlvarez in #789 (comment)

Bebaam · 2021-12-08T11:02:23Z

How many different speaker are there? At least 300 is suggested. If you train with a batch size of 12 as you mentioned in #940, maybe the model did not yet converge after 50k steps. Maybe it will need 100k steps+. Moreover, I assume you did learn attention, as you have kinda good quality, but wrong voice? Then the question is: Did you finetune your model? To get good results it is crucial to finetune it for a single speaker, this will vastly improve quality. Have a look on #437

AlexSteveChungAlvarez · 2021-12-08T18:22:54Z

What I want to achieve is to be able to clone any unseen voice during training, as the English pretrained model does, but in Spanish. That's why I didn't finetune it. Here is the loss at 50k steps:

Unfortunately, there isn't any information about the number of speakers in this dataset.
Before, I tried with samples of the cv-corpus dataset (https://commonvoice.mozilla.org/es/datasets), which has plenty of voices, but I don't know why the outputs from target audios were a lot of noise, like whispering, or silences, even when the target audios was one from the training it didn't output the text passed, but it did output the same audio with much less quality. I tried with samples of that dataset, because with the entire dataset there was an error which I attached in #789 (comment) . Should I continue training with Crowdsourced high-quality Peruvian Spanish speech data set until 100k+ steps?
Or maybe you know how to solve the issue with the cv-corpus so that I can train on it?

Bebaam · 2021-12-09T00:20:44Z

Training loss varies with datasets, but it doesn't look wrong.
As mentioned in #30, maybe you need to train an own encoder for a new language.
For commonvoice, which tsv file did you use?
Maybe it is really important to have each speaker as an own folder to distinguish voices, as we discussed in #934

AlexSteveChungAlvarez · 2021-12-09T01:06:33Z

For commonvoice I used the one that comes with it, and generated each .txt for each audio. To solve the problem of training a new encoder for the language, I have tried to clone an audio in English, so it detects it well, and then put the text in Spanish, since the synthesizer is being trained in spanish.

Bebaam · 2021-12-09T11:13:11Z

Ok I only know commonVoice for other languages, there we have multiple .tsv files. Maybe it is different to spanish commonvoice dataset.

AlexSteveChungAlvarez · 2021-12-09T12:38:33Z

Oh, now I get what you were asking, I used validated.tsv to copy all the audios from the original into a new directory with only the validated ones and from this file also got the .txts

AlexSteveChungAlvarez · 2021-12-09T12:49:46Z

Did you try to use commonVoice with the code in this repo? What suggestions can you give me about it? I haven't found another dataset with such many speakers as it has, yet.

Bebaam · 2021-12-09T13:31:30Z

CommonVoice should be the best dataset by far, the sheer amount of speakers I did not find anywhere else.
For me, the quality of validated.tsv was not good enough, I assume there are all speaker, for which the corresponding texts are more or less verified. In contrast, train.tsv fits better, maybe this is the subset of validated having comparatively good quality.

Bebaam · 2021-12-09T13:45:17Z

But I am afraid that the problem lies in the encoder, as the cloning quality depends mainly on the encoder. I remember it was stated in some issue, but did not find it. But this by blue-fish should indicate in this direction:
#162 (comment)

quality may differ from person to person
if the encoder isn't familiar with voices like yours, it can't encode it accordingly.
So if you are really interested in having very good quality, I would think about training an encoder. But keep in mind that this will need much more time than training a synthesizer.

AlexSteveChungAlvarez · 2021-12-09T13:58:22Z

I already tried the pretrained.pt files with the same audio of my voice and it worked, that's why I don't think it may be the encoder, if it was, then with the pretrained.pt wouldn't have worked. As I said before, the target audio is me speaking in English like for 10-11 seconds. Then, the problem should be in the model I get from the synthesizer. I am pretty sure the issue is caused because I don't have enough speakers to train on. Now that you said I should use train.tsv, maybe that was the issue with commonvoice. Did you see my older post here #789 (comment)? All of that was with the validated.tsv, I will try right now with the train.tsv and see how it works. I will try and error until I get this Spanish model! Thank you for being alert!

AlexSteveChungAlvarez · 2021-12-09T14:10:39Z

By the way, which batch size do you recommend for a RTX 2060 and for a RTX 3060?

Bebaam · 2021-12-09T14:30:27Z

Okay, now I understand you idea. If it works with your target voice in english, then it may be fine.
I see your older post, I would try with train.tsv and if the error still occurs, then I would try to search for nan-values in the data. Maybe a few files are corrupt.
The batch_size depends on your gpu VRAM. The more the better in my opinion, so just try with your 2060 6GB I assume, how high you can set the batch_size without getting cuda memory errors. For the 3060 12GB, you can easily double the amount.

AlexSteveChungAlvarez · 2021-12-09T14:50:02Z

Thank you Bebaam, I asked for the 3060 because @Andredenise is helping me with his gpu, we will work on it later, right now I will prepare the data. For tomorrow I hope we have good news!

AlexSteveChungAlvarez · 2021-12-26T22:48:54Z

Hello! I want to ask a few things about attention...My synthesizer model is already above 225k steps but the graphics about attention seem worse than previous graphics. For example:
step 210500

step 229000

Like these two examples, there are a lot of graphics that some times seem more likely to the 210500 and other times to the 229000, I am worried that it may be overfitting maybe? I also want to know if this metric and the mel-spectrogram are the only ones that I can compare to Corentin's model, or if I can make another comparison between the two models of the sinthesizer. I don't know when I should stop training the synthesizer, too.

ireneb612 · 2022-02-24T10:04:28Z

@AlexSteveChungAlvarez To train the synthetizer, did you pre process the dataset? Did you get the right accents fro the spanish language?? Did you specify different characters in symbols and cleaners? Thank you!

AlexSteveChungAlvarez · 2022-02-24T19:08:58Z

Hi @ireneb612! Yes, actually the code itself has a script for preprocessing. I found out with help of the community that the mozilla's commonvoice dataset was the best because of the variety of accents for the language. Yes, I specified the characters in symbols and cleaners as was mentioned in the different issues (and I think in the guide too). Here is the repo with the resulting code of my work: https://github.com/AlexSteveChungAlvarez/Real-Time-Voice-Cloning-Spanish , it includes the script to prepare the dataset too!

pauortegariera · 2022-09-01T14:45:31Z

Hello, I want to use a dataset in Spanish from Argentina, can this implementation be adapted for that? Any information is welcome. Thanks a lot !

AlexSteveChungAlvarez · 2022-09-01T15:58:03Z

Of course! You just need to put your dataset in the correct structure.

pauortegariera · 2022-09-01T16:03:51Z

Excellent ! one more question: in this issue you comment that the results you obtained do not resemble those of the target voice. Could you solve this problem? any suggestion? Thanks for your help AlexSteveChungAlvarez!

AlexSteveChungAlvarez · 2022-09-01T16:22:51Z

I think we better discuss this via email, since it's not part of the issue, but yeah, in my opinion, the results, even of the most recent models, don't sound like the targets. If you want to achieve this, by now, you need to finetune the model with a dataset of the target voice, that works when you have many audios from the target to clone.

AlexSteveChungAlvarez mentioned this issue Dec 12, 2021

File structure for training (encoder, synthesizer (vocoder)) #934

Open

CorentinJ closed this as completed Dec 28, 2021

neonsecret mentioned this issue Apr 29, 2022

Support for other languages #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train Synthetizer in Spanish #941

Train Synthetizer in Spanish #941

AlexSteveChungAlvarez commented Dec 7, 2021

Bebaam commented Dec 8, 2021

AlexSteveChungAlvarez commented Dec 8, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

Bebaam commented Dec 9, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 26, 2021 •

edited

Loading

ireneb612 commented Feb 24, 2022

AlexSteveChungAlvarez commented Feb 24, 2022 •

edited

Loading

pauortegariera commented Sep 1, 2022

AlexSteveChungAlvarez commented Sep 1, 2022

pauortegariera commented Sep 1, 2022

AlexSteveChungAlvarez commented Sep 1, 2022

Train Synthetizer in Spanish #941

Train Synthetizer in Spanish #941

Comments

AlexSteveChungAlvarez commented Dec 7, 2021

Bebaam commented Dec 8, 2021

AlexSteveChungAlvarez commented Dec 8, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

Bebaam commented Dec 9, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

Bebaam commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 9, 2021

AlexSteveChungAlvarez commented Dec 26, 2021 • edited Loading

ireneb612 commented Feb 24, 2022

AlexSteveChungAlvarez commented Feb 24, 2022 • edited Loading

pauortegariera commented Sep 1, 2022

AlexSteveChungAlvarez commented Sep 1, 2022

pauortegariera commented Sep 1, 2022

AlexSteveChungAlvarez commented Sep 1, 2022

AlexSteveChungAlvarez commented Dec 26, 2021 •

edited

Loading

AlexSteveChungAlvarez commented Feb 24, 2022 •

edited

Loading