-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training Voice Cloning model for another language #492
Comments
Update However, be careful: in order for someone to run the model you've created they will also need to make the same changes to the file. I spent hours learning this the hard way trying to use the model in #257 because the creator was unavailable to help. |
Thank you very much! Will try :) |
Can you also tell me - can I somehow fine tune pretrained model on some new voice samples without full retraining? |
Yes, you can resume training on a pretrained model using a different dataset. The main use for this is single-speaker finetuning (process and examples in #437) but you could also finetune multi-speaker using the same process. One more thing to add, the speaker encoder is trained on English and may not work well for other languages. If you have a large number of voice samples in your target language, you may wish to train a new encoder or at least finetune an existing one. (Data preprocessing for encoder is not a smooth process so set your expectations accordingly). There are some very good speaker encoders shared in #126 but the model size of 768 is too big to be practical for cloning. You can use this process to import the relevant weights from the model and finetune to a more useful dimension: #458 (comment) |
Hello, i will also try to train a voice cloning model in another language (in Fr for me) and i have some tricks for you if it can help your :
Good luck for training ! |
The English encoder works all right for Swedish. There's info on setting it up and samples in #257 . Since encoder training is very intensive, you should just try it (either jump straight to synth preprocess and training, or do some speaker verification with Ukrainian utterances to see how well it performs). |
Thanks guys! Will try :) |
@rlutsyshyn How is progress on your synthesizer model? |
@blue-fish Just collect a lot of data :) |
Hello, i speak spanish, is there a tutorial for train it on my language? sorry i am a very noob with this but very fun project- |
@afantasialiberal Please see #431 (comment) for a general outline of the process. There is no tutorial available at this time. |
Hey, I have new issue while tried to run vocoder_preprocess. Preprocessing "starts" but it had 0 iterations (without any error) |
@rlutsyshyn Do you still have that issue with vocoder preprocess? |
@blue-fish have issue with synthesizer now :) I mean, that when I use 48kHz audio and calculate parameters in synthesizer/hparams.py - after fine tuning my voice is like in Alvin and the Chipmunks (very very fast) ... mb you have some advices on this case? |
Just to be sure, if you train the synthesizer to create 48khz melspectrogram, you should also train the vocoder to generate 48khz audio (because it’s trained on 16khz audio) Good luck ! |
@Ananas120 For synthesizer in hparams.py I can modify win_size, hop... etc, but in vocoder/hparams.py I don't see something like that, so waht sould I modify to fine tune my vocoder for 48kHz data? |
Honnestly, i don’t know, i think blue-fish can help you better for this |
@rlutsyshyn You need to train a vocoder from scratch, the good news is that it trains relatively fast and you should only need to do it once. Most people choose sampling rates of 22.05 or 24 kHz for faster inference but that's your call. In synthesizer hparams, you should modify
When preprocessing data, the |
Thank for your response @blue-fish , but for training vocoder I need a lot of 48kHz data, what could be a problem. By the way:
|
Hi @rlutsyshyn, you don't need to use the same datasets for synth and vocoder. You can preprocess a different 48khz dataset (even English) and it should generalize to Ukrainian if it has enough voices (several hundred or more). Use The downside to this approach is your trained vocoder will not compensate for any deficiencies of your synthesizer model. It is a missed opportunity to make the final output better.
For proper vocoder inference, you either need to edit The reason this works is because the vocoder just sees a 2d array of shape (num_mels, frames) as input. There is no sample rate information contained in the mel spectrogram. You can even go the other direction, and take a synthesizer trained at 16khz and use the mels on a vocoder trained at 24 khz :)
|
@blue-fish Thanks for your fast response, will try this :) |
@blue-fish Hey! Can you give me an advice? When I used data for fine tuning (16kHz english speaker) and fine tune only sysnthesizer after testing I had similar voice but words are like bla bla bla ... bla bla bla Is that problem with synthesizer or I have to train (fine tune) vocoder for that voice? |
@rlutsyshyn Are you taking the pretrained synthesizer (English) and finetuning on your Ukrainian data? That's not going to work because the mapping of letters to sounds will not match. You need to start the synthesizer training from scratch when working with a new language. |
For the classic Tacotron-2 model, training from En to another language work (in Fr for me) but En and Fr sounds are not as far as that so i suppose mapping slightly differs but not as much |
@blue-fish @Ananas120 I used english synthesizer and try to fine rune on english data but recorded by my self. I collected 400 samples of utterances and try to fine tune synthesizer on them but had bla bla bla . |
When finetuning, use the same embedding for all of your samples for faster convergence. I take the embedding of the first audio file and use it to overwrite all the others. For inference, make sure you load the same audio file used to generate your embeds for finetuning. If it still doesn't work, check your preprocessing and also make sure the transcripts in |
@blue-fish Can you explain this approach with same embedding more accurate, please? |
At the moment i use a « speaker-embedding » (the mean of all utterances embeddings), is it more interesting or is it better to user 1 single « real » utterance embedding for all ? |
@rlutsyshyn You have 400 wav files in your training set for finetuning. When you run @Ananas120 I use the embedding of a real utterance so I can load the audio file in the toolbox to get the desired embedding. The mean or L2-norm is technically better but with a good encoder model it shouldn't make much of a difference. |
what did you mean? |
@rlutsyshyn After #492 (comment) , your entire dataset is using the embedding from file 1. The embedding corresponds to a specific audio file, let's call it file1.wav. When you test your new synthesizer in the toolbox (or demo_cli.py), you must remember to load file1.wav to generate the embedding. |
@blue-fish I tried to do what you said but I have still same results... bla bla bla. I checked datasets/SV2TTS/synthesizer/train.txt file and all is good there e.g.: I used first embedding for fine tuning model, and same embedding for inference in toolbox or demo_cli.py While I fine tuned the model loss was +-0.5 and won't fall more. |
Your train.txt is improperly formatted. Here is an example line:
|
@blue-fish thanks, now it works good. But how can I improve the quality of the output ? |
@rlutsyshyn That's something that I continue to work on now. I am experimenting with different synthesizer models and settings, but I still have not surpassed the pretrained models from Corentin. |
@blue-fish Can the vocoder fine tuning improve output audio quality? |
@rlutsyshyn Yes, though you'll want to make sure you are satisfied with the synthesizer before moving on to vocoder training. |
@blue-fish Yes, I think that I'm satisfied on the synthesizer model quality. But when I try to fine tune vocoder (on 16kHz data) on the output I listen just simple noise... |
Closing this issue due to inactivity. @rlutsyshyn I think you know as much about this repo as I do now. My recommendation is to avoid finetuning the vocoder, since it will not improve the quality that much. If you need a better vocoder train it from scratch. |
how many voice samples of a particular voice are required to train the model ? |
Hi! I am already know how to train syntheiser and vocoder, also know how to create relevant dataset. But if I want to train voice cloning model for another language e.g.ukrainian, what else should I do?
The text was updated successfully, but these errors were encountered: