-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The pretrained model does not perform well compared to the youtube video #162
Comments
|
Deprecating this response - since it's replicated below in hopes of keeping up s/n. |
Thanks for your notes. But i can't open this link. Could you provide any other links, such as google drive ? By the way, do you mean you train the entire model in this github on your own high quality English dataset without any model modifications? |
I wasn't satisfied with the example I gave (new link here https://drive.google.com/open?id=1qZAYTfYe0sUobaOVaYkHDz075FcNWJgy) so I spent the evening running tests with paired high and low quality samples. Since I try to follow the data, I now want to retract my response above. I do still think it's important for getting the best quality results, but it's not the defining factor. Frankly I'm still not sure what that factor is yet - I can't identify it by any spectral features or differences by ear, but the practical upshot is that some subsamples of a voice clone better than others. The key to getting a good voice synthesis seems to me to be testing a range of different samples from the same speaker and discovering the one which clones the best. Some samples which sound fine give garbage results, while others are much better. When I mentioned training, I meant the voice you're trying to clone - the step of training the vocoder (via the alternative waveRNN model I believe, but I'm no ML specialist) from the sample you're using, which is the bit which interests me right now. For reference, here's about a minute of audio of three different voices reading three different quotes - this is about as good as I'm getting at the moment: https://drive.google.com/open?id=1vqWj1XPJ2BcWTNKkAbwGji2344sNxLOd |
Thanks for your reply. I think our main problem is how to clone the tone of reference speech(i mean the few seconds audio out of the training data) as much as possible. I don't think the audio quality is the main factor leading to the bad performance for tone cloning. My first step is to check if our evaluation script mentioned above has problems or if the pretained model the author provided lead to bad performance. |
@CorentinJ Can you explain if there are any differences between the models used for your Youtube demo, and the pretrained models released on the wiki page? Would like to put any speculation to rest. In #197 , it is noticed that you used a different vocoder called "gen_s_mel_raw" for the video but I don't think that is it. |
No differences, and the vocoder is the same but with a different name |
Not every reference audio will clone well. The quality depends on whether it is similar to other utterances seen by the encoder and synthesizer during training. |
I write a script for model evaluation instead of using your toolbox. In the script, I load the pretained model you provided and evaluate the whole model using new reference audio. But the synthesis performance is not welI as you showed in the Youtube demo. I also put the code under this issue. Could you kindly point out my problem or give me some guidance to reproduce your results? I really appreciate your help!
The text was updated successfully, but these errors were encountered: