-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unable to apply voice conversion to long files using my trained speaker embedding #57
Comments
hi, what's reconstruction loss of your converged model? I want to know when can I reproduce the Voice Conversion results in domain? thank you~~ |
maybe use your own speaker's dataset to fine-tune the autovc's content encoder and decoder? |
he just use speaker encoder to get speaker features, and use it on pretrained autovc model, so he maybe not train~ |
Same issue here. Just tried the pretrained model(downloaded from the repo) without any finetuning. I use some clean speech audios from a female speaker outside the VCTK dataset, let's call her speaker A.
However, the generated speech of A get garbled after the second 2-3. From the generated audio we can see that the speaker identity is successfully converted to A but the actual content of speech is lost. But if I convert p227 to p225 (both of them are VCTK speakers) using the exact same proceudre described above and exact the same pretrained models. It works fine. (Although shorter audios performs better than long audios, but at least the content is correct). From the paper, I found this "... as long as seen speakers are included in either side of the conversions, the performance is comparable...". A is an unseen speaker, not sure if the p227 is seen during training. So, here are some guessings:
I haven't tried finetuning yet. Just some thoughs. Any ideas guys? |
Hi,
I am attempting zero-shot voice conversion by using only a few audio sentences of a target speaker. I am training a speaker embedding for this speaker using the make_spect.py and make_metadata.py (originally thought for training) and extract the embedding from the resulting train.pkl file. I do the same for the source speaker.
As a comparison, I also perform VC between 2 of the speakers provided in the metadata.pkl file.
I am then applying the embedding to perform VC by modifying conversion.py code and things start to fall apart. Here is what I see:
Has anyone experienced this? the author says that the system has been trained with short audios, but would this explain this behavior? mostly given that when using the embeddings in metadata.pkl it always sounds well, regardless of the length?
Have the embeddings in metadata.pkl been computed in the exact same way as the embeddings I am now computing? (note that I have tried disabling the random noise added in the make_metadata.py script, but same results).
Thanks!
The text was updated successfully, but these errors were encountered: