-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloning accuracy #404
Comments
I believe the embedding only depends on the last file loaded. In other words, the toolbox has no memory and does not learn as it is used. So you can experiment to see which of the clips results in the best cloned voice. (Will be a lot easier with the toolbox once #402 is merged)
You are correct, if the speaker encoder is good then all the points from a single speaker should form a distinct cluster away from other speakers. However, if it is only plotting data from a single speaker then I think the autoscaling will make those points appear farther apart than they are in reality. |
@brcisna Can you try this vocoder model and let me know whether you still get "wind in the microphone" effect? #126 (comment) Because the synthesizer is not deterministic you will need a few attempts to conclude if there is a difference. If it still occurs with the new vocoder it is likely an artifact of the synthesizer.
If you plan on distributing your work please be mindful of the legal implications of using someone else's voice and make sure you have secured rights if necessary. |
Closed due to inactivity. @brcisna please reopen the issue if you have more to discuss. |
Hello All,
This is not an issue, but seeings how there are no forums for this software just wanted anyone's thoughts on cloning accuracy they are coming up with.
I personally use this in kind of an odd manner,in that I use it to clone voices to be used in historic videos as a narrator type scenario Do old auto racing history so my routine is.
2- Save this audio ,split it into 10-15 second audio clips.
3- feed (4) 10 second clips to the toolbox, synthezising each clip.
4- synthesize and vocode after typing in the narration i want to use in text box
The results surprisingly is very accurate ,other than no 'emotion' is possible .Of course am just using this at a hobby level. This probably wouldn't be acceptable for someone trying to do professional presentations maybe? I am not sure my procedure is really even correct. Works for me. Sometimes the results ends up with slight what i would call 'wind in the microphone' muffling effect at either start or finish of generated audio.
Also i am not sure how to interpret the lower left box were your points are generated projections are all over from the same voice. Am pretty sure these points should be almost directly on top of one another. Am very green at how this is suppose to happen.
Anyone ,please comment on their routine.
Sorry for long post.
Admin : if this is not acceptable here delete post.
Thanks.
.
The text was updated successfully, but these errors were encountered: