cloning accuracy #404

brcisna · 2020-07-06T22:57:55Z

Hello All,
This is not an issue, but seeings how there are no forums for this software just wanted anyone's thoughts on cloning accuracy they are coming up with.

I personally use this in kind of an odd manner,in that I use it to clone voices to be used in historic videos as a narrator type scenario Do old auto racing history so my routine is.

- to find only 60 seconds of a past racer voice on yourtube.
  2- Save this audio ,split it into 10-15 second audio clips.
  3- feed (4) 10 second clips to the toolbox, synthezising each clip.
  4- synthesize and vocode after typing in the narration i want to use in text box

The results surprisingly is very accurate ,other than no 'emotion' is possible .Of course am just using this at a hobby level. This probably wouldn't be acceptable for someone trying to do professional presentations maybe? I am not sure my procedure is really even correct. Works for me. Sometimes the results ends up with slight what i would call 'wind in the microphone' muffling effect at either start or finish of generated audio.
Also i am not sure how to interpret the lower left box were your points are generated projections are all over from the same voice. Am pretty sure these points should be almost directly on top of one another. Am very green at how this is suppose to happen.

Anyone ,please comment on their routine.

Sorry for long post.

Admin : if this is not acceptable here delete post.

Thanks.
.

ghost · 2020-07-08T09:59:01Z

3- feed (4) 10 second clips to the toolbox, synthezising each clip.

I believe the embedding only depends on the last file loaded. In other words, the toolbox has no memory and does not learn as it is used. So you can experiment to see which of the clips results in the best cloned voice. (Will be a lot easier with the toolbox once #402 is merged)

Also i am not sure how to interpret the lower left box were your points are generated projections are all over from the same voice. Am pretty sure these points should be almost directly on top of one another.

You are correct, if the speaker encoder is good then all the points from a single speaker should form a distinct cluster away from other speakers. However, if it is only plotting data from a single speaker then I think the autoscaling will make those points appear farther apart than they are in reality.

ghost · 2020-07-12T16:34:10Z

Sometimes the results ends up with slight what i would call 'wind in the microphone' muffling effect at either start or finish of generated audio.

@brcisna Can you try this vocoder model and let me know whether you still get "wind in the microphone" effect? #126 (comment)

Because the synthesizer is not deterministic you will need a few attempts to conclude if there is a difference. If it still occurs with the new vocoder it is likely an artifact of the synthesizer.

I personally use this in kind of an odd manner,in that I use it to clone voices to be used in historic videos as a narrator type scenario Do old auto racing history so my routine is.

If you plan on distributing your work please be mindful of the legal implications of using someone else's voice and make sure you have secured rights if necessary.

ghost · 2020-07-18T17:42:34Z

Closed due to inactivity. @brcisna please reopen the issue if you have more to discuss.

ghost mentioned this issue Jul 7, 2020

Toolbox not working with python3.8 #401

Closed

ghost mentioned this issue Jul 8, 2020

Training with multiple recordings #252

Closed

ghost closed this as completed Jul 18, 2020

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloning accuracy #404

cloning accuracy #404

brcisna commented Jul 6, 2020

ghost commented Jul 8, 2020

ghost commented Jul 12, 2020

ghost commented Jul 18, 2020

cloning accuracy #404

cloning accuracy #404

Comments

brcisna commented Jul 6, 2020

ghost commented Jul 8, 2020

ghost commented Jul 12, 2020

ghost commented Jul 18, 2020