-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
since 3.10 my models train with a strong english accent #438
Comments
UPDATE: using infer on the same audio file with the same config and same model (which should work as it was trained before the issue appeared with 3.9.3) seems to still have a strong english accent , suggesting that there is an issue in the "infer" process and not with the training process. It also seem to have more noise in it |
Ok, i just downgraded to 3.9.3 and everything works as expected. If you are using a model in a language DIFFERENT than english, DO NOT USE newer versions!!!! |
@vertexgamer how do I downgrade? |
nevermind if anyone else is wondering
|
One thing I'm wondering now: What if you train a model in 3.10+ and then infer on 3.9? Do you also get those accent results? |
in my experience no, only the infer process affects the accent. But just to be sure, i'm training right now a model with 3.9.3. When i'm done i will come back with more info |
@Lordmau5 it seems that there is no hearable difference when using 3.9.3 models vs 3.10+ ones |
Hmm... okay that's interesting. I know 3.10 did a switch from the Apparently it's not relying on the (correct?) pretrained contentvec model anymore and doesn't utilize it. Maybe it has to do with that? @34j any thoughts? (Seeing as you made those changes) Looking at the code a bit more, it does rely on a contentvec model, but it's relying on the content vec model, but not the content vec LEGACY model, as also offered here: And looking at the Hugging Face repository, it seems to actually be the legacy one? I am very confused. I can't help with this as I didn't make these changes... |
I would like to suggest the possibility that the contents of final_proj are different because I remember non final_proj version worked for me (probably). |
I'm not confident so anyone who has time please test it |
I did test one thing, and that was adding
(It also mentions error in realtime instead of inference) What else should I try in regards to testing it? 🤔 |
Hmm, I just went with the template it gave me (started this model around a week ago and I couldn't spot any changes to the templates in regards to model input, ssl_dim or similar) According to the wiki:
Doesn't this mean that the config templates should be adjusted going forward? I did try both of your models and they sound fine to me... Also no errors when inferring them |
For now, I would like to suggest changing the default back to contentvec_final_proj=False and deal with it later. |
Maybe so-vits-svc's contentvec is uniquely retrained in the final layer to enhance Japanese / Chinese pronunciation? Although I don't know how it works, so I can't say for sure... |
Okay so, what I gathered just now: Starting a new model and doing However, doing I'm giving that training a go with the Kurzgesagt voice for testing to several thousand steps and report back. The thing I see is that we need to figure out if a model is of
|
https://huggingface.co/lengyue233/content-vec-best If you have more free time, you can follow this procedure to convert so-vits-svc's ContentVec and test it again. |
Yeah I converted the Trying to convert the non-legacy checkpoint is just erroring with that config (which makes sense) |
What about with "contentvec_final_proj": true? |
Note that final_proj is one nn.Linear that outputs 256 channels from 768 input channels. |
Yup, that works. But seeing as that's the default I assume we're back to square 1 with the English accent... |
I'm not sure, but if the rebuilt version (for the purpose of replacing final_proj) still doesn't work, I think the only way is to extract and insert final_proj from our ckpt or ask lengyue232 for help, or my code os wrong |
Modifying the I'm unsure how to convert it to a functional pytorch model... |
You have better read our code first before talking, I'm saying that the weights for final proj is different between 2 original non Huggingface models and need to replace it |
|
Aaaaah I see. I still don't understand much about the AI side of things with the project (I'm happy I can contribute with fixes here and there) so I apologize for that |
I would like this to be resolved as soon as possible, do you have time now? |
On the second thought, I think I'm the only person who can understand my dirty code and guess I should archive this repo |
It is painful to be blamed for wasting the computing costs of the planet by having to train an incorrect model that was not identifiable for two days. |
I've tried it and can't tell the difference...... |
3.10.0 1.out.wav.mp43.9.5 1.out.wav.mp4The rebuilt one 1.out.wav.mp4Still not fixed... |
https://huggingface.co/lengyue233/content-vec-best/blob/c0b9ba13db21beaa4053faae94c102ebe326fd68/convert.py#L131-L132 |
So have you guys found the origin of the issue? |
I was unfortunately asleep at that time (7:25AM and I was awake until ilke 5AM hah), sorry :(
I mean, you said it yourself before that you're still pretty new to this AI stuff if I remember correctly? Don't be too hard on yourself. There are bugs in other repositories that are way trickier to fix and might even go under the radar for longer 🙏
Well, 34j did push a fix in 3.10.5 for it - would you be able to give that another go and see if it is more comparable to what you got in 3.9? |
@Lordmau5 rn i'm training a model, when i finish i will try. I asked a friend to try it and it seems to be very similar to 3.9.3, but it might be placebo as the trained iterations are not the same |
since 3.10 my models train with a strong english accent, i first thought it was an over training problem, but when training from scratch the same issue happen
The text was updated successfully, but these errors were encountered: