Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

since 3.10 my models train with a strong english accent #438

Closed
vertexgamer opened this issue Apr 20, 2023 · 35 comments · Fixed by #455
Closed

since 3.10 my models train with a strong english accent #438

vertexgamer opened this issue Apr 20, 2023 · 35 comments · Fixed by #455
Labels
bug Something isn't working

Comments

@vertexgamer
Copy link

since 3.10 my models train with a strong english accent, i first thought it was an over training problem, but when training from scratch the same issue happen

@vertexgamer vertexgamer added the bug Something isn't working label Apr 20, 2023
@vertexgamer
Copy link
Author

vertexgamer commented Apr 20, 2023

UPDATE: using infer on the same audio file with the same config and same model (which should work as it was trained before the issue appeared with 3.9.3) seems to still have a strong english accent , suggesting that there is an issue in the "infer" process and not with the training process. It also seem to have more noise in it

@vertexgamer
Copy link
Author

Ok, i just downgraded to 3.9.3 and everything works as expected.
It's CONFIRMED that newer versions have an accent bias.

If you are using a model in a language DIFFERENT than english, DO NOT USE newer versions!!!!

@liamlenholm
Copy link

@vertexgamer how do I downgrade?

@liamlenholm
Copy link

@vertexgamer how do I downgrade?

nevermind if anyone else is wondering

pip install <package>==<version>
pip install -U so-vits-svc-fork==3.9.3

@Lordmau5
Copy link
Contributor

Ok, i just downgraded to 3.9.3 and everything works as expected. It's CONFIRMED that newer versions have an accent bias.

If you are using a model in a language DIFFERENT than english, DO NOT USE newer versions!!!!

One thing I'm wondering now:

What if you train a model in 3.10+ and then infer on 3.9? Do you also get those accent results?

@vertexgamer
Copy link
Author

Ok, i just downgraded to 3.9.3 and everything works as expected. It's CONFIRMED that newer versions have an accent bias.
If you are using a model in a language DIFFERENT than english, DO NOT USE newer versions!!!!

One thing I'm wondering now:

What if you train a model in 3.10+ and then infer on 3.9? Do you also get those accent results?

in my experience no, only the infer process affects the accent. But just to be sure, i'm training right now a model with 3.9.3. When i'm done i will come back with more info

@vertexgamer
Copy link
Author

@Lordmau5 it seems that there is no hearable difference when using 3.9.3 models vs 3.10+ ones

@Lordmau5
Copy link
Contributor

Lordmau5 commented Apr 21, 2023

Hmm... okay that's interesting.

I know 3.10 did a switch from the fairseq to transformers library, which means it's one step less for building from what I can tell.
a2fe0f3

Apparently it's not relying on the (correct?) pretrained contentvec model anymore and doesn't utilize it.
I saw another voice changer project that supports so-vits-svc models did require it though.

Maybe it has to do with that? @34j any thoughts? (Seeing as you made those changes)


Looking at the code a bit more, it does rely on a contentvec model, but it's relying on the content vec model, but not the content vec LEGACY model, as also offered here:
https://github.com/auspicious3000/contentvec


And looking at the Hugging Face repository, it seems to actually be the legacy one?
https://huggingface.co/lengyue233/content-vec-best

I am very confused. I can't help with this as I didn't make these changes...

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

I would like to suggest the possibility that the contents of final_proj are different because I remember non final_proj version worked for me (probably).

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

I'm not confident so anyone who has time please test it

@Lordmau5
Copy link
Contributor

Lordmau5 commented Apr 21, 2023

I did test one thing, and that was adding "contentvec_final_proj": false to the config, which ended up returning errors during inference and didn't output an audio file unfortunately...

[14:19:09] Starting inference...
[14:19:12] E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py:81: UserWarning: Unused arguments: {'n_layers_q': 3, 'use_spectral_norm': False}
  warnings.warn(f"Unused arguments: {kwargs}")

[14:19:12] Decoder type: hifi-gan
[14:19:13] Loaded checkpoint 'E:/Development/so-vits-svc-4.0/Kurzgesagt/logs/44k/G_800.pth' (iteration 34)
[14:19:13] Chunk: Chunk(Speech: False, 8820.0)
[14:19:13] Chunk: Chunk(Speech: True, 361620.0)
[14:19:13] F0 inference time:       0.167s, RTF: 0.020
[14:19:17] HuBERT inference time  : 2.987s, RTF: 0.356
[14:19:17] Finished inference for cbt_normal.wav
[14:19:17] Error in realtime: 
[14:19:17] Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
pebble.common.RemoteTraceback: Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\pebble\common.py", line 174, in process_execute
    return function(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\main.py", line 56, in infer
    audio = svc_model.infer_silence(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 284, in infer_silence
    audio_chunk_pad_infer_tensor, _ = self.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 218, in infer
    audio = self.net_g.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py", line 213, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
[14:19:17] Error in inference: 
[14:19:17] Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
pebble.common.RemoteTraceback: Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\pebble\common.py", line 174, in process_execute
    return function(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\main.py", line 56, in infer
    audio = svc_model.infer_silence(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 284, in infer_silence
    audio_chunk_pad_infer_tensor, _ = self.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 218, in infer
    audio = self.net_g.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py", line 213, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 675, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
[14:19:18] Error in realtime: 
[14:19:18] Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
pebble.common.RemoteTraceback: Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\pebble\common.py", line 174, in process_execute
    return function(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\main.py", line 56, in infer
    audio = svc_model.infer_silence(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 284, in infer_silence
    audio_chunk_pad_infer_tensor, _ = self.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 218, in infer
    audio = self.net_g.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py", line 213, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 675, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead

(It also mentions error in realtime instead of inference)

What else should I try in regards to testing it? 🤔

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

That's not surprising because your model input is 256 channels and excepts final_projed input. You can download the model I trained from here or here if you don't have one.

@Lordmau5
Copy link
Contributor

Lordmau5 commented Apr 21, 2023

That's not surprising because your model input is 256 channels and excepts final_projed input.

Hmm, I just went with the template it gave me (started this model around a week ago and I couldn't spot any changes to the templates in regards to model input, ssl_dim or similar)

According to the wiki:

The ssl_dim is the number of input channels, and the correct number of output channels for the officially trained ContentVec model is 768, but after applying final_proj it is 256.

Doesn't this mean that the config templates should be adjusted going forward?


I did try both of your models and they sound fine to me... Also no errors when inferring them

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

For now, I would like to suggest changing the default back to contentvec_final_proj=False and deal with it later.

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

Maybe so-vits-svc's contentvec is uniquely retrained in the final layer to enhance Japanese / Chinese pronunciation? Although I don't know how it works, so I can't say for sure...

@Lordmau5
Copy link
Contributor

Lordmau5 commented Apr 21, 2023

Okay so, what I gathered just now:

Starting a new model and doing svc pre-config will select so-vits-svc-4.0v1-legacy by default.
Trying to set "contentvec_final_proj": false in that config file will return errors like mentioned above because it is a different model structure / config.

However, doing svc pre-config -t so-vits-svc-4.0v1 will give the correct structure in which I can set the final_proj to true and then train with that.
It does seem like training that will take a bit longer however. Legacy at around 500 steps sounds better than that new one. That's fine though as long as it's mentioned.

I'm giving that training a go with the Kurzgesagt voice for testing to several thousand steps and report back.


The thing I see is that we need to figure out if a model is of "type_": "hifi-gan" or similar, as in not legacy, and in that case set it to use contentvec_final_projc=False.

Additionally, I've noticed a n_speakers variable that's set to 200 by default. I remember you saying something along the lines of "do we need 200 speakers?" and whether it could make the model smaller?
My bad, it wasn't n_speakers, it was the VITS model in general: #314

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

https://huggingface.co/lengyue233/content-vec-best

If you have more free time, you can follow this procedure to convert so-vits-svc's ContentVec and test it again.

@Lordmau5
Copy link
Contributor

Lordmau5 commented Apr 21, 2023

Yeah I converted the checkpoint_best_legacy_500.pt and loaded it in code instead of getting it from the lengyue233 Hugging Face repository, the results are the same. It's still erroring... (Expecting 768 but providing 256 on a so-vits-svc-4.0v1-legacy model with "contentvec_final_proj": false)

Trying to convert the non-legacy checkpoint is just erroring with that config (which makes sense)

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

What about with "contentvec_final_proj": true?

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

Note that final_proj is one nn.Linear that outputs 256 channels from 768 input channels.

@Lordmau5
Copy link
Contributor

What about with "contentvec_final_proj": true?

Yup, that works. But seeing as that's the default I assume we're back to square 1 with the English accent...

@34j
Copy link
Collaborator

34j commented Apr 21, 2023

I'm not sure, but if the rebuilt version (for the purpose of replacing final_proj) still doesn't work, I think the only way is to extract and insert final_proj from our ckpt or ask lengyue232 for help, or my code os wrong

@Lordmau5
Copy link
Contributor

Modifying the convert.py script from lengyue233's repository a bit to remove the final_proj related code, I still get a config / model that has a "hidden_size" of 768. But we need one with 256.

I'm unsure how to convert it to a functional pytorch model...

@34j
Copy link
Collaborator

34j commented Apr 22, 2023

Modifying the convert.py script from lengyue233's repository a bit to remove the final_proj related code, I still get a config / model that has a "hidden_size" of 768. But we need one with 256.

You have better read our code first before talking, I'm saying that the weights for final proj is different between 2 original non Huggingface models and need to replace it

@34j
Copy link
Collaborator

34j commented Apr 22, 2023

Note that final_proj is one nn.Linear that outputs 256 channels from 768 input channels.

@Lordmau5
Copy link
Contributor

You have better read our code first before talking, I'm saying that the weights for final proj is different between 2 original non Huggingface models and need to replace it

Aaaaah I see. I still don't understand much about the AI side of things with the project (I'm happy I can contribute with fixes here and there) so I apologize for that

@34j
Copy link
Collaborator

34j commented Apr 22, 2023

I would like this to be resolved as soon as possible, do you have time now?

@34j
Copy link
Collaborator

34j commented Apr 22, 2023

On the second thought, I think I'm the only person who can understand my dirty code and guess I should archive this repo

@34j
Copy link
Collaborator

34j commented Apr 22, 2023

It is painful to be blamed for wasting the computing costs of the planet by having to train an incorrect model that was not identifiable for two days.

@34j 34j assigned 34j and unassigned 34j Apr 22, 2023
@34j
Copy link
Collaborator

34j commented Apr 22, 2023

I've tried it and can't tell the difference......

@34j
Copy link
Collaborator

34j commented Apr 22, 2023

3.10.0

1.out.wav.mp4

3.9.5

1.out.wav.mp4

The rebuilt one

1.out.wav.mp4

Still not fixed...

@34j
Copy link
Collaborator

34j commented Apr 22, 2023

result1 = hubert(new_input, output_hidden_states=True)["hidden_states"][9]
result1 = hubert.final_proj(result1)

https://huggingface.co/lengyue233/content-vec-best/blob/c0b9ba13db21beaa4053faae94c102ebe326fd68/convert.py#L131-L132
I didn't understand anything

@vertexgamer
Copy link
Author

vertexgamer commented Apr 22, 2023

So have you guys found the origin of the issue?

@Lordmau5
Copy link
Contributor

I would like this to be resolved as soon as possible, do you have time now?

I was unfortunately asleep at that time (7:25AM and I was awake until ilke 5AM hah), sorry :(

It is painful to be blamed for wasting the computing costs of the planet by having to train an incorrect model that was not identifiable for two days.

I mean, you said it yourself before that you're still pretty new to this AI stuff if I remember correctly? Don't be too hard on yourself. There are bugs in other repositories that are way trickier to fix and might even go under the radar for longer 🙏

So have you guys fond the origin of the issue?

Well, 34j did push a fix in 3.10.5 for it - would you be able to give that another go and see if it is more comparable to what you got in 3.9?

@vertexgamer
Copy link
Author

@Lordmau5 rn i'm training a model, when i finish i will try. I asked a friend to try it and it seems to be very similar to 3.9.3, but it might be placebo as the trained iterations are not the same

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants