since 3.10 my models train with a strong english accent #438

vertexgamer · 2023-04-20T14:42:57Z

since 3.10 my models train with a strong english accent, i first thought it was an over training problem, but when training from scratch the same issue happen

vertexgamer · 2023-04-20T17:27:14Z

UPDATE: using infer on the same audio file with the same config and same model (which should work as it was trained before the issue appeared with 3.9.3) seems to still have a strong english accent , suggesting that there is an issue in the "infer" process and not with the training process. It also seem to have more noise in it

vertexgamer · 2023-04-20T17:43:42Z

Ok, i just downgraded to 3.9.3 and everything works as expected.
It's CONFIRMED that newer versions have an accent bias.

If you are using a model in a language DIFFERENT than english, DO NOT USE newer versions!!!!

liamlenholm · 2023-04-20T20:02:08Z

@vertexgamer how do I downgrade?

liamlenholm · 2023-04-20T20:20:03Z

@vertexgamer how do I downgrade?

nevermind if anyone else is wondering

pip install <package>==<version>
pip install -U so-vits-svc-fork==3.9.3

Lordmau5 · 2023-04-20T21:45:14Z

Ok, i just downgraded to 3.9.3 and everything works as expected. It's CONFIRMED that newer versions have an accent bias.

If you are using a model in a language DIFFERENT than english, DO NOT USE newer versions!!!!

One thing I'm wondering now:

What if you train a model in 3.10+ and then infer on 3.9? Do you also get those accent results?

vertexgamer · 2023-04-21T04:52:02Z

Ok, i just downgraded to 3.9.3 and everything works as expected. It's CONFIRMED that newer versions have an accent bias.
If you are using a model in a language DIFFERENT than english, DO NOT USE newer versions!!!!

One thing I'm wondering now:

What if you train a model in 3.10+ and then infer on 3.9? Do you also get those accent results?

in my experience no, only the infer process affects the accent. But just to be sure, i'm training right now a model with 3.9.3. When i'm done i will come back with more info

vertexgamer · 2023-04-21T07:03:24Z

@Lordmau5 it seems that there is no hearable difference when using 3.9.3 models vs 3.10+ ones

Lordmau5 · 2023-04-21T09:48:19Z

Hmm... okay that's interesting.

I know 3.10 did a switch from the fairseq to transformers library, which means it's one step less for building from what I can tell.
a2fe0f3

Apparently it's not relying on the (correct?) pretrained contentvec model anymore and doesn't utilize it.
I saw another voice changer project that supports so-vits-svc models did require it though.

Maybe it has to do with that? @34j any thoughts? (Seeing as you made those changes)

Looking at the code a bit more, it does rely on a contentvec model, but it's relying on the content vec model, but not the content vec LEGACY model, as also offered here:
https://github.com/auspicious3000/contentvec

And looking at the Hugging Face repository, it seems to actually be the legacy one?
https://huggingface.co/lengyue233/content-vec-best

I am very confused. I can't help with this as I didn't make these changes...

34j · 2023-04-21T11:56:49Z

I would like to suggest the possibility that the contents of final_proj are different because I remember non final_proj version worked for me (probably).

34j · 2023-04-21T11:59:56Z

I'm not confident so anyone who has time please test it

Lordmau5 · 2023-04-21T12:18:08Z

I did test one thing, and that was adding "contentvec_final_proj": false to the config, which ended up returning errors during inference and didn't output an audio file unfortunately...

[14:19:09] Starting inference...
[14:19:12] E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py:81: UserWarning: Unused arguments: {'n_layers_q': 3, 'use_spectral_norm': False}
  warnings.warn(f"Unused arguments: {kwargs}")

[14:19:12] Decoder type: hifi-gan
[14:19:13] Loaded checkpoint 'E:/Development/so-vits-svc-4.0/Kurzgesagt/logs/44k/G_800.pth' (iteration 34)
[14:19:13] Chunk: Chunk(Speech: False, 8820.0)
[14:19:13] Chunk: Chunk(Speech: True, 361620.0)
[14:19:13] F0 inference time:       0.167s, RTF: 0.020
[14:19:17] HuBERT inference time  : 2.987s, RTF: 0.356
[14:19:17] Finished inference for cbt_normal.wav
[14:19:17] Error in realtime: 
[14:19:17] Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
pebble.common.RemoteTraceback: Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\pebble\common.py", line 174, in process_execute
    return function(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\main.py", line 56, in infer
    audio = svc_model.infer_silence(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 284, in infer_silence
    audio_chunk_pad_infer_tensor, _ = self.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 218, in infer
    audio = self.net_g.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py", line 213, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
[14:19:17] Error in inference: 
[14:19:17] Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
pebble.common.RemoteTraceback: Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\pebble\common.py", line 174, in process_execute
    return function(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\main.py", line 56, in infer
    audio = svc_model.infer_silence(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 284, in infer_silence
    audio_chunk_pad_infer_tensor, _ = self.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 218, in infer
    audio = self.net_g.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py", line 213, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 675, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
[14:19:18] Error in realtime: 
[14:19:18] Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead
pebble.common.RemoteTraceback: Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\pebble\common.py", line 174, in process_execute
    return function(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\main.py", line 56, in infer
    audio = svc_model.infer_silence(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 284, in infer_silence
    audio_chunk_pad_infer_tensor, _ = self.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\inference\core.py", line 218, in infer
    audio = self.net_g.infer(
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\modules\synthesizers.py", line 213, in infer
    x = self.pre(c) * x_mask + self.emb_uv(uv.long()).transpose(1, 2)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\torch\nn\modules\conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 675, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
  File "E:\Development\so-vits-svc-4.0\__env\lib\site-packages\so_vits_svc_fork\gui.py", line 667, in main
    future.result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 451, in result
    return self.__get_result()
  File "c:\program files\python310\lib\concurrent\futures\_base.py", line 403, in __get_result
    raise self._exception
RuntimeError: Given groups=1, weight of size [192, 256, 5], expected input[1, 768, 723] to have 256 channels, but got 768 channels instead

(It also mentions error in realtime instead of inference)

What else should I try in regards to testing it? 🤔

34j · 2023-04-21T12:30:11Z

That's not surprising because your model input is 256 channels and excepts final_projed input. You can download the model I trained from here or here if you don't have one.

Lordmau5 · 2023-04-21T12:39:29Z

That's not surprising because your model input is 256 channels and excepts final_projed input.

Hmm, I just went with the template it gave me (started this model around a week ago and I couldn't spot any changes to the templates in regards to model input, ssl_dim or similar)

According to the wiki:

The ssl_dim is the number of input channels, and the correct number of output channels for the officially trained ContentVec model is 768, but after applying final_proj it is 256.

Doesn't this mean that the config templates should be adjusted going forward?

I did try both of your models and they sound fine to me... Also no errors when inferring them

34j · 2023-04-21T12:40:55Z

For now, I would like to suggest changing the default back to contentvec_final_proj=False and deal with it later.

34j · 2023-04-21T12:42:27Z

Maybe so-vits-svc's contentvec is uniquely retrained in the final layer to enhance Japanese / Chinese pronunciation? Although I don't know how it works, so I can't say for sure...

Lordmau5 · 2023-04-21T12:57:12Z

Okay so, what I gathered just now:

Starting a new model and doing svc pre-config will select so-vits-svc-4.0v1-legacy by default.
Trying to set "contentvec_final_proj": false in that config file will return errors like mentioned above because it is a different model structure / config.

However, doing svc pre-config -t so-vits-svc-4.0v1 will give the correct structure in which I can set the final_proj to true and then train with that.
It does seem like training that will take a bit longer however. Legacy at around 500 steps sounds better than that new one. That's fine though as long as it's mentioned.

I'm giving that training a go with the Kurzgesagt voice for testing to several thousand steps and report back.

The thing I see is that we need to figure out if a model is of "type_": "hifi-gan" or similar, as in not legacy, and in that case set it to use contentvec_final_projc=False.

Additionally, I've noticed a n_speakers variable that's set to 200 by default. I remember you saying something along the lines of "do we need 200 speakers?" and whether it could make the model smaller?
My bad, it wasn't n_speakers, it was the VITS model in general: #314

34j · 2023-04-21T13:03:37Z

https://huggingface.co/lengyue233/content-vec-best

If you have more free time, you can follow this procedure to convert so-vits-svc's ContentVec and test it again.

Lordmau5 · 2023-04-21T13:18:53Z

Yeah I converted the checkpoint_best_legacy_500.pt and loaded it in code instead of getting it from the lengyue233 Hugging Face repository, the results are the same. It's still erroring... (Expecting 768 but providing 256 on a so-vits-svc-4.0v1-legacy model with "contentvec_final_proj": false)

Trying to convert the non-legacy checkpoint is just erroring with that config (which makes sense)

34j · 2023-04-21T14:04:21Z

What about with "contentvec_final_proj": true?

34j · 2023-04-21T14:07:41Z

Note that final_proj is one nn.Linear that outputs 256 channels from 768 input channels.

Lordmau5 · 2023-04-21T14:18:18Z

What about with "contentvec_final_proj": true?

Yup, that works. But seeing as that's the default I assume we're back to square 1 with the English accent...

34j · 2023-04-21T15:16:08Z

I'm not sure, but if the rebuilt version (for the purpose of replacing final_proj) still doesn't work, I think the only way is to extract and insert final_proj from our ckpt or ask lengyue232 for help, or my code os wrong

Lordmau5 · 2023-04-21T15:56:47Z

Modifying the convert.py script from lengyue233's repository a bit to remove the final_proj related code, I still get a config / model that has a "hidden_size" of 768. But we need one with 256.

I'm unsure how to convert it to a functional pytorch model...

34j · 2023-04-22T02:32:27Z

Modifying the convert.py script from lengyue233's repository a bit to remove the final_proj related code, I still get a config / model that has a "hidden_size" of 768. But we need one with 256.

You have better read our code first before talking, I'm saying that the weights for final proj is different between 2 original non Huggingface models and need to replace it

34j · 2023-04-22T02:33:10Z

Note that final_proj is one nn.Linear that outputs 256 channels from 768 input channels.

Lordmau5 · 2023-04-22T03:09:23Z

You have better read our code first before talking, I'm saying that the weights for final proj is different between 2 original non Huggingface models and need to replace it

Aaaaah I see. I still don't understand much about the AI side of things with the project (I'm happy I can contribute with fixes here and there) so I apologize for that

34j · 2023-04-22T05:25:06Z

I would like this to be resolved as soon as possible, do you have time now?

34j · 2023-04-22T05:28:53Z

On the second thought, I think I'm the only person who can understand my dirty code and guess I should archive this repo

34j · 2023-04-22T05:31:14Z

It is painful to be blamed for wasting the computing costs of the planet by having to train an incorrect model that was not identifiable for two days.

34j · 2023-04-22T05:43:05Z

I've tried it and can't tell the difference......

34j · 2023-04-22T06:04:07Z

3.10.0

1.out.wav.mp4

3.9.5

1.out.wav.mp4

The rebuilt one

1.out.wav.mp4

Still not fixed...

34j · 2023-04-22T06:33:03Z

result1 = hubert(new_input, output_hidden_states=True)["hidden_states"][9]
result1 = hubert.final_proj(result1)

https://huggingface.co/lengyue233/content-vec-best/blob/c0b9ba13db21beaa4053faae94c102ebe326fd68/convert.py#L131-L132
I didn't understand anything

vertexgamer · 2023-04-22T11:15:57Z

So have you guys found the origin of the issue?

Lordmau5 · 2023-04-22T11:49:11Z

I would like this to be resolved as soon as possible, do you have time now?

I was unfortunately asleep at that time (7:25AM and I was awake until ilke 5AM hah), sorry :(

It is painful to be blamed for wasting the computing costs of the planet by having to train an incorrect model that was not identifiable for two days.

I mean, you said it yourself before that you're still pretty new to this AI stuff if I remember correctly? Don't be too hard on yourself. There are bugs in other repositories that are way trickier to fix and might even go under the radar for longer 🙏

So have you guys fond the origin of the issue?

Well, 34j did push a fix in 3.10.5 for it - would you be able to give that another go and see if it is more comparable to what you got in 3.9?

vertexgamer · 2023-04-22T14:24:04Z

@Lordmau5 rn i'm training a model, when i finish i will try. I asked a friend to try it and it seems to be very similar to 3.9.3, but it might be placebo as the trained iterations are not the same

vertexgamer added the bug Something isn't working label Apr 20, 2023

34j assigned 34j and unassigned 34j Apr 22, 2023

34j mentioned this issue Apr 22, 2023

fix(utils): fix so-vits-svc style contentvec usage #455

Merged

34j closed this as completed in #455 Apr 22, 2023

since 3.10 my models train with a strong english accent #438

since 3.10 my models train with a strong english accent #438

Comments

vertexgamer commented Apr 20, 2023

vertexgamer commented Apr 20, 2023 • edited Loading

vertexgamer commented Apr 20, 2023

liamlenholm commented Apr 20, 2023

liamlenholm commented Apr 20, 2023

Lordmau5 commented Apr 20, 2023

vertexgamer commented Apr 21, 2023

vertexgamer commented Apr 21, 2023

Lordmau5 commented Apr 21, 2023 • edited Loading

34j commented Apr 21, 2023 • edited Loading

34j commented Apr 21, 2023

Lordmau5 commented Apr 21, 2023 • edited Loading

34j commented Apr 21, 2023 • edited Loading

Lordmau5 commented Apr 21, 2023 • edited Loading

34j commented Apr 21, 2023

34j commented Apr 21, 2023 • edited Loading

Lordmau5 commented Apr 21, 2023 • edited Loading

34j commented Apr 21, 2023

Lordmau5 commented Apr 21, 2023 • edited Loading

34j commented Apr 21, 2023 • edited Loading

34j commented Apr 21, 2023

Lordmau5 commented Apr 21, 2023

34j commented Apr 21, 2023 • edited Loading

Lordmau5 commented Apr 21, 2023

34j commented Apr 22, 2023 • edited Loading

34j commented Apr 22, 2023

Lordmau5 commented Apr 22, 2023

34j commented Apr 22, 2023

34j commented Apr 22, 2023

34j commented Apr 22, 2023

34j commented Apr 22, 2023

34j commented Apr 22, 2023

34j commented Apr 22, 2023 • edited Loading

vertexgamer commented Apr 22, 2023 • edited Loading

Lordmau5 commented Apr 22, 2023

vertexgamer commented Apr 22, 2023

vertexgamer commented Apr 20, 2023 •

edited

Loading

Lordmau5 commented Apr 21, 2023 •

edited

Loading

34j commented Apr 21, 2023 •

edited

Loading

Lordmau5 commented Apr 21, 2023 •

edited

Loading

34j commented Apr 21, 2023 •

edited

Loading

Lordmau5 commented Apr 21, 2023 •

edited

Loading

34j commented Apr 21, 2023 •

edited

Loading

Lordmau5 commented Apr 21, 2023 •

edited

Loading

Lordmau5 commented Apr 21, 2023 •

edited

Loading

34j commented Apr 21, 2023 •

edited

Loading

34j commented Apr 21, 2023 •

edited

Loading

34j commented Apr 22, 2023 •

edited

Loading

34j commented Apr 22, 2023 •

edited

Loading

vertexgamer commented Apr 22, 2023 •

edited

Loading