Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of VITS-2 #1508

Closed
ezerhouni opened this issue Feb 19, 2024 · 15 comments
Closed

Implementation of VITS-2 #1508

ezerhouni opened this issue Feb 19, 2024 · 15 comments

Comments

@ezerhouni
Copy link
Collaborator

Hello, I am trying to implement VITS2 but I am getting the following error :

  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 38, in piecewise_rational_quadratic_transform
    outputs, logabsdet = spline_fn(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 85, in unconstrained_rational_quadratic_spline
    ) = rational_quadratic_spline(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 118, in rational_quadratic_spline
    if torch.min(inputs) < left or torch.max(inputs) > right:
RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

Do you have an idea where it might come from ? I know that without code it is difficult to know, I will do a PR of the implementation later this week. Thank you

@csukuangfj
Copy link
Collaborator

Can you try
torch.min(inputs, dim=None)?

The error shows you need to specify the dim argument for torch.min(), though your code looks correct to me.

@nshmyrev
Copy link
Contributor

Same issue coqui-ai/TTS#2555

It comes from the bad data file which doesn't align properly.

@csukuangfj
Copy link
Collaborator

@ezerhouni

I suggest that you use
https://github.com/rhasspy/piper-phonemize
to convert text to tokens.

Otherwise, it may be difficult, if not impossible, to deploy the trained model with C++.

You can find pre-built wheels for Linux and Windows at
https://github.com/csukuangfj/piper-phonemize/releases/tag/2023.12.5

Screenshot 2024-02-20 at 09 46 21

@yaozengwei

Do you have any code to share about using piper-phonemizer to convert text to tokens?

@ezerhouni
Copy link
Collaborator Author

@csukuangfj Let me try torch.min(inputs, dim=None)
I am trying the LJSpeech recipe for the moment with VITS-2

@csukuangfj
Copy link
Collaborator

I am trying the LJSpeech recipe for the moment with VITS-2

Ok, but we are switching to piper-phonemize for converting text to tokens.

Hope that @yaozengwei can push the new tokenizer soon.

@yaozengwei
Copy link
Collaborator

yaozengwei commented Feb 20, 2024

I am trying the LJSpeech recipe for the moment with VITS-2

Ok, but we are switching to piper-phonemize for converting text to tokens.

Hope that @yaozengwei can push the new tokenizer soon.

I just uploaded the code here #1511.

@ezerhouni
Copy link
Collaborator Author

@csukuangfj Now I am getting:

  File "/vits2/egs/ljspeech/TTS/vits2/duration_predictor.py", line 191, in forward
    z = flow(z, x_mask, g=x, inverse=inverse)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/vits2/egs/ljspeech/TTS/vits2/flow.py", line 297, in forward
    xb, logdet_abs = piecewise_rational_quadratic_transform(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 38, in piecewise_rational_quadratic_transform
    outputs, logabsdet = spline_fn(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 85, in unconstrained_rational_quadratic_spline
    ) = rational_quadratic_spline(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 175, in rational_quadratic_spline
    assert (discriminant >= 0).all()
AssertionError

I will try with the new tokenizer to see if it fixes the issue

@csukuangfj
Copy link
Collaborator

@yaozengwei
Could you have a look at the above error?

@yaozengwei
Copy link
Collaborator

Hello, I am trying to implement VITS2 but I am getting the following error :

  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 38, in piecewise_rational_quadratic_transform
    outputs, logabsdet = spline_fn(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 85, in unconstrained_rational_quadratic_spline
    ) = rational_quadratic_spline(
  File "/vits2/egs/ljspeech/TTS/vits2/transform.py", line 118, in rational_quadratic_spline
    if torch.min(inputs) < left or torch.max(inputs) > right:
RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

Do you have an idea where it might come from ? I know that without code it is difficult to know, I will do a PR of the implementation later this week. Thank you

Seems the tensor inputs for torch.min is empty.

@csukuangfj
Copy link
Collaborator

>>> import torch
>>> a = torch.empty((0,))
>>> torch.min(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

An empty tensor will indeed throw the same error.

@ezerhouni
Copy link
Collaborator Author

@csukuangfj I might have some good news but it needs a bit more testing. I will let you know next week

@ezerhouni
Copy link
Collaborator Author

Unrelated to VITS-2 (please tell me if you prefer that I open a proper issue), it seems that for the VITS recipes, you are using spectrogram which is using Wav2Spec while the loss is computed using Wav2LogFilterBank is on purpose ?

@JinZr
Copy link
Collaborator

JinZr commented Mar 18, 2024

hmm, i think we didn't choose this setup on purpose @yaozengwei am i right?

@yaozengwei
Copy link
Collaborator

Unrelated to VITS-2 (please tell me if you prefer that I open a proper issue), it seems that for the VITS recipes, you are using spectrogram which is using Wav2Spec while the loss is computed using Wav2LogFilterBank is on purpose ?

We just follows the VITS paper (https://arxiv.org/pdf/2106.06103.pdf), which uses linear spectrogram as input of the posterior encoder (Sec 2.1.3 and Fig.1), and uses mel-scale spectrograms to compute the reconstruction loss (Sec 2.1.2).

@ezerhouni
Copy link
Collaborator Author

@yaozengwei Yes my bad, I misunderstood part of the code

@JinZr JinZr closed this as completed Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants