-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility with Matcha TTS #39
Comments
would you mind helping with this? I don't know where to start. |
@mush42 Hey, I'm pissing around with the same thing currently. Tried synthesising using vocos as a head to MatchaTTS. Vocos seems to want 100 mel bins? Matcha currently outputs specs with 80bins. I'm not sure the best way to go, either retrain Matcha on 100bins, or see if zero padding could work. I tried earlier, just zero padding from 80 to 100mel bins, and synthesising through vocos mel head, quality wasn't that great |
You can check out my fork with config for 22050 vocos - https://github.com/egorsmkv/vocos |
@egorsmkv, even after training the model with vocos.yaml config from your repo, the issue seems to persist, the output is still robotic and in low-volume @hubertsiuzdak @alealv, Any help or guidance regarding this would be really helpful! |
How many steps did you train? |
15k steps |
@mush42 I took a different approach. I searched for the appropriate parameters of The difference was on the frequency limits and the mel scaling that uses by default
I also updated the feature extraction in the reconstruction loss. You can check the changes in this fork https://github.com/wetdog/vocos/tree/matcha The results sound good after 20 epochs with Libritts, We'll publish the checkpoints once the training finishes. |
Hi
The issue
I trained a model based on Matcha TTS, and I tried to use Vocos with it. Unfortunately, vocoding using a checkpoint trained with the default config of Vocos gives a robotic output with very low volume.
The only config values I changed are sample_rate (=22050) and n_mels (=80).
I asumed that there is a mismatch between Matcha TTS-generated melspectrogram and Vocos expected melspectrogram in terms of parameters.
A new feature extractor
I wrote a feature extractor class to generate melspectogram using same parameters of Matcha TTS. Most of the code is copied directly from Matcha's source code.
Click to expand: MatchaMelSpectrogramFeatures
And I used it with the following config:
Click to expand config: vocos-matcha.yaml
Results
I trained Vocos using the above feature extractor and config, but this also fails with even worse vocoding quality and even lower volume.
Questions
head
expects melspectograms generated using certain parameters?Additional notes
I believe many open-source TTS models use the same code to extract melspectogram. So resolving this will help with training Vocos for use with these TTS models.
Best
The text was updated successfully, but these errors were encountered: