Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New-Model] HiFi-GAN implementation #661

Closed
erogol opened this issue Feb 19, 2021 · 9 comments
Closed

[New-Model] HiFi-GAN implementation #661

erogol opened this issue Feb 19, 2021 · 9 comments

Comments

@erogol
Copy link
Contributor

erogol commented Feb 19, 2021

@rishikksh20 is kind to integrate his own work into TTS.

for more details: https://github.com/rishikksh20/HiFi-GAN

@thorstenMueller
Copy link
Contributor

Great @rishikksh20 👍 . Looking forward to it, because i'm interested in training HifiGAN for my (Mozilla) Tacotron2 DCA trained model. If it's helpful you could use my public german dataset for testing.

@rishikksh20
Copy link

rishikksh20 commented Feb 20, 2021

@erogol @thorstenMueller Sure, I check on German dataset.
But I like to share some points regarding my implementation of HiFiGAN, I implemented this repo https://github.com/rishikksh20/HiFi-GAN/tree/d044dbcdf799f0fdfbfc1920e57e95ac6a05f91b , just after reading the hifigan's paper and I never gone through original hifigan repo while coding my implementation, now when I compare it with official hifigan repo I noticed that my implementation is little bit diff than official implementation.

And I guess, I did something terrible right because my implementation is trained 30% faster (1.9 steps/sec vs 1.4 steps/sec of official repo on V100, batch 16), 3x smaller (aprrox 350 MB vs 920 MB) than original not only that, it converges really really fast, I only trained my model for 12 hrs (80k steps) and it's quality is better than official repo's 1 week (1 million steps) of training samples on V100 and that too without fine tuning with GTA and deep feature matching loss. I checked this hypothesis on 3 different datasets and results were same.

You can listen to yourself
Original: https://soundcloud.com/rishikesh-kumar-1/original
Generated : https://soundcloud.com/rishikesh-kumar-1/generated

I still training my repo on different datasets, I do modify my code bit which make quality worse. So far this commit tree https://github.com/rishikksh20/HiFi-GAN/tree/d044dbcdf799f0fdfbfc1920e57e95ac6a05f91b gives best quality and I will integrate this with TTS's repo.

@m-toman
Copy link
Contributor

m-toman commented Feb 22, 2021

@rishikksh20 sounds great, did you also try the "V2" version? So setting this https://github.com/jik876/hifi-gan/blob/4769534d45265d52a904b850da5a622601885777/config_v1.json#L13 to 128, as far as I see that's the only difference.

I've been training the official HifiGAN repo for ages on one GPU but never really got close to the official models and definitely worse than my current MelGAN setup. Think on one 11GB GPU I'd probalby have to train it 2 months :)

@rishikksh20
Copy link

@m-toman yes HifiGAN is too slow to train, although I think after 1.5 M steps (12 days on V100) of training quality more or less similar in V1 version, still 12 days on V100 is quite huge time. I tried V2 version of official HifiGAN repo but convergence time is somewhat similar but quality is much more worse similar case for V3 because V1, V2 and V3 all share same discs and discs of hifigan is too slow to train.

@m-toman
Copy link
Contributor

m-toman commented Feb 22, 2021

Thanks, I meant if you tried some V2 style setting with your implementation. V1 seems to be much slower than V2 on CPU

@nukes
Copy link

nukes commented Feb 24, 2021

@erogol @thorstenMueller Sure, I check on German dataset.
But I like to share some points regarding my implementation of HiFiGAN, I implemented this repo https://github.com/rishikksh20/HiFi-GAN/tree/d044dbcdf799f0fdfbfc1920e57e95ac6a05f91b , just after reading the hifigan's paper and I never gone through original hifigan repo while coding my implementation, now when I compare it with official hifigan repo I noticed that my implementation is little bit diff than official implementation.

And I guess, I did something terrible right because my implementation is trained 30% faster (1.9 steps/sec vs 1.4 steps/sec of official repo on V100, batch 16), 3x smaller (aprrox 350 MB vs 920 MB) than original not only that, it converges really really fast, I only trained my model for 12 hrs (80k steps) and it's quality is better than official repo's 1 week (1 million steps) of training samples on V100 and that too without fine tuning with GTA and deep feature matching loss. I checked this hypothesis on 3 different datasets and results were same.

You can listen to yourself
Original: https://soundcloud.com/rishikesh-kumar-1/original
Generated : https://soundcloud.com/rishikesh-kumar-1/generated

I still training my repo on different datasets, I do modify my code bit which make quality worse. So far this commit tree https://github.com/rishikksh20/HiFi-GAN/tree/d044dbcdf799f0fdfbfc1920e57e95ac6a05f91b gives best quality and I will integrate this with TTS's repo.

Interesting! Why your model is much smaller than the official one? you have smaller discriminators? The official model is indeed very large.

@ghost
Copy link

ghost commented Feb 25, 2021

Which voice corpus is listen here?
Generated : https://soundcloud.com/rishikesh-kumar-1/generated

@rishikksh20
Copy link

My custom dataset

@erogol
Copy link
Contributor Author

erogol commented Mar 15, 2021

Continues there coqui-ai/TTS#16

@erogol erogol closed this as completed Mar 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants