Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Poll] Should we include WaveRNN in Mozilla TTS ? #458

Closed
erogol opened this issue Jul 13, 2020 · 40 comments
Closed

[Poll] Should we include WaveRNN in Mozilla TTS ? #458

erogol opened this issue Jul 13, 2020 · 40 comments
Labels
help wanted Extra attention is needed poll Poll about things in Mozilla TTS

Comments

@erogol
Copy link
Contributor

erogol commented Jul 13, 2020

I see a lot of people still use WaveRNN although we released new faster vocoders.

I am not willing to invest time in it given the way faster alternatives but you can let us know if you like to see WaveRNN as a part of Mozilla TTS repo.

Please give thumps up or down to this post to have a poll.

You can also state your comment or reason to have WaveRNN below.

@erogol erogol added the poll Poll about things in Mozilla TTS label Jul 13, 2020
@erogol erogol changed the title [Pool] Should we inlclude WaveRNN in Mozilla TTS ? [Poll] Should we inlclude WaveRNN in Mozilla TTS ? Jul 13, 2020
@erogol erogol changed the title [Poll] Should we inlclude WaveRNN in Mozilla TTS ? [Poll] Should we include WaveRNN in Mozilla TTS ? Jul 13, 2020
@erogol erogol pinned this issue Jul 13, 2020
@domcross
Copy link

So the recommended vocoder as of now is PWGAN, correct?

@erogol
Copy link
Contributor Author

erogol commented Jul 14, 2020

right now it is Melgan or Multiband Melgan under the vocoder module.

@LucasRotsen
Copy link

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

@erogol
Copy link
Contributor Author

erogol commented Jul 14, 2020

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

@cs50victor
Copy link

I use LJSpeech but I'm having a hard time adding WaveRNN to my project. Where do you keep the downloaded files and in what file do add the path of the downloaded files? I've been struggling for over a week.

@LucasRotsen
Copy link

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

Custom dataset in Brazilian Portuguese.

@MuruganR96
Copy link

Quality of Output is best in WaveRNN. but the latency is here the problem. can anyone help me, how to overcome this WaveRNN latency issue? small tips is enough. i can go through.

@LucasRotsen
Copy link

Quality of Output is best in WaveRNN. but the latency is here the problem. can anyone help me, how to overcome this WaveRNN latency issue? small tips is enough. i can go through.

If generation speed is an requirement for your application - even if it leads to a lower quality - you should definitely try MelGAN or MultiBand MelGAN. Auto-regressive models like WaveRNN are inherently slow, and as far as I know, we can't do much to overcome this. NVIDIA's WaveGlow is a non-autoregressive model that achieved a good MOS compared to WaveRNN, but the training is very resource consuming. It may be a good idea to take this conversation to discourse.

@erogol
Copy link
Contributor Author

erogol commented Jul 16, 2020

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

Custom dataset in Brazilian Portuguese.

can you share a couple of samples somewhere? I just want to checl what causes the noise.

@LucasRotsen
Copy link

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

Custom dataset in Brazilian Portuguese.

can you share a couple of samples somewhere? I just want to checl what causes the noise.

Of course! Thanks for helping. I sent you the samples in a private message on Discourse.

@erogol
Copy link
Contributor Author

erogol commented Jul 20, 2020

@LucasRotsen with the latest PWGAN model I was able to get rid of any noise. Maybe you should give a try using the latest dev branch.

@cs50victor
Copy link

I would really appreciate it if another Released model could be added to the Wiki

@erogol
Copy link
Contributor Author

erogol commented Jul 20, 2020

it is there

@erogol
Copy link
Contributor Author

erogol commented Jul 29, 2020

Given that there are people asking for WaveRNN, is there anyone to help on that to have it in TTS?

@WeberJulian
Copy link
Contributor

@erogol Another reason why people like WaveRNN imo is that you released a universal vocoder that works pretty well even without fine-tuning.
Do you think it would be feasible to do the same with another vocoder ?

@LucasRotsen
Copy link

LucasRotsen commented Jul 30, 2020

Given that there are people asking for WaveRNN, is there anyone to help on that to have it in TTS?

I'll be glad to help!

@erogol
Copy link
Contributor Author

erogol commented Jul 31, 2020

@erogol Another reason why people like WaveRNN imo is that you released a universal vocoder that works pretty well even without fine-tuning.
Do you think it would be feasible to do the same with another vocoder ?

I believe it is possible with the new vocoders but I need some time before starting to that.

@erogol
Copy link
Contributor Author

erogol commented Jul 31, 2020

Given that there are people asking for WaveRNN, is there anyone to help on that to have it in TTS?

I'll be glad to help!

Ohh nice! I think we just need to place the model under the vocoder folder and add it to the train_vocoder.py with necessary changes or we can create a new train_wavernn.py script. As I write this I believe writing a new script makes more sense since train_vocoder.py is for GAN training.

If you could help with that, then I can add it to the inference pipeline to use it seamlessly at inference as we use GAN vocoders.

@LucasRotsen
Copy link

LucasRotsen commented Jul 31, 2020

Nice, @erogol! I agree that creating a train_wavernn script is a better idea (As the configuration file for GAN's is quite different from that of WaveRNN).

@WeberJulian
Copy link
Contributor

WeberJulian commented Jul 31, 2020

@erogol Another reason why people like WaveRNN imo is that you released a universal vocoder that works pretty well even without fine-tuning.
Do you think it would be feasible to do the same with another vocoder ?

I believe it is possible with the new vocoders but I need some time before starting to that.

I have a lot of free AWS credit left from a previous project that I can use until September. With a little guidance from you, I would gladly do the training part

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

@WeberJulian maybe then you can cooperate with @LucasRotsen on wavernn

or we can try our GAN vocoders for multi-speaker training?

@WeberJulian
Copy link
Contributor

WeberJulian commented Aug 4, 2020

I'm currently familiarising myself with the new vocoders on the dev branch. I think that in most use-cases, realtime makes a huge difference so I'd rather go with a GAN vocoder.
For the choice of dataset, what do you think is best ?
I think you used libriTTS for the universal WaveRNN but for multi-language generalisation, it might be best to use either :

I spent a little time validating audio samples for common voice in french and it seems that the data is a bit messier (different microphones, noise level etc).
I can't decide if it's good or bad, on the one hand it's gonna be harder for the model to learn but on the other, it might make it generalise more and work more reliably.
What's your opinion on this ?

@WeberJulian
Copy link
Contributor

WeberJulian commented Aug 4, 2020

I was reading the MelGAN paper and I think it might be even better to implement WaveGlow for the Universal vocoder.
WaveGlow inference is relatively fast, 223kHz on a 1080Ti. I think it's like 14RTF for a 16kHz signal.
The huge cost is training because it has even more weights than WaveRNN but to me it's a no issue if we train only once for a universal vocoder.
image
I think it's worth it considering the far better MOS score

What do you think ?

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

@WeberJulian If you are willing to contribute waveglow pls feel free to do so. However, there are a couple of concerns I have.

  • WaveGlow is too big for someone to train. So the improvement might not worth its quality.
  • It is way slower on a CPU compared to MelGAN which is important for enabling TTS in low-resource devices.
  • We also have a parallel-wavegan implementation which provides better results than MelGAN with a small run-time sacrifice. We can try it instead of MelGAN. It also converged a lot faster. They also report better MOS values on paper. They don't have exact comparison to WaveGlow but better than Wavenet.

But again I believe there are also people who would be happy with a WaveGlow implementation.

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

@WeberJulian wrt dataset, I'd suggest going with LibriTTS from which I trained the Universal WaveRNN vocoder.

@WeberJulian
Copy link
Contributor

Ok got it, I'm gonna go with Parallel_wavegan and LibriTTS.

  • Did you use the full dataset ?
  • Do you think 24kHz is worth it for the performance/quality tradeoff or I should convert the dataset to 16kHz

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

I guess the sampling rate requires another poll. I am not sure what do people use for their TTS models. I'd suggest 22050 or 16khz.

Or we can try to come up a way to train a sampling rate agnostic vocoder (research project) :)

Yes I used the whole dataset.

I guess you just need to collect (or sym link) all the wav files in a folder and give it to it to the vocoder.

@WeberJulian
Copy link
Contributor

Oh that's interesting, you mean where the input is sample rate agnostic or where we can tune the output sample rate at inference time ?

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

yes. Maybe we can provide the sampling rate as an additional feature like an one-hot embedding vector where each dimension corresponds to a different sampling rate. Then we can replace the upsampling network to a normal interpolation of the spectrograms or we can train different upsampling networks for each possible sampling rate.

@lexkoro
Copy link
Contributor

lexkoro commented Aug 4, 2020

@WeberJulian maybe then you can cooperate with @LucasRotsen on wavernn

or we can try our GAN vocoders for multi-speaker training?

@erogol Do you think throwing multiple datasets together (different languages) let's say of mostly equal quality, would improve the quality of the multi-speaker vocoder or more likely hurt it?

@WeberJulian
Copy link
Contributor

yes. Maybe we can provide the sampling rate as an additional feature like an one-hot embedding vector where each dimension corresponds to a different sampling rate. Then we can replace the upsampling network to a normal interpolation of the spectrograms or we can train different upsampling networks for each possible sampling rate.

I'm a bit of a DeepLearning noob but why use a one-hot encoding for a continuous value like sample rate ? Is it to force exact values like 16kHZ but not 16001 ?

I think the easier way for now is to train with the highest sample rate (24kHz) and then use transfer learning to train the lesser sample rate models at low compute cost but I'm not sure how well the knowledge would be transferred.

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

That is I think better for a start. I thing transfer learning should also work well.

@WeberJulian
Copy link
Contributor

WeberJulian commented Aug 4, 2020

How about other parameters like :

  • fft_size
  • win_length
  • hop_length
  • max_norm
  • upsample_factors

Are they sample rate agnostic ?

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

I'd try first with the default parameters and see how it works. Then, we could update if necessary.

@WeberJulian
Copy link
Contributor

Alright, and if you have time to answer, what's the intuition behind the one-hot encoded sample rate ?

@erogol
Copy link
Contributor Author

erogol commented Aug 4, 2020

so that the model can adapt to different sampling rates given this additional feature vector. But just guessing for now.

@erogol erogol unpinned this issue Sep 11, 2020
@stale
Copy link

stale bot commented Oct 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

@stale stale bot added the wontfix This will not be worked on label Oct 3, 2020
@erogol
Copy link
Contributor Author

erogol commented Oct 4, 2020

Anyone willing to help for that to happen? I can provide some GPU runtime

@stale stale bot removed the wontfix This will not be worked on label Oct 4, 2020
@erogol erogol added wontfix This will not be worked on help wanted Extra attention is needed and removed wontfix This will not be worked on labels Oct 4, 2020
@lexkoro
Copy link
Contributor

lexkoro commented Oct 17, 2020

Hey, I've mostly migrated the old WaveRNN repo into a fork of the new master TTS branch. Here is the fork.

I have only done some coarse testing so far. So not sure how functional it is.
What's missing is a better native preprocessing of the audio files, depending on the chosen mode (MOL, RAW).

I'll make a pull request soon.

@stale
Copy link

stale bot commented Dec 17, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

@stale stale bot added the wontfix This will not be worked on label Dec 17, 2020
@erogol erogol removed the wontfix This will not be worked on label Dec 17, 2020
@erogol erogol closed this as completed Dec 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed poll Poll about things in Mozilla TTS
Projects
None yet
Development

No branches or pull requests

7 participants