[Poll] Should we include WaveRNN in Mozilla TTS ? #458

erogol · 2020-07-13T15:13:29Z

I see a lot of people still use WaveRNN although we released new faster vocoders.

I am not willing to invest time in it given the way faster alternatives but you can let us know if you like to see WaveRNN as a part of Mozilla TTS repo.

Please give thumps up or down to this post to have a poll.

You can also state your comment or reason to have WaveRNN below.

domcross · 2020-07-13T17:33:10Z

So the recommended vocoder as of now is PWGAN, correct?

erogol · 2020-07-14T08:14:02Z

right now it is Melgan or Multiband Melgan under the vocoder module.

LucasRotsen · 2020-07-14T14:21:34Z

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

erogol · 2020-07-14T16:21:39Z

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

cs50victor · 2020-07-14T16:40:33Z

I use LJSpeech but I'm having a hard time adding WaveRNN to my project. Where do you keep the downloaded files and in what file do add the path of the downloaded files? I've been struggling for over a week.

LucasRotsen · 2020-07-14T17:29:43Z

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

Custom dataset in Brazilian Portuguese.

MuruganR96 · 2020-07-16T00:35:31Z

Quality of Output is best in WaveRNN. but the latency is here the problem. can anyone help me, how to overcome this WaveRNN latency issue? small tips is enough. i can go through.

LucasRotsen · 2020-07-16T01:17:07Z

Quality of Output is best in WaveRNN. but the latency is here the problem. can anyone help me, how to overcome this WaveRNN latency issue? small tips is enough. i can go through.

If generation speed is an requirement for your application - even if it leads to a lower quality - you should definitely try MelGAN or MultiBand MelGAN. Auto-regressive models like WaveRNN are inherently slow, and as far as I know, we can't do much to overcome this. NVIDIA's WaveGlow is a non-autoregressive model that achieved a good MOS compared to WaveRNN, but the training is very resource consuming. It may be a good idea to take this conversation to discourse.

erogol · 2020-07-16T11:29:22Z

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

Custom dataset in Brazilian Portuguese.

can you share a couple of samples somewhere? I just want to checl what causes the noise.

LucasRotsen · 2020-07-16T14:15:14Z

As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound.

Do you use a custom dataset or LJSpeech?

Custom dataset in Brazilian Portuguese.

can you share a couple of samples somewhere? I just want to checl what causes the noise.

Of course! Thanks for helping. I sent you the samples in a private message on Discourse.

erogol · 2020-07-20T13:39:42Z

@LucasRotsen with the latest PWGAN model I was able to get rid of any noise. Maybe you should give a try using the latest dev branch.

cs50victor · 2020-07-20T13:48:41Z

I would really appreciate it if another Released model could be added to the Wiki

erogol · 2020-07-20T16:08:55Z

it is there

erogol · 2020-07-29T14:26:22Z

Given that there are people asking for WaveRNN, is there anyone to help on that to have it in TTS?

WeberJulian · 2020-07-30T15:36:19Z

@erogol Another reason why people like WaveRNN imo is that you released a universal vocoder that works pretty well even without fine-tuning.
Do you think it would be feasible to do the same with another vocoder ?

LucasRotsen · 2020-07-30T23:24:10Z

Given that there are people asking for WaveRNN, is there anyone to help on that to have it in TTS?

I'll be glad to help!

erogol · 2020-07-31T09:17:37Z

@erogol Another reason why people like WaveRNN imo is that you released a universal vocoder that works pretty well even without fine-tuning.
Do you think it would be feasible to do the same with another vocoder ?

I believe it is possible with the new vocoders but I need some time before starting to that.

erogol · 2020-07-31T09:21:15Z

Given that there are people asking for WaveRNN, is there anyone to help on that to have it in TTS?

I'll be glad to help!

Ohh nice! I think we just need to place the model under the vocoder folder and add it to the train_vocoder.py with necessary changes or we can create a new train_wavernn.py script. As I write this I believe writing a new script makes more sense since train_vocoder.py is for GAN training.

If you could help with that, then I can add it to the inference pipeline to use it seamlessly at inference as we use GAN vocoders.

LucasRotsen · 2020-07-31T22:31:58Z

Nice, @erogol! I agree that creating a train_wavernn script is a better idea (As the configuration file for GAN's is quite different from that of WaveRNN).

WeberJulian · 2020-07-31T23:09:10Z

@erogol Another reason why people like WaveRNN imo is that you released a universal vocoder that works pretty well even without fine-tuning.
Do you think it would be feasible to do the same with another vocoder ?

I believe it is possible with the new vocoders but I need some time before starting to that.

I have a lot of free AWS credit left from a previous project that I can use until September. With a little guidance from you, I would gladly do the training part

erogol · 2020-08-04T10:45:03Z

@WeberJulian maybe then you can cooperate with @LucasRotsen on wavernn

or we can try our GAN vocoders for multi-speaker training?

WeberJulian · 2020-08-04T11:37:48Z

I'm currently familiarising myself with the new vocoders on the dev branch. I think that in most use-cases, realtime makes a huge difference so I'd rather go with a GAN vocoder.
For the choice of dataset, what do you think is best ?
I think you used libriTTS for the universal WaveRNN but for multi-language generalisation, it might be best to use either :

Common Voice 7200h / 54 languages
M-AILABS 1000h / 9 languages

I spent a little time validating audio samples for common voice in french and it seems that the data is a bit messier (different microphones, noise level etc).
I can't decide if it's good or bad, on the one hand it's gonna be harder for the model to learn but on the other, it might make it generalise more and work more reliably.
What's your opinion on this ?

WeberJulian · 2020-08-04T12:30:34Z

I was reading the MelGAN paper and I think it might be even better to implement WaveGlow for the Universal vocoder.
WaveGlow inference is relatively fast, 223kHz on a 1080Ti. I think it's like 14RTF for a 16kHz signal.
The huge cost is training because it has even more weights than WaveRNN but to me it's a no issue if we train only once for a universal vocoder.

I think it's worth it considering the far better MOS score

What do you think ?

erogol · 2020-08-04T12:39:53Z

@WeberJulian If you are willing to contribute waveglow pls feel free to do so. However, there are a couple of concerns I have.

WaveGlow is too big for someone to train. So the improvement might not worth its quality.
It is way slower on a CPU compared to MelGAN which is important for enabling TTS in low-resource devices.
We also have a parallel-wavegan implementation which provides better results than MelGAN with a small run-time sacrifice. We can try it instead of MelGAN. It also converged a lot faster. They also report better MOS values on paper. They don't have exact comparison to WaveGlow but better than Wavenet.

But again I believe there are also people who would be happy with a WaveGlow implementation.

erogol · 2020-08-04T12:41:13Z

@WeberJulian wrt dataset, I'd suggest going with LibriTTS from which I trained the Universal WaveRNN vocoder.

WeberJulian · 2020-08-04T12:52:16Z

Ok got it, I'm gonna go with Parallel_wavegan and LibriTTS.

Did you use the full dataset ?
Do you think 24kHz is worth it for the performance/quality tradeoff or I should convert the dataset to 16kHz

erogol · 2020-08-04T12:55:46Z

I guess the sampling rate requires another poll. I am not sure what do people use for their TTS models. I'd suggest 22050 or 16khz.

Or we can try to come up a way to train a sampling rate agnostic vocoder (research project) :)

Yes I used the whole dataset.

I guess you just need to collect (or sym link) all the wav files in a folder and give it to it to the vocoder.

WeberJulian · 2020-08-04T13:03:58Z

Oh that's interesting, you mean where the input is sample rate agnostic or where we can tune the output sample rate at inference time ?

erogol · 2020-08-04T13:32:44Z

yes. Maybe we can provide the sampling rate as an additional feature like an one-hot embedding vector where each dimension corresponds to a different sampling rate. Then we can replace the upsampling network to a normal interpolation of the spectrograms or we can train different upsampling networks for each possible sampling rate.

lexkoro · 2020-08-04T13:34:57Z

@WeberJulian maybe then you can cooperate with @LucasRotsen on wavernn

or we can try our GAN vocoders for multi-speaker training?

@erogol Do you think throwing multiple datasets together (different languages) let's say of mostly equal quality, would improve the quality of the multi-speaker vocoder or more likely hurt it?

WeberJulian · 2020-08-04T13:40:27Z

yes. Maybe we can provide the sampling rate as an additional feature like an one-hot embedding vector where each dimension corresponds to a different sampling rate. Then we can replace the upsampling network to a normal interpolation of the spectrograms or we can train different upsampling networks for each possible sampling rate.

I'm a bit of a DeepLearning noob but why use a one-hot encoding for a continuous value like sample rate ? Is it to force exact values like 16kHZ but not 16001 ?

I think the easier way for now is to train with the highest sample rate (24kHz) and then use transfer learning to train the lesser sample rate models at low compute cost but I'm not sure how well the knowledge would be transferred.

erogol · 2020-08-04T14:00:58Z

That is I think better for a start. I thing transfer learning should also work well.

WeberJulian · 2020-08-04T14:09:59Z

How about other parameters like :

fft_size
win_length
hop_length
max_norm
upsample_factors

Are they sample rate agnostic ?

erogol · 2020-08-04T14:17:46Z

I'd try first with the default parameters and see how it works. Then, we could update if necessary.

WeberJulian · 2020-08-04T14:23:51Z

Alright, and if you have time to answer, what's the intuition behind the one-hot encoded sample rate ?

erogol · 2020-08-04T16:43:34Z

so that the model can adapt to different sampling rates given this additional feature vector. But just guessing for now.

stale · 2020-10-03T18:30:04Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

erogol · 2020-10-04T22:33:13Z

Anyone willing to help for that to happen? I can provide some GPU runtime

lexkoro · 2020-10-17T11:44:32Z

Hey, I've mostly migrated the old WaveRNN repo into a fork of the new master TTS branch. Here is the fork.

I have only done some coarse testing so far. So not sure how functional it is.
What's missing is a better native preprocessing of the audio files, depending on the chosen mode (MOL, RAW).

I'll make a pull request soon.

stale · 2020-12-17T12:55:06Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts

erogol added the poll Poll about things in Mozilla TTS label Jul 13, 2020

erogol changed the title ~~[Pool] Should we inlclude WaveRNN in Mozilla TTS ?~~ [Poll] Should we inlclude WaveRNN in Mozilla TTS ? Jul 13, 2020

erogol changed the title ~~[Poll] Should we inlclude WaveRNN in Mozilla TTS ?~~ [Poll] Should we include WaveRNN in Mozilla TTS ? Jul 13, 2020

erogol pinned this issue Jul 13, 2020

erogol unpinned this issue Sep 11, 2020

stale bot added the wontfix This will not be worked on label Oct 3, 2020

stale bot removed the wontfix This will not be worked on label Oct 4, 2020

erogol added wontfix This will not be worked on help wanted Extra attention is needed and removed wontfix This will not be worked on labels Oct 4, 2020

lexkoro mentioned this issue Oct 19, 2020

Initial WaveRNN support #542

Merged

stale bot added the wontfix This will not be worked on label Dec 17, 2020

erogol removed the wontfix This will not be worked on label Dec 17, 2020

erogol closed this as completed Dec 17, 2020

[Poll] Should we include WaveRNN in Mozilla TTS ? #458

[Poll] Should we include WaveRNN in Mozilla TTS ? #458

Comments

erogol commented Jul 13, 2020

domcross commented Jul 13, 2020

erogol commented Jul 14, 2020

LucasRotsen commented Jul 14, 2020

erogol commented Jul 14, 2020

cs50victor commented Jul 14, 2020

LucasRotsen commented Jul 14, 2020

MuruganR96 commented Jul 16, 2020

LucasRotsen commented Jul 16, 2020

erogol commented Jul 16, 2020

LucasRotsen commented Jul 16, 2020

erogol commented Jul 20, 2020

cs50victor commented Jul 20, 2020

erogol commented Jul 20, 2020

erogol commented Jul 29, 2020

WeberJulian commented Jul 30, 2020

LucasRotsen commented Jul 30, 2020 • edited Loading

erogol commented Jul 31, 2020

erogol commented Jul 31, 2020

LucasRotsen commented Jul 31, 2020 • edited Loading

WeberJulian commented Jul 31, 2020 • edited Loading

erogol commented Aug 4, 2020

WeberJulian commented Aug 4, 2020 • edited Loading

WeberJulian commented Aug 4, 2020 • edited Loading

erogol commented Aug 4, 2020

erogol commented Aug 4, 2020

WeberJulian commented Aug 4, 2020

erogol commented Aug 4, 2020 • edited Loading

WeberJulian commented Aug 4, 2020

erogol commented Aug 4, 2020

lexkoro commented Aug 4, 2020

WeberJulian commented Aug 4, 2020

erogol commented Aug 4, 2020

WeberJulian commented Aug 4, 2020 • edited Loading

erogol commented Aug 4, 2020

WeberJulian commented Aug 4, 2020

erogol commented Aug 4, 2020

stale bot commented Oct 3, 2020

erogol commented Oct 4, 2020 • edited Loading

lexkoro commented Oct 17, 2020

stale bot commented Dec 17, 2020

LucasRotsen commented Jul 30, 2020 •

edited

Loading

LucasRotsen commented Jul 31, 2020 •

edited

Loading

WeberJulian commented Jul 31, 2020 •

edited

Loading

WeberJulian commented Aug 4, 2020 •

edited

Loading

WeberJulian commented Aug 4, 2020 •

edited

Loading

erogol commented Aug 4, 2020 •

edited

Loading

WeberJulian commented Aug 4, 2020 •

edited

Loading

erogol commented Oct 4, 2020 •

edited

Loading