-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Poll] Should we include WaveRNN in Mozilla TTS ? #458
Comments
So the recommended vocoder as of now is PWGAN, correct? |
right now it is Melgan or Multiband Melgan under the vocoder module. |
As someone who is using WaveRNN - even after training the PWGAN and MelGAN models - I think people still use it due to the quality of the generated audio. For me, PWGAN had a strange noise and MelGAN produced a metallic sound. |
Do you use a custom dataset or LJSpeech? |
I use LJSpeech but I'm having a hard time adding WaveRNN to my project. Where do you keep the downloaded files and in what file do add the path of the downloaded files? I've been struggling for over a week. |
Custom dataset in Brazilian Portuguese. |
Quality of Output is best in WaveRNN. but the latency is here the problem. can anyone help me, how to overcome this WaveRNN latency issue? small tips is enough. i can go through. |
If generation speed is an requirement for your application - even if it leads to a lower quality - you should definitely try MelGAN or MultiBand MelGAN. Auto-regressive models like WaveRNN are inherently slow, and as far as I know, we can't do much to overcome this. NVIDIA's WaveGlow is a non-autoregressive model that achieved a good MOS compared to WaveRNN, but the training is very resource consuming. It may be a good idea to take this conversation to discourse. |
can you share a couple of samples somewhere? I just want to checl what causes the noise. |
Of course! Thanks for helping. I sent you the samples in a private message on Discourse. |
@LucasRotsen with the latest PWGAN model I was able to get rid of any noise. Maybe you should give a try using the latest dev branch. |
I would really appreciate it if another Released model could be added to the Wiki |
it is there |
Given that there are people asking for WaveRNN, is there anyone to help on that to have it in TTS? |
@erogol Another reason why people like WaveRNN imo is that you released a universal vocoder that works pretty well even without fine-tuning. |
I'll be glad to help! |
I believe it is possible with the new vocoders but I need some time before starting to that. |
Ohh nice! I think we just need to place the model under the vocoder folder and add it to the train_vocoder.py with necessary changes or we can create a new train_wavernn.py script. As I write this I believe writing a new script makes more sense since train_vocoder.py is for GAN training. If you could help with that, then I can add it to the inference pipeline to use it seamlessly at inference as we use GAN vocoders. |
Nice, @erogol! I agree that creating a |
I have a lot of free AWS credit left from a previous project that I can use until September. With a little guidance from you, I would gladly do the training part |
@WeberJulian maybe then you can cooperate with @LucasRotsen on wavernn or we can try our GAN vocoders for multi-speaker training? |
I'm currently familiarising myself with the new vocoders on the dev branch. I think that in most use-cases, realtime makes a huge difference so I'd rather go with a GAN vocoder.
I spent a little time validating audio samples for common voice in french and it seems that the data is a bit messier (different microphones, noise level etc). |
I was reading the MelGAN paper and I think it might be even better to implement WaveGlow for the Universal vocoder. |
@WeberJulian If you are willing to contribute waveglow pls feel free to do so. However, there are a couple of concerns I have.
But again I believe there are also people who would be happy with a WaveGlow implementation. |
@WeberJulian wrt dataset, I'd suggest going with LibriTTS from which I trained the Universal WaveRNN vocoder. |
Ok got it, I'm gonna go with Parallel_wavegan and LibriTTS.
|
I guess the sampling rate requires another poll. I am not sure what do people use for their TTS models. I'd suggest 22050 or 16khz. Or we can try to come up a way to train a sampling rate agnostic vocoder (research project) :) Yes I used the whole dataset. I guess you just need to collect (or sym link) all the wav files in a folder and give it to it to the vocoder. |
Oh that's interesting, you mean where the input is sample rate agnostic or where we can tune the output sample rate at inference time ? |
yes. Maybe we can provide the sampling rate as an additional feature like an one-hot embedding vector where each dimension corresponds to a different sampling rate. Then we can replace the upsampling network to a normal interpolation of the spectrograms or we can train different upsampling networks for each possible sampling rate. |
@erogol Do you think throwing multiple datasets together (different languages) let's say of mostly equal quality, would improve the quality of the multi-speaker vocoder or more likely hurt it? |
I'm a bit of a DeepLearning noob but why use a one-hot encoding for a continuous value like sample rate ? Is it to force exact values like 16kHZ but not 16001 ? I think the easier way for now is to train with the highest sample rate (24kHz) and then use transfer learning to train the lesser sample rate models at low compute cost but I'm not sure how well the knowledge would be transferred. |
That is I think better for a start. I thing transfer learning should also work well. |
How about other parameters like :
Are they sample rate agnostic ? |
I'd try first with the default parameters and see how it works. Then, we could update if necessary. |
Alright, and if you have time to answer, what's the intuition behind the one-hot encoded sample rate ? |
so that the model can adapt to different sampling rates given this additional feature vector. But just guessing for now. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts |
Anyone willing to help for that to happen? I can provide some GPU runtime |
Hey, I've mostly migrated the old WaveRNN repo into a fork of the new master TTS branch. Here is the fork. I have only done some coarse testing so far. So not sure how functional it is. I'll make a pull request soon. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discourse page for further help. https://discourse.mozilla.org/c/tts |
I see a lot of people still use WaveRNN although we released new faster vocoders.
I am not willing to invest time in it given the way faster alternatives but you can let us know if you like to see WaveRNN as a part of Mozilla TTS repo.
Please give thumps up or down to this post to have a poll.
You can also state your comment or reason to have WaveRNN below.
The text was updated successfully, but these errors were encountered: