Universal WaveRNN #221

erogol · 2019-06-24T08:26:50Z

I am training an universal waveRNN with >900 speakers. I aim to release this model to the community which hopefully solves the vocoder dependence of TTS solutions in general.

I am using https://github.com/erogol/WaveRNN

By now, it looks promising and is able to generalize well for new speakers. Updates will be shared here...

alexdemartos · 2019-06-24T17:47:50Z

That's great news, thank you!

I have already tried your MoL implementation and I am able to get around 1KHz/s on a 2080Ti. That means around 20 times RTF (real time factor) for a 22050Hz model.

Do you have any thoughts on how to speed up the generation time?

erogol · 2019-06-25T21:40:56Z

There are a couple of things need to be done.

Find the minimum network size which gives on par performance.
Prune the network to reduce the parameter size. There are ways to do in training or after training.
Transfer model to C++ backend using Eigen or ant comparable library. I am not sure if Pytorch C++ end is ready to do this. Maybe I am wrong.

If anyone has any take on any of these, I am open to any contribution and further discussion.

erogol · 2019-06-27T09:26:03Z

Almost ready!! Here are some samples:

Make sure you listened to the speaking piano :).

https://soundcloud.com/user-565970875/sets/universal-vocoder-with-wavernn

erogol · 2019-07-02T11:27:38Z

So the final model is with a Mixture of Logistic Distribution output and 16K sampling rate.

It works for any speaker I've tried so far, even in different languages, although it is trained with only English.

https://drive.google.com/drive/u/1/folders/15JhAbc91dT-RRZwakh_v4tBVOEuoOikg

bshall · 2019-07-14T08:48:57Z

Hi @erogol, thanks for the awesome work! Just wanted to let you know that I've also had some good results with Amazon's model Robust Universal Neural Vocoding. My implementation is here if it is of any interest to you.

btomtom5 · 2019-08-31T04:59:25Z

Hi @erogol, I am currently playing with your universal WaveRNN and I couldn't find the right TTS model to pair it with. AFAIK, the universal WaveRNN you released requires a hop length of 200 because of the upscale factors of [5, 5, 8] but the pretrained tacotron 2 models all have a hop length of 275 making them incompatible. Is this correct?

erogol · 2019-09-02T09:20:01Z

@erogol I used 16K sampling rate. All the other values computed based on this value. What I do for LJSpeech is to manually resample the generated mel-spec before feeding WaveRNN. Or you can finetune for a short while .

belevtsoff · 2019-09-19T11:50:49Z

@erogol Great work, thanks! Did you use TTS-generated mel-spectrograms to train this model, or ground-truth ones? We are planning to train a similar wavernn using a multi-speaker TTS (or potentially our speech-to-speech system) and raising the sample-rate to 22.5 kHz. I would really appreciate if you could share some training details (i.e. number of GPU's, total training time, number of iterations). Thanks!

btomtom5 · 2019-09-24T20:46:24Z

@erogol Do you have any tips on how to resample the mel spectrogram? I am having trouble figuring out how to 'manually' resample the mel-spectrogram because isn't the mel spectrogram supposed to be sample rate independent? Thanks in advance!

@belevtsoff Not sure about the universal vocoder but the vocoder for Tacotron 2 uses the TTS mel specs and the vocoder for Tacotron 1 uses both the ground truth as well as the TTS in conjunction.

PetrochukM · 2019-09-27T03:42:00Z

@belevtsoff How's respeecher doing?

erogol · 2019-09-27T08:42:33Z

@btomtom5 @belevtsoff actually vocoder for Tacotron has not seen mel specs from Tacotron. It is only trained on Tacotron2 mel specs and I reused it.

@belevtsoff as far as I remember, it was 4GPUs for >1m iterations. Something like 1 week.

belevtsoff · 2019-09-30T20:48:39Z

@btomtom5 @erogol I see, thanks. Have you though about training the universal vocoder on generated spectrograms form a multispeaker Tacotron[2]?

@PetrochukM Doing well, thanks. Trying to make our vocoders more robust.

erogol · 2019-10-01T08:06:30Z

@belevtsoff tacotron2 vocoder is trained with generated mel specs but the universal used only ground truth specs for training.

rmldj · 2019-12-26T15:12:55Z

I wanted to try using the Universal WaveRNN with the Benchmark.ipynb notebook.

I got an error after

    from WaveRNN.models.wavernn import Model
    bits = 10

    wavernn = Model(
            rnn_dims=512,
            fc_dims=512,
            mode="mold",
            pad=2,
            upsample_factors=VOCODER_CONFIG.upsample_factors,  # set this depending on dataset
            feat_dims=VOCODER_CONFIG.audio["num_mels"],
            compute_dims=128,
            res_out_dims=128,
            res_blocks=10,
            hop_length=ap.hop_length,
            sample_rate=ap.sample_rate,
        ).cuda()

which is

TypeError: __init__() missing 3 required positional arguments: 'mulaw', 'use_aux_net', and 'use_upsample_net'

What is the recommendation for these parameters? Or should one do bigger changes in the Benchmark.ipynb notebook? (For TTS I am using the checkpoint_260000.pth.tar from the Tacotron2-iter-260K-824c091 branch.

Edit: I copied the relevant part from the master branch of TTS and I ran the notebook. Btw. is it possible to make the generated speech slightly slower and/or lower the overall frequency?

rmldj · 2019-12-27T18:06:01Z

@erogol In order to take into account the different hop length (275 versus 200) in TTS/WaveRNN I made an interpolation of the TTS generated mel spectrogram before feeding into WaveRNN:

    nt = len(mel_postnet_spec)
    ntnew = int(nt*2.75/2.0)
    x = np.arange(nt, dtype=np.float32)
    f = scipy.interpolate.interp1d(x, mel_postnet_spec, axis=0, kind='cubic')
    mel_postnet_spec = f(np.linspace(0, nt-1, ntnew))

However the voice comes out with a much higher pitch than the one generated by GL (sounding a bit childlish). The sample rate in TTS is 22050 while in WaveRNN it is 16000. Should one also modify the mel columns in some way? I would be grateful for a hint - I am a complete outsider to the text-to-speach arena so most probably I do not know even basic stuff..

Shikherneo2 · 2020-01-25T01:07:44Z

@erogol Hi! Great work! I downloaded the tar file from google drive, but I cant seem to untar it. Keeps saying it does not look like a tar file. Do you have any idea what might be going wrong?
Thanks

shad94 · 2020-01-25T08:32:02Z

@Shikherneo2, you mean file with . pth?
It's Pytorch type of file.
Btw, use discourse.mozilla.org for such issues, it's more convenient.

Shikherneo2 · 2020-01-27T02:39:03Z

@shad94 Ahh.. My bad. I have been trying to untar it! I will checkout discourse.
Thanks a ton!

mindmapper15 · 2020-08-31T05:35:10Z

@erogol Hello. I'm trying to create universal vocoder of my own with LibriTTS dataset based on your setting.
Did you trained this universal vocoder model with train-clean only? or include train-other-500 in LibriTTS also?

erogol · 2020-09-07T09:13:40Z

I only used train-clean for this model AFAIR.

lawrence-laz · 2021-10-10T09:32:35Z

Hi! I tried using this directly from CLI with this command:

$ tts --text "hello, how are you doing?" --vocoder_path ~/tts/best_model_16K.pth.tar --vocoder_config_path ~/tts/config_16K.json && play tts_output.wav

but got the following error:

 > tts_models/en/ljspeech/tacotron2-DDC is already downloaded.
 > Using model: Tacotron2
 > Model's reduction rate `r` is set to: 1
Traceback (most recent call last):
  File "/home/llaz/.local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/llaz/.local/lib/python3.9/site-packages/TTS/bin/synthesize.py", line 226, in main
    synthesizer = Synthesizer(
  File "/home/llaz/.local/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 75, in __init__
    self._load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
  File "/home/llaz/.local/lib/python3.9/site-packages/TTS/utils/synthesizer.py", line 161, in _load_vocoder
    self.vocoder_config = load_config(model_config)
  File "/home/llaz/.local/lib/python3.9/site-packages/TTS/config/__init__.py", line 92, in load_config
    model_name = _process_model_name(config_dict)
  File "/home/llaz/.local/lib/python3.9/site-packages/TTS/config/__init__.py", line 59, in _process_model_name
    model_name = config_dict["model"] if "model" in config_dict else config_dict["generator_model"]
KeyError: 'generator_model'

Am I doing something wrong?

erogol · 2021-10-26T12:23:38Z

@lawrence-laz This repo is not maintained anymore try 🐸TTS

erogol added improvement a new feature experiment experimental things labels Jun 24, 2019

erogol closed this as completed Jul 2, 2019

G-Wang mentioned this issue Jul 8, 2019

Vocoder Training CorentinJ/Real-Time-Voice-Cloning#40

Closed

macarbonneau mentioned this issue Jan 6, 2020

To aux or not to aux? #332

Closed

WeberJulian mentioned this issue Jul 30, 2020

[Poll] Should we include WaveRNN in Mozilla TTS ? #458

Closed

ghost mentioned this issue Aug 11, 2020

Training a new encoder model CorentinJ/Real-Time-Voice-Cloning#458

Closed

ghost mentioned this issue Sep 12, 2020

To Much Noise on Mandarin CorentinJ/Real-Time-Voice-Cloning#498

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Universal WaveRNN #221

Universal WaveRNN #221

erogol commented Jun 24, 2019 •

edited

Loading

alexdemartos commented Jun 24, 2019

erogol commented Jun 25, 2019

erogol commented Jun 27, 2019 •

edited

Loading

erogol commented Jul 2, 2019

bshall commented Jul 14, 2019

btomtom5 commented Aug 31, 2019 •

edited

Loading

erogol commented Sep 2, 2019

belevtsoff commented Sep 19, 2019

btomtom5 commented Sep 24, 2019 •

edited

Loading

PetrochukM commented Sep 27, 2019

erogol commented Sep 27, 2019

belevtsoff commented Sep 30, 2019

erogol commented Oct 1, 2019

rmldj commented Dec 26, 2019 •

edited

Loading

rmldj commented Dec 27, 2019

Shikherneo2 commented Jan 25, 2020

shad94 commented Jan 25, 2020

Shikherneo2 commented Jan 27, 2020

mindmapper15 commented Aug 31, 2020

erogol commented Sep 7, 2020

lawrence-laz commented Oct 10, 2021

erogol commented Oct 26, 2021 •

edited

Loading

Universal WaveRNN #221

Universal WaveRNN #221

Comments

erogol commented Jun 24, 2019 • edited Loading

alexdemartos commented Jun 24, 2019

erogol commented Jun 25, 2019

erogol commented Jun 27, 2019 • edited Loading

erogol commented Jul 2, 2019

bshall commented Jul 14, 2019

btomtom5 commented Aug 31, 2019 • edited Loading

erogol commented Sep 2, 2019

belevtsoff commented Sep 19, 2019

btomtom5 commented Sep 24, 2019 • edited Loading

PetrochukM commented Sep 27, 2019

erogol commented Sep 27, 2019

belevtsoff commented Sep 30, 2019

erogol commented Oct 1, 2019

rmldj commented Dec 26, 2019 • edited Loading

rmldj commented Dec 27, 2019

Shikherneo2 commented Jan 25, 2020

shad94 commented Jan 25, 2020

Shikherneo2 commented Jan 27, 2020

mindmapper15 commented Aug 31, 2020

erogol commented Sep 7, 2020

lawrence-laz commented Oct 10, 2021

erogol commented Oct 26, 2021 • edited Loading

erogol commented Jun 24, 2019 •

edited

Loading

erogol commented Jun 27, 2019 •

edited

Loading

btomtom5 commented Aug 31, 2019 •

edited

Loading

btomtom5 commented Sep 24, 2019 •

edited

Loading

rmldj commented Dec 26, 2019 •

edited

Loading

erogol commented Oct 26, 2021 •

edited

Loading