You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Initially, I plan to adapt to the original implementation.
I think the encoder can be simplified with a convolutional encoder. I'll try a couple of different architectures.
The text was updated successfully, but these errors were encountered:
I finished the first version of the model. Here are some details.
The model is able to produce good quality results but my observation is that it is still less natural than our TacotronDDC model. However, it is easier to train due to its alternative to the attention module with a greedy search mechanism to learn the text-to-spec alignment. Then, it passes the learned alignment to a duration predictor which is used at inference time.
Glow-TTS enables to set speed and variation of the speech with certain parameters. It also does not rely on auto-regression thus it computs the output with a single pass that yields a faster execution time compared to Tacotron models with reduction factor smaller than 3. (So our release TacotronDDC model has very similar real-time factor both in a GPU or a CPU)
In my implementation there are couple of differences. I tried convolution encoder models to enable faster execution. And so far, I got comparable results with Gated Convolution to the original model. Using Gated Convolution speeds up the model ~1.3 times on a CPU.
Tensorboard outputs: (Please ignore empty loss plots which are from different runs of the same machine)
Paper: https://arxiv.org/pdf/2005.11129.pdf
Implementation: https://github.com/jaywalnut310/glow-tts
Initially, I plan to adapt to the original implementation.
I think the encoder can be simplified with a convolutional encoder. I'll try a couple of different architectures.
The text was updated successfully, but these errors were encountered: