layout

background-class

body-class

title

summary

Model Description

The Tacotron 2 and WaveGlow model form a text-to-speech system that enables user to synthesise a natural sounding speech from raw transcripts without any additional prosody information. The Tacotron 2 model (also available via torch.hub) produces mel spectrograms from input text using encoder-decoder architecture. WaveGlow is a flow-based model that consumes the mel spectrograms to generate speech.

Example

In the example below:

pretrained Tacotron2 and Waveglow models are loaded from torch.hub
Tacotron2 generates mel spectrogram given tensor represantation of an input text ("Hello world, I missed you")
Waveglow generates sound given the mel spectrogram
the output sound is saved in an 'audio.wav' file

To run the example you need some extra python packages installed. These are needed for preprocessing the text and audio, as well as for display and input / output.

pip install numpy scipy librosa unidecode inflect librosa

import numpy as np
from scipy.io.wavfile import write

Prepare the waveglow model for inference

waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Load tacotron2 from PyTorch Hub

tacotron2 = torch.hub.load('nvidia/DeepLearningExamples:torchhub', 'nvidia_tacotron2')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

Now, let's make the model say "hello world, I missed you"

text = "hello world, I missed you"

Now chain pre-processing -> tacotron2 -> waveglow

# preprocessing
sequence = np.array(tacotron2.text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.from_numpy(sequence).to(device='cuda', dtype=torch.int64)

# run the models
with torch.no_grad():
    _, mel, _, _ = tacotron2.infer(sequence)
    audio = waveglow.infer(mel)
audio_numpy = audio[0].data.cpu().numpy()
rate = 22050

You can write it to a file and listen to it

write("audio.wav", rate, audio_numpy)

Alternatively, play it right away in a notebook with IPython widgets

from IPython.display import Audio
Audio(audio_numpy, rate=rate)

Details

For detailed information on model input and output, training recipies, inference and performance visit: github and/or NGC

References

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions
WaveGlow: A Flow-based Generative Network for Speech Synthesis
Tacotron2 and WaveGlow on NGC
Tacotron2 and Waveglow on github

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia_deeplearningexamples_waveglow.md

nvidia_deeplearningexamples_waveglow.md

Model Description

Example

Details

References

Files

nvidia_deeplearningexamples_waveglow.md

Latest commit

History

nvidia_deeplearningexamples_waveglow.md

File metadata and controls

Model Description

Example

Details

References