Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tacotron2_pipeline_tutorial.py #3759

Merged
merged 3 commits into from
Mar 18, 2024
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 21 additions & 25 deletions examples/tutorials/tacotron2_pipeline_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@
#
# 2. Spectrogram generation
#
# From the encoded text, a spectrogram is generated. We use ``Tacotron2``
# From the encoded text, a spectrogram is generated. We use the ``Tacotron2``
# model for this.
#
# 3. Time-domain conversion
#
# The last step is converting the spectrogram into the waveform. The
# process to generate speech from spectrogram is also called Vocoder.
# process to generate speech from spectrogram is also called a Vocoder.
# In this tutorial, three different vocoders are used,
# :py:class:`~torchaudio.models.WaveRNN`,
# :py:class:`~torchaudio.transforms.GriffinLim`, and
Expand Down Expand Up @@ -90,17 +90,13 @@
# works.
#
# Since the pre-trained Tacotron2 model expects specific set of symbol
# tables, the same functionalities available in ``torchaudio``. This
# section is more for the explanation of the basis of encoding.
# tables, the same functionalities is available in ``torchaudio``. However,
# we will first manually implement the encoding to aid in understanding.
#
# Firstly, we define the set of symbols. For example, we can use
# First, we define the set of symbols
# ``'_-!\'(),.:;? abcdefghijklmnopqrstuvwxyz'``. Then, we will map the
# each character of the input text into the index of the corresponding
# symbol in the table.
#
# The following is an example of such processing. In the example, symbols
# that are not in the table are ignored.
#
# symbol in the table. Symbols that are not in the table are ignored.

symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
Expand All @@ -118,8 +114,8 @@ def text_to_sequence(text):

######################################################################
# As mentioned in the above, the symbol table and indices must match
# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the
# transform along with the pretrained model. For example, you can
# what the pretrained Tacotron2 model expects. ``torchaudio`` provides the same
# transform along with the pretrained model. You can
# instantiate and use such transform as follow.
#

Expand All @@ -133,12 +129,12 @@ def text_to_sequence(text):


######################################################################
# The ``processor`` object takes either a text or list of texts as inputs.
# Note: The output of our manual encoding and the ``torchaudio`` ``text_processor`` output matches (meaning we correctly re-implemented what the library does internally). It takes either a text or list of texts as inputs.
# When a list of texts are provided, the returned ``lengths`` variable
# represents the valid length of each processed tokens in the output
# batch.
#
# The intermediate representation can be retrieved as follow.
# The intermediate representation can be retrieved as follows:
#

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
Expand All @@ -152,7 +148,7 @@ def text_to_sequence(text):
# uses a symbol table based on phonemes and a G2P (Grapheme-to-Phoneme)
# model.
#
# The detail of the G2P model is out of scope of this tutorial, we will
# The detail of the G2P model is out of the scope of this tutorial, we will
# just look at what the conversion looks like.
#
# Similar to the case of character-based encoding, the encoding process is
Expand Down Expand Up @@ -195,7 +191,7 @@ def text_to_sequence(text):
# encoded text. For the detail of the model, please refer to `the
# paper <https://arxiv.org/abs/1712.05884>`__.
#
# It is easy to instantiate a Tacotron2 model with pretrained weight,
# It is easy to instantiate a Tacotron2 model with pretrained weights,
# however, note that the input to Tacotron2 models need to be processed
# by the matching text processor.
#
Expand Down Expand Up @@ -224,7 +220,7 @@ def text_to_sequence(text):

######################################################################
# Note that ``Tacotron2.infer`` method perfoms multinomial sampling,
# therefor, the process of generating the spectrogram incurs randomness.
# therefore, the process of generating the spectrogram incurs randomness.
#


Expand All @@ -245,16 +241,16 @@ def plot():
# -------------------
#
# Once the spectrogram is generated, the last process is to recover the
# waveform from the spectrogram.
# waveform from the spectrogram using a vocoder.
#
# ``torchaudio`` provides vocoders based on ``GriffinLim`` and
# ``WaveRNN``.
#


######################################################################
# WaveRNN
# ~~~~~~~
# WaveRNN Vocoder
# ~~~~~~~~~~~~~~~
#
# Continuing from the previous section, we can instantiate the matching
# WaveRNN model from the same bundle.
Expand Down Expand Up @@ -294,11 +290,11 @@ def plot(waveforms, spec, sample_rate):


######################################################################
# Griffin-Lim
# ~~~~~~~~~~~
# Griffin-Lim Vocoder
# ~~~~~~~~~~~~~~~~~~~
#
# Using the Griffin-Lim vocoder is same as WaveRNN. You can instantiate
# the vocode object with
# the vocoder object with
# :py:func:`~torchaudio.pipelines.Tacotron2TTSBundle.get_vocoder`
# method and pass the spectrogram.
#
Expand All @@ -323,8 +319,8 @@ def plot(waveforms, spec, sample_rate):


######################################################################
# Waveglow
# ~~~~~~~~
# Waveglow Vocoder
# ~~~~~~~~~~~~~~~~
#
# Waveglow is a vocoder published by Nvidia. The pretrained weights are
# published on Torch Hub. One can instantiate the model using ``torch.hub``
Expand Down
Loading