diff --git a/PyTorch/SpeechSynthesis/FastPitch/.dockerignore b/PyTorch/SpeechSynthesis/FastPitch/.dockerignore new file mode 100644 index 000000000..4cfb3fabe --- /dev/null +++ b/PyTorch/SpeechSynthesis/FastPitch/.dockerignore @@ -0,0 +1,9 @@ +*~ +*.pyc +__pycache__ +output +LJSpeech-1.1* +runs* +pretrained_models + +.git diff --git a/PyTorch/SpeechSynthesis/FastPitch/.gitignore b/PyTorch/SpeechSynthesis/FastPitch/.gitignore new file mode 100644 index 000000000..540fab07b --- /dev/null +++ b/PyTorch/SpeechSynthesis/FastPitch/.gitignore @@ -0,0 +1,9 @@ +*.swp +*.swo +*.pyc +__pycache__ +scripts_joc/ +runs*/ +notebooks/ +LJSpeech-1.1/ +output* diff --git a/PyTorch/SpeechSynthesis/FastPitch/Dockerfile b/PyTorch/SpeechSynthesis/FastPitch/Dockerfile new file mode 100644 index 000000000..1d3c40144 --- /dev/null +++ b/PyTorch/SpeechSynthesis/FastPitch/Dockerfile @@ -0,0 +1,7 @@ +ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:20.03-py3 +FROM ${FROM_IMAGE_NAME} + +ADD requirements.txt . +RUN pip install --no-cache-dir -r requirements.txt +WORKDIR /workspace/fastpitch +COPY . . diff --git a/PyTorch/SpeechSynthesis/FastPitch/LICENSE b/PyTorch/SpeechSynthesis/FastPitch/LICENSE new file mode 100644 index 000000000..fe818a49a --- /dev/null +++ b/PyTorch/SpeechSynthesis/FastPitch/LICENSE @@ -0,0 +1,29 @@ +BSD 3-Clause License + +Copyright (c) 2020, NVIDIA Corporation +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are met: + +* Redistributions of source code must retain the above copyright notice, this + list of conditions and the following disclaimer. + +* Redistributions in binary form must reproduce the above copyright notice, + this list of conditions and the following disclaimer in the documentation + and/or other materials provided with the distribution. + +* Neither the name of the copyright holder nor the names of its + contributors may be used to endorse or promote products derived from + this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" +AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. diff --git a/PyTorch/SpeechSynthesis/FastPitch/README.md b/PyTorch/SpeechSynthesis/FastPitch/README.md new file mode 100644 index 000000000..526f2d83f --- /dev/null +++ b/PyTorch/SpeechSynthesis/FastPitch/README.md @@ -0,0 +1,777 @@ +# FastPitch 1.0 for PyTorch + +This repository provides a script and recipe to train the FastPitch model to achieve state-of-the-art accuracy and is tested and maintained by NVIDIA. + +## Table Of Contents + +- [Model overview](#model-overview) + * [Model architecture](#model-architecture) + * [Default configuration](#default-configuration) + * [Feature support matrix](#feature-support-matrix) + * [Features](#features) + * [Mixed precision training](#mixed-precision-training) + * [Enabling mixed precision](#enabling-mixed-precision) + * [Glossary](#glossary) +- [Setup](#setup) + * [Requirements](#requirements) +- [Quick Start Guide](#quick-start-guide) +- [Advanced](#advanced) + * [Scripts and sample code](#scripts-and-sample-code) + * [Parameters](#parameters) + * [Training parameters](#training-parameters) + * [Audio and SFST parameters](#audio-and-sfst-parameters) + * [FastPitch parameters](#fastpitch parameters) + * [Command-line options](#command-line-options) + * [Getting the data](#getting-the-data) + * [Dataset guidelines](#dataset-guidelines) + * [Multi-dataset](#multi-dataset) + * [Training process](#training-process) + * [Inference process](#inference-process) + * [Deploying the FastPitch model using Triton Inference Server](#deploying-the-fastpitch-model-using-triton-inference) + * [Performance analysis for Triton Inference Server](#performance-analysis-for-triton-inference-server) + * [Running the Triton Inference Server and client](#running-the-triton-inference-server-and-client) +- [Performance](#performance) + * [Benchmarking](#benchmarking) + * [Training performance benchmark](#training-performance-benchmark) + * [Inference performance benchmark](#inference-performance-benchmark) + * [Results](#results) + * [Training accuracy results](#training-accuracy-results) + * [Training accuracy: NVIDIA DGX-1 (8x V100 16G)](#training-accuracy-nvidia-dgx-1-8x-v100-16g) + * [Training stability test](#training-stability-test) + * [Training performance results](#training-performance-results) + * [Training performance: NVIDIA DGX-1 (8x V100 16G)](#training-performance-nvidia-dgx-1-8x-v100-16g) + * [Expected training time](#expected-training-time) + * [Training performance: NVIDIA DGX-2 (16x V100 32G)](#training-performance-nvidia-dgx-2-16x-v100-32g) + * [Inference performance results](#inference-performance-results) + * [Inference performance: NVIDIA DGX-1 (1x V100 16G)](#inference-performance-nvidia-dgx-1-1x-v100-16g) + * [Inference performance: NVIDIA T4](#inference-performance-nvidia-t4) +- [Release notes](#release-notes) + * [Changelog](#changelog) + * [Known issues](#known-issues) + +## Model overview + +A full text-to-speech (TTS) system is a pipeline of two neural network +models: +* a mel-spectrogram generator such as [FastPitch](#) or [Tacotron 2](https://arxiv.org/abs/1712.05884), and +* a waveform synthesizer such as [WaveGlow](https://arxiv.org/abs/1811.00002) (see [NVIDIA example code](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2)). + +It enables users to synthesize natural sounding speech from raw transcripts. + +The FastPitch model generates mel-spectrograms from raw input text and allows to exert additional control over the synthesized utterances, such as: +* supply pitch cues to control the prosody +* alter the pace of speech +The FastPitch model is based on the [FastSpeech](https://arxiv.org/abs/1905.09263) model(?). The main differences between FastPitch and FastSpeech are that FastPitch: +* explicitly learns to predict pitch (f0), +* achieves higher quality, trains faster and no longer needs knowledge distillation from a teacher model, +* character durations are extracted with a Tacotron 2 model. + +The model is trained on a publicly +available [LJ Speech dataset](https://keithito.com/LJ-Speech-Dataset/). + +This model is trained with mixed precision using Tensor Cores on NVIDIA Volta and Turing GPUs. Therefore, researchers can get results 2.2x faster than training without Tensor Cores, while experiencing the benefits of mixed precision training. This model is tested against each NGC monthly container release to ensure consistent accuracy and performance over time. + +### Model architecture + +FastPitch is a fully feedforward Transformer model that predicts mel-spectrograms +from raw text. The model is composed of an encoder, pitch predictor, duration predictor, and a decoder. +After encoding, the signal is augmented with pitch information and discretely upsampled. +The goal of the decoder is to smooth out the upsampled signal, and construct a mel-spectrogram. +The entire process is parallel. + + + + +
+ FastPitch model architecture +
+ Figure 1. Architecture of FastPitch (source: [FastPitch: Paper title TODO](#)) +
+ +### Default configuration + +The FastPitch model supports multi-GPU and mixed precision training with dynamic loss +scaling (see Apex code +[here](https://github.com/NVIDIA/apex/blob/master/apex/fp16_utils/loss_scaler.py)), +as well as mixed precision inference. + +The following features were implemented in this model: + +* data-parallel multi-GPU training, +* dynamic loss scaling with backoff for Tensor Cores (mixed precision) +training, +* gradient accumulation for reproducible results regardless of the number of GPUs. + +To speed-up FastPitch training, +reference mel-spectrograms, character durations, and pitch cues +are generated during the pre-processing step and read +directly from the disk during training. For more information on data pre-processing refer to [Dataset guidelines +](#dataset-guidelines) and the [paper](#). + +### Feature support matrix + +The following features are supported by this model. + +| Feature | FastPitch | +| :------------------------------------------------------------------|------------:| +|[AMP](https://nvidia.github.io/apex/amp.html) | Yes | +|[Apex DistributedDataParallel](https://nvidia.github.io/apex/parallel.html) | Yes | + +#### Features + +AMP - a tool that enables Tensor Core-accelerated training. For more information, +refer to [Enabling mixed precision](#enabling-mixed-precision). + +Apex DistributedDataParallel - a module wrapper that enables easy multiprocess +distributed data parallel training, similar to `torch.nn.parallel.DistributedDataParallel`. +`DistributedDataParallel` is optimized for use with NCCL. It achieves high +performance by overlapping communication with computation during `backward()` +and bucketing smaller gradient transfers to reduce the total number of transfers +required. + +### Mixed precision training + +Mixed precision is the combined use of different numerical precisions in a computational method. [Mixed precision](https://arxiv.org/abs/1710.03740) training offers significant computational speedup by performing operations in half-precision format while storing minimal information in single-precision to retain as much information as possible in critical parts of the network. Since the introduction of [Tensor Cores](https://developer.nvidia.com/tensor-cores) in the Volta and Turing architecture, significant training speedups are experienced by switching to mixed precision -- up to 3x overall speedup on the most arithmetically intense model architectures. Using mixed precision training requires two steps: +1. Porting the model to use the FP16 data type where appropriate. +2. Adding loss scaling to preserve small gradient values. + +The ability to train deep learning networks with lower precision was introduced in the Pascal architecture and first supported in [CUDA 8](https://devblogs.nvidia.com/parallelforall/tag/fp16/) in the NVIDIA Deep Learning SDK. + +For information about: +- How to train using mixed precision, see the [Mixed Precision Training](https://arxiv.org/abs/1710.03740) paper and [Training With Mixed Precision](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html) documentation. +- Techniques used for mixed precision training, see the [Mixed-Precision Training of Deep Neural Networks](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) blog. +- How to access and enable AMP for TensorFlow, see [Using TF-AMP](https://docs.nvidia.com/deeplearning/dgx/tensorflow-user-guide/index.html#tfamp) from the TensorFlow User Guide. +- APEX tools for mixed precision training, see the [NVIDIA Apex: Tools for Easy Mixed-Precision Training in PyTorch](https://devblogs.nvidia.com/apex-pytorch-easy-mixed-precision-training/). + +#### Enabling mixed precision + +Mixed precision is enabled in PyTorch by using the Automatic Mixed Precision +(AMP) library from [APEX](https://github.com/NVIDIA/apex) that casts variables +to half-precision upon retrieval, while storing variables in single-precision +format. Furthermore, to preserve small gradient magnitudes in backpropagation, +a [loss scaling](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html#lossscaling) +step must be included when applying gradients. In PyTorch, loss scaling can be +easily applied by using the `scale_loss()` method provided by AMP. The scaling value +to be used can be [dynamic](https://nvidia.github.io/apex/fp16_utils.html#apex.fp16_utils.DynamicLossScaler) or fixed. + +By default, the `train_fastpitch.sh` script will +launch mixed precision training with Tensor Cores. You can change this +behaviour by removing the `--amp-run` flag from the `train.py` script. + +To enable mixed precision, the following steps were performed: +* Import AMP from APEX: + ```bash + from apex import amp + ``` + +* Initialize AMP: + ```bash + model, optimizer = amp.initialize(model, optimizer, opt_level="O1") + ``` + +* If running on multi-GPU, wrap the model with `DistributedDataParallel`: + ```bash + from apex.parallel import DistributedDataParallel as DDP + model = DDP(model) + ``` + +* Scale loss before backpropagation (assuming loss is stored in a variable +called `losses`): + + * Default backpropagate for FP32: + ```bash + losses.backward() + ``` + + * Scale loss and backpropagate with AMP: + ```bash + with optimizer.scale_loss(losses) as scaled_losses: + scaled_losses.backward() + ``` + + +### Glossary + +**Forced alignment** +Segmentation of a recording into lexical units like characters, words, or phonemes. The segmentation is hard and defines exact starting end ending times for every unit. + +**Fundamental frequency** +The lowest vibration frequency of a periodic soundwave, for example, produced by a vibrating instrument. It is perceived as the loudest. In the context of speech, it refers to the frequency of vibration of vocal chords. Abbreviated as *f0*. + +**Pitch** +A perceived frequency of vibration of music or sound. + +**Transformer** +The paper [Attention Is All You Need](https://arxiv.org/abs/1706.03762) introduces a novel architecture called Transformer, which repeatedly applies the attention mechanism. It transforms one sequence into another. + +## Setup + +The following section lists the requirements that you need to meet in order to start training the FastPitch model. + +### Requirements + +This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components: +- [NVIDIA Docker](https://github.com/NVIDIA/nvidia-docker) +- [PyTorch 20.03-py3 NGC container](https://ngc.nvidia.com/registry/nvidia-pytorch) +or newer +- [NVIDIA Volta](https://www.nvidia.com/en-us/data-center/volta-gpu-architecture/) or [Turing](https://www.nvidia.com/en-us/geforce/turing/) based GPU + +For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation: +- [Getting Started Using NVIDIA GPU Cloud](https://docs.nvidia.com/ngc/ngc-getting-started-guide/index.html) +- [Accessing And Pulling From The NGC Container Registry](https://docs.nvidia.com/deeplearning/frameworks/user-guide/index.html#accessing_registry) +- [Running PyTorch](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/running.html#running) + +For those unable to use the PyTorch NGC container, to set up the required environment or create your own container, see the versioned [NVIDIA Container Support Matrix](https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html). + +## Quick Start Guide + +To train your model using mixed precision with Tensor Cores or using FP32, perform the following steps using the default parameters of the FastPitch model on the LJSpeech 1.1 dataset. For the specifics concerning training and inference, see the [Advanced](#advanced) section. + +1. Clone the repository. + ```bash + git clone https://github.com/NVIDIA/DeepLearningExamples.git + cd DeepLearningExamples/PyTorch/SpeechSynthesis/FastPitch + ``` + +2. Build and run the FastPitch PyTorch NGC container. + + By default the container will use the first available GPU. Modify the script to include other available devices. + ```bash + bash scripts/docker/build.sh + bash scripts/docker/interactive.sh + ``` + +3. Download and preprocess the dataset. + + Use the scripts to automatically download and preprocess the training, validation and test datasets: + ```bash + bash scripts/download_dataset.sh + bash scripts/prepare_dataset.sh + ``` + + The data is downloaded to the `./LJSpeech-1.1` directory (on the host). The + `./LJSpeech-1.1` directory is mounted under the `/workspace/fastpitch/LJSpeech-1.1` + location in the NGC container. The complete dataset has the following structure: + ```bash + ./LJSpeech-1.1 + ├── durations # Character durations estimates for forced alignment training + ├── mels # Pre-calculated target mel-spectrograms + ├── metadata.csv # Mapping of waveforms to utterances + ├── pitch_char # Average fundamental frequencies, aligned and averaged for every character + ├── pitch_char_stats__ljs_audio_text_train_filelist.json # Mean and std of pitch for training data + ├── README + └── wavs # Raw waveforms + ``` + +4. Start training. + ```bash + bash scripts/train.sh + ``` + The training will produce a FastPitch model capable of generating mel-spectrograms from raw text. + It will be serialized as a single `.pt` checkpoint file, along with a series of intermediate checkpoints. + +5. Start validation/evaluation. + + Ensure your training loss values are comparable to those listed in the table in the + [Results](#results) section. Note that the validation loss is evaluated with ground truth durations for letters (not the predicted ones). The loss values are stored in the `./output/nvlog.json` log file, `./output/{train,val,test}` as TensorBoard logs, and printed to the standard output (`stdout`) during training. + The main reported loss is a weighted sum of losses for mel-, pitch-, and duration- predicting modules. + + The audio can be generated by following the [Inference process](#inference-process) section below. + The synthesized audio should be similar to the samples in the `./audio` directory. + +6. Start inference/predictions. + + To synthesize audio, you will need to train a WaveGlow model, which generates waveforms based on mel-spectrograms generated with FastPitch. To train WaveGlow, follow the instructions in [NVIDIA/DeepLearningExamples/Tacotron2](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2). A pre-trained WaveGlow checkpoint should be placed into the `./pretrained_models` directory. + + You can perform inference using the respective `.pt` checkpoints that are passed as `--fastpitch` + and `--waveglow` arguments: + ```bash + python inference.py --cuda --wn-channels 256 --amp-run \ + --fastpitch output/ \ + --waveglow pretrained_models/waveglow/ \ + -i phrases/devset10.tsv \ + -o output/wavs_devset10 + ``` + + The speech is generated from lines of text in the file that is passed with + `-i` argument. To run + inference in mixed precision, use the `--amp-run` flag. The output audio will + be stored in the path specified by the `-o` argument. Consult the `inference.py` to learn more options, such as setting the batch size. + + +## Advanced + +The following sections provide greater details of the dataset, running training and inference, and the training results. + +### Scripts and sample code + +The repository holds code for FastPitch (training and inference) and WaveGlow (inference only). +The code specific to a particular model is located in that model’s directory - `./fastpitch` and `./waveglow` - and common functions live in the `./common` directory. The model-specific scripts are as follows: + +* `/model.py` - the model architecture, definition of forward and +inference functions +* `/arg_parser.py` - argument parser for parameters specific to a +given model +* `/data_function.py` - data loading functions +* `/loss_function.py` - loss function for the model + +The common scripts contain layer definitions common to both models +(`common/layers.py`), some utility scripts (`common/utils.py`) and scripts +for audio processing (`common/audio_processing.py` and `common/stft.py`). + +In the root directory `./` of this repository, the `./train.py` script is used for +training while inference can be executed with the `./inference.py` script. The +scripts `./models.py`, `./data_functions.py` and `./loss_functions.py` call +the respective scripts in the `` directory, depending on what +model is trained using the `train.py` script. + +The structure of the repository follows closely to that of the [NVIDIA Tacotron2 Deep Learning example](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2). It allows to combine both models within a single project in more advanced use cases. + +### Parameters + +In this section, we list the most important hyperparameters and command-line arguments, +together with their default values that are used to train FastPitch. + +#### Training parameters + +* `--epochs` - number of epochs (default: 1500) +* `--learning-rate` - learning rate (default: 0.1) +* `--batch-size` - batch size (default: 32) +* `--amp-run` - use mixed precision training + +#### Audio and STFT parameters + +* `--sampling-rate` - sampling rate in Hz of input and output audio (22050) +* `--filter-length` - (1024) +* `--hop-length` - hop length for FFT, i.e., sample stride between consecutive FFTs (256) +* `--win-length` - window size for FFT (1024) +* `--mel-fmin` - lowest frequency in Hz (0.0) +* `--mel-fmax` - highest frequency in Hz (8.000) + +#### FastPitch parameters + +* `--pitch-predictor-loss-scale` - rescale the loss of the pitch predictor module to dampen +its influence on the shared encoder +* `--duration-predictor-loss-scale` - rescale the loss of the duration predictor module to dampen +its influence on the shared encoder +* `--pitch` - enable pitch conditioning and prediction + +### Command-line options + +To see the full list of available options and their descriptions, use the `-h` +or `--help` command line option, for example: +```bash +python train.py --help +``` + +The following example output is printed when running the model: + +```bash +DLL 2020-03-30 10:41:12.562594 - epoch 1 | iter 1/19 | loss 36.99 | mel loss 35.25 | 142370.52 items/s | elapsed 2.50 s | lrate 1.00E-01 -> 3.16E-06 +DLL 2020-03-30 10:41:13.202835 - epoch 1 | iter 2/19 | loss 37.26 | mel loss 35.98 | 561459.27 items/s | elapsed 0.64 s | lrate 3.16E-06 -> 6.32E-06 +DLL 2020-03-30 10:41:13.831189 - epoch 1 | iter 3/19 | loss 36.93 | mel loss 35.41 | 583530.16 items/s | elapsed 0.63 s | lrate 6.32E-06 -> 9.49E-06 +``` + +### Getting the data + +The FastPitch and WaveGlow models were trained on the LJSpeech-1.1 dataset. +The `./scripts/download_dataset.sh` script will automatically download and extract the dataset to the `./LJSpeech-1.1` directory. + +#### Dataset guidelines + +The LJSpeech dataset has 13,100 clips that amount to about 24 hours of speech of a single, female speaker. Since the original dataset does not define a train/dev/test split of the data, we provide a split in the form of three file lists: +```bash +./filelists +├── ljs_mel_ali_pitch_text_test_filelist.txt +├── ljs_mel_ali_pitch_text_train_filelist.txt +└── ljs_mel_ali_pitch_text_val_filelist.txt +``` + +***NOTE: When combining FastPitch/WaveGlow with external models trained on LJSpeech-1.1, make sure that your train/dev/test split matches. Different organizations may use custom splits. A mismatch poses a risk of leaking the training data through model weights during validation and testing.*** + +FastPitch predicts character durations just like [FastSpeech](https://arxiv.org/abs/1905.09263) does. +This calls for training with forced alignments, expressed as the number of output mel-spectrogram frames for every input character. +To this end, a pre-trained +[Tacotron 2 model](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) +is used. Its attention matrix +relates the input characters with the output mel-spectrogram frames. + +For every mel-spectrogram frame, its fundamental frequency in Hz is estimated with [Praat](http://praat.org). +Those values are then averaged over every character, in order to provide sparse +pitch cues for the model. Character boundaries are calculated from durations +extracted previously with Tacotron 2. + +
+ Pitch estimates extracted with Praat +
+ Figure 2. Pitch estimates for mel-spectrogram frames of phrase "in being comparatively" +averaged over characters. Silent letters have duration 0 and are omitted. +
+ +#### Multi-dataset + +Follow these steps to use datasets different from the default LJSpeech dataset. + +1. Prepare a directory with .wav files. + ```bash + ./my_dataset + └── wavs + ``` + +2. Prepare filelists with transcripts and paths to .wav files. They define training/validation split of the data (test is currently unused): + ``` + ./filelists + ├── my_dataset_mel_ali_pitch_text_train_filelist.txt + └── my_dataset_mel_ali_pitch_text_val_filelist.txt + ``` + +Those filelists should list a single utterance per line as: + ```bash + `