RNN-T reference update for MLPerf Training v1.0 (pytorch#430)

* RNN-T reference update for MLPerf Training v1.0 * switch to stable DALI release * transcritp tensor building - index with np array instead of torch tensor * fix multi-GPU bucketing * eval every epoch, logging improvement * user can adjust optimizer betas * gradient clipping * missing config file * [README] add driver disclaimer * right path to sentencepieces * bind all gpus in docker/launch.sh script * move speed perturbation out of evaluation * remove not related code; update logging; default arguments with LAMB * add evaluation when every sample is seen once * add run_and_time.sh * update logging * missing augmentation logs * revert unwanted dropout removal from first two encode layers * scaling weights initialization * limit number of symbols produced by the greedy decoder * simplification - rm old eval pipeline * dev_ema in tb_logginer * loading from checkpoint restores optimizer state * Rnnt logging update (pytorch#4) * logging uses constants instead of raw strings * missing log entries * add weights initialization logging according to mlcommons/logging#80 * 0.5 wights initialization scale gives more stable convergence * fix typo, update logging lib to include new constant * README update * apply review suggestions * [README] fix model diagram 2x time stacking after 2nd encoder layer, not 3x * transcript tensor padding comment * DALI output doesn't need extra zeroing of padding * Update README.md Links to code sources, fix LSTM weight and bias initialization description * [README] model diagram fix - adjust to 1023 sentencepieces
rajveerb · Apr 7, 2021 · 55d6266 · 55d6266
1 parent 8e7ad54
commit 55d6266
Show file tree

Hide file tree

Showing 73 changed files with 4,216 additions and 4,417 deletions.
diff --git a/rnn_speech_recognition/pytorch/.dockerignore b/rnn_speech_recognition/pytorch/.dockerignore
@@ -0,0 +1,6 @@
+checkpoints/
+tb_*/
+results/
+__pycache__
+_legacy/
+lightning_logs/
diff --git a/rnn_speech_recognition/pytorch/Dockerfile b/rnn_speech_recognition/pytorch/Dockerfile
@@ -1,4 +1,4 @@
-# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
+# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -12,35 +12,45 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.09-py3
+ARG FROM_IMAGE_NAME=pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
 FROM ${FROM_IMAGE_NAME}
 
+ENV PYTORCH_VERSION=1.7.0a0+7036e91 
 
-RUN apt-get update && apt-get install -y libsndfile1 && apt-get install -y sox && rm -rf /var/lib/apt/lists/*
+RUN apt-get update && \
+    apt-get install -y libsndfile1 sox git cmake jq && \
+    apt-get install -y --no-install-recommends numactl && \
+    rm -rf /var/lib/apt/lists/*
 
-RUN COMMIT_SHA=c6d12f9e1562833c2b4e7ad84cb22aa4ba31d18c && \
+RUN COMMIT_SHA=f546575109111c455354861a0567c8aa794208a2 && \
     git clone https://github.com/HawkAaron/warp-transducer deps/warp-transducer && \
     cd deps/warp-transducer && \
     git checkout $COMMIT_SHA && \
+    sed -i 's/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")/#set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")/g' CMakeLists.txt && \
+    sed -i 's/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_75,code=sm_75")/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_80,code=sm_80")/g' CMakeLists.txt && \
     mkdir build && \
     cd build && \
     cmake .. && \
     make VERBOSE=1 && \
-	export CUDA_HOME="/usr/local/cuda" && \
+    export CUDA_HOME="/usr/local/cuda" && \
     export WARP_RNNT_PATH=`pwd` && \
     export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME && \
     export LD_LIBRARY_PATH="$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH" && \
     export LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH && \
     export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH && \
     export CFLAGS="-I$CUDA_HOME/include $CFLAGS" && \
     cd ../pytorch_binding && \
-    python3 setup.py install --user && \
+    python3 setup.py install && \
     rm -rf ../tests test ../tensorflow_binding && \
     cd ../../..
 
-WORKDIR /workspace/jasper
+WORKDIR /workspace/rnnt
+
+RUN pip install --no-cache --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110==0.28.0
+
+RUN pip install --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex/archive/8a1ed9e8d35dfad26fb973996319965e4224dcdd.zip
 
 COPY requirements.txt .
-RUN pip install --disable-pip-version-check -U -r requirements.txt
+RUN pip install --no-cache --disable-pip-version-check -U -r requirements.txt
 
 COPY . .
diff --git a/rnn_speech_recognition/pytorch/LICENSE b/rnn_speech_recognition/pytorch/LICENSE
@@ -188,7 +188,7 @@
       same "printed page" as the copyright notice for easier
       identification within third-party archives.
 
-   Copyright 2019 NVIDIA Corporation
+   Copyright 2019-2020 NVIDIA Corporation
    Copyright 2019 Myrtle Software Limited, www.myrtle.ai
 
    Licensed under the Apache License, Version 2.0 (the "License");

diff --git a/rnn_speech_recognition/pytorch/NOTICE b/rnn_speech_recognition/pytorch/NOTICE
@@ -1,5 +1,5 @@
-Jasper in PyTorch
+RNN-T in PyTorch
 
-This repository includes source code (in "parts/") from:
+This repository includes source code (in "rnnt/") from:
 * https://github.com/keithito/tacotron and https://github.com/ryanleary/patter licensed under MIT license.
 
diff --git a/rnn_speech_recognition/pytorch/README.md b/rnn_speech_recognition/pytorch/README.md
@@ -1,44 +1,192 @@
-# DISCLAIMER
-This codebase is a work in progress. There are known and unknown bugs in the implementation, and has not been optimized in any way.
-
-MLPerf has neither finalized on a decision to add a speech recognition benchmark, nor as this implementationn/architecture as a reference implementation.
-
 # 1. Problem 
 Speech recognition accepts raw audio samples and produces a corresponding text transcription.
 
 # 2. Directions
-See https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/README.md. This implementation shares significant code with that repository.
+
+## Steps to configure machine
+### From Docker
+1. Clone the repository
+```
+git clone https://github.com/mlcommon/training.git
+```
+2. Install CUDA and Docker
+```
+source training/install_cuda_docker.sh
+```
+3. Build the docker image for the single stage detection task
+```
+# Build from Dockerfile
+cd training/rnn_speech_recognition/pytorch/
+bash scripts/docker/build.sh
+```
+
+#### Requirements
+Currently, the reference uses CUDA-11.0 (see [Dockerfile](Dockerfile#L15)).
+Here you can find a table listing compatible drivers: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver
+
+## Steps to download data
+1. Start an interactive session in the NGC container to run data download/training/inference
+```
+bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULTS_DIR>
+```
+
+Within the container, the contents of this repository will be copied to the `/workspace/rnnt` directory. The `/datasets`, `/checkpoints`, `/results` directories are mounted as volumes
+and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.
+
+2. Download and preprocess the dataset.
+
+No GPU is required for data download and preprocessing. Therefore, if GPU usage is a limited resource, launch the container for this section on a CPU machine by following prevoius steps.
+
+Note: Downloading and preprocessing the dataset requires 500GB of free disk space and can take several hours to complete.
+
+This repository provides scripts to download, and extract the following datasets:
+
+* LibriSpeech [http://www.openslr.org/12](http://www.openslr.org/12)
+
+LibriSpeech contains 1000 hours of 16kHz read English speech derived from public domain audiobooks from LibriVox project and has been carefully segmented and aligned. For more information, see the [LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS](http://www.danielpovey.com/files/2015_icassp_librispeech.pdf) paper.
+
+Inside the container, download and extract the datasets into the required format for later training and inference:
+```bash
+bash scripts/download_librispeech.sh
+```
+Once the data download is complete, the following folders should exist:
+
+* `/datasets/LibriSpeech/`
+   * `train-clean-100/`
+   * `train-clean-360/`
+   * `train-other-500/`
+   * `dev-clean/`
+   * `dev-other/`
+   * `test-clean/`
+   * `test-other/`
+
+Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3),  once the dataset is downloaded it will be accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.
+
+Next, convert the data into WAV files:
+```bash
+bash scripts/preprocess_librispeech.sh
+```
+Once the data is converted, the following additional files and folders should exist:
+* `datasets/LibriSpeech/`
+   * `librispeech-train-clean-100-wav.json`
+   * `librispeech-train-clean-360-wav.json`
+   * `librispeech-train-other-500-wav.json`
+   * `librispeech-dev-clean-wav.json`
+   * `librispeech-dev-other-wav.json`
+   * `librispeech-test-clean-wav.json`
+   * `librispeech-test-other-wav.json`
+   * `train-clean-100-wav/`
+   * `train-clean-360-wav/`
+   * `train-other-500-wav/`
+   * `dev-clean-wav/`
+   * `dev-other-wav/`
+   * `test-clean-wav/`
+   * `test-other-wav/`
+
+For training, the following manifest files are used:
+   * `librispeech-train-clean-100-wav.json`
+   * `librispeech-train-clean-360-wav.json`
+   * `librispeech-train-other-500-wav.json`
+
+For evaluation, the `librispeech-dev-clean-wav.json` is used.
+
+## Steps to run benchmark.
+
+### Steps to launch training
+
+Inside the container, use the following script to start training.
+Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.
+
+```bash
+bash scripts/train.sh
+```
+
+This script tries to use 8 GPUs by default.
+To run 1-gpu training, use the following command:
+
+```bash
+NUM_GPUS=1 GRAD_ACCUMULATION_STEPS=64 scripts/train.sh
+```
 
 # 3. Dataset/Environment
 ### Publication/Attribution
 ["OpenSLR LibriSpeech Corpus"](http://www.openslr.org/12/) provides over 1000 hours of speech data in the form of raw audio.
+
 ### Data preprocessing
-What preprocessing is done to the the dataset? 
+Data preprocessing is described by scripts mentioned in the [Steps to download data](#steps-to-download-data).
+
+### Data pipeline
+Transcripts are encoded to sentencepieces using model produced in [Steps to download data](#steps-to-download-data).
+Audio processing consists of the following steps:
+1. audio is decoded with sample rate choosen uniformly between 13800 and 18400 ([code](./common/data/dali/pipeline.py#L91-L97));
+2. silience is trimmed with -60 dB threshold (datails in the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/archives/dali_0280/user-guide/docs/supported_ops.html?highlight=nonsilentregion#nvidia.dali.ops.NonsilentRegion)) ([code](./common/data/dali/pipeline.py#L120-L121));
+3. random noise with normal distribution and 0.00001 amplitude is applied to reduce quantization effect (dither) ([code](/common/data/dali/pipeline.py#L197));
+4. Pre-emphasis filter is applied (details in the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/archives/dali_0280/user-guide/docs/supported_ops.html?highlight=nonsilentregion#nvidia.dali.ops.PreemphasisFilter) ([code](./common/data/dali/pipeline.py#L101));
+1. spectograms are calculated with 512 ffts, 20ms window and 10ms stride ([code](./common/data/dali/pipeline.py#L103-L105));
+1. MelFilterBanks are calculated with 80 features and normalization ([code](./common/data/dali/pipeline.py#L107-L108));
+1. features are translated to decibeles with log(10) multiplier reference magnitude 1 and 1e-20 cutoff (details in the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/archives/dali_0280/user-guide/docs/supported_ops.html?highlight=nonsilentregion#nvidia.dali.ops.ToDecibels)) ([code](./common/data/dali/pipeline.py#L110-L111));
+1. features are normalized along time dimension using algorithm described in the [normalize operator documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/normalize.html) ([code](common/data/dali/pipeline.py#L115));
+1. In the train pipeline, an adaptive specaugment augmentation is applied ([arxiv](https://arxiv.org/abs/1912.05533), [code](https://github.com/mwawrzos/training/blob/rnnt/rnn_speech_recognition/pytorch/common/data/features.py#L44-L117)). In the evaluation pipeline, this step is omitted;
+1. to reduce accelerator memory usage, frames are spliced (stacked three times, and subsampled three times) ([code](https://github.com/mwawrzos/training/blob/rnnt/rnn_speech_recognition/pytorch/common/data/features.py#L144-L165));
+
 ### Training and test data separation
-How is the test set extracted?
+Dataset authors separated it to test and training subsets. For this benchmark, training is done on train-clean-100, train-clean-360 and train-other-500 subsets. Evaluation is done on dev-clean subset.
+
 ### Training data order
-In what order is the training data traversed?
+To reduce data padding in minibatches, data bucketing is applied.
+The algorithm is implemented here:
+[link](https://github.com/mlcommons/training/blob/2126999a1ffff542064bb3208650a1e673920dcf/rnn_speech_recognition/pytorch/common/data/dali/sampler.py#L65-L105)
+and can be described as follows:
+1. drop samples longer than a given threshold ([code](./common/data/dali/data_loader.py#L97-L98));
+1. sort data by audio length ([code](./common/data/dali/sampler.py#L69));
+2. split data into 6 equally sized buckets ([code](./common/data/dali/sampler.py#L70));
+3. for every epochs:
+    1. shuffle data in each bucket ([code](common/data/dali/sampler.py#L73-L78));
+    2. as long as all samples are not divisible by global batch size, remove random element from random bucket ([code](./common/data/dali/sampler.py#L82-L86));
+    3. concatenate all buckets;
+    4. split samples into minibatches ([code](./common/data/dali/sampler.py#L90));
+    5. shuffle minibatches in the epoch ([code](./common/data/dali/sampler.py#L93-L94)).
+
 ### Test data order
-In what order is the test data traversed?
-### Simulation environment (RL models only)
-Describe simulation environment briefly, if applicable. 
+Test data order is the same as in the dataset.
+
 # 4. Model
 ### Publication/Attribution
-Cite paper describing model plus any additional attribution requested by code authors 
+To the best of our knowledge, there is no single publication describing RNN-T training on LibriSpeech,
+or another publicly available dataset of reasonable size. For that reason, the reference will be a
+collection of solutions from several works. It is based on the following articles:
+* Graves 2012 - an invention of RNN-Transducer: https://arxiv.org/abs/1211.3711
+* Rao 2018 - time reduction in the acoustic model, internal dataset: https://arxiv.org/abs/1801.00841
+* Zhang 2020 - Transformer-transducer publication. It includes bi-directional LSTM RNN-T result on LibriSpeech: https://arxiv.org/abs/2002.02562
+* Park 2019 - adaptive spec augment, internal dataset: https://arxiv.org/abs/1912.05533
+* Guo 2020 - RNN-T trained with vanilla LSTM, internal dataset: https://arxiv.org/abs/2007.13802
+
 ### List of layers 
-Brief summary of structure of model
+Model structure is described in the following picture:
+![model layers structure](./rnnt_layers.svg "RNN-T model structure")
+
 ### Weight and bias initialization
-How are weights and biases initialized
+* In all fully connected layers, weights and biases are initialized as defined in the [Pytorch 1.7.0 torch.nn.Linear documentation](https://pytorch.org/docs/1.7.0/generated/torch.nn.Linear.html#torch.nn.Linear) ([code](./rnnt/model.py#L123-L137)).
+* In the embeding layer, weights are initialized as defined in the [Pytorch 1.7.0 torch.nn.Embeding documentation](https://pytorch.org/docs/1.7.0/generated/torch.nn.Embedding.html#torch.nn.Embedding) ([code](./rnnt/model.py#L105)).
+* In all LSTM layers:
+    * weights and biases are initialized as defined in the [Pytorch 1.7.0 torch.nn.LSTM documentation](https://pytorch.org/docs/1.7.0/generated/torch.nn.LSTM.html#torch.nn.LSTM) ([code](./common/rnn.py#L56-L61)),
+    * forget gate biases are set to 1 ([code](./common/rnn.py#L67-L69)),
+    * then the weights and bias values are divided by two (in result, the forget gate biases are set to 0.5) ([code](./common/rnn.py#L74-L76)).
+
 ### Loss function
-Transducer Loss
+Transducer Loss 
 ### Optimizer
-TBD, currently Adam
+RNN-T benchmark uses LAMB optimizer. More details are in [training policies](https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#appendix-allowed-optimizers).
+
+To decrease the number of epochs needed to reach the target accuracy,
+evaluation is done with an exponential moving average of the trained model weights with a smoothing factor set to 0.999.
+
 # 5. Quality
 ### Quality metric
 Word Error Rate (WER) across all words in the output text of all samples in the validation set.
 ### Quality target
-What is the numeric quality target
+Target quality is 0.058 Word Error Rate or lower.
 ### Evaluation frequency
-TBD
+Evaluation is done after each training epoch.
 ### Evaluation thoroughness
-TBD
+Evaluation is done on each sample from the evaluation set.
diff --git a/rnn_speech_recognition/pytorch/common/__init__.py b/rnn_speech_recognition/pytorch/common/__init__.py