Skip to content

Commit

Permalink
RNN-T reference update for MLPerf Training v1.0 (pytorch#430)
Browse files Browse the repository at this point in the history
* RNN-T reference update for MLPerf Training v1.0

* switch to stable DALI release

* transcritp tensor building - index with np array instead of torch tensor

* fix multi-GPU bucketing

* eval every epoch, logging improvement

* user can adjust optimizer betas

* gradient clipping

* missing config file

* [README] add driver disclaimer

* right path to sentencepieces

* bind all gpus in docker/launch.sh script

* move speed perturbation out of evaluation

* remove not related code; update logging; default arguments with LAMB

* add evaluation when every sample is seen once

* add run_and_time.sh

* update logging

* missing augmentation logs

* revert unwanted dropout removal from first two encode layers

* scaling weights initialization

* limit number of symbols produced by the greedy decoder

* simplification - rm old eval pipeline

* dev_ema in tb_logginer

* loading from checkpoint restores optimizer state

* Rnnt logging update (pytorch#4)

* logging uses constants instead of raw strings
* missing log entries
* add weights initialization logging according to mlcommons/logging#80

* 0.5 wights initialization scale gives more stable convergence

* fix typo, update logging lib to include new constant

* README update

* apply review suggestions

* [README] fix model diagram

2x time stacking after 2nd encoder layer, not 3x

* transcript tensor padding comment

* DALI output doesn't need extra zeroing of padding

* Update README.md

Links to code sources, fix LSTM weight and bias initialization description

* [README] model diagram fix - adjust to 1023 sentencepieces
  • Loading branch information
mwawrzos authored Apr 7, 2021
1 parent 8e7ad54 commit 55d6266
Show file tree
Hide file tree
Showing 73 changed files with 4,216 additions and 4,417 deletions.
6 changes: 6 additions & 0 deletions rnn_speech_recognition/pytorch/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
checkpoints/
tb_*/
results/
__pycache__
_legacy/
lightning_logs/
26 changes: 18 additions & 8 deletions rnn_speech_recognition/pytorch/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2019-2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -12,35 +12,45 @@
# See the License for the specific language governing permissions and
# limitations under the License.

ARG FROM_IMAGE_NAME=nvcr.io/nvidia/pytorch:19.09-py3
ARG FROM_IMAGE_NAME=pytorch/pytorch:1.7.0-cuda11.0-cudnn8-devel
FROM ${FROM_IMAGE_NAME}

ENV PYTORCH_VERSION=1.7.0a0+7036e91

RUN apt-get update && apt-get install -y libsndfile1 && apt-get install -y sox && rm -rf /var/lib/apt/lists/*
RUN apt-get update && \
apt-get install -y libsndfile1 sox git cmake jq && \
apt-get install -y --no-install-recommends numactl && \
rm -rf /var/lib/apt/lists/*

RUN COMMIT_SHA=c6d12f9e1562833c2b4e7ad84cb22aa4ba31d18c && \
RUN COMMIT_SHA=f546575109111c455354861a0567c8aa794208a2 && \
git clone https://github.com/HawkAaron/warp-transducer deps/warp-transducer && \
cd deps/warp-transducer && \
git checkout $COMMIT_SHA && \
sed -i 's/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")/#set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_30,code=sm_30 -O2")/g' CMakeLists.txt && \
sed -i 's/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_75,code=sm_75")/set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -gencode arch=compute_80,code=sm_80")/g' CMakeLists.txt && \
mkdir build && \
cd build && \
cmake .. && \
make VERBOSE=1 && \
export CUDA_HOME="/usr/local/cuda" && \
export CUDA_HOME="/usr/local/cuda" && \
export WARP_RNNT_PATH=`pwd` && \
export CUDA_TOOLKIT_ROOT_DIR=$CUDA_HOME && \
export LD_LIBRARY_PATH="$CUDA_HOME/extras/CUPTI/lib64:$LD_LIBRARY_PATH" && \
export LIBRARY_PATH=$CUDA_HOME/lib64:$LIBRARY_PATH && \
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH && \
export CFLAGS="-I$CUDA_HOME/include $CFLAGS" && \
cd ../pytorch_binding && \
python3 setup.py install --user && \
python3 setup.py install && \
rm -rf ../tests test ../tensorflow_binding && \
cd ../../..

WORKDIR /workspace/jasper
WORKDIR /workspace/rnnt

RUN pip install --no-cache --extra-index-url https://developer.download.nvidia.com/compute/redist nvidia-dali-cuda110==0.28.0

RUN pip install --global-option="--cpp_ext" --global-option="--cuda_ext" https://github.com/NVIDIA/apex/archive/8a1ed9e8d35dfad26fb973996319965e4224dcdd.zip

COPY requirements.txt .
RUN pip install --disable-pip-version-check -U -r requirements.txt
RUN pip install --no-cache --disable-pip-version-check -U -r requirements.txt

COPY . .
2 changes: 1 addition & 1 deletion rnn_speech_recognition/pytorch/LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2019 NVIDIA Corporation
Copyright 2019-2020 NVIDIA Corporation
Copyright 2019 Myrtle Software Limited, www.myrtle.ai

Licensed under the Apache License, Version 2.0 (the "License");
Expand Down
4 changes: 2 additions & 2 deletions rnn_speech_recognition/pytorch/NOTICE
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
Jasper in PyTorch
RNN-T in PyTorch

This repository includes source code (in "parts/") from:
This repository includes source code (in "rnnt/") from:
* https://github.com/keithito/tacotron and https://github.com/ryanleary/patter licensed under MIT license.

188 changes: 168 additions & 20 deletions rnn_speech_recognition/pytorch/README.md
Original file line number Diff line number Diff line change
@@ -1,44 +1,192 @@
# DISCLAIMER
This codebase is a work in progress. There are known and unknown bugs in the implementation, and has not been optimized in any way.

MLPerf has neither finalized on a decision to add a speech recognition benchmark, nor as this implementationn/architecture as a reference implementation.

# 1. Problem
Speech recognition accepts raw audio samples and produces a corresponding text transcription.

# 2. Directions
See https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechRecognition/Jasper/README.md. This implementation shares significant code with that repository.

## Steps to configure machine
### From Docker
1. Clone the repository
```
git clone https://github.com/mlcommon/training.git
```
2. Install CUDA and Docker
```
source training/install_cuda_docker.sh
```
3. Build the docker image for the single stage detection task
```
# Build from Dockerfile
cd training/rnn_speech_recognition/pytorch/
bash scripts/docker/build.sh
```

#### Requirements
Currently, the reference uses CUDA-11.0 (see [Dockerfile](Dockerfile#L15)).
Here you can find a table listing compatible drivers: https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver

## Steps to download data
1. Start an interactive session in the NGC container to run data download/training/inference
```
bash scripts/docker/launch.sh <DATA_DIR> <CHECKPOINT_DIR> <RESULTS_DIR>
```

Within the container, the contents of this repository will be copied to the `/workspace/rnnt` directory. The `/datasets`, `/checkpoints`, `/results` directories are mounted as volumes
and mapped to the corresponding directories `<DATA_DIR>`, `<CHECKPOINT_DIR>`, `<RESULT_DIR>` on the host.

2. Download and preprocess the dataset.

No GPU is required for data download and preprocessing. Therefore, if GPU usage is a limited resource, launch the container for this section on a CPU machine by following prevoius steps.

Note: Downloading and preprocessing the dataset requires 500GB of free disk space and can take several hours to complete.

This repository provides scripts to download, and extract the following datasets:

* LibriSpeech [http://www.openslr.org/12](http://www.openslr.org/12)

LibriSpeech contains 1000 hours of 16kHz read English speech derived from public domain audiobooks from LibriVox project and has been carefully segmented and aligned. For more information, see the [LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS](http://www.danielpovey.com/files/2015_icassp_librispeech.pdf) paper.

Inside the container, download and extract the datasets into the required format for later training and inference:
```bash
bash scripts/download_librispeech.sh
```
Once the data download is complete, the following folders should exist:

* `/datasets/LibriSpeech/`
* `train-clean-100/`
* `train-clean-360/`
* `train-other-500/`
* `dev-clean/`
* `dev-other/`
* `test-clean/`
* `test-other/`

Since `/datasets/` is mounted to `<DATA_DIR>` on the host (see Step 3), once the dataset is downloaded it will be accessible from outside of the container at `<DATA_DIR>/LibriSpeech`.

Next, convert the data into WAV files:
```bash
bash scripts/preprocess_librispeech.sh
```
Once the data is converted, the following additional files and folders should exist:
* `datasets/LibriSpeech/`
* `librispeech-train-clean-100-wav.json`
* `librispeech-train-clean-360-wav.json`
* `librispeech-train-other-500-wav.json`
* `librispeech-dev-clean-wav.json`
* `librispeech-dev-other-wav.json`
* `librispeech-test-clean-wav.json`
* `librispeech-test-other-wav.json`
* `train-clean-100-wav/`
* `train-clean-360-wav/`
* `train-other-500-wav/`
* `dev-clean-wav/`
* `dev-other-wav/`
* `test-clean-wav/`
* `test-other-wav/`

For training, the following manifest files are used:
* `librispeech-train-clean-100-wav.json`
* `librispeech-train-clean-360-wav.json`
* `librispeech-train-other-500-wav.json`

For evaluation, the `librispeech-dev-clean-wav.json` is used.

## Steps to run benchmark.

### Steps to launch training

Inside the container, use the following script to start training.
Make sure the downloaded and preprocessed dataset is located at `<DATA_DIR>/LibriSpeech` on the host (see Step 3), which corresponds to `/datasets/LibriSpeech` inside the container.

```bash
bash scripts/train.sh
```

This script tries to use 8 GPUs by default.
To run 1-gpu training, use the following command:

```bash
NUM_GPUS=1 GRAD_ACCUMULATION_STEPS=64 scripts/train.sh
```

# 3. Dataset/Environment
### Publication/Attribution
["OpenSLR LibriSpeech Corpus"](http://www.openslr.org/12/) provides over 1000 hours of speech data in the form of raw audio.

### Data preprocessing
What preprocessing is done to the the dataset?
Data preprocessing is described by scripts mentioned in the [Steps to download data](#steps-to-download-data).

### Data pipeline
Transcripts are encoded to sentencepieces using model produced in [Steps to download data](#steps-to-download-data).
Audio processing consists of the following steps:
1. audio is decoded with sample rate choosen uniformly between 13800 and 18400 ([code](./common/data/dali/pipeline.py#L91-L97));
2. silience is trimmed with -60 dB threshold (datails in the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/archives/dali_0280/user-guide/docs/supported_ops.html?highlight=nonsilentregion#nvidia.dali.ops.NonsilentRegion)) ([code](./common/data/dali/pipeline.py#L120-L121));
3. random noise with normal distribution and 0.00001 amplitude is applied to reduce quantization effect (dither) ([code](/common/data/dali/pipeline.py#L197));
4. Pre-emphasis filter is applied (details in the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/archives/dali_0280/user-guide/docs/supported_ops.html?highlight=nonsilentregion#nvidia.dali.ops.PreemphasisFilter) ([code](./common/data/dali/pipeline.py#L101));
1. spectograms are calculated with 512 ffts, 20ms window and 10ms stride ([code](./common/data/dali/pipeline.py#L103-L105));
1. MelFilterBanks are calculated with 80 features and normalization ([code](./common/data/dali/pipeline.py#L107-L108));
1. features are translated to decibeles with log(10) multiplier reference magnitude 1 and 1e-20 cutoff (details in the [DALI documentation](https://docs.nvidia.com/deeplearning/dali/archives/dali_0280/user-guide/docs/supported_ops.html?highlight=nonsilentregion#nvidia.dali.ops.ToDecibels)) ([code](./common/data/dali/pipeline.py#L110-L111));
1. features are normalized along time dimension using algorithm described in the [normalize operator documentation](https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/normalize.html) ([code](common/data/dali/pipeline.py#L115));
1. In the train pipeline, an adaptive specaugment augmentation is applied ([arxiv](https://arxiv.org/abs/1912.05533), [code](https://github.com/mwawrzos/training/blob/rnnt/rnn_speech_recognition/pytorch/common/data/features.py#L44-L117)). In the evaluation pipeline, this step is omitted;
1. to reduce accelerator memory usage, frames are spliced (stacked three times, and subsampled three times) ([code](https://github.com/mwawrzos/training/blob/rnnt/rnn_speech_recognition/pytorch/common/data/features.py#L144-L165));

### Training and test data separation
How is the test set extracted?
Dataset authors separated it to test and training subsets. For this benchmark, training is done on train-clean-100, train-clean-360 and train-other-500 subsets. Evaluation is done on dev-clean subset.

### Training data order
In what order is the training data traversed?
To reduce data padding in minibatches, data bucketing is applied.
The algorithm is implemented here:
[link](https://github.com/mlcommons/training/blob/2126999a1ffff542064bb3208650a1e673920dcf/rnn_speech_recognition/pytorch/common/data/dali/sampler.py#L65-L105)
and can be described as follows:
1. drop samples longer than a given threshold ([code](./common/data/dali/data_loader.py#L97-L98));
1. sort data by audio length ([code](./common/data/dali/sampler.py#L69));
2. split data into 6 equally sized buckets ([code](./common/data/dali/sampler.py#L70));
3. for every epochs:
1. shuffle data in each bucket ([code](common/data/dali/sampler.py#L73-L78));
2. as long as all samples are not divisible by global batch size, remove random element from random bucket ([code](./common/data/dali/sampler.py#L82-L86));
3. concatenate all buckets;
4. split samples into minibatches ([code](./common/data/dali/sampler.py#L90));
5. shuffle minibatches in the epoch ([code](./common/data/dali/sampler.py#L93-L94)).

### Test data order
In what order is the test data traversed?
### Simulation environment (RL models only)
Describe simulation environment briefly, if applicable.
Test data order is the same as in the dataset.

# 4. Model
### Publication/Attribution
Cite paper describing model plus any additional attribution requested by code authors
To the best of our knowledge, there is no single publication describing RNN-T training on LibriSpeech,
or another publicly available dataset of reasonable size. For that reason, the reference will be a
collection of solutions from several works. It is based on the following articles:
* Graves 2012 - an invention of RNN-Transducer: https://arxiv.org/abs/1211.3711
* Rao 2018 - time reduction in the acoustic model, internal dataset: https://arxiv.org/abs/1801.00841
* Zhang 2020 - Transformer-transducer publication. It includes bi-directional LSTM RNN-T result on LibriSpeech: https://arxiv.org/abs/2002.02562
* Park 2019 - adaptive spec augment, internal dataset: https://arxiv.org/abs/1912.05533
* Guo 2020 - RNN-T trained with vanilla LSTM, internal dataset: https://arxiv.org/abs/2007.13802

### List of layers
Brief summary of structure of model
Model structure is described in the following picture:
![model layers structure](./rnnt_layers.svg "RNN-T model structure")

### Weight and bias initialization
How are weights and biases initialized
* In all fully connected layers, weights and biases are initialized as defined in the [Pytorch 1.7.0 torch.nn.Linear documentation](https://pytorch.org/docs/1.7.0/generated/torch.nn.Linear.html#torch.nn.Linear) ([code](./rnnt/model.py#L123-L137)).
* In the embeding layer, weights are initialized as defined in the [Pytorch 1.7.0 torch.nn.Embeding documentation](https://pytorch.org/docs/1.7.0/generated/torch.nn.Embedding.html#torch.nn.Embedding) ([code](./rnnt/model.py#L105)).
* In all LSTM layers:
* weights and biases are initialized as defined in the [Pytorch 1.7.0 torch.nn.LSTM documentation](https://pytorch.org/docs/1.7.0/generated/torch.nn.LSTM.html#torch.nn.LSTM) ([code](./common/rnn.py#L56-L61)),
* forget gate biases are set to 1 ([code](./common/rnn.py#L67-L69)),
* then the weights and bias values are divided by two (in result, the forget gate biases are set to 0.5) ([code](./common/rnn.py#L74-L76)).

### Loss function
Transducer Loss
Transducer Loss
### Optimizer
TBD, currently Adam
RNN-T benchmark uses LAMB optimizer. More details are in [training policies](https://github.com/mlcommons/training_policies/blob/master/training_rules.adoc#appendix-allowed-optimizers).

To decrease the number of epochs needed to reach the target accuracy,
evaluation is done with an exponential moving average of the trained model weights with a smoothing factor set to 0.999.

# 5. Quality
### Quality metric
Word Error Rate (WER) across all words in the output text of all samples in the validation set.
### Quality target
What is the numeric quality target
Target quality is 0.058 Word Error Rate or lower.
### Evaluation frequency
TBD
Evaluation is done after each training epoch.
### Evaluation thoroughness
TBD
Evaluation is done on each sample from the evaluation set.
Empty file.
Loading

0 comments on commit 55d6266

Please sign in to comment.