Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient filter for training lstm model #564

Merged
merged 20 commits into from
Sep 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .flake8
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ per-file-ignores =
egs/*/ASR/pruned_transducer_stateless*/*.py: E501,
egs/*/ASR/*/optim.py: E501,
egs/*/ASR/*/scaling.py: E501,
egs/librispeech/ASR/lstm_transducer_stateless/*.py: E501, E203
egs/librispeech/ASR/lstm_transducer_stateless*/*.py: E501, E203
egs/librispeech/ASR/conv_emformer_transducer_stateless*/*.py: E501, E203
egs/librispeech/ASR/conformer_ctc2/*py: E501,
egs/librispeech/ASR/RESULTS.md: E999,
Expand Down
134 changes: 112 additions & 22 deletions egs/librispeech/ASR/RESULTS.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,99 @@
## Results

#### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + multi-dataset)
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + gradient filter)

[lstm_transducer_stateless2](./lstm_transducer_stateless2)
#### [lstm_transducer_stateless3](./lstm_transducer_stateless3)

See <https://github.com/k2-fsa/icefall/pull/558> for more details.
It implements LSTM model with mechanisms in reworked model for streaming ASR.
Gradient filter is applied inside each lstm module to stabilize the training.

See <https://github.com/k2-fsa/icefall/pull/564> for more details.

##### training on full librispeech

This model contains 12 encoder layers (LSTM module + Feedforward module). The number of model parameters is 84689496.

The WERs are:

| | test-clean | test-other | comment | decoding mode |
|-------------------------------------|------------|------------|----------------------|----------------------|
| greedy search (max sym per frame 1) | 3.66 | 9.51 | --epoch 40 --avg 15 | simulated streaming |
| greedy search (max sym per frame 1) | 3.66 | 9.48 | --epoch 40 --avg 15 | streaming |
| fast beam search | 3.55 | 9.33 | --epoch 40 --avg 15 | simulated streaming |
| fast beam search | 3.57 | 9.25 | --epoch 40 --avg 15 | streaming |
| modified beam search | 3.55 | 9.28 | --epoch 40 --avg 15 | simulated streaming |
| modified beam search | 3.54 | 9.25 | --epoch 40 --avg 15 | streaming |

Note: `simulated streaming` indicates feeding full utterance during decoding, while `streaming` indicates feeding certain number of frames at each time.


The training command is:

```bash
./lstm_transducer_stateless3/train.py \
--world-size 4 \
--num-epochs 40 \
--start-epoch 1 \
--exp-dir lstm_transducer_stateless3/exp \
--full-libri 1 \
--max-duration 500 \
--master-port 12325 \
--num-encoder-layers 12 \
--grad-norm-threshold 25.0 \
--rnn-hidden-size 1024
```

The tensorboard log can be found at
<https://tensorboard.dev/experiment/caNPyr5lT8qAl9qKsXEeEQ/>

The simulated streaming decoding command using greedy search, fast beam search, and modified beam search is:
```bash
for decoding_method in greedy_search fast_beam_search modified_beam_search; do
./lstm_transducer_stateless3/decode.py \
--epoch 40 \
--avg 15 \
--exp-dir lstm_transducer_stateless3/exp \
--max-duration 600 \
--num-encoder-layers 12 \
--rnn-hidden-size 1024 \
--decoding-method $decoding_method \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8 \
--beam-size 4
done
```

The streaming decoding command using greedy search, fast beam search, and modified beam search is:
```bash
for decoding_method in greedy_search fast_beam_search modified_beam_search; do
./lstm_transducer_stateless3/streaming_decode.py \
--epoch 40 \
--avg 15 \
--exp-dir lstm_transducer_stateless3/exp \
--max-duration 600 \
--num-encoder-layers 12 \
--rnn-hidden-size 1024 \
--decoding-method $decoding_method \
--use-averaged-model True \
--beam 4 \
--max-contexts 4 \
--max-states 8 \
--beam-size 4
done
```

Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-lstm-transducer-stateless3-2022-09-28>


### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T + multi-dataset)

#### [lstm_transducer_stateless2](./lstm_transducer_stateless2)

See <https://github.com/k2-fsa/icefall/pull/558> for more details.

The WERs are:

Expand All @@ -18,6 +106,7 @@ The WERs are:
| modified_beam_search | 2.75 | 7.08 | --iter 472000 --avg 18 |
| fast_beam_search | 2.77 | 7.29 | --iter 472000 --avg 18 |


The training command is:

```bash
Expand Down Expand Up @@ -70,15 +159,16 @@ Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-lstm-transducer-stateless2-2022-09-03>

#### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T)

[lstm_transducer_stateless](./lstm_transducer_stateless)
### LibriSpeech BPE training results (Pruned Stateless LSTM RNN-T)

#### [lstm_transducer_stateless](./lstm_transducer_stateless)

It implements LSTM model with mechanisms in reworked model for streaming ASR.

See <https://github.com/k2-fsa/icefall/pull/479> for more details.

#### training on full librispeech
##### training on full librispeech

This model contains 12 encoder layers (LSTM module + Feedforward module). The number of model parameters is 84689496.

Expand Down Expand Up @@ -165,7 +255,7 @@ It is modified from [torchaudio](https://github.com/pytorch/audio).

See <https://github.com/k2-fsa/icefall/pull/440> for more details.

#### With lower latency setup, training on full librispeech
##### With lower latency setup, training on full librispeech

In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.

Expand Down Expand Up @@ -316,7 +406,7 @@ Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-conv-emformer-transducer-stateless2-2022-07-05>

#### With higher latency setup, training on full librispeech
##### With higher latency setup, training on full librispeech

In this model, the lengths of chunk and right context are 64 frames (i.e., 0.64s) and 16 frames (i.e., 0.16s), respectively.

Expand Down Expand Up @@ -851,14 +941,14 @@ Pre-trained models, training and decoding logs, and decoding results are availab

### LibriSpeech BPE training results (Pruned Stateless Conv-Emformer RNN-T)

[conv_emformer_transducer_stateless](./conv_emformer_transducer_stateless)
#### [conv_emformer_transducer_stateless](./conv_emformer_transducer_stateless)

It implements [Emformer](https://arxiv.org/abs/2010.10759) augmented with convolution module for streaming ASR.
It is modified from [torchaudio](https://github.com/pytorch/audio).

See <https://github.com/k2-fsa/icefall/pull/389> for more details.

#### Training on full librispeech
##### Training on full librispeech

In this model, the lengths of chunk and right context are 32 frames (i.e., 0.32s) and 8 frames (i.e., 0.08s), respectively.

Expand Down Expand Up @@ -1011,7 +1101,7 @@ are available at

### LibriSpeech BPE training results (Pruned Stateless Emformer RNN-T)

[pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)
#### [pruned_stateless_emformer_rnnt2](./pruned_stateless_emformer_rnnt2)

Use <https://github.com/k2-fsa/icefall/pull/390>.

Expand Down Expand Up @@ -1079,7 +1169,7 @@ results at:

### LibriSpeech BPE training results (Pruned Stateless Transducer 5)

[pruned_transducer_stateless5](./pruned_transducer_stateless5)
#### [pruned_transducer_stateless5](./pruned_transducer_stateless5)

Same as `Pruned Stateless Transducer 2` but with more layers.

Expand All @@ -1092,7 +1182,7 @@ The notations `large` and `medium` below are from the [Conformer](https://arxiv.
paper, where the large model has about 118 M parameters and the medium model
has 30.8 M parameters.

#### Large
##### Large

Number of model parameters 118129516 (i.e, 118.13 M).

Expand Down Expand Up @@ -1152,7 +1242,7 @@ results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-2022-07-07>


#### Medium
##### Medium

Number of model parameters 30896748 (i.e, 30.9 M).

Expand Down Expand Up @@ -1212,7 +1302,7 @@ results at:
<https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless5-M-2022-07-07>


#### Baseline-2
##### Baseline-2

It has 88.98 M parameters. Compared to the model in pruned_transducer_stateless2, its has more
layers (24 v.s 12) but a narrower model (1536 feedforward dim and 384 encoder dim vs 2048 feed forward dim and 512 encoder dim).
Expand Down Expand Up @@ -1273,13 +1363,13 @@ results at:

### LibriSpeech BPE training results (Pruned Stateless Transducer 4)

[pruned_transducer_stateless4](./pruned_transducer_stateless4)
#### [pruned_transducer_stateless4](./pruned_transducer_stateless4)

This version saves averaged model during training, and decodes with averaged model.

See <https://github.com/k2-fsa/icefall/issues/337> for details about the idea of model averaging.

#### Training on full librispeech
##### Training on full librispeech

See <https://github.com/k2-fsa/icefall/pull/344>

Expand Down Expand Up @@ -1355,7 +1445,7 @@ Pretrained models, training logs, decoding logs, and decoding results
are available at
<https://huggingface.co/Zengwei/icefall-asr-librispeech-pruned-transducer-stateless4-2022-06-03>

#### Training on train-clean-100
##### Training on train-clean-100

See <https://github.com/k2-fsa/icefall/pull/344>

Expand Down Expand Up @@ -1392,7 +1482,7 @@ The tensorboard log can be found at

### LibriSpeech BPE training results (Pruned Stateless Transducer 3, 2022-04-29)

[pruned_transducer_stateless3](./pruned_transducer_stateless3)
#### [pruned_transducer_stateless3](./pruned_transducer_stateless3)
Same as `Pruned Stateless Transducer 2` but using the XL subset from
[GigaSpeech](https://github.com/SpeechColab/GigaSpeech) as extra training data.

Expand Down Expand Up @@ -1606,10 +1696,10 @@ can be found at

### LibriSpeech BPE training results (Pruned Transducer 2)

[pruned_transducer_stateless2](./pruned_transducer_stateless2)
#### [pruned_transducer_stateless2](./pruned_transducer_stateless2)
This is with a reworked version of the conformer encoder, with many changes.

#### Training on fulll librispeech
##### Training on full librispeech

Using commit `34aad74a2c849542dd5f6359c9e6b527e8782fd6`.
See <https://github.com/k2-fsa/icefall/pull/288>
Expand Down Expand Up @@ -1658,7 +1748,7 @@ can be found at
<https://huggingface.co/csukuangfj/icefall-asr-librispeech-pruned-transducer-stateless2-2022-04-29>


#### Training on train-clean-100:
##### Training on train-clean-100:

Trained with 1 job:
```
Expand Down
1 change: 1 addition & 0 deletions egs/librispeech/ASR/lstm_transducer_stateless3/__init__.py
Loading