Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[recipe] LibriSpeech zipformer_ctc #941

Merged
merged 14 commits into from
Oct 27, 2023
1 change: 1 addition & 0 deletions egs/librispeech/ASR/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ We place an additional Conv1d layer right after the input embedding layer.
| `conformer-ctc` | Conformer | Use auxiliary attention head |
| `conformer-ctc2` | Reworked Conformer | Use auxiliary attention head |
| `conformer-ctc3` | Reworked Conformer | Streaming version + delay penalty |
| `zipformer-ctc` | Zipformer | Use auxiliary attention head |
| `zipformer` | Upgraded Zipformer | Use auxiliary transducer head | The latest recipe |

# MMI
Expand Down
51 changes: 50 additions & 1 deletion egs/librispeech/ASR/RESULTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -375,6 +375,55 @@ for m in greedy_search modified_beam_search fast_beam_search; do
done
```

### Zipformer CTC

#### [zipformer_ctc](./zipformer_ctc)

See <https://github.com/k2-fsa/icefall/pull/941> for more details.

You can find a pretrained model, training logs, decoding logs, and decoding
results at:
<https://huggingface.co/desh2608/icefall-asr-librispeech-zipformer-ctc>

Number of model parameters: 86083707, i.e., 86.08 M

| decoding method | test-clean | test-other | comment |
|-------------------------|------------|------------|---------------------|
| ctc-decoding | 2.50 | 5.86 | --epoch 30 --avg 9 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also post the result for HLG decoding, i.e., one-best decoding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting the following WERs for 1best:

| 1best                   | 2.01       | 4.61       | --epoch 30 --avg 9  |

This seems much better than other decoding methods. Is it expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is strange that 1best (HLG) is better than whole-lattice-rescoring (HLG + 4-gram G).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking the same. I'll verify the numbers again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@desh2608 It seems that you don't have a parameter to adjust the scale of the HLG decoding graph. Could you please add this parameter like here:

parser.add_argument(
"--hlg-scale",
type=float,
default=0.8,
help="""The scale to be applied to `hlg.scores`.

I tested your model and I got 2.46/5.36 with hlg_scale=0.5 for 1best decoding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking the same. I'll verify the numbers again.

Are you able to reproduce it, i.e., WER for test clean = 2.01 ?
@desh2608

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not find time to check it. Let me try to do it this week.
@MarcoYang thanks for the pointer. I'll add it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW something else that is different in this recipe compared to other LibriSpeech recipes is that I keep cuts shorter than 25s (instead of 20s), to avoid throwing away more data. With the quadratic_duration option in DynamicBucketingSampler, this seems to be working fine (I could train on V100 with batch size 800).

| whole-lattice-rescoring | 2.44 | 5.38 | --epoch 30 --avg 9 |
| attention-rescoring | 2.35 | 5.16 | --epoch 30 --avg 9 |
| 1best | 2.01 | 4.61 | --epoch 30 --avg 9 |

The training commands are:
```bash

export CUDA_VISIBLE_DEVICES="0,1,2,3"

./zipformer_ctc/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer_ctc/exp \
--full-libri 1 \
--max-duration 1000 \
--master-port 12345
```

The tensorboard log can be found at:
<https://tensorboard.dev/experiment/IjPSJjHOQFKPYA5Z0Vf8wg>

The decoding command is:

```bash
./zipformer_ctc/decode.py \
--epoch 30 --avg 9 --use-averaged-model True \
--exp-dir zipformer_ctc/exp \
--lang-dir data/lang_bpe_500 \
--lm-dir data/lm \
--method ctc-decoding
```

### pruned_transducer_stateless7 (Fine-tune with mux)

See <https://github.com/k2-fsa/icefall/pull/1059> for more details.
Expand Down Expand Up @@ -616,7 +665,6 @@ for m in greedy_search modified_beam_search fast_beam_search; do
done
```


#### Smaller model

We also provide a very small version (only 6.1M parameters) of this setup. The training command for the small model is:
Expand Down Expand Up @@ -663,6 +711,7 @@ This small model achieves the following WERs on GigaSpeech test and dev sets:

You can find the tensorboard logs at <https://tensorboard.dev/experiment/tAc5iXxTQrCQxky5O5OLyw/#scalars>.


### Streaming Zipformer-Transducer (Pruned Stateless Transducer + Streaming Zipformer)

#### [pruned_transducer_stateless7_streaming](./pruned_transducer_stateless7_streaming)
Expand Down
Empty file.
1 change: 1 addition & 0 deletions egs/librispeech/ASR/zipformer_ctc/asr_datamodule.py
Loading
Loading