Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[recipe] LibriSpeech zipformer_ctc #941

Merged
merged 14 commits into from
Oct 27, 2023
Merged

Conversation

desh2608
Copy link
Collaborator

@desh2608 desh2608 commented Mar 9, 2023

I trained a zipformer based CTC model (with aux. attention head) on LibriSpeech. The following are results on test clean/other.

decoding method test-clean test-other comment
ctc-decoding 2.50 5.86 --epoch 30 --avg 9
whole-lattice-rescoring 2.44 5.38 --epoch 30 --avg 9
attention-rescoring 2.35 5.16 --epoch 30 --avg 9

Tensorboard: https://tensorboard.dev/experiment/IjPSJjHOQFKPYA5Z0Vf8wg
Pretrained model: https://huggingface.co/desh2608/icefall-asr-librispeech-zipformer-ctc

SOLVED

I am having some trouble with the other decoding methods. I created G.fst.txt by first downloading the 4-gram.arpa.gz file, unzipping it, and then running the following:

python3 -m kaldilm \
  --read-symbol-table="data/lang_bpe_500/tokens.txt" \
  --disambig-symbol='#0' \
  --max-order=4 \
  data/lm/4-gram.arpa > data/lang_bpe_500/G_4_gram.fst.txt

The G.pt should get created inside decode.py. But during decoding, I get the following AssertionError:

  File "zipformer_ctc_att/decode.py", line 556, in decode_dataset
    hyps_dict = decode_one_batch(
  File "zipformer_ctc_att/decode.py", line 440, in decode_one_batch
    best_path_dict = rescore_with_whole_lattice(
  File "/exp/draj/mini_scale_2022/icefall/icefall/decode.py", line 858, in rescore_with_whole_lattice
    assert G_with_epsilon_loops.shape == (1, None, None)

I am guessing I did something wrong in creating G.pt. I would appreciate if someone can help with this.

@ezerhouni
Copy link
Collaborator

@desh2608 Not sure but it looks like your G is a token ngram while rescore_with_whole_lattice is expecting a word ngram, could it be possible ?

@desh2608
Copy link
Collaborator Author

desh2608 commented Mar 10, 2023

@desh2608 Not sure but it looks like your G is a token ngram while rescore_with_whole_lattice is expecting a word ngram, could it be possible ?

Ahh, of course. I should pass words.txt for the symbol table. Thanks!

Update: Actually, on looking at my command history, I see that I did use words.txt (not tokens.txt) to create G.fst.txt.

@desh2608
Copy link
Collaborator Author

It turns out that I had the wrong G.pt in my lang directory, so the correct G_4_gram.fst.txt was not being used. Here are the steps in case someone is interested.

  1. Download and extract 3-gram.pruned.1e-7.arpa.gz and 4-gram.arpa.gz from https://openslr.org/11/ into data/lm.

  2. Prepare G_3_gram.fst.txt and G_4_gram.fst.txt as follows:

python -m kaldilm --read-symbol-table="data/lang_bpe_500/words.txt" --disambig-symbol="#0" --max-order=3 data/lm/3-gram.pruned.1e-7.arpa > data/lm/G_3_gram.fst.txt
python -m kaldilm --read-symbol-table="data/lang_bpe_500/words.txt" --disambig-symbol="#0" --max-order=4 data/lm/4-gram.arpa > data/lm/G_4_gram.fst.txt
  1. Compile HLG using the pruned 3-gram:
python local/compile_hlg.py --lm G_3_gram --lang-dir data/lang_bpe_500

Now run decode.py. The G.pt gets created from data/lm/G_4_gram.fst.txt inside the decode script, so it doesn't have to be created in advance.

@desh2608 desh2608 added the ready label Mar 11, 2023
@desh2608
Copy link
Collaborator Author

@csukuangfj please review when you have some time.

@yfyeung yfyeung requested a review from pkufool March 17, 2023 03:08

| decoding method | test-clean | test-other | comment |
|-------------------------|------------|------------|---------------------|
| ctc-decoding | 2.50 | 5.86 | --epoch 30 --avg 9 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also post the result for HLG decoding, i.e., one-best decoding?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting the following WERs for 1best:

| 1best                   | 2.01       | 4.61       | --epoch 30 --avg 9  |

This seems much better than other decoding methods. Is it expected?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is strange that 1best (HLG) is better than whole-lattice-rescoring (HLG + 4-gram G).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking the same. I'll verify the numbers again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@desh2608 It seems that you don't have a parameter to adjust the scale of the HLG decoding graph. Could you please add this parameter like here:

parser.add_argument(
"--hlg-scale",
type=float,
default=0.8,
help="""The scale to be applied to `hlg.scores`.

I tested your model and I got 2.46/5.36 with hlg_scale=0.5 for 1best decoding.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking the same. I'll verify the numbers again.

Are you able to reproduce it, i.e., WER for test clean = 2.01 ?
@desh2608

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I did not find time to check it. Let me try to do it this week.
@MarcoYang thanks for the pointer. I'll add it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW something else that is different in this recipe compared to other LibriSpeech recipes is that I keep cuts shorter than 25s (instead of 20s), to avoid throwing away more data. With the quadratic_duration option in DynamicBucketingSampler, this seems to be working fine (I could train on V100 with batch size 800).

Address comments from @csukuangfj
@JinZr JinZr merged commit 7d56685 into k2-fsa:master Oct 27, 2023
3 checks passed
@armusc
Copy link
Contributor

armusc commented Oct 27, 2023

Hi,
looking at this conversation after the merge, were those numbers from 1best decoding then confirmed?
thanks

@JinZr
Copy link
Collaborator

JinZr commented Oct 27, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants