-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding ILM beam search and decoding #1291
base: master
Are you sure you want to change the base?
Conversation
@AmirHussein96 if you have some time, you can try out the experiment suggested by @marcoyang1998: #1271 (comment). @marcoyang1998 do you have a RNNLM trained on GigaSpeech? |
Yeah, I have RNNLM trained on GigaSpeech but not in icefall style. https://huggingface.co/yfyeung/icefall-asr-gigaspeech-rnn_lm-2023-10-08 |
@AmirHussein96 I note that you modified |
check this: k2-fsa/k2#1244 |
I conducted benchmarking on the following scenario:
Choice of ILM/LODR and RNNLM weights: The configuration for the RNNLM and the training command is as following:
RNNLM results on dev: |
@AmirHussein96 I noticed that you are using a positive scale for LODR, this should be negative. You can check the code here: icefall/egs/librispeech/ASR/pruned_transducer_stateless2/beam_search.py Lines 2629 to 2634 in 9af144c
Would you mind re-running the decoding experiment with LODR, thanks! |
@marcoyang1998 I used the implementation of
|
@marcoyang1998 I tried the
The LODR results now are much better so I think The decoding command is below
|
Please have a look at #1017 and https://icefall.readthedocs.io/en/latest/decoding-with-langugage-models/index.html for a comparison between different decoding methods with language models.
LODR works in both shallow fusion and rescoring. |
Hi, sorry to step into this conversation. I have a question regarding the LM, is there any motivation why it is preferred RNNLM instead of Transformer-based LM for these experiments? Thanks. |
The primary reason for choosing RNN-LM is its computational efficiency and suitability for streaming applications. Additionally, the improvement from using a Transformer-LM compared to RNN-LM for rescoring is minimal. |
@marcoyang1998, you can check the updated table with beam 12. The results in the updated table show very close performance, with slight improvements in LODR over ILME. These results align with the findings presented in LODR paper: https://arxiv.org/pdf/2203.16776.pdf. Additionally, I conducted an MPSSWE statistical test, which indicates that there is no statistically significant difference between LODR and ILME.
|
Great work!
Perhaps we can put a note saying that the RNNLM rescoring of paths is not
normally recommended, and instead direct people to
the appropriate method.
Did you see any difference between zipformer with normal RNN-T and
zipformer-HAT?
…On Thu, Oct 12, 2023 at 10:43 PM Amir Hussein ***@***.***> wrote:
@marcoyang1998 <https://github.com/marcoyang1998> I tried the
modified_beam_search_LODR with LODR_scale=-.24 from
https://k2-fsa.github.io/icefall/decoding-with-langugage-models/LODR.html
and also LODR_scale=-.16 from my best
modified_beam_search_lm_rescore_LODR() results.
beam LM scale ILM / LODR scale giga dev giga test
modified_beam_search (baseline) 4 0 0 20.81 19.95
- RNNLM SF 4 0.1 0 20.3 19.55
- RNNLM SF 4 0.29 0 19.88 19.21
- RNNLM SF 4 0.45 0 20.1 19.46
- RNNLM SF 12 0.29 0 *19.77* *19.01*
- RNNLM lm_rescore_LODR (bigram) 4 0.45 0.16 20.42 19.6
- RNNLM LODR (bigram) 4 0.45 -0.24 19.38 18.71
- RNNLM LODR (bigram) 4 0.45 -0.16 19.47 18.85
- RNNLM LODR (bigram) 12 0.45 -0.24 *19.1* *18.44*
- RNNLM SF - ILME 4 0.29 0.1 19.7 18.96
- RNNLM SF - ILME 4 0.45 0.1 19.54 18.89
- RNNLM SF - ILME 4 0.29 0.2 19.84 18.99
- RNNLM SF - ILME 12 0.45 0.1 *19.21* *18.57*
The LODR results now are much better so I think
modified_beam_search_lm_rescore_LODR() should be removed from
beam_search.py.
The decoding command is below
for method in modified_beam_search_LODR; do
./zipformer_hat/decode.py \
--epoch 40 --avg 16 --use-averaged-model True \
--beam-size 4 \
--exp-dir ./zipformer_hat/exp \
--bpe-model data/lang_bpe_500/bpe.model \
--max-contexts 4 \
--max-states 8 \
--max-duration 800 \
--decoding-method $method \
--use-shallow-fusion 1 \
--lm-type rnn \
--lm-exp-dir rnn_lm/exp \
--lm-epoch 25 \
--lm-scale 0.45 \
--lm-avg 5 \
--lm-vocab-size 500 \
--rnn-lm-embedding-dim 512 \
--rnn-lm-hidden-dim 512 \
--rnn-lm-num-layers 2 \
--tokens-ngram 2 \
--ngram-lm-scale $LODR_scale
done
@marcoyang1998 <https://github.com/marcoyang1998>, you can check the
updated table with beam 12. The results in the updated table show very
close performance, with slight improvements in LODR over ILME. These
results align with the findings presented in LODR paper:
https://arxiv.org/pdf/2203.16776.pdf. Additionally, I conducted an MPSSWE
statistical test, which indicates that there is no statistically
significant difference between LODR and ILME.
baseline RNNLM SF LODR ILME
RNNLM SF <0.001 - <0.001 <0.001
LODR <0.001 <0.001 - 1
ILME <0.001 <0.001 1 -
—
Reply to this email directly, view it on GitHub
<#1291 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO4TNCYBUGOP73UKHXTX676ZTANCNFSM6AAAAAA5TPKIDM>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Yes we compared zipformer with the zipformer-HAT using greedy and modified beam search, and the performance is almost the same. |
Please let me know if any modifications are needed to finalize the merging of the pull request. |
@AmirHussein96 this needs the k2 PR (k2-fsa/k2#1244) to be merged first. @csukuangfj besides ILM, I am also using HAT for joint speaker diarization (with my SURT model), and Amir is using it for joint language ID in code-switched ASR. We will make PRs for those recipes in the coming months, but it would be great to have these ones checked in first. |
@marcoyang1998 Could you have a look at this PR? |
export CUDA_VISIBLE_DEVICES="0,1,2,3" | ||
|
||
# For non-streaming model training: | ||
./zipformer/train.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the recipe name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Could you please add a section about HAT (WERs, training command, decoding command etc.) in RESULTS.md? |
I had a glance and left a few comments. The rest looked fine, thanks for the work! Would you mind uploading your HAT model to huggingface so that other people can try it? |
@AmirHussein96 if you have some time, can we make a final push to get this checked in? |
Done |
This is a Librispeech zipformer recipe using HAT loss from k2-fsa/k2#1244. The recipe includes HAT training, greedy decoding, modified beam search decoding, and subtracting ILM with RNN-LM shallow fusion.
So far, @desh2608 and I have tested this on Librispeech, and the results are similar to regular RNN-LM shallow fusion. However, the intended use of this is adaptation to a new domain with an external RNN-LM trained on that domain.