`am-scale` and `lm-scale` for "Simple RNNT" loss smoothing #1494

YuriiMytiai · 2024-02-07T09:29:40Z

YuriiMytiai
Feb 7, 2024

There're two hyperparameters for loss smoothing used in https://github.com/k2-fsa/icefall/blob/master/egs/librispeech/ASR/pruned_transducer_stateless2/train.py
which are called am-scale and lm-scale
I'm curious if any research has been conducted on how these parameters impact the final model accuracy and the rationale behind selecting those default values.
I wasn't able to find any discussion about this.
Thanks for answer!

Answered by pkufool

Feb 13, 2024

I found some results on our weekly report, I think these results are based on our first version of conformer (not the reworked one). The basic conclusion is, if am_scale greater than 0, the results get worse, lm_scale helps to improve the performance and also helps to make modified_beam_search work better (max_symbol_per_frame=1), see our paper, simple_loss_scale also helps to improve the performance (as some kind of regularization I think). We did not tune these values a lot, here are some previous results:

About simple_loss_scale :

best WER for (greedy search)	baseline test-clean \|\| test-other max-duration = 200 4 GPUs	k2 pruned loss (s_range=8) test-clean \|\| test-other max-duratio…

View full answer

JinZr · 2024-02-08T08:25:34Z

JinZr
Feb 8, 2024
Maintainer

We haven't tuned these two parameters on the latest Zipformer model, but I think @pkufool conducted some experiments on some early versions of the model. 🧐

3 replies

YuriiMytiai Feb 8, 2024
Author

Thanks for the reply
Actually I'm interested in Conformer-based model, so @pkufool please share some information about this if you have.
Thank you in advance!

pkufool Feb 13, 2024
Maintainer

I found some results on our weekly report, I think these results are based on our first version of conformer (not the reworked one). The basic conclusion is, if am_scale greater than 0, the results get worse, lm_scale helps to improve the performance and also helps to make modified_beam_search work better (max_symbol_per_frame=1), see our paper, simple_loss_scale also helps to improve the performance (as some kind of regularization I think). We did not tune these values a lot, here are some previous results:

About simple_loss_scale :

best WER for (greedy search)	baseline test-clean \|\| test-other max-duration = 200 4 GPUs	k2 pruned loss (s_range=8) test-clean \|\| test-other max-duration = 300 4 GPUs loss = simple + pruned	k2 pruned loss (s_range=8) test-clean \|\| test-other max-duration = 300 4 GPUs loss = 0.01 * simple + pruned (first 3 epochs trained with simple + pruned)	k2 pruned loss (s_range=8) test-clean \|\| test-other max-duration = 300 4 GPUs loss = 0.5 * simple + pruned
epoch 15	3.12 \|\| 7.95	3.32 \|\| 8.35		3.0 \|\| 7.96
epoch 16	3.06 \|\| 7.99	3.29 \|\| 8.23
epoch 17	3.07 \|\| 7.86	3.22 \|\| 8.02		2.96 \|\| 7.71
epoch 18	3.02 \|\| 7.76	3.19 \|\| 8.0
epoch 19	3.0 \|\| 7.69	3.15 \|\| 7.82	3.18 \|\| 7.86	2.95 \|\| 7.52
epoch 20	2.99 \|\| 7.61	3.09 \|\| 7.82
epoch 21	2.98 \|\| 7.51		3.13 \|\| 7.79	2.91 \|\| 7.32
epoch 22	2.95 \|\| 7.5	3.0 \|\| 7.67	3.12 \|\| 7.77
epoch 23	2.93 \|\| 7.43	3.0 \|\| 7.66	3.1 \|\| 7.72	2.87 \|\| 7.26
epoch 24	2.92 \|\| 7.38	2.98 \|\| 7.59	3.1 \|\| 7.68
epoch 25	2.91 \|\| 7.37		3.09 \|\| 7.66	2.86 \|\| 7.18
epoch 26	2.9 \|\| 7.34	2.98 \|\| 7.47
epoch 27	2.87 \|\| 7.29	3.0 \|\| 7.55	3.09 \|\| 7.59	2.84 \|\| 7.14
epoch 28	2.87 \|\| 7.27	2.97 \|\| 7.43
epoch 29	2.85 \|\| 7.28	2.98 \|\| 7.39	3.05 \|\| 7.59	2.81 \|\| 7.06
epoch 30	2.85 \|\| 7.29			2.82 \|\| 7.02
epoch 31	2.85 \|\| 7.26	2.94 \|\| 7.33	3.05 \|\| 7.52	2.79 \|\| 7.03
epoch 32	2.85 \|\| 7.27	2.94 \|\| 7.33		2.83 \|\| 7.03
epoch 33	2.85 \|\| 7.3		3.04 \|\| 7.48	2.82 \|\| 7.01
epoch 34		2.94 \|\| 7.3		2.81 \|\| 7.14
epoch 36		2.94 \|\| 7.27

About lm_scale :

best WER for (greedy search)	baseline (torch rnnt loss) test-clean \|\| test-other max-duration = 200 4 GPUs	(k2 pruned loss range 8) test-clean \|\| test-other max-duration = 300 4 GPUs	(k2 pruned loss range 8) lm_only_scale=0.25 am_only_scale=0.0 test-clean \|\| test-other max-duration = 300 4 GPUs	(k2 pruned loss range 8) lm_only_scale=0.1 am_only_scale=0.0 test-clean \|\| test-other max-duration = 300 3 GPUs	(k2 pruned loss range 8) lm_only_scale=0.5 am_only_scale=0.0 test-clean \|\| test-other max-duration = 300 3 GPUs
epoch 19	3.0 \|\| 7.69	3.15 \|\| 7.82	3.01 \|\| 7.79		2.96 \|\| 7.31
epoch 20	2.99 \|\| 7.61	3.09 \|\| 7.82	2.97 \|\| 7.72	3.04 \|\| 7.93
epoch 21	2.98 \|\| 7.51		3.0 \|\| 7.73		3.01 \|\| 7.15
epoch 22	2.95 \|\| 7.5	3.0 \|\| 7.67	2.99 \|\| 7.59	2.98 \|\| 7.73
epoch 23	2.93 \|\| 7.43	3.0 \|\| 7.66	2.95 \|\| 7.59		2.97 \|\| 7.03
epoch 24	2.92 \|\| 7.38	2.98 \|\| 7.59	2.96 \|\| 7.5	2.97 \|\| 7.7
epoch 25	2.91 \|\| 7.37		2.97 \|\| 7.37		2.98 \|\| 6.95
epoch 26	2.9 \|\| 7.34	2.98 \|\| 7.47	2.92 \|\| 7.33	2.94 \|\| 7.67
epoch 27	2.87 \|\| 7.29	3.0 \|\| 7.55	2.92 \|\| 7.26		2.93 \|\| 7.02
epoch 28	2.87 \|\| 7.27	2.97 \|\| 7.43	2.89 \|\| 7.18	2.9 \|\| 7.58
epoch 29	2.85 \|\| 7.28	2.98 \|\| 7.39	2.88 \|\| 7.22		2.9 \|\| 6.99
epoch 30	2.85 \|\| 7.29		2.89 \|\| 7.07	2.87 \|\| 7.52
epoch 31	2.85 \|\| 7.26	2.94 \|\| 7.33	2.89 \|\| 6.97
epoch 32	2.85 \|\| 7.27	2.94 \|\| 7.33	2.86 \|\| 6.94	2.85 \|\| 7.42
epoch 33	2.85 \|\| 7.3
epoch 34		2.94 \|\| 7.3	2.86 \|\| 6.88	2.85 \|\| 7.36
epoch 35			2.84 \|\| 6.87
epoch 36		2.94 \|\| 7.27	2.84 \|\| 6.87	2.85 \|\| 7.35
epoch 37			2.81 \|\| 6.86
epoch 38			2.81 \|\| 6.84	2.85 \|\| 7.32
epoch 39			2.8 \|\| 6.85
epoch 40			2.77 \|\| 6.92	2.85 \|\| 7.3	2.75 \|\| 6.69
epoch 41			2.76 \|\| 6.91		2.74 \|\| 6.69
epoch 42			2.75 \|\| 6.95	2.85 \|\| 7.33	2.75 \|\| 6.7
epoch 43			2.75 \|\| 6.87		2.74 \|\| 6.7
epoch 44		2.9 \|\| 7.2	2.75 \|\| 6.85	2.84 \|\| 7.37	2.74 \|\| 6.68
epoch 45		2.89 \|\| 7.16	2.74 \|\| 6.84		2.72 \|\| 6.67
epoch 46			2.73 \|\| 6.83	2.81 \|\| 7.3	2.71 \|\| 6.65
epoch 47			2.73 \|\| 6.82		2.72 \|\| 6.65
epoch 48			2.73 \|\| 6.84	2.81 \|\| 7.36	2.72 \|\| 6.63
epoch 49			2.73 \|\| 6.78		2.74 \|\| 6.63

Final results of above table:

	(k2 pruned loss range 8) test-clean \|\| test-other max-duration = 300 4 GPUs	(k2 pruned loss range 8) lm_only_scale=0.1 am_only_scale=0.0 test-clean \|\| test-other max-duration = 300 3 GPUs	(k2 pruned loss range 8) lm_only_scale=0.25 am_only_scale=0.0 test-clean \|\| test-other max-duration = 300 4 GPUs	(k2 pruned loss range 8) lm_only_scale=0.5 am_only_scale=0.0 test-clean \|\| test-other max-duration = 300 3 GPUs
best WERs	2.89 \|\| 7.16	2.81 \|\| 7.3	2.73 \|\| 6.78	2.71 \|\| 6.65

About am_scale (libir-100):

this is just one of the results, in my previous exps, positive am_scale always hurts the performance ( I did not try negative values).

	test-clean	test-other
lm_only_scale = 0.25	2.81	6.84	epoch 38
am_only_scale = 0.25	2.98	7.91	epoch 32
without smoothing	2.89	7.16	epoch 45
torch rnnt loss	2.85	7.3	epoch 33

@YuriiMytiai hope these tables will help you. If you tune these value by yourself, it will be great if you can share the results with us, thank you!

Answer selected by JinZr

YuriiMytiai Feb 13, 2024
Author

Cool, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`am-scale` and `lm-scale` for "Simple RNNT" loss smoothing #1494

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

am-scale and lm-scale for "Simple RNNT" loss smoothing #1494

YuriiMytiai Feb 7, 2024

Replies: 1 comment · 3 replies

JinZr Feb 8, 2024 Maintainer

YuriiMytiai Feb 8, 2024 Author

pkufool Feb 13, 2024 Maintainer

YuriiMytiai Feb 13, 2024 Author

`am-scale` and `lm-scale` for "Simple RNNT" loss smoothing #1494

YuriiMytiai
Feb 7, 2024

Replies: 1 comment 3 replies

JinZr
Feb 8, 2024
Maintainer

YuriiMytiai Feb 8, 2024
Author

pkufool Feb 13, 2024
Maintainer

YuriiMytiai Feb 13, 2024
Author