Ngram lm fusion for RNNT maes decoding #6118

andrusenkoau · 2023-02-27T09:19:52Z

What does this PR do ?

Ngram LM fusion for RNNT modified adaptive expansion search (maes) decoding.

Collection: [ASR]

Changelog

Add ngram_lm option to maes decoding algorithm
Add EncDecRNNTBPEModel and EncDecRNNTModel to kenlm_utils.py for ngram lm building
Add new file eval_beamsearch_ngram_transducer.py (base on old Vahid's branch for ngram rnnt beam search) for testing the ngram RNNT decoding

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

… for maes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

for more information, see https://pre-commit.ci

scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py

+    words_count = 0
+    chars_count = 0
+    if preds_output_file:
+        out_file = open(preds_output_file, 'w')


Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

titu1994

Looks overall good, needs minor changes

titu1994 · 2023-02-27T11:41:28Z

nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py

@@ -1122,6 +1159,8 @@ def modified_adaptive_expansion_search(
                            timestep=hyp.timestep[:],
                            length=t,
                        )
+                        if self.ngram_lm:
+                            new_hyp.ngram_lm_state = hyp.ngram_lm_state.__deepcopy__()


Do you need deepcopy ? It's very expensive for large objects

You are right, deepcopy is redundant here

titu1994 · 2023-02-27T18:58:42Z

nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py

@@ -213,6 +227,9 @@ def __init__(
        language_model: Optional[Dict[str, Any]] = None,
        softmax_temperature: float = 1.0,
        preserve_alignments: bool = False,
+        ngram_lm_model: Optional[str] = None,
+        ngram_lm_alpha: float = 0.0,
+        tokens_type: str = "subword",


Do not request used to provide token type. Determine if from the Ngram or from model type of possible

Ok. I redid it as in ctc_beam_decoding.py by set_decoding_type function

titu1994 · 2023-02-27T19:00:42Z

nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py

+                            hyp_i.score + float(logp[hyp_j.y_sequence[pref_id]]) + self.ngram_lm_alpha * lm_score
+                        )
+                    else:
+                        curr_score = hyp_i.score + float(logp[hyp_j.y_sequence[pref_id]])


Why not keep the original code and just add if branch after it to add the lm score ?

Sounds reasonable, thanks)

titu1994 · 2023-02-27T19:01:00Z

nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py

+                            lm_score, next_state = self.compute_ngram_score(next_state, int(hyp_j.y_sequence[k + 1]))
+                            curr_score += float(logp[hyp_j.y_sequence[k + 1]]) + self.ngram_lm_alpha * lm_score
+                        else:
+                            curr_score += float(logp[hyp_j.y_sequence[k + 1]])


Same as above

titu1994 · 2023-02-27T19:03:33Z

scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py

+# https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html
+
+
+import argparse


Instead of two different script for beam search, would it be possible to use high level beam search API of RNNT and merge this script whth the one we already have?

I will try to check this possibility

@titu1994, I tried to merge RNNT beam search decoding into eval_beamsearch_ngram_transducer.py, but it looks a bit overloaded due to different decoding parameters and methods for CTC and RNNT. Furthermore, we need to add the option of logits returning for rnnt_model.transcribe method. We can stop at the current version of the two different files for this PR. I can try to merge them in the future separate PR.

Transducer does not compute full logits, you can get back alignment dangling matrix with preserve alignments and return hypothesis so nj need for additional logprobs flag.

I suppose we can keep separate script for now though. Let me know when PR is finalized

I meant RNNT encoder logits which are used for asr_model.decoding.rnnt_decoder_predictions_tensor by the analogy with model.decoding.ctc_decoder_predictions_tensor as in eval_beamsearch_ngram_transducer.py script.

I get that, but adding an argument to RNNT transcribe just for that chances signature of RNNT and CTC. You can use logprobs flag but you won't get actual logprobs back but encoder output. Please use hypothesis output for encoder logits or repurpose logprobs as encoder output

Ok, I will try to do it in another PR.
Could you check the current PR and approve it if everything is OK? I finalized all the work.

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

for more information, see https://pre-commit.ci

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

…into maes_ngram_lm

VahidooX

Have you used the same formula used in my branch to calculate the lm scores?

VahidooX · 2023-03-02T06:30:33Z

nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py

+        # Please check train_kenlm.py in scripts/asr_language_modeling/ to find out why we need
+        # TOKEN_OFFSET for BPE-based models
+        if decoding_type == 'subword':
+            self.token_offset = 100


Let's define this number in one of the files and then import it here.

Good suggestion! I imported token_offset variable from train_kenlm.py script.

No, import from train_kenlm.py was bad idea. Do you have any suggestion where in nemo I can add default token_offset value? I can create nemo/collections/asr/parts/submodules/decoder_constants.py with this parameter.

BTW: CTC beam search also has hardcoded value for token_offset.

We can keep the offset inside of the CTC beam decoding file. The offset is not used for RNNT. Import the constant from the Nemo CTC beam search decoding file inside of train kenlm file

Ok, I will add offset value inside ctc_beam_decodding.py. The idea is to use the same n-gram lm model for CTC_bpe and RNNT_bpe models. That is why I want to add offset for RNNT_bpe model too.

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

for more information, see https://pre-commit.ci

andrusenkoau · 2023-03-06T13:31:03Z

Have you used the same formula used in my branch to calculate the lm scores?

@VahidooX -- yes, I used your formula except for ngram_lm_beta scaling which I removed. I did not found any lm_beta mention in papers for external LM fusion for RNNT. Some authors used hypothesis length normalization in the end of beam search algorithm (final_hyp_score / len(hyp_text)). This technique did not work for me.

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

for more information, see https://pre-commit.ci

titu1994

Minor changes required - let's keep signatures of the decoding strategies close to each other

titu1994 · 2023-03-13T18:22:35Z

nemo/collections/asr/metrics/rnnt_wer_bpe.py

@@ -202,6 +203,9 @@ def __init__(self, decoding_cfg, decoder, joint, tokenizer: TokenizerSpec):
            decoding_cfg=decoding_cfg, decoder=decoder, joint=joint, blank_id=blank_id + joint.num_extra_outputs
        )

+        if isinstance(self.decoding, rnnt_beam_decoding.BeamRNNTInfer):
+            self.decoding.set_decoding_type('subword')
+


What about adding the Tokenizer here ? And also the vocabulary for char models ?

Do we need these for the RNNT beam decoding? MAES decoding works with default int values of predicted labels. The token offset is needed only for the kenlm score computation inside separate function that does not shift labels int values for hypothesis results.

titu1994 · 2023-03-13T18:24:51Z

nemo/collections/asr/parts/submodules/rnnt_beam_decoding.py

+        """
+
+        next_state = kenlm.State()
+        lm_score = self.ngram_lm.BaseScore(current_lm_state, chr(label + self.token_offset), next_state)


Token offset should only be applied for subword lms not char

Yes, it was the bug. I fixed it.

VahidooX · 2023-03-14T06:11:52Z

scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py

@@ -0,0 +1,264 @@
+# Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.


Needs to get updated to 2022.

VahidooX · 2023-03-14T06:12:17Z

scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py

+# limitations under the License.
+#
+
+# This script would evaluate an N-gram language model trained with KenLM library (https://github.com/kpu/kenlm) in


Docs here needs update.

I am going to merge eval_beamsearch_ngram_transducer.py with eval_beamsearch_ngram.py in another PR. Could we leave eval_beamsearch_ngram_transducer.py as is now? I will update docs according to new eval_beamsearch_ngram.py.

VahidooX · 2023-03-14T06:13:48Z

scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py

+#                                         --decoding_mode=maes
+#                                         ...
+#
+# You may find more info on how to use this script at:


Let's add it to the main doc page that we now support KenLM for both CTC and Transducer models and add the link to this script to the main page at the following address.
Please also add it to the main README page of nemo doc if missing.

I can do it in another PR (merging eval_beamsearch_ngram_transducer.py with eval_beamsearch_ngram.py) which I mentioned above.

VahidooX · 2023-03-14T06:16:02Z

scripts/asr_language_modeling/ngram_lm/train_kenlm.py

@@ -44,6 +44,7 @@
 import torch

 import nemo.collections.asr as nemo_asr
+from nemo.collections.asr.parts.submodules.ctc_beam_decoding import DEFAULT_TOKEN_OFFSET


If you import the DEFAULT_TOKEN_OFFSET, then why not use DEFAULT_TOKEN_OFFSET instead of TOKEN_OFFSET in the code?
Other way is to use the name TOKEN_OFFSET instead of DEFAULT_TOKEN_OFFSET?

I left DEFAULT_TOKEN_OFFSET only

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

…into maes_ngram_lm

VahidooX

We may merge it now and address the rest of the comments in another PR.

titu1994

LGTM for now

* add parameters for ngram_lm Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add parameters for ngram lm Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add RNNT model types for kenlm training Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add ngram lm fusion to maes decoding mode Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add a script for the rnnt beam search decoding with a ngram lm fusion for maes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * minor fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * minor fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix autocast Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * typing fix Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * minor fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add set_decoding_type function Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * remove tokens_type Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove tokens_type from config Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * import token_offset from train_kenlm Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add DEFAULT_TOKEN_OFFSET variable Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix applying token_offset for char models Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixe copyright year Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * leave DEFAULT_TOKEN_OFFSET only Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> --------- Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* add parameters for ngram_lm Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add parameters for ngram lm Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add RNNT model types for kenlm training Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add ngram lm fusion to maes decoding mode Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add a script for the rnnt beam search decoding with a ngram lm fusion for maes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * minor fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * minor fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix autocast Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * typing fix Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * minor fixes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * add set_decoding_type function Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * remove tokens_type Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove tokens_type from config Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * import token_offset from train_kenlm Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add DEFAULT_TOKEN_OFFSET variable Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix applying token_offset for char models Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * fixe copyright year Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> * leave DEFAULT_TOKEN_OFFSET only Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> --------- Signed-off-by: andrusenkoau <andrusenkoau@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

andrusenkoau and others added 8 commits February 24, 2023 07:21

add parameters for ngram_lm

f58178d

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

add parameters for ngram lm

b67f666

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

add RNNT model types for kenlm training

08c806e

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

add ngram lm fusion to maes decoding mode

61eb624

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

add a script for the rnnt beam search decoding with a ngram lm fusion…

12a11b7

… for maes Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Merge branch 'NVIDIA:main' into maes_ngram_lm

0681c92

minor fixes

01ffae3

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

minor fixes

d973ad7

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

andrusenkoau added the ASR label Feb 27, 2023

[pre-commit.ci] auto fixes from pre-commit.com hooks

f5dc54a

for more information, see https://pre-commit.ci

github-advanced-security bot found potential problems Feb 27, 2023

View reviewed changes

andrusenkoau added 2 commits February 27, 2023 02:29

fix autocast

6d274cb

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

typing fix

3da58a5

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

titu1994 requested changes Feb 27, 2023

View reviewed changes

andrusenkoau and others added 7 commits March 1, 2023 04:16

minor fixes

ca3b12d

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

add set_decoding_type function

02da700

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

remove tokens_type

016606a

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Merge branch 'main' into maes_ngram_lm

16fd9c6

[pre-commit.ci] auto fixes from pre-commit.com hooks

618b0b5

for more information, see https://pre-commit.ci

remove tokens_type from config

5d67a2b

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Merge branch 'maes_ngram_lm' of https://github.com/andrusenkoau/NeMo …

50bc2d8

…into maes_ngram_lm

VahidooX reviewed Mar 2, 2023

View reviewed changes

andrusenkoau and others added 3 commits March 6, 2023 05:18

import token_offset from train_kenlm

c25468e

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Merge branch 'main' into maes_ngram_lm

3ed6962

[pre-commit.ci] auto fixes from pre-commit.com hooks

8336431

for more information, see https://pre-commit.ci

andrusenkoau and others added 4 commits March 9, 2023 04:55

add DEFAULT_TOKEN_OFFSET variable

6777683

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

resolve conflict

3c81b88

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Merge branch 'main' into maes_ngram_lm

02fa635

[pre-commit.ci] auto fixes from pre-commit.com hooks

e55ab1a

for more information, see https://pre-commit.ci

Merge branch 'main' into maes_ngram_lm

316b335

titu1994 requested changes Mar 13, 2023

View reviewed changes

VahidooX requested changes Mar 14, 2023

View reviewed changes

andrusenkoau and others added 5 commits March 13, 2023 23:28

fix applying token_offset for char models

8edb445

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Merge branch 'main' into maes_ngram_lm

3d9e141

fixe copyright year

a3f311b

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

leave DEFAULT_TOKEN_OFFSET only

ff878f1

Signed-off-by: andrusenkoau <andrusenkoau@gmail.com>

Merge branch 'maes_ngram_lm' of https://github.com/andrusenkoau/NeMo …

d3ac8f0

…into maes_ngram_lm

VahidooX approved these changes Mar 14, 2023

View reviewed changes

Merge branch 'main' into maes_ngram_lm

ce44f85

titu1994 approved these changes Mar 14, 2023

View reviewed changes

titu1994 merged commit 3dbc64e into NVIDIA:main Mar 14, 2023

		# https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html


		import argparse

		@@ -0,0 +1,264 @@
		# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.

Ngram lm fusion for RNNT maes decoding #6118

Ngram lm fusion for RNNT maes decoding #6118

Conversation

andrusenkoau commented Feb 27, 2023

What does this PR do ?

Changelog

Before your PR is "Ready for review"

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VahidooX left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

titu1994 Mar 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrusenkoau commented Mar 6, 2023

titu1994 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrusenkoau Mar 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VahidooX left a comment

Choose a reason for hiding this comment

titu1994 left a comment

Choose a reason for hiding this comment

titu1994 Mar 6, 2023 •

edited

Loading

andrusenkoau Mar 14, 2023 •

edited

Loading