Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flashlight and Pyctcdecode decoders #8428

Open
wants to merge 95 commits into
base: main
Choose a base branch
from
Open

Flashlight and Pyctcdecode decoders #8428

wants to merge 95 commits into from

Conversation

karpnv
Copy link
Collaborator

@karpnv karpnv commented Feb 15, 2024

Preserve Flashlight and Pyctcdecode beamsearch with Ngram LM

Support Flashlight and Pyctcdecode decoding with pure KenLM and NeMo KenLM
Standardize API of CLI inference scripts

Collection: ASR

Changelog

  • Fix install script install_beamsearch_decoders.sh
  • Create flashlight_lexicon file during scripts/asr_language_modeling/ngram_lm/train_kenlm.py and tar it with kenlm.bin
  • Unify parameters for eval_beamsearch_ngram_ctc.py, speech_to_text_eval.py and training
    -- Get logprobs from Hypothesis
    -- Use "pyctcdecode" strategy as default beamsearch algorithm denoted as "beam"
    -- Remove default seq2seq strategy
    -- Check decoding_type and search_type combinations
    -- Support empty string in nemo_kenlm_path and word_kenlm_path for beamsearch without LM (ZeroLM)
  • Fix bug with EncDecHybridRNNTCTCModel in examples/asr/transcribe_speech.py
  • Support AggregateTokenizer in scripts/asr_language_modeling/ngram_lm/create_lexicon_from_arpa.py
python3 scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_ctc.py \
model_path=am_model.nemo  \
dataset_manifest=manifest.json  \
preds_output_folder=/tmp   \
ctc_decoding.strategy=flashlight \
ctc_decoding.beam.kenlm_path=am_model.kenlm \
ctc_decoding.beam.beam_size=[4]   \
ctc_decoding.beam.beam_alpha=[0.5]   \
ctc_decoding.beam.beam_beta=[0.5] \
batch_size=32  \
beam_batch_size=1 \
cuda=1

python3 examples/asr/speech_to_text_eval.py  \
model_path=am_model.nemo \ 
dataset_manifest=manifest.json \
decoder_type=ctc  
ctc_decoding.strategy=flashlight \  
ctc_decoding.beam.nemo_kenlm_path=kenlm_model.bin \
ctc_decoding.beam.beam_size=4   \
ctc_decoding.beam.beam_alpha=0.5   \
ctc_decoding.beam.beam_beta=0.5 \
ctc_decoding.beam.flashlight_cfg.lexicon_path=am_model.flashlight_lexicon \ # DEFAULT_TOKEN_OFFSET
ctc_decoding.beam.return_best_hypothesis=true \
batch_size=32  \
output_filename=/tmp/manifest_out.json 
cuda=1

PR Type:

  • [ V] New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Additional Information

karpnv and others added 25 commits January 24, 2024 00:26
Signed-off-by: Nikolay Karpov <nkarpov@nvidia.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
@github-actions github-actions bot added the ASR label Feb 15, 2024
Copy link
Contributor

github-actions bot commented Mar 1, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 1, 2024
Copy link
Contributor

github-actions bot commented Mar 9, 2024

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Mar 9, 2024
@github-actions github-actions bot added the stale label Sep 3, 2024
Copy link
Contributor

This PR was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this Sep 10, 2024
@karpnv karpnv reopened this Sep 16, 2024
@github-actions github-actions bot removed the stale label Sep 17, 2024
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
if cfg.amp:
if torch.cuda.is_available() and hasattr(torch.cuda, 'amp') and hasattr(torch.cuda.amp, 'autocast'):
logging.info("AMP is enabled!\n")
autocast = torch.cuda.amp.autocast

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable autocast is not used.
autocast = torch.cuda.amp.autocast

else:
autocast = default_autocast

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable autocast is not used.
autocast = default_autocast
else:

autocast = default_autocast

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable autocast is not used.
@karpnv karpnv added Run CICD and removed Run CICD labels Oct 1, 2024
Copy link
Contributor

github-actions bot commented Oct 1, 2024

[🤖]: Hi @karpnv 👋,

I just wanted to let you know that, you know, a CICD pipeline for this PR just finished successfully ✨

So it might be time to merge this PR or like to get some approvals 🚀

But I'm just a 🤖 so I'll leave it you what to do next.

Have a great day!

//cc @ko3n1g

artbataev
artbataev previously approved these changes Oct 2, 2024
Copy link
Collaborator

@artbataev artbataev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks it is worth merging now.
@karpov-nick please, fix autocast/use_amp in scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py

andrusenkoau
andrusenkoau previously approved these changes Oct 2, 2024
Copy link
Collaborator

@andrusenkoau andrusenkoau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

python eval_beamsearch_ngram_ctc.py model_path=<path to the .nemo file of the model> \
dataset_manifest=<path to the input evaluation JSON manifest file> \
ctc_decoding.beam.word_kenlm_path=<path to the binary KenLM model> \
ctc_decoding.beam.nemo_kenlm_path=<path to the binary KenLM model> \

This comment was marked as outdated.

Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Oct 17, 2024
@nithinraok
Copy link
Collaborator

can we merge this?

Copy link
Contributor

github-actions bot commented Nov 1, 2024

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Nov 1, 2024
@tbartley94 tbartley94 removed the stale label Nov 4, 2024
@tbartley94
Copy link
Collaborator

@karpnv could you fix merge conflicts so this can be merged?

Signed-off-by: Nikolay Karpov <karpnv@gmail.com>
@karpnv karpnv dismissed stale reviews from andrusenkoau and artbataev via 4f4212c November 7, 2024 14:24
@karpnv karpnv added Run CICD and removed Run CICD labels Nov 7, 2024
lexicon_path = os.path.join(tmpdir.name, lexicon[0].name)
SaveRestoreConnector._unpack_nemo_file(path2file=kenlm_path, out_folder=tmpdir.name, members=members)
cfg = OmegaConf.load(config_path)
return tmpdir, cfg.encoding_level, kenlm_model_path, lexicon_path

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'lexicon_path' may be used before it is initialized.
try:
self.tmpdir, self.kenlm_encoding_level, self.kenlm_path, lexicon_path = get_nemolm(kenlm_path)
if not self.flashlight_cfg.lexicon_path:
self.flashlight_cfg.lexicon_path = lexicon_path

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'lexicon_path' may be used before it is initialized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants