From 1533d66796f71e0eb4df266e04b371d825406e2b Mon Sep 17 00:00:00 2001 From: huvunvidia <86480512+huvunvidia@users.noreply.github.com> Date: Tue, 16 Apr 2024 22:19:17 -0400 Subject: [PATCH] Adding RETRO tests to Action Tests (cicd-main.yml) (#8942) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * update branch Signed-off-by: eharper * Add dist ckpt support for regular optimizers (#7749) * Add dist ckpt support for regular optimizers Signed-off-by: Mikołaj Błaż * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * fix imports Signed-off-by: dimapihtar * imports fix Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci imports fix Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert asr notebook Signed-off-by: dimapihtar * revert asr notebook Signed-off-by: dimapihtar --------- Signed-off-by: Mikołaj Błaż Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: dimapihtar Co-authored-by: Eric Harper Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Pin lhotse=1.19.2 in r1.23.0 (#8303) Signed-off-by: Piotr Żelasko * Cache Aware Streaming tutorial notebook (#8296) * add notebook Signed-off-by: Elena Rastorgueva * rename old notebook to Buffered_Streaming Signed-off-by: Elena Rastorgueva * call setup_streaming_params in set_default_att_context_size method Signed-off-by: Elena Rastorgueva * update links in docs Signed-off-by: Elena Rastorgueva * update links to tutorials in docs Signed-off-by: Elena Rastorgueva * remove hard-coding Signed-off-by: Elena Rastorgueva * rename var Signed-off-by: Elena Rastorgueva --------- Signed-off-by: Elena Rastorgueva * fix path location and branch (#8304) * fix path location and branch Signed-off-by: Nithin Rao Koluguri * change to a floating point number Signed-off-by: Nithin Rao Koluguri --------- Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri Co-authored-by: Somshubra Majumdar * add deallocate pipeline output optimization (#8279) * add deallocate pipeline output optimization Signed-off-by: Jimmy Zhang * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jimmy Zhang Co-authored-by: Jimmy Zhang Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix memory leak caused by context parallelism hanging references by omegaconf (#8299) * save cp_size to self Signed-off-by: Jimmy Zhang * use parallel_state instead of self Signed-off-by: Jimmy Zhang --------- Signed-off-by: Jimmy Zhang Co-authored-by: Jimmy Zhang Co-authored-by: Eric Harper * remove assertion (#8302) Signed-off-by: dimapihtar * Update PEFT Doc (#8262) * update peft doc Signed-off-by: Chen Cui * remove old prompt learning doc and notebook Signed-off-by: Chen Cui * fix table Signed-off-by: Chen Cui * fix table Signed-off-by: Chen Cui * fix table Signed-off-by: Chen Cui * Merge branch 'r1.23.0' into chcui/update_peft_doc Signed-off-by: Chen Cui * revert accidental changes Signed-off-by: Chen Cui * revert accidental changes Signed-off-by: Chen Cui --------- Signed-off-by: Chen Cui * Attention encoder-decoder models for multiple speech-to-text tasks (#8242) (#8324) * Rebasing canary changes at current main Signed-off-by: Piotr Żelasko * Move the changes from asr transformer to nlp transformer as originally intended Signed-off-by: Piotr Żelasko * update eval to strip spaces before punctuations Signed-off-by: stevehuang52 * update pc strip Signed-off-by: stevehuang52 * [canary] Refactor: `PromptedAudioToTextLhotseDataset` and `EncDecMultiTaskModel` (#8247) * Create a separate CanaryDataset and use it inside `transformer_bpe_models.py`. Ditches `token_sequence_format`. Signed-off-by: Piotr Żelasko * [canary] Refactor: move changes in transformer_bpe_models.py to Canar… (#8252) * [canary] Refactor: move changes in transformer_bpe_models.py to CanaryModel Signed-off-by: Piotr Żelasko * Rename `CanaryModel` to `EncDecMultiTaskModel` and remove inheritance from `EncDecTransfModelBPE`; add a separate config for this model Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko * Rename `CanaryDataset` to `PromptedAudioToTextLhotseDataset`; add `prompt_format_fn` argument; clean-up the `_canary_prompt_format` function a bit Signed-off-by: Piotr Żelasko * Move tokenization into `prompt_format_fn`, fix usage, add docs Signed-off-by: Piotr Żelasko * Backward-compatible utterance validation Signed-off-by: Piotr Żelasko * Improve type annotations Signed-off-by: Piotr Żelasko * config and prompt_fn registration changes from review Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko * fix transcribe config Signed-off-by: stevehuang52 * Refactor Canary to follow schema of remaining ASR models (#8260) * Initial draft of multi task beam decoding strategy Signed-off-by: smajumdar * Stabilize inference Signed-off-by: smajumdar * Update AED Multi Task model to mostly conform to Archetype-Type format. Update config Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add change decoding strategy Signed-off-by: smajumdar * Remove redundant imports Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup Signed-off-by: smajumdar * Cleanup Signed-off-by: smajumdar * remove asr transformer dependency on nlp Signed-off-by: stevehuang52 * clean up Signed-off-by: stevehuang52 * copy token_classifier from nlp to asr Signed-off-by: stevehuang52 * Address comments Signed-off-by: smajumdar * Add typing to beam decoding Signed-off-by: smajumdar * Make prompt format configurable Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * drop asr dependency on nlp Signed-off-by: stevehuang52 --------- Signed-off-by: smajumdar Signed-off-by: stevehuang52 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: stevehuang52 * fix transcribe, update asr evaluator Signed-off-by: stevehuang52 * Extend the docs for the canary prompt_fn Signed-off-by: Piotr Żelasko * Incorporate changes from Nithin's code review Signed-off-by: Piotr Żelasko * training bug fix and adding launch script for speech_multitask (#8270) * bug fix and adding launch script for speech_multitask Signed-off-by: Krishna Puvvada * update launch script example in speech_to_text_aed.py Signed-off-by: Krishna Puvvada --------- Signed-off-by: Krishna Puvvada Co-authored-by: Krishna Puvvada * Fix: drop_last must be true in validation/test otherwise the training will hang Signed-off-by: Piotr Żelasko * revert to current transcribe API Signed-off-by: stevehuang52 * revert changes to NLP, update docs Signed-off-by: stevehuang52 * update eval utils Signed-off-by: stevehuang52 * update docs Signed-off-by: stevehuang52 * Remove DALI; rename compute_audio_loss to compute_loss Signed-off-by: Piotr Żelasko * set default use_model_transcribe=False Signed-off-by: stevehuang52 * change os.path.dirname to pathlib Signed-off-by: stevehuang52 * [canary] Test for CanaryTokenizer + refactoring (#8285) * Test for CanaryTokenizer Signed-off-by: Piotr Żelasko * Attempt at refactor... Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko * Update config for AED models (#8294) Signed-off-by: smajumdar * set default calculate_wer=False in transcribe_speech.py Signed-off-by: stevehuang52 * Attention encoder-decoder models for multiple speech-to-text tasks Signed-off-by: Piotr Żelasko * Apply suggestions from code review, part 1 Co-authored-by: Nithin Rao Signed-off-by: Piotr Żelasko * Apply suggestions from code review, part 2 Signed-off-by: Piotr Żelasko * Document compute_loss Signed-off-by: Piotr Żelasko * update transcribe_speech.py Signed-off-by: stevehuang52 * add docstring Signed-off-by: stevehuang52 * Attention encoder-decoder models for multiple speech-to-text tasks Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko Signed-off-by: stevehuang52 Signed-off-by: smajumdar Signed-off-by: Krishna Puvvada Signed-off-by: Piotr Żelasko Co-authored-by: stevehuang52 Co-authored-by: Somshubra Majumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Krishna Puvvada <93558329+krishnacpuvvada@users.noreply.github.com> Co-authored-by: Krishna Puvvada Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Nithin Rao (cherry picked from commit d10726da72f74eb5a95056843d1f9e2562a5051c) Co-authored-by: Piotr Żelasko * add code for calling mcore_retro in NeMo * add code for calling mcore_retro in NeMo * runnable, training curve match retro mcore and nemo * working on retro inference * working on megatron_retro_eval.py and megatron_retro_inference.yaml * refactoring text_generation_utils code and retro inference relevant files * clean PR * resolving quick hacks (reading number of train/valid samples from workdir, discrepancy in total samples and samples with neighbors retrieved, tokenizers) * clean repository * revert changes to inference/eval code to original in main * clean code * runable training code, with already implemented eval code * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * Add Bert HF checkpoint converter (#8088) * Add Bert HF checkpoint converter Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reformat Signed-off-by: yaoyu-33 * Add BERT ONNX export * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add NeMo BERT to HF BERT script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean code Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update argument names Signed-off-by: yaoyu-33 * Update build_transformer_config in Bert Signed-off-by: yaoyu-33 --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen * revert to original eval code files * revert to original eval code files 2 * revert to original eval code files 3 * revert to original eval code files 4 * clean code * clean code * update my code to support changes from lastest main * commit before rebase r1.23.0 * Multimodal r1.23.0 bug fix (#8315) * Rename quick-gelu Signed-off-by: yaoyu-33 * ddpm config guard Signed-off-by: yaoyu-33 * Fix ddpm edit api Signed-off-by: yaoyu-33 * Fix insert_image_token cfg issue Signed-off-by: yaoyu-33 * neva updates Signed-off-by: yaoyu-33 * reformat Signed-off-by: yaoyu-33 * Add back jenkins Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix jenkins Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bugs Signed-off-by: yaoyu-33 * Update default neva template Signed-off-by: yaoyu-33 --------- Signed-off-by: yaoyu-33 Co-authored-by: Eric Harper Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * copy paste files from r1.23.0 * clean PR * Fixes for MoE parameter passing & use of AutoTokenizer/Model for mistral. (#8272) Signed-off-by: Alexandros Koumparoulis * Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 (#8334) Signed-off-by: Sangkug Lym Co-authored-by: Eric Harper * Remove asr webapp (#8347) Signed-off-by: smajumdar * remove _target_ at model level in aed config (#8351) Signed-off-by: Krishna Puvvada Co-authored-by: Krishna Puvvada * revert changes for tts and asr * Add change_vocabulary and save_tokenizers() support to Multitask ASR models (#8357) * Add change_vocabulary and save_tokenizers() support Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update nemo/collections/asr/models/aed_multitask_models.py Co-authored-by: Piotr Żelasko Signed-off-by: Somshubra Majumdar --------- Signed-off-by: smajumdar Signed-off-by: Somshubra Majumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Piotr Żelasko * Change default (#8371) Signed-off-by: smajumdar * implement retro's own fwd_bwd_step() and validation_step() to not have argument first_val_step, which the MLM commit doesn't support * adding megatron compile_helpers(), in future can be fixed with correct MLM commit * bug fix in fast-conformer-aed.yaml and adding jenkins test for speech_to_text_aed model (#8368) Signed-off-by: Krishna Puvvada Co-authored-by: Krishna Puvvada Co-authored-by: Somshubra Majumdar * Enable megatron core loggers for GPT pretraining (#8354) * Logging changes tested for gpt_pretraining Signed-off-by: Aishwarya Bhandare * Additional args Signed-off-by: Aishwarya Bhandare * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Aishwarya Bhandare Co-authored-by: Aishwarya Bhandare Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper * mcore ds fix (#8283) * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * mcore ds fix Signed-off-by: Dmytro Pykhtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update mcore Signed-off-by: dimapihtar * revert asr files Signed-off-by: dimapihtar * add comments Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for mcore mock dataset Signed-off-by: dimapihtar * update mcore version Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update gpt cfg Signed-off-by: dimapihtar * update mcore commit Signed-off-by: dimapihtar * fix Bert unit tests Signed-off-by: dimapihtar * update bert tests Signed-off-by: dimapihtar * fix bert mcore test Signed-off-by: dimapihtar * fix gpt jenkins tests Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update apex & TE commits Signed-off-by: dimapihtar * revert apex installation Signed-off-by: dimapihtar * turn off the fusion for jenkins Signed-off-by: dimapihtar --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: Dmytro Pykhtar Signed-off-by: dimapihtar Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pablo Garay * addressing Eric's reviews * adding existing implementation RETRO files * adding existing implementation RETRO files * Add Finetuning tutorial with HF Datasets (#8356) * Add Finetuning tutorial with HF Datasets Signed-off-by: Nithin Rao Koluguri * update on Som comments Signed-off-by: Nithin Rao Koluguri --------- Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri * release updates (#8378) * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * mcore ds fix Signed-off-by: Dmytro Pykhtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update mcore Signed-off-by: dimapihtar * revert asr files Signed-off-by: dimapihtar * add comments Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for mcore mock dataset Signed-off-by: dimapihtar * update mcore version Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update gpt cfg Signed-off-by: dimapihtar * update mcore commit Signed-off-by: dimapihtar * fix Bert unit tests Signed-off-by: dimapihtar * update bert tests Signed-off-by: dimapihtar * fix bert mcore test Signed-off-by: dimapihtar * fix gpt jenkins tests Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for dict data input type Signed-off-by: dimapihtar * add mock ds test Signed-off-by: dimapihtar * add test for dict data input type Signed-off-by: dimapihtar * mcore ds fix Signed-off-by: dimapihtar * data input fix Signed-off-by: dimapihtar --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: Dmytro Pykhtar Signed-off-by: dimapihtar Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pablo Garay * MCore dataset compatibility for tokenizers (#8390) * Add unique_identifiers for all tokenizers and eod for SentencePieceTokenizer Signed-off-by: Valerie Sarge * Add generalized token aliases to TokenizerSpec to conform with MegatronTokenizer's interface. Remove now-redundant individual fixes from AutoTokenizer and SentencePieceTokenizer. Signed-off-by: Valerie Sarge --------- Signed-off-by: Valerie Sarge Co-authored-by: Pablo Garay * Mcore customization doc (#8298) * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * Add Bert HF checkpoint converter (#8088) * Add Bert HF checkpoint converter Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reformat Signed-off-by: yaoyu-33 * Add BERT ONNX export * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add NeMo BERT to HF BERT script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean code Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update argument names Signed-off-by: yaoyu-33 * Update build_transformer_config in Bert Signed-off-by: yaoyu-33 --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen * initial placeholder Signed-off-by: Huiying Li * add to intro/index.rst Signed-off-by: Huiying Li * initial content update Signed-off-by: Huiying Li * add diff images Signed-off-by: Huiying Li size Signed-off-by: Huiying Li * minor fixes * minor language change Signed-off-by: Chen Cui * clean changes --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: yaoyu-33 Signed-off-by: Huiying Li Signed-off-by: Huiying Li Signed-off-by: Chen Cui Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen Co-authored-by: Huiying Li Co-authored-by: Chen Cui * wer fix (#8404) Signed-off-by: Travis Bartley * updated link to pubmed (#8402) Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri * Update NFA video download link (#8406) * update nfa nasa video link Signed-off-by: Elena Rastorgueva * update link in markdown Signed-off-by: Elena Rastorgueva --------- Signed-off-by: Elena Rastorgueva * revert changes (#8410) Signed-off-by: Chen Cui * Fix dreambooth data sampler issue (#8400) * Turn on drop last Signed-off-by: yaoyu-33 * Some neva fixes Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fixed errors in the CTM gen functions (#8416) Signed-off-by: Taejin Park * add ensemble decoding fix (#8427) Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri * SDE bugfix log (#8430) Signed-off-by: George * mcore customization doc minor fix (#8421) Signed-off-by: Huiying Li * NeMo-Mistral to HF converter bugfix. (#8353) Signed-off-by: Alexandros Koumparoulis * Fixing mcore bert for TP, PP and SP (#8336) * Fixing mcore bert for TP, PP and SP * Fixing mcore bert for TP, PP and SP * Fixing mcore version * Fixing mcore version * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> --------- Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy Co-authored-by: Eric Harper * Add settings to suppress bf16 compile errors in CI on V100 (#8481) * Add settings to suppress bf16 compile errors in CI on V100 Signed-off-by: Abhishree * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * MoE parameter passing (#8255) * MoE parameter passing Signed-off-by: Alexandros Koumparoulis * Pass EP/MoE params in consumer scripts. Signed-off-by: Alexandros Koumparoulis * PR fixes Signed-off-by: Alexandros Koumparoulis * Use latest commit of mcore-0.5 Signed-off-by: Alexandros Koumparoulis * CI fix Signed-off-by: Alexandros Koumparoulis * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alexandros Koumparoulis Co-authored-by: Alexandros Koumparoulis Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update k2 version (#8478) (#8492) Signed-off-by: Vladimir Bataev * Add fp8 support for SD/Update notebook paths (#8489) * Add fp8 support for SD/Update notebook paths Signed-off-by: Mingyuan Ma * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper * pin to 0.5.0 (#8465) Signed-off-by: eharper * Update NeMo Multimodal Requirements (#8515) * Update requirements_multimodal.txt Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update github raw content link (#8517) Signed-off-by: Chen Cui * Add dep notice for notebooks (#8522) * add dep notice Signed-off-by: eharper * revert Signed-off-by: eharper --------- Signed-off-by: eharper * Revert FP8 integration (#8520) * Revert FP8 integration Signed-off-by: Mingyuan Ma * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update data prep notebook (#8532) Signed-off-by: Mingyuan Ma * before update branch with latest r1.23.0 * update to run with MLM ae2817b3dde4efb1515061a5311d01d8f85bd99c (runnable training and saving checkpoint) * remove compile_helpers * reverse changes from main branch to r1.23.0 * adding *_legacy files * update MLM commit in Jenkinsfile to latest * debugging Jenkinstest: test different mcore import in retro_dataset * update Jenkinsfile edit megatron_retro_mutransfer_pretrain_legacy.py * removing all mcore RETRO to pass the Jenkinstest * fixing import legacy problem for tests/collections/nlp/test_indexed_retrieval_dataset.py * update Jenkinsfile file to use TE v0.7 * update NeMo to work with latest mcore RETRO (solving TE problems) * update TE commit Jenkinsfile to be the same with r1.23.0's Jenkinsfile * update commit for MLM * jenkinstest debugging * temporary fix RETRO's __init__ for jenkinstest * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * add model.data.dataloader_type=cyclic to jenkinsfile * update code to work with latest megatron-lm main 81dab6067 * update M-LM commit in Jenkinsfile to latest main M-LM 81dab6067 * fix to by pass CI test bf16 problem (following this PR https://github.com/NVIDIA/NeMo/pull/8481/files) * isort and black * adjusting model.micro_batch_size to 1 * fix BRANCH = 'r1.23.0' * replace tutorials dir from main branch to huvu/mcore_retro * fix minor merges conflict * update Jenkinsfile * runnable with a temporary fix from Jacek (unfound -unfinished problem) * runnable with a temporary fix from Jacek (unfound -unfinished problem) * modified nlp_overrides.py back to original * fix checkpoint from Jacek Bieniusiewicz * config Jenkinsfile test * set RETRO Jenkins MBS to 1 * black fix * isort fix * update TE commit * update to latest Jenkinsfile with latest container and commits * remove new RETRO jenkinstest * merge latest main * put RETRO Jenkinstest to the right place * update code for megatron_retro_pretraining_legacy.py * untrack ipa_cmudict-0.7b_nv23.01.txt * untrack ipa_cmudict-0.7b_nv23.01.txt * set config in megatron_retro_pretraining_legacy.py to megatron_retro_config_legacy * update new RETRO jenkinstest to run faster * merging latest main, and edit Jenkinstest * update Jenkinstest for new RETRO to run faster * fix isort * adding RETRO tests to cicd-main.yml action tests * update ipa_cmudict-0.7b_nv23.01.txt * remove quotes for model.data for legacy RETRO action tests --------- Signed-off-by: eharper Signed-off-by: Mikołaj Błaż Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: dimapihtar Signed-off-by: Piotr Żelasko Signed-off-by: Elena Rastorgueva Signed-off-by: Nithin Rao Koluguri Signed-off-by: Jimmy Zhang Signed-off-by: Chen Cui Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: yaoyu-33 Signed-off-by: Alexandros Koumparoulis Signed-off-by: Sangkug Lym Signed-off-by: smajumdar Signed-off-by: Krishna Puvvada Signed-off-by: Somshubra Majumdar Signed-off-by: Aishwarya Bhandare Signed-off-by: Dmytro Pykhtar Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Signed-off-by: Valerie Sarge Signed-off-by: Huiying Li Signed-off-by: Huiying Li Signed-off-by: Travis Bartley Signed-off-by: Taejin Park Signed-off-by: George Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Signed-off-by: Abhishree Signed-off-by: Vladimir Bataev Signed-off-by: Mingyuan Ma Co-authored-by: eharper Co-authored-by: mikolajblaz Co-authored-by: Eric Harper Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Piotr Żelasko Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Co-authored-by: Nithin Rao Co-authored-by: Somshubra Majumdar Co-authored-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jimmy Zhang Co-authored-by: Chen Cui Co-authored-by: Huy Vu2 Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Bobby Chen Co-authored-by: akoumpa <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Sangkug Lym Co-authored-by: Krishna Puvvada <93558329+krishnacpuvvada@users.noreply.github.com> Co-authored-by: Krishna Puvvada Co-authored-by: ashbhandare Co-authored-by: Aishwarya Bhandare Co-authored-by: Dmytro Pykhtar Co-authored-by: Pablo Garay Co-authored-by: Valerie Sarge Co-authored-by: Huiying Co-authored-by: Huiying Li Co-authored-by: tbartley94 <90423858+tbartley94@users.noreply.github.com> Co-authored-by: Taejin Park Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis Co-authored-by: Vladimir Bataev Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Co-authored-by: Huy Vu2 --- .github/workflows/cicd-main.yml | 77 +++++++++++++++++++++++++++++++-- 1 file changed, 73 insertions(+), 4 deletions(-) diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml index c4350a42f59b7..a0353a42fb5e8 100644 --- a/.github/workflows/cicd-main.yml +++ b/.github/workflows/cicd-main.yml @@ -3690,6 +3690,75 @@ jobs: uses: actions/checkout@v2 - run: | python examples/nlp/language_modeling/megatron_retro_pretraining.py \ + trainer.num_nodes=1 \ + trainer.devices=2 \ + trainer.precision=bf16 \ + trainer.accelerator=gpu \ + model.data.data_prefix=['none'] \ + exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \ + model.mcore_gpt=True \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=1 \ + model.optim.name=distributed_fused_adam \ + model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \ + model.data.num_workers=4 \ + model.micro_batch_size=1 \ + model.data.shuffle_documents=False \ + trainer.val_check_interval=30 \ + +trainer.num_sanity_val_steps=0 \ + model.init_method_std=0.023 \ + model.optim.lr=6.0e-4 \ + model.megatron_amp_O2=True \ + model.data.splits_string=\'\"98,2,0\"\' \ + model.data.dataloader_type=cyclic \ + trainer.max_steps=10 + + python examples/nlp/language_modeling/megatron_retro_pretraining.py \ + trainer.num_nodes=1 \ + trainer.devices=2 \ + trainer.precision=bf16 \ + trainer.accelerator=gpu \ + model.data.data_prefix=['none'] \ + exp_manager.exp_dir=examples/nlp/language_modeling/mcore_retro_results \ + model.mcore_gpt=True \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=1 \ + model.optim.name=distributed_fused_adam \ + model.retro.retro_project_dir=/home/TestData/nlp/megatron_retro/mcore_retro/micro-wiki-core \ + model.data.num_workers=4 \ + model.micro_batch_size=1 \ + model.data.shuffle_documents=False \ + trainer.val_check_interval=30 \ + +trainer.num_sanity_val_steps=0 \ + model.init_method_std=0.023 \ + model.optim.lr=6.0e-4 \ + model.megatron_amp_O2=True \ + model.data.splits_string=\'\"98,2,0\"\' \ + model.data.dataloader_type=cyclic \ + trainer.max_steps=20 + + rm -rf examples/nlp/language_modeling/mcore_retro_results + - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main" + if: "failure()" + + L2_Legacy_Megatron_RETRO_Pretraining_and_Resume_Training: + needs: [cicd-test-container-setup] + runs-on: self-hosted-azure + container: + image: nemoci.azurecr.io/nemo_container_${{ github.run_id }} + options: + # --user 0:128 + --device=/dev/nvidia0 + --gpus all + --shm-size=8g + --env TRANSFORMERS_OFFLINE=0 + --env HYDRA_FULL_ERROR=1 + --volume /mnt/datadrive/TestData:/home/TestData + steps: + - name: Checkout repository + uses: actions/checkout@v2 + - run: | + python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \ trainer.devices=2 \ trainer.num_nodes=1 \ trainer.accelerator=gpu \ @@ -3700,7 +3769,7 @@ jobs: trainer.precision=16 \ trainer.gradient_clip_val=1.0 \ trainer.val_check_interval=10 \ - exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \ + exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \ model.data.data_prefix= \ model.data.knn_index= \ model.data.retrieval_prefix= \ @@ -3720,7 +3789,7 @@ jobs: model.dec_cross_attention=[1] \ +model.data.mock=True - python examples/nlp/language_modeling/megatron_retro_pretraining.py \ + python examples/nlp/language_modeling/megatron_retro_pretraining_legacy.py \ trainer.devices=2 \ trainer.num_nodes=1 \ trainer.accelerator=gpu \ @@ -3731,7 +3800,7 @@ jobs: trainer.precision=16 \ trainer.gradient_clip_val=1.0 \ trainer.val_check_interval=10 \ - exp_manager.exp_dir=examples/nlp/language_modeling/retro_results \ + exp_manager.exp_dir=examples/nlp/language_modeling/retro_legacy_results \ model.data.data_prefix= \ model.data.knn_index= \ model.data.retrieval_prefix= \ @@ -3751,7 +3820,7 @@ jobs: model.dec_cross_attention=[1] \ +model.data.mock=True - rm -rf examples/nlp/language_modeling/retro_results + rm -rf examples/nlp/language_modeling/retro_legacy_results - uses: "NVIDIA/NeMo/.github/actions/cancel-workflow@main" if: "failure()"