From 3201553feee8805a492ea267fa3e872573dbce1d Mon Sep 17 00:00:00 2001 From: huvunvidia <86480512+huvunvidia@users.noreply.github.com> Date: Fri, 26 Apr 2024 13:32:37 -0400 Subject: [PATCH] Developer Documents for mcore RETRO (#9026) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * update branch Signed-off-by: eharper * Add dist ckpt support for regular optimizers (#7749) * Add dist ckpt support for regular optimizers Signed-off-by: Mikołaj Błaż * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * fix imports Signed-off-by: dimapihtar * imports fix Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * ci imports fix Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert asr notebook Signed-off-by: dimapihtar * revert asr notebook Signed-off-by: dimapihtar --------- Signed-off-by: Mikołaj Błaż Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: dimapihtar Co-authored-by: Eric Harper Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Pin lhotse=1.19.2 in r1.23.0 (#8303) Signed-off-by: Piotr Żelasko * Cache Aware Streaming tutorial notebook (#8296) * add notebook Signed-off-by: Elena Rastorgueva * rename old notebook to Buffered_Streaming Signed-off-by: Elena Rastorgueva * call setup_streaming_params in set_default_att_context_size method Signed-off-by: Elena Rastorgueva * update links in docs Signed-off-by: Elena Rastorgueva * update links to tutorials in docs Signed-off-by: Elena Rastorgueva * remove hard-coding Signed-off-by: Elena Rastorgueva * rename var Signed-off-by: Elena Rastorgueva --------- Signed-off-by: Elena Rastorgueva * fix path location and branch (#8304) * fix path location and branch Signed-off-by: Nithin Rao Koluguri * change to a floating point number Signed-off-by: Nithin Rao Koluguri --------- Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri Co-authored-by: Somshubra Majumdar * add deallocate pipeline output optimization (#8279) * add deallocate pipeline output optimization Signed-off-by: Jimmy Zhang * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jimmy Zhang Co-authored-by: Jimmy Zhang Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fix memory leak caused by context parallelism hanging references by omegaconf (#8299) * save cp_size to self Signed-off-by: Jimmy Zhang * use parallel_state instead of self Signed-off-by: Jimmy Zhang --------- Signed-off-by: Jimmy Zhang Co-authored-by: Jimmy Zhang Co-authored-by: Eric Harper * remove assertion (#8302) Signed-off-by: dimapihtar * Update PEFT Doc (#8262) * update peft doc Signed-off-by: Chen Cui * remove old prompt learning doc and notebook Signed-off-by: Chen Cui * fix table Signed-off-by: Chen Cui * fix table Signed-off-by: Chen Cui * fix table Signed-off-by: Chen Cui * Merge branch 'r1.23.0' into chcui/update_peft_doc Signed-off-by: Chen Cui * revert accidental changes Signed-off-by: Chen Cui * revert accidental changes Signed-off-by: Chen Cui --------- Signed-off-by: Chen Cui * Attention encoder-decoder models for multiple speech-to-text tasks (#8242) (#8324) * Rebasing canary changes at current main Signed-off-by: Piotr Żelasko * Move the changes from asr transformer to nlp transformer as originally intended Signed-off-by: Piotr Żelasko * update eval to strip spaces before punctuations Signed-off-by: stevehuang52 * update pc strip Signed-off-by: stevehuang52 * [canary] Refactor: `PromptedAudioToTextLhotseDataset` and `EncDecMultiTaskModel` (#8247) * Create a separate CanaryDataset and use it inside `transformer_bpe_models.py`. Ditches `token_sequence_format`. Signed-off-by: Piotr Żelasko * [canary] Refactor: move changes in transformer_bpe_models.py to Canar… (#8252) * [canary] Refactor: move changes in transformer_bpe_models.py to CanaryModel Signed-off-by: Piotr Żelasko * Rename `CanaryModel` to `EncDecMultiTaskModel` and remove inheritance from `EncDecTransfModelBPE`; add a separate config for this model Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko * Rename `CanaryDataset` to `PromptedAudioToTextLhotseDataset`; add `prompt_format_fn` argument; clean-up the `_canary_prompt_format` function a bit Signed-off-by: Piotr Żelasko * Move tokenization into `prompt_format_fn`, fix usage, add docs Signed-off-by: Piotr Żelasko * Backward-compatible utterance validation Signed-off-by: Piotr Żelasko * Improve type annotations Signed-off-by: Piotr Żelasko * config and prompt_fn registration changes from review Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko * fix transcribe config Signed-off-by: stevehuang52 * Refactor Canary to follow schema of remaining ASR models (#8260) * Initial draft of multi task beam decoding strategy Signed-off-by: smajumdar * Stabilize inference Signed-off-by: smajumdar * Update AED Multi Task model to mostly conform to Archetype-Type format. Update config Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add change decoding strategy Signed-off-by: smajumdar * Remove redundant imports Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleanup Signed-off-by: smajumdar * Cleanup Signed-off-by: smajumdar * remove asr transformer dependency on nlp Signed-off-by: stevehuang52 * clean up Signed-off-by: stevehuang52 * copy token_classifier from nlp to asr Signed-off-by: stevehuang52 * Address comments Signed-off-by: smajumdar * Add typing to beam decoding Signed-off-by: smajumdar * Make prompt format configurable Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * drop asr dependency on nlp Signed-off-by: stevehuang52 --------- Signed-off-by: smajumdar Signed-off-by: stevehuang52 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: stevehuang52 * fix transcribe, update asr evaluator Signed-off-by: stevehuang52 * Extend the docs for the canary prompt_fn Signed-off-by: Piotr Żelasko * Incorporate changes from Nithin's code review Signed-off-by: Piotr Żelasko * training bug fix and adding launch script for speech_multitask (#8270) * bug fix and adding launch script for speech_multitask Signed-off-by: Krishna Puvvada * update launch script example in speech_to_text_aed.py Signed-off-by: Krishna Puvvada --------- Signed-off-by: Krishna Puvvada Co-authored-by: Krishna Puvvada * Fix: drop_last must be true in validation/test otherwise the training will hang Signed-off-by: Piotr Żelasko * revert to current transcribe API Signed-off-by: stevehuang52 * revert changes to NLP, update docs Signed-off-by: stevehuang52 * update eval utils Signed-off-by: stevehuang52 * update docs Signed-off-by: stevehuang52 * Remove DALI; rename compute_audio_loss to compute_loss Signed-off-by: Piotr Żelasko * set default use_model_transcribe=False Signed-off-by: stevehuang52 * change os.path.dirname to pathlib Signed-off-by: stevehuang52 * [canary] Test for CanaryTokenizer + refactoring (#8285) * Test for CanaryTokenizer Signed-off-by: Piotr Żelasko * Attempt at refactor... Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko * Update config for AED models (#8294) Signed-off-by: smajumdar * set default calculate_wer=False in transcribe_speech.py Signed-off-by: stevehuang52 * Attention encoder-decoder models for multiple speech-to-text tasks Signed-off-by: Piotr Żelasko * Apply suggestions from code review, part 1 Co-authored-by: Nithin Rao Signed-off-by: Piotr Żelasko * Apply suggestions from code review, part 2 Signed-off-by: Piotr Żelasko * Document compute_loss Signed-off-by: Piotr Żelasko * update transcribe_speech.py Signed-off-by: stevehuang52 * add docstring Signed-off-by: stevehuang52 * Attention encoder-decoder models for multiple speech-to-text tasks Signed-off-by: Piotr Żelasko --------- Signed-off-by: Piotr Żelasko Signed-off-by: stevehuang52 Signed-off-by: smajumdar Signed-off-by: Krishna Puvvada Signed-off-by: Piotr Żelasko Co-authored-by: stevehuang52 Co-authored-by: Somshubra Majumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Krishna Puvvada <93558329+krishnacpuvvada@users.noreply.github.com> Co-authored-by: Krishna Puvvada Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com> Co-authored-by: Nithin Rao (cherry picked from commit 86efc4e0ae8d2a2febe8027a1c8b43aeba8e0553) Co-authored-by: Piotr Żelasko * add code for calling mcore_retro in NeMo * add code for calling mcore_retro in NeMo * runnable, training curve match retro mcore and nemo * working on retro inference * working on megatron_retro_eval.py and megatron_retro_inference.yaml * refactoring text_generation_utils code and retro inference relevant files * clean PR * resolving quick hacks (reading number of train/valid samples from workdir, discrepancy in total samples and samples with neighbors retrieved, tokenizers) * clean repository * revert changes to inference/eval code to original in main * clean code * runable training code, with already implemented eval code * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * Add Bert HF checkpoint converter (#8088) * Add Bert HF checkpoint converter Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reformat Signed-off-by: yaoyu-33 * Add BERT ONNX export * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add NeMo BERT to HF BERT script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean code Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update argument names Signed-off-by: yaoyu-33 * Update build_transformer_config in Bert Signed-off-by: yaoyu-33 --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen * revert to original eval code files * revert to original eval code files 2 * revert to original eval code files 3 * revert to original eval code files 4 * clean code * clean code * update my code to support changes from lastest main * commit before rebase r1.23.0 * Multimodal r1.23.0 bug fix (#8315) * Rename quick-gelu Signed-off-by: yaoyu-33 * ddpm config guard Signed-off-by: yaoyu-33 * Fix ddpm edit api Signed-off-by: yaoyu-33 * Fix insert_image_token cfg issue Signed-off-by: yaoyu-33 * neva updates Signed-off-by: yaoyu-33 * reformat Signed-off-by: yaoyu-33 * Add back jenkins Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix jenkins Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bugs Signed-off-by: yaoyu-33 * Update default neva template Signed-off-by: yaoyu-33 --------- Signed-off-by: yaoyu-33 Co-authored-by: Eric Harper Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * copy paste files from r1.23.0 * clean PR * Fixes for MoE parameter passing & use of AutoTokenizer/Model for mistral. (#8272) Signed-off-by: Alexandros Koumparoulis * Keep max_seqlen and cu_seqlens_argmin for later micro-batches when PP>1 (#8334) Signed-off-by: Sangkug Lym Co-authored-by: Eric Harper * Remove asr webapp (#8347) Signed-off-by: smajumdar * remove _target_ at model level in aed config (#8351) Signed-off-by: Krishna Puvvada Co-authored-by: Krishna Puvvada * revert changes for tts and asr * Add change_vocabulary and save_tokenizers() support to Multitask ASR models (#8357) * Add change_vocabulary and save_tokenizers() support Signed-off-by: smajumdar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update nemo/collections/asr/models/aed_multitask_models.py Co-authored-by: Piotr Żelasko Signed-off-by: Somshubra Majumdar --------- Signed-off-by: smajumdar Signed-off-by: Somshubra Majumdar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Piotr Żelasko * Change default (#8371) Signed-off-by: smajumdar * implement retro's own fwd_bwd_step() and validation_step() to not have argument first_val_step, which the MLM commit doesn't support * adding megatron compile_helpers(), in future can be fixed with correct MLM commit * bug fix in fast-conformer-aed.yaml and adding jenkins test for speech_to_text_aed model (#8368) Signed-off-by: Krishna Puvvada Co-authored-by: Krishna Puvvada Co-authored-by: Somshubra Majumdar * Enable megatron core loggers for GPT pretraining (#8354) * Logging changes tested for gpt_pretraining Signed-off-by: Aishwarya Bhandare * Additional args Signed-off-by: Aishwarya Bhandare * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Aishwarya Bhandare Co-authored-by: Aishwarya Bhandare Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper * mcore ds fix (#8283) * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * mcore ds fix Signed-off-by: Dmytro Pykhtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update mcore Signed-off-by: dimapihtar * revert asr files Signed-off-by: dimapihtar * add comments Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for mcore mock dataset Signed-off-by: dimapihtar * update mcore version Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update gpt cfg Signed-off-by: dimapihtar * update mcore commit Signed-off-by: dimapihtar * fix Bert unit tests Signed-off-by: dimapihtar * update bert tests Signed-off-by: dimapihtar * fix bert mcore test Signed-off-by: dimapihtar * fix gpt jenkins tests Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update apex & TE commits Signed-off-by: dimapihtar * revert apex installation Signed-off-by: dimapihtar * turn off the fusion for jenkins Signed-off-by: dimapihtar --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: Dmytro Pykhtar Signed-off-by: dimapihtar Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pablo Garay * addressing Eric's reviews * adding existing implementation RETRO files * adding existing implementation RETRO files * Add Finetuning tutorial with HF Datasets (#8356) * Add Finetuning tutorial with HF Datasets Signed-off-by: Nithin Rao Koluguri * update on Som comments Signed-off-by: Nithin Rao Koluguri --------- Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri * release updates (#8378) * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * mcore ds fix Signed-off-by: Dmytro Pykhtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update mcore Signed-off-by: dimapihtar * revert asr files Signed-off-by: dimapihtar * add comments Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for mcore mock dataset Signed-off-by: dimapihtar * update mcore version Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update gpt cfg Signed-off-by: dimapihtar * update mcore commit Signed-off-by: dimapihtar * fix Bert unit tests Signed-off-by: dimapihtar * update bert tests Signed-off-by: dimapihtar * fix bert mcore test Signed-off-by: dimapihtar * fix gpt jenkins tests Signed-off-by: dimapihtar * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add support for dict data input type Signed-off-by: dimapihtar * add mock ds test Signed-off-by: dimapihtar * add test for dict data input type Signed-off-by: dimapihtar * mcore ds fix Signed-off-by: dimapihtar * data input fix Signed-off-by: dimapihtar --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: Dmytro Pykhtar Signed-off-by: dimapihtar Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Pablo Garay * MCore dataset compatibility for tokenizers (#8390) * Add unique_identifiers for all tokenizers and eod for SentencePieceTokenizer Signed-off-by: Valerie Sarge * Add generalized token aliases to TokenizerSpec to conform with MegatronTokenizer's interface. Remove now-redundant individual fixes from AutoTokenizer and SentencePieceTokenizer. Signed-off-by: Valerie Sarge --------- Signed-off-by: Valerie Sarge Co-authored-by: Pablo Garay * Mcore customization doc (#8298) * [tutorial] fixed missing RIR scripts file. (#8257) Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> * add values to en tts dict (#7879) Signed-off-by: Mariana Graterol Fuenmayor * Add Bert HF checkpoint converter (#8088) * Add Bert HF checkpoint converter Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reformat Signed-off-by: yaoyu-33 * Add BERT ONNX export * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add NeMo BERT to HF BERT script * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean code Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update argument names Signed-off-by: yaoyu-33 * Update build_transformer_config in Bert Signed-off-by: yaoyu-33 --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen * initial placeholder Signed-off-by: Huiying Li * add to intro/index.rst Signed-off-by: Huiying Li * initial content update Signed-off-by: Huiying Li * add diff images Signed-off-by: Huiying Li size Signed-off-by: Huiying Li * minor fixes * minor language change Signed-off-by: Chen Cui * clean changes --------- Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: yaoyu-33 Signed-off-by: Huiying Li Signed-off-by: Huiying Li Signed-off-by: Chen Cui Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Bobby Chen Co-authored-by: Huiying Li Co-authored-by: Chen Cui * wer fix (#8404) Signed-off-by: Travis Bartley * updated link to pubmed (#8402) Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri * Update NFA video download link (#8406) * update nfa nasa video link Signed-off-by: Elena Rastorgueva * update link in markdown Signed-off-by: Elena Rastorgueva --------- Signed-off-by: Elena Rastorgueva * revert changes (#8410) Signed-off-by: Chen Cui * Fix dreambooth data sampler issue (#8400) * Turn on drop last Signed-off-by: yaoyu-33 * Some neva fixes Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Fixed errors in the CTM gen functions (#8416) Signed-off-by: Taejin Park * add ensemble decoding fix (#8427) Signed-off-by: Nithin Rao Koluguri Co-authored-by: Nithin Rao Koluguri * SDE bugfix log (#8430) Signed-off-by: George * mcore customization doc minor fix (#8421) Signed-off-by: Huiying Li * NeMo-Mistral to HF converter bugfix. (#8353) Signed-off-by: Alexandros Koumparoulis * Fixing mcore bert for TP, PP and SP (#8336) * Fixing mcore bert for TP, PP and SP * Fixing mcore bert for TP, PP and SP * Fixing mcore version * Fixing mcore version * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> * Update Jenkinsfile Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> --------- Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy Co-authored-by: Eric Harper * Add settings to suppress bf16 compile errors in CI on V100 (#8481) * Add settings to suppress bf16 compile errors in CI on V100 Signed-off-by: Abhishree * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * MoE parameter passing (#8255) * MoE parameter passing Signed-off-by: Alexandros Koumparoulis * Pass EP/MoE params in consumer scripts. Signed-off-by: Alexandros Koumparoulis * PR fixes Signed-off-by: Alexandros Koumparoulis * Use latest commit of mcore-0.5 Signed-off-by: Alexandros Koumparoulis * CI fix Signed-off-by: Alexandros Koumparoulis * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Alexandros Koumparoulis Co-authored-by: Alexandros Koumparoulis Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update k2 version (#8478) (#8492) Signed-off-by: Vladimir Bataev * Add fp8 support for SD/Update notebook paths (#8489) * Add fp8 support for SD/Update notebook paths Signed-off-by: Mingyuan Ma * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper * pin to 0.5.0 (#8465) Signed-off-by: eharper * Update NeMo Multimodal Requirements (#8515) * Update requirements_multimodal.txt Signed-off-by: yaoyu-33 * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: yaoyu-33 Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * update github raw content link (#8517) Signed-off-by: Chen Cui * Add dep notice for notebooks (#8522) * add dep notice Signed-off-by: eharper * revert Signed-off-by: eharper --------- Signed-off-by: eharper * Revert FP8 integration (#8520) * Revert FP8 integration Signed-off-by: Mingyuan Ma * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Mingyuan Ma Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Update data prep notebook (#8532) Signed-off-by: Mingyuan Ma * before update branch with latest r1.23.0 * update to run with MLM ae2817b3dde4efb1515061a5311d01d8f85bd99c (runnable training and saving checkpoint) * remove compile_helpers * reverse changes from main branch to r1.23.0 * adding *_legacy files * update MLM commit in Jenkinsfile to latest * debugging Jenkinstest: test different mcore import in retro_dataset * update Jenkinsfile edit megatron_retro_mutransfer_pretrain_legacy.py * removing all mcore RETRO to pass the Jenkinstest * fixing import legacy problem for tests/collections/nlp/test_indexed_retrieval_dataset.py * update Jenkinsfile file to use TE v0.7 * update NeMo to work with latest mcore RETRO (solving TE problems) * update TE commit Jenkinsfile to be the same with r1.23.0's Jenkinsfile * update commit for MLM * jenkinstest debugging * temporary fix RETRO's __init__ for jenkinstest * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * edit splits_string in jenkinsfile to correct format; put RETRO test in front to test faster * add model.data.dataloader_type=cyclic to jenkinsfile * update code to work with latest megatron-lm main 81dab6067 * update M-LM commit in Jenkinsfile to latest main M-LM 81dab6067 * fix to by pass CI test bf16 problem (following this PR https://github.com/NVIDIA/NeMo/pull/8481/files) * isort and black * adjusting model.micro_batch_size to 1 * fix BRANCH = 'r1.23.0' * replace tutorials dir from main branch to huvu/mcore_retro * fix minor merges conflict * update Jenkinsfile * runnable with a temporary fix from Jacek (unfound -unfinished problem) * runnable with a temporary fix from Jacek (unfound -unfinished problem) * modified nlp_overrides.py back to original * fix checkpoint from Jacek Bieniusiewicz * config Jenkinsfile test * set RETRO Jenkins MBS to 1 * black fix * isort fix * update TE commit * update to latest Jenkinsfile with latest container and commits * remove new RETRO jenkinstest * merge latest main * put RETRO Jenkinstest to the right place * update code for megatron_retro_pretraining_legacy.py * untrack ipa_cmudict-0.7b_nv23.01.txt * untrack ipa_cmudict-0.7b_nv23.01.txt * set config in megatron_retro_pretraining_legacy.py to megatron_retro_config_legacy * update new RETRO jenkinstest to run faster * merging latest main, and edit Jenkinstest * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * huvu/mcore_retro_docs first commit * update with main * update RETRO docs * fix scripts/tts_dataset_files/ipa_cmudict-0.7b_nv23.01.txt * update docs * update docs * udpate RETRO docs * update with Jennifer's comments --------- Signed-off-by: eharper Signed-off-by: Mikołaj Błaż Signed-off-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Signed-off-by: dimapihtar Signed-off-by: Piotr Żelasko Signed-off-by: Elena Rastorgueva Signed-off-by: Nithin Rao Koluguri Signed-off-by: Jimmy Zhang Signed-off-by: Chen Cui Signed-off-by: Mariana Graterol Fuenmayor Signed-off-by: yaoyu-33 Signed-off-by: Alexandros Koumparoulis Signed-off-by: Sangkug Lym Signed-off-by: smajumdar Signed-off-by: Krishna Puvvada Signed-off-by: Somshubra Majumdar Signed-off-by: Aishwarya Bhandare Signed-off-by: Dmytro Pykhtar Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Signed-off-by: Valerie Sarge Signed-off-by: Huiying Li Signed-off-by: Huiying Li Signed-off-by: Travis Bartley Signed-off-by: Taejin Park Signed-off-by: George Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Signed-off-by: Abhishree Signed-off-by: Vladimir Bataev Signed-off-by: Mingyuan Ma Co-authored-by: eharper Co-authored-by: mikolajblaz Co-authored-by: Eric Harper Co-authored-by: Xuesong Yang <1646669+XuesongYang@users.noreply.github.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Piotr Żelasko Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com> Co-authored-by: Nithin Rao Co-authored-by: Somshubra Majumdar Co-authored-by: JimmyZhang12 <67203904+JimmyZhang12@users.noreply.github.com> Co-authored-by: Jimmy Zhang Co-authored-by: Chen Cui Co-authored-by: Huy Vu2 Co-authored-by: Mariana <47233618+mgrafu@users.noreply.github.com> Co-authored-by: yaoyu-33 <54727607+yaoyu-33@users.noreply.github.com> Co-authored-by: Bobby Chen Co-authored-by: akoumpa <153118171+akoumpa@users.noreply.github.com> Co-authored-by: Sangkug Lym Co-authored-by: Krishna Puvvada <93558329+krishnacpuvvada@users.noreply.github.com> Co-authored-by: Krishna Puvvada Co-authored-by: ashbhandare Co-authored-by: Aishwarya Bhandare Co-authored-by: Dmytro Pykhtar Co-authored-by: Pablo Garay Co-authored-by: Valerie Sarge Co-authored-by: Huiying Co-authored-by: Huiying Li Co-authored-by: tbartley94 <90423858+tbartley94@users.noreply.github.com> Co-authored-by: Taejin Park Co-authored-by: George <37293288+Jorjeous@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com> Co-authored-by: Shanmugam Ramasamy Co-authored-by: Abhishree Thittenamane <47577437+athitten@users.noreply.github.com> Co-authored-by: Alexandros Koumparoulis Co-authored-by: Vladimir Bataev Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com> Co-authored-by: Huy Vu2 --- .../nlp/nemo_megatron/retro/retro_model.rst | 512 ++++-------------- .../{retro => retro_legacy}/images/arch.png | Bin .../retro_legacy/retro_model_legacy.rst | 469 ++++++++++++++++ 3 files changed, 574 insertions(+), 407 deletions(-) rename docs/source/nlp/nemo_megatron/{retro => retro_legacy}/images/arch.png (100%) create mode 100644 docs/source/nlp/nemo_megatron/retro_legacy/retro_model_legacy.rst diff --git a/docs/source/nlp/nemo_megatron/retro/retro_model.rst b/docs/source/nlp/nemo_megatron/retro/retro_model.rst index e490b70797d42..5bd7f03f77aca 100644 --- a/docs/source/nlp/nemo_megatron/retro/retro_model.rst +++ b/docs/source/nlp/nemo_megatron/retro/retro_model.rst @@ -1,281 +1,92 @@ -NeMo RETRO Model +RETRO Model ================ -The Retrieval-Enhanced Transformer (RETRO) model is an autoregressive language model that takes into account document chunks retrieved from a large -corpus when making predictions. The RETRO model has a similar architecture to the GPT model, but it includes an encoder that encodes the retrieved -context and cross-attention layers that integrate the context to improve the model's output. Below is a simple diagram of the RETRO model architecture. +The Retrieval-Enhanced Transformer (RETRO) `(Borgeaud et al., 2022) `_ is an autoregressive decoder-only language model (LM) +pretrained with retrieval-augmentation. +RETRO features practical scalability to support large-scale pretraining from scratch by retrieving from trillions of +tokens. +Pretraining with retrieval provides a more efficient storage mechanism of factual knowledge, when compared to storing factual knowledge implicitly within the network's parameters. This approach significantly reduces the model's parameter count while achieving lower perplexity than the standard GPT model. +RETRO also provides the flexibility to update the +knowledge stored in LMs `(Wang et al., 2023a) `_ +by updating the retrieval database without training LMs again. -.. image:: images/arch.png - :align: center - :width: 800px - :alt: RETRO model architecture +For the legacy native NeMo RETRO model documentation, please see `NeMo RETRO Model (Legacy) `_. -For more detailed information on the model, please refer to the `RETRO paper `_ :cite:`nlp-retro-borgeaud2021improving` by Deepmind. -The NeMo RETRO Model is an open-source implementation of the paper, and it has the following differences/features compared to Deepmind's proposed implementation: - -1. The NeMo RETRO Model is built on top of NeMo Megatron code, allowing for efficient training of large language models in a cluster environment. -2. The NeMo RETRO Model uses `Faiss `_ :cite:`nlp-retro-jegou2022faiss` as the K$N search library, which can be accelerated by GPUs. -3. The NeMo RETRO uses `RoPe relative positional encoding `_ :cite:`nlp-retro-su2021roformer`. -4. The NeMo RETRO uses `SentenceTransformers `_ :cite:`nlp-retro-reimers2019sentence` as the retriever encoder. -5. The NeMo RETRO supports `mu-Transfer `_ :cite:`nlp-retro-yang2022tensor`, allowing for scalable training of the RETRO model via Zero-Shot Hyperparameter Transfer. - -Quick start +Quick Start ************ -Steps below demonstrate training and evaluating a NeMo RETRO model +The following instructions demonstrate how to preprocess the data as well as train and evaluate a RETRO model. -Data pre-processing +Data Preprocessing ------------------- -Step 1: Collect training data -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The RETRO model uses two types of data: training data, which typically consists of 64-token chunks, and retrieval data, which typically consists of 128-token chunks. -The training data is used to train the model, while the retrieval data is used to supplement the language model. -It's possible to use the same data for both training and retrieval, as long as duplicates are removed properly, as described below. -Both types of data are stored in a loose JSON format, with each line containing a single text sample. For example: - -.. code-block:: json - - {"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"} - {"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"} - -The name of the text field of the json can be changed by using the ``--json-key`` flag in ``preprocess_data_for_megatron.py``. The other metadata are optional and are not used in training. +For detailed information on data preprocessing, refer to the `Megatron-LM Github `_ repository. This repository contains scripts and comprehensive instructions for the entire preprocessing procedure, specifically focusing on `RETRO Data Preparation `_. The main stages of the process are summarized below. -Step 2: Convert training data into memory map format -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +The outcome of the preparation step yields a processed RETRO data directory, fully primed for pre-training. Specifically, this directory encompasses the following key files and subdirectories: -The loose json is then processed into a binary format for training and retrieval. To convert the json into mmap, cached index file. -Set the ``--dataset-impl`` flag to `retmmap`, which is the memory map format dedicated for RETRO model. - -An example script to prepare data for RETRO training is: - -.. code-block:: bash - - python scripts/nlp_language_modeling/preprocess_data_for_megatron.py \ - --input=/dataset/pubmed_train.jsonl \ - --json-keys=text \ - --tokenizer-library=megatron \ - --apply-ftfy \ - --dataset-impl=retmmap \ - --merge-file=/dataset/gpt2-merges.txt \ - --vocab-file=/dataset/gpt2-vocab.json \ - --tokenizer-type=GPT2BPETokenizer \ - --output-prefix=/result/pubmed_train \ - --need-pad-id \ - --append-eod \ - --retrieval-db \ - --chunk_size=64 \ - --workers=48 - -The RETRO model processes chunked documents using 64 tokens as the default chunk size. The RETRO memory map dataset will add padding -tokens to the end of each document to make it a multiple of 64. The ``--need-pad-id`` argument adds a padding token to the tokenizer -if it doesn't already have one. The ``--append-eod`` argument controls whether to add ``end-of-document`` tokens to the preprocessed -data, and the ``--retrieval-db`` argument indicates whether to create a retrieval database for the preprocessed data. If ``--retrieval-db`` -is used, it will add an additional 64 padding tokens at the end of the document. The ``--chunk_size`` and ``--workers`` arguments -control the size of the data chunks to be processed and the number of worker processes to use, respectively. - -Following is the retro memory map index data format: - -.. list-table:: - :widths: 25 25 25 25 25 25 - - * - 'MMIDRET\x00\x00' (header 9 bytes) - - 1 (version 8 byte) - - dtype code :sup:`1` (1 byte) - - sentence count (8 byte) - - chunk size (8 byte) - - chunk count (8 byte) - * - retrieved db :sup:`2` (1 byte) - - number of tokens for each of sentences ( int32 array) - - start of sentence address in byte (int64 array) - - start of chunk id (int64 array) - - chunk id address in byte (int64 array) - - - -:sup:`1` 1: np.uint8, 2: np.int8, 3: np.int16, 4: np.int32, 5: np.int64, 6: np.float64, 7: np.double, 8: np.uint16 - -:sup:`2` When building the indexed dataset, we pad each sentence to be a multiple of ``chunk_size`` with ``pad_id`` from the tokenizer. -The number of tokens for each sentence includes the padded token ids. For retrieval data, there is an extra ``chunk_size`` padding at -the end of each sentence, and the ``retrieved_db`` flag is set to True. However, the number of tokens for each sentence excludes this extra ``chunk_size`` padding. - -Following is the retro memory map binary data format: - -.. list-table:: - :widths: 65 - - * - token id array for sentence 0,1, 2 ... (dtype :sup:`3` array) - -:sup:`3` np.uint16 vocab_size < 65500 else np.int32 - -Step 3: Create Faiss index for retrieval data -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -After creating the memory map retrieval data binary file and index files, we can build a Faiss index that can quickly find the K-nearest neighbors of a given -chunk ID based on a query embedding vector. Because the retrieval data is typically very large, we break this process down into three steps. - -Step 3.1: Train the Faiss index structure -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In this step, it uses a subset of the retrieval data to train a empty Faiss index. An example script is: - -.. code-block:: bash +* ``config.json``: contains the hyperparameters used in the data preparation step, which will then be retrieved to use in the pre-training step for consistency. For example: sample length, chunk length, data splits, tokenizer files, etc. +* ``data``: contains the original data before any preprocessing. +* ``tokenizer``: contains tokenizer files used in the preparation step. +* ``db``: contains the chunk database of processed and chunked text used for retrieving neighbors. +* ``index``: contains the Faiss index of the chunk database for retrieval. +* ``query``: contains the queried neighboring chunks for all training samples. - python scripts/nlp_language_modeling/build_retrieval_index.py \ - --input_file=/result/pubmed_train_text_document \ - --tokenizer-library=megatron \ - --tokenizer-type=GPT2BPETokenizer \ - --merge-file=/dataset/gpt2-merges.txt \ - --vocab-file=/dataset/gpt2-vocab.json \ - --percent=1.0 \ - --sentence_transformer_model=all-mpnet-base-v2 \ - --batch_size=1024 \ - --train_index_size=2000000 \ - --workers=2 \ - --devices=0,1,2,3,4,5,6,7 \ - --stage=0 \ - --output_file=/result/pubmed_faiss_learn.index - -This command is used to build an empty Faiss index using the 2000000 training data in ``pubmed_train_text_document``. -The ``all-mpnet-base-v2`` sentence transformer model is used to encode the chunk tokens into an embedding vector. -The index will be saved in the result directory as ``pubmed_faiss_learn.index``. This command specifies using 8 GPUs to train the Faiss index. - -Step 3.2: Add retrieval data into sharding index -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This step adds all the retrieval data to the empty Faiss index created in the previous step. An example script is: -.. code-block:: bash +The data preparation process contains the following main stages: - python scripts/nlp_language_modeling/build_retrieval_index.py \ - --input_file=/result/pubmed_train_text_document \ - --tokenizer-library=megatron \ - --tokenizer-type=GPT2BPETokenizer \ - --merge-file=/dataset/gpt2-merges.txt \ - --vocab-file=/dataset/gpt2-vocab.json \ - --percent=1.0 \ - --sentence_transformer_model=all-mpnet-base-v2 \ - --batch_size=1024 \ - --shard_id=0 \ - --total_shards=10 \ - --workers=2 \ - --devices=0,1,2,3,4,5,6,7 \ - --stage=1 \ - --learned_index=/result/pubmed_faiss_learn.index \ - --output_file=/result/pubmed_faiss_shard0.save - -This command breaks the retrieval data into ``total_shards`` shards and adds the data in the shard specified by ``shard_id``. -The result is saved to a file specified by ``output_file``. In the example above, 10 sharding indexes are created. - -Step 3.3: Merge the sharding indexes into final Faiss index -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -This step merges all the sharding indexes created in the previous step into the final Faiss index. An example script is: +Build Retrieval Chunk Database +############################## -.. code-block:: bash +This stage involves creating a database of text chunks from a corpus such as Wikipedia to be used for retrievals. The chunks are non-overlapping and extracted from the original GPT token dataset, with each chunk traditionally being 64 tokens in length. The database is stored as a 2-D array and is not a relational database. - python scripts/nlp_language_modeling/build_retrieval_index.py \ - --stage=2 \ - --devices=0,1,2,3,4,5,6,7 \ - --learned_index=/result/pubmed_faiss_learn.index \ - --shard_index_input=/result/pubmed_faiss_shard \ - --output_file=/result/pubmed_faiss_final.index +The main output of this stage is: -Step 4: Build KNN index -^^^^^^^^^^^^^^^^^^^^^^^ +* ``/db/merged/train.hdf5``: the database containing all processed and chunked text. +* ``/db/merged/sampled.hdf5``: the database containing a small portion of all chunks, only used for training the index in the next stage. -During training, it is inefficient to run a query to find the K-nearest neighbor chunk IDs for each training data point. -This can be pre-calculated by building a KNN index before training. The KNN index maps the training data chunk IDs to the K-nearest neighbor chunk IDs -in the retrieval data. As with building the Faiss index, this process is divided into two steps. +Build Index for Similarity Search +################################# -Following is the KNN index data format: +The second stage is to build a search index using Faiss, a library for efficient similarity search. The index is trained on a subset of the chunks ``sampled.hdf5`` from the database. After training, all chunks are added to the index to enable querying. The index accepts 1-D floating point vectors, so chunks must be embedded using Bert embeddings before they can be added to the index. Particularly, the stage is comprised of two sub-stages: -.. list-table:: - :widths: 25 25 25 25 45 + \- Extract BERT embeddings from the sampled chunk database (``sampled.hdf5``) and use them to train a Faiss index. - * - 'KNNRETM\x00\x00' (header 9 bytes) - - 1 (version 8 byte) - - K number of neighbors (8 byte) - - Number chunks (8 byte) - - Map to K retrieval data chunk IDs, shape (number_chunks, K) ( int64 array) + \- Extract BERT embeddings for each chunk in the all chunks database (``train.hdf5``) and add them to the trained Faiss index. -Step 4.1: Build KNN sharding index -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The main output of this stage is: -The KNN index is built using the memory-mapped training data created by the ``preprocess_data_for_megatron.py`` script and the Faiss index -file for the retrieval data built by the ``build_retrieval_index.py`` script. +* ``/index///added.faissindex``: the trained index, with all chunks in the database added to it -An example script is: +Query Pretraining Neighbors +########################### -.. code-block:: bash +To speed up the RETRO pretraining process, you pre-retrieve neighbors for all training samples instead of retrieving them on-the-fly. In this stage, the pretraining datasets are processed to find and save k-nearest neighbors for each chunk in each sample. The neighbors are saved to disk and labeled with unique properties to ensure they match the pretraining configuration. Query-time hyperparameters can be tuned to improve the quality of the neighbors. - python scripts/nlp_language_modeling/build_knn_map_index.py \ - --input_file=/result/pubmed_eval_text_document \ - --tokenizer-library=megatron \ - --tokenizer-type=GPT2BPETokenizer \ - --merge-file=/dataset/gpt2-merges.txt \ - --vocab-file=/dataset/gpt2-vocab.json \ - --process_chunk_size=10000 \ - --sentence_transformer_model=all-mpnet-base-v2 \ - --batch_size=1024 \ - --K_neighbors=50 \ - --workers=2 \ - --devices=0,1,2,3,4,5,6,7 \ - --remove_duplicate \ - --dedup_margin=70 \ - --nprobe=100 \ - --shard_id=0 \ - --total_shards=10 \ - --stage=1 \ - --output_file=/dataset/pubmed_knn_shard0.save \ - --faiss_index=/result/pubmed_faiss_final.index - -In this example, the training data is broken into ``total_shards`` shards, and the KNN index is calculated for the shard specified by ``shard_id``. -The result is saved to a file specified by ``output_file``. In the example above, 10 KNN sharding indexes are created. - -Use the ``remove_duplicate`` flag if the training data and retrieval data are the same to remove neighbors from the same document. - -Step 4.2: Merge KNN sharding index into final KNN index -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -An example script is: +The main output of this stage is: -.. code-block:: bash +* ``train_``: directory containing retrieved neighbors for all training samples. +* ``valid_``: directory containing retrieved neighbors for all validating samples. - python scripts/nlp_language_modeling/build_knn_map_index.py \ - --stage=2 \ - --output_file=pubmed_knn_final.save \ - --shard_index_input=pubmed_knn_shard -Train NeMo RETRO Model +Train RETRO Model ----------------------- -Once the training data, retrieval data, KNN index, and Faiss index are prepared, we are ready to train the RETRO model. In the NeMo implementation, -the RETRO model can be pre-trained with or without the `mu-Transfer `_ :cite:`nlp-retro-yang2022tensor` feature. We will introduce both ways. - +Once the training samples, pre-retrieved neighbors, and other data are prepared, you are ready to train the RETRO model. The training process will use the output directory from the data preparation step. We set the path to this directory at the ``retro.retro_project_dir`` argument. Many of the data hyperparameters will be retrieved from the ``config.json`` file in this directory, including data splits, sequence length, chunk length, number of training and validating samples, tokenizer, etc. -The table below lists some of the common parameters that can be configured for model pre-training. +The table below lists some of the common architecture and optimizer parameters that can be configured for model pre-training. Many of these values are set in ``examples/nlp/language_modeling/conf/megatron_retro_config.yaml``, which is used when training unless being overriden by the running command. Notice unlike other NeMo models, the `model.data.data_prefix` value is set to None, because all data information will be retrieved from `model.retro.retro_project_dir`. +----------------------------------+-------------+----------------------------------------------------------------------------------------+ | **Parameter** | **Default** | **Description** | +==================================+=============+========================================================================================+ -| model.micro_batch_size | 4 | the micro batch size used for training | -+----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.tensor_model_parallel_size | 1 | tensor model parallel size | +| retro_data.retro_chunk_length | 64 | the chunk size used to retrieve | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.encoder_seq_length | 2048 | token sequence length | -+----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.chunk_size | 64 | the chunk size used to retrieve | -+----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.enc_num_layers | 4 | total number of encoder layers | +| retro.retro_num_neighbors | 2 | token sequence length | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.dec_num_layers | 6 | total number of decoder layers | +| retro_encoder_num_layers | 2 | total number of encoder layers | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.enc_cross_attention | [3] | layer numbers for cross attention in encoder | +| model.num_layers | 12 | total number of decoder layers | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.dec_cross_attention | [3,4,5] | layer numbers for chunked cross attention in decoder | -+----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.add_position_embedding | FALSE | whether to add the absolute position encoding | +| model.encoder_seq_length | 2048 | token sequence length | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ | model.hidden_size | 768 | model hidden size | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ @@ -283,187 +94,74 @@ The table below lists some of the common parameters that can be configured for m +----------------------------------+-------------+----------------------------------------------------------------------------------------+ | model.num_attention_heads | 12 | number of attention heads | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.init_method_std | 0.02 | standard deviation of the zero mean normal distribution used for weight initialization | +| model.init_method_std | 0.023 | standard deviation of the zero mean normal distribution used for weight initialization | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ | model.hidden_dropout | 0.1 | dropout probability for hidden state transformer | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ | model.attention_dropout | 0.1 | dropout probability in the attention layer | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ -| model.ffn_dropout | 0 | dropout probability in the feed-forward layer | +| model.ffn_dropout | 0.1 | dropout probability in the feed-forward layer | +----------------------------------+-------------+----------------------------------------------------------------------------------------+ - -Option 1: Train the NeMo RETRO model *without* mu-Transfer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -An example RETRO pre-training script is: - -.. code-block:: bash - - python examples/nlp/language_modeling/megatron_retro_pretraining.py \ - trainer.devices=8 \ - trainer.num_nodes=2 \ - trainer.accelerator=gpu \ - trainer.max_steps=800000 \ - trainer.precision=16 \ - exp_manager.exp_dir=/result/retro_model \ - model.apply_query_key_layer_scaling=False \ - model.tensor_model_parallel_size=8 \ - model.optim.name=adamw \ - model.enc_num_layers=2 \ - model.dec_num_layers=32 \ - model.enc_cross_attention=[0] \ - model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ - model.hidden_size=4096 \ - model.ffn_hidden_size=16384 \ - model.num_attention_heads=32 \ - model.tokenizer.merge_file=/dataset/gpt2-merges.txt \ - model.tokenizer.vocab_file=/dataset/gpt2-vocab.json \ - model.data.data_prefix=[/result/pubmed_eval_text_document] \ - model.data.knn_index=[dataset/pubmed_knn_final.save] \ - model.data.retrieval_prefix=/result/pubmed_eval_text_document \ - model.micro_batch_size=8 - -During the training, launch Tensorboard to monitor training like so: - -.. code-block:: bash - - tensorboard --logdir /result/retro_model --bind_all - -.. note:: Weights and Biases (WandB) is supported too. Add ``exp_manager.create_wandb_logger=True`` to the model training arguments to enable it. - -After the training, the model nemo file can be found at the result checkpoint directory. - -Option 2: Train the NeMo RETRO model *with* mu-Transfer -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -`mu-Transfer `_ :cite:`nlp-retro-yang2022tensor` paper proposed a method to zero-shot transfer hyperparameter to train a larger model. -This can be done in 3 steps in NeMo RETRO implementation. - - -Step 1. find optimal hyper parameter for a small base model -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Use the pre-training code in Option 1, either manually or automatically ind a set of optimal hyperparameter for a small base RETRO -model. This is can be done cheaply ans fast due to the small model size. - -Step 2. calculate the shape file that can be used to run mu-Transfer -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The shape file determines which hyperparameters will be scaled up, allowing the model to adjust the learning rate, weight scaling factor, etc. - -Here is an example shape file calculation script: - +The following example shows a RETRO pre-training script. The rest of the argument values are retrieved from ``examples/nlp/language_modeling/conf/megatron_retro_config.yaml``. .. code-block:: bash - python examples/nlp/language_modeling/megatron_retro_cal_shape.py \ - trainer.devices=8 \ - trainer.num_nodes=1 \ - trainer.accelerator=gpu \ - exp_manager.exp_dir=/result/retro_model \ - base_model.enc_num_layers=2 \ - delta_model.enc_num_layers=2 \ - base_model.dec_num_layers=32 \ - delta_model.dec_num_layers=32 \ - base_model.tensor_model_parallel_size=8 \ - delta_model.tensor_model_parallel_size=8 \ - base_model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ - delta_model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ - base_model.enc_cross_attention=[0] \ - delta_model.enc_cross_attention=[0] \ - base_model.hidden_size=768 \ - base_model.ffn_hidden_size=3072 \ - delta_model.hidden_size=96 \ - delta_model.ffn_hidden_size=384 \ - base_model.num_attention_heads=16 \ - delta_model.num_attention_heads=16 \ - model.shape_file=tp8_32depth_o1_rel_shape_info.yaml - -In this example, the ``base_model`` refers to the small base model for which an optimal set of hyperparameters has been determined. -The ``delta_model`` refers to a model with certain hyperparameters that have been scaled up or down. In this case, -the ``hidden_size`` and ``ffn_hidden_size`` have been changed in the ``delta_model``, allowing these two parameters to be scaled freely later. - -Step 3. Pretrain mu-Transfer RETRO model -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Once the shape file is created, we can start training a RETRO model. The model training can be scale up freely using the hyperparameters -specified by the delta model and the shape file. - -An example mu-Transfer pre-training script is: - -.. code-block:: bash - - python examples/nlp/language_modeling/megatron_retro_mutransfer_pretrain.py \ - trainer.devices=8 \ - trainer.num_nodes=2 \ - trainer.accelerator=gpu \ - trainer.max_steps=500000 \ - trainer.precision=16 \ - exp_manager.exp_dir=/result/retro_model \ - model.apply_query_key_layer_scaling=False \ - model.tensor_model_parallel_size=8 \ - model.optim.name=muadamw \ - model.enc_num_layers=2 \ - model.dec_num_layers=32 \ - model.enc_cross_attention=[0] \ - model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ - model.hidden_size=4096 \ - model.ffn_hidden_size=16384 \ - model.num_attention_heads=32 \ - model.tokenizer.merge_file=/dataset/gpt2-merges.txt \ - model.tokenizer.vocab_file=/dataset/gpt2-vocab.json \ - model.data.data_prefix=[/result/pubmed_eval_text_document] \ - model.data.knn_index=[dataset/pubmed_knn_final.save] \ - model.data.retrieval_prefix=/result/pubmed_eval_text_document \ - model.micro_batch_size=8 \ - model.shape_file=tp8_32depth_o1_rel_shape_info.yaml - -.. note:: We have chosen to use ``muadamw`` as the optimizer for use with the mu-transfer method. Currently, only ``muadam`` and ``muadamw`` are supported. - -Similarly to the pre-training in Option 1, the model nemo file can be found at the result checkpoint directory after training is complete. - -Run NeMo RETRO Model Inference + python /examples/nlp/language_modeling/megatron_retro_pretraining.py \ + trainer.num_nodes=1 \ + trainer.devices=8 \ + trainer.precision=bf16 \ + trainer.accelerator=gpu \ + trainer.max_steps=750000 + trainer.val_check_interval=10 \ + trainer.precision=16 \ + exp_manager.exp_dir=/path/to/exp_dir \ + model.mcore_gpt=True \ + model.tensor_model_parallel_size=1 \ + model.pipeline_model_parallel_size=1 \ + model.megatron_amp_O2=True \ + model.retro.num_layers=12 \ + model.retro.retro_encoder_num_layers=2 \ + model.retro.retro_num_retrieved_chunks=2 \ + model.retro.retro_project_dir=/path/to/retro_workdir \ + model.micro_batch_size=4 \ + model.data.num_workers=4 \ + model.data.data_prefix=["none"] \ + model.data.shuffle_documents=False \ + model.data.dataloader_type=single \ + model.data.splits_string=\'98,2,0\' \ + model.optim.lr=6.0e-4 \ + model.optim.weight_decay=0.1 \ + model.optim.sched.name=CosineAnnealing \ + model.optim.sched.min_lr=6.0e-5 \ + model.optim.sched.max_steps=650000 \ + model.optim.name=distributed_fused_adam + +During the training, we can monitor the process with Weights and Biases (WandB) by setting ``exp_manager.create_wandb_logger=True`` and set relevant wandb arguments. +After training, the model distributed checkpoint directory can be found at the result checkpoint directory. + +Run RETRO Model Inference ------------------------------- -Once the NeMo RETRO model has been trained, we can put it into inference mode and experiment with it. -During inference, we are not limited to the static Faiss index that we built earlier for KNN queries. -We can feed any external data to the model as retrieval context. NeMo RETRO implementation supports dynamic retrieval service, -allowing users to add, reset, and query new documents on the fly. - -We have built a simple web client that makes it easy for users to play around with the model. Here is an example script to launch the server: +Once the RETRO model has been trained, you can put it into inference mode and experiment with it. +During inference, you are not limited to the indexed corpus to retrieve relevant chunks, but can directly provide any relevant contexts to the prompt through the argument ``neighbors``. +When performing inference, the input for RETRO differs from that used during training structurally. Specifically, the model’s input consists of only two chunks: one for the prompt and another for the answer to be generated. Unlike during training, these chunks do not necessarily have a fixed length of 64 tokens; instead, they match the length of the tokenized prompt. When context neighbors are supplied for a prompt, these neighbors correspond to the first chunk and are processed through the RETRO encoder to generate text for the second chunk. +The following example shows a RETRO inferencing script. The rest of the argument values are retrieved from ``examples/nlp/language_modeling/conf/megatron_retro_inference.yaml``. .. code-block:: bash - python examples/nlp/language_modeling/megatron_retro_eval.py \ - trainer.devices=8 \ - trainer.num_nodes=1 \ - trainer.accelerator=gpu \ - trainer.precision=16 \ - retro_model_file=megatron_retro.nemo \ - tensor_model_parallel_size=8 \ - pipeline_model_parallel_size=1 \ - retrieval_service.sentence_bert.devices=\'0,1,2,3,4,5,6,7\' \ - retrieval_service.services.0.faiss_devices=\'0,1,2,3,4,5,6,7\' \ - retrieval_service.services.1.faiss_devices=\'0,1,2,3,4,5,6,7\' \ - retrieval_service.services.0.faiss_index=/result/pubmed_faiss_final.index \ - retrieval_service.services.0.retrieval_index=/result/pubmed_eval_text_document \ - retrieval_service.neighbors=2 \ - retrieval_service.pad_tokens=True \ - retrieval_service.store_retrieved=True \ - server=True \ - web_server=True \ - share=True \ - username=test \ - password=test123 - -Set the retro_model_file to use the nemo file generated in the pre-training step. After launching the server, copy-paste the URL from -the terminal into your browser. Use the specified username and password to log in and have fun experimenting with the RETRO model. - -References -************ - -.. bibliography:: ../../nlp_all.bib - :style: plain - :labelprefix: nlp-retro - :keyprefix: nlp-retro- + python /examples/nlp/language_modeling/megatron_retro_eval.py \ + checkpoint_dir=/path/to/checkpoints \ + checkpoint_name=/checkpoint_name \ + trainer.devices=1 \ + trainer.num_nodes=1 \ + trainer.accelerator=gpu \ + trainer.precision=32 \ + megatron_amp_O2=False \ + inference.tokens_to_generate=10 \ + inference.greedy=False \ + inference.add_BOS=False \ + inference.temperature=1.0 \ + inference.retro_inference.retro_num_neighbors=2 \ + prompt="sample prompt" \ + neighbors=["sample neighbor 1","sample neighbor 2"] diff --git a/docs/source/nlp/nemo_megatron/retro/images/arch.png b/docs/source/nlp/nemo_megatron/retro_legacy/images/arch.png similarity index 100% rename from docs/source/nlp/nemo_megatron/retro/images/arch.png rename to docs/source/nlp/nemo_megatron/retro_legacy/images/arch.png diff --git a/docs/source/nlp/nemo_megatron/retro_legacy/retro_model_legacy.rst b/docs/source/nlp/nemo_megatron/retro_legacy/retro_model_legacy.rst new file mode 100644 index 0000000000000..e490b70797d42 --- /dev/null +++ b/docs/source/nlp/nemo_megatron/retro_legacy/retro_model_legacy.rst @@ -0,0 +1,469 @@ +NeMo RETRO Model +================ + +The Retrieval-Enhanced Transformer (RETRO) model is an autoregressive language model that takes into account document chunks retrieved from a large +corpus when making predictions. The RETRO model has a similar architecture to the GPT model, but it includes an encoder that encodes the retrieved +context and cross-attention layers that integrate the context to improve the model's output. Below is a simple diagram of the RETRO model architecture. + +.. image:: images/arch.png + :align: center + :width: 800px + :alt: RETRO model architecture + +For more detailed information on the model, please refer to the `RETRO paper `_ :cite:`nlp-retro-borgeaud2021improving` by Deepmind. +The NeMo RETRO Model is an open-source implementation of the paper, and it has the following differences/features compared to Deepmind's proposed implementation: + +1. The NeMo RETRO Model is built on top of NeMo Megatron code, allowing for efficient training of large language models in a cluster environment. +2. The NeMo RETRO Model uses `Faiss `_ :cite:`nlp-retro-jegou2022faiss` as the K$N search library, which can be accelerated by GPUs. +3. The NeMo RETRO uses `RoPe relative positional encoding `_ :cite:`nlp-retro-su2021roformer`. +4. The NeMo RETRO uses `SentenceTransformers `_ :cite:`nlp-retro-reimers2019sentence` as the retriever encoder. +5. The NeMo RETRO supports `mu-Transfer `_ :cite:`nlp-retro-yang2022tensor`, allowing for scalable training of the RETRO model via Zero-Shot Hyperparameter Transfer. + +Quick start +************ +Steps below demonstrate training and evaluating a NeMo RETRO model + +Data pre-processing +------------------- + +Step 1: Collect training data +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The RETRO model uses two types of data: training data, which typically consists of 64-token chunks, and retrieval data, which typically consists of 128-token chunks. +The training data is used to train the model, while the retrieval data is used to supplement the language model. +It's possible to use the same data for both training and retrieval, as long as duplicates are removed properly, as described below. +Both types of data are stored in a loose JSON format, with each line containing a single text sample. For example: + +.. code-block:: json + + {"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"} + {"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"} + +The name of the text field of the json can be changed by using the ``--json-key`` flag in ``preprocess_data_for_megatron.py``. The other metadata are optional and are not used in training. + +Step 2: Convert training data into memory map format +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The loose json is then processed into a binary format for training and retrieval. To convert the json into mmap, cached index file. +Set the ``--dataset-impl`` flag to `retmmap`, which is the memory map format dedicated for RETRO model. + +An example script to prepare data for RETRO training is: + +.. code-block:: bash + + python scripts/nlp_language_modeling/preprocess_data_for_megatron.py \ + --input=/dataset/pubmed_train.jsonl \ + --json-keys=text \ + --tokenizer-library=megatron \ + --apply-ftfy \ + --dataset-impl=retmmap \ + --merge-file=/dataset/gpt2-merges.txt \ + --vocab-file=/dataset/gpt2-vocab.json \ + --tokenizer-type=GPT2BPETokenizer \ + --output-prefix=/result/pubmed_train \ + --need-pad-id \ + --append-eod \ + --retrieval-db \ + --chunk_size=64 \ + --workers=48 + +The RETRO model processes chunked documents using 64 tokens as the default chunk size. The RETRO memory map dataset will add padding +tokens to the end of each document to make it a multiple of 64. The ``--need-pad-id`` argument adds a padding token to the tokenizer +if it doesn't already have one. The ``--append-eod`` argument controls whether to add ``end-of-document`` tokens to the preprocessed +data, and the ``--retrieval-db`` argument indicates whether to create a retrieval database for the preprocessed data. If ``--retrieval-db`` +is used, it will add an additional 64 padding tokens at the end of the document. The ``--chunk_size`` and ``--workers`` arguments +control the size of the data chunks to be processed and the number of worker processes to use, respectively. + +Following is the retro memory map index data format: + +.. list-table:: + :widths: 25 25 25 25 25 25 + + * - 'MMIDRET\x00\x00' (header 9 bytes) + - 1 (version 8 byte) + - dtype code :sup:`1` (1 byte) + - sentence count (8 byte) + - chunk size (8 byte) + - chunk count (8 byte) + * - retrieved db :sup:`2` (1 byte) + - number of tokens for each of sentences ( int32 array) + - start of sentence address in byte (int64 array) + - start of chunk id (int64 array) + - chunk id address in byte (int64 array) + - + +:sup:`1` 1: np.uint8, 2: np.int8, 3: np.int16, 4: np.int32, 5: np.int64, 6: np.float64, 7: np.double, 8: np.uint16 + +:sup:`2` When building the indexed dataset, we pad each sentence to be a multiple of ``chunk_size`` with ``pad_id`` from the tokenizer. +The number of tokens for each sentence includes the padded token ids. For retrieval data, there is an extra ``chunk_size`` padding at +the end of each sentence, and the ``retrieved_db`` flag is set to True. However, the number of tokens for each sentence excludes this extra ``chunk_size`` padding. + +Following is the retro memory map binary data format: + +.. list-table:: + :widths: 65 + + * - token id array for sentence 0,1, 2 ... (dtype :sup:`3` array) + +:sup:`3` np.uint16 vocab_size < 65500 else np.int32 + +Step 3: Create Faiss index for retrieval data +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +After creating the memory map retrieval data binary file and index files, we can build a Faiss index that can quickly find the K-nearest neighbors of a given +chunk ID based on a query embedding vector. Because the retrieval data is typically very large, we break this process down into three steps. + +Step 3.1: Train the Faiss index structure +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In this step, it uses a subset of the retrieval data to train a empty Faiss index. An example script is: + +.. code-block:: bash + + python scripts/nlp_language_modeling/build_retrieval_index.py \ + --input_file=/result/pubmed_train_text_document \ + --tokenizer-library=megatron \ + --tokenizer-type=GPT2BPETokenizer \ + --merge-file=/dataset/gpt2-merges.txt \ + --vocab-file=/dataset/gpt2-vocab.json \ + --percent=1.0 \ + --sentence_transformer_model=all-mpnet-base-v2 \ + --batch_size=1024 \ + --train_index_size=2000000 \ + --workers=2 \ + --devices=0,1,2,3,4,5,6,7 \ + --stage=0 \ + --output_file=/result/pubmed_faiss_learn.index + +This command is used to build an empty Faiss index using the 2000000 training data in ``pubmed_train_text_document``. +The ``all-mpnet-base-v2`` sentence transformer model is used to encode the chunk tokens into an embedding vector. +The index will be saved in the result directory as ``pubmed_faiss_learn.index``. This command specifies using 8 GPUs to train the Faiss index. + +Step 3.2: Add retrieval data into sharding index +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This step adds all the retrieval data to the empty Faiss index created in the previous step. An example script is: + +.. code-block:: bash + + python scripts/nlp_language_modeling/build_retrieval_index.py \ + --input_file=/result/pubmed_train_text_document \ + --tokenizer-library=megatron \ + --tokenizer-type=GPT2BPETokenizer \ + --merge-file=/dataset/gpt2-merges.txt \ + --vocab-file=/dataset/gpt2-vocab.json \ + --percent=1.0 \ + --sentence_transformer_model=all-mpnet-base-v2 \ + --batch_size=1024 \ + --shard_id=0 \ + --total_shards=10 \ + --workers=2 \ + --devices=0,1,2,3,4,5,6,7 \ + --stage=1 \ + --learned_index=/result/pubmed_faiss_learn.index \ + --output_file=/result/pubmed_faiss_shard0.save + +This command breaks the retrieval data into ``total_shards`` shards and adds the data in the shard specified by ``shard_id``. +The result is saved to a file specified by ``output_file``. In the example above, 10 sharding indexes are created. + +Step 3.3: Merge the sharding indexes into final Faiss index +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This step merges all the sharding indexes created in the previous step into the final Faiss index. An example script is: + +.. code-block:: bash + + python scripts/nlp_language_modeling/build_retrieval_index.py \ + --stage=2 \ + --devices=0,1,2,3,4,5,6,7 \ + --learned_index=/result/pubmed_faiss_learn.index \ + --shard_index_input=/result/pubmed_faiss_shard \ + --output_file=/result/pubmed_faiss_final.index + +Step 4: Build KNN index +^^^^^^^^^^^^^^^^^^^^^^^ + +During training, it is inefficient to run a query to find the K-nearest neighbor chunk IDs for each training data point. +This can be pre-calculated by building a KNN index before training. The KNN index maps the training data chunk IDs to the K-nearest neighbor chunk IDs +in the retrieval data. As with building the Faiss index, this process is divided into two steps. + +Following is the KNN index data format: + +.. list-table:: + :widths: 25 25 25 25 45 + + * - 'KNNRETM\x00\x00' (header 9 bytes) + - 1 (version 8 byte) + - K number of neighbors (8 byte) + - Number chunks (8 byte) + - Map to K retrieval data chunk IDs, shape (number_chunks, K) ( int64 array) + +Step 4.1: Build KNN sharding index +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The KNN index is built using the memory-mapped training data created by the ``preprocess_data_for_megatron.py`` script and the Faiss index +file for the retrieval data built by the ``build_retrieval_index.py`` script. + +An example script is: + +.. code-block:: bash + + python scripts/nlp_language_modeling/build_knn_map_index.py \ + --input_file=/result/pubmed_eval_text_document \ + --tokenizer-library=megatron \ + --tokenizer-type=GPT2BPETokenizer \ + --merge-file=/dataset/gpt2-merges.txt \ + --vocab-file=/dataset/gpt2-vocab.json \ + --process_chunk_size=10000 \ + --sentence_transformer_model=all-mpnet-base-v2 \ + --batch_size=1024 \ + --K_neighbors=50 \ + --workers=2 \ + --devices=0,1,2,3,4,5,6,7 \ + --remove_duplicate \ + --dedup_margin=70 \ + --nprobe=100 \ + --shard_id=0 \ + --total_shards=10 \ + --stage=1 \ + --output_file=/dataset/pubmed_knn_shard0.save \ + --faiss_index=/result/pubmed_faiss_final.index + +In this example, the training data is broken into ``total_shards`` shards, and the KNN index is calculated for the shard specified by ``shard_id``. +The result is saved to a file specified by ``output_file``. In the example above, 10 KNN sharding indexes are created. + +Use the ``remove_duplicate`` flag if the training data and retrieval data are the same to remove neighbors from the same document. + +Step 4.2: Merge KNN sharding index into final KNN index +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +An example script is: + +.. code-block:: bash + + python scripts/nlp_language_modeling/build_knn_map_index.py \ + --stage=2 \ + --output_file=pubmed_knn_final.save \ + --shard_index_input=pubmed_knn_shard + + +Train NeMo RETRO Model +----------------------- + +Once the training data, retrieval data, KNN index, and Faiss index are prepared, we are ready to train the RETRO model. In the NeMo implementation, +the RETRO model can be pre-trained with or without the `mu-Transfer `_ :cite:`nlp-retro-yang2022tensor` feature. We will introduce both ways. + + +The table below lists some of the common parameters that can be configured for model pre-training. + ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| **Parameter** | **Default** | **Description** | ++==================================+=============+========================================================================================+ +| model.micro_batch_size | 4 | the micro batch size used for training | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.tensor_model_parallel_size | 1 | tensor model parallel size | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.encoder_seq_length | 2048 | token sequence length | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.chunk_size | 64 | the chunk size used to retrieve | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.enc_num_layers | 4 | total number of encoder layers | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.dec_num_layers | 6 | total number of decoder layers | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.enc_cross_attention | [3] | layer numbers for cross attention in encoder | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.dec_cross_attention | [3,4,5] | layer numbers for chunked cross attention in decoder | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.add_position_embedding | FALSE | whether to add the absolute position encoding | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.hidden_size | 768 | model hidden size | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.ffn_hidden_size | 3072 | model FFN hidden size. Usually 4 * hidden_size | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.num_attention_heads | 12 | number of attention heads | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.init_method_std | 0.02 | standard deviation of the zero mean normal distribution used for weight initialization | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.hidden_dropout | 0.1 | dropout probability for hidden state transformer | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.attention_dropout | 0.1 | dropout probability in the attention layer | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ +| model.ffn_dropout | 0 | dropout probability in the feed-forward layer | ++----------------------------------+-------------+----------------------------------------------------------------------------------------+ + + +Option 1: Train the NeMo RETRO model *without* mu-Transfer +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +An example RETRO pre-training script is: + +.. code-block:: bash + + python examples/nlp/language_modeling/megatron_retro_pretraining.py \ + trainer.devices=8 \ + trainer.num_nodes=2 \ + trainer.accelerator=gpu \ + trainer.max_steps=800000 \ + trainer.precision=16 \ + exp_manager.exp_dir=/result/retro_model \ + model.apply_query_key_layer_scaling=False \ + model.tensor_model_parallel_size=8 \ + model.optim.name=adamw \ + model.enc_num_layers=2 \ + model.dec_num_layers=32 \ + model.enc_cross_attention=[0] \ + model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ + model.hidden_size=4096 \ + model.ffn_hidden_size=16384 \ + model.num_attention_heads=32 \ + model.tokenizer.merge_file=/dataset/gpt2-merges.txt \ + model.tokenizer.vocab_file=/dataset/gpt2-vocab.json \ + model.data.data_prefix=[/result/pubmed_eval_text_document] \ + model.data.knn_index=[dataset/pubmed_knn_final.save] \ + model.data.retrieval_prefix=/result/pubmed_eval_text_document \ + model.micro_batch_size=8 + +During the training, launch Tensorboard to monitor training like so: + +.. code-block:: bash + + tensorboard --logdir /result/retro_model --bind_all + +.. note:: Weights and Biases (WandB) is supported too. Add ``exp_manager.create_wandb_logger=True`` to the model training arguments to enable it. + +After the training, the model nemo file can be found at the result checkpoint directory. + +Option 2: Train the NeMo RETRO model *with* mu-Transfer +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +`mu-Transfer `_ :cite:`nlp-retro-yang2022tensor` paper proposed a method to zero-shot transfer hyperparameter to train a larger model. +This can be done in 3 steps in NeMo RETRO implementation. + + +Step 1. find optimal hyper parameter for a small base model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Use the pre-training code in Option 1, either manually or automatically ind a set of optimal hyperparameter for a small base RETRO +model. This is can be done cheaply ans fast due to the small model size. + +Step 2. calculate the shape file that can be used to run mu-Transfer +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The shape file determines which hyperparameters will be scaled up, allowing the model to adjust the learning rate, weight scaling factor, etc. + +Here is an example shape file calculation script: + + +.. code-block:: bash + + python examples/nlp/language_modeling/megatron_retro_cal_shape.py \ + trainer.devices=8 \ + trainer.num_nodes=1 \ + trainer.accelerator=gpu \ + exp_manager.exp_dir=/result/retro_model \ + base_model.enc_num_layers=2 \ + delta_model.enc_num_layers=2 \ + base_model.dec_num_layers=32 \ + delta_model.dec_num_layers=32 \ + base_model.tensor_model_parallel_size=8 \ + delta_model.tensor_model_parallel_size=8 \ + base_model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ + delta_model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ + base_model.enc_cross_attention=[0] \ + delta_model.enc_cross_attention=[0] \ + base_model.hidden_size=768 \ + base_model.ffn_hidden_size=3072 \ + delta_model.hidden_size=96 \ + delta_model.ffn_hidden_size=384 \ + base_model.num_attention_heads=16 \ + delta_model.num_attention_heads=16 \ + model.shape_file=tp8_32depth_o1_rel_shape_info.yaml + +In this example, the ``base_model`` refers to the small base model for which an optimal set of hyperparameters has been determined. +The ``delta_model`` refers to a model with certain hyperparameters that have been scaled up or down. In this case, +the ``hidden_size`` and ``ffn_hidden_size`` have been changed in the ``delta_model``, allowing these two parameters to be scaled freely later. + +Step 3. Pretrain mu-Transfer RETRO model +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Once the shape file is created, we can start training a RETRO model. The model training can be scale up freely using the hyperparameters +specified by the delta model and the shape file. + +An example mu-Transfer pre-training script is: + +.. code-block:: bash + + python examples/nlp/language_modeling/megatron_retro_mutransfer_pretrain.py \ + trainer.devices=8 \ + trainer.num_nodes=2 \ + trainer.accelerator=gpu \ + trainer.max_steps=500000 \ + trainer.precision=16 \ + exp_manager.exp_dir=/result/retro_model \ + model.apply_query_key_layer_scaling=False \ + model.tensor_model_parallel_size=8 \ + model.optim.name=muadamw \ + model.enc_num_layers=2 \ + model.dec_num_layers=32 \ + model.enc_cross_attention=[0] \ + model.dec_cross_attention=[8,11,14,17,20,23,26,29,31] \ + model.hidden_size=4096 \ + model.ffn_hidden_size=16384 \ + model.num_attention_heads=32 \ + model.tokenizer.merge_file=/dataset/gpt2-merges.txt \ + model.tokenizer.vocab_file=/dataset/gpt2-vocab.json \ + model.data.data_prefix=[/result/pubmed_eval_text_document] \ + model.data.knn_index=[dataset/pubmed_knn_final.save] \ + model.data.retrieval_prefix=/result/pubmed_eval_text_document \ + model.micro_batch_size=8 \ + model.shape_file=tp8_32depth_o1_rel_shape_info.yaml + +.. note:: We have chosen to use ``muadamw`` as the optimizer for use with the mu-transfer method. Currently, only ``muadam`` and ``muadamw`` are supported. + +Similarly to the pre-training in Option 1, the model nemo file can be found at the result checkpoint directory after training is complete. + +Run NeMo RETRO Model Inference +------------------------------- + +Once the NeMo RETRO model has been trained, we can put it into inference mode and experiment with it. +During inference, we are not limited to the static Faiss index that we built earlier for KNN queries. +We can feed any external data to the model as retrieval context. NeMo RETRO implementation supports dynamic retrieval service, +allowing users to add, reset, and query new documents on the fly. + +We have built a simple web client that makes it easy for users to play around with the model. Here is an example script to launch the server: + +.. code-block:: bash + + python examples/nlp/language_modeling/megatron_retro_eval.py \ + trainer.devices=8 \ + trainer.num_nodes=1 \ + trainer.accelerator=gpu \ + trainer.precision=16 \ + retro_model_file=megatron_retro.nemo \ + tensor_model_parallel_size=8 \ + pipeline_model_parallel_size=1 \ + retrieval_service.sentence_bert.devices=\'0,1,2,3,4,5,6,7\' \ + retrieval_service.services.0.faiss_devices=\'0,1,2,3,4,5,6,7\' \ + retrieval_service.services.1.faiss_devices=\'0,1,2,3,4,5,6,7\' \ + retrieval_service.services.0.faiss_index=/result/pubmed_faiss_final.index \ + retrieval_service.services.0.retrieval_index=/result/pubmed_eval_text_document \ + retrieval_service.neighbors=2 \ + retrieval_service.pad_tokens=True \ + retrieval_service.store_retrieved=True \ + server=True \ + web_server=True \ + share=True \ + username=test \ + password=test123 + +Set the retro_model_file to use the nemo file generated in the pre-training step. After launching the server, copy-paste the URL from +the terminal into your browser. Use the specified username and password to log in and have fun experimenting with the RETRO model. + +References +************ + +.. bibliography:: ../../nlp_all.bib + :style: plain + :labelprefix: nlp-retro + :keyprefix: nlp-retro-