Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Akoumparouli/mcore microbatch calculator fix #10780

Merged
merged 7 commits into from
Oct 7, 2024

Conversation

akoumpa
Copy link
Member

@akoumpa akoumpa commented Oct 7, 2024

What does this PR do ?

Changes:

  1. use context manager when running reconfigure microbatch calculator to avoid side-effects across tests.
  2. move test/lightning/io to test/lightning/_io to avoid name-collision with python's io lib.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
…ger to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/mcore_microbatch_calculator_fix branch from f6a55e7 to 41685ec Compare October 7, 2024 07:08
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/mcore_microbatch_calculator_fix branch from ed557e7 to 28812a9 Compare October 7, 2024 07:10
@akoumpa akoumpa marked this pull request as ready for review October 7, 2024 07:10
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
@akoumpa akoumpa force-pushed the akoumparouli/mcore_microbatch_calculator_fix branch from 1edaa66 to 422df96 Compare October 7, 2024 07:12
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
@akoumpa akoumpa added Run CICD and removed Run CICD labels Oct 7, 2024
tests/lightning/test_dist_ckpt.py Dismissed Show dismissed Hide dismissed
Copy link
Contributor

github-actions bot commented Oct 7, 2024

[🤖]: Hi @akoumpa 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully

So it might be time to merge this PR or get some approvals

I'm just a bot so I'll leave it you what to do next.

//cc @pablo-garay @ko3n1g

Copy link
Collaborator

@dimapihtar dimapihtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

@akoumpa akoumpa merged commit 8a238b8 into main Oct 7, 2024
155 of 161 checks passed
@akoumpa akoumpa deleted the akoumparouli/mcore_microbatch_calculator_fix branch October 7, 2024 17:06
youngeunkwon0405 pushed a commit to youngeunkwon0405/NeMo that referenced this pull request Oct 7, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
youngeunkwon0405 pushed a commit to youngeunkwon0405/NeMo that referenced this pull request Oct 8, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
akoumpa added a commit that referenced this pull request Oct 10, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Oct 11, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Oct 15, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
BoxiangW pushed a commit that referenced this pull request Oct 18, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
yashaswikarnati pushed a commit that referenced this pull request Oct 20, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
pablo-garay added a commit that referenced this pull request Oct 21, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* hot fix on table style

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
artbataev pushed a commit to artbataev/NeMo that referenced this pull request Oct 22, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (NVIDIA#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (NVIDIA#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (NVIDIA#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (NVIDIA#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (NVIDIA#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (NVIDIA#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (NVIDIA#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (NVIDIA#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (NVIDIA#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (NVIDIA#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (NVIDIA#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
artbataev pushed a commit to artbataev/NeMo that referenced this pull request Oct 22, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (NVIDIA#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (NVIDIA#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (NVIDIA#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (NVIDIA#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (NVIDIA#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (NVIDIA#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (NVIDIA#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (NVIDIA#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (NVIDIA#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (NVIDIA#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (NVIDIA#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* hot fix on table style

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
akoumpa added a commit that referenced this pull request Oct 24, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
akoumpa added a commit that referenced this pull request Oct 24, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* hot fix on table style

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
yashaswikarnati pushed a commit that referenced this pull request Oct 24, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
yashaswikarnati pushed a commit that referenced this pull request Oct 24, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* hot fix on table style

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
titu1994 pushed a commit that referenced this pull request Oct 28, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
titu1994 pushed a commit that referenced this pull request Oct 28, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* hot fix on table style

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (NVIDIA#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (NVIDIA#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (NVIDIA#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (NVIDIA#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (NVIDIA#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (NVIDIA#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (NVIDIA#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (NVIDIA#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (NVIDIA#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (NVIDIA#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (NVIDIA#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (NVIDIA#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (NVIDIA#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (NVIDIA#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (NVIDIA#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (NVIDIA#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (NVIDIA#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (NVIDIA#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (NVIDIA#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (NVIDIA#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (NVIDIA#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (NVIDIA#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* hot fix on table style

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
HuiyingLi pushed a commit to HuiyingLi/NeMo that referenced this pull request Nov 15, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (NVIDIA#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (NVIDIA#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (NVIDIA#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (NVIDIA#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (NVIDIA#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (NVIDIA#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (NVIDIA#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (NVIDIA#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (NVIDIA#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (NVIDIA#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (NVIDIA#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
HuiyingLi pushed a commit to HuiyingLi/NeMo that referenced this pull request Nov 15, 2024
* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (NVIDIA#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (NVIDIA#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (NVIDIA#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (NVIDIA#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (NVIDIA#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (NVIDIA#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (NVIDIA#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (NVIDIA#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (NVIDIA#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (NVIDIA#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (NVIDIA#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* hot fix on table style

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
ericharper added a commit that referenced this pull request Nov 19, 2024
* nemo2-sft notebook initial draft

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* remove mixtral info

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* minor fixes

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* minor fixes

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* minor fixes

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* add import_ckpt script and minor changes

Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

* Random read for tarr files in lhotse dataloaders (#10536)

* Random read for tarr files in lhotse dataloaders

Signed-off-by: Nune <ntadevosyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: nune-tadevosyan <nune-tadevosyan@users.noreply.github.com>

* Solve failled tests

Signed-off-by: Nune <ntadevosyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: nune-tadevosyan <nune-tadevosyan@users.noreply.github.com>

* Adding a testcase

Signed-off-by: Nune <ntadevosyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: nune-tadevosyan <nune-tadevosyan@users.noreply.github.com>

* Some changs in tests

Signed-off-by: Nune <ntadevosyan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: nune-tadevosyan <nune-tadevosyan@users.noreply.github.com>

* removing import

Signed-off-by: Nune <ntadevosyan@nvidia.com>

---------

Signed-off-by: Nune <ntadevosyan@nvidia.com>
Signed-off-by: nune-tadevosyan <nune-tadevosyan@users.noreply.github.com>
Co-authored-by: nune-tadevosyan <nune-tadevosyan@users.noreply.github.com>

* training code for hybrid-autoregressive inference model (#10841)

* training code for hybrid-autoregressive inference model

Signed-off-by: Hainan Xu <hainanx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hainan-xv <hainan-xv@users.noreply.github.com>

---------

Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Signed-off-by: hainan-xv <hainan-xv@users.noreply.github.com>
Co-authored-by: Hainan Xu <hainanx@nvidia.com>
Co-authored-by: hainan-xv <hainan-xv@users.noreply.github.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 772faca ! (#10871)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>

* Use trainer.local_rank/global_rank (#10860)

* fix global_rank calculation

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use trainer's global/local rank

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove stacking operation from batched functions (#10524)

* remove stacking operations

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* fixes im base class

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* clean up

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com>

* remove potentially uninitialized local variable

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* restore batch_intilize states funcname

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* fix typo

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* fix potentially uninitialized local variable

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* fix potentially uninitialized local variable
in stateless transduser

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* fix test

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com>

* fix docstring, rm comment

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

* fix dosctrings

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>

---------

Signed-off-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Signed-off-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com>
Co-authored-by: lilithgrigoryan <lgrigoryan@nvidia.com>
Co-authored-by: lilithgrigoryan <lilithgrigoryan@users.noreply.github.com>

* [NeMo-UX] Add llm.generate to nemo.collections.llm (#10471)

* Add llm.generate

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Remove comment

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Fix launching with python

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* PR feedback

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* PR feedback

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Add assert cp

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Add example script

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>

* Adding support for LightningDataModule inside Fabric-API (#10879)

* Make FabricMegatronMixedPrecision match MegatronMixedPrecision

Signed-off-by: Marc Romeijn <mromeijn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

* Supporting DataModule in fabric-API

Signed-off-by: Marc Romeijn <mromeijn@nvidia.com>

* Adding support for LightningDataModule inside Fabric-API

Signed-off-by: Marc Romeijn <mromeijn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

* Remove import in mock.py

Signed-off-by: Marc Romeijn <mromeijn@nvidia.com>

---------

Signed-off-by: Marc Romeijn <mromeijn@nvidia.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>

* initial draft

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial local run

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial local run

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial local run

Signed-off-by: smajumdar <titu1994@gmail.com>

* Initial local run

Signed-off-by: smajumdar <titu1994@gmail.com>

* Save yaml config for model in nemo.lightning.io (#10765)

* Save yaml config for model in nemo.lightning.io

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Fix bug

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Fix bug

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix bug

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Add explicit yaml comparison

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* relax test

Signed-off-by: Hemil Desai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>

* Move collectiob.nlp imports inline for t5 (#10877)

* Move collectiob.nlp imports inline for t5

Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

---------

Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>

* add world_size/pp_size runtime check (#10842)

* add world_size/pp_size runtime check

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix msg precision

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix test_init_parallel_ranks ws=3 pp=3

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix peft resume (#10887)

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Update engine build step for TRT-LLM 0.13.0 (#10880)

* Setting use_fused_mlp for TRT-LLM >= 0.13.0

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Unused import removal

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Akoumparouli/nemo ux moe loss logging (#10128)

* Move across pipeline loss reduction to a separate function

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Add support for MoE loss logging

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused function

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>

* enable vboost and set LM SM margin (#10853)

* enable vboost

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* env vars

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* add perf plugin

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com>

* revert default executor

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com>

* fix typo

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

* fix more typo

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

* ln margin knob

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com>

* specify lm margin

Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com>

---------

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>
Signed-off-by: malay-nagda <164242706+malay-nagda@users.noreply.github.com>
Signed-off-by: Jimmy Zhang <jiemingz@nvidia.com>
Signed-off-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com>
Co-authored-by: malay-nagda <malay-nagda@users.noreply.github.com>
Co-authored-by: Jimmy Zhang <jiemingz@nvidia.com>
Co-authored-by: JimmyZhang12 <JimmyZhang12@users.noreply.github.com>

* use _get_extra_te_kwargs_meta in fabric (call mcore's _get_extra_te_k… (#10608)

* use _get_extra_te_kwargs_meta in fabric (call mcore's _get_extra_te_kwargs & overwrite device)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>

* Use torch sdpa implementation in ASR mha (#9590)

* use pytorch sdpa

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* sdpa work

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: titu1994 <titu1994@users.noreply.github.com>

* sdpa flag to false & sdpa_backend arg

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>

* change arg name

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>

* fix config args

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>

* add condition on version

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>

* update condition on version

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* remove condition on torch version

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>

* move code to init

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>

* refactor

Signed-off-by: WoodieDudy <goshagks@gmail.com>

* Apply isort and black reformatting

Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>

* refactor

Signed-off-by: WoodieDudy <goshagks@gmail.com>

---------

Signed-off-by: WoodieDudy <goshagks@gmail.com>
Signed-off-by: titu1994 <titu1994@users.noreply.github.com>
Signed-off-by: WoodieDudy <WoodieDudy@users.noreply.github.com>
Co-authored-by: Somshubra Majumdar <titu1994@gmail.com>
Co-authored-by: titu1994 <titu1994@users.noreply.github.com>
Co-authored-by: WoodieDudy <WoodieDudy@users.noreply.github.com>
Co-authored-by: Nithin Rao <nithinrao.koluguri@gmail.com>

* Add registry to register all needed classes with artifacts in nemo.lightning.io (#10861)

* Add registry to register all needed classes with artifacts in nemo.lightning.io

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Fixes

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* comments

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* Remove cyclic import

Signed-off-by: Hemil Desai <hemild@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>

* call __post_init__ after altering config values (#10885)

* call __post_init__ after altering config values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* test fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* turn off SP

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Nemo 2.0 ckpt support in TRT-LLM export (#10891)

* fix minor import bug

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Add registry to register all needed classes with artifacts in nemo.lightning.io

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Fixes

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* nemo 2.0 support in export to trt-llm

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* get mixing from main

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

* fix style

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>

---------

Signed-off-by: Onur Yilmaz <oyilmaz@nvidia.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: oyilmaz-nvidia <oyilmaz-nvidia@users.noreply.github.com>

* [Docs] Fix doc warnings, focus on feature and multimodal sections (#10171)

* various simple docs source fixes

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* fix docstrings and typing with forward reference

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: erastorgueva-nv <erastorgueva-nv@users.noreply.github.com>

* fix typing forward reference for PromptedAudioToTextLhotseDataset

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>

* fix feature warnings

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Try fix some model part errors

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* try add requirements

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* try add requirements

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix indent in docstring

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Apply isort and black reformatting

Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>

* update

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* handle duplicate issue

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* handle duplicate issue

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix imagen cite

* fix ratio issues

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix Dreambooth

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* Fix activation recomputation

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix sequence packing

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fix asr_language_modeling_and_customization

Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

* fixes wip

Signed-off-by: Huiying Li <willwin.lee@gmail.com>

---------

Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Signed-off-by: erastorgueva-nv <erastorgueva-nv@users.noreply.github.com>
Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>
Signed-off-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>
Signed-off-by: Huiying Li <willwin.lee@gmail.com>
Signed-off-by: Yu Yao <54727607+yaoyu-33@users.noreply.github.com>
Co-authored-by: Elena Rastorgueva <erastorgueva@nvidia.com>
Co-authored-by: Elena Rastorgueva <80532067+erastorgueva-nv@users.noreply.github.com>
Co-authored-by: erastorgueva-nv <erastorgueva-nv@users.noreply.github.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: yaoyu-33 <yaoyu-33@users.noreply.github.com>
Co-authored-by: Ao Tang <aot@nvidia.com>
Co-authored-by: Huiying Li <willwin.lee@gmail.com>

* calculate step time batch end-batch end (#10202)

* log step time at end

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* use nemo logging

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* cleanup

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* check remove

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* delta timing callback

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* comment and name change

Signed-off-by: Malay Nagda <malayn@nvidia.com>

---------

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>
Co-authored-by: malay-nagda <malay-nagda@users.noreply.github.com>

* late import prettytable (#10912)

Signed-off-by: Maanu Grover <maanug@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 0d89fc4 ! (#10919)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Warning for missing FP8 checkpoint support for vLLM deployment (#10906)

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Add lhotse fixes for rnnt model training and WER hanging issue with f… (#10821)

* Add lhotse fixes for rnnt model training and WER hanging issue with f… (#10787)

* Add lhotse fixes for rnnt model training and WER hanging issue with fuse batching

Signed-off-by: Nithin Rao Koluguri <nithinraok>

* Apply isort and black reformatting

Signed-off-by: nithinraok <nithinraok@users.noreply.github.com>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: nithinraok <nithinraok@users.noreply.github.com>
Co-authored-by: Nithin Rao Koluguri <nithinraok>
Co-authored-by: nithinraok <nithinraok@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: nithinraok <nithinraok@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

---------

Signed-off-by: Nithin Rao Koluguri <nithinraok>
Signed-off-by: nithinraok <nithinraok@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: nithinraok <nithinraok@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>

* Fix ASR tests (#10794)

* Make tests required

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Debug torch.load issue

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* Run only necessary tests

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Try fix loading

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* Avoid caching fixture

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Try restore model several times

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Try customize temporary directory

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* Reorder tests

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Disable one test

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* Avoid xxlarge model

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Disable test

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Revert changes

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* Magic fix

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Revert unnecessary changes

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Clean up

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Disable all jobs except L0

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* RNNT alignments - merge with unit tests

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Fix CUDA graph frame-looping decoder to handle non-CUDA inputs

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Fix config

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Log test results

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* Use less audio files for tests

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

---------

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>

* Integrating mcore export (#10238)

* Integrating mcore export

* Integrating mcore export

* Apply isort and black reformatting

Signed-off-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>

* Move trt imports in nemo.collections.llm inside respective functions (#10234)

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Add tests for LazyNeMoIterator and fix case with metadata_only=True and offsets in manifest (#10198)

* Add tests for LazyNeMoIterator and fix case with manifest_only=True and offsets in manifest

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* Address code review

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* fix tests

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

---------

Signed-off-by: Piotr Żelasko <petezor@gmail.com>

* [NeMo-UX] Fix a serialization bug that prevents users from moving checkpoints (#9939)

* perfor serialization using relative paths to allow users to move checkpoints after they're saved

Signed-off-by: ashors1 <ashors@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ashors1 <ashors1@users.noreply.github.com>

* remove unused import

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix artifact load

Signed-off-by: ashors1 <ashors@nvidia.com>

* fix path artifact

Signed-off-by: ashors1 <ashors@nvidia.com>

* remove unused import

Signed-off-by: ashors1 <ashors@nvidia.com>

---------

Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>

* Add MemoryProfileCallback (#10166)

* Add MemoryProfileCallback

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

* Remove reference cycles, save snapshot on specific ranks

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

* Remove unnecessary imports

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

* Update docstring

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>

---------

Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com>
Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>

* Lower bound transformers to support nemotron (#10240)

Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
Co-authored-by: Dong Hyuk Chang <donghyukc@nvidia.com>

* [Audio] SSL Pretraining framework for flow-matching model for audio processing (#10052)

Flow matching generative model with SSL pretraining framework

Signed-off-by: Pin-Jui Ku <pku@nvidia.com>
Co-authored-by: Kuray107 <Kuray107@users.noreply.github.com>

* Revert torchrun fix for model import (#10251)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* [NeMo-UX[ Move nemotron imports inline (#10255)

* Move nemotron transformers + tokenizer imports inline to reduce number of required deps

Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>

---------

Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>

* Wrap CPU model init with megatron_lazy_init_context (#10219)

* Wrap CPU model init with megatron_lazy_init_context

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Cleanup checkpoint-dir if saving fails

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>

* Bump `Dockerfile.ci` (2024-08-22) (#10227)

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 124bcff !

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* fix bert flags

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

---------

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>

* salm export trtllm (#10245)

Signed-off-by: slyne deng <slyned@nvidia.com>
Co-authored-by: slyne deng <slyned@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to ef85bc9 ! (#10250)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 01ca03f ! (#10266)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>

* Load model in the target export precision by default in PTQ (#10267)

* Load model in the target export precision by default

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* Enable megatron_amp_O2=true to actually use half-precision

Signed-off-by: Jan Lasek <jlasek@nvidia.com>

---------

Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <jlasek@nvidia.com>

* Add WandbPlugin, NsysPlugin and PreemptionPlugin to nemo.lightning.run.plugins (#10223)

* Add WandbPlugin, NsysPlugin and PreemptionPlugin to nemo.lightning.run.plugins

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Remove duplicate

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Add entity to wandb logger

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Add documentation

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Add warning

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* PR feedback

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Add comments

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>

* [NeMo-UX] Handle absolute logger directories in nemo_logger (#10259)

* handle absolute and relative logger directories

Signed-off-by: Anna Shors <ashors@nvidia.com>

* merge lines

Signed-off-by: ashors1 <ashors@nvidia.com>

---------

Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors@nvidia.com>

* Add sdxl notebook (#10139)

* Add sdxl notebook

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

* Rename

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

* final Update SDXL notebook

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

---------

Signed-off-by: mingyuanm <mingyuanm@nvidia.com>

* Updating some coments

* Apply isort and black reformatting

Signed-off-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>

* Updating some coments

* Apply isort and black reformatting

Signed-off-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>

* Updating some coments

* Small change

* Apply isort and black reformatting

Signed-off-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>

* ADD support for layernorm1p

* Apply isort and black reformatting

Signed-off-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>

* Update Dockerfile.ci

Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>

* Update Dockerfile.ci

Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>

* Update Dockerfile.ci

Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>

---------

Signed-off-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: Piotr Żelasko <petezor@gmail.com>
Signed-off-by: ashors1 <ashors@nvidia.com>
Signed-off-by: ashors1 <ashors1@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Dong Hyuk Chang <donghyukc@nvidia.com>
Signed-off-by: Pin-Jui Ku <pku@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Marc Romeyn <mromeijn@nvidia.com>
Signed-off-by: marcromeyn <marcromeyn@users.noreply.github.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: slyne deng <slyned@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Jan Lasek <janek.lasek@gmail.com>
Signed-off-by: Jan Lasek <jlasek@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Anna Shors <ashors@nvidia.com>
Signed-off-by: mingyuanm <mingyuanm@nvidia.com>
Signed-off-by: Shanmugam Ramasamy <111910568+shanmugamr1992@users.noreply.github.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@login-eos01.eos.clusters.nvidia.com>
Co-authored-by: shanmugamr1992 <shanmugamr1992@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: Piotr Żelasko <petezor@gmail.com>
Co-authored-by: Anna Shors <71393111+ashors1@users.noreply.github.com>
Co-authored-by: ashors1 <ashors1@users.noreply.github.com>
Co-authored-by: Shriya Rishab <69161273+ShriyaPalsamudram@users.noreply.github.com>
Co-authored-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Co-authored-by: Dong Hyuk Chang <thomaschang26@tutanota.com>
Co-authored-by: Dong Hyuk Chang <donghyukc@nvidia.com>
Co-authored-by: Kuray107 <pku9@gatech.edu>
Co-authored-by: Kuray107 <Kuray107@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: Marc Romeyn <mromeijn@nvidia.com>
Co-authored-by: marcromeyn <marcromeyn@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Slyne Deng <slynedeng@gmail.com>
Co-authored-by: slyne deng <slyned@nvidia.com>
Co-authored-by: Jan Lasek <janek.lasek@gmail.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Ming <111467530+Victor49152@users.noreply.github.com>
Co-authored-by: Shanmugam Ramasamy <shanmugamr@shanmugamr-mlt.client.nvidia.com>

* Fix artifact saving (#10914)

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Lora improvement (#10918)

* pull out freeze model

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add wildcard match to lora target modules

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Huvu/t5 nemo2.0 peft (#10916)

* adding peft test and cicd

* add setting mcore model to train in peft.py

* adding test for T5 lora

* fix follow Chen's fix

* restore cicd-main.yml

---------

Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>

* Add tie_word_embeddings=True (#10710)

Signed-off-by: Yoshi Suhara <ysuhara@nvidia.com>

* Use a context-manager when opening files (#10895)

* Use a context-manager when opening files

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>

* long context performance numbers in doc (#10784)

* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm from __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change the figure file name

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Accommodating the reviewer's comment

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the y-axis title

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to 3f90b98 ! (#10789)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Add ModelOpt transformer model pruning example for Llama models, default to llama3.1-8b-base (#10294)

* Add ModelOpt transformer model pruning example for Llama3 model

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* examples code is at wrong dir, move them

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* changes as suggested in comment

remove some logging and unused config code, update example model to
llama3.1

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Add pruning of hidden_size into example

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>

* Update examples/nlp/language_modeling/conf/megatron_gpt_prune.yaml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Add pruning test to cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

* Update cicd-main.yml

Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

---------

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Update mamba.rst after dist ckpt addition (#10800)

Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix chunked infer (#10581)

Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* fix state transform (#10728)

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* use ckpt_to_weights_subdir in restore (#10786)

* use ckpt_to_weights_subdir in restore

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* make ckpt_to_{weight,context}_subdir idempotent

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Mixtral set seq_length=4k (#10704)

* enable SP & set seq_lenght=4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update test expected values

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* 8x22b 4k

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Fix for crashes with tensorboard_logger=false and VP + LoRA (#10792)

* Fix for crashes with tensorboard_logger=false and virtual pipeline parallel + LoRA

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: vysarge <vysarge@users.noreply.github.com>

---------

Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Disable checkpoint conversion inside AutoResume (#10645)

* Disable checkpoint conversion inside AutoResume

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>

* Update resume docstrings

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* fix

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* add default finetuning recipe and refactor llama3 8b recipe

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* address comment

Signed-off-by: Chen Cui <chcui@nvidia.com>

* refactor other recipes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* remove 8x3b finetuning recipe for now because HF version not available

Signed-off-by: Chen Cui <chcui@nvidia.com>

* add copyright header

Signed-off-by: Chen Cui <chcui@nvidia.com>

* adjust unit tests based on recipe fixes

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix failed unit test

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* replace png file to github assets

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* change image url to github release

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

---------

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>
Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Signed-off-by: shengliangxu <shengliangxu@users.noreply.github.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Signed-off-by: stevehuang52 <heh@nvidia.com>
Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: Valerie Sarge <vsarge@nvidia.com>
Signed-off-by: vysarge <vysarge@users.noreply.github.com>
Signed-off-by: Hemil Desai <hemild@nvidia.com>
Signed-off-by: hemildesai <hemildesai@users.noreply.github.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: oliver könig <okoenig@nvidia.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>
Co-authored-by: Shengliang Xu <106840466+shengliangxu@users.noreply.github.com>
Co-authored-by: shengliangxu <shengliangxu@users.noreply.github.com>
Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Co-authored-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: He Huang (Steve) <105218074+stevehuang52@users.noreply.github.com>
Co-authored-by: Chen Cui <chcui@nvidia.com>
Co-authored-by: Valerie Sarge <vsarge@nvidia.com>
Co-authored-by: vysarge <vysarge@users.noreply.github.com>
Co-authored-by: Hemil Desai <hemild@nvidia.com>
Co-authored-by: hemildesai <hemildesai@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>

* perf recipes and Mcore DistOpt params (#10883)

* 175b gpt3 recipe

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* dist opt params

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* 405b dist opt params

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* perf recipes and dist opt params

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* MoE dist opt params

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* gpt bias fusion params

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* 175b recipe

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* perf params comments

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* MoE perf params comments

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

* perf recipes suffix

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* specific models fusion params

Signed-off-by: Malay Nagda <malayn@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>

---------

Signed-off-by: Malay Nagda <malayn@nvidia.com>
Signed-off-by: malay-nagda <malay-nagda@users.noreply.github.com>
Co-authored-by: malay-nagda <malay-nagda@users.noreply.github.com>

* ci: Fix cherry pick team (#10945)

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>

* Packed sequence bug fixes (#10898)

* save prepared dataset to different folders according to tokenizer name

Signed-off-by: Chen Cui <chcui@nvidia.com>

* fix hang

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* fix hang

Signed-off-by: Chen Cui <chcui@nvidia.com>

* raise mbs>1 error and provide suggestion to user instead of automatically changing config

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* add ci for packed seq

Signed-off-by: Chen Cui <chcui@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>

* fix bug

Signed-off-by: Chen Cui <chcui@nvidia.com>

---------

Signed-off-by: Chen Cui <chcui@nvidia.com>
Signed-off-by: cuichenx <cuichenx@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: cuichenx <cuichenx@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>

* Fix requirements for MacOS (#10930)

Signed-off-by: Vladimir Bataev <vbataev@nvidia.com>

* Fix nemo 2.0 recipes  (#10915)

* Fix recipe num_nodes and long context docstring

* Fix typo

* Fix PP issue

* Fix unit test

* Change recipes

* fix test

* Fix unit tests

* Fix recipes

* Add general legal test on parallelization settings

* Rename test

* Apply isort and black reformatting

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>

---------

Signed-off-by: BoxiangW <BoxiangW@users.noreply.github.com>
Co-authored-by: BoxiangW <BoxiangW@users.noreply.github.com>

* Akoumparouli/nemo ux fix dir or string artifact (#10936)

* Add __repr__ to Artifact

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* nemo.lightning.io.artifact: represent strings as fdl.Config to avoid path adjustment during restoration

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* t5 test minification

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>

* ckpt convert bug fixes (#10878)

* Mistral-NeMo-12B recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rename mistral to mistral_7b

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* include mistral_nemo_12b in __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* add to __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* Remove stale imports

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* TP=2

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove finetune_reci[e

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Rename MistralNeMo2407Config12B to MistralNeMoConfig12B per review's suggestion

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update config names in tests

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* mistral-nemo-12b from llama_8b

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* TP=2; SP=True

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix overlap value

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* update mistral-nemo-base-12b finetune recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* bug fix

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* remove extra file

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove extra changes

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* revert changes

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add ckpt_format configurable

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

* Apply isort and black reformatting

Signed-off-by: artbataev <artbataev@users.noreply.github.com>

* revert changes

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Apply isort and black reformatting

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>
Signed-off-by: artbataev <artbataev@users.noreply.github.com>
Co-authored-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
Co-authored-by: artbataev <artbataev@users.noreply.github.com>

* fix typo in docstring (#10955)

Signed-off-by: ashors1 <ashors@nvidia.com>

* remove deprecated ci tests (#10922)

* remove deprecated tutorial

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove deprecated ci tests

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add deprecation note

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add deprecation note

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove bart tests

Signed-off-by: dimapihtar <dpihtar@gmail.com>

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* [Nemo CICD] Remove deprecated tests (#10960)

* remove deprecated tutorial

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove deprecated ci tests

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add deprecation note

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* add deprecation note

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* remove bart tests

Signed-off-by: dimapihtar <dpihtar@gmail.com>

* Remove deleted CI tests

---------

Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: Pablo Garay <palenq@gmail.com>
Co-authored-by: dimapihtar <dpihtar@gmail.com>

* Adithyare/oai chat completion (#10785)

* updates

Signed-off-by: adithyare <adithyare@nvidia.com>

* open ai chat completion wip

Signed-off-by: adithyare <adithyare@nvidia.com>

* responding with model responses

Signed-off-by: adithyare <adithyare@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: arendu <arendu@users.noreply.github.com>

* also support general completion

Signed-off-by: adithyare <adithyare@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: arendu <arendu@users.noreply.github.com>

---------

Signed-off-by: adithyare <adithyare@nvidia.com>
Signed-off-by: arendu <arendu@users.noreply.github.com>
Co-authored-by: arendu <arendu@users.noreply.github.com>

* Update megatron_t5_pretraining.py (#10952)

Signed-off-by: Huy Vu <86480512+huvunvidia@users.noreply.github.com>

* Convert perf plugin env vars to strings (#10947)

Signed-off-by: Hemil Desai <hemild@nvidia.com>

* disable dynamo for ddp checker (#10961)

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* [🤠]: Howdy folks, let's bump `Dockerfile.ci` to db7d37b ! (#10965)

Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: pablo-garay <7166088+pablo-garay@users.noreply.github.com>

* Mistral-NeMo-12B recipe (#10607)

* Mistral-NeMo-12B recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rename mistral to mistral_7b

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* include mistral_nemo_12b in __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* add to __init__

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* Remove stale imports

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* TP=2

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove finetune_reci[e

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Rename MistralNeMo2407Config12B to MistralNeMoConfig12B per review's suggestion

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* update config names in tests

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* mistral-nemo-12b from llama_8b

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* TP=2; SP=True

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix overlap value

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

* update mistral-nemo-base-12b finetune recipe

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>

* Make nemo text processing optional in TTS (#10584)

* move TN guard to better location; make guard print error message rather than throwing error

Signed-off-by: Jason <jasoli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: blisc <blisc@users.noreply.github.com>

* Forgot to add the actual normalizer

Signed-off-by: Jason <jasoli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: blisc <blisc@users.noreply.github.com>

---------

Signed-off-by: Jason <jasoli@nvidia.com>
Signed-off-by: blisc <blisc@users.noreply.github.com>
Co-authored-by: blisc <blisc@users.noreply.github.com>

* respect warnings' filters (#10953)

* respect warnings' filters

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>

* Update T5 tokenizer (adding additional tokens to tokenizer config) (#10972)

* initial commit

* restore t5_pretraining

* Apply isort and black reformatting

Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com>

---------

Signed-off-by: huvunvidia <huvunvidia@users.noreply.github.com>
Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com>
Co-authored-by: huvunvidia <huvunvidia@users.noreply.github.com>

* Alit/mamba recipe (#10935)

* add some mamba recipe

* add 130m

* add the rest of the recipes

* add tokenizer

* add tokenizer

* minor fix

* minor fix

* minor fix

* minor fix

* minor fix

* minor fix

* minor fix

* minor fix

* minor fix

* minor fix

* minor fix

* add fixes to ssm for nemorun recipes

* add hybrid tokenizer

* updating some recipes

* Apply isort and black reformatting

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>

* remove comments

* update gbs

* fix ckpt resume

* fix ckpt resume

* fix ckpt resume

* update recipes final

* Apply isort and black reformatting

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>

* remove redundant imports

* ckpt convertor dtype fix

* Apply isort and black reformatting

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>

---------

Signed-off-by: JRD971000 <JRD971000@users.noreply.github.com>
Signed-off-by: Ali Taghibakhshi <71892896+JRD971000@users.noreply.github.com>
Co-authored-by: JRD971000 <JRD971000@users.noreply.github.com>

* Long context performance doc hot fix (#10946)

* long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* update the long context perf

Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* Akoumparouli/mcore microbatch calculator fix (#10780)

* move tests/lightning/{,_}io

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* use microbatch calculator context manager

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* add on_load_checkpoint test to ValidateModelRestoration; use ctx manager to reconfigure microbatch calculator; update save/restore path; add cleanup step at the end

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove unused var

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* fix

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* Apply isort and black reformatting

Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>

---------

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
Signed-off-by: akoumpa <akoumpa@users.noreply.github.com>
Co-authored-by: akoumpa <akoumpa@users.noreply.github.com>
Signed-off-by: Youngeun Kwon <youngeunk@nvidia.com>

* remove 8x3b recipes (#10764)

* remove 8x3b recipes

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* remove 8x3b from test_nemo_run

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

* rm fr…
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants