Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allows non-strict load with distributed checkpoints #9715

Merged

Conversation

github-actions[bot]
Copy link
Contributor

What does this PR do ?

With distributed checkpoints, the mismatches between the runtime model and checkpoint model manifest during dist_checkpoint.load (not during model.load_state_dict as with regular checkpoints).
This PR adds a flag that allows to adjust load strictness (e.g. ignore unexpected keys).

This PR relies on MCore feature that is not merged yet (merge ETA 5th July): https://gitlab-master.nvidia.com/ADLR/megatron-lm/-/merge_requests/1628

Collection: NLP

Changelog

  • Add model.dist_ckpt_load_strictness flag to control dist ckpt load strictness. The most useful value is log_all which warns about all mismatches but performs the checkpoint load for a matching state dict subset.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

* Allow non-strict load

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Point to non-stric load MCore branch

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Avoid module level StrictHandling

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Use MCore fork

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to MCore fix

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Restore ackward compatibility

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update flag defaults

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update MCore tag

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update PyT Dist interface

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

* Update to latest core_r0.8.0

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
@mikolajblaz mikolajblaz force-pushed the cherry-pick-main-423951179e73fb29f77e28120e38708535098152 branch from b2b2c2d to 81ea8b1 Compare July 12, 2024 13:28
@mikolajblaz mikolajblaz self-assigned this Jul 12, 2024
@mikolajblaz mikolajblaz merged commit 081a163 into main Jul 12, 2024
209 checks passed
@mikolajblaz mikolajblaz deleted the cherry-pick-main-423951179e73fb29f77e28120e38708535098152 branch July 12, 2024 14:56
nikitaved pushed a commit to nikitaved/NeMo that referenced this pull request Jul 16, 2024
…IDIA#9715)

* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
ertkonuk pushed a commit that referenced this pull request Jul 19, 2024
* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Tugrul Konuk <ertkonuk@gmail.com>
malay-nagda pushed a commit to malay-nagda/NeMo that referenced this pull request Jul 26, 2024
…IDIA#9715)

* Allow non-strict load

* Point to non-stric load MCore branch

* Avoid module level StrictHandling

* Use MCore fork

* Update to MCore fix

* Restore ackward compatibility

* Update flag defaults

* Update MCore tag

* Update PyT Dist interface

* Update to latest core_r0.8.0

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Malay Nagda <malayn@malayn-mlt.client.nvidia.com>
tonyjie pushed a commit to tonyjie/NeMo that referenced this pull request Aug 6, 2024
…IDIA#9715)

* Allow non-strict load

* Point to non-stric load MCore branch

* Avoid module level StrictHandling

* Use MCore fork

* Update to MCore fix

* Restore ackward compatibility

* Update flag defaults

* Update MCore tag

* Update PyT Dist interface

* Update to latest core_r0.8.0

---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: tonyjie <jl4257@cornell.edu>
monica-sekoyan pushed a commit that referenced this pull request Oct 14, 2024
* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
hainan-xv pushed a commit to hainan-xv/NeMo that referenced this pull request Nov 5, 2024
…IDIA#9715)

* Allow non-strict load



* Point to non-stric load MCore branch



* Avoid module level StrictHandling



* Use MCore fork



* Update to MCore fix



* Restore ackward compatibility



* Update flag defaults



* Update MCore tag



* Update PyT Dist interface



* Update to latest core_r0.8.0



---------

Signed-off-by: Mikołaj Błaż <mblaz@nvidia.com>
Co-authored-by: mikolajblaz <mikolajblaz@users.noreply.github.com>
Signed-off-by: Hainan Xu <hainanx@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant