-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for lightning 2.0 upgrade #7176
Conversation
044469b
to
d8c23ac
Compare
Jenkinsfile
Outdated
@@ -3603,6 +3603,7 @@ assert_frame_equal(training_curve, gt_curve, rtol=1e-3, atol=1e-3)"''' | |||
trainer.precision=16 \ | |||
trainer.gradient_clip_val=1.0 \ | |||
exp_manager.exp_dir=examples/nlp/language_modeling/gpt_pretrain_results \ | |||
exp_manager.resume_if_exits=False \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resume_if_exists*
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @titu1994, corrected the typo. Also the reason behind setting this to False is that this test aka Megatron GPT Pretraining and Resume Training PP=2
is picking up the last checkpoint from the previous test aka Megatron GPT with KERPLE Pretraining and Resume Training TP=2
and it errors out as it tries to find that checkpoint in the current test's exp_dir
, but can't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was happening bcoz I had commented these lines out by mistake in the lightning 2.0 upgrade PR: https://github.com/NVIDIA/NeMo/blob/main/Jenkinsfile#L3583-L3584
. Fixing it ASAP in a separate PR.
2cdeb16
to
d0140e2
Compare
09c2033
to
3a747c8
Compare
636d022
to
552b3b7
Compare
5d20bc2
to
ecb0642
Compare
Signed-off-by: Abhishree <abhishreetm@gmail.com>
…for NMT Signed-off-by: Abhishree <abhishreetm@gmail.com>
1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error Signed-off-by: Abhishree <abhishreetm@gmail.com>
1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
…en(dataloader) Signed-off-by: Abhishree <abhishreetm@gmail.com>
Signed-off-by: Abhishree <abhishreetm@gmail.com>
86c60d5
to
20a0275
Compare
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) for NMT Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add resume_if_exists=False in JenkinsFile 1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove optimizer_idx in optimizer_step and fix typo 1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove resume_if_exists=False in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Make trainer.val_check_interval=1 for few tests in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change val_check_interval in JenkinsFile during resume to less than len(dataloader) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change val_check_interval to 1 for Megatron T5 with ALiBi resume Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) for NMT Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add resume_if_exists=False in JenkinsFile 1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove optimizer_idx in optimizer_step and fix typo 1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove resume_if_exists=False in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Make trainer.val_check_interval=1 for few tests in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change val_check_interval in JenkinsFile during resume to less than len(dataloader) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change val_check_interval to 1 for Megatron T5 with ALiBi resume Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: dorotat <dorotat@nvidia.com>
* Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove trainer._checkpoint_connector = _CheckpointConnector(trainer) for NMT Signed-off-by: Abhishree <abhishreetm@gmail.com> * Add resume_if_exists=False in JenkinsFile 1) Add resume_if_exists=False for Megatron GPT Pretraining and Resume Training PP=2 as it can resume from the checkpoint of the previous model test in CI leading to Error Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove optimizer_idx in optimizer_step and fix typo 1) Remove arg optimizer_idx in optimizer_step func as the arg is not used my parent func of lightning Signed-off-by: Abhishree <abhishreetm@gmail.com> * Remove resume_if_exists=False in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Make trainer.val_check_interval=1 for few tests in JenkinsFile Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change val_check_interval in JenkinsFile during resume to less than len(dataloader) Signed-off-by: Abhishree <abhishreetm@gmail.com> * Change val_check_interval to 1 for Megatron T5 with ALiBi resume Signed-off-by: Abhishree <abhishreetm@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Abhishree <abhishreetm@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
What does this PR do ?
Fix the following issues with the previous lightning 2.0 upgrade PR:
val_check_interval
to 1 while resuming from a ckpt in some CI tests where max_steps is same as that of training and lightning 2.0 errors: fit_loop.py#L259. Lightning 2.0 callssetup_data
right at the beginning of fit_loop.run unlike 1.9 that did it only whenself.skip
for fit was False, hence the error din't show up with 1.9trainer._checkpoint_connector
as it's redundant in lightning 2.0 and usetrainer.ckpt_path
directly. Also havingtrainer._checkpoint_connector = _CheckpointConnector(trainer)
overrode ckpt_path with None leading to starting training from scratch during resuming from a ckpt.optimizer_idx
arg inoptimizer_step
ofMegatronHalfPrecisionPlugin
class as the parent function in lightning does not have it anymore.Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information