-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add StragglerDetection and FTlauncher to NeMo2.0 #11117
base: main
Are you sure you want to change the base?
Conversation
|
||
from nemo.utils.import_utils import safe_import | ||
|
||
res_module, HAVE_RES = safe_import('nvidia_resiliency_ext.ptl_resiliency') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this will be in all recipes now, and we don't have 'nvidia_resiliency_ext' as a hard dependency yet, should we return 'None' or some dummy callback if the package isn't installed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would suppress the ability to catch issues with 'nvidia_resiliency_ext'. It will also make debugging harder for users.
Some options come to mind -
- Make 'nvidia_resiliency_ext' a hard dependency for all training jobs
- Disable resiliency features by default since it is not a hard requirement and add documentation on how to enable it, if needed.
- Silently disable resiliency features if 'nvidia_resiliency_ext' is not present with some logging stating that features have been disabled (least favorite option)
[🤖]: Hi @ShriyaPalsamudram 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
6382449
to
a14a642
Compare
@maanug-nv to introduce abstraction to reduce number of files changed to add a callback for all recipes. |
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
598d275
to
eeb9414
Compare
[🤖]: Hi @ShriyaPalsamudram 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully So it might be time to merge this PR or get some approvals I'm just a bot so I'll leave it you what to do next. //cc @pablo-garay @ko3n1g |
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
Signed-off-by: ShriyaPalsamudram <ShriyaPalsamudram@users.noreply.github.com>
Signed-off-by: Shriya Palsamudram <spalsamudram@nvidia.com>
What does this PR do ?
Add StragglerDetection and FTlauncher to NeMo2.0
Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information