-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SLURM check in ddp_train() and init_ddp_connection() #1387
Add SLURM check in ddp_train() and init_ddp_connection() #1387
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like refactor the init_ddp_connection
to have the method body in private method
This pull request is now in conflict... :( |
try: | ||
node_id = os.environ['SLURM_NODEID'] | ||
node_id = os.environ['SLURM_NODEID'] if self.is_slurm_managing_tasks else os.environ['RANK'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make this generic shouldn't we call this NODE_RANK
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamFalcon RANK
is a variable name required by pytorch distributed env://
initialization method (see pytorch doc).
The main idea is that: if RANK
is defined (with other required variables) plain pytorch would work but pytorch-lightning would not. And because of that there are environments built for pytorch where you can't use pytorch-lightning. For example, Kubeflow PyTorchJob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@williamFalcon changed to NODE_RANK
, since rank is computed inside Pytorch-Lightning based on node rank and gpu count and RANK
name is misleading.
A section in documentation should also probably be added alongside these changes.
changelog need to be rebased on new release #1419 |
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
3fb6dee
to
80ae594
Compare
Codecov Report
@@ Coverage Diff @@
## master #1387 +/- ##
======================================
- Coverage 91% 90% -0%
======================================
Files 67 67
Lines 3784 3796 +12
======================================
- Hits 3439 3433 -6
- Misses 345 363 +18 |
This pull request is now in conflict... :( |
Before submitting
What does this PR do?
Fixes #1345 .
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃