Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SLURM check in ddp_train() and init_ddp_connection() #1387

Merged
merged 10 commits into from
Apr 19, 2020

Conversation

areshytko
Copy link
Contributor

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1345 .

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team April 6, 2020 01:35
@Borda Borda added the feature Is an improvement or enhancement label Apr 6, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like refactor the init_ddp_connection to have the method body in private method

pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved
pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved
pytorch_lightning/core/lightning.py Outdated Show resolved Hide resolved
pytorch_lightning/trainer/distrib_data_parallel.py Outdated Show resolved Hide resolved
@mergify
Copy link
Contributor

mergify bot commented Apr 6, 2020

This pull request is now in conflict... :(

@Borda Borda requested review from ethanwharris, neggert and a team April 6, 2020 21:14
try:
node_id = os.environ['SLURM_NODEID']
node_id = os.environ['SLURM_NODEID'] if self.is_slurm_managing_tasks else os.environ['RANK']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make this generic shouldn't we call this NODE_RANK?

Copy link
Contributor Author

@areshytko areshytko Apr 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamFalcon RANK is a variable name required by pytorch distributed env:// initialization method (see pytorch doc).

The main idea is that: if RANK is defined (with other required variables) plain pytorch would work but pytorch-lightning would not. And because of that there are environments built for pytorch where you can't use pytorch-lightning. For example, Kubeflow PyTorchJob.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williamFalcon changed to NODE_RANK, since rank is computed inside Pytorch-Lightning based on node rank and gpu count and RANK name is misleading.

A section in documentation should also probably be added alongside these changes.

@Borda Borda changed the title Add SLURM check in ddp_train() and init_ddp_connection() [blocked by #1419] Add SLURM check in ddp_train() and init_ddp_connection() Apr 8, 2020
@Borda Borda changed the title [blocked by #1419] Add SLURM check in ddp_train() and init_ddp_connection() Add SLURM check in ddp_train() and init_ddp_connection() Apr 8, 2020
@Borda
Copy link
Member

Borda commented Apr 8, 2020

changelog need to be rebased on new release #1419

@mergify mergify bot requested a review from a team April 9, 2020 16:44
@areshytko areshytko requested review from williamFalcon and Borda and removed request for a team April 11, 2020 01:51
@mergify mergify bot requested a review from a team April 11, 2020 01:51
Alexander Reshytko and others added 6 commits April 11, 2020 09:42
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
Co-Authored-By: Jirka Borovec <Borda@users.noreply.github.com>
@Borda Borda force-pushed the ddp-without-slurm--1345 branch from 3fb6dee to 80ae594 Compare April 11, 2020 07:45
@codecov
Copy link

codecov bot commented Apr 11, 2020

Codecov Report

Merging #1387 into master will decrease coverage by 0%.
The diff coverage is 75%.

@@          Coverage Diff           @@
##           master   #1387   +/-   ##
======================================
- Coverage      91%     90%   -0%     
======================================
  Files          67      67           
  Lines        3784    3796   +12     
======================================
- Hits         3439    3433    -6     
- Misses        345     363   +18     

@mergify
Copy link
Contributor

mergify bot commented Apr 13, 2020

This pull request is now in conflict... :(

@williamFalcon williamFalcon merged commit d0c9472 into Lightning-AI:master Apr 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make Pytorch-Lightning DDP work without SLURM
3 participants