-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566
Conversation
Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id> |
pytorch_lightning/trainer/connectors/fault_tolerant_connector.py
Outdated
Show resolved
Hide resolved
Codecov Report
@@ Coverage Diff @@
## master #9566 +/- ##
=======================================
- Coverage 93% 89% -4%
=======================================
Files 180 181 +1
Lines 15094 15341 +247
=======================================
- Hits 14021 13623 -398
- Misses 1073 1718 +645 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does signal handling generalize across all clusters/schedulers?
our scheduler offers a dedicated mechanism to check for preemption, which we've integrated in the model checkpoint callback on batch end for whether we ought to save a checkpoint that batch. but i also thought this was atypical in cloud providers workloads where the job scheduler could kill the job at any point
Think looks great! Thanks for adding the mechanism @tchaton! |
… merge SlurmConnector. (#9566) Co-authored-by: Sean Naren <sean@grid.ai>
log.info("Set SLURM handle signals.") | ||
sigusr1_handlers.append(self.slurm_sigusr1_handler_fn) | ||
|
||
sigterm_handlers.append(self.sigterm_handler_fn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line got unindented, it was previously under the slurm check.
it means that Lightning can't be killed by sigterm.
As best as I can judge, this is not required for fault tolerant training.
See #10154 for context.
What does this PR do?
Right now, when the script is being killed, a fault tolerant checkpoint would be saved.
However, Lightning can't always ensure it is always fully happening in reproducible part of the codebase.
When running in the cloud, there is 2-3 min to kill the script.
The proposal is to add a mechanism to detect a killing signal as been sent and Lightning will terminate on the next reproducible part in this time windown.
Parts #9567
Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃