[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566

tchaton · 2021-09-16T14:20:51Z

What does this PR do?

Right now, when the script is being killed, a fault tolerant checkpoint would be saved.
However, Lightning can't always ensure it is always fully happening in reproducible part of the codebase.

When running in the cloud, there is 2-3 min to kill the script.

The proposal is to add a mechanism to detect a killing signal as been sent and Lightning will terminate on the next reproducible part in this time windown.

Parts #9567

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

Did you have fun?

Make sure you had fun coding 🙃

github-actions · 2021-09-16T14:21:56Z

Build Error! No Linked Issue found. Please link an issue or mention it in the body using #<issue_id>

pytorch_lightning/trainer/connectors/fault_tolerant_connector.py

pytorch_lightning/trainer/trainer.py

CHANGELOG.md

pytorch_lightning/trainer/connectors/signal_connector.py

codecov · 2021-09-16T16:01:28Z

Codecov Report

Merging #9566 (7216a39) into master (982a956) will decrease coverage by 4%.
The diff coverage is 71%.

@@           Coverage Diff           @@
##           master   #9566    +/-   ##
=======================================
- Coverage      93%     89%    -4%     
=======================================
  Files         180     181     +1     
  Lines       15094   15341   +247     
=======================================
- Hits        14021   13623   -398     
- Misses       1073    1718   +645

ananthsub

does signal handling generalize across all clusters/schedulers?

our scheduler offers a dedicated mechanism to check for preemption, which we've integrated in the model checkpoint callback on batch end for whether we ought to save a checkpoint that batch. but i also thought this was atypical in cloud providers workloads where the job scheduler could kill the job at any point

pytorch_lightning/trainer/connectors/signal_connector.py

CHANGELOG.md

Co-authored-by: Sean Naren <sean@grid.ai>

yifuwang · 2021-09-17T18:09:26Z

Think looks great! Thanks for adding the mechanism @tchaton!

… merge SlurmConnector. (#9566) Co-authored-by: Sean Naren <sean@grid.ai>

awaelchli · 2021-10-27T14:51:05Z

pytorch_lightning/trainer/connectors/signal_connector.py

+            log.info("Set SLURM handle signals.")
+            sigusr1_handlers.append(self.slurm_sigusr1_handler_fn)
+
+        sigterm_handlers.append(self.sigterm_handler_fn)


this line got unindented, it was previously under the slurm check.
it means that Lightning can't be killed by sigterm.
As best as I can judge, this is not required for fault tolerant training.

See #10154 for context.

update

b637fc1

tchaton requested review from awaelchli, Borda, carmocca, justusschock, kaushikb11, SeanNaren and williamFalcon as code owners September 16, 2021 14:20

tchaton self-assigned this Sep 16, 2021

tchaton added this to the v1.5 milestone Sep 16, 2021

tchaton added the fault tolerance label Sep 16, 2021

update

47cc2ec

awaelchli reviewed Sep 16, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/fault_tolerant_connector.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/trainer.py Outdated Show resolved Hide resolved

cleanup

83da52f

awaelchli changed the title ~~[Feat] Add gracefull detection of signal to exit~~ [Feat] Add graceful detection of signal to exit Sep 16, 2021

address comments

e0f5004

tchaton changed the title ~~[Feat] Add graceful detection of signal to exit~~ [Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. Sep 16, 2021

carmocca approved these changes Sep 16, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

pytorch_lightning/trainer/connectors/signal_connector.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/connectors/signal_connector.py Outdated Show resolved Hide resolved

update

6f26558

ananthsub reviewed Sep 16, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/signal_connector.py Outdated Show resolved Hide resolved

yifuwang reviewed Sep 16, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/signal_connector.py Outdated Show resolved Hide resolved

awaelchli approved these changes Sep 17, 2021

View reviewed changes

pytorch_lightning/trainer/connectors/signal_connector.py Outdated Show resolved Hide resolved

pytorch_lightning/trainer/connectors/signal_connector.py Outdated Show resolved Hide resolved

mergify bot added the ready PRs ready to be merged label Sep 17, 2021

tchaton added 4 commits September 17, 2021 12:30

wip

bea0adc

update

d685bc3

update

dd7002f

update

82d3192

tchaton requested a review from ananthsub September 17, 2021 11:35

tchaton requested review from awaelchli and carmocca September 17, 2021 11:35

update

dc62236

carmocca reviewed Sep 17, 2021

View reviewed changes

carmocca self-requested a review September 17, 2021 13:32

tchaton added 4 commits September 17, 2021 16:10

update

5f677d4

update

3e522a3

update

11b2b93

resolve mypy

a8a7ac5

justusschock reviewed Sep 17, 2021

View reviewed changes

SeanNaren reviewed Sep 17, 2021

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

SeanNaren approved these changes Sep 17, 2021

View reviewed changes

tchaton and others added 3 commits September 17, 2021 17:32

Update CHANGELOG.md

1af0fdd

Co-authored-by: Sean Naren <sean@grid.ai>

update

5034bd9

update

9130d92

tchaton enabled auto-merge (squash) September 17, 2021 17:03

justusschock approved these changes Sep 17, 2021

View reviewed changes

tchaton added 3 commits September 17, 2021 18:15

resolve mypy

9fc7d0a

update

c100d1e

update

7216a39

yifuwang approved these changes Sep 17, 2021

View reviewed changes

tchaton merged commit c7451b3 into master Sep 17, 2021

tchaton deleted the add_fault_tolerant_gracefully_detection branch September 17, 2021 19:14

ananthsub approved these changes Sep 17, 2021

View reviewed changes

SeanNaren pushed a commit that referenced this pull request Sep 22, 2021

[Feat] Add graceful detection of signal to exit + SignalConnector and…

af80c43

… merge SlurmConnector. (#9566) Co-authored-by: Sean Naren <sean@grid.ai>

ananthsub mentioned this pull request Sep 30, 2021

Mark the trainer's connectors as protected #9778

Closed

awaelchli reviewed Oct 27, 2021

View reviewed changes

awaelchli mentioned this pull request Oct 27, 2021

Fix sigterm signal handling #10189

Merged

11 tasks

kaushikb11 mentioned this pull request Nov 15, 2021

Using pytorch-lightning with ray[tune] doesn't work on toy MNIST example #10407

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566

[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566

Uh oh!

tchaton commented Sep 16, 2021 •

edited

Loading

Uh oh!

github-actions bot commented Sep 16, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 16, 2021 •

edited

Loading

Uh oh!

ananthsub left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yifuwang commented Sep 17, 2021

Uh oh!

awaelchli Oct 27, 2021 •

edited

Loading

Uh oh!

Uh oh!

[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566

[Feat] Add graceful detection of signal to exit + SignalConnector and merge SlurmConnector. #9566

Uh oh!

Conversation

tchaton commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Does your PR introduce any breaking changes? If yes, please list them.

Before submitting

PR review

Did you have fun?

Uh oh!

github-actions bot commented Sep 16, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Sep 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ananthsub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yifuwang commented Sep 17, 2021

Uh oh!

awaelchli Oct 27, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tchaton commented Sep 16, 2021 •

edited

Loading

codecov bot commented Sep 16, 2021 •

edited

Loading

awaelchli Oct 27, 2021 •

edited

Loading