Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[blocked by refactor] [WIP] graceful shutdown signal handling #2165

Closed
wants to merge 59 commits into from

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Jun 12, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1999
Fixes #2913
maybe also fixes #2590 (need to check)
maybe also fixes #3275 (need to check)

  • simplifies graceful shutdown (SIGINT is equivalent to KeyboardInterrupt)
  • moved all signal handling to one place
  • restored original signal handlers after training finishes

TODO:

  • docs

@mergify mergify bot requested a review from a team June 12, 2020 22:51
@awaelchli awaelchli added the bug Something isn't working label Jun 12, 2020
Copy link
Contributor

@tullie tullie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this it'll let me fix the logging finalize bug i'm looking into.

I think long term we need to put more thought into signal handling. Ideally, we'd have a global signal handling teardown function that closes everything nicely whether we're in the training loop, evaluate loop or in between.

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved
@mergify mergify bot requested a review from a team June 13, 2020 16:19
@awaelchli
Copy link
Contributor Author

I found out that in the kill signal handler, we need to call sys.exit(), otherwise the process hangs after tests completed. It seems the test suite sends one of these kill signals...
@Borda I merged your branch to check that CI works.

@awaelchli awaelchli marked this pull request as ready for review June 13, 2020 21:37
@awaelchli
Copy link
Contributor Author

@justusschock I noticed that the try-catch block for the KeyboardInterrupt is around almost all the code in the train method. Do you think we could simply wrap the try catch around where we call train()?

@codecov
Copy link

codecov bot commented Jun 13, 2020

Codecov Report

Merging #2165 into master will decrease coverage by 3%.
The diff coverage is 91%.

@@           Coverage Diff           @@
##           master   #2165    +/-   ##
=======================================
- Coverage      90%     87%    -3%     
=======================================
  Files          81      81            
  Lines        7644    7435   -209     
=======================================
- Hits         6878    6433   -445     
- Misses        766    1002   +236     

@Borda Borda changed the title graceful shutdown with atexit handler [blocked by #2176] graceful shutdown with atexit handler Jun 13, 2020
@Borda Borda changed the title [blocked by #2176] graceful shutdown with atexit handler graceful shutdown with atexit handler Jun 14, 2020
@williamFalcon
Copy link
Contributor

@awaelchli i think this is WIP no?
let's make sure 100% that this is correct before merging so we don't add a major bug to 0.8.0? otherwise, we should hold off until 0.9.0

@awaelchli
Copy link
Contributor Author

It's finished except I don't know how to test these signals in CI and slurm.

@pep8speaks
Copy link

pep8speaks commented Jun 14, 2020

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-16 16:15:27 UTC

@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2020

This pull request is now in conflict... :(

@awaelchli awaelchli changed the title [WIP] graceful shutdown signal handling [blocked by refactor] [WIP] graceful shutdown signal handling Aug 29, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🐰

@@ -364,7 +354,8 @@ def train(self):
# model hooks
model.on_train_start()

try:
if True: # just here to enable easier merging. TODO: remove last minute
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not forget this one...

@Borda Borda added the ready PRs ready to be merged label Sep 15, 2020
@mergify mergify bot requested a review from a team September 15, 2020 17:52
@Borda
Copy link
Member

Borda commented Sep 15, 2020

is still some extra docs missing?

@awaelchli
Copy link
Contributor Author

@Borda this is not ready to go. this pr is completely destroyed now, there are too many conflicts. I have to start from scratch and send a new PR, I'll keep it open for now so I don't forget. I will work on it very soon, I promise.

@awaelchli awaelchli removed the ready PRs ready to be merged label Sep 15, 2020
@awaelchli awaelchli mentioned this pull request Sep 23, 2020
7 tasks
@awaelchli awaelchli closed this Sep 23, 2020
@Borda Borda deleted the bugfix/atexit branch September 30, 2020 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants