[blocked by refactor] [WIP] graceful shutdown signal handling #2165

awaelchli · 2020-06-12T22:50:59Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1999
Fixes #2913
maybe also fixes #2590 (need to check)
maybe also fixes #3275 (need to check)

simplifies graceful shutdown (SIGINT is equivalent to KeyboardInterrupt)
moved all signal handling to one place
restored original signal handlers after training finishes

TODO:

docs

tullie

Thanks for writing this it'll let me fix the logging finalize bug i'm looking into.

I think long term we need to put more thought into signal handling. Ideally, we'd have a global signal handling teardown function that closes everything nicely whether we're in the training loop, evaluate loop or in between.

pytorch_lightning/trainer/training_loop.py

awaelchli · 2020-06-13T21:35:05Z

I found out that in the kill signal handler, we need to call sys.exit(), otherwise the process hangs after tests completed. It seems the test suite sends one of these kill signals...
@Borda I merged your branch to check that CI works.

awaelchli · 2020-06-13T21:38:40Z

@justusschock I noticed that the try-catch block for the KeyboardInterrupt is around almost all the code in the train method. Do you think we could simply wrap the try catch around where we call train()?

codecov · 2020-06-13T22:16:35Z

Codecov Report

Merging #2165 into master will decrease coverage by 3%.
The diff coverage is 91%.

@@           Coverage Diff           @@
##           master   #2165    +/-   ##
=======================================
- Coverage      90%     87%    -3%     
=======================================
  Files          81      81            
  Lines        7644    7435   -209     
=======================================
- Hits         6878    6433   -445     
- Misses        766    1002   +236

williamFalcon · 2020-06-14T15:38:17Z

@awaelchli i think this is WIP no?
let's make sure 100% that this is correct before merging so we don't add a major bug to 0.8.0? otherwise, we should hold off until 0.9.0

awaelchli · 2020-06-14T15:47:19Z

It's finished except I don't know how to test these signals in CI and slurm.

pep8speaks · 2020-06-14T22:33:19Z

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-16 16:15:27 UTC

mergify · 2020-08-24T14:52:11Z

This pull request is now in conflict... :(

Borda

LGTM 🐰

Borda · 2020-09-15T17:49:54Z

pytorch_lightning/trainer/training_loop.py

@@ -364,7 +354,8 @@ def train(self):
            # model hooks
            model.on_train_start()

-        try:
+        if True:  # just here to enable easier merging. TODO: remove last minute


do not forget this one...

Borda · 2020-09-15T17:52:17Z

is still some extra docs missing?

awaelchli · 2020-09-15T17:54:46Z

@Borda this is not ready to go. this pr is completely destroyed now, there are too many conflicts. I have to start from scratch and send a new PR, I'll keep it open for now so I don't forget. I will work on it very soon, I promise.

mergify bot requested a review from a team June 12, 2020 22:51

awaelchli added the bug Something isn't working label Jun 12, 2020

tullie approved these changes Jun 13, 2020

View reviewed changes

pytorch_lightning/trainer/training_loop.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team June 13, 2020 16:19

awaelchli force-pushed the bugfix/atexit branch from 1091837 to 7f218c5 Compare June 13, 2020 17:35

awaelchli marked this pull request as ready for review June 13, 2020 21:37

Borda changed the title ~~graceful shutdown with atexit handler~~ [blocked by #2176] graceful shutdown with atexit handler Jun 13, 2020

Borda changed the title ~~[blocked by #2176] graceful shutdown with atexit handler~~ graceful shutdown with atexit handler Jun 14, 2020

Borda and others added 11 commits June 14, 2020 11:35

math

8e72ccf

notes

7d9de6d

atexit with closure

bcd3877

fix decorator call

75825c0

register teardown method directly outside

292f4dc

signal handler args

1533428

arg typo

07789d2

pull out method

1a10cb1

rename signals list var

51cf8fc

fix hanging process

d99bd22

simplify teardown with finally block

f744285

awaelchli force-pushed the bugfix/atexit branch from 34eea82 to f744285 Compare June 14, 2020 09:36

awaelchli and others added 2 commits June 14, 2020 19:16

wip test

e8a317d

keyboard interrupt and sigint are the same

1024a1a

awaelchli added 2 commits June 15, 2020 00:33

simplify if

edbbbee

wip test

c3d5dae

awaelchli mentioned this pull request Aug 11, 2020

DistributedDataParallel with nccl backend produces zombie processes #2913

Closed

Adrian Wälchli and others added 15 commits August 11, 2020 16:48

Merge branch 'master' into bugfix/atexit

2f7cb7e

clean up process group

6e995a7

shutdown problem

a7033c7

interrupted

f9cc4ab

distrib

855815e

update test

2ffa230

pytest mark

d0f9cec

reason

8057fc2

trying to fix stall

7a48ea0

multiprocessing

b2125bc

force

e0fb7a2

ddp_spawn

8f15ff4

clean up tests

5da491c

cleanup tests

1b8e0e9

Merge branch 'master' into bugfix/atexit

6528fb1

awaelchli changed the title ~~[WIP] graceful shutdown signal handling~~ [blocked by refactor] [WIP] graceful shutdown signal handling Aug 29, 2020

awaelchli mentioned this pull request Sep 3, 2020

How to free up the CUDA memory #3275

Closed

Borda approved these changes Sep 15, 2020

View reviewed changes

Borda added the ready PRs ready to be merged label Sep 15, 2020

mergify bot requested a review from a team September 15, 2020 17:52

awaelchli removed the ready PRs ready to be merged label Sep 15, 2020

Borda added the Blocked on ... label Sep 15, 2020

awaelchli mentioned this pull request Sep 16, 2020

Memory leaks: process still remain in the back if even the code is finished. #2590

Closed

awaelchli mentioned this pull request Sep 23, 2020

signal handling and teardown #3632

Closed

7 tasks

awaelchli closed this Sep 23, 2020

Borda deleted the bugfix/atexit branch September 30, 2020 13:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[blocked by refactor] [WIP] graceful shutdown signal handling #2165

[blocked by refactor] [WIP] graceful shutdown signal handling #2165

awaelchli commented Jun 12, 2020 •

edited by Borda

Loading

tullie left a comment

awaelchli commented Jun 13, 2020

awaelchli commented Jun 13, 2020

codecov bot commented Jun 13, 2020 •

edited

Loading

williamFalcon commented Jun 14, 2020

awaelchli commented Jun 14, 2020

pep8speaks commented Jun 14, 2020 •

edited

Loading

mergify bot commented Aug 24, 2020

Borda left a comment

Borda Sep 15, 2020

Borda commented Sep 15, 2020

awaelchli commented Sep 15, 2020

[blocked by refactor] [WIP] graceful shutdown signal handling #2165

[blocked by refactor] [WIP] graceful shutdown signal handling #2165

Conversation

awaelchli commented Jun 12, 2020 • edited by Borda Loading

Before submitting

What does this PR do?

tullie left a comment

Choose a reason for hiding this comment

awaelchli commented Jun 13, 2020

awaelchli commented Jun 13, 2020

codecov bot commented Jun 13, 2020 • edited Loading

Codecov Report

williamFalcon commented Jun 14, 2020

awaelchli commented Jun 14, 2020

pep8speaks commented Jun 14, 2020 • edited Loading

Comment last updated at 2020-08-16 16:15:27 UTC

mergify bot commented Aug 24, 2020

Borda left a comment

Choose a reason for hiding this comment

Borda Sep 15, 2020

Choose a reason for hiding this comment

Borda commented Sep 15, 2020

awaelchli commented Sep 15, 2020

awaelchli commented Jun 12, 2020 •

edited by Borda

Loading

codecov bot commented Jun 13, 2020 •

edited

Loading

pep8speaks commented Jun 14, 2020 •

edited

Loading