Graceful keyboard doesn't work as expected #965

williamFalcon · 2020-02-27T11:56:54Z

@jeremyjordan
Not sure the fix is what you were looking for? Make sure to also log the message only from proc_0.

jeremyjordan · 2020-02-27T14:16:34Z

Thanks I’ll look into this. To reproduce, is this running in a Colab notebook with a TPU enabled?

jeremyjordan · 2020-03-02T04:26:59Z

I spent some time looking into this but haven't yet figured out how to gracefully exit when using DDP. We know that the first KeyboardInterrupt was caught due to the logging.

Reading the docs for torch.multiprocessing I see it says:

In the case an exception was caught in the child process, it is forwarded and its traceback is included in the exception raised in the parent process.

I'm wondering if the KeyboardInterrupt is being sent to all of the processes and the child processes are re-raising to the main process which can only handle the exception once.

Borda · 2020-04-09T11:12:42Z

@jeremyjordan how is it going here?

jeremyjordan · 2020-04-10T03:03:44Z

@Borda I could use some help figuring out how to fail gracefully in the case of DDP

when the user signals a KeyboardInterrupt for the first time, it should signal all of the nodes to finish their work, collect back onto the main process, and then run the teardown routine. if the user signals for a second KeyboardInterrupt, it should stop immediately.

Borda · 2020-04-13T19:20:54Z

Do you mean that first KeyboardInterrupt ends the teardown and the second KeyboardInterrupt ends even the teardown? Maybe we shall really implement the trainer status enum...

jeremyjordan · 2020-04-16T01:57:33Z

yes i believe an exception is being thrown for each worker and raised back to the main process. the main process catches the first exception and begins running the training teardown but then the subsequent exceptions cause it to halt immediately. does that make sense? any ideas how to handle that?

sneiman · 2020-04-25T18:08:07Z

I have not tried to solve this particular problem - but I have wrestled some similar problems - sharing what I learned in case it helps.

You can probably intercept the KeyboardInterrupt in each process by setting a signal handler like this in each ddp process:

def signal_handler(sig, frame):
        # do something here - probably pass
        # I would try to re raise the error from None - to avoid sending all the history ...
        # might even translate it to your own exception ... ExitTraining?
        raise ExitTraining from None
signal.signal(signal.SIGINT, signal_handler)

To make these active in each ddp process, they have to be defined and signal.signal() called in ddp_train().

Catch ExitTraining in the main, cpu based process. At this point, you 'should' have control of the interrupt.

You can create a shared Value (see python/pytorch multiprocessing) in Trainer.__init__(). You then use this as a flag to signal time to teardown. So ... your handler for ExitTraining in the main process would set the flag to True. And somewhere AFTER a barrier call in the training loop you check it, and initiate teardown.
NOTE that for ddp, you MUST create the shared value with

mp.get_context('spawn').Value(...)

This should work on a single node. To work across nodes, I believe you will have to create a Manager() - see mp.

Hope this at leasts gives you a useful direction ...
I do something similar

cmpute · 2020-04-29T00:24:22Z

A short note, I also encountered this problem in DDP. I found that the KeyboardInterrupt is not caught sometimes, and even though it's caught, the self.interrupted is only set in child process, which means the value for the main process is not set. A workaround for me is to try-catch the mp.spawn() sentence and set the flags in the main process.

williamFalcon added bug Something isn't working help wanted Open to be worked on labels Feb 27, 2020

Borda assigned jeremyjordan Mar 1, 2020

jeremyjordan mentioned this issue Apr 28, 2020

Graceful shutdown on python interpreter exit #1631

Merged

5 tasks

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed in #2029 Jun 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful keyboard doesn't work as expected #965

Graceful keyboard doesn't work as expected #965

williamFalcon commented Feb 27, 2020

jeremyjordan commented Feb 27, 2020 •

edited

Loading

jeremyjordan commented Mar 2, 2020

Borda commented Apr 9, 2020

jeremyjordan commented Apr 10, 2020

Borda commented Apr 13, 2020

jeremyjordan commented Apr 16, 2020

sneiman commented Apr 25, 2020 •

edited

Loading

cmpute commented Apr 29, 2020

Graceful keyboard doesn't work as expected #965

Graceful keyboard doesn't work as expected #965

Comments

williamFalcon commented Feb 27, 2020

jeremyjordan commented Feb 27, 2020 • edited Loading

jeremyjordan commented Mar 2, 2020

Borda commented Apr 9, 2020

jeremyjordan commented Apr 10, 2020

Borda commented Apr 13, 2020

jeremyjordan commented Apr 16, 2020

sneiman commented Apr 25, 2020 • edited Loading

cmpute commented Apr 29, 2020

jeremyjordan commented Feb 27, 2020 •

edited

Loading

sneiman commented Apr 25, 2020 •

edited

Loading