-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful keyboard doesn't work as expected #965
Comments
Thanks I’ll look into this. To reproduce, is this running in a Colab notebook with a TPU enabled? |
I spent some time looking into this but haven't yet figured out how to gracefully exit when using DDP. We know that the first Reading the docs for
I'm wondering if the |
@jeremyjordan how is it going here? |
@Borda I could use some help figuring out how to fail gracefully in the case of DDP when the user signals a KeyboardInterrupt for the first time, it should signal all of the nodes to finish their work, collect back onto the main process, and then run the teardown routine. if the user signals for a second KeyboardInterrupt, it should stop immediately. |
Do you mean that first KeyboardInterrupt ends the teardown and the second KeyboardInterrupt ends even the teardown? Maybe we shall really implement the trainer status enum... |
yes i believe an exception is being thrown for each worker and raised back to the main process. the main process catches the first exception and begins running the training teardown but then the subsequent exceptions cause it to halt immediately. does that make sense? any ideas how to handle that? |
I have not tried to solve this particular problem - but I have wrestled some similar problems - sharing what I learned in case it helps. You can probably intercept the KeyboardInterrupt in each process by setting a signal handler like this in each ddp process:
To make these active in each ddp process, they have to be defined and Catch You can create a shared Value (see python/pytorch multiprocessing) in
This should work on a single node. To work across nodes, I believe you will have to create a Manager() - see mp. Hope this at leasts gives you a useful direction ... |
A short note, I also encountered this problem in DDP. I found that the KeyboardInterrupt is not caught sometimes, and even though it's caught, the |
@jeremyjordan
Not sure the fix is what you were looking for? Make sure to also log the message only from proc_0.
The text was updated successfully, but these errors were encountered: