-
Notifications
You must be signed in to change notification settings - Fork 22.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394
Conversation
… fix ungraceful shutdown As above Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/) [ghstack-poisoned]
💊 CircleCI build failures summary and remediationsAs of commit 76e7890 (more details on the Dr. CI page): ✅ None of the build failures appear to be your fault 💚
🚧 2 upstream failures:These were probably caused by upstream breakages:
This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker. This comment has been revised 7 times. |
// Note: calling threadPool_.waitWorkComplete() after listenerThread.join() so | ||
// that we can finish any possible work enqueued into the thread pool, before | ||
// python RPC handler is shutdown (see shutdown in rpc/api.py). | ||
threadPool_.waitWorkComplete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can launch more sends? In that case, do we need to abort those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this is a good point, though in the current code it looks like we do this after aborting pending sends as well, but technically more sends can be launched in threadPool.waitworkComplete()
again. Also, when this line is called we would have set rpcRunning_
to false, and the send code checks this, so the send would immediately be stopped. I think this is why I'm not able to get this scenario to show up in the tests.
What about the following order?
- join listener thread --> so this node stops accepting new requests, and just has to flush out its old ones
- abort all existing pending sends --> basically end the current pending sends, do not send it over RPC
- call
threadPool.waitWorkComplete()
---> here, even if we have additional sends, they will immediately be cancelled, since we already checkrpcRunning_
flag when waiting for a send.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above looks good to me for non-graceful shutdown. For graceful shutdown, do we need to wait for cleanup for dist autograd context? Can the cleanup message still stay in the threadPool while we are doing this? Do we need sth similar to _delete_call_user_rrefs()
for dist autograd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mrshenli For graceful shutdown and cleaning up dist autograd context, I thought we are okay, since the following will happen:
- Exit dist autograd context, triggering messages to be sent for cleanup
- Graceful shutdown, eventually calls
sync()
, which waits for all messages to be cleanly processed across all nodes, meaning that we will process dist autograd cleanup while waiting in sync() - Shutdown, which in graceful case, should not have any existing work when calling
threadPool.waitWorkComplete
(I have a WIP diff to check this, by ensuring that we won't abort any pending sends in graceful shutdown)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, this should be OK for now. One thing is that TensorPipeAgent is expecting us to deprecate sync/join APIs. In that case, dist_autograd probably need its own way to clear all messages. So that application messages will be cleared by wait_all_workers, RRef messages will be cleared by _delete_all_user_rrefs
, and dist autograd can add its own internal message cleanup function in shutdown()
. This might be sufficient to get rid of join.
…d.join() to fix ungraceful shutdown" ungraceful shutdown** * #35393 [rpc] create error string in listenLoop outside of lock This is one of the causes of flakiness seen in `dist_autograd_node_failure` (the other is a `std::terminate` in RPC retries which is being fixed by @osalpekar). The root issue is that since we call `threadPool.waitWorkComplete()` after `listenerThread.join()`, it is possible that in some ungraceful shutdown situations, `listenerThread` enqueues more `RecvWork` into the thread pool. When the thread pool is running this task, shutdown could be running concurrently, and as part of shutdown, we call `_cleanup_python_rpc_handler` which sets `pyRunFunction_` to `None`. This is why we would see errors such as `NoneType not callable` in `request_callback.cpp`. The fix here is to wait for all locally enqueued work to be completed before shutting down the python part. Test plan: run `dist_autograd_node_failure` tests. Although, completely resolving the flakiness is also dependent on fixing the `std::terminate()` issue mentioned above. Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/) [ghstack-poisoned]
… fix ungraceful shutdown Pull Request resolved: #35394 As above ghstack-source-id: 101537586 Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)
…d.join() to fix ungraceful shutdown" ungraceful shutdown** ungraceful shutdown** * #35393 [rpc] create error string in listenLoop outside of lock This is one of the causes of flakiness seen in `dist_autograd_node_failure` (the other is a `std::terminate` in RPC retries which is being fixed by @osalpekar). The root issue is that since we call `threadPool.waitWorkComplete()` after `listenerThread.join()`, it is possible that in some ungraceful shutdown situations, `listenerThread` enqueues more `RecvWork` into the thread pool. When the thread pool is running this task, shutdown could be running concurrently, and as part of shutdown, we call `_cleanup_python_rpc_handler` which sets `pyRunFunction_` to `None`. This is why we would see errors such as `NoneType not callable` in `request_callback.cpp`. The fix here is to wait for all locally enqueued work to be completed before shutting down the python part. Test plan: run `dist_autograd_node_failure` tests. Although, completely resolving the flakiness is also dependent on fixing the `std::terminate()` issue mentioned above. Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/) [ghstack-poisoned]
… fix ungraceful shutdown Pull Request resolved: #35394 As above ghstack-source-id: 101592571 Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)
This pull request has been merged in 2ef1ace. |
… fix (pytorch#35394) Summary: Pull Request resolved: pytorch#35394 As above ghstack-source-id: 101592571 Test Plan: Existing CI, no longer flaky Differential Revision: D20632405 fbshipit-source-id: fbfd81470b3361371109af341f0db3ef8b3a415b
Stack from ghstack:
ungraceful shutdown
ungraceful shutdown**
ungraceful shutdown**
ungraceful shutdown**
ungraceful shutdown**
This is one of the causes of flakiness seen in
dist_autograd_node_failure
(the other is astd::terminate
in RPC retries which is being fixed by @osalpekar).The root issue is that since we call
threadPool.waitWorkComplete()
afterlistenerThread.join()
, it is possible that in some ungraceful shutdown situations,listenerThread
enqueues moreRecvWork
into the thread pool. SincelistenerThread
is only responsible for enqueueing theRecvWork
and not waiting on it, it can exit, and shutdown will continue. As part of shutdown we then call_cleanup_python_rpc_handler
which setspyRunFunction_
toNone
. Although after this, we could still be processing work in the RPC threadpools. This is why we would see errors such asNoneType not callable
inrequest_callback.cpp
.The fix here is to wait for all locally enqueued work to be completed before shutting down the python part.
Test plan: run
dist_autograd_node_failure
tests. Although, completely resolving the flakiness is also dependent on fixing thestd::terminate()
issue mentioned above.Differential Revision: D20632405