[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394

rohan-varma · 2020-03-25T18:11:43Z

Stack from ghstack:

[rpc] allow ability to abort second call to RecvWork::wait() in ProcessGroupAgent::listenLoop #36084 [rpc] allow ability to abort second call to RecvWork::wait() in ProcessGroupAgent::listenLoop
[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394 [rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix
ungraceful shutdown
ungraceful shutdown**
ungraceful shutdown**
ungraceful shutdown**
ungraceful shutdown**
[rpc] create error string in listenLoop outside of lock #35393 [rpc] create error string in listenLoop outside of lock

This is one of the causes of flakiness seen in dist_autograd_node_failure (the other is a std::terminate in RPC retries which is being fixed by @osalpekar).

The root issue is that since we call threadPool.waitWorkComplete() after listenerThread.join(), it is possible that in some ungraceful shutdown situations, listenerThread enqueues more RecvWork into the thread pool. Since listenerThread is only responsible for enqueueing the RecvWork and not waiting on it, it can exit, and shutdown will continue. As part of shutdown we then call _cleanup_python_rpc_handler which sets pyRunFunction_ to None. Although after this, we could still be processing work in the RPC threadpools. This is why we would see errors such as NoneType not callable in request_callback.cpp.

The fix here is to wait for all locally enqueued work to be completed before shutting down the python part.

Test plan: run dist_autograd_node_failure tests. Although, completely resolving the flakiness is also dependent on fixing the std::terminate() issue mentioned above.

Differential Revision: D20632405

… fix ungraceful shutdown As above Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/) [ghstack-poisoned]

dr-ci · 2020-03-25T18:13:40Z

💊 CircleCI build failures summary and remediations

As of commit 76e7890 (more details on the Dr. CI page):

✅ None of the build failures appear to be your fault 💚

2/2 broken upstream at merge base 82d58ed since Apr 05
Please rebase on the viable/strict branch (expand for instructions)

If your commit is newer than viable/strict, you can try basing on an older, stable commit:
```
git fetch https://github.com/pytorch/pytorch viable/strict
git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)
```
If your commit is older than viable/strict:
```
git fetch https://github.com/pytorch/pytorch viable/strict
git rebase FETCH_HEAD
```
Check out the recency history of this "viable master" tracking branch.

🚧 2 upstream failures:

These were probably caused by upstream breakages:

pytorch_python_doc_push since Apr 05
pytorch_cpp_doc_push since Apr 05

This comment was automatically generated by Dr. CI (expand for details).

Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 7 times.

mrshenli · 2020-03-25T18:51:59Z

torch/csrc/distributed/rpc/process_group_agent.cpp

+  // Note: calling threadPool_.waitWorkComplete() after listenerThread.join() so
+  // that we can finish any possible work enqueued into the thread pool, before
+  // python RPC handler is shutdown (see shutdown in rpc/api.py).
+  threadPool_.waitWorkComplete();


This can launch more sends? In that case, do we need to abort those?

Hmm, this is a good point, though in the current code it looks like we do this after aborting pending sends as well, but technically more sends can be launched in threadPool.waitworkComplete() again. Also, when this line is called we would have set rpcRunning_ to false, and the send code checks this, so the send would immediately be stopped. I think this is why I'm not able to get this scenario to show up in the tests.

What about the following order?

join listener thread --> so this node stops accepting new requests, and just has to flush out its old ones

abort all existing pending sends --> basically end the current pending sends, do not send it over RPC

call threadPool.waitWorkComplete() ---> here, even if we have additional sends, they will immediately be cancelled, since we already check rpcRunning_ flag when waiting for a send.

Above looks good to me for non-graceful shutdown. For graceful shutdown, do we need to wait for cleanup for dist autograd context? Can the cleanup message still stay in the threadPool while we are doing this? Do we need sth similar to _delete_call_user_rrefs() for dist autograd?

@mrshenli For graceful shutdown and cleaning up dist autograd context, I thought we are okay, since the following will happen:

Exit dist autograd context, triggering messages to be sent for cleanup

Graceful shutdown, eventually calls sync(), which waits for all messages to be cleanly processed across all nodes, meaning that we will process dist autograd cleanup while waiting in sync()

Shutdown, which in graceful case, should not have any existing work when calling threadPool.waitWorkComplete (I have a WIP diff to check this, by ensuring that we won't abort any pending sends in graceful shutdown)

I see, this should be OK for now. One thing is that TensorPipeAgent is expecting us to deprecate sync/join APIs. In that case, dist_autograd probably need its own way to clear all messages. So that application messages will be cleared by wait_all_workers, RRef messages will be cleared by _delete_all_user_rrefs, and dist autograd can add its own internal message cleanup function in shutdown(). This might be sufficient to get rid of join.

@osalpekar

…d.join() to fix ungraceful shutdown" ungraceful shutdown** * #35393 [rpc] create error string in listenLoop outside of lock This is one of the causes of flakiness seen in `dist_autograd_node_failure` (the other is a `std::terminate` in RPC retries which is being fixed by @osalpekar). The root issue is that since we call `threadPool.waitWorkComplete()` after `listenerThread.join()`, it is possible that in some ungraceful shutdown situations, `listenerThread` enqueues more `RecvWork` into the thread pool. When the thread pool is running this task, shutdown could be running concurrently, and as part of shutdown, we call `_cleanup_python_rpc_handler` which sets `pyRunFunction_` to `None`. This is why we would see errors such as `NoneType not callable` in `request_callback.cpp`. The fix here is to wait for all locally enqueued work to be completed before shutting down the python part. Test plan: run `dist_autograd_node_failure` tests. Although, completely resolving the flakiness is also dependent on fixing the `std::terminate()` issue mentioned above. Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/) [ghstack-poisoned]

… fix ungraceful shutdown Pull Request resolved: #35394 As above ghstack-source-id: 101537586 Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)

@osalpekar

…d.join() to fix ungraceful shutdown" ungraceful shutdown** ungraceful shutdown** * #35393 [rpc] create error string in listenLoop outside of lock This is one of the causes of flakiness seen in `dist_autograd_node_failure` (the other is a `std::terminate` in RPC retries which is being fixed by @osalpekar). The root issue is that since we call `threadPool.waitWorkComplete()` after `listenerThread.join()`, it is possible that in some ungraceful shutdown situations, `listenerThread` enqueues more `RecvWork` into the thread pool. When the thread pool is running this task, shutdown could be running concurrently, and as part of shutdown, we call `_cleanup_python_rpc_handler` which sets `pyRunFunction_` to `None`. This is why we would see errors such as `NoneType not callable` in `request_callback.cpp`. The fix here is to wait for all locally enqueued work to be completed before shutting down the python part. Test plan: run `dist_autograd_node_failure` tests. Although, completely resolving the flakiness is also dependent on fixing the `std::terminate()` issue mentioned above. Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/) [ghstack-poisoned]

… fix ungraceful shutdown Pull Request resolved: #35394 As above ghstack-source-id: 101592571 Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)

facebook-github-bot · 2020-04-07T00:19:02Z

This pull request has been merged in 2ef1ace.

… fix (pytorch#35394) Summary: Pull Request resolved: pytorch#35394 As above ghstack-source-id: 101592571 Test Plan: Existing CI, no longer flaky Differential Revision: D20632405 fbshipit-source-id: fbfd81470b3361371109af341f0db3ef8b3a415b

[rpc] call threadPool.waitWorkComplete after listenerThread.join() to…

8ad3490

… fix ungraceful shutdown As above Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/) [ghstack-poisoned]

rohan-varma requested review from mrshenli, pritamdamania87 and zhaojuanmao as code owners March 25, 2020 18:11

This was referenced Mar 25, 2020

[rpc] create error string in listenLoop outside of lock #35393

Closed

[rpc] Add a debug only check to debug python cleanup races #35395

Closed

mrshenli reviewed Mar 25, 2020

View reviewed changes

osalpekar mentioned this pull request Mar 27, 2020

Fix for test_backward_node_failure: Error Handling in RPC Agent #35263

Closed

mrshenli approved these changes Apr 2, 2020

View reviewed changes

rohan-varma mentioned this pull request Apr 6, 2020

[rpc] allow ability to abort second call to RecvWork::wait() in ProcessGroupAgent::listenLoop #36084

Closed

facebook-github-bot closed this in 2ef1ace Apr 6, 2020

facebook-github-bot added the merged label Apr 7, 2020

facebook-github-bot deleted the gh/rohan-varma/102/head branch April 10, 2020 14:18

rohan-varma mentioned this pull request Apr 22, 2020

Race during ungraceful shutdown of RPC agent #33740

Closed

mrshenli mentioned this pull request Apr 28, 2020

test_backward_node_failure_python_udf is flaky #35099

Closed

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394

[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394

Uh oh!

rohan-varma commented Mar 25, 2020 •

edited

Loading

Uh oh!

dr-ci bot commented Mar 25, 2020 •

edited

Loading

Uh oh!

mrshenli Mar 25, 2020

Uh oh!

rohan-varma Mar 26, 2020

Uh oh!

mrshenli Mar 31, 2020

Uh oh!

rohan-varma Apr 1, 2020

Uh oh!

mrshenli Apr 2, 2020

Uh oh!

facebook-github-bot commented Apr 7, 2020

Uh oh!

Uh oh!

[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394

[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394

Uh oh!

Conversation

rohan-varma commented Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dr-ci bot commented Mar 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CircleCI build failures summary and remediations

🚧 2 upstream failures:

Uh oh!

mrshenli Mar 25, 2020

Choose a reason for hiding this comment

Uh oh!

rohan-varma Mar 26, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Mar 31, 2020

Choose a reason for hiding this comment

Uh oh!

rohan-varma Apr 1, 2020

Choose a reason for hiding this comment

Uh oh!

mrshenli Apr 2, 2020

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Apr 7, 2020

Uh oh!

Uh oh!

rohan-varma commented Mar 25, 2020 •

edited

Loading

dr-ci bot commented Mar 25, 2020 •

edited

Loading