Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rpc] call threadPool.waitWorkComplete after listenerThread.join() to fix ungraceful shutdown #35394

Closed
wants to merge 3 commits into from

Conversation

rohan-varma
Copy link
Member

@rohan-varma rohan-varma commented Mar 25, 2020

Stack from ghstack:

This is one of the causes of flakiness seen in dist_autograd_node_failure (the other is a std::terminate in RPC retries which is being fixed by @osalpekar).

The root issue is that since we call threadPool.waitWorkComplete() after listenerThread.join(), it is possible that in some ungraceful shutdown situations, listenerThread enqueues more RecvWork into the thread pool. Since listenerThread is only responsible for enqueueing the RecvWork and not waiting on it, it can exit, and shutdown will continue. As part of shutdown we then call _cleanup_python_rpc_handler which sets pyRunFunction_ to None. Although after this, we could still be processing work in the RPC threadpools. This is why we would see errors such as NoneType not callable in request_callback.cpp.

The fix here is to wait for all locally enqueued work to be completed before shutting down the python part.

Test plan: run dist_autograd_node_failure tests. Although, completely resolving the flakiness is also dependent on fixing the std::terminate() issue mentioned above.

Differential Revision: D20632405

… fix

ungraceful shutdown

As above

Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)

[ghstack-poisoned]
@dr-ci
Copy link

dr-ci bot commented Mar 25, 2020

💊 CircleCI build failures summary and remediations

As of commit 76e7890 (more details on the Dr. CI page):


None of the build failures appear to be your fault 💚


  • 2/2 broken upstream at merge base 82d58ed since Apr 05

    Please rebase on the viable/strict branch (expand for instructions)

    If your commit is newer than viable/strict, you can try basing on an older, stable commit:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase --onto FETCH_HEAD $(git merge-base origin/master HEAD)
    

    If your commit is older than viable/strict:

    git fetch https://github.com/pytorch/pytorch viable/strict
    git rebase FETCH_HEAD
    

    Check out the recency history of this "viable master" tracking branch.


🚧 2 upstream failures:

These were probably caused by upstream breakages:


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 7 times.

// Note: calling threadPool_.waitWorkComplete() after listenerThread.join() so
// that we can finish any possible work enqueued into the thread pool, before
// python RPC handler is shutdown (see shutdown in rpc/api.py).
threadPool_.waitWorkComplete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can launch more sends? In that case, do we need to abort those?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is a good point, though in the current code it looks like we do this after aborting pending sends as well, but technically more sends can be launched in threadPool.waitworkComplete() again. Also, when this line is called we would have set rpcRunning_ to false, and the send code checks this, so the send would immediately be stopped. I think this is why I'm not able to get this scenario to show up in the tests.

What about the following order?

  1. join listener thread --> so this node stops accepting new requests, and just has to flush out its old ones
  2. abort all existing pending sends --> basically end the current pending sends, do not send it over RPC
  3. call threadPool.waitWorkComplete() ---> here, even if we have additional sends, they will immediately be cancelled, since we already check rpcRunning_ flag when waiting for a send.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above looks good to me for non-graceful shutdown. For graceful shutdown, do we need to wait for cleanup for dist autograd context? Can the cleanup message still stay in the threadPool while we are doing this? Do we need sth similar to _delete_call_user_rrefs() for dist autograd?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli For graceful shutdown and cleaning up dist autograd context, I thought we are okay, since the following will happen:

  1. Exit dist autograd context, triggering messages to be sent for cleanup
  2. Graceful shutdown, eventually calls sync(), which waits for all messages to be cleanly processed across all nodes, meaning that we will process dist autograd cleanup while waiting in sync()
  3. Shutdown, which in graceful case, should not have any existing work when calling threadPool.waitWorkComplete (I have a WIP diff to check this, by ensuring that we won't abort any pending sends in graceful shutdown)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, this should be OK for now. One thing is that TensorPipeAgent is expecting us to deprecate sync/join APIs. In that case, dist_autograd probably need its own way to clear all messages. So that application messages will be cleared by wait_all_workers, RRef messages will be cleared by _delete_all_user_rrefs, and dist autograd can add its own internal message cleanup function in shutdown(). This might be sufficient to get rid of join.

…d.join() to fix

ungraceful shutdown"

ungraceful shutdown**
* #35393 [rpc] create error string in listenLoop outside of lock

This is one of the causes of flakiness seen in `dist_autograd_node_failure` (the other is a `std::terminate` in RPC retries which is being fixed by @osalpekar).

The root issue is that since we call `threadPool.waitWorkComplete()` after `listenerThread.join()`, it is possible that in some ungraceful shutdown situations, `listenerThread` enqueues more `RecvWork` into the thread pool. When the thread pool is running this task, shutdown could be running concurrently, and as part of shutdown, we call `_cleanup_python_rpc_handler` which sets `pyRunFunction_` to `None`. This is why we would see errors such as `NoneType not callable` in `request_callback.cpp`. 

The fix here is to wait for all locally enqueued work to be completed before shutting down the python part.

Test plan: run `dist_autograd_node_failure` tests. Although, completely resolving the flakiness is also dependent on fixing the `std::terminate()` issue mentioned above.

Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 4, 2020
… fix

ungraceful shutdown

Pull Request resolved: #35394

As above
ghstack-source-id: 101537586

Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)
…d.join() to fix

ungraceful shutdown"

ungraceful shutdown**
ungraceful shutdown**
* #35393 [rpc] create error string in listenLoop outside of lock

This is one of the causes of flakiness seen in `dist_autograd_node_failure` (the other is a `std::terminate` in RPC retries which is being fixed by @osalpekar).

The root issue is that since we call `threadPool.waitWorkComplete()` after `listenerThread.join()`, it is possible that in some ungraceful shutdown situations, `listenerThread` enqueues more `RecvWork` into the thread pool. When the thread pool is running this task, shutdown could be running concurrently, and as part of shutdown, we call `_cleanup_python_rpc_handler` which sets `pyRunFunction_` to `None`. This is why we would see errors such as `NoneType not callable` in `request_callback.cpp`. 

The fix here is to wait for all locally enqueued work to be completed before shutting down the python part.

Test plan: run `dist_autograd_node_failure` tests. Although, completely resolving the flakiness is also dependent on fixing the `std::terminate()` issue mentioned above.

Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)

[ghstack-poisoned]
rohan-varma added a commit that referenced this pull request Apr 6, 2020
… fix

ungraceful shutdown

Pull Request resolved: #35394

As above
ghstack-source-id: 101592571

Differential Revision: [D20632405](https://our.internmc.facebook.com/intern/diff/D20632405/)
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 2ef1ace.

@facebook-github-bot facebook-github-bot deleted the gh/rohan-varma/102/head branch April 10, 2020 14:18
ashishfarmer pushed a commit to ashishfarmer/pytorch that referenced this pull request Apr 13, 2020
… fix (pytorch#35394)

Summary:
Pull Request resolved: pytorch#35394

As above
ghstack-source-id: 101592571

Test Plan: Existing CI, no longer flaky

Differential Revision: D20632405

fbshipit-source-id: fbfd81470b3361371109af341f0db3ef8b3a415b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants