-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix gcs healthch manager crash when node is removed by node manager. #31917
Conversation
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@@ -30,6 +34,20 @@ using namespace boost; | |||
#include "gtest/gtest.h" | |||
#include "ray/gcs/gcs_server/gcs_health_check_manager.h" | |||
|
|||
int GetFreePort() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fun fact: it's generated by chatgpt
still failed in testing env |
|
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@@ -121,14 +110,14 @@ class GcsHealthCheckManager { | |||
NodeID node_id_; | |||
|
|||
// Whether the health check has stopped. | |||
std::shared_ptr<bool> stopped_; | |||
bool stopped_ = false; | |||
|
|||
/// gRPC related fields | |||
std::unique_ptr<::grpc::health::v1::Health::Stub> stub_; | |||
|
|||
// The context is used in the gRPC callback which is in another |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comments need update
context_ = std::make_shared<grpc::ClientContext>(); | ||
// Reset the context/request/response for the next request. | ||
context_.~ClientContext(); | ||
new (&context_) grpc::ClientContext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we make context_ an pointer or unique_ptr, would it make it more nature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It actually doesn't matter any difference I think since at the same time, only one in-flight request will be there per node.
I did this just following what gRPC did (https://sourcegraph.com/github.com/grpc/grpc/-/blob/test/cpp/qps/client_sync.cc?L189:26) and I think this is for performance (allocate on stack vs allocate on heap)(perf is not very important here, so I think no big difference.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly just nits!
… node manager. (ray-project#31917)" (ray-project#31995) This reverts commit a32b9b1. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Why are these changes needed?
The root cause is because the data structure is deleted, but call backs is not canceled and got executed. This PR simplify the life model and make it the way gRPC works. We only delete the structure after gRPC OnDone is called.
In the shortcut, according to the doc https://github.com/grpc/proposal/blob/master/L67-cpp-callback-api.md#unary-rpc-shortcuts , OnDone will call the callback function.
A better model is needed here. The code will be changed once we update the threading model in gRPC.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.