Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix gcs healthch manager crash when node is removed by node manager. #31917

Merged
merged 14 commits into from
Jan 26, 2023

Conversation

fishbone
Copy link
Contributor

@fishbone fishbone commented Jan 25, 2023

Why are these changes needed?

The root cause is because the data structure is deleted, but call backs is not canceled and got executed. This PR simplify the life model and make it the way gRPC works. We only delete the structure after gRPC OnDone is called.

In the shortcut, according to the doc https://github.com/grpc/proposal/blob/master/L67-cpp-callback-api.md#unary-rpc-shortcuts , OnDone will call the callback function.

A better model is needed here. The code will be changed once we update the threading model in gRPC.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@@ -30,6 +34,20 @@ using namespace boost;
#include "gtest/gtest.h"
#include "ray/gcs/gcs_server/gcs_health_check_manager.h"

int GetFreePort() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fun fact: it's generated by chatgpt

@fishbone fishbone marked this pull request as ready for review January 25, 2023 03:56
@fishbone fishbone changed the title Fix crashing [core] Fix gcs healthch manager crash when node is removed by node manager. Jan 25, 2023
@fishbone fishbone linked an issue Jan 25, 2023 that may be closed by this pull request
@fishbone fishbone marked this pull request as draft January 25, 2023 20:42
@fishbone
Copy link
Contributor Author

still failed in testing env

@fishbone
Copy link
Contributor Author

*** SIGABRT received at time=1674687639 on cpu 0 ***
PC: @     0x7f0fd2f8400b  (unknown)  raise
    @     0x7f0fd4fa47b5        208  absl::lts_20220623::WriteFailureInfo()
    @     0x7f0fd4fa44f8         64  absl::lts_20220623::AbslFailureSignalHandler()
    @     0x7f0fd349f420       4032  (unknown)
    @     0x7f0fd31f838c        224  (unknown)
    @     0x7f0fd578f5ca        400  grpc::GenericDeserialize<>()
    @     0x7f0fd578eb5c        128  grpc::internal::CallOpRecvMessage<>::FinishOp()
    @     0x7f0fd578d63d         64  grpc::internal::CallOpSet<>::FinalizeResult()
    @     0x7f0fd5791ef8        256  grpc::internal::CallbackWithStatusTag::Run()
    @     0x7f0fd49c1bcc         48  grpc::(anonymous namespace)::CallbackAlternativeCQ::Ref()::{lambda()#1}::__invoke()
    @     0x7f0fd3d184ec         96  grpc_core::(anonymous namespace)::ThreadInternalsPosix::ThreadInternalsPosix()::{lambda()#1}::__invoke()
    @     0x7f0fd3493609  (unknown)  start_thread

Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@fishbone fishbone marked this pull request as ready for review January 26, 2023 00:45
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@@ -121,14 +110,14 @@ class GcsHealthCheckManager {
NodeID node_id_;

// Whether the health check has stopped.
std::shared_ptr<bool> stopped_;
bool stopped_ = false;

/// gRPC related fields
std::unique_ptr<::grpc::health::v1::Health::Stub> stub_;

// The context is used in the gRPC callback which is in another
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments need update

context_ = std::make_shared<grpc::ClientContext>();
// Reset the context/request/response for the next request.
context_.~ClientContext();
new (&context_) grpc::ClientContext();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make context_ an pointer or unique_ptr, would it make it more nature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually doesn't matter any difference I think since at the same time, only one in-flight request will be there per node.
I did this just following what gRPC did (https://sourcegraph.com/github.com/grpc/grpc/-/blob/test/cpp/qps/client_sync.cc?L189:26) and I think this is for performance (allocate on stack vs allocate on heap)(perf is not very important here, so I think no big difference.)

Copy link
Contributor

@scv119 scv119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly just nits!

@scv119 scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 26, 2023
Signed-off-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
@fishbone fishbone merged commit a32b9b1 into ray-project:master Jan 26, 2023
krfricke added a commit that referenced this pull request Jan 27, 2023
fishbone pushed a commit that referenced this pull request Jan 28, 2023
fishbone added a commit that referenced this pull request Jan 28, 2023
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
… node manager. (ray-project#31917)" (ray-project#31995)

This reverts commit a32b9b1.

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] long_running_node_failures is flaky
3 participants