Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experiment with gRPC outlier detection balancer #10

Open
wants to merge 1 commit into
base: client_fail_over_poc_internal
Choose a base branch
from

Conversation

chaochn47
Copy link
Owner

@chaochn47 chaochn47 commented Sep 20, 2023

This is an experiment on top of #9 which shows better etcd availability combined with gRPC health check.

The configured outlier detection config is if within 2s, there are more than 5 requests (RPC) failed, it will at most eject one backend address out of the 3 etcd addresses. It will kick in earlier than current configured 6s latency that costs server push unhealthy status to client.

make gofail-enable
make build

cd tests/e2e && go test -v -run TestFailover
=== RUN   TestFailover
=== RUN   TestFailover/network_partition
failover_test.go:162: request failure rate is 0.13%, traffic volume success 15944 requests, total 15964 requests
=== RUN   TestFailover/stalled_disk_write
failover_test.go:162: request failure rate is 0.10%, traffic volume success 20160 requests, total 20180 requests
=== RUN   TestFailover/defrag
failover_test.go:162: request failure rate is 0.00%, traffic volume success 24481 requests, total 24482 requests

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@chaochn47 chaochn47 force-pushed the client_fail_over_with_outlier_detection_poc_internal branch from 03eea6a to d6401d4 Compare September 20, 2023 05:06
@@ -24,6 +24,7 @@ import (
"github.com/stretchr/testify/require"
"google.golang.org/grpc"
_ "google.golang.org/grpc/health"
_ "google.golang.org/grpc/xds"
Copy link
Owner Author

@chaochn47 chaochn47 Sep 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used to import https://github.com/grpc/grpc-go/blob/master/xds/internal/balancer/outlierdetection/balancer.go#L57-L61 to register the outlierdetection balancer in the global registry. Unfortunately, it has not yet moved out to public.. See the gRPC-go maintainer comment in grpc/grpc-go#5672 (comment) and grpc/grpc-go#5899 (comment)

/cc @tjungblu since you summarized the use case pretty well in etcd-io#14501

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's really awesome @chaochn47! I'll give this a bash on Friday when I find some time.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works pretty well, I just gave this a quick run with an apiserver.
We also want to tune the retry policies in the future, I'm still a little unsure how to expose this in the etcd client. Exposing the raw service config is probably not a good idea given the potential changes to xds...

Copy link
Owner Author

@chaochn47 chaochn47 Sep 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works pretty well, I just gave this a quick run with an apiserver.

Thanks! That's a good data point.

We also want to tune the retry policies in the future, I'm still a little unsure how to expose this in the etcd client.

Retry policy is on my to-do list to explore. Will ping you when I find a reliable way to do this with etcd client. This might need some changes in v3 etcd client code base.

Exposing the raw service config is probably not a good idea given the potential changes to xds...

Changes to xds should not affect how we configure outlier detection load balancing policy. It is part of the service config proto definition in the main grpc repo. https://github.com/grpc/grpc-proto/blob/master/grpc/service_config/service_config.proto#L241

Signed-off-by: Chao Chen <chaochn@amazon.com>
@chaochn47 chaochn47 force-pushed the client_fail_over_with_outlier_detection_poc_internal branch from 35ea4f1 to 90b5e66 Compare September 20, 2023 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants