-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GRPC Healthchecks #5581
Add GRPC Healthchecks #5581
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious about this setup, I mean, I see the problem that you want to fix, but this change starting all the servers will include all the instance endpoints into the service, so metrics server will see al of them. How will metrics server choose the correct instance?
Ugh, good point. We'll need to use a load balancing policy (other than default pick_first) if we wanted the client side health checking to work: https://grpc.io/docs/guides/health-checking/#enabling-client-health-checking |
Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl. |
I think this would be a great contribution anyway |
Yea, i think the way to have this work is to change the client config to round-robin lbpolicy, that way the client will always check health and connect to the serving grpc server |
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
ba9f085
to
f400c70
Compare
@JorTurFer @zroubalik Besides local testing, how else do you want me to test this out? |
Actually, I'm going to go forward with this change if possible. Per @zroubalik's comments and some more testing, it would have safer to have this. I'm also working on the change to add metrics for the GRPC requests so we can better observe this. |
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
/run-e2e internal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
@JorTurFer Can't tell if the test failures were caused by my changes. Logs from the servers seem fine, can't tell immediately which e2e tests failed. |
I think that the failure is unrelated (there is other PR in progress reviewing e2e tests). Let me trigger them again |
/run-e2e internal |
This reverts commit 3bf5151.
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
This PR adds a GRPC healthcheck server to the operator and returns
SERVING
status only if the server is the leader elected instance.To do this, I refactored the GRPCServer class to be a non-leader elected runnable (returning
false
forNeedsLeaderElection
) and instead listen to theElected
channel from the manager in the select statement to set the server state to serving.This also needed the addition client-side health checking to ensure that the GRPC client checks the health of the endpoint its connected to and ensure it is connected to the serving client. Additionally, this required changing the load balancing behavior of the client from
pick_first
toround_robin
so that the client internally watches the state of the health checks and selects a server that is in the serving state, per docs in https://grpc.io/docs/guides/health-checking/I also added a graceful shutdown of the GRPC server when
ctx.Done()
is closed as there was none previously.Verified by running in our Kubernetes cluster in 1 and 2 replica modes. Verified that server shuts down cleanly when pod is selected by monitoring logs.
Leader-elected:
Non-leader instance:
Checklist
Fixes #5590