Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GRPC Healthchecks #5581

Merged

Conversation

aliaqel-stripe
Copy link
Contributor

@aliaqel-stripe aliaqel-stripe commented Mar 7, 2024

This PR adds a GRPC healthcheck server to the operator and returns SERVING status only if the server is the leader elected instance.

To do this, I refactored the GRPCServer class to be a non-leader elected runnable (returning false for NeedsLeaderElection) and instead listen to the Elected channel from the manager in the select statement to set the server state to serving.

This also needed the addition client-side health checking to ensure that the GRPC client checks the health of the endpoint its connected to and ensure it is connected to the serving client. Additionally, this required changing the load balancing behavior of the client from pick_first to round_robin so that the client internally watches the state of the health checks and selects a server that is in the serving state, per docs in https://grpc.io/docs/guides/health-checking/

I also added a graceful shutdown of the GRPC server when ctx.Done() is closed as there was none previously.

Verified by running in our Kubernetes cluster in 1 and 2 replica modes. Verified that server shuts down cleanly when pod is selected by monitoring logs.

Leader-elected:

2024/03/20 21:37:34 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
...
2024-03-20T21:37:34Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0320 21:37:34.629844      13 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
2024-03-20T21:37:34Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
I0320 21:38:05.572556      13 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
2024-03-20T21:38:05Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
...
2024-03-20T21:41:42Z	INFO	Stopping and waiting for non leader election runnables
2024-03-20T21:41:42Z	INFO	grpc_server	Shutting down gRPC server
2024-03-20T21:41:42Z	INFO	Stopping and waiting for leader election runnables
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-03-20T21:41:42Z	INFO	Stopping and waiting for caches
W0320 21:41:42.719516      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1.Deployment ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719677      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ClusterTriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719778      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.CloudEventSource ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719869      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledJob ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719968      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledObject ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.720065      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.TriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.720222      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v2.HorizontalPodAutoscaler ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2024-03-20T21:41:42Z	INFO	Stopping and waiting for webhooks
2024-03-20T21:41:42Z	INFO	Stopping and waiting for HTTP servers
2024-03-20T21:41:42Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-03-20T21:41:42Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-03-20T21:41:42Z	INFO	Wait completed, proceeding to shutdown the manager

Non-leader instance:

2024/03/20 21:32:57 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
...
2024-03-20T21:32:57Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0320 21:32:57.488369      14 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
2024-03-20T21:32:57Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
2024-03-20T21:45:18Z	INFO	Stopping and waiting for non leader election runnables
2024-03-20T21:45:18Z	INFO	grpc_server	Shutting down gRPC server
2024-03-20T21:45:18Z	INFO	Stopping and waiting for leader election runnables
...
2024-03-20T21:45:18Z	INFO	Stopping and waiting for webhooks
2024-03-20T21:45:18Z	INFO	Stopping and waiting for HTTP servers
2024-03-20T21:45:18Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-03-20T21:45:18Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-03-20T21:45:18Z	INFO	Wait completed, proceeding to shutdown the manager

Checklist

Fixes #5590

@aliaqel-stripe aliaqel-stripe changed the title Aliaqel/implement grpc healthchecks Implement GRPC Healthchecking Mar 11, 2024
@aliaqel-stripe aliaqel-stripe changed the title Implement GRPC Healthchecking Add GRPC Healthchecks Mar 11, 2024
@aliaqel-stripe aliaqel-stripe marked this pull request as ready for review March 11, 2024 20:19
@aliaqel-stripe aliaqel-stripe requested a review from a team as a code owner March 11, 2024 20:19
Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious about this setup, I mean, I see the problem that you want to fix, but this change starting all the servers will include all the instance endpoints into the service, so metrics server will see al of them. How will metrics server choose the correct instance?

@aliaqel-stripe
Copy link
Contributor Author

I'm curious about this setup, I mean, I see the problem that you want to fix, but this change starting all the servers will include all the instance endpoints into the service, so metrics server will see al of them. How will metrics server choose the correct instance?

Ugh, good point. We'll need to use a load balancing policy (other than default pick_first) if we wanted the client side health checking to work: https://grpc.io/docs/guides/health-checking/#enabling-client-health-checking

@aliaqel-stripe
Copy link
Contributor Author

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

@zroubalik
Copy link
Member

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

I think this would be a great contribution anyway

@aliaqel-stripe
Copy link
Contributor Author

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

I think this would be a great contribution anyway

Yea, i think the way to have this work is to change the client config to round-robin lbpolicy, that way the client will always check health and connect to the serving grpc server

Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
@aliaqel-stripe aliaqel-stripe force-pushed the aliaqel/implement-grpc-healthchecks branch from ba9f085 to f400c70 Compare March 20, 2024 16:29
@aliaqel-stripe
Copy link
Contributor Author

@JorTurFer @zroubalik Besides local testing, how else do you want me to test this out?

@aliaqel-stripe
Copy link
Contributor Author

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

Actually, I'm going to go forward with this change if possible. Per @zroubalik's comments and some more testing, it would have safer to have this.

I'm also working on the change to add metrics for the GRPC requests so we can better observe this.

Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
@JorTurFer
Copy link
Member

JorTurFer commented Mar 24, 2024

/run-e2e internal
Update: You can check the progress here

Copy link
Member

@JorTurFer JorTurFer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

@aliaqel-stripe
Copy link
Contributor Author

@JorTurFer Can't tell if the test failures were caused by my changes. Logs from the servers seem fine, can't tell immediately which e2e tests failed.

@JorTurFer
Copy link
Member

I think that the failure is unrelated (there is other PR in progress reviewing e2e tests). Let me trigger them again

@JorTurFer
Copy link
Member

JorTurFer commented Mar 25, 2024

/run-e2e internal
Update: You can check the progress here

@JorTurFer JorTurFer merged commit 3bf5151 into kedacore:main Mar 26, 2024
20 checks passed
aliaqel-stripe added a commit to aliaqel-stripe/keda that referenced this pull request Apr 11, 2024
aliaqel-stripe added a commit to aliaqel-stripe/keda that referenced this pull request Apr 11, 2024
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
zroubalik pushed a commit that referenced this pull request Apr 11, 2024
Signed-off-by: Ali Aqel <aliaqel@stripe.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement GRPC Healthchecks so only leader-elected GRPC server serves requests
3 participants