Add GRPC Healthchecks #5581

aliaqel-stripe · 2024-03-07T23:13:31Z

This PR adds a GRPC healthcheck server to the operator and returns SERVING status only if the server is the leader elected instance.

To do this, I refactored the GRPCServer class to be a non-leader elected runnable (returning false for NeedsLeaderElection) and instead listen to the Elected channel from the manager in the select statement to set the server state to serving.

This also needed the addition client-side health checking to ensure that the GRPC client checks the health of the endpoint its connected to and ensure it is connected to the serving client. Additionally, this required changing the load balancing behavior of the client from pick_first to round_robin so that the client internally watches the state of the health checks and selects a server that is in the serving state, per docs in https://grpc.io/docs/guides/health-checking/

I also added a graceful shutdown of the GRPC server when ctx.Done() is closed as there was none previously.

Verified by running in our Kubernetes cluster in 1 and 2 replica modes. Verified that server shuts down cleanly when pod is selected by monitoring logs.

Leader-elected:

2024/03/20 21:37:34 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
...
2024-03-20T21:37:34Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0320 21:37:34.629844      13 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
2024-03-20T21:37:34Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
I0320 21:38:05.572556      13 leaderelection.go:260] successfully acquired lease keda/operator.keda.sh
2024-03-20T21:38:05Z	INFO	Starting EventSource	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
...
2024-03-20T21:41:42Z	INFO	Stopping and waiting for non leader election runnables
2024-03-20T21:41:42Z	INFO	grpc_server	Shutting down gRPC server
2024-03-20T21:41:42Z	INFO	Stopping and waiting for leader election runnables
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-03-20T21:41:42Z	INFO	Shutdown signal received, waiting for all workers to finish	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-03-20T21:41:42Z	INFO	All workers finished	{"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-03-20T21:41:42Z	INFO	Stopping and waiting for caches
W0320 21:41:42.719516      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1.Deployment ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719677      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ClusterTriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719778      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.CloudEventSource ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719869      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledJob ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.719968      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.ScaledObject ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.720065      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v1alpha1.TriggerAuthentication ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
W0320 21:41:42.720222      13 reflector.go:458] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers.go:105: watch of *v2.HorizontalPodAutoscaler ended with: an error on the server ("unable to decode an event from the watch stream: context canceled") has prevented the request from succeeding
2024-03-20T21:41:42Z	INFO	Stopping and waiting for webhooks
2024-03-20T21:41:42Z	INFO	Stopping and waiting for HTTP servers
2024-03-20T21:41:42Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-03-20T21:41:42Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-03-20T21:41:42Z	INFO	Wait completed, proceeding to shutdown the manager

Non-leader instance:

2024/03/20 21:32:57 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
...
2024-03-20T21:32:57Z	INFO	starting server	{"kind": "health probe", "addr": "[::]:8081"}
I0320 21:32:57.488369      14 leaderelection.go:250] attempting to acquire leader lease keda/operator.keda.sh...
2024-03-20T21:32:57Z	INFO	grpc_server	Starting Metrics Service gRPC Server	{"address": ":9666"}
2024-03-20T21:45:18Z	INFO	Stopping and waiting for non leader election runnables
2024-03-20T21:45:18Z	INFO	grpc_server	Shutting down gRPC server
2024-03-20T21:45:18Z	INFO	Stopping and waiting for leader election runnables
...
2024-03-20T21:45:18Z	INFO	Stopping and waiting for webhooks
2024-03-20T21:45:18Z	INFO	Stopping and waiting for HTTP servers
2024-03-20T21:45:18Z	INFO	shutting down server	{"kind": "health probe", "addr": "[::]:8081"}
2024-03-20T21:45:18Z	INFO	controller-runtime.metrics	Shutting down metrics server with timeout of 1 minute
2024-03-20T21:45:18Z	INFO	Wait completed, proceeding to shutdown the manager

Checklist

I have verified that my change is according to the deprecations & breaking changes policy
Changelog has been updated and is aligned with our changelog requirements
Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #5590

pkg/metricsservice/client.go

JorTurFer

I'm curious about this setup, I mean, I see the problem that you want to fix, but this change starting all the servers will include all the instance endpoints into the service, so metrics server will see al of them. How will metrics server choose the correct instance?

pkg/metricsservice/client.go

aliaqel-stripe · 2024-03-12T20:45:50Z

I'm curious about this setup, I mean, I see the problem that you want to fix, but this change starting all the servers will include all the instance endpoints into the service, so metrics server will see al of them. How will metrics server choose the correct instance?

Ugh, good point. We'll need to use a load balancing policy (other than default pick_first) if we wanted the client side health checking to work: https://grpc.io/docs/guides/health-checking/#enabling-client-health-checking

aliaqel-stripe · 2024-03-12T20:46:43Z

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

zroubalik · 2024-03-15T19:09:17Z

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

I think this would be a great contribution anyway

aliaqel-stripe · 2024-03-15T19:13:13Z

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

I think this would be a great contribution anyway

Yea, i think the way to have this work is to change the client config to round-robin lbpolicy, that way the client will always check health and connect to the serving grpc server

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

pkg/metricsservice/client.go

aliaqel-stripe · 2024-03-20T17:46:47Z

@JorTurFer @zroubalik Besides local testing, how else do you want me to test this out?

aliaqel-stripe · 2024-03-20T17:48:27Z

Honestly, I think we might not need this now since I decided to bypass envoy connect to the operator directly. Let me give this a whirl.

Actually, I'm going to go forward with this change if possible. Per @zroubalik's comments and some more testing, it would have safer to have this.

I'm also working on the change to add metrics for the GRPC requests so we can better observe this.

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

JorTurFer · 2024-03-24T23:05:42Z

/run-e2e internal
Update: You can check the progress here

JorTurFer

Great job!

aliaqel-stripe · 2024-03-25T19:37:13Z

@JorTurFer Can't tell if the test failures were caused by my changes. Logs from the servers seem fine, can't tell immediately which e2e tests failed.

JorTurFer · 2024-03-25T22:35:18Z

I think that the failure is unrelated (there is other PR in progress reviewing e2e tests). Let me trigger them again

JorTurFer · 2024-03-25T22:35:21Z

/run-e2e internal
Update: You can check the progress here

This reverts commit 3bf5151.

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

aliaqel-stripe changed the title ~~Aliaqel/implement grpc healthchecks~~ Implement GRPC Healthchecking Mar 11, 2024

aliaqel-stripe changed the title ~~Implement GRPC Healthchecking~~ Add GRPC Healthchecks Mar 11, 2024

aliaqel-stripe marked this pull request as ready for review March 11, 2024 20:19

aliaqel-stripe requested a review from a team as a code owner March 11, 2024 20:19

aliaqel-stripe commented Mar 11, 2024

View reviewed changes

pkg/metricsservice/client.go Show resolved Hide resolved

JorTurFer reviewed Mar 11, 2024

View reviewed changes

pkg/metricsservice/client.go Show resolved Hide resolved

aliaqel-stripe added 9 commits March 20, 2024 09:29

Add healthchecks

509c7a2

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

add new vendor files

329b40e

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

fix server startup

3f7e31a

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

add logging

2bec637

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

add graceful shutdown and client side healthchecking

ea57059

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

remove context to fix static checks

990328b

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

add changelog

9860ae6

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

add round_robin policy

f516348

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

add round_robin policy

f400c70

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

aliaqel-stripe force-pushed the aliaqel/implement-grpc-healthchecks branch from ba9f085 to f400c70 Compare March 20, 2024 16:29

aliaqel-stripe commented Mar 20, 2024

View reviewed changes

pkg/metricsservice/client.go Show resolved Hide resolved

aliaqel-stripe added 3 commits March 20, 2024 19:02

remove extra comma

fb64d80

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

change how grpc server starts relative to leader election

b4ffe87

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

update comment

9e659c0

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

JorTurFer approved these changes Mar 24, 2024

View reviewed changes

JorTurFer merged commit 3bf5151 into kedacore:main Mar 26, 2024
20 checks passed

aliaqel-stripe added a commit to aliaqel-stripe/keda that referenced this pull request Apr 11, 2024

Revert "Add GRPC Healthchecks (kedacore#5581)"

3911ca2

This reverts commit 3bf5151.

aliaqel-stripe added a commit to aliaqel-stripe/keda that referenced this pull request Apr 11, 2024

Revert "Add GRPC Healthchecks (kedacore#5581)"

e7076a9

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

zroubalik pushed a commit that referenced this pull request Apr 11, 2024

Revert "Add GRPC Healthchecks (#5581)" (#5681)

adfe867

Signed-off-by: Ali Aqel <aliaqel@stripe.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GRPC Healthchecks #5581

Add GRPC Healthchecks #5581

aliaqel-stripe commented Mar 7, 2024 •

edited

Loading

JorTurFer left a comment

aliaqel-stripe commented Mar 12, 2024

aliaqel-stripe commented Mar 12, 2024

zroubalik commented Mar 15, 2024

aliaqel-stripe commented Mar 15, 2024

aliaqel-stripe commented Mar 20, 2024

aliaqel-stripe commented Mar 20, 2024

JorTurFer commented Mar 24, 2024 •

edited by github-actions bot

Loading

JorTurFer left a comment

aliaqel-stripe commented Mar 25, 2024

JorTurFer commented Mar 25, 2024

JorTurFer commented Mar 25, 2024 •

edited by github-actions bot

Loading

Add GRPC Healthchecks #5581

Add GRPC Healthchecks #5581

Conversation

aliaqel-stripe commented Mar 7, 2024 • edited Loading

Checklist

JorTurFer left a comment

Choose a reason for hiding this comment

aliaqel-stripe commented Mar 12, 2024

aliaqel-stripe commented Mar 12, 2024

zroubalik commented Mar 15, 2024

aliaqel-stripe commented Mar 15, 2024

aliaqel-stripe commented Mar 20, 2024

aliaqel-stripe commented Mar 20, 2024

JorTurFer commented Mar 24, 2024 • edited by github-actions bot Loading

JorTurFer left a comment

Choose a reason for hiding this comment

aliaqel-stripe commented Mar 25, 2024

JorTurFer commented Mar 25, 2024

JorTurFer commented Mar 25, 2024 • edited by github-actions bot Loading

aliaqel-stripe commented Mar 7, 2024 •

edited

Loading

JorTurFer commented Mar 24, 2024 •

edited by github-actions bot

Loading

JorTurFer commented Mar 25, 2024 •

edited by github-actions bot

Loading