-
Notifications
You must be signed in to change notification settings - Fork 331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health checks #2671
Add health checks #2671
Conversation
Codecov ReportBase: 81.42% // Head: 81.30% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #2671 +/- ##
==========================================
- Coverage 81.42% 81.30% -0.13%
==========================================
Files 162 163 +1
Lines 9849 9900 +51
==========================================
+ Hits 8020 8049 +29
- Misses 1590 1609 +19
- Partials 239 242 +3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
@dprotaso @evankanderson gentle ping for review. |
We should not have the same problem as the controller-runtime issue, because we always start controllers, even if they don't win the leader election. I agree that the check should start up and run independently of the controller startup, but it seems like we shouldn't return healthy until informer caches have been filled (which, for example, can be blocked by incorrect RBAC permissions). |
I'd also advocate for having health endpoints always enabled. If you don't want to use them, they should be minimal overhead. Ideally, we could provide this same server to the Prometheus metrics implementation and share a port, but I don't want to force this behind refactoring that big ball of mud. |
@evankanderson hi, thanks for reviewing. Some thoughts:
Initially I had it like that and spend some time thinking about options. So not elected instances dont own any bucket right? That technically means they are not ready no? Even with multiple buckets there is bucket migration. That is a less of a problem though it depends what is defined as ready here. Now, if health probes (liveness) fail here it does not help much if you have to re-sync the cache when you restart. Also, if we have an error at this stage I think it is unlikely that a restart could help. Even with the old problem we had wrt OOM and informers, restarting didnt help or if RBAC permissions are wrong there is nothing that is improved with restarts. Health checks (liveness probes) and restarts are usually for deadlocks, fixable ooms etc. I think having the check after cache sync it opens the door to a configuration headache depending on the cluster api server communication and cache size, no? Then the user will have to set initialDelaySeconds accordingly or other settings.
Then it could be the opposite the flag to be a disable flag in case someone wants to disable it and override it with his own. I can easily change it. Not sure though if any project eg. sandbox projects, will be affected by this. A scenario could be that a project uses sharedmain and has a custom probe.
Yes indeed I think there was such a reference in some controller runtime issue but we can keep it simple here for now. |
/assign @evankanderson |
Will update the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I'm happy to build this up incrementally.
It feels to me like it might make sense to have a registration mechanism; something like AddReadinessCheck()
and AddLivenessCheck()
.
I'm also wondering if these interfaces should live in sharedmain
instead of network
, but I think we could work that out now or later.
I am going to insist on switching the usage of log
for getting the logger
from the Context, though.
network/health_check.go
Outdated
}() | ||
} | ||
|
||
func newHealthCheck(sigCtx context.Context) func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts about how we might let users customize this later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could refactor this to allow pass a func and propagate it from ServeHealthProbes(ctx context.Context) or use context to register another callback there and get it with some func such as healthCheckFromContext. WDYTH?
network/health_check.go
Outdated
} | ||
|
||
func (h *healthHandler) handle(w http.ResponseWriter, r *http.Request) { | ||
if IsKubeletProbe(r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking IsKubeletProbe
here keeps this health checker from being used for e.g. proxy or apiserver health checks.
I know we may not need them today, but I'm wondering if we should be passing the request to the health checking function rather than doing this check here. i.e. func(http.Request) error
. This might be a half-baked idea, though.
network/health_check.go
Outdated
func (h *healthHandler) handle(w http.ResponseWriter, r *http.Request) { | ||
if IsKubeletProbe(r) { | ||
if err := h.HealthCheck(); err != nil { | ||
log.Println("Healthcheck failed: ", err.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this extract the logger from the context and use that?
network/health_check.go
Outdated
} | ||
|
||
// UserReadinessProbe checks if user has explicitly requested to set a readiness probe in the related context. | ||
func UserReadinessProbe(ctx context.Context) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can extend with a custom url in the future to change the default path too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to figure out why you'd want to disable these on the server. I understand why you might want to remove them from the manifest, but what's the rationale for not registering them with the HTTPServer?
@evankanderson pls review. Updates: Not sure how queue proxy could utilize this because drainer is used as part of a handler chain. Also I think there are more requirements there like TLS which we dont address at the moment (some user services require this). |
Yeah, I'm not deeply concerned about using this for At the same time, it would be nice if we could use this for e.g. eventing brokers, and have them fail readiness or liveness if they can't reach their storage backends. That probably needs that we need something like an ability to register health checks in the |
(Reviewing now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of enabling health probes -- is there any reason not to have them on all the time (it might simplify the code)?
I'd also sort of like to be able to say:
func NewController(ctx context.Context, cmw configmap.Watcher) *controller.Impl {
...
checker := func (r http.Request) error {
// pretend we have appropriate locking
return impl.connectionError
}
// internally, calls getReadinessRegisterFunc(ctx).Register(checker)
injection.AddReadiness(ctx, checker)
return impl
}
Of course, checker
could also be impl.IsHealthy
or the like.
I'm okay with either a func (http.Request) error
or a func () error
signature -- whichever seems less of a pain in the common case.
/hold
/lgtm
If you like the above interface, I don't know if we want to introduce this one and then remove it.
If you want to just enable health checks, I'd probably remove most of the public interfaces in network
for now except the one needed from sharedmain.go
.
network/health_check.go
Outdated
} | ||
|
||
// UserReadinessProbe checks if user has explicitly requested to set a readiness probe in the related context. | ||
func UserReadinessProbe(ctx context.Context) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm struggling to figure out why you'd want to disable these on the server. I understand why you might want to remove them from the manifest, but what's the rationale for not registering them with the HTTPServer?
network/health_check.go
Outdated
if h := getOverrideHealthCheckHandler(ctx); h != nil { | ||
handler = *h | ||
} else { | ||
defaultHandle := newDefaultProbesHandle(ctx) | ||
handler = HealthCheckHandler{ | ||
ReadinessProbeRequestHandle: defaultHandle, | ||
LivenessProbeRequestHandle: defaultHandle, | ||
} | ||
} | ||
mux := http.NewServeMux() | ||
if UserReadinessProbe(ctx) { | ||
mux.HandleFunc("/readiness", handler.ReadinessProbeRequestHandle) | ||
} | ||
if UserLivenessProbe(ctx) { | ||
mux.HandleFunc("/health", handler.LivenessProbeRequestHandle) | ||
} | ||
// Set both probes if user does not want to explicitly set only one. | ||
if !UserReadinessProbe(ctx) && !UserLivenessProbe(ctx) { | ||
mux.HandleFunc("/readiness", handler.ReadinessProbeRequestHandle) | ||
mux.HandleFunc("/health", handler.LivenessProbeRequestHandle) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels complicated -- could it simplify to something like:
readiness := getReadinessHandlerOrDefault(ctx)
liveness := getLivenessHandlerOrDefault(ctx)
mux := http.NewServeMux()
mux.HandleFunc("/readiness", readiness)
mux.HandleFunc("/liveness", liveness)
return mux
And then the ...OrDefault(ctx)
calls would get the correct handler function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok let me work on the interface and see what I can come up with.
network/health_check.go
Outdated
|
||
func newDefaultProbesHandle(sigCtx context.Context) http.HandlerFunc { | ||
logger := logging.FromContext(sigCtx) | ||
once := sync.Once{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this once
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Activator has the same approach too for its healthcheck. So since the probe handler could be part of a handler chain (we might want to make it generic) we want the msg to be printed once. Of course I can simplify it for now if we expect SIGTERM to be managed once.
network/health_check.go
Outdated
} | ||
} | ||
|
||
if IsKubeletProbe(r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still wondering if we need this IsKubeletProbe
check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to reject a request that is not coming from kubelet. The validation is weak but it seemed to me it is better than nothing. Also it could used by queue proxy at some point. Anyway I can remove it for now.
injection/sharedmain/main.go
Outdated
} | ||
|
||
// HealthProbesDisabled checks if default health checks are disabled in the related context. | ||
func HealthProbesDisabled(ctx context.Context) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be an exported method, or can it be internal-only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be private.
The probes are on by default, the default ones are used. |
Now you can say:
Anyway I will take a look and see if it can be simpler or more explicit. |
@evankanderson I simplified the impl. Pls review. |
{"level":"error","ts":1675945698.819293,"logger":"fallback","caller":"consumergroup/reconciler.go:302","msg":"Returned an error","targetMethod":"FinalizeKind","error":"unable to delete the consumer group my.group.id: kafka server: The client is not authorized to send this request type","stacktrace":"knative.dev/eventing-kafka-broker/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup.(*reconcilerImpl).Reconcile\n\t/home/runner/work/pkg/pkg/downstream/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup/reconciler.go:302\nknative.dev/pkg/reconciler/testing.(*TableRow).Test\n\t/home/runner/work/pkg/pkg/downstream/vendor/knative.dev/pkg/reconciler/testing/table.go:160\nknative.dev/pkg/reconciler/testing.TableTest.Test.func1\n\t/home/runner/work/pkg/pkg/downstream/vendor/knative.dev/pkg/reconciler/testing/table.go:389\ntesting.tRunner\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/testing/testing.go:1439"} |
@evankanderson gentle ping. |
@evankanderson pls review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
Looks good for now! Sorry about the delays in reviewing; I had Friday as a day off.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: evankanderson, skonto The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/unhold |
/hold @evankanderson is it ok to unhold? |
injection/health_check.go
Outdated
port := os.Getenv("KNATIVE_HEALTH_PROBES_PORT") | ||
if port == "" { | ||
port = "8080" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not make port an arg of the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could just used what Eventing had for alignment. I will update this.
injection/health_check.go
Outdated
logger.Infof("Probes server listening on port %s", port) | ||
|
||
if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) { | ||
logger.Fatal(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should fatal - we should return an error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok we can just emit the error but eventually we will fail since probes are failing.
injection/sharedmain/main.go
Outdated
@@ -313,6 +313,11 @@ func MainWithConfig(ctx context.Context, component string, cfg *rest.Config, cto | |||
return controller.StartAll(ctx, controllers...) | |||
}) | |||
|
|||
// Setup default health checks to catch issues with cache sync etc. | |||
if !healthProbesDisabled(ctx) { | |||
injection.ServeHealthProbes(ctx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make this part of the errgroup
so that if it fails to start it should shut everything down
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok makes sense we can also avoid some nesting of the go routines.
Yes, I hadn't even realized there was still a hold. I'll typically use a hold if I want to give the author a choice between merge now and change based on optional feedback. |
@evankanderson @dprotaso pls take another look. I addressed the latest comments. One scenario not addressed here is what default-domain job uses:
This is a special case as we need to modify the path, have only a common handler and use only "/" as the path. Net-kourier probes that path only, any other path will not work. This is left for a separate PR. Btw we could have "/" as the default path for all probes but I like the cleaner approach of having explicit paths specified for both probes. |
I think it's fine to add the option to change the path (register additional paths?) as a follow-up. /lgtm |
@dprotaso is it ok to unhold? |
/unhold |
Changes
Add missing deployment probes serving#13563