Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health checks #2671

Merged
merged 7 commits into from
Feb 15, 2023
Merged

Add health checks #2671

merged 7 commits into from
Feb 15, 2023

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Jan 26, 2023

Changes

@knative-prow knative-prow bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 26, 2023
@skonto skonto changed the title Add health checks [wip] Add health checks Jan 26, 2023
@knative-prow knative-prow bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2023
@codecov
Copy link

codecov bot commented Jan 26, 2023

Codecov Report

Base: 81.42% // Head: 81.30% // Decreases project coverage by -0.13% ⚠️

Coverage data is based on head (d296136) compared to base (408ad07).
Patch coverage: 62.74% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2671      +/-   ##
==========================================
- Coverage   81.42%   81.30%   -0.13%     
==========================================
  Files         162      163       +1     
  Lines        9849     9900      +51     
==========================================
+ Hits         8020     8049      +29     
- Misses       1590     1609      +19     
- Partials      239      242       +3     
Impacted Files Coverage Δ
injection/health_check.go 62.74% <62.74%> (ø)
apis/duck/cached.go 91.89% <0.00%> (-8.11%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@skonto skonto changed the title [wip] Add health checks Add health checks Jan 26, 2023
@knative-prow knative-prow bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 26, 2023
@skonto
Copy link
Contributor Author

skonto commented Jan 27, 2023

@dprotaso @evankanderson gentle ping for review.

@evankanderson
Copy link
Member

We should not have the same problem as the controller-runtime issue, because we always start controllers, even if they don't win the leader election.

I agree that the check should start up and run independently of the controller startup, but it seems like we shouldn't return healthy until informer caches have been filled (which, for example, can be blocked by incorrect RBAC permissions).

@evankanderson
Copy link
Member

I'd also advocate for having health endpoints always enabled. If you don't want to use them, they should be minimal overhead.

Ideally, we could provide this same server to the Prometheus metrics implementation and share a port, but I don't want to force this behind refactoring that big ball of mud.

@skonto
Copy link
Contributor Author

skonto commented Jan 30, 2023

@evankanderson hi, thanks for reviewing. Some thoughts:

We should not have the same problem as the controller-runtime issue, because we always start controllers, even if they don't win the leader election. I agree that the check should start up and run independently of the controller startup, but it seems like we shouldn't return healthy until informer caches have been filled (which, for example, can be blocked by incorrect RBAC permissions).

Initially I had it like that and spend some time thinking about options. So not elected instances dont own any bucket right? That technically means they are not ready no? Even with multiple buckets there is bucket migration. That is a less of a problem though it depends what is defined as ready here.

Now, if health probes (liveness) fail here it does not help much if you have to re-sync the cache when you restart. Also, if we have an error at this stage I think it is unlikely that a restart could help. Even with the old problem we had wrt OOM and informers, restarting didnt help or if RBAC permissions are wrong there is nothing that is improved with restarts. Health checks (liveness probes) and restarts are usually for deadlocks, fixable ooms etc. I think having the check after cache sync it opens the door to a configuration headache depending on the cluster api server communication and cache size, no? Then the user will have to set initialDelaySeconds accordingly or other settings.
The good thing is that it would clearly signal that the container is failing if it does not finish the controller/informers startup thing and user would know, instead of looking at the logs in those scenarios. Also it enforces a more complete notion of healthiness, instead of just checking if main is running etc.
Controller runtime btw even for the leader pod does not check that part so I am kind of puzzled as to what level of health check we should offer and if we re doing too much to cover all cases. I can try move it there and see how it goes for sure. @dprotaso @psschwei any thoughts?

I'd also advocate for having health endpoints always enabled. If you don't want to use them, they should be minimal overhead.

Then it could be the opposite the flag to be a disable flag in case someone wants to disable it and override it with his own. I can easily change it. Not sure though if any project eg. sandbox projects, will be affected by this. A scenario could be that a project uses sharedmain and has a custom probe.

Ideally, we could provide this same server to the Prometheus metrics implementation and share a port, but I don't want to force this behind refactoring that big ball of mud.

Yes indeed I think there was such a reference in some controller runtime issue but we can keep it simple here for now.
Do you mean use the Prometheus server for the check endpoint? If so metrics could be off or user may not use Prometheus no?

@skonto
Copy link
Contributor Author

skonto commented Feb 3, 2023

/assign @evankanderson

@skonto
Copy link
Contributor Author

skonto commented Feb 3, 2023

Will update the PR.

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! I'm happy to build this up incrementally.

It feels to me like it might make sense to have a registration mechanism; something like AddReadinessCheck() and AddLivenessCheck().

I'm also wondering if these interfaces should live in sharedmain instead of network, but I think we could work that out now or later.

I am going to insist on switching the usage of log for getting the logger from the Context, though.

}()
}

func newHealthCheck(sigCtx context.Context) func() error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts about how we might let users customize this later?

Copy link
Contributor Author

@skonto skonto Feb 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could refactor this to allow pass a func and propagate it from ServeHealthProbes(ctx context.Context) or use context to register another callback there and get it with some func such as healthCheckFromContext. WDYTH?

}

func (h *healthHandler) handle(w http.ResponseWriter, r *http.Request) {
if IsKubeletProbe(r) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking IsKubeletProbe here keeps this health checker from being used for e.g. proxy or apiserver health checks.

I know we may not need them today, but I'm wondering if we should be passing the request to the health checking function rather than doing this check here. i.e. func(http.Request) error. This might be a half-baked idea, though.

func (h *healthHandler) handle(w http.ResponseWriter, r *http.Request) {
if IsKubeletProbe(r) {
if err := h.HealthCheck(); err != nil {
log.Println("Healthcheck failed: ", err.Error())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this extract the logger from the context and use that?

}

// UserReadinessProbe checks if user has explicitly requested to set a readiness probe in the related context.
func UserReadinessProbe(ctx context.Context) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can extend with a custom url in the future to change the default path too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to figure out why you'd want to disable these on the server. I understand why you might want to remove them from the manifest, but what's the rationale for not registering them with the HTTPServer?

@skonto
Copy link
Contributor Author

skonto commented Feb 7, 2023

@evankanderson pls review. Updates:
a) Added a mechanism to register only specific probe types.
b) Made the health handler easy to override
c) Added unit tests

Not sure how queue proxy could utilize this because drainer is used as part of a handler chain. Also I think there are more requirements there like TLS which we dont address at the moment (some user services require this).
Thus the scope of this for now is for regular deployments of Knative control plane but probably we can refactor further.

@evankanderson
Copy link
Member

Yeah, I'm not deeply concerned about using this for queue-proxy or activator, which will both be a little funny, but I just didn't want to over-specialize and then have to walk it back.

At the same time, it would be nice if we could use this for e.g. eventing brokers, and have them fail readiness or liveness if they can't reach their storage backends. That probably needs that we need something like an ability to register health checks in the injection.ControllerConstructor, which probably means adding the healthcheck registration(s) to the context before calling the constructors.

@evankanderson
Copy link
Member

(Reviewing now)

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of enabling health probes -- is there any reason not to have them on all the time (it might simplify the code)?

I'd also sort of like to be able to say:

func NewController(ctx context.Context, cmw configmap.Watcher) *controller.Impl {
  ...
  checker := func (r http.Request) error {
    // pretend we have appropriate locking
    return impl.connectionError
  }

  // internally, calls getReadinessRegisterFunc(ctx).Register(checker)
  injection.AddReadiness(ctx, checker)

  return impl
}

Of course, checker could also be impl.IsHealthy or the like.

I'm okay with either a func (http.Request) error or a func () error signature -- whichever seems less of a pain in the common case.

/hold
/lgtm

If you like the above interface, I don't know if we want to introduce this one and then remove it.

If you want to just enable health checks, I'd probably remove most of the public interfaces in network for now except the one needed from sharedmain.go.

}

// UserReadinessProbe checks if user has explicitly requested to set a readiness probe in the related context.
func UserReadinessProbe(ctx context.Context) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to figure out why you'd want to disable these on the server. I understand why you might want to remove them from the manifest, but what's the rationale for not registering them with the HTTPServer?

Comment on lines 58 to 78
if h := getOverrideHealthCheckHandler(ctx); h != nil {
handler = *h
} else {
defaultHandle := newDefaultProbesHandle(ctx)
handler = HealthCheckHandler{
ReadinessProbeRequestHandle: defaultHandle,
LivenessProbeRequestHandle: defaultHandle,
}
}
mux := http.NewServeMux()
if UserReadinessProbe(ctx) {
mux.HandleFunc("/readiness", handler.ReadinessProbeRequestHandle)
}
if UserLivenessProbe(ctx) {
mux.HandleFunc("/health", handler.LivenessProbeRequestHandle)
}
// Set both probes if user does not want to explicitly set only one.
if !UserReadinessProbe(ctx) && !UserLivenessProbe(ctx) {
mux.HandleFunc("/readiness", handler.ReadinessProbeRequestHandle)
mux.HandleFunc("/health", handler.LivenessProbeRequestHandle)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels complicated -- could it simplify to something like:

readiness := getReadinessHandlerOrDefault(ctx)
liveness := getLivenessHandlerOrDefault(ctx)

mux := http.NewServeMux()
mux.HandleFunc("/readiness", readiness)
mux.HandleFunc("/liveness", liveness)
return mux

And then the ...OrDefault(ctx) calls would get the correct handler function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok let me work on the interface and see what I can come up with.


func newDefaultProbesHandle(sigCtx context.Context) http.HandlerFunc {
logger := logging.FromContext(sigCtx)
once := sync.Once{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this once?

Copy link
Contributor Author

@skonto skonto Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Activator has the same approach too for its healthcheck. So since the probe handler could be part of a handler chain (we might want to make it generic) we want the msg to be printed once. Of course I can simplify it for now if we expect SIGTERM to be managed once.

}
}

if IsKubeletProbe(r) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering if we need this IsKubeletProbe check.

Copy link
Contributor Author

@skonto skonto Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to reject a request that is not coming from kubelet. The validation is weak but it seemed to me it is better than nothing. Also it could used by queue proxy at some point. Anyway I can remove it for now.

}

// HealthProbesDisabled checks if default health checks are disabled in the related context.
func HealthProbesDisabled(ctx context.Context) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be an exported method, or can it be internal-only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be private.

@knative-prow knative-prow bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 9, 2023
@skonto
Copy link
Contributor Author

skonto commented Feb 9, 2023

I like the idea of enabling health probes -- is there any reason not to have them on all the time (it might simplify the code)?

The probes are on by default, the default ones are used. HealthProbesDisabled checks if user disabled them via context. However, user could choose to provide only one probe not both by default (as K8s allows at the spec side).
Otherwise, if we have both handles in any case, we have extra probe handles available/enabled by default. It does not add much but thought it is redundant.

@skonto
Copy link
Contributor Author

skonto commented Feb 9, 2023

I'd also sort of like to be able to say:
func NewController(ctx context.Context, cmw configmap.Watcher) *controller.Impl {
  ...
  checker := func (r http.Request) error {
    // pretend we have appropriate locking
    return impl.connectionError
  }

  // internally, calls getReadinessRegisterFunc(ctx).Register(checker)
  injection.AddReadiness(ctx, checker)

  return impl
}

Now you can say:

ctx = WithOverrideHealthCheckHandler(ctx, &HealthCheckHandler{
			ReadinessProbeRequestHandle: checker
		})
networkpkg.ServeHealthProbes(ctx)

Anyway I will take a look and see if it can be simpler or more explicit.

@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 9, 2023
@skonto
Copy link
Contributor Author

skonto commented Feb 9, 2023

@evankanderson I simplified the impl. Pls review.

@skonto
Copy link
Contributor Author

skonto commented Feb 9, 2023

{"level":"error","ts":1675945698.819293,"logger":"fallback","caller":"consumergroup/reconciler.go:302","msg":"Returned an error","targetMethod":"FinalizeKind","error":"unable to delete the consumer group my.group.id: kafka server: The client is not authorized to send this request type","stacktrace":"knative.dev/eventing-kafka-broker/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup.(*reconcilerImpl).Reconcile\n\t/home/runner/work/pkg/pkg/downstream/control-plane/pkg/client/internals/kafka/injection/reconciler/eventing/v1alpha1/consumergroup/reconciler.go:302\nknative.dev/pkg/reconciler/testing.(*TableRow).Test\n\t/home/runner/work/pkg/pkg/downstream/vendor/knative.dev/pkg/reconciler/testing/table.go:160\nknative.dev/pkg/reconciler/testing.TableTest.Test.func1\n\t/home/runner/work/pkg/pkg/downstream/vendor/knative.dev/pkg/reconciler/testing/table.go:389\ntesting.tRunner\n\t/opt/hostedtoolcache/go/1.18.10/x64/src/testing/testing.go:1439"}

@skonto
Copy link
Contributor Author

skonto commented Feb 10, 2023

@evankanderson gentle ping.

@skonto
Copy link
Contributor Author

skonto commented Feb 13, 2023

@evankanderson pls review

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Looks good for now! Sorry about the delays in reviewing; I had Friday as a day off.

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 13, 2023
@knative-prow
Copy link

knative-prow bot commented Feb 13, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 13, 2023
@skonto
Copy link
Contributor Author

skonto commented Feb 14, 2023

/unhold

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2023
@skonto
Copy link
Contributor Author

skonto commented Feb 14, 2023

/hold @evankanderson is it ok to unhold?

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 14, 2023
Comment on lines 33 to 36
port := os.Getenv("KNATIVE_HEALTH_PROBES_PORT")
if port == "" {
port = "8080"
}
Copy link
Member

@dprotaso dprotaso Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make port an arg of the function?

Copy link
Contributor Author

@skonto skonto Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just used what Eventing had for alignment. I will update this.

logger.Infof("Probes server listening on port %s", port)

if err := server.ListenAndServe(); err != nil && !errors.Is(err, http.ErrServerClosed) {
logger.Fatal(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should fatal - we should return an error

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we can just emit the error but eventually we will fail since probes are failing.

@@ -313,6 +313,11 @@ func MainWithConfig(ctx context.Context, component string, cfg *rest.Config, cto
return controller.StartAll(ctx, controllers...)
})

// Setup default health checks to catch issues with cache sync etc.
if !healthProbesDisabled(ctx) {
injection.ServeHealthProbes(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make this part of the errgroup so that if it fails to start it should shut everything down

Copy link
Contributor Author

@skonto skonto Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok makes sense we can also avoid some nesting of the go routines.

@evankanderson
Copy link
Member

Yes, I hadn't even realized there was still a hold.

I'll typically use a hold if I want to give the author a choice between merge now and change based on optional feedback.

@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2023
@skonto
Copy link
Contributor Author

skonto commented Feb 15, 2023

@evankanderson @dprotaso pls take another look. I addressed the latest comments.
For the use of the PR check here: knative/serving#13563.

One scenario not addressed here is what default-domain job uses:

	// Start an HTTP Server
	h := netprobe.NewHandler(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
		w.WriteHeader(http.StatusOK)
	}))
	//nolinlt:gosec https://github.com/knative/serving/issues/13439
	server := http.Server{Addr: ":8080", Handler: h, ReadHeaderTimeout: time.Minute}
	go server.ListenAndServe()

This is a special case as we need to modify the path, have only a common handler and use only "/" as the path. Net-kourier probes that path only, any other path will not work. This is left for a separate PR. Btw we could have "/" as the default path for all probes but I like the cleaner approach of having explicit paths specified for both probes.

@evankanderson
Copy link
Member

I think it's fine to add the option to change the path (register additional paths?) as a follow-up.

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Feb 15, 2023
@skonto
Copy link
Contributor Author

skonto commented Feb 15, 2023

@dprotaso is it ok to unhold?

@dprotaso
Copy link
Member

/unhold

@knative-prow knative-prow bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 15, 2023
@knative-prow knative-prow bot merged commit 4a80605 into knative:main Feb 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants