Webhook Slow to Restart #863

ellistarn · 2021-11-27T00:54:13Z

Tell us about your request
The karpenter webhook never shuts down, waiting for the kubelet to force kill it after grace termination period. When updating karpenter, the controller starts quickly, and will attempt to patch provisioners after deploying capacity, which can cause the following error:

karpenter-controller-7677cfcdc4-mjqg7 manager 2021-11-27T00:49:19.415Z	ERROR	controller.controller.counter	Reconciler error	{"commit": "3b36965", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "name": "default", "namespace": "", "error": "failed to persist changes to /default, Internal error occurred: failed calling webhook \"defaulting.webhook.provisioners.karpenter.sh\": Post \"https://karpenter-webhook.karpenter.svc:443/default-resource?timeout=10s\": no endpoints available for service \"karpenter-webhook\""}

We need to figure out why the webhook isn't shutting down
It probably makes sense to combine the webhook+controller in the same deployment, for shared fate on restarts.

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

tzneal · 2022-02-06T01:04:31Z

I looked into this, and in my testing it shuts down after 45 seconds. That 45 seconds is from the network.DefaultDrainTimeout. It's set as the gracePeriod in the Webhook, which then uses it in the Drainer.
The purpose is to not shut down the webhook until 45 second of inactivity. There appears to be no way to set it, but perhaps a change to add it to webhook.Options upstream would be accepted?

allow overriding the default grace period of 45 seconds knative/pkg#2423

Setting terminationGracePeriodSeconds to 45+ seconds for the webhook will at least stop the force kill.

ellistarn · 2022-02-06T02:59:26Z

Waiting 45 seconds seems a bit naïve to me -- so does configuring it to wait less. Minimally, if we want to kill the pod quickly, we can set terminationGracePeriodSeconds: 5 without touching knative/pkg.

As for better draining behavior, I'd expect it to stop accepting new requests while allowing some amount of time for open requests be fulfilled. Eventually, closing all of the connections as well. @mattmoor definitely knows what he's doing, so he might have some comment as to why this is implemented this way. Perhaps it's intentionally naïve and just waiting for someone to improve it.

tzneal · 2022-02-06T03:12:23Z

It's not just a fixed 45 seconds, it could be longer in some cases. While draining it serves 500's to K8s, but still servicing any non-k8s requests, with the non-k8s requests also resetting the drain timer.

So it works the way you would expect, except the 45 seconds seems a bit long to me. There is a comment though about why 45s was chosen:

        // DefaultDrainTimeout is the time that Knative components on the data
	// path will wait before shutting down server, but after starting to fail
	// readiness probes to ensure network layer propagation and so that no requests
	// are routed to this pod.
	// Note that this was bumped from 30s due to intermittent issues where
	// the webhook would get a bad request from the API Server when running
	// under chaos.
	DefaultDrainTimeout = 45 * time.Second

ellistarn · 2022-02-06T05:04:43Z

I'm reading through the history here: knative/pkg#1517, knative/pkg#1509

mattmoor · 2022-02-06T15:05:41Z

While draining it serves 500's to K8s

The kubelet, the API server requests (for webhooks) are fine.

We run chaos testing during out e2e testing, and prior to this we would intermittently see the webhook returning 500s to the API server because of inadequate drain times. Honestly any fixed value here is wrong because:

The Endpoints have to be updated,
Every consumer of Endpoints (transitively) has to observe that update.

The latter is a hard problem, and (I believe) a big part of why K8s has terminationGracePeriodSeconds (which defaults to 30s).

mattmoor · 2022-02-06T15:07:15Z

Perhaps it's intentionally naïve and just waiting for someone to improve it.

I'd love something better. I'd guess @dprotaso would too :)

dprotaso · 2022-02-07T14:46:18Z

Is there a reason your webhook is set to Recreate instead of the default RollingUpdate

https://github.com/aws/karpenter/blob/df5789248a247c2188000c3e20a232b913d9b969/charts/karpenter/templates/webhook/deployment.yaml#L24-L25

ellistarn · 2022-02-07T16:57:19Z

IIRC this was set (and neglected) quite some time ago when we were working in clusters with tiny data planes. Since Karpenter is a node scaler, you need some (ideally tiny) amount of capacity to run it, and the remainder of the capacity can be managed by karpenter itself. You also don't want to run the controller on nodes that Karpenter manages, or you can lock your keys in the car. Further, I expect this was likely copied from our controller deployment, which can tolerate Recreate without side effects like webhook outages.

We almost certainly should update the webhook to RollingUpdate, and at this point, likely should do the same for the controller itself since we're planning on recombining them.

ellistarn added feature New feature or request good-first-issue Good for newcomers bug Something isn't working and removed feature New feature or request labels Nov 27, 2021

ellistarn added feature New feature or request and removed bug Something isn't working labels Dec 7, 2021

ellistarn added the operational-excellence label Jan 3, 2022

ellistarn mentioned this issue Jan 14, 2022

RBAC is too permissive #973

Closed

ellistarn added bug Something isn't working and removed feature New feature or request labels Feb 10, 2022

billrayburn closed this as completed Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Webhook Slow to Restart #863

Webhook Slow to Restart #863

ellistarn commented Nov 27, 2021

tzneal commented Feb 6, 2022 •

edited

Loading

ellistarn commented Feb 6, 2022

tzneal commented Feb 6, 2022

ellistarn commented Feb 6, 2022 •

edited

Loading

mattmoor commented Feb 6, 2022

mattmoor commented Feb 6, 2022

dprotaso commented Feb 7, 2022

ellistarn commented Feb 7, 2022

Webhook Slow to Restart #863

Webhook Slow to Restart #863

Comments

ellistarn commented Nov 27, 2021

Community Note

tzneal commented Feb 6, 2022 • edited Loading

ellistarn commented Feb 6, 2022

tzneal commented Feb 6, 2022

ellistarn commented Feb 6, 2022 • edited Loading

mattmoor commented Feb 6, 2022

mattmoor commented Feb 6, 2022

dprotaso commented Feb 7, 2022

ellistarn commented Feb 7, 2022

tzneal commented Feb 6, 2022 •

edited

Loading

ellistarn commented Feb 6, 2022 •

edited

Loading