-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webhook Slow to Restart #863
Comments
I looked into this, and in my testing it shuts down after 45 seconds. That 45 seconds is from the network.DefaultDrainTimeout. It's set as the gracePeriod in the Webhook, which then uses it in the Drainer. Setting |
Waiting 45 seconds seems a bit naïve to me -- so does configuring it to wait less. Minimally, if we want to kill the pod quickly, we can set As for better draining behavior, I'd expect it to stop accepting new requests while allowing some amount of time for open requests be fulfilled. Eventually, closing all of the connections as well. @mattmoor definitely knows what he's doing, so he might have some comment as to why this is implemented this way. Perhaps it's intentionally naïve and just waiting for someone to improve it. |
It's not just a fixed 45 seconds, it could be longer in some cases. While draining it serves 500's to K8s, but still servicing any non-k8s requests, with the non-k8s requests also resetting the drain timer. So it works the way you would expect, except the 45 seconds seems a bit long to me. There is a comment though about why 45s was chosen:
|
I'm reading through the history here: knative/pkg#1517, knative/pkg#1509 |
The kubelet, the API server requests (for webhooks) are fine. We run chaos testing during out e2e testing, and prior to this we would intermittently see the webhook returning 500s to the API server because of inadequate drain times. Honestly any fixed value here is wrong because:
The latter is a hard problem, and (I believe) a big part of why K8s has |
I'd love something better. I'd guess @dprotaso would too :) |
Is there a reason your webhook is set to |
IIRC this was set (and neglected) quite some time ago when we were working in clusters with tiny data planes. Since Karpenter is a node scaler, you need some (ideally tiny) amount of capacity to run it, and the remainder of the capacity can be managed by karpenter itself. You also don't want to run the controller on nodes that Karpenter manages, or you can lock your keys in the car. Further, I expect this was likely copied from our controller deployment, which can tolerate We almost certainly should update the webhook to RollingUpdate, and at this point, likely should do the same for the controller itself since we're planning on recombining them. |
Tell us about your request
The karpenter webhook never shuts down, waiting for the kubelet to force kill it after grace termination period. When updating karpenter, the controller starts quickly, and will attempt to patch provisioners after deploying capacity, which can cause the following error:
Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)
Community Note
The text was updated successfully, but these errors were encountered: