Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook Slow to Restart #863

Closed
ellistarn opened this issue Nov 27, 2021 · 8 comments
Closed

Webhook Slow to Restart #863

ellistarn opened this issue Nov 27, 2021 · 8 comments
Labels
bug Something isn't working good-first-issue Good for newcomers operational-excellence

Comments

@ellistarn
Copy link
Contributor

Tell us about your request
The karpenter webhook never shuts down, waiting for the kubelet to force kill it after grace termination period. When updating karpenter, the controller starts quickly, and will attempt to patch provisioners after deploying capacity, which can cause the following error:

karpenter-controller-7677cfcdc4-mjqg7 manager 2021-11-27T00:49:19.415Z	ERROR	controller.controller.counter	Reconciler error	{"commit": "3b36965", "reconciler group": "karpenter.sh", "reconciler kind": "Provisioner", "name": "default", "namespace": "", "error": "failed to persist changes to /default, Internal error occurred: failed calling webhook \"defaulting.webhook.provisioners.karpenter.sh\": Post \"https://karpenter-webhook.karpenter.svc:443/default-resource?timeout=10s\": no endpoints available for service \"karpenter-webhook\""}
  • We need to figure out why the webhook isn't shutting down
  • It probably makes sense to combine the webhook+controller in the same deployment, for shared fate on restarts.

Attachments
If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ellistarn ellistarn added feature New feature or request good-first-issue Good for newcomers bug Something isn't working and removed feature New feature or request labels Nov 27, 2021
@ellistarn ellistarn added feature New feature or request and removed bug Something isn't working labels Dec 7, 2021
@tzneal
Copy link
Contributor

tzneal commented Feb 6, 2022

I looked into this, and in my testing it shuts down after 45 seconds. That 45 seconds is from the network.DefaultDrainTimeout. It's set as the gracePeriod in the Webhook, which then uses it in the Drainer.
The purpose is to not shut down the webhook until 45 second of inactivity. There appears to be no way to set it, but perhaps a change to add it to webhook.Options upstream would be accepted?

Setting terminationGracePeriodSeconds to 45+ seconds for the webhook will at least stop the force kill.

@ellistarn
Copy link
Contributor Author

Waiting 45 seconds seems a bit naïve to me -- so does configuring it to wait less. Minimally, if we want to kill the pod quickly, we can set terminationGracePeriodSeconds: 5 without touching knative/pkg.

As for better draining behavior, I'd expect it to stop accepting new requests while allowing some amount of time for open requests be fulfilled. Eventually, closing all of the connections as well. @mattmoor definitely knows what he's doing, so he might have some comment as to why this is implemented this way. Perhaps it's intentionally naïve and just waiting for someone to improve it.

@tzneal
Copy link
Contributor

tzneal commented Feb 6, 2022

It's not just a fixed 45 seconds, it could be longer in some cases. While draining it serves 500's to K8s, but still servicing any non-k8s requests, with the non-k8s requests also resetting the drain timer.

So it works the way you would expect, except the 45 seconds seems a bit long to me. There is a comment though about why 45s was chosen:

        // DefaultDrainTimeout is the time that Knative components on the data
	// path will wait before shutting down server, but after starting to fail
	// readiness probes to ensure network layer propagation and so that no requests
	// are routed to this pod.
	// Note that this was bumped from 30s due to intermittent issues where
	// the webhook would get a bad request from the API Server when running
	// under chaos.
	DefaultDrainTimeout = 45 * time.Second

@ellistarn
Copy link
Contributor Author

ellistarn commented Feb 6, 2022

I'm reading through the history here: knative/pkg#1517, knative/pkg#1509

@mattmoor
Copy link

mattmoor commented Feb 6, 2022

While draining it serves 500's to K8s

The kubelet, the API server requests (for webhooks) are fine.

We run chaos testing during out e2e testing, and prior to this we would intermittently see the webhook returning 500s to the API server because of inadequate drain times. Honestly any fixed value here is wrong because:

  1. The Endpoints have to be updated,
  2. Every consumer of Endpoints (transitively) has to observe that update.

The latter is a hard problem, and (I believe) a big part of why K8s has terminationGracePeriodSeconds (which defaults to 30s).

@mattmoor
Copy link

mattmoor commented Feb 6, 2022

Perhaps it's intentionally naïve and just waiting for someone to improve it.

I'd love something better. I'd guess @dprotaso would too :)

@dprotaso
Copy link

dprotaso commented Feb 7, 2022

Is there a reason your webhook is set to Recreate instead of the default RollingUpdate

https://github.com/aws/karpenter/blob/df5789248a247c2188000c3e20a232b913d9b969/charts/karpenter/templates/webhook/deployment.yaml#L24-L25

@ellistarn
Copy link
Contributor Author

IIRC this was set (and neglected) quite some time ago when we were working in clusters with tiny data planes. Since Karpenter is a node scaler, you need some (ideally tiny) amount of capacity to run it, and the remainder of the capacity can be managed by karpenter itself. You also don't want to run the controller on nodes that Karpenter manages, or you can lock your keys in the car. Further, I expect this was likely copied from our controller deployment, which can tolerate Recreate without side effects like webhook outages.

We almost certainly should update the webhook to RollingUpdate, and at this point, likely should do the same for the controller itself since we're planning on recombining them.

@ellistarn ellistarn added bug Something isn't working and removed feature New feature or request labels Feb 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good-first-issue Good for newcomers operational-excellence
Projects
None yet
Development

No branches or pull requests

5 participants