-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cert-manager certificate rotation may lead to downtime of webhooks for up to 90s #10522
Comments
This issue is currently awaiting triage. CAPI contributors will take a look as soon as possible, apply one of the Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Note: for this special case where we detected the issue: we are into increasing the timeouts for CRD Migration in this PR: #10513 to have enough retries to not abort when hitting this issue during upgrades using clusterctl. |
Maybe the simplest solution is to not use a volume mount to get / propagate the secret. Both the webhook and metrics server allow setting a Might even be worth implementing a reference implementation for a certificate watcher based on a secret in CR. |
Yes, that sounds awesome and would reduce the time down to seconds which should almost not be notable. |
Yup. I think probably we end up with a similar delay as the kube-apiserver. With this approach we won't get it down to 0 (like we could with a proper cert rotation) but I think it should be okay and folks should not rely on that every single request works. |
/priority important-longterm |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
What steps did you take and what happened?
What did you expect to happen?
Cluster API version
Propably all existing ones / all controllers using cert-manager to provide a certificate for validating/mutating/conversion webhooks.
Kubernetes version
Irrelevant
Anything else you would like to add?
I did notice this issue when having a test failure for the clusterctl upgrade test for
v0.4 -> v1.6 -> current
.During the clusterctl upgrade for
v0.4 -> v1.6
(using clusterctl v1.6.4) the CRD migration failed because a validating webhook was not available due to x509 errors:Turns out that the following timeline happened:
So in this case there was a timespan of ~49s in which the webhooks for CAPD were not serving with the updated certificate, while the kube-apiserver almost immediately tried to use the new ones.
The interesting part here is not why cert-manager issues a new certificate during the upgrade. There are several triggers for creating a new certificate e.g. expiration of the existing one.
The interesting part here is that it can take up to 60-90s (source) for the new secret data to be propagated to the pod, while requests already try to get validated with the new secret.
This kind of downtime for the webhooks usually happens every 60 days.
Label(s) to be applied
/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: