Cert-manager certificate rotation may lead to downtime of webhooks for up to 90s #10522

chrischdi · 2024-04-26T08:33:33Z

What steps did you take and what happened?

Trigger a renewal of a certificate which is used for a controller.

What did you expect to happen?

webhooks are reachable all the time

Cluster API version

Propably all existing ones / all controllers using cert-manager to provide a certificate for validating/mutating/conversion webhooks.

Kubernetes version

Irrelevant

Anything else you would like to add?

I did notice this issue when having a test failure for the clusterctl upgrade test for v0.4 -> v1.6 -> current.

During the clusterctl upgrade for v0.4 -> v1.6 (using clusterctl v1.6.4) the CRD migration failed because a validating webhook was not available due to x509 errors:

  STEP: Downloading clusterctl binary from https://github.com/kubernetes-sigs/cluster-api/releases/download/v0.4.8/clusterctl-linux-amd64 @ 04/24/24 22:38:01.392
  STEP: Initializing the workload cluster with older versions of providers @ 04/24/24 22:38:03.009
  INFO: clusterctl init --config /logs/artifacts/repository/clusterctl-config.v1.2.yaml --kubeconfig /tmp/e2e-kubeconfig1796937681 --wait-providers --core cluster-api:v0.4.8 --bootstrap kubeadm:v0.4.8 --control-plane kubeadm:v0.4.8 --infrastructure docker:v0.4.8
  INFO: Waiting for provider controllers to be running
  STEP: Waiting for deployment capd-system/capd-controller-manager to be available @ 04/24/24 22:38:37.533
  INFO: Creating log watcher for controller capd-system/capd-controller-manager, pod capd-controller-manager-7cb759f76b-mdnfq, container manager
  STEP: Waiting for deployment capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager to be available @ 04/24/24 22:38:37.66
  INFO: Creating log watcher for controller capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager, pod capi-kubeadm-bootstrap-controller-manager-b67d5f4cb-c59qt, container manager
  STEP: Waiting for deployment capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager to be available @ 04/24/24 22:38:37.677
  INFO: Creating log watcher for controller capi-kubeadm-control-plane-system/capi-kubeadm-control-plane-controller-manager, pod capi-kubeadm-control-plane-controller-manager-69846d766d-6cxv7, container manager
  STEP: Waiting for deployment capi-system/capi-controller-manager to be available @ 04/24/24 22:38:37.696
  INFO: Creating log watcher for controller capi-system/capi-controller-manager, pod capi-controller-manager-7c9ccb586-jf8h2, container manager
  STEP: THE MANAGEMENT CLUSTER WITH THE OLDER VERSION OF PROVIDERS IS UP&RUNNING! @ 04/24/24 22:38:37.946
...
  STEP: THE MANAGEMENT CLUSTER WITH OLDER VERSION OF PROVIDERS WORKS! @ 04/24/24 22:40:51.143
  STEP: [0] Starting upgrade @ 04/24/24 22:40:51.143
  STEP: [0] Downloading clusterctl binary from https://github.com/kubernetes-sigs/cluster-api/releases/download/v1.6.4/clusterctl-linux-amd64 @ 04/24/24 22:40:51.152
  STEP: [0] Upgrading providers to custom versions @ 04/24/24 22:40:53.08
  INFO: clusterctl upgrade apply --config /logs/artifacts/repository/clusterctl-config.yaml --kubeconfig /tmp/e2e-kubeconfig1796937681 --wait-providers --core cluster-api:v1.6.4 --bootstrap kubeadm:v1.6.4 --control-plane kubeadm:v1.6.4 --infrastructure docker:v1.6.4
  [FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/clusterctl/client.go:231 @ 04/24/24 22:42:53.322
  STEP: Dumping logs from the "clusterctl-upgrade-0zy92i" workload cluster @ 04/24/24 22:42:53.325
  [FAILED] in [AfterEach] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/cluster_proxy.go:311 @ 04/24/24 22:45:53.33
  << Timeline

  [FAILED] failed to run clusterctl upgrade apply:
  stdout:
  [KubeAPIWarningLogger] policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
  Checking cert-manager version...
  Deleting cert-manager Version="v1.5.3"
  Installing cert-manager Version="v1.14.4"
  Waiting for cert-manager to be available...
  Performing upgrade...
  CR migration required kind="ClusterClass" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="ClusterClass"
  CR migration required kind="ClusterResourceSetBinding" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="ClusterResourceSetBinding"
  CR migration required kind="ClusterResourceSet" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="ClusterResourceSet"
  CR migration required kind="Cluster" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="Cluster"
  CR migration required kind="MachineDeployment" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="MachineDeployment"
  CR migration required kind="MachineHealthCheck" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="MachineHealthCheck"
  CR migration required kind="MachinePool" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="MachinePool"
  CR migration required kind="Machine" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="Machine"
  CR migration required kind="MachineSet" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="MachineSet"
  CR migration required kind="KubeadmConfig" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="KubeadmConfig"
  CR migration required kind="KubeadmConfigTemplate" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="KubeadmConfigTemplate"
  CR migration required kind="KubeadmControlPlane" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="KubeadmControlPlane"
  CR migration required kind="KubeadmControlPlaneTemplate" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="KubeadmControlPlaneTemplate"
  CR migration required kind="DockerCluster" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="DockerCluster"
  CR migration required kind="DockerClusterTemplate" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="DockerClusterTemplate"
  CR migration required kind="DockerMachinePool" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="DockerMachinePool"
  CR migration required kind="DockerMachine" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="DockerMachine"
  CR migration required kind="DockerMachineTemplate" storedVersionsToDelete="v1alpha4" storedVersionsToPreserve=""
  Migrating CRs, this operation may take a while... kind="DockerMachineTemplate"
  Error: failed to migrate clusterctl-upgrade/clusterctl-upgrade-woxc8x-control-plane: action failed after 10 attempts: Internal error occurred: failed calling webhook "validation.dockermachinetemplate.infrastructure.cluster.x-k8s.io": failed to call webhook: Post "https://capd-webhook-service.capd-system.svc:443/validate-infrastructure-cluster-x-k8s-io-v1alpha4-dockermachinetemplate?timeout=10s": x509: certificate signed by unknown authority
  ...

Turns out that the following timeline happened:

22:38:03Z: clusterctl: initializing the management cluster using CAPI v0.4.8 (source test log)
22:38:37Z: Creating the workload cluster (source test log)
22:40:53Z: clusterctl: starting upgrade apply (source test log)
22:40:59Z: New cert-manager pods created (source pod information)
22:41:02Z: all cert-manager pods are ready (source pod information)
22:42:04Z: certificate gets updated (source Certificate object managed-fields)
22:42:04Z: CAPD's ca secret got updated (source secret / managed-fields)
22:42:04Z: ca-injector injected ca to CAPD's validating webhook (source: validatingwebhookconfiguration / managed-fields)
22:42:05Z: CAPD controller starts to get requests using a wrong certificate (source: CAPD pod logs)
22:42:53Z: controller logs that TLS cert got updated (source: CAPD pod logs)
22:42:53Z: Tests failed

So in this case there was a timespan of ~49s in which the webhooks for CAPD were not serving with the updated certificate, while the kube-apiserver almost immediately tried to use the new ones.

The interesting part here is not why cert-manager issues a new certificate during the upgrade. There are several triggers for creating a new certificate e.g. expiration of the existing one.

The interesting part here is that it can take up to 60-90s (source) for the new secret data to be propagated to the pod, while requests already try to get validated with the new secret.

This kind of downtime for the webhooks usually happens every 60 days.

Label(s) to be applied

/kind bug
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-04-26T08:33:41Z

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2024-04-26T08:53:36Z

Note: for this special case where we detected the issue: we are into increasing the timeouts for CRD Migration in this PR: #10513 to have enough retries to not abort when hitting this issue during upgrades using clusterctl.

sbueringer · 2024-04-29T09:08:59Z

Maybe the simplest solution is to not use a volume mount to get / propagate the secret.

Both the webhook and metrics server allow setting a GetCertificate func via TLSOpts. In this case the cert watcher is not created (which is the component watching the volume mount).

Might even be worth implementing a reference implementation for a certificate watcher based on a secret in CR.

chrischdi · 2024-04-29T10:30:48Z

Yes, that sounds awesome and would reduce the time down to seconds which should almost not be notable.

sbueringer · 2024-04-29T10:47:09Z

Yup. I think probably we end up with a similar delay as the kube-apiserver. With this approach we won't get it down to 0 (like we could with a proper cert rotation) but I think it should be okay and folks should not rely on that every single request works.

fabriziopandini · 2024-04-30T12:44:51Z

/priority important-longterm
Given it doesn't happen very often
Probably worth considering the entire certificate story before starting to implemented

k8s-triage-robot · 2024-07-29T13:25:38Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-10-31T13:02:53Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 26, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Apr 26, 2024

k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 30, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 29, 2024

fabriziopandini removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 31, 2024

sbueringer mentioned this issue Nov 20, 2024

flakes in clusterctl upgrade tests #11133

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cert-manager certificate rotation may lead to downtime of webhooks for up to 90s #10522

Cert-manager certificate rotation may lead to downtime of webhooks for up to 90s #10522

chrischdi commented Apr 26, 2024

k8s-ci-robot commented Apr 26, 2024

chrischdi commented Apr 26, 2024

sbueringer commented Apr 29, 2024 •

edited

Loading

chrischdi commented Apr 29, 2024

sbueringer commented Apr 29, 2024 •

edited

Loading

fabriziopandini commented Apr 30, 2024 •

edited

Loading

k8s-triage-robot commented Jul 29, 2024

k8s-triage-robot commented Oct 31, 2024

Cert-manager certificate rotation may lead to downtime of webhooks for up to 90s #10522

Cert-manager certificate rotation may lead to downtime of webhooks for up to 90s #10522

Comments

chrischdi commented Apr 26, 2024

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

k8s-ci-robot commented Apr 26, 2024

chrischdi commented Apr 26, 2024

sbueringer commented Apr 29, 2024 • edited Loading

chrischdi commented Apr 29, 2024

sbueringer commented Apr 29, 2024 • edited Loading

fabriziopandini commented Apr 30, 2024 • edited Loading

k8s-triage-robot commented Jul 29, 2024

k8s-triage-robot commented Oct 31, 2024

sbueringer commented Apr 29, 2024 •

edited

Loading

sbueringer commented Apr 29, 2024 •

edited

Loading

fabriziopandini commented Apr 30, 2024 •

edited

Loading