Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 clusterctl: always run crd migration if possible to reduce conversion webhook usage #10513

Conversation

chrischdi
Copy link
Member

@chrischdi chrischdi commented Apr 25, 2024

What this PR does / why we need it:

If objects are still stored in the old version, every time an object is requested
the kube-apiserver will call the conversion webhook. Instead of lots of conversion
webhook calls we once do the migration if possible.

Also increases the timeout used for the migrations: Before running CRD migration
clusterctl usually upgrades cert-manager which may lead to updating Certificates and
rolling out new certificates to our controllers. There is a chance that this can take
up to 90s in which conversion, validating or mutatingwebhooks may be unavailable.
Because the migration gets run more aggresively during upgrades this also increases
the relevant timeouts.


This also fixes to not run the CRD version migration, if the stored version is equal to the storageVersion (which would result in running migration, but being a no-op because of no conversion happening).

This could e.g. be the case when upgrading from v0.4 to v1.6 or v1.7: What would happen with the old code:

  • old CRD is deployed, having storageVersion=v1alpha4 and storedVersions = [v1alpha4]
  • clusterctl upgrade apply
    • CRD migration starts
      • v1alpha4 is not served anymore so it runs migration
      • migration is no-op, because storageVersion is still v1alpha4
    • CRD's get patched/updated
    • CRD's would have storedVersions = [v1alpha4, v1beta1] (as soon as the first object got applied

It is totally fine that v1alpha4 is not served anymore with the new CRD, important is that it is still in the versions of the new CRD, so kube-apiserver is still able to run the conversion webhook. The next upgrade would cover removing the valpha4 from storedVersions, because the migration would be done.

We have a safeguard to not run in the error case above the changed code:

return false, errors.Errorf("unable to upgrade CRD %q because the new CRD does not contain the storage version %q of the current CRD, thus not allowing CR migration", newCRD.Name, currentStorageVersion)

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

/area clusterctl

@k8s-ci-robot k8s-ci-robot added area/clusterctl Issues or PRs related to clusterctl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 25, 2024
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 25, 2024
@chrischdi chrischdi changed the title 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion [WIP] 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion Apr 25, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 25, 2024
@chrischdi
Copy link
Member Author

chrischdi commented Apr 25, 2024

Nevermind

/hold cancel

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 25, 2024
@chrischdi chrischdi changed the title [WIP] 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion Apr 25, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 25, 2024
@chrischdi chrischdi changed the title 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion [WI{] 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion Apr 25, 2024
@chrischdi chrischdi changed the title [WI{] 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion [WIP] 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion Apr 25, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 25, 2024
… webhook usage

If objects are still stored in the old version, every time an object is requested
the kube-apiserver will call the conversion webhook. Instead of lots of conversion
webhook calls we once do the migration if possible.

Also increases the timeout used for the migrations: Before running CRD migration
clusterctl usually upgrades cert-manager which may lead to updating Certificates and
rolling out new certificates to our controllers. There is a chance that this can take
up to 90s in which conversion, validating or mutatingwebhooks may be unavailable.
Because the migration gets run more aggresively during upgrades this also increases
the relevant timeouts.
@chrischdi chrischdi force-pushed the pr-prevent-unnecessary-crd-migration branch from baa5b6d to d754ad5 Compare April 25, 2024 13:54
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 25, 2024
@chrischdi chrischdi changed the title [WIP] 🌱 clusterctl: don't run crd migration if storedVersions is equal to storageVersion [WIP] 🌱 clusterctl: always run crd migration if possible to reduce conversion webhook usage Apr 25, 2024
@chrischdi chrischdi changed the title [WIP] 🌱 clusterctl: always run crd migration if possible to reduce conversion webhook usage 🌱 clusterctl: always run crd migration if possible to reduce conversion webhook usage Apr 25, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 25, 2024
@chrischdi
Copy link
Member Author

/hold

For getting a good review

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

@sbueringer
Copy link
Member

sbueringer commented Apr 26, 2024

@killianmuldoon PTAL. I think the logic around that we have to go through CR migration if an apiVersion becomes unserved was flawed. A version doesn't have to be served to migrate it: a) the get & update calls can be done against the current storage version b) served is not required for conversion (apiserver will just call the conversion webhook and that's it)

I think the new logic is significantly easier to reason through. Basically trigger migration ASAP to avoid unnecessary conversion webhook calls later at runtime.

@chrischdi chrischdi force-pushed the pr-prevent-unnecessary-crd-migration branch from d3879cf to 4a7acb0 Compare April 26, 2024 09:23
Copy link
Contributor

@killianmuldoon killianmuldoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - pending fixups - this is definitely easier to reason about than the previous migrations

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 30, 2024
@fabriziopandini
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 30, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 3079234d0ebd5b679a4a30871d4fb0d2d29ebf2d

@sbueringer
Copy link
Member

Thx!

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 30, 2024
@sbueringer
Copy link
Member

sbueringer commented Apr 30, 2024

/hold cancel

Assuming a good review (or two or three) has been provided 😂

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 30, 2024
@k8s-ci-robot k8s-ci-robot merged commit e1a701f into kubernetes-sigs:main Apr 30, 2024
21 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.8 milestone Apr 30, 2024
@cahillsf
Copy link
Member

cahillsf commented Dec 1, 2024

/cherry-pick release-1.7

@k8s-infra-cherrypick-robot

@cahillsf: new pull request created: #11513

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/clusterctl Issues or PRs related to clusterctl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants