Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1beta1 cluster upgrade tests (using clusterctl upgrade) #1771

Merged
merged 3 commits into from
Oct 28, 2021

Conversation

shysank
Copy link
Contributor

@shysank shysank commented Oct 11, 2021

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #1727 #1770

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squashed commits
  • includes documentation
  • adds unit tests

Release note:

Fix default diff issue when upgrading clusters from v1alpha3 to v1beta1

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note-none Denotes a PR that doesn't merit a release note. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 11, 2021
@k8s-ci-robot k8s-ci-robot added area/provider/azure Issues or PRs related to azure provider sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. labels Oct 11, 2021
@@ -63,6 +63,7 @@ func (r *AzureMachineTemplate) ValidateUpdate(oldRaw runtime.Object) error {
var allErrs field.ErrorList
old := oldRaw.(*AzureMachineTemplate)

old.Spec.Template.Spec.OSDisk.CachingType = "None"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #1770

@shysank
Copy link
Contributor Author

shysank commented Oct 11, 2021

cc @sonasingh46

test/e2e/azure_test.go Outdated Show resolved Hide resolved
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 13, 2021
@shysank
Copy link
Contributor Author

shysank commented Oct 14, 2021

/test ls

@k8s-ci-robot
Copy link
Contributor

@shysank: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-provider-azure-build
  • /test pull-cluster-api-provider-azure-e2e
  • /test pull-cluster-api-provider-azure-e2e-windows
  • /test pull-cluster-api-provider-azure-test
  • /test pull-cluster-api-provider-azure-verify

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-provider-azure-apidiff
  • /test pull-cluster-api-provider-azure-capi-e2e
  • /test pull-cluster-api-provider-azure-conformance-v1alpha4
  • /test pull-cluster-api-provider-azure-conformance-with-ci-artifacts
  • /test pull-cluster-api-provider-azure-coverage
  • /test pull-cluster-api-provider-azure-e2e-exp
  • /test pull-cluster-api-provider-azure-e2e-full
  • /test pull-cluster-api-provider-azure-e2e-workload-upgrade-1-22-latest-main
  • /test pull-cluster-api-provider-azure-upstream-v1alpha4-windows
  • /test pull-cluster-api-provider-azure-windows-upstream-with-ci-artifacts

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-provider-azure-apidiff
  • pull-cluster-api-provider-azure-build
  • pull-cluster-api-provider-azure-coverage
  • pull-cluster-api-provider-azure-e2e
  • pull-cluster-api-provider-azure-e2e-exp
  • pull-cluster-api-provider-azure-e2e-windows
  • pull-cluster-api-provider-azure-test
  • pull-cluster-api-provider-azure-verify

In response to this:

/test ls

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@shysank shysank force-pushed the v1beta1_upgrade_tests branch 2 times, most recently from 7c087f8 to 12667db Compare October 14, 2021 00:27
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 14, 2021
@shysank
Copy link
Contributor Author

shysank commented Oct 14, 2021

/test pull-cluster-api-provider-azure-capi-e2e

@shysank
Copy link
Contributor Author

shysank commented Oct 14, 2021

/test pull-cluster-api-provider-azure-capi-e2e

@shysank
Copy link
Contributor Author

shysank commented Oct 14, 2021

Upgrade tests are working now https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-azure/1771/pull-cluster-api-provider-azure-capi-e2e/1448517546729803776/build-log.txt

�[1mSTEP�[0m: THE MANAGEMENT CLUSTER WAS SUCCESSFULLY UPGRADED!
INFO: Scaling machine deployment clusterctl-upgrade/clusterctl-upgrade-i30nkp-md-0 from 1 to 2 replicas
INFO: Waiting for correct number of replicas to exist
�[1mSTEP�[0m: THE UPGRADED MANAGEMENT CLUSTER WORKS!
�[1mSTEP�[0m: PASSED!

@shysank
Copy link
Contributor Author

shysank commented Oct 14, 2021

@CecileRobertMichon @devigned wdyt think about the changes?

c.Spec.AzureEnvironment, "field is immutable"),
)
// If old spec is in v1alpha3, we need to set defaults before comparison. This is because defaulting webhook was added
// in v1alpha4, so old spec that was created in v1alpha3 would not have run through the defaulting webhook when it was
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back on #1244 when this was first introduced, this is breaking for users who were previously using non-default clouds with v1alpha3 but looks like that was added in release notes. I wonder if doing this in conversion wouldn't be better though, it feels wrong run through the defaulting it in the webhook validation every single time, and we also have no way to tell if old really is v1alpha3

I would also clarify that it's not just the defaulting webhook that was added in v1alpha4, it's the AzureEnvironment API field itself

Copy link
Contributor Author

@shysank shysank Oct 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if doing this in conversion wouldn't be better though, it feels wrong run through the defaulting it in the webhook validation every single time, and we also have no way to tell if old really is v1alpha3

Having this in conversion seems more logical. I didn't do it for two reasons:

  1. We'll have to disable fuzz testing for a whole bunch of fields since we'll be overriding the values.
  2. Duplicate the defaulting code in conversion.

I'm also a bit skeptical about this fix in general because there could be other fields that will be affected which we won't know until we test all the flavors. One potential idea that was floated internally was to not have defaulting webhook for update. This makes sense in principle as the defaults should only be set during creation. We'll probably still have to set some defaults during conversion, but at least we'll have some kind of base to work on. wdyt?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential idea that was floated internally was to not have defaulting webhook for update. This makes sense in principle as the defaults should only be set during creation.

the problem I see with this is that if controllers in a certain api version expect a certain field to be set (eg. the AzureCluster v1alpha4 controller won't work if AzureEnvironment isn't set) it will fail reconciliation after upgrade, unless we default it in conversion, but then I don't see how not having a defaulting webhook for update helps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'll have it to still set some defaults in conversion. Not having default webhook for update gives us more predictability on what the old and new objects will be during update as they are run through the same admission webhooks. Eg. consider a scenario where an object was created in version x.y.0. And a new default is added in a patch release x.y.1(for the same api version). Now when the provider is updated, the object created in x.y.0 may fail during reconciliation if it tries to update because the defaults have changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the default in conversion will only exist for a little while. v1alpha3 will be going away in a relatively short time. We may need to specialize fuzzing for that conversion, but it won't be around that long.

Conversion, however untidy, seems to be the place.

Copy link
Contributor Author

@shysank shysank Oct 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I tried moving it to conversion, but it doesn't work. My assumption that the old object is run through conversion webhook appears to be wrong.

Digging a little deeper: controller runtime's handle tries to decode the old object using api machinery's UniversalDesiralizer which, as mentioned in the the comments, does not perform conversion/defaulting.

The conversion/defaulting happens in the api server. The understanding till now is that the api server does not run conversion, and defaulting webhooks for "GET" requests. More discussion here

@shysank
Copy link
Contributor Author

shysank commented Oct 17, 2021

/test pull-cluster-api-provider-azure-capi-e2e

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 27, 2021
@shysank
Copy link
Contributor Author

shysank commented Oct 27, 2021

/test pull-cluster-api-provider-azure-capi-e2e

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Oct 27, 2021
@shysank
Copy link
Contributor Author

shysank commented Oct 27, 2021

/test pull-cluster-api-provider-azure-capi-e2e

@shysank
Copy link
Contributor Author

shysank commented Oct 28, 2021

Upgrade test is passing, but other tests that are unrelated to this pr are failing. Trying again.
/test pull-cluster-api-provider-azure-capi-e2e

@CecileRobertMichon
Copy link
Contributor

quick start spec failed 🤔

/retest

@CecileRobertMichon CecileRobertMichon added this to the v1.0 milestone Oct 28, 2021
Copy link
Contributor

@CecileRobertMichon CecileRobertMichon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold for passing capi test

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2021
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 28, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: CecileRobertMichon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 28, 2021
@shysank
Copy link
Contributor Author

shysank commented Oct 28, 2021

@CecileRobertMichon tests are passing now, finally 😄

@CecileRobertMichon
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2021
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Oct 28, 2021

@shysank: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-azure-e2e-windows 5d06a8f link true /test pull-cluster-api-provider-azure-e2e-windows

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@shysank
Copy link
Contributor Author

shysank commented Oct 28, 2021

/retest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/azure Issues or PRs related to azure provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add e2e upgrade tests for v1alpha3 -> v1beta1
6 participants