🐛 normalize MachineSet version validation #5406

abhinavnagaraj · 2021-10-07T00:08:54Z

What this PR does / why we need it:
This PR normalizes the MachineSet Template version, to match the pattern v.<major>.<minor>.<patch>
This prevents the creation of new machines when upgrading from v1alpha3 to v1alpha4, when there are no changes in the MachineDeployment spec.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5405

k8s-ci-robot · 2021-10-07T00:09:02Z

Hi @abhinavnagaraj. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2021-10-07T06:17:39Z

/ok-to-test

CecileRobertMichon

/lgtm

fabriziopandini · 2021-10-08T09:17:05Z

/hold
For the ongoing discussion on the issue

enxebre · 2021-10-12T12:51:03Z

/lgtm

vincepri

/approve

k8s-ci-robot · 2021-10-12T17:54:33Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vincepri

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vincepri]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sbueringer · 2021-10-12T18:22:43Z

As stated in #5405 (comment) I would really prefer to understand how this PR can fix the issue.

vincepri · 2021-10-13T02:41:07Z

We should probably remove the Fixes #issue from the PR description, the change on its own seems a good one?

sbueringer · 2021-10-13T07:40:24Z

This change adds the v prefix on MachineSets which will make sure all new and updated MachineSets have the prefix. But all new MachineSets will have the prefix anyway because we enforce it on the MachineDeployment, so I think the change itself shouldn't be needed except to solve the edge case described in the issue (and imho it doesn't solve it).

So I wonder what happens after an upgrade. Initially (if I understood it correctly) you will have a MD and a MS without prefix. When you scale up the MD the v prefix will be added. I wonder who/what will trigger an update on the pre-existing MS to trigger the webhook.

Do status updates also trigger mutating webhooks configured for UPDATE? Or is our controller doing full updates on the MS resource?

If it is our controller which triggers the update, imho we will now have a race condition after this PR. More or less the MD and the MS have to get the prefix at the same time (or at least we should never run into a MD reconcile loop where only one of them has the prefix).

I'm probably missing something, but it sounds to me like the PR will make the issue harder to reproduce, not solve it. I think it could be solved for example by also adding the v prefix in the conversion webhook, because then we would always have the v prefix, which is our goal.

sbueringer · 2021-10-13T13:45:16Z

I played a bit around with this PR and I think it only helps when somehow the MS also is updated at the same time as the MD (but I couldn't reproduce this case)

Setup

Deploy controllers based on the current PR (ideally run the capi controller in the IDE to be able to inspect what the reconciler does)
Create an MD and a corresponding MS without the v prefix.

I had the following results

Option 1: Run `kubectl scale` to scale up the MD

MD will not get the v prefix because the defaulting webhook is not run on the scale subresource
MD controller tries to scale up the MS => the MS will get the v prefix as the MS defaulting webhook is run
Now we have a MD without prefix and a MS with the prefix
The MD will not be able to reconcile the MS without prefix as the MS webhook will always add the prefix
The MD controller now always runs into MS already exists errors as its not possible to reconcile the MS anymore (this return)

Option 2: Run `kubectl edit` and change replicas

MD will now get the prefix as the webhook is triggered when kubectl patches the MD
MD will create a new MS as the MD has the prefix but the existing MS does not (becaise the MS webhook was never triggered)

Option 3: Trigger updates on MD and MS concurrently

I guess if the MD and MS are updated both before a MD reconcile runs both will get the prefix and the reconcile will only scale up.

I think it might be a good idea to additionally adjust the conversion webhooks, so whenever a v1beta1 MD or MS is retrieved it will have the v prefix.
I think this should be safer as whenever an old MD or MS is read it will get the v prefix and as soon as the MD or MS is written to etcd it will actually get the prefix.
(Note: the only exception is MS which already have been converted now to v1beta1, but I would ignore that case)

enxebre · 2021-10-13T13:56:02Z

, so I think the change itself shouldn't be needed except to solve the edge case described in the issue (and imho it doesn't solve it).

I think the change is legit orthogonally to the issue which I don't think it fixes. One for consistency and to eventually remove the edge cases you describe here #5406 (comment) and two because the MachineSet is nothing less than a user facing API which UX we should care same as a MachineDeployment, e.g as a user I might choose to run a MachineSet directly because I need more granular control than I have with a MachineDeployment and from a UX pov I'd be surprise if the fields differently.

I think it might be a good idea to additionally adjust the conversion webhooks, so whenever a v1beta1 MD or MS is retrieved it will have the v prefix.

sounds reasonable to me.

vincepri · 2021-10-13T13:56:38Z

This change adds the v prefix on MachineSets which will make sure all new and updated MachineSets have the prefix. But all new MachineSets will have the prefix anyway because we enforce it on the MachineDeployment, so I think the change itself shouldn't be needed except to solve the edge case described in the issue (and imho it doesn't solve it).

Users might want to use MachineSet without a MachineDeployment though, which has been a valid use case for quite a while. There are cases where you need strict control on how MachineTemplate is rolled out, which you might want to leverage MachineSet for rather than a MachineDeployment resource. The change as-is seems valid from that point of view?

Do status updates also trigger mutating webhooks configured for UPDATE? Or is our controller doing full updates on the MS resource?

We usually don't configure our admission or validation webhooks to trigger on status (or any other subresource) changes

cluster-api/config/webhook/manifests.yaml

Lines 322 to 323 in 0caa40d

    
           resources: 
        
           - machinesets

I'm probably missing something, but it sounds to me like the PR will make the issue harder to reproduce, not solve it. I think it could be solved for example by also adding the v prefix in the conversion webhook, because then we would always have the v prefix, which is our goal.

Wouldn't this cause a rollout?

To clarify, I don't think that the change in this PR solves the related issue, if it does it might be by chance or because something else gets triggered. We'd have to dig a little deeper on it.

sbueringer · 2021-10-13T14:10:48Z

Agree, the change itself is fine so that the MS itself works correctly.

| Wouldn't this cause a rollout?
I would assume it doesn't as MD and MS would be in sync

Ah I think I missed something. Maybe it fixes the issue because after an upgrade the MachineSetReconciler reconciles every MS at least once (after the list call (?)). During reconcile it reconciles the external references (aka upgrades the API versions of at least the bootstrap template ref) at least in our case because we've also upgrade the kubeadm types. Those updates should then trigger the defaulting.

Update:
I confirmed it. So post-upgrade the capi-controller is upgrading the template references automatically and during that also updates the version field through the webhooks. I think this only leaves small edge cases like when someone updates from v1alpha3 to v1beta1 without updating the infra or bootstrap provider at the same time. I guess that's unlikely as most providers will update their code and api version between v1alpha3 and v1beta1.

So I think it could be good enough to just fix it with this PR.

sbueringer · 2021-10-13T14:46:09Z

/lgtm

vincepri · 2021-10-13T14:49:00Z

/hold cancel

sbueringer · 2021-10-13T17:43:00Z

@vincepri We should make sure this PR gets into the next v1.0 release. Do we need a cherry-pick? I'm not sure if there was a consensus around fast-forward.

vincepri · 2021-10-20T18:48:18Z

/cherrypick release-1.0

k8s-infra-cherrypick-robot · 2021-10-20T18:49:08Z

@vincepri: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2021-10-21T06:17:21Z

/cherrypick release-1.0

k8s-infra-cherrypick-robot · 2021-10-21T06:18:11Z

@sbueringer: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2021-10-21T06:29:26Z

I've opened a Slack thread: https://kubernetes.slack.com/archives/CDECRSC5U/p1634797739002500

sbueringer · 2021-10-21T14:06:24Z

/cherrypick release-1.0

k8s-infra-cherrypick-robot · 2021-10-21T14:06:59Z

@sbueringer: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2021-10-21T14:45:29Z

New thread in #testing-ops: https://kubernetes.slack.com/archives/C7J9RP96G/p1634827485007700

sbueringer · 2021-10-22T06:26:19Z

/cherrypick release-1.0

k8s-infra-cherrypick-robot · 2021-10-22T06:27:09Z

@sbueringer: failed to push cherry-picked changes in GitHub: pushing failed, output: "To https://github.com/k8s-infra-cherrypick-robot/cluster-api\n ! [remote rejected] cherry-pick-5406-to-release-1.0 -> cherry-pick-5406-to-release-1.0 (refusing to allow a Personal Access Token to create or update workflow .github/workflows/golangci-lint.yml without workflow scope)\nerror: failed to push some refs to 'https://github.com/k8s-infra-cherrypick-robot/cluster-api'\n", error: exit status 1

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2021-10-22T17:44:08Z

/cherrypick release-0.4

k8s-infra-cherrypick-robot · 2021-10-22T17:45:00Z

@sbueringer: new pull request created: #5482

In response to this:

/cherrypick release-0.4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2021-11-02T07:51:02Z

/cherrypick release-1.0

k8s-infra-cherrypick-robot · 2021-11-02T07:51:38Z

@sbueringer: new pull request created: #5560

In response to this:

/cherrypick release-1.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 7, 2021

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 7, 2021

k8s-ci-robot requested review from CecileRobertMichon and fabriziopandini October 7, 2021 00:09

normalize MachineSet version validation

929728c

abhinavnagaraj force-pushed the machineset-version branch from d5a1d79 to 929728c Compare October 7, 2021 01:12

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 7, 2021

CecileRobertMichon reviewed Oct 7, 2021

View reviewed changes

k8s-ci-robot assigned CecileRobertMichon Oct 7, 2021

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 7, 2021

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 8, 2021

k8s-ci-robot assigned enxebre Oct 12, 2021

vincepri approved these changes Oct 12, 2021

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 12, 2021

k8s-ci-robot assigned sbueringer Oct 13, 2021

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 13, 2021

k8s-ci-robot merged commit af2feb1 into kubernetes-sigs:main Oct 13, 2021

k8s-ci-robot added this to the v0.4 milestone Oct 13, 2021

k8s-infra-cherrypick-robot mentioned this pull request Oct 22, 2021

🐛 normalize MachineSet version validation #5482

Merged

This was referenced Oct 22, 2021

MachineSet version changes, upgrading from v1alpha3 to v1alpha4 #5405

Closed

[cherry-pick-bot] fails to create cherry-pick PRs for cluster-api release-1.0 branch kubernetes/test-infra#24108

Closed

k8s-infra-cherrypick-robot mentioned this pull request Nov 2, 2021

🐛 normalize MachineSet version validation #5560

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 normalize MachineSet version validation #5406

🐛 normalize MachineSet version validation #5406

abhinavnagaraj commented Oct 7, 2021

k8s-ci-robot commented Oct 7, 2021

sbueringer commented Oct 7, 2021

CecileRobertMichon left a comment

fabriziopandini commented Oct 8, 2021

enxebre commented Oct 12, 2021

vincepri left a comment

k8s-ci-robot commented Oct 12, 2021

sbueringer commented Oct 12, 2021

vincepri commented Oct 13, 2021

sbueringer commented Oct 13, 2021 •

edited

Loading

sbueringer commented Oct 13, 2021 •

edited

Loading

enxebre commented Oct 13, 2021 •

edited

Loading

vincepri commented Oct 13, 2021

sbueringer commented Oct 13, 2021 •

edited

Loading

sbueringer commented Oct 13, 2021

vincepri commented Oct 13, 2021

sbueringer commented Oct 13, 2021

vincepri commented Oct 20, 2021

k8s-infra-cherrypick-robot commented Oct 20, 2021

sbueringer commented Oct 21, 2021

k8s-infra-cherrypick-robot commented Oct 21, 2021

sbueringer commented Oct 21, 2021

sbueringer commented Oct 21, 2021

k8s-infra-cherrypick-robot commented Oct 21, 2021

sbueringer commented Oct 21, 2021

sbueringer commented Oct 22, 2021

k8s-infra-cherrypick-robot commented Oct 22, 2021

sbueringer commented Oct 22, 2021

k8s-infra-cherrypick-robot commented Oct 22, 2021

sbueringer commented Nov 2, 2021

k8s-infra-cherrypick-robot commented Nov 2, 2021

🐛 normalize MachineSet version validation #5406

🐛 normalize MachineSet version validation #5406

Conversation

abhinavnagaraj commented Oct 7, 2021

k8s-ci-robot commented Oct 7, 2021

sbueringer commented Oct 7, 2021

CecileRobertMichon left a comment

Choose a reason for hiding this comment

fabriziopandini commented Oct 8, 2021

enxebre commented Oct 12, 2021

vincepri left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Oct 12, 2021

sbueringer commented Oct 12, 2021

vincepri commented Oct 13, 2021

sbueringer commented Oct 13, 2021 • edited Loading

sbueringer commented Oct 13, 2021 • edited Loading

Setup

Option 1: Run kubectl scale to scale up the MD

Option 2: Run kubectl edit and change replicas

Option 3: Trigger updates on MD and MS concurrently

enxebre commented Oct 13, 2021 • edited Loading

vincepri commented Oct 13, 2021

sbueringer commented Oct 13, 2021 • edited Loading

sbueringer commented Oct 13, 2021

vincepri commented Oct 13, 2021

sbueringer commented Oct 13, 2021

vincepri commented Oct 20, 2021

k8s-infra-cherrypick-robot commented Oct 20, 2021

sbueringer commented Oct 21, 2021

k8s-infra-cherrypick-robot commented Oct 21, 2021

sbueringer commented Oct 21, 2021

sbueringer commented Oct 21, 2021

k8s-infra-cherrypick-robot commented Oct 21, 2021

sbueringer commented Oct 21, 2021

sbueringer commented Oct 22, 2021

k8s-infra-cherrypick-robot commented Oct 22, 2021

sbueringer commented Oct 22, 2021

k8s-infra-cherrypick-robot commented Oct 22, 2021

sbueringer commented Nov 2, 2021

k8s-infra-cherrypick-robot commented Nov 2, 2021

sbueringer commented Oct 13, 2021 •

edited

Loading

sbueringer commented Oct 13, 2021 •

edited

Loading

Option 1: Run `kubectl scale` to scale up the MD

Option 2: Run `kubectl edit` and change replicas

enxebre commented Oct 13, 2021 •

edited

Loading

sbueringer commented Oct 13, 2021 •

edited

Loading