Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CP orchestrated in-place upgrade controller #71

Conversation

HomayoonAlimohammadi
Copy link
Contributor

@HomayoonAlimohammadi HomayoonAlimohammadi commented Oct 15, 2024

Overview

This PR adds the orchestrated in-place upgrade controller to the control plane provider.
After annotating the CK8sControlPlane object with inplace-upgrade-to annotation, this controller will make sure that each control plane machine is upgraded to the specified version.
By design, this controller does not allow parallel upgrades to ensure that the control plane stays responsive (prevent breaking HA, quorum, etc.)

Details

  • A lock (which is just a glorified configMap) prevents multiple control plane nodes getting upgraded
  • Some of the common functionality is moved to pkg/upgrade/inplace. We should use these functionalities for the MachineDeploymentReconciler as well (instead of separately implementing them, which is what we're doing right now).
    • Also the MachineDeploymentReconciler should get renamed to resemble its responsibilities more closely (e.g. bootstrapControllers.OrchestratedInPlaceUpgradeController) in the future.

@HomayoonAlimohammadi HomayoonAlimohammadi requested a review from a team as a code owner October 15, 2024 13:37
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work! Can you explain in short form how the lock flow works, especially around failures and retries.

}

// Upgrade is locked and a machine is already upgrading
if lockedMachine != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on if upgrade is failed on the machine, do we wait/hold the lock until it succeds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out, I noticed that I had to move the IsMachineUpgradeFailed check a bit higher.

In short, yes. Lemme explain a bit further:

Scenario 1: Not locked

  • Let's say the upgrade process is not locked. We select a machine to upgrade (let's call it M) and lock the upgrade process while mentioning in the lock object that machine M is upgrading.

Scenario 2: Locked and the machine is upgraded

  • We see that the process is locked, so we get machine M and notice that it is upgraded (release annotation). We unlock and requeue the reconciliation.

Scenario 3: Locked and the machine is not upgraded (still upgrading)

  • We requeue the reconciliation.

Scenario 4: A machine that we tried to upgrade before is not there any more (somehow deleted)

  • The lock config map has the machine information and tries to obtain it but notices that it's not there, so the IsLocked won't return an upgradingMachine, the lock config map is deleted (Unlock is called) and we continue on to select a fresh machine to upgrade.

Scenario 5: Locked and the machine does not have any upgrade instructions

  • This is a weird situation (maybe the admin deleted the annotations accidentally), but can happen nonetheless. We consider that a stale lock and Unlock and requeue.

Scenario 6: Locked and the machine upgrade failed

  • We mark the CK8sControlPlane object with failed status and publish the respective event and retry the process (basically the underlying single machine in-place upgrade reconciler is doing the retry and we just wait).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the list - this was very helpful for the review.

Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM but please create some unittests for the inplace package and an integration test that verifies the happy path of this feature.

@HomayoonAlimohammadi HomayoonAlimohammadi force-pushed the KU-1562/add-orchestrated-in-place-upgrade-to-cp-provider branch from 8e9c788 to b4c17be Compare October 23, 2024 11:29
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for adding the integration test!

Copy link
Contributor

@eaudetcobello eaudetcobello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work, thanks Hue

@HomayoonAlimohammadi HomayoonAlimohammadi merged commit 05b0352 into main Oct 24, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants