-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CP orchestrated in-place upgrade controller #71
Add CP orchestrated in-place upgrade controller #71
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work! Can you explain in short form how the lock flow works, especially around failures and retries.
} | ||
|
||
// Upgrade is locked and a machine is already upgrading | ||
if lockedMachine != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on if upgrade is failed on the machine, do we wait/hold the lock until it succeds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out, I noticed that I had to move the IsMachineUpgradeFailed
check a bit higher.
In short, yes. Lemme explain a bit further:
Scenario 1: Not locked
- Let's say the upgrade process is not locked. We select a machine to upgrade (let's call it
M
) and lock the upgrade process while mentioning in the lock object that machineM
is upgrading.
Scenario 2: Locked and the machine is upgraded
- We see that the process is locked, so we get machine
M
and notice that it is upgraded (release
annotation). We unlock and requeue the reconciliation.
Scenario 3: Locked and the machine is not upgraded (still upgrading)
- We requeue the reconciliation.
Scenario 4: A machine that we tried to upgrade before is not there any more (somehow deleted)
- The lock config map has the machine information and tries to obtain it but notices that it's not there, so the
IsLocked
won't return anupgradingMachine
, the lock config map is deleted (Unlock
is called) and we continue on to select a fresh machine to upgrade.
Scenario 5: Locked and the machine does not have any upgrade instructions
- This is a weird situation (maybe the admin deleted the annotations accidentally), but can happen nonetheless. We consider that a stale lock and Unlock and requeue.
Scenario 6: Locked and the machine upgrade failed
- We mark the
CK8sControlPlane
object withfailed
status and publish the respective event and retry the process (basically the underlyingsingle machine in-place upgrade reconciler
is doing the retry and we just wait).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the list - this was very helpful for the review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM but please create some unittests for the inplace
package and an integration test that verifies the happy path of this feature.
8e9c788
to
b4c17be
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding the integration test!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fantastic work, thanks Hue
Overview
This PR adds the orchestrated in-place upgrade controller to the control plane provider.
After annotating the
CK8sControlPlane
object withinplace-upgrade-to
annotation, this controller will make sure that each control plane machine is upgraded to the specified version.By design, this controller does not allow parallel upgrades to ensure that the control plane stays responsive (prevent breaking HA, quorum, etc.)
Details
pkg/upgrade/inplace
. We should use these functionalities for theMachineDeploymentReconciler
as well (instead of separately implementing them, which is what we're doing right now).MachineDeploymentReconciler
should get renamed to resemble its responsibilities more closely (e.g.bootstrapControllers.OrchestratedInPlaceUpgradeController
) in the future.