-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add machine deployment reconciler #63
Add machine deployment reconciler #63
Conversation
6bb826e
to
931cb7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @HomayoonAlimohammadi
added a couple of comments
} | ||
|
||
if !r.hasUpgradeInstructions(scope) { | ||
log.Info("MachineDeployment has no upgrade instructions, removing upgrade-to annotation from machines and stopping reconciliation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would silently remove any manual upgrade from a single machine. right? Is this by design and we are fine with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be fair this was not mentioned in the proposal. But I did this intentionally to prevent unnecessary confusion and complexity later on (maybe I just added more confusion?). The rationale behind is that I think the users are supposed to use the MachineDeployment
annotation and not the single machine ones. I can remove this if you think it's not the correct move.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should be able to perform single machine operations, the orchestration should be optional. We would've directly implemented the upgrade logic in the higher level reconcilers such as MachineDeployment
reconciler if the intention was to not perform single machine operations.
scope.MachineDeployment, | ||
corev1.EventTypeWarning, | ||
bootstrapv1.InPlaceUpgradeFailedEvent, | ||
"MachineDeployment %q marked as in-place upgrade failed. Upgrade failed for machine %q", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could include the error message here. Would this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good idea, but I think it's not easily feasible right now. The single machine in-place upgrade reconciler (which has access to the original error) publishes an event for that but does not include it anywhere else: https://github.com/canonical/cluster-api-k8s/blob/main/bootstrap/controllers/upgrade_controller.go#L189-L204
} | ||
} | ||
|
||
ownedMachinesM, err := r.machineGetter.GetMachinesForCluster(ctx, client.ObjectKeyFromObject(cluster), collections.OwnedMachines(&ms)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does the M stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It stands for map since the collections.Machines
is basically a map[string]*clusterv1.Machine
. I can change it if you don't like it.
// This function triggers a retry by removing the `Status` and `ChangeID` annotations, | ||
// but the `LastFailedAttemptAt` lets us descriminiate between a retry and a fresh upgrade. | ||
// Overall, we don't remove the `LastFailedAttemptAt` annotation in the `InPlaceUpgradeReconciler`. | ||
// That's the responsibility of the `MachineDeploymentReconciler`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mote: I'm not super happy about this solution. It causes an implicit dependency between two controllers but I don't really have an alternative solution so I'm fine with it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. This coupling is indeed not desirable. But given the other solutions that we discussed, I feel like this is an acceptable workaround so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Did an initial pass and left some comments.
@@ -17,4 +18,5 @@ const ( | |||
InPlaceUpgradeInProgressEvent = "InPlaceUpgradeInProgress" | |||
InPlaceUpgradeDoneEvent = "InPlaceUpgradeDone" | |||
InPlaceUpgradeFailedEvent = "InPlaceUpgradeFailed" | |||
InPlaceUpgradeCanceledEvent = "InPlaceUpgradeCanceled" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
InPlaceUpgradeCanceledEvent = "InPlaceUpgradeCanceled" | |
InPlaceUpgradeCancelledEvent = "InPlaceUpgradeCancelled" |
} | ||
|
||
if !r.hasUpgradeInstructions(scope) { | ||
log.Info("MachineDeployment has no upgrade instructions, removing upgrade-to annotation from machines and stopping reconciliation") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users should be able to perform single machine operations, the orchestration should be optional. We would've directly implemented the upgrade logic in the higher level reconcilers such as MachineDeployment
reconciler if the intention was to not perform single machine operations.
delete(annotations, bootstrapv1.InPlaceUpgradeToAnnotation) | ||
|
||
annotations[bootstrapv1.InPlaceUpgradeStatusAnnotation] = bootstrapv1.InPlaceUpgradeDoneStatus | ||
annotations[bootstrapv1.InPlaceUpgradeReleaseAnnotation] = scope.UpgradeTo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also add this under MachineDeployment.spec.template.metadata.annotations
which would propagate to MachineSet.spec.template.metadata.annotations
and lastly propagate to Machine.annotations
. This is important for scale up/rolling upgrade compatability.
Also we should remove this annotation from both sections once an upgrade is requested, and add it after completion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, will make the changes. But as a question:
For the scale up we will trigger the reconciliation and add the upgrade-to
to the machines (manually in this reconciler), this should be enough, right? I also tested scale ups and everything works as expected (new machines get annotated and upgraded).
For rolling upgrade, I'm not sure what we should do here...
Let's say we previously had an in-place upgrade by annotating the MachineDeployment
. What should happen when the user triggers a rolling upgrade (by changing version
) while also keeping the in-place upgrade annotation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth mentioning that while the current implementation does cover scaling up, it's probably not standard (in terms of k8s and how annotation propagation works) nor aligned with how the Ck8sConfigReconciler
works (uses release
to overwrite snap install instructions)
So I'll change this. Thanks for pointing this out.
} | ||
|
||
// NOTE(Hue): The reason we are checking the `Release` annotation as well is that we want to make sure | ||
// we upgrade the new machines that joined after the initial upgrade process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the intention here, the CK8sConfigReconciler
checks for the release annotation and overrides the install already. I believe just propagating the release annotation should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was not aware of this: https://github.com/canonical/cluster-api-k8s/blob/main/bootstrap/controllers/ck8sconfig_controller.go#L400
Thanks for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great work overall. I apologize for being a bit insistent about the sorting logic—it's just the one part of the PR that doesn't sit quite right with me. Just wanted to share my 2 cents; feel free to resolve and merge as you see fit.
slices.SortStableFunc(ownedMachines, func(m1, m2 *clusterv1.Machine) int { | ||
switch { | ||
case m1.UID < m2.UID: | ||
return -1 | ||
case m1.UID == m2.UID: | ||
return 0 | ||
default: | ||
return 1 | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any guarantees that UID's increase monotonically? Looks like that's a prerequisite for this sort to work.
Additionally, this sort function runs every time the Reconcile function is called. I'm sure you'll understand my POV when I say sorting 10, 100, 1000 strings every x seconds is a waste of CPU cycles.
EDIT: Sorting 1 million UID's takes 300ms, this is a non-issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, great work!
fa5dcbf
to
4156c15
Compare
Worth noting that the added commits after the LGTM are just because of a rebase (with no conflicting changes). |
Summary
This PR adds the
MachineDeploymentReconciler
according to the recent proposal. This controller is responsible for watching forupgrade-to
annotation onMachineDeployment
object and then handing down the annotation to its ownedMachine
s and wait for them to finish upgrading.