Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add machine deployment reconciler #63

Merged
merged 9 commits into from
Oct 10, 2024

Conversation

HomayoonAlimohammadi
Copy link
Contributor

Summary

This PR adds the MachineDeploymentReconciler according to the recent proposal. This controller is responsible for watching for upgrade-to annotation on MachineDeployment object and then handing down the annotation to its owned Machines and wait for them to finish upgrading.

@HomayoonAlimohammadi HomayoonAlimohammadi requested a review from a team as a code owner October 2, 2024 15:55
@HomayoonAlimohammadi HomayoonAlimohammadi marked this pull request as draft October 2, 2024 16:13
@HomayoonAlimohammadi HomayoonAlimohammadi marked this pull request as ready for review October 3, 2024 09:20
@HomayoonAlimohammadi HomayoonAlimohammadi marked this pull request as draft October 3, 2024 11:50
@HomayoonAlimohammadi HomayoonAlimohammadi marked this pull request as ready for review October 4, 2024 07:27
@HomayoonAlimohammadi HomayoonAlimohammadi marked this pull request as draft October 4, 2024 07:59
@HomayoonAlimohammadi HomayoonAlimohammadi force-pushed the KU-1563/add-machine-deployment-reconciler branch from 6bb826e to 931cb7a Compare October 4, 2024 16:01
@HomayoonAlimohammadi HomayoonAlimohammadi marked this pull request as ready for review October 4, 2024 16:21
Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @HomayoonAlimohammadi
added a couple of comments

bootstrap/controllers/machine_deployment_controller.go Outdated Show resolved Hide resolved
}

if !r.hasUpgradeInstructions(scope) {
log.Info("MachineDeployment has no upgrade instructions, removing upgrade-to annotation from machines and stopping reconciliation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would silently remove any manual upgrade from a single machine. right? Is this by design and we are fine with that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be fair this was not mentioned in the proposal. But I did this intentionally to prevent unnecessary confusion and complexity later on (maybe I just added more confusion?). The rationale behind is that I think the users are supposed to use the MachineDeployment annotation and not the single machine ones. I can remove this if you think it's not the correct move.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should be able to perform single machine operations, the orchestration should be optional. We would've directly implemented the upgrade logic in the higher level reconcilers such as MachineDeployment reconciler if the intention was to not perform single machine operations.

bootstrap/controllers/machine_deployment_controller.go Outdated Show resolved Hide resolved
bootstrap/controllers/machine_deployment_controller.go Outdated Show resolved Hide resolved
bootstrap/controllers/machine_deployment_controller.go Outdated Show resolved Hide resolved
bootstrap/controllers/machine_deployment_controller.go Outdated Show resolved Hide resolved
scope.MachineDeployment,
corev1.EventTypeWarning,
bootstrapv1.InPlaceUpgradeFailedEvent,
"MachineDeployment %q marked as in-place upgrade failed. Upgrade failed for machine %q",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could include the error message here. Would this make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good idea, but I think it's not easily feasible right now. The single machine in-place upgrade reconciler (which has access to the original error) publishes an event for that but does not include it anywhere else: https://github.com/canonical/cluster-api-k8s/blob/main/bootstrap/controllers/upgrade_controller.go#L189-L204

}
}

ownedMachinesM, err := r.machineGetter.GetMachinesForCluster(ctx, client.ObjectKeyFromObject(cluster), collections.OwnedMachines(&ms))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the M stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It stands for map since the collections.Machines is basically a map[string]*clusterv1.Machine. I can change it if you don't like it.

// This function triggers a retry by removing the `Status` and `ChangeID` annotations,
// but the `LastFailedAttemptAt` lets us descriminiate between a retry and a fresh upgrade.
// Overall, we don't remove the `LastFailedAttemptAt` annotation in the `InPlaceUpgradeReconciler`.
// That's the responsibility of the `MachineDeploymentReconciler`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mote: I'm not super happy about this solution. It causes an implicit dependency between two controllers but I don't really have an alternative solution so I'm fine with it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand. This coupling is indeed not desirable. But given the other solutions that we discussed, I feel like this is an acceptable workaround so far.

pkg/trace/trace.go Show resolved Hide resolved
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Did an initial pass and left some comments.

@@ -17,4 +18,5 @@ const (
InPlaceUpgradeInProgressEvent = "InPlaceUpgradeInProgress"
InPlaceUpgradeDoneEvent = "InPlaceUpgradeDone"
InPlaceUpgradeFailedEvent = "InPlaceUpgradeFailed"
InPlaceUpgradeCanceledEvent = "InPlaceUpgradeCanceled"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
InPlaceUpgradeCanceledEvent = "InPlaceUpgradeCanceled"
InPlaceUpgradeCancelledEvent = "InPlaceUpgradeCancelled"

}

if !r.hasUpgradeInstructions(scope) {
log.Info("MachineDeployment has no upgrade instructions, removing upgrade-to annotation from machines and stopping reconciliation")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users should be able to perform single machine operations, the orchestration should be optional. We would've directly implemented the upgrade logic in the higher level reconcilers such as MachineDeployment reconciler if the intention was to not perform single machine operations.

delete(annotations, bootstrapv1.InPlaceUpgradeToAnnotation)

annotations[bootstrapv1.InPlaceUpgradeStatusAnnotation] = bootstrapv1.InPlaceUpgradeDoneStatus
annotations[bootstrapv1.InPlaceUpgradeReleaseAnnotation] = scope.UpgradeTo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add this under MachineDeployment.spec.template.metadata.annotations which would propagate to MachineSet.spec.template.metadata.annotations and lastly propagate to Machine.annotations. This is important for scale up/rolling upgrade compatability.

Also we should remove this annotation from both sections once an upgrade is requested, and add it after completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will make the changes. But as a question:
For the scale up we will trigger the reconciliation and add the upgrade-to to the machines (manually in this reconciler), this should be enough, right? I also tested scale ups and everything works as expected (new machines get annotated and upgraded).
For rolling upgrade, I'm not sure what we should do here...
Let's say we previously had an in-place upgrade by annotating the MachineDeployment. What should happen when the user triggers a rolling upgrade (by changing version) while also keeping the in-place upgrade annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth mentioning that while the current implementation does cover scaling up, it's probably not standard (in terms of k8s and how annotation propagation works) nor aligned with how the Ck8sConfigReconciler works (uses release to overwrite snap install instructions)
So I'll change this. Thanks for pointing this out.

}

// NOTE(Hue): The reason we are checking the `Release` annotation as well is that we want to make sure
// we upgrade the new machines that joined after the initial upgrade process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what's the intention here, the CK8sConfigReconciler checks for the release annotation and overrides the install already. I believe just propagating the release annotation should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@eaudetcobello eaudetcobello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great work overall. I apologize for being a bit insistent about the sorting logic—it's just the one part of the PR that doesn't sit quite right with me. Just wanted to share my 2 cents; feel free to resolve and merge as you see fit.

Comment on lines +337 to +346
slices.SortStableFunc(ownedMachines, func(m1, m2 *clusterv1.Machine) int {
switch {
case m1.UID < m2.UID:
return -1
case m1.UID == m2.UID:
return 0
default:
return 1
}
})
Copy link
Contributor

@eaudetcobello eaudetcobello Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any guarantees that UID's increase monotonically? Looks like that's a prerequisite for this sort to work.

Additionally, this sort function runs every time the Reconcile function is called. I'm sure you'll understand my POV when I say sorting 10, 100, 1000 strings every x seconds is a waste of CPU cycles.

EDIT: Sorting 1 million UID's takes 300ms, this is a non-issue.

@eaudetcobello eaudetcobello dismissed their stale review October 8, 2024 15:03

nothing to do

Copy link
Contributor

@bschimke95 bschimke95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work!

@HomayoonAlimohammadi HomayoonAlimohammadi force-pushed the KU-1563/add-machine-deployment-reconciler branch from fa5dcbf to 4156c15 Compare October 10, 2024 07:11
@HomayoonAlimohammadi
Copy link
Contributor Author

Worth noting that the added commits after the LGTM are just because of a rebase (with no conflicting changes).

@HomayoonAlimohammadi HomayoonAlimohammadi merged commit 8a45adb into main Oct 10, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants