Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster occasionally stops reconciling prior to completion #1954

Closed
mdbooth opened this issue Mar 18, 2024 · 0 comments · Fixed by #1955
Closed

Cluster occasionally stops reconciling prior to completion #1954

mdbooth opened this issue Mar 18, 2024 · 0 comments · Fixed by #1955
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mdbooth
Copy link
Contributor

mdbooth commented Mar 18, 2024

/kind bug

I've seen this several times recently, normally on the upgrade test. An example:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1951/pull-cluster-api-provider-openstack-e2e-full-test/1769057441917440000

In the logs we see:

I0316 18:33:56.607682       1 port.go:599] "Adopted previously created port which was not in status" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="self-hosted-xudrlo/self-hosted-0wwq91" namespace="self-hosted-xudrlo" name="self-hosted-0wwq91" reconcileID="16545edd-8004-4a0d-b4ef-7a73fcf69301" cluster="self-hosted-0wwq91" port index=0 portID="779b61f4-3ce4-4319-9ece-213726d71430"

and then nothing further from the cluster controller. The above log message is always associated with an update to the status, so however the reconcile ended we should have been reconciled again.

The root cause was this predicate, which explicitly filters updates which only change the status:

// Avoid reconciling if the event triggering the reconciliation is related to incremental status updates
UpdateFunc: func(e event.UpdateEvent) bool {
oldCluster := e.ObjectOld.(*infrav1.OpenStackCluster).DeepCopy()
newCluster := e.ObjectNew.(*infrav1.OpenStackCluster).DeepCopy()
oldCluster.Status = infrav1.OpenStackClusterStatus{}
newCluster.Status = infrav1.OpenStackClusterStatus{}
oldCluster.ObjectMeta.ResourceVersion = ""
newCluster.ObjectMeta.ResourceVersion = ""
return !reflect.DeepEqual(oldCluster, newCluster)
},

This predicate was originally added in commit 4b368a6, which notes that it was copying a pattern from CAPA.

FWIW, CAPA also still has this predicate for both AWSMachine and AWSCluster: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/068aba046a2cae6c7a6de6c295604d5cb863d5ce/controllers/awsmachine_controller.go#L260-L277

We should remove this predicate in both OpenStackCluster and OpenStackMachine, as it results in a race when exiting the reconcile after a status update without returning an error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
Status: Done
2 participants