Cluster occasionally stops reconciling prior to completion #1954

mdbooth · 2024-03-18T13:50:15Z

/kind bug

I've seen this several times recently, normally on the upgrade test. An example:

https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1951/pull-cluster-api-provider-openstack-e2e-full-test/1769057441917440000

In the logs we see:

I0316 18:33:56.607682       1 port.go:599] "Adopted previously created port which was not in status" controller="openstackcluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="OpenStackCluster" OpenStackCluster="self-hosted-xudrlo/self-hosted-0wwq91" namespace="self-hosted-xudrlo" name="self-hosted-0wwq91" reconcileID="16545edd-8004-4a0d-b4ef-7a73fcf69301" cluster="self-hosted-0wwq91" port index=0 portID="779b61f4-3ce4-4319-9ece-213726d71430"

and then nothing further from the cluster controller. The above log message is always associated with an update to the status, so however the reconcile ended we should have been reconciled again.

The root cause was this predicate, which explicitly filters updates which only change the status:

cluster-api-provider-openstack/controllers/openstackcluster_controller.go

Lines 829 to 838 in c1fc60d

    
           // Avoid reconciling if the event triggering the reconciliation is related to incremental status updates 
        
           UpdateFunc: func(e event.UpdateEvent) bool { 
        
           	oldCluster := e.ObjectOld.(*infrav1.OpenStackCluster).DeepCopy() 
        
           	newCluster := e.ObjectNew.(*infrav1.OpenStackCluster).DeepCopy() 
        
           	oldCluster.Status = infrav1.OpenStackClusterStatus{} 
        
           	newCluster.Status = infrav1.OpenStackClusterStatus{} 
        
           	oldCluster.ObjectMeta.ResourceVersion = "" 
        
           	newCluster.ObjectMeta.ResourceVersion = "" 
        
           	return !reflect.DeepEqual(oldCluster, newCluster) 
        
           },

This predicate was originally added in commit 4b368a6, which notes that it was copying a pattern from CAPA.

FWIW, CAPA also still has this predicate for both AWSMachine and AWSCluster: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/068aba046a2cae6c7a6de6c295604d5cb863d5ce/controllers/awsmachine_controller.go#L260-L277

We should remove this predicate in both OpenStackCluster and OpenStackMachine, as it results in a race when exiting the reconcile after a status update without returning an error.

The text was updated successfully, but these errors were encountered:

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 18, 2024

k8s-ci-robot closed this as completed in #1955 Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster occasionally stops reconciling prior to completion #1954

Cluster occasionally stops reconciling prior to completion #1954

mdbooth commented Mar 18, 2024

Cluster occasionally stops reconciling prior to completion #1954

Cluster occasionally stops reconciling prior to completion #1954

Comments

mdbooth commented Mar 18, 2024