Bug: crash-looping config-policy-controller-uninstaller #131
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some HyperShift hosted clusters have been getting stuck in the uninstalling state (without transitioning to the error state), indicating a problem with ACM's deletion of the ManagedCluster object. For these clusters, we've noticed that the ManagedCluster object on the service/hub cluster has a DeletionTimestamp in the past and a status condition of ManagedClusterLeaseUpdateStopped. On the corresponding management cluster, we've noticed that the klusterlet-$CID namespace contains a pod named config-policy-controller-uninstall that's crash-looping with the following error.
Actual results:
ManagedCluster object stays in "Unknown" state with condition ManagedClusterLeaseUpdateStopped. config-policy-controller-uninstall is crashlooping. Corresponding HyperShift HostedCluster gets stuck in uninstalling state forever. Related non-ACM namespaces (e.g., ocm-staging-$CID-$CNAME) are never terminated (i.e., they remain "Active").
Expected results:
ManagedCluster object's finalizers complete without error. config-policy-controller-uninstall doesn't crash but completes as expected. klusterlet-$CID namespace is terminated. The remainder of the HyperShift destruction chain is allowed to continue as normal. HostedCluster is eventually fully uninstalled.
Ref: https://issues.redhat.com/browse/ACM-4854
Signed-off-by: Yi Rae Kim yikim@redhat.com