Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: crash-looping config-policy-controller-uninstaller #131

Conversation

yiraeChristineKim
Copy link
Contributor

@yiraeChristineKim yiraeChristineKim commented May 8, 2023

Some HyperShift hosted clusters have been getting stuck in the uninstalling state (without transitioning to the error state), indicating a problem with ACM's deletion of the ManagedCluster object. For these clusters, we've noticed that the ManagedCluster object on the service/hub cluster has a DeletionTimestamp in the past and a status condition of ManagedClusterLeaseUpdateStopped. On the corresponding management cluster, we've noticed that the klusterlet-$CID namespace contains a pod named config-policy-controller-uninstall that's crash-looping with the following error.

Actual results:
ManagedCluster object stays in "Unknown" state with condition ManagedClusterLeaseUpdateStopped. config-policy-controller-uninstall is crashlooping. Corresponding HyperShift HostedCluster gets stuck in uninstalling state forever. Related non-ACM namespaces (e.g., ocm-staging-$CID-$CNAME) are never terminated (i.e., they remain "Active").

Expected results:
ManagedCluster object's finalizers complete without error. config-policy-controller-uninstall doesn't crash but completes as expected. klusterlet-$CID namespace is terminated. The remainder of the HyperShift destruction chain is allowed to continue as normal. HostedCluster is eventually fully uninstalled.

Ref: https://issues.redhat.com/browse/ACM-4854

Signed-off-by: Yi Rae Kim yikim@redhat.com

skipLoop = true
}

if discoveryErr == nil || cleanupImmediately {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if discoveryErr == nil || cleanupImmediately {
if !skiploop && (discoveryErr == nil || cleanupImmediately) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in cleanupImmediately() we have

	return false, fmt.Errorf(
		"isBeingUninstalled error: '%v', definitionIsDeleting error: '%v'", beingUninstalledErr, defErr,
	)

so... is it ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because the "happy path" returns with:

	if beingUninstalledErr == nil && defErr == nil {
		// if either was deleting, we would've already returned.
		return false, nil
	}

Signed-off-by: Yi Rae Kim <yikim@redhat.com>
@openshift-ci
Copy link

openshift-ci bot commented May 15, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mprahl, yiraeChristineKim

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [mprahl,yiraeChristineKim]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants