Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: attempt to delete processes from instance-managers in unknown state #3127

Merged

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented Aug 30, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#6552

What this PR does / why we need it:

See longhorn/longhorn#6552 (comment) for context. It may be possible to delete engine and replica processes from an instance-manager even when the instance-manager's state is unknown. Doing so prevents the processes from becoming orphans if the instance-manager eventually recovers from the unknown state.

This PR is looking good in some local testing, but I want to put it through a few more paces before review.

@ejweber ejweber changed the title Best-effort attempt to delete processes from instance-managers in unknown state fix: attempt to delete processes from instance-managers in unknown state Aug 30, 2024
@derekbit
Copy link
Member

derekbit commented Sep 2, 2024

CI failures:

controller/engine_controller.go:2430:16: S1039: unnecessary use of fmt.Sprintf (gosimple)
		return true, fmt.Sprintf("the RWX volume is delinquent")
		             ^
controller/replica_controller.go:910:16: S1039: unnecessary use of fmt.Sprintf (gosimple)
		return true, fmt.Sprintf("the RWX volume is delinquent")

@ejweber ejweber force-pushed the 6552-try-to-stop-unknown-engines-on-cleanup branch 2 times, most recently from e27dc1b to f75f01e Compare September 3, 2024 18:22
@ejweber
Copy link
Contributor Author

ejweber commented Sep 3, 2024

To test (in contrast with longhorn/longhorn#6552 (comment)):

  1. Deploy Longhorn with these changes.
  2. Create and attach a volume that has the same number of replicas as nodes (e.g. with the UI).
  3. SSH to the node the volume is attached to.
  4. Stop the kubelet.
  5. Wait ~30s for Kubernetes to notice and for the instance-manager state to transition to unknown.
  6. Detach the volume.
  7. Verify that no related engine or replica processes exist in any instance-manager CR.
  8. Check longhorn-manager logs for some of the new messages (e.g. "Communicating with instance manager <>, state unknown, IP <>").

Rerun RWX fast failover tests from longhorn/longhorn#6205 (comment) to ensure no regression.

james-munson
james-munson previously approved these changes Sep 5, 2024
Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The re-factor makes the logic clearer, too.

@ejweber
Copy link
Contributor Author

ejweber commented Sep 5, 2024

@james-munson raised a concern about the effect this might have on RWX fast failover. I will test it a bit before merging (and modify the test plan so QA also tests it eventually).

@ejweber
Copy link
Contributor Author

ejweber commented Sep 6, 2024

@james-munson, I ran case 1 from longhorn/longhorn#6205 (comment) a few times. The results were:

  • ShareManager up in 10 seconds, workload pods writing in 52 seconds and 57 seconds respectively.
  • ShareManager up in 9 seconds, workload pods writing in 53 and 58 seconds respectively.
  • ShareManager up in 7 seconds, workload pods writing in 52 seconds and 52 seconds respectively.

I think RWX fast failover is still working as expected.

Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Leave a styling comment but not strong about it. Feel free to ignore it if you prefer the current implementation

Thank you for the investigation and the fix

engineapi/instance_manager.go Outdated Show resolved Hide resolved
Longhorn 6552

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber force-pushed the 6552-try-to-stop-unknown-engines-on-cleanup branch from f75f01e to 25ca057 Compare September 10, 2024 20:38
Longhorn 6552

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 6552

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber ejweber force-pushed the 6552-try-to-stop-unknown-engines-on-cleanup branch from 25ca057 to 5b534ef Compare September 10, 2024 22:56
@PhanLe1010 PhanLe1010 merged commit 1a1ab36 into longhorn:master Sep 11, 2024
7 of 8 checks passed
@ejweber
Copy link
Contributor Author

ejweber commented Sep 11, 2024

@mergify backport v1.6.x v1.5.x

Copy link

mergify bot commented Sep 11, 2024

backport v1.6.x v1.5.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants