fix: attempt to delete processes from instance-managers in unknown state #3127

ejweber · 2024-08-30T21:36:44Z

Which issue(s) this PR fixes:

What this PR does / why we need it:

See longhorn/longhorn#6552 (comment) for context. It may be possible to delete engine and replica processes from an instance-manager even when the instance-manager's state is unknown. Doing so prevents the processes from becoming orphans if the instance-manager eventually recovers from the unknown state.

This PR is looking good in some local testing, but I want to put it through a few more paces before review.

derekbit · 2024-09-02T01:45:52Z

CI failures:

controller/engine_controller.go:2430:16: S1039: unnecessary use of fmt.Sprintf (gosimple)
		return true, fmt.Sprintf("the RWX volume is delinquent")
		             ^
controller/replica_controller.go:910:16: S1039: unnecessary use of fmt.Sprintf (gosimple)
		return true, fmt.Sprintf("the RWX volume is delinquent")

ejweber · 2024-09-03T18:27:34Z

To test (in contrast with longhorn/longhorn#6552 (comment)):

Deploy Longhorn with these changes.
Create and attach a volume that has the same number of replicas as nodes (e.g. with the UI).
SSH to the node the volume is attached to.
Stop the kubelet.
Wait ~30s for Kubernetes to notice and for the instance-manager state to transition to unknown.
Detach the volume.
Verify that no related engine or replica processes exist in any instance-manager CR.
Check longhorn-manager logs for some of the new messages (e.g. "Communicating with instance manager <>, state unknown, IP <>").

Rerun RWX fast failover tests from longhorn/longhorn#6205 (comment) to ensure no regression.

james-munson

Looks good. The re-factor makes the logic clearer, too.

ejweber · 2024-09-05T21:36:53Z

@james-munson raised a concern about the effect this might have on RWX fast failover. I will test it a bit before merging (and modify the test plan so QA also tests it eventually).

ejweber · 2024-09-06T14:53:07Z

@james-munson, I ran case 1 from longhorn/longhorn#6205 (comment) a few times. The results were:

ShareManager up in 10 seconds, workload pods writing in 52 seconds and 57 seconds respectively.
ShareManager up in 9 seconds, workload pods writing in 53 and 58 seconds respectively.
ShareManager up in 7 seconds, workload pods writing in 52 seconds and 52 seconds respectively.

I think RWX fast failover is still working as expected.

PhanLe1010

LGTM. Leave a styling comment but not strong about it. Feel free to ignore it if you prefer the current implementation

Thank you for the investigation and the fix

engineapi/instance_manager.go

Longhorn 6552 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2024-09-11T15:11:26Z

@mergify backport v1.6.x v1.5.x

mergify · 2024-09-11T15:11:37Z

backport v1.6.x v1.5.x

✅ Backports have been created

#3154 fix: attempt to delete processes from instance-managers in unknown state (backport #3127) has been created for branch v1.6.x but encountered conflicts
#3155 fix: attempt to delete processes from instance-managers in unknown state (backport #3127) has been created for branch v1.5.x but encountered conflicts

ejweber changed the title ~~Best-effort attempt to delete processes from instance-managers in unknown state~~ fix: attempt to delete processes from instance-managers in unknown state Aug 30, 2024

ejweber force-pushed the 6552-try-to-stop-unknown-engines-on-cleanup branch 2 times, most recently from e27dc1b to f75f01e Compare September 3, 2024 18:22

ejweber marked this pull request as ready for review September 3, 2024 18:29

ejweber mentioned this pull request Sep 3, 2024

[BUG] kubectl drain node is blocked by unexpected orphan engine processes longhorn/longhorn#6552

Closed

james-munson previously approved these changes Sep 5, 2024

View reviewed changes

derekbit requested review from PhanLe1010 and c3y1huang September 9, 2024 12:15

PhanLe1010 reviewed Sep 10, 2024

View reviewed changes

engineapi/instance_manager.go Outdated Show resolved Hide resolved

fix: attempt to delete unknown engine instances on cleanup

ec3020b

Longhorn 6552 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber dismissed james-munson’s stale review via 25ca057 September 10, 2024 20:38

ejweber force-pushed the 6552-try-to-stop-unknown-engines-on-cleanup branch from f75f01e to 25ca057 Compare September 10, 2024 20:38

PhanLe1010 approved these changes Sep 10, 2024

View reviewed changes

ejweber added 2 commits September 10, 2024 17:56

fix: allow communication with instance managers in an unknown state

35dee93

Longhorn 6552 Signed-off-by: Eric Weber <eric.weber@suse.com>

fix: attempt to delete unknown replica instances on cleanup

5b534ef

Longhorn 6552 Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 6552-try-to-stop-unknown-engines-on-cleanup branch from 25ca057 to 5b534ef Compare September 10, 2024 22:56

PhanLe1010 merged commit 1a1ab36 into longhorn:master Sep 11, 2024
7 of 8 checks passed

This was referenced Sep 11, 2024

fix: attempt to delete processes from instance-managers in unknown state (backport #3127) #3154

Merged

fix: attempt to delete processes from instance-managers in unknown state (backport #3127) #3155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: attempt to delete processes from instance-managers in unknown state #3127

fix: attempt to delete processes from instance-managers in unknown state #3127

ejweber commented Aug 30, 2024

derekbit commented Sep 2, 2024

ejweber commented Sep 3, 2024 •

edited

Loading

james-munson left a comment

ejweber commented Sep 5, 2024

ejweber commented Sep 6, 2024

PhanLe1010 left a comment •

edited

Loading

ejweber commented Sep 11, 2024

mergify bot commented Sep 11, 2024 •

edited

Loading

fix: attempt to delete processes from instance-managers in unknown state #3127

fix: attempt to delete processes from instance-managers in unknown state #3127

Conversation

ejweber commented Aug 30, 2024

Which issue(s) this PR fixes:

What this PR does / why we need it:

derekbit commented Sep 2, 2024

ejweber commented Sep 3, 2024 • edited Loading

james-munson left a comment

Choose a reason for hiding this comment

ejweber commented Sep 5, 2024

ejweber commented Sep 6, 2024

PhanLe1010 left a comment • edited Loading

Choose a reason for hiding this comment

ejweber commented Sep 11, 2024

mergify bot commented Sep 11, 2024 • edited Loading

✅ Backports have been created

ejweber commented Sep 3, 2024 •

edited

Loading

PhanLe1010 left a comment •

edited

Loading

mergify bot commented Sep 11, 2024 •

edited

Loading