CSI: retry claims from client #12113

tgross · 2022-02-23T20:27:25Z

Fixes #8609 when combined with #12112
Fixes #11477

When the alloc runner claims a volume, an allocation for a previous
version of the job may still have the volume claimed because it's
still shutting down. In this case we'll receive an error from the
server. Retry this error until we succeed or until a very long timeout
expires, to give operators a chance to recover broken plugins.

Make the claim hook tolerate temporary RPC failures in general to
prevent errors during client restart or during leadership elections.

(Tests don't pass until #12112 is merged; I'll rebase this on that PR)

shoenig

eh, some CSI test failures

tgross · 2022-02-24T13:54:55Z

eh, some CSI test failures

Yup! This won't pass tests until #12112 is merged and then I'll rebase this PR on that (that's why this is in draft still... not sure why I added reviewers to it last night but maybe I was hoping #12112 was going to get merged by the time anyone looked at it! 😀 )

When the alloc runner claims a volume, an allocation for a previous version of the job may still have the volume claimed because it's still shutting down. In this case we'll receive an error from the server. Retry this error until we succeed or until a very long timeout expires, to give operators a chance to recover broken plugins. Make the alloc runner hook tolerant of temporary RPC failures.

tgross · 2022-02-24T15:09:02Z

@shoenig tests pass now (except for the one flake #12115) which I'm re-running for now.

shoenig

LGTM

In #12112 and #12113 we solved for the problem of races in releasing volume claims, but there was a case that we missed. During a node drain with a controller attach/detach, we can hit a race where we call controller publish before the unpublish has completed. This is discouraged in the spec but plugins are supposed to handle it safely. But if the storage provider's API is slow enough and the plugin doesn't handle the case safely, the volume can get "locked" into a state where the provider's API won't detach it cleanly. Check the claim before making any external controller publish RPC calls so that Nomad is responsible for the canonical information about whether a volume is currently claimed. This has a couple side-effects that also had to get fixed here: * Changing the order means that the volume will have a past claim without a valid external node ID because it came from the client, and this uncovered a separate bug where we didn't assert the external node ID was valid before returning it. Fallthrough to getting the ID from the plugins in the state store in this case. We avoided this originally because of concerns around plugins getting lost during node drain but now that we've fixed that we may want to revisit it in future work. * We should make sure we're handling `FailedPrecondition` cases from the controller plugin the same way we handle other retryable cases.

In #12112 and #12113 we solved for the problem of races in releasing volume claims, but there was a case that we missed. During a node drain with a controller attach/detach, we can hit a race where we call controller publish before the unpublish has completed. This is discouraged in the spec but plugins are supposed to handle it safely. But if the storage provider's API is slow enough and the plugin doesn't handle the case safely, the volume can get "locked" into a state where the provider's API won't detach it cleanly. Check the claim before making any external controller publish RPC calls so that Nomad is responsible for the canonical information about whether a volume is currently claimed. This has a couple side-effects that also had to get fixed here: * Changing the order means that the volume will have a past claim without a valid external node ID because it came from the client, and this uncovered a separate bug where we didn't assert the external node ID was valid before returning it. Fallthrough to getting the ID from the plugins in the state store in this case. We avoided this originally because of concerns around plugins getting lost during node drain but now that we've fixed that we may want to revisit it in future work. * We should make sure we're handling `FailedPrecondition` cases from the controller plugin the same way we handle other retryable cases. * Several tests had to be updated because they were assuming we fail in a particular order that we're no longer doing.

github-actions · 2022-10-17T02:47:21Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tgross added stage/needs-backporting theme/storage type/bug labels Feb 23, 2022

tgross added this to the 1.3.0 milestone Feb 23, 2022

tgross force-pushed the csi-client-side-retry-on-claim branch from a994c1c to 2b083b5 Compare February 23, 2022 20:30

vercel bot deployed to Preview – nomad-storybook-and-ui February 23, 2022 20:30 View deployment

vercel bot temporarily deployed to Preview – nomad February 23, 2022 20:30 Inactive

tgross mentioned this pull request Feb 23, 2022

CSI: enforce usage at claim time #12112

Merged

tgross force-pushed the csi-client-side-retry-on-claim branch from 2b083b5 to 3e53fa2 Compare February 23, 2022 21:56

vercel bot deployed to Preview – nomad-storybook-and-ui February 23, 2022 21:56 View deployment

vercel bot temporarily deployed to Preview – nomad February 23, 2022 21:56 Inactive

tgross changed the title ~~CSI: retry claims from client when max claims are reached~~ CSI: retry claims from client Feb 23, 2022

tgross requested review from lgfa29, jrasell and shoenig February 23, 2022 21:57

tgross mentioned this pull request Feb 23, 2022

CSI: allocrunner w/ volumes fails to restore in csi_hook after client restart #11477

Closed

shoenig approved these changes Feb 23, 2022

View reviewed changes

shoenig self-requested a review February 23, 2022 23:00

tgross mentioned this pull request Feb 24, 2022

Jobs using CSI volume do not recover from the client failure without human intervention #12118

Closed

tgross force-pushed the csi-client-side-retry-on-claim branch from 3e53fa2 to 4fb2347 Compare February 24, 2022 14:39

vercel bot temporarily deployed to Preview – nomad February 24, 2022 14:39 Inactive

vercel bot deployed to Preview – nomad-storybook-and-ui February 24, 2022 14:39 View deployment

tgross marked this pull request as ready for review February 24, 2022 15:08

shoenig approved these changes Feb 24, 2022

View reviewed changes

tgross merged commit 649f1e3 into main Feb 24, 2022

tgross mentioned this pull request Mar 25, 2022

CSI: reorder controller volume detachment #12387

Merged

This was referenced Apr 19, 2022

Backport of CSI: enforce usage at claim time into release/1.2.x #12648

Merged

Backport of CSI: enforce usage at claim time into release/1.1.x #12649

Merged

lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 19, 2022

This was referenced Apr 19, 2022

Backport of CSI: retry claims from client into release/1.1.x #12650

Closed

Backport of CSI: retry claims from client into release/1.2.x #12651

Closed

lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line and removed backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 19, 2022

This was referenced Apr 19, 2022

Backport of CSI: retry claims from client into release/1.2.x #12655

Closed

Backport of CSI: retry claims from client into release/1.1.x #12656

Closed

lgfa29 added backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line and removed backport/1.1.x backport to 1.1.x release line backport/1.2.x backport to 1.1.x release line labels Apr 19, 2022

This was referenced Apr 19, 2022

Backport of CSI: retry claims from client into release/1.2.x #12659

Merged

Backport of CSI: retry claims from client into release/1.1.x #12660

Merged

lgfa29 removed stage/needs-backporting labels Apr 19, 2022

This was referenced Apr 20, 2022

Backport of CSI: reorder controller volume detachment into release/1.2.x #12700

Merged

Backport of CSI: reorder controller volume detachment into release/1.1.x #12701

Merged

github-actions bot locked as resolved and limited conversation to collaborators Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: retry claims from client #12113

CSI: retry claims from client #12113

tgross commented Feb 23, 2022 •

edited

Loading

shoenig left a comment •

edited

Loading

tgross commented Feb 24, 2022 •

edited

Loading

tgross commented Feb 24, 2022

shoenig left a comment

github-actions bot commented Oct 17, 2022

CSI: retry claims from client #12113

CSI: retry claims from client #12113

Conversation

tgross commented Feb 23, 2022 • edited Loading

shoenig left a comment • edited Loading

Choose a reason for hiding this comment

tgross commented Feb 24, 2022 • edited Loading

tgross commented Feb 24, 2022

shoenig left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 17, 2022

tgross commented Feb 23, 2022 •

edited

Loading

shoenig left a comment •

edited

Loading

tgross commented Feb 24, 2022 •

edited

Loading