alloc_runner: wait when starting suspicious allocs #6216

notnoop · 2019-08-27T01:44:08Z

This commit aims to help users running with clients susceptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart.

If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client.

Few reasons motivate this behavior:

Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely.

Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down.

More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.

This is a follow up to #6207
Fixes #5984 (comment)

This commit aims to help users running with clients suseptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart. If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client. Few reasons motivate this behavior: Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely. Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down. More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.

schmichael

The criteria for waiting on the server seems sufficiently narrow that this should never affect allocs that are actually running.

An alternative would be removing the alloc from the local state and returning a !ok or skip bool from AR.Restore to simply drop the alloc from the client. If the client should be running the alloc it will be re-added once the servers are contacted anyway.

The main functional difference (ignoring code complexity & riskiness of change) would be that the client alloc APIs will return information about these pending allocs with this solution, whereas the client would have no knowledge of them anymore if they're dropped.

client/allocrunner/alloc_runner.go

notnoop

Great suggestion; I like it way more. I updated the PR such that suspicious allocs get skipped without creating an alloc runner. This approach also avoids creating goroutines unnecessarily which depending on client state may have its own undesired impact.

I have one question though about client status.

notnoop · 2019-08-27T21:28:08Z

client/client.go

+//
+// COMPAT(0.12): remove once upgrading from 0.9.5 is no longer supported
+func (c *Client) isPotentiallyCompletedAlloc(alloc *structs.Allocation) bool {
+	if alloc.ClientStatus != structs.AllocClientStatusPending {


thinking at this more, I wonder if this should be checking if the client status is dead. It's possible that a completed alloc can have a non-pending client status in the following scenario:

Server pushes alloc to client

Client starts alloc

Server updates alloc in non-disructive way (e.g. updating reschedule policy, count, etc)

Client fetches new alloc with clientStatus=running

Client completes, GCes, and destroyes

Client persists alloc with clientStatus=running in periodic persistence

Client restarts.

Though, I'm a bit worried that we would increase the scope and impact a bit more. @schmichael opinion?

To clarify, the proposal is to make it:

if alloc.ClientStatus == structs.AllocClientStatusComplete || alloc.ClientStatus == structs.AllocClientStatusFailed { return true }

This is a good observation, but I think the fix is ignoring alloc.ClientStatus altogether in this check.

What we've identified is that a persisted alloc in the absence of other associated client state is useless. There's no client state for it: nothing to reattach to, nothing to GC, no local state whatsoever other than an orphaned alloc struct in the client's db.

So I think we just remove this alloc.ClientStatus check altogether and rely on the other state checks. If there's any other client state, there's something we can move forward with. If there's no other client state, we're better off dropping it and waiting on the server to tell us about it again if it should be running.

You could rename this method hasLocalState or similar to reflect that logic. I believe the persisted alloc is cached global/remote state and in the absence of related local state, the cache entry should be ignored and purged.

a persisted alloc ... is useless

To be clear: it could be useful if not for the bug. This fix hurts our "edge compute" story where clients are more likely to restart in disconnected states and using a local cache of allocs is useful.

However correctness far outweighs better support for disconnected nodes - especially because there are a number of other behaviors that are already similarly problematic for disconnected nodes.

If there's any other client state, there's something we can move forward with. If there's no other client state, we're better off dropping it and waiting on the server to tell us about it again if it should be running.

Excellent point. Adding it to code comment.

Actually, a quick question: Should we special case system jobs here like we did with failed task restores?

I'm inclined to not special case it. Unless it was GCed, we should expect some task state to be persisted; unless super bad things happened.

Yeah, I think your inclination is correct: do not special case system jobs.

This uses an alternative approach where we avoid restoring the alloc runner in the first place, if we suspect that the alloc may have been completed already.

The ClientState being pending isn't a good criteria; as an alloc may have been updated in-place before it was completed. Also, updated the logic so we only check for task states. If an alloc has deployment state but no persisted tasks at all, restore will still fail.

TestClient_RestoreError is very slow, taking ~81 seconds. It has few problematic patterns. It's unclear what it tests, it simulates a failure condition where all state db lookup fails and asserts that alloc fails. Though starting from #6216 , we don't fail allocs in that condition but rather restart them. Also, the drivers used in second client `c2` are the same singleton instances used in `c1` and already shutdown. We ought to start healthy new driver instances.

github-actions · 2023-02-03T02:18:31Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop requested a review from schmichael August 27, 2019 01:44

notnoop force-pushed the b-recognize-pending-allocs branch from 52da09c to cbc521e Compare August 27, 2019 02:06

notnoop mentioned this pull request Aug 27, 2019

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Open

schmichael approved these changes Aug 27, 2019

View reviewed changes

client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved

client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved

client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved

notnoop commented Aug 27, 2019

View reviewed changes

Alternative approach: avoid restoring

493945a

This uses an alternative approach where we avoid restoring the alloc runner in the first place, if we suspect that the alloc may have been completed already.

notnoop force-pushed the b-recognize-pending-allocs branch from f6b9e4a to 493945a Compare August 27, 2019 21:31

notnoop requested a review from schmichael August 28, 2019 12:38

notnoop merged commit b2ef75e into master Aug 28, 2019

notnoop deleted the b-recognize-pending-allocs branch August 28, 2019 18:46

cgbaker mentioned this pull request Oct 8, 2019

[BUG] All stopped batch jobs restart and Docker daemon enter in stuck state after clients is restarted. #6438

Closed

github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

alloc_runner: wait when starting suspicious allocs #6216

alloc_runner: wait when starting suspicious allocs #6216

notnoop commented Aug 27, 2019 •

edited

Loading

schmichael left a comment

notnoop left a comment

notnoop Aug 27, 2019

schmichael Aug 28, 2019

schmichael Aug 28, 2019

notnoop Aug 28, 2019

notnoop Aug 28, 2019 •

edited

Loading

schmichael Aug 28, 2019

github-actions bot commented Feb 3, 2023

alloc_runner: wait when starting suspicious allocs #6216

alloc_runner: wait when starting suspicious allocs #6216

Conversation

notnoop commented Aug 27, 2019 • edited Loading

schmichael left a comment

Choose a reason for hiding this comment

notnoop left a comment

Choose a reason for hiding this comment

notnoop Aug 27, 2019

Choose a reason for hiding this comment

schmichael Aug 28, 2019

Choose a reason for hiding this comment

schmichael Aug 28, 2019

Choose a reason for hiding this comment

notnoop Aug 28, 2019

Choose a reason for hiding this comment

notnoop Aug 28, 2019 • edited Loading

Choose a reason for hiding this comment

schmichael Aug 28, 2019

Choose a reason for hiding this comment

github-actions bot commented Feb 3, 2023

notnoop commented Aug 27, 2019 •

edited

Loading

notnoop Aug 28, 2019 •

edited

Loading