-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alloc_runner: wait when starting suspicious allocs #6216
Conversation
This commit aims to help users running with clients suseptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart. If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client. Few reasons motivate this behavior: Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely. Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down. More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.
52da09c
to
cbc521e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The criteria for waiting on the server seems sufficiently narrow that this should never affect allocs that are actually running.
An alternative would be removing the alloc from the local state and returning a !ok
or skip
bool from AR.Restore
to simply drop the alloc from the client. If the client should be running the alloc it will be re-added once the servers are contacted anyway.
The main functional difference (ignoring code complexity & riskiness of change) would be that the client alloc APIs will return information about these pending allocs with this solution, whereas the client would have no knowledge of them anymore if they're dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion; I like it way more. I updated the PR such that suspicious allocs get skipped without creating an alloc runner. This approach also avoids creating goroutines unnecessarily which depending on client state may have its own undesired impact.
I have one question though about client status.
client/client.go
Outdated
// | ||
// COMPAT(0.12): remove once upgrading from 0.9.5 is no longer supported | ||
func (c *Client) isPotentiallyCompletedAlloc(alloc *structs.Allocation) bool { | ||
if alloc.ClientStatus != structs.AllocClientStatusPending { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking at this more, I wonder if this should be checking if the client status is dead. It's possible that a completed alloc can have a non-pending client status in the following scenario:
- Server pushes alloc to client
- Client starts alloc
- Server updates alloc in non-disructive way (e.g. updating reschedule policy, count, etc)
- Client fetches new alloc with clientStatus=running
- Client completes, GCes, and destroyes
- Client persists alloc with clientStatus=running in periodic persistence
- Client restarts.
Though, I'm a bit worried that we would increase the scope and impact a bit more. @schmichael opinion?
To clarify, the proposal is to make it:
if alloc.ClientStatus == structs.AllocClientStatusComplete || alloc.ClientStatus == structs.AllocClientStatusFailed {
return true
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good observation, but I think the fix is ignoring alloc.ClientStatus
altogether in this check.
What we've identified is that a persisted alloc in the absence of other associated client state is useless. There's no client state for it: nothing to reattach to, nothing to GC, no local state whatsoever other than an orphaned alloc struct in the client's db.
So I think we just remove this alloc.ClientStatus
check altogether and rely on the other state checks. If there's any other client state, there's something we can move forward with. If there's no other client state, we're better off dropping it and waiting on the server to tell us about it again if it should be running.
You could rename this method hasLocalState
or similar to reflect that logic. I believe the persisted alloc
is cached global/remote state and in the absence of related local state, the cache entry should be ignored and purged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a persisted alloc ... is useless
To be clear: it could be useful if not for the bug. This fix hurts our "edge compute" story where clients are more likely to restart in disconnected states and using a local cache of allocs is useful.
However correctness far outweighs better support for disconnected nodes - especially because there are a number of other behaviors that are already similarly problematic for disconnected nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's any other client state, there's something we can move forward with. If there's no other client state, we're better off dropping it and waiting on the server to tell us about it again if it should be running.
Excellent point. Adding it to code comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, a quick question: Should we special case system jobs here like we did with failed task restores?
I'm inclined to not special case it. Unless it was GCed, we should expect some task state to be persisted; unless super bad things happened.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think your inclination is correct: do not special case system jobs.
This uses an alternative approach where we avoid restoring the alloc runner in the first place, if we suspect that the alloc may have been completed already.
f6b9e4a
to
493945a
Compare
The ClientState being pending isn't a good criteria; as an alloc may have been updated in-place before it was completed. Also, updated the logic so we only check for task states. If an alloc has deployment state but no persisted tasks at all, restore will still fail.
TestClient_RestoreError is very slow, taking ~81 seconds. It has few problematic patterns. It's unclear what it tests, it simulates a failure condition where all state db lookup fails and asserts that alloc fails. Though starting from #6216 , we don't fail allocs in that condition but rather restart them. Also, the drivers used in second client `c2` are the same singleton instances used in `c1` and already shutdown. We ought to start healthy new driver instances.
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This commit aims to help users running with clients susceptible to the destroyed alloc being restrarted bug upgrade to latest. Without this, such users will have their tasks run unexpectedly on upgrade and only see the bug resolved after subsequent restart.
If, on restore, the client sees a pending alloc without any other persisted info, then err on the side that it's an corrupt persisted state of an alloc instead of the client happening to be killed right when alloc is assigned to client.
Few reasons motivate this behavior:
Statistically speaking, corruption being the cause is more likely. A long running client will have higher chance of having allocs persisted incorrectly with pending state. Being killed right when an alloc is about to start is relatively unlikely.
Also, delaying starting an alloc that hasn't started (by hopefully seconds) is not as severe as launching too many allocs that may bring client down.
More importantly, this helps customers upgrade their clients without risking taking their clients down and destablizing their cluster. We don't want existing users to force triggering the bug while they upgrade and restart cluster.
This is a follow up to #6207
Fixes #5984 (comment)