Don't persist allocs of destroyed alloc runners #6207

notnoop · 2019-08-25T15:33:17Z

This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.

The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.

Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.

This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.

Fixes #5984
Related to #5890

This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes #5984 Related to #5890

schmichael

On client restart, client assumes that alloc is pending state and re-runs it.

That line from the description seems like the much bigger bug. Perhaps we should just remove the DeleteTaskBucket calls altogether, so they're only deleted as part of the atomic alloc bucket deletion.

That way if the agent crashes mid-GC, the alloc state should be terminal when the agent restarts. The partially GC'd alloc would not be restarted and again be eligible for GC to finish what the previous run started.

Update: Talked with Mahmood and good news! We already atomically delete the alloc+task buckets. The problem is that before this PR we could resurrect just the alloc and get into this state. This PR prevents an alloc being stored after being GC'd which prevents agent restarts from thinking its pending.

client/allocrunner/alloc_runner.go

Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.

Don't persist allocs of destroyed alloc runners

github-actions · 2023-02-03T02:18:40Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

notnoop requested a review from schmichael August 25, 2019 15:33

notnoop mentioned this pull request Aug 25, 2019

Runaway nomad process after Nomad client reboot #5984

Closed

schmichael reviewed Aug 26, 2019

View reviewed changes

client/allocrunner/alloc_runner.go Outdated Show resolved Hide resolved

schmichael approved these changes Aug 26, 2019

View reviewed changes

notnoop merged commit f616370 into master Aug 26, 2019

notnoop deleted the b-gc-destroyed-allocs-rerun branch August 26, 2019 21:26

This was referenced Aug 27, 2019

alloc_runner: wait when starting suspicious allocs #6216

Merged

Constraint/count is not respected after Nomad cluster restart (previously failed allocs) #5921

Open

notnoop pushed a commit that referenced this pull request Sep 18, 2019

Merge pull request #6207 from hashicorp/b-gc-destroyed-allocs-rerun

6b25c05

Don't persist allocs of destroyed alloc runners

cgbaker mentioned this pull request Oct 8, 2019

[BUG] All stopped batch jobs restart and Docker daemon enter in stuck state after clients is restarted. #6438

Closed

github-actions bot locked as resolved and limited conversation to collaborators Feb 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't persist allocs of destroyed alloc runners #6207

Don't persist allocs of destroyed alloc runners #6207

notnoop commented Aug 25, 2019

schmichael left a comment •

edited

Loading

github-actions bot commented Feb 3, 2023

Don't persist allocs of destroyed alloc runners #6207

Don't persist allocs of destroyed alloc runners #6207

Conversation

notnoop commented Aug 25, 2019

schmichael left a comment • edited Loading

Choose a reason for hiding this comment

github-actions bot commented Feb 3, 2023

schmichael left a comment •

edited

Loading