-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't persist allocs of destroyed alloc runners #6207
Conversation
This fixes a bug where allocs that have been GCed get re-run again after client is restarted. A heavily-used client may launch thousands of allocs on startup and get killed. The bug is that an alloc runner that gets destroyed due to GC remains in client alloc runner set. Periodically, they get persisted until alloc is gced by server. During that time, the client db will contain the alloc but not its individual tasks status nor completed state. On client restart, client assumes that alloc is pending state and re-runs it. Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc to the state DB. This is a short-term fix, as we should consider revamping client state management. Storing alloc and task information in non-transaction non-atomic concurrently while alloc runner is running and potentially changing state is a recipe for bugs. Fixes #5984 Related to #5890
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On client restart, client assumes that alloc is pending state and re-runs it.
That line from the description seems like the much bigger bug. Perhaps we should just remove the DeleteTaskBucket calls altogether, so they're only deleted as part of the atomic alloc bucket deletion.
That way if the agent crashes mid-GC, the alloc state should be terminal when the agent restarts. The partially GC'd alloc would not be restarted and again be eligible for GC to finish what the previous run started.
Update: Talked with Mahmood and good news! We already atomically delete the alloc+task buckets. The problem is that before this PR we could resurrect just the alloc and get into this state. This PR prevents an alloc being stored after being GC'd which prevents agent restarts from thinking its pending.
Protect against a race where destroying and persist state goroutines race. The downside is that the database io operation will run while holding the lock and may run indefinitely. The risk of lock being long held is slow destruction, but slow io has bigger problems.
Don't persist allocs of destroyed alloc runners
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This fixes a bug where allocs that have been GCed get re-run again after client
is restarted. A heavily-used client may launch thousands of allocs on startup
and get killed.
The bug is that an alloc runner that gets destroyed due to GC remains in
client alloc runner set. Periodically, they get persisted until alloc is
gced by server. During that time, the client db will contain the alloc
but not its individual tasks status nor completed state. On client restart,
client assumes that alloc is pending state and re-runs it.
Here, we fix it by ensuring that destroyed alloc runners don't persist any alloc
to the state DB.
This is a short-term fix, as we should consider revamping client state
management. Storing alloc and task information in non-transaction non-atomic
concurrently while alloc runner is running and potentially changing state is a
recipe for bugs.
Fixes #5984
Related to #5890