fix(v2): avoid starvation of failed jobs in the compaction scheduler #3732
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've been testing various failure modes and discovered that it still possible for a job that has already been reassigned to starve and remain in the queue (this time, after the first unsuccessful retry). The reason is that we give up too early when traversing the queue:
In the chart, you can see that individual reassignments are handled as expected. However, it fails to process jobs after a subsequent failure, and the jobs get stuck in the
reassigned
status without moving tofailed
, nor are they retried repeatedly until the error threshold is exceeded.The fix is quite straightforward: inspect all the jobs eligible for reassignment. This, however, requires protection against the case where all jobs are getting reassigned (e.g., due to a bug or unavailability of the object store).