fix(v2): avoid starvation of failed jobs in the compaction scheduler #3732

kolesnikovae · 2024-11-30T07:59:50Z

I've been testing various failure modes and discovered that it still possible for a job that has already been reassigned to starve and remain in the queue (this time, after the first unsuccessful retry). The reason is that we give up too early when traversing the queue:

In the chart, you can see that individual reassignments are handled as expected. However, it fails to process jobs after a subsequent failure, and the jobs get stuck in the reassigned status without moving to failed, nor are they retried repeatedly until the error threshold is exceeded.

The fix is quite straightforward: inspect all the jobs eligible for reassignment. This, however, requires protection against the case where all jobs are getting reassigned (e.g., due to a bug or unavailability of the object store).

simonswine

This looks good to me. LGTM.

fix(v2): avoid starvation of failed jobs in the compaction scheduler

0736088

kolesnikovae force-pushed the fix/scheduler-job-assignment branch from a53a3c5 to 0736088 Compare November 30, 2024 09:44

kolesnikovae marked this pull request as ready for review December 2, 2024 03:05

kolesnikovae requested a review from a team as a code owner December 2, 2024 03:05

simonswine approved these changes Dec 2, 2024

View reviewed changes

kolesnikovae merged commit fba0faa into main Dec 3, 2024
18 checks passed

kolesnikovae deleted the fix/scheduler-job-assignment branch December 3, 2024 03:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(v2): avoid starvation of failed jobs in the compaction scheduler #3732

fix(v2): avoid starvation of failed jobs in the compaction scheduler #3732

kolesnikovae commented Nov 30, 2024 •

edited

Loading

simonswine left a comment

fix(v2): avoid starvation of failed jobs in the compaction scheduler #3732

fix(v2): avoid starvation of failed jobs in the compaction scheduler #3732

Conversation

kolesnikovae commented Nov 30, 2024 • edited Loading

simonswine left a comment

Choose a reason for hiding this comment

kolesnikovae commented Nov 30, 2024 •

edited

Loading