Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(v2): avoid starvation of failed jobs in the compaction scheduler #3732

Merged
merged 1 commit into from
Dec 3, 2024

Conversation

kolesnikovae
Copy link
Collaborator

@kolesnikovae kolesnikovae commented Nov 30, 2024

I've been testing various failure modes and discovered that it still possible for a job that has already been reassigned to starve and remain in the queue (this time, after the first unsuccessful retry). The reason is that we give up too early when traversing the queue:

image

In the chart, you can see that individual reassignments are handled as expected. However, it fails to process jobs after a subsequent failure, and the jobs get stuck in the reassigned status without moving to failed, nor are they retried repeatedly until the error threshold is exceeded.

The fix is quite straightforward: inspect all the jobs eligible for reassignment. This, however, requires protection against the case where all jobs are getting reassigned (e.g., due to a bug or unavailability of the object store).

image

@kolesnikovae kolesnikovae force-pushed the fix/scheduler-job-assignment branch from a53a3c5 to 0736088 Compare November 30, 2024 09:44
@kolesnikovae kolesnikovae marked this pull request as ready for review December 2, 2024 03:05
@kolesnikovae kolesnikovae requested a review from a team as a code owner December 2, 2024 03:05
Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. LGTM.

@kolesnikovae kolesnikovae merged commit fba0faa into main Dec 3, 2024
18 checks passed
@kolesnikovae kolesnikovae deleted the fix/scheduler-job-assignment branch December 3, 2024 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants