Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use (hopefully) fairer job processing order #617

Merged
merged 2 commits into from
Jul 14, 2023
Merged

Conversation

evansd
Copy link
Contributor

@evansd evansd commented Jul 13, 2023

Instead of randomising we now order jobs by how many jobs are already running in that jobs workspace, and then by age.

There are many things other than workspace we could partition on (e.g. user, project, organisation) but workspace is directly available on the job object, is easy to explain, and doesn't have the effect of penalising users who have to work on several different things.

Note that this doesn't prevent a large job request from grabbing all available slots if nothing else is running when it is submitted. But it does mean that the next slot that becomes free will go to a job from a different workspace.

Because the number of running jobs will change as we work our way through the set of active jobs we have to update the counts and re-sort the jobs on each iteration. This is not the most beautiful code, but I think it does the job.

In order for this to work correctly, we need job processing to happen in two phases. First, check up on all the running jobs, see if any have finished, and establish how much capacity we have. Then distribute this capacity fairly among the pending jobs. We achieve this by sorting first on status == RUNNING. This ensures that we handle all running jobs before we handle any pending jobs.

@evansd evansd marked this pull request as draft July 13, 2023 12:11
@evansd
Copy link
Contributor Author

evansd commented Jul 13, 2023

Update: have realised that we don't need two separate loops to deal with this; we just need to sort the jobs to make sure we handle all the running ones before the pending ones. This actually simplifies the code overall.

Original comment below.


Converting back to draft because I don't think this fixes the issue. The problem is that slots become available as we work our way through the list and finalise completed jobs. So, whenever a job completes, the slot will go to the next ready-to-run job that comes after it – which will almost certainly not be the highest priority job.

I think the only way to deal with this is to do two separate loops, one after the other: in the first we check up on all the running jobs and finalise them if necessary; and in the second we check up on all the pending jobs and start them if necessary. So slots become available during the first loop and are consumed in the second loop. It's only that second loop that needs to be in priority order.

Instead of randomising we now order jobs by how many jobs are already
running in that jobs workspace, and then by age.
@evansd evansd marked this pull request as ready for review July 14, 2023 12:09
@bloodearnest
Copy link
Member

Nice. It's kinda like hacking the two separate loops idea into the sort function.

Extra +1.

Comment on lines +97 to +98
if job.state == State.RUNNING:
running_for_workspace[job.workspace] += 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be if job.state == State.RUNNING and job_previous_state != State.RUNNING?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, but I think the answer's no – and this is one of the simplifications over the previous version. We don't count running jobs at all at the start, we just build up the count as we go along. That means the count isn't accurate during the initial phase while we processing running jobs, but that's fine. By the time we get to processing the pending jobs it will be accurate.

And because we only process each job once we never risk double-counting. So it makes the logic here much simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep yep, I still had the previous version of this in mind, this is correct.

@evansd evansd merged commit 9acc49a into main Jul 14, 2023
@evansd evansd deleted the evansd/job-scheduling branch July 14, 2023 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants