-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support job priority #1943
support job priority #1943
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1943 +/- ##
==========================================
- Coverage 92.76% 92.66% -0.10%
==========================================
Files 41 41
Lines 6991 7106 +115
Branches 1129 1158 +29
==========================================
+ Hits 6485 6585 +100
- Misses 370 382 +12
- Partials 136 139 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
|
if job.is_suspended: | ||
has_suspended = True | ||
suspended_at = job.status_history.current.transition_time | ||
if now - suspended_at >= self._config.max_suspended_time: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, looks like here you change logic a bit.
Previously, we resumed suspended jobs only if there is no waiting jobs (so there might be enough resources to fit additional job). It is sub-optimal, as it can happen that a waiting job requires GPU, and a suspended job needs only CPU and the cluster has enough resources. To partially overcome this issue, we resume jobs if they have waited for too long.
So SUSPENDED state here is a "long" state - a job can be in this state for hours.
But you've changed max_suspended_time
default to 1:30 minutes, and also do not schedule new jobs while there is at least a single SUSPENDED job. So I guess you want SUSPENDED state to be "short", or it will block new jobs.
My guess is that your idea is to use k8s internal queue instead of our queue of suspended jobs as much as possible. I think it will not work properly for the following case:
You have a cluster with 0.2 CPU, and 4 jobs - 3x 0.1cpu, and 1x0.2cpu, all with the same priority. If we materialize the suspended jobs almost instantly, then 0.2cpu job will never run.
00093a5
to
5d73c75
Compare
7db4644
to
16d1b87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thank you!
No description provided.