support job priority #1943

zubenkoivan · 2022-04-29T12:30:07Z

No description provided.

codecov · 2022-04-29T12:31:41Z

Codecov Report

Merging #1943 (5c1d0ba) into master (9ce01ad) will decrease coverage by 0.09%.
The diff coverage is 94.73%.

@@            Coverage Diff             @@
##           master    #1943      +/-   ##
==========================================
- Coverage   92.76%   92.66%   -0.10%     
==========================================
  Files          41       41              
  Lines        6991     7106     +115     
  Branches     1129     1158      +29     
==========================================
+ Hits         6485     6585     +100     
- Misses        370      382      +12     
- Partials      136      139       +3

Flag	Coverage Δ
integration	`84.63% <69.73%> (-0.38%)`	⬇️
unit	`69.64% <90.78%> (+0.34%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
platform_api/cluster_config_factory.py	`100.00% <ø> (ø)`
platform_api/orchestrator/base.py	`71.42% <66.66%> (-0.58%)`	⬇️
platform_api/orchestrator/poller_service.py	`87.10% <93.13%> (-1.65%)`	⬇️
platform_api/cluster_config.py	`100.00% <100.00%> (ø)`
platform_api/handlers/jobs_handler.py	`90.53% <100.00%> (+0.01%)`	⬆️
platform_api/orchestrator/job.py	`97.14% <100.00%> (+0.10%)`	⬆️
platform_api/orchestrator/job_request.py	`99.39% <100.00%> (+<0.01%)`	⬆️
platform_api/orchestrator/jobs_poller.py	`94.55% <100.00%> (-1.37%)`	⬇️
platform_api/orchestrator/jobs_service.py	`92.57% <100.00%> (+0.45%)`	⬆️
platform_api/orchestrator/kube_orchestrator.py	`95.27% <100.00%> (+0.06%)`	⬆️
... and 6 more

romasku · 2022-05-02T15:05:30Z

platform_api/orchestrator/poller_service.py

+            if job.is_suspended:
+                has_suspended = True
+                suspended_at = job.status_history.current.transition_time
+                if now - suspended_at >= self._config.max_suspended_time:


Hmm, looks like here you change logic a bit.

Previously, we resumed suspended jobs only if there is no waiting jobs (so there might be enough resources to fit additional job). It is sub-optimal, as it can happen that a waiting job requires GPU, and a suspended job needs only CPU and the cluster has enough resources. To partially overcome this issue, we resume jobs if they have waited for too long.

So SUSPENDED state here is a "long" state - a job can be in this state for hours.

But you've changed max_suspended_time default to 1:30 minutes, and also do not schedule new jobs while there is at least a single SUSPENDED job. So I guess you want SUSPENDED state to be "short", or it will block new jobs.

My guess is that your idea is to use k8s internal queue instead of our queue of suspended jobs as much as possible. I think it will not work properly for the following case:

You have a cluster with 0.2 CPU, and 4 jobs - 3x 0.1cpu, and 1x0.2cpu, all with the same priority. If we materialize the suspended jobs almost instantly, then 0.2cpu job will never run.

romasku

Looks great, thank you!

zubenkoivan marked this pull request as ready for review May 2, 2022 08:26

zubenkoivan requested review from asvetlov and romasku May 2, 2022 08:29

romasku reviewed May 2, 2022

View reviewed changes

zubenkoivan marked this pull request as draft May 4, 2022 07:05

zubenkoivan force-pushed the iz/support-job-priority branch from 00093a5 to 5d73c75 Compare May 7, 2022 07:52

zubenkoivan added 4 commits May 9, 2022 17:52

support job priority

f8c9d43

Fix test

970092c

fix test

af2717e

support job priority

16d1b87

zubenkoivan force-pushed the iz/support-job-priority branch from 7db4644 to 16d1b87 Compare May 9, 2022 14:52

zubenkoivan requested a review from romasku May 9, 2022 14:56

zubenkoivan marked this pull request as ready for review May 9, 2022 14:56

zubenkoivan added 5 commits May 9, 2022 18:08

support job priority

8faadd2

fix test

d964885

refactor

87ac0b0

add test

207be3e

refactor

5c1d0ba

romasku approved these changes May 11, 2022

View reviewed changes

zubenkoivan merged commit 0d45f62 into master May 12, 2022

zubenkoivan deleted the iz/support-job-priority branch May 12, 2022 08:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support job priority #1943

support job priority #1943

zubenkoivan commented Apr 29, 2022

codecov bot commented Apr 29, 2022 •

edited

Loading

romasku May 2, 2022

romasku left a comment

support job priority #1943

support job priority #1943

Conversation

zubenkoivan commented Apr 29, 2022

codecov bot commented Apr 29, 2022 • edited Loading

Codecov Report

romasku May 2, 2022

Choose a reason for hiding this comment

romasku left a comment

Choose a reason for hiding this comment

codecov bot commented Apr 29, 2022 •

edited

Loading