Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support job priority #1943

Merged
merged 9 commits into from
May 12, 2022
Merged

support job priority #1943

merged 9 commits into from
May 12, 2022

Conversation

zubenkoivan
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Apr 29, 2022

Codecov Report

Merging #1943 (5c1d0ba) into master (9ce01ad) will decrease coverage by 0.09%.
The diff coverage is 94.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1943      +/-   ##
==========================================
- Coverage   92.76%   92.66%   -0.10%     
==========================================
  Files          41       41              
  Lines        6991     7106     +115     
  Branches     1129     1158      +29     
==========================================
+ Hits         6485     6585     +100     
- Misses        370      382      +12     
- Partials      136      139       +3     
Flag Coverage Δ
integration 84.63% <69.73%> (-0.38%) ⬇️
unit 69.64% <90.78%> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
platform_api/cluster_config_factory.py 100.00% <ø> (ø)
platform_api/orchestrator/base.py 71.42% <66.66%> (-0.58%) ⬇️
platform_api/orchestrator/poller_service.py 87.10% <93.13%> (-1.65%) ⬇️
platform_api/cluster_config.py 100.00% <100.00%> (ø)
platform_api/handlers/jobs_handler.py 90.53% <100.00%> (+0.01%) ⬆️
platform_api/orchestrator/job.py 97.14% <100.00%> (+0.10%) ⬆️
platform_api/orchestrator/job_request.py 99.39% <100.00%> (+<0.01%) ⬆️
platform_api/orchestrator/jobs_poller.py 94.55% <100.00%> (-1.37%) ⬇️
platform_api/orchestrator/jobs_service.py 92.57% <100.00%> (+0.45%) ⬆️
platform_api/orchestrator/kube_orchestrator.py 95.27% <100.00%> (+0.06%) ⬆️
... and 6 more

@zubenkoivan zubenkoivan marked this pull request as ready for review May 2, 2022 08:26
if job.is_suspended:
has_suspended = True
suspended_at = job.status_history.current.transition_time
if now - suspended_at >= self._config.max_suspended_time:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looks like here you change logic a bit.

Previously, we resumed suspended jobs only if there is no waiting jobs (so there might be enough resources to fit additional job). It is sub-optimal, as it can happen that a waiting job requires GPU, and a suspended job needs only CPU and the cluster has enough resources. To partially overcome this issue, we resume jobs if they have waited for too long.

So SUSPENDED state here is a "long" state - a job can be in this state for hours.

But you've changed max_suspended_time default to 1:30 minutes, and also do not schedule new jobs while there is at least a single SUSPENDED job. So I guess you want SUSPENDED state to be "short", or it will block new jobs.

My guess is that your idea is to use k8s internal queue instead of our queue of suspended jobs as much as possible. I think it will not work properly for the following case:

You have a cluster with 0.2 CPU, and 4 jobs - 3x 0.1cpu, and 1x0.2cpu, all with the same priority. If we materialize the suspended jobs almost instantly, then 0.2cpu job will never run.

@zubenkoivan zubenkoivan marked this pull request as draft May 4, 2022 07:05
@zubenkoivan zubenkoivan requested a review from romasku May 9, 2022 14:56
@zubenkoivan zubenkoivan marked this pull request as ready for review May 9, 2022 14:56
Copy link
Contributor

@romasku romasku left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you!

@zubenkoivan zubenkoivan merged commit 0d45f62 into master May 12, 2022
@zubenkoivan zubenkoivan deleted the iz/support-job-priority branch May 12, 2022 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants