Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added priority capability for reclaim action #2262

Open
zhaixigui opened this issue May 29, 2022 · 12 comments · May be fixed by #3340
Open

added priority capability for reclaim action #2262

zhaixigui opened this issue May 29, 2022 · 12 comments · May be fixed by #3340
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@zhaixigui
Copy link

What would you like to be added:

We need a resource recovery requirement between queues. The scenario is as follows:

We have an offline job management platform, which is responsible for managing offline jobs of different business teams, and all business teams submit jobs on this platform.

The platform has a global job priority. it creates the queue for each business team and sets the maximum available capacity(queue.capacity). and it has an operator responsible for creating jobs and scheduling them with volcano.

The platform has two responsibilities:
1) improve the resource utilization rate of the platform,
2) allow more high-quality jobs to run.
we hope to achieve our goal through the following two operations:

  1. sum (queue.capacity) > the total resources of the cluster, for example: the cluster can allocate 100 cpus, there are three queues, each queue is 50 cpus, sum(queue)=150 > 100( cluster can allocate resources)
  2. relcaim enabled,when the capacity of a queue does not reach the queue.deserved, and the cluster has no idle resources to allocate, the high-priority jobs in this queue can recycle low-priority jobs in other queues, but the same level cannot be reclaimed.

We see that relcaim has no priority, resource recycling happens between jobs of the same level.

So, is it possible to add an AddReclaimableFn to the priority plugin?

E.g:
priority.go

ssn.AddReclaimableFn(pp.Name(), preemptableFn)

Why is this needed:

@zhaixigui zhaixigui added the kind/feature Categorizes issue or PR as related to a new feature. label May 29, 2022
@zhaixigui
Copy link
Author

any update ?

@wpeng102
Copy link
Member

wpeng102 commented Aug 8, 2022

I think it's a reasonable request. Relaiming tasks between queue need consider job's priority. @william-wang @Thor-wl

@Thor-wl
Copy link
Contributor

Thor-wl commented Aug 8, 2022

I think it's a reasonable request. Relaiming tasks between queue need consider job's priority. @william-wang @Thor-wl

It's OK for me. Looking forward to somebody who are interested in this enhancement.

@Thor-wl Thor-wl added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 8, 2022
@william-wang
Copy link
Member

@zhaixigui
The following ways are usually used to achive the goal like you said.

  1. Use proportion to share resource dynamically between different business team. for example: the cluster can allocate 100 cpus, there are three queues, the tasks in each queue can use all cluster resource if there is no tasks in other queue. This will help to Maximize resource Utilization.
  2. Use job priority for allocation and preemption in each queue for urgent task.
  3. Queue reservation ability can also be used to handle urgent task to ensure there is aways idle available at any time for special tasks.

If your legacy system can not adopt above approach. It's fine for me to add AddReclaimableFn in priority. However there is conflction with current proportion plugin. The intention of reclaiming in proportion and priority is different. We might need to refactor the framework.

@william-wang william-wang self-assigned this Aug 9, 2022
@Thor-wl Thor-wl removed the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Aug 9, 2022
@zhaixigui
Copy link
Author

zhaixigui commented Aug 9, 2022

@william-wang

Resource reservation requires waiting for resources to be released, which is similar to waiting for low-priority tasks to end. We use the "reclaim" to hope to recover resources as soon as possible to allow high-priority tasks to run, which can better guarantee high-priority tasks when cluster resources are insufficient.

Therefore, we hope that “reclaim” considers the priority of tasks when reclaiming resources, and only reclaims resources of low-priority tasks.

@william-wang
Copy link
Member

So you still want to use global job priority and reclaim accross the queue based on priority in your case. I'll add this enhancement into the pipeline of development, if you would like to make contribution, we can work together for that as well.

@zhaixigui
Copy link
Author

zhaixigui commented Aug 31, 2022

With pleasure !
just assign it to me @william-wang

@RamezesDong
Copy link

RamezesDong commented Nov 2, 2022

@william-wang
Hello! I wonder if it has more progress in this feature. We need a feature about priority between queues, which can allow the new job in a higher priority queue to preempt(meaning kill now) other queues' jobs with lower priority. I think global job priority is something similar to queue priority.
The bug fix about preemption within queues may be helpful issue:2337

@stale
Copy link

stale bot commented Feb 2, 2023

Hello 👋 Looks like there was no activity on this issue for last 90 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 2, 2023
@wangyang0616
Copy link
Member

Retain the current issue.
/remove-lifecycle stale

@volcano-sh-bot volcano-sh-bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 24, 2023
@lowang-bh
Copy link
Member

/assign

@zhaizhch
Copy link

zhaizhch commented Mar 7, 2024

两年过去了,这个问题解决了吗? @zhaixigui @wangyang0616 @william-wang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants