KEP: Add local queue weight #963

KunWuLuan · 2023-07-10T04:10:07Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add a new plugin to enable local queue sorting and filtering

Which issue(s) this PR fixes:

Fixes #752

Special notes for your reviewer:

This pr is also related to #10

Does this PR introduce a user-facing change?

k8s-ci-robot · 2023-07-10T04:10:09Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2023-07-10T04:10:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KunWuLuan
Once this PR has been reviewed and has the lgtm label, please assign denkensk for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2023-07-10T04:10:14Z

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

Name	Link
🔨 Latest commit	`72fe311`
🔍 Latest deploy log	https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/64c0943f6660c500078a8bc4

k8s-ci-robot · 2023-07-10T04:10:16Z

Hi @KunWuLuan. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alculquicondor · 2023-07-10T12:11:35Z

/ok-to-test

alculquicondor · 2023-07-10T12:11:50Z

cc @mwielgus

mwielgus · 2023-07-11T13:20:41Z

keps/752-localqueue-weight/README.md

+[experience reports]: https://github.com/golang/go/wiki/ExperienceReports
+-->
+
+Currently, for queues inside one CQ, we can submit queues as much as we want until the bottleneck of CQ, and one queue can occupy the whole resources with higher priorities. A limitation for a local queue is helpful for resource partitioning.


IMHO, there are two options here:

stay with 2 level hierarchy that Kueue offers today - have each local queue point to a different cluster queue with some resources, and allow some borrowing/eviction between queues via cohorts.

explicitly introduce multi-level hierarchy

This KEP is somewhere between the two - introduces 3 level hierarchy, but with a different functionality set on level 1 (local queues in cluster queue) and level 2 (cluster queues in a cohort).

To Marcin's point, one thing to consider is that LocalQueues are namespaced. As such, it means that administrators could only define weights between users that are in the same namespace.

This could be quite limiting for some organizations that requires a namespace for each end-user.

Sorry maybe I didn't totally understand u. Do you mean maybe we should consider adding weight on cluster queues instead of local queues?

I think weight in local queue is a resource consumption policy to control whose priority is more higher when scheduling and the resources in the queue will not be further segmented. We will still stay with 2 level hierarchy that Kueue offers today.

Today, fair sharing and borrowing semantics happen for CQs within a cohort.

By adding weight to LQs, we are adding fair sharing to that level, implicitly adding one more level of hierarchy. The question is: do we want to do that, or should we rather implement arbitrary hierarchies where CQs can belong to other CQs or Cohort is an CRD and cohorts can belong to cohorts.

I'm not saying that one option or the other is better. I'm saying that we need to ponder all the options. Kueue APIs offer backwards compatibility, meaning that if we add an API, we need to maintain it forever or offer a deprecation process.

To Marcin's point, one thing to consider is that LocalQueues are namespaced. As such, it means that administrators could only define weights between users that are in the same namespace.
This could be quite limiting for some organizations that requires a namespace for each end-user.

Nvm about that comment. I realize it's wrong as multiple namespaces can point to the same CQ and we could respect weight among them.

Thanks for explain!
In my cloud scene, my customers will create a quota node for each department and create a leaf quota node for each user, each user will have their own namespace to submit workloads. Unused resources will be shared by users by min and max like elastic quota in scheduler-plugins.
So I think maybe a 3 level tree is enough for the cases.

I have three ideas.

Add a crd to define weights for local queues in a certain namespace and allow users to create and modify the resource may meet the requirement of Abdullah. But will still add one more level of hierarchy.

Allow admin to define parent cluster queue in cluster queue's spec. Cohort can only be set for the cluster queues who have same parent. Sum nominal resources of Children CQ must less than parent CQ's nominal resources. In this case, the functions in this KEP can be converted into multi-level CQ. But because users can not create and modify cluster queue, users will not be able to manage different workloads' resource consumption of their own.

Combine 1 and 2.

HDYT?

@alculquicondor @ahg-g Hi, if u have time, please take a look and let me know your options, I will continue work on this issue, thank very much. :D

I think the better approach would be something along the lines of 2, which is also similar to #79

And here we have two high level alternatives:
a. Make a cohort a CRD. Then you could make cohorts be part of other cohorts. In this case, cohorts themselves don't have quota (like today). But they could still have weights, that only have semantics around borrowing.
b. Make CQs either belong to a cohort (today's mode) or another CQ. In this case, the child CQ cannot have limits, but can only have weights.

It might be useful to look into how systems like YARN or Slurm do multi-level hierarchy.

Question about your use-case: why not create a cohort per department and one CQ per user?

Also #62 is related.

Ok, thanks, if there is any progress, I will add it on KEP

…queue weight

KunWuLuan · 2023-07-26T03:35:07Z

Hi, I have added some more designs to reach the goals. Please take a look if you have time. Thank you very much! @alculquicondor 👍

KunWuLuan · 2023-09-04T02:26:43Z

I think this can be considered after #1093 .

k8s-triage-robot · 2024-03-20T20:10:02Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

mwielgus · 2024-03-20T20:15:53Z

This will be largely handled by #1531 and #1773.

k8s-ci-robot · 2024-03-28T11:49:08Z

@KunWuLuan: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kueue-verify-main	`72fe311`	link	true	`/test pull-kueue-verify-main`
pull-kueue-test-multikueue-e2e-main-1-29	`72fe311`	link	true	`/test pull-kueue-test-multikueue-e2e-main-1-29`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-triage-robot · 2024-05-18T13:46:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-06-28T17:51:29Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-06-28T17:51:33Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

add kep

624bcab

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 10, 2023

k8s-ci-robot requested a review from denkensk July 10, 2023 04:10

k8s-ci-robot requested a review from tenzen-y July 10, 2023 04:10

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 10, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 10, 2023

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 10, 2023

mwielgus reviewed Jul 11, 2023

View reviewed changes

KunWuLuan mentioned this pull request Jul 12, 2023

☂️ Requirements for v0.5 #974

Closed

add kep

15df90d

KunWuLuan force-pushed the kep/localQueueWeight branch from d2fc26e to 15df90d Compare July 14, 2023 08:14

KunWuLuan added 2 commits July 17, 2023 17:34

add Extension Points in Cluster Queue

45bd44a

add description for preempt

04640b3

KunWuLuan changed the title ~~Kep/local queue weight~~ KEP: add local queue weight Jul 26, 2023

KunWuLuan added 2 commits July 26, 2023 11:27

add design for multi-level hierarchy and allow users to modify local …

c83934b

…queue weight

add design details for local queue weight

72fe311

KunWuLuan changed the title ~~KEP: add local queue weight~~ KEP: Add local queue weight Sep 4, 2023

trasc mentioned this pull request Mar 13, 2024

[multikueue] Don't use reconnecting clients in WL reconnecting. #1827

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 20, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 18, 2024

k8s-ci-robot closed this Jun 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KEP: Add local queue weight #963

KEP: Add local queue weight #963

KunWuLuan commented Jul 10, 2023

k8s-ci-robot commented Jul 10, 2023

k8s-ci-robot commented Jul 10, 2023

netlify bot commented Jul 10, 2023 •

edited

Loading

k8s-ci-robot commented Jul 10, 2023

alculquicondor commented Jul 10, 2023

alculquicondor commented Jul 10, 2023

mwielgus Jul 11, 2023

alculquicondor Jul 11, 2023

KunWuLuan Jul 17, 2023

KunWuLuan Jul 17, 2023

alculquicondor Jul 17, 2023

KunWuLuan Jul 18, 2023

KunWuLuan Aug 7, 2023

alculquicondor Aug 8, 2023

alculquicondor Aug 9, 2023

KunWuLuan Aug 16, 2023

KunWuLuan commented Jul 26, 2023

KunWuLuan commented Sep 4, 2023

k8s-triage-robot commented Mar 20, 2024

mwielgus commented Mar 20, 2024

k8s-ci-robot commented Mar 28, 2024 •

edited

Loading

k8s-triage-robot commented May 18, 2024

k8s-triage-robot commented Jun 28, 2024

k8s-ci-robot commented Jun 28, 2024

KEP: Add local queue weight #963

KEP: Add local queue weight #963

Conversation

KunWuLuan commented Jul 10, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Jul 10, 2023

k8s-ci-robot commented Jul 10, 2023

netlify bot commented Jul 10, 2023 • edited Loading

✅ Deploy Preview for kubernetes-sigs-kueue canceled.

k8s-ci-robot commented Jul 10, 2023

alculquicondor commented Jul 10, 2023

alculquicondor commented Jul 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KunWuLuan commented Jul 26, 2023

KunWuLuan commented Sep 4, 2023

k8s-triage-robot commented Mar 20, 2024

mwielgus commented Mar 20, 2024

k8s-ci-robot commented Mar 28, 2024 • edited Loading

k8s-triage-robot commented May 18, 2024

k8s-triage-robot commented Jun 28, 2024

k8s-ci-robot commented Jun 28, 2024

netlify bot commented Jul 10, 2023 •

edited

Loading

k8s-ci-robot commented Mar 28, 2024 •

edited

Loading