Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Preemption to interact with Quota/Usage through ClusterQueueSnapshot interface #2595

Merged
merged 2 commits into from
Jul 17, 2024

Conversation

gabesaba
Copy link
Contributor

@gabesaba gabesaba commented Jul 12, 2024

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Follow up to #2592. In preparation for #79.

Copied from #2592: we will change the Scheduler, FlavorAssigner, and Preemption logic to only interact with the ClusterQueueSnapshot's capacity through high level queries. E.g. "am I borrowing" or "how much capacity do I have left".

Additionally, we fix the test preempting locally and borrowing other resources in cohort, without cohort candidates. Since we were only looping over CQ resource groups, we didn't
find out that we had no capacity in CQ for this flavor code. We fix this by adding the alpha flavor to that CQ, and adding a CQ which lends resources to those FlavorResources

Does this PR introduce a user-facing change?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jul 12, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 12, 2024
Copy link

netlify bot commented Jul 12, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit 3541f14
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6697b4f27d88d500084c6f6f

}
cqResUsage := cq.Usage[fName]
for rName := range flvReq {
if cqResUsage[rName] >= cq.QuotaFor(resources.FlavorResource{Flavor: fName, Resource: rName}).Nominal {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: this >= turns into a >. I think this is correct (as we want to make sure we're within nominal quota), but please scrutinize it @alculquicondor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a more accurate check would be whether the usage plus the resources for the incoming workload would be borrowing.

If it's not borrowing, then this means that this CQ is preempting to reclaim quota, then it is allowed to preempt other workloads in the cohort. Otherwise, it should only be allowed to preempt workloads within its CQ or those below the threshold.

@mimowo wdyt? This was added in #1979 and reused in #2110

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's a bug in the accounting, let's fix it in a follow-up PR, as this PR is intended to cleanup existing code without any change in behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to reviewers: this >= turns into a >. I think this is correct (as we want to make sure we're within nominal quota), but please scrutinize it @alculquicondor

I think !queueUnderNominalInResourcesNeedingPreemption <> cqIsBorrowing. For example if you have one resource, and cqResUsage[rName] == Nominal then the answer is "false", so !queueUnderNominalInResourcesNeedingPreemption is true, while cqIsBorrowing is false.

@mimowo wdyt? This was added in #1979 and reused in #2110

ack, I remember this was quite complex, so I would prefer to keep the logic in this PR, and do a dedicated one for fix if needed. I will yet look a bit into this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Folded the two functions into one #2595 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've met with @gabesaba and agreed that he will try to adjust this refactoring PR not to change any logic.We will have a dedicated follow up to adjust the logic if needed. We will then consider changing <, with <=, or the idea from #2595 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Unfolded the functions, and added back in the cohort == nil check.

Comment on lines 539 to 541
if cq.Cohort == nil {
return false
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we get rid of this check? if so, can fold queueUnderNominalInResourcesNeedingPreemption into this function. unit tests still pass after its removal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the check can be removed, because this logic is only called for CQs that belong to the same cohort as the preempting one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced usages of queueUnderNominalInResourcesNeedingPreemption with !cqIsBorrowing

return 0
}

func (c *ClusterQueueSnapshot) borrowing(fr resources.FlavorResource) *int64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func (c *ClusterQueueSnapshot) borrowing(fr resources.FlavorResource) *int64 {
func (c *ClusterQueueSnapshot) borrowingLimit(fr resources.FlavorResource) *int64 {


// if the borrowing limit exists, we cap our available capacity by the borrowing limit.
if borrowingLimit := c.borrowing(fr); borrowingLimit != nil {
borrowingRemaining := c.nominal(fr) + *borrowingLimit - c.usageFor(fr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
borrowingRemaining := c.nominal(fr) + *borrowingLimit - c.usageFor(fr)
withBorrowingRemaining := c.nominal(fr) + *borrowingLimit - c.usageFor(fr)

Comment on lines 539 to 541
if cq.Cohort == nil {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the check can be removed, because this logic is only called for CQs that belong to the same cohort as the preempting one.

}
cqResUsage := cq.Usage[fName]
for rName := range flvReq {
if cqResUsage[rName] >= cq.QuotaFor(resources.FlavorResource{Flavor: fName, Resource: rName}).Nominal {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, a more accurate check would be whether the usage plus the resources for the incoming workload would be borrowing.

If it's not borrowing, then this means that this CQ is preempting to reclaim quota, then it is allowed to preempt other workloads in the cohort. Otherwise, it should only be allowed to preempt workloads within its CQ or those below the threshold.

@mimowo wdyt? This was added in #1979 and reused in #2110

@@ -91,6 +91,23 @@ func TestPreemption(t *testing.T) {
Resource(corev1.ResourceCPU, "6", "6").
Resource(corev1.ResourceMemory, "3Gi", "3Gi").
Obj(),
*utiltesting.MakeFlavorQuotas("alpha").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, change the test to use default for both resources. They are part of the same resource group, so they shouldn't get different flavors.

@gabesaba gabesaba force-pushed the refactor_preemption branch from d4ed6f5 to a89149b Compare July 15, 2024 08:06
@gabesaba gabesaba force-pushed the refactor_preemption branch from a89149b to 3541f14 Compare July 17, 2024 12:11
@mimowo
Copy link
Contributor

mimowo commented Jul 17, 2024

/lgtm
/approve
I believe all comments are addressed.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 17, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gabesaba, mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 5d12594ef18c940ed7cbc15cda6969d616343427

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 17, 2024
@k8s-ci-robot k8s-ci-robot merged commit 1c559b6 into kubernetes-sigs:main Jul 17, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.8 milestone Jul 17, 2024
@gabesaba gabesaba deleted the refactor_preemption branch August 9, 2024 13:04
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
…Snapshot interface (kubernetes-sigs#2595)

* Fix preemption test

* Refactor preemption.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants