Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix using PrioritySortingWithinCohort=false and borrowWithinCohort #2807

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Aug 9, 2024

What type of PR is this?

What this PR does / why we need it:

To prevent infinite preemption loop when borrowWithinCohort is used with PrioritySortingWithingCohort=false

Which issue(s) this PR fixes:

Fixes #2821

Special notes for your reviewer:

In the scenario described in #2821 when PrioritySortingWithingCohort=false we re-admit repeatedly workload-A (lower priority) which blocks the admission of the Workload-B2 (higher priority).

Ideally, we would like workload-B to be scheduled, which would break the cycle. Workload-B2 would not get preempted by Workload-A.

The code that is important for relative prioritization of the two workloads is here:

  1. [L630-634] both Workload-A and Workload-B2 need to borrow to get admitted
  2. [L637-639] fair sharing is not configured here, so this block is not relevant
  3. [L641-648] PrioritySortingWithingCohort=false so this block is not relevant
  4. [L651-653] Workload-A currently "wins" here because it is older

The idea for the fix in to modify the formula for GetQueueOrderTimestamp at the last step and use the timestamp to discriminate in favor of Workload-B2. In this order we use the EvictionTimestamp rather than creation timestamp. This idea is not entirely new as we do something similar to penalize workload evicted due to PodsReady timeout already.

Does this PR introduce a user-facing change?

Prevent infinite preemption loop when PrioritySortingWithinCohort=false
is used together with borrowWithinCohort.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Aug 9, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from denkensk and trasc August 9, 2024 10:28
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 9, 2024
Copy link

netlify bot commented Aug 9, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit f09db5f
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/66ba3866d863b70008cc8c64

@mimowo
Copy link
Contributor Author

mimowo commented Aug 9, 2024

/test pull-kueue-test-integration-main

@mimowo mimowo force-pushed the fix-disable-priority branch from 093cad0 to d079c97 Compare August 12, 2024 08:21
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 12, 2024
@mimowo mimowo force-pushed the fix-disable-priority branch 4 times, most recently from 42c5add to 9c75489 Compare August 12, 2024 09:27
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Aug 12, 2024
@mimowo mimowo force-pushed the fix-disable-priority branch from 9c75489 to ef58531 Compare August 12, 2024 12:04
@mimowo mimowo changed the title WIP: fix mode when PrioritySortingWithingCohort is disabled [EXPERIMENT] WIP: fix mode when PrioritySortingWithinCohort is disabled [EXPERIMENT] Aug 12, 2024
@mimowo mimowo force-pushed the fix-disable-priority branch from ef58531 to f09db5f Compare August 12, 2024 16:29
@mimowo mimowo changed the title WIP: fix mode when PrioritySortingWithinCohort is disabled [EXPERIMENT] Fix using PrioritySortingWithinCohort=false and borrowWithinCohort Aug 12, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 12, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Aug 12, 2024

/assign @gabesaba

@mimowo
Copy link
Contributor Author

mimowo commented Aug 13, 2024

/cc @PBundyra

@k8s-ci-robot k8s-ci-robot requested a review from PBundyra August 13, 2024 07:32
@PBundyra
Copy link
Contributor

PBundyra commented Aug 13, 2024

Could you please describe the mechanism in description behind preventing the infinite loop? Especially which workload does Kueue should admit after the changes

@PBundyra
Copy link
Contributor

How do we estimate that 1ms will be enough to put the preemptor in front of the workload being preemtepted?
Other than that, LGTM

@mimowo
Copy link
Contributor Author

mimowo commented Aug 13, 2024

Could you please describe the mechanism in description behind preventing the infinite loop? Especially which workload does Kueue should admit after the changes

Good idea, I have updated the "Special notes for your reviewer:" section. PTAL.

@mimowo
Copy link
Contributor Author

mimowo commented Aug 13, 2024

How do we estimate that 1ms will be enough to put the preemptor in front of the workload being preemtepted?
Other than that, LGTM

Yes, any epsilon would be enough. This is just to discriminate them in scenario when the creation timestamp of the preemptor equals eviction timestamp of the preempted workload. This is a practical possibility which was discovered by the integration test when I was running it in a loop. It would not be end of the world - the preempted workload-A in practice would get admitted and preempted again. Eventually its eviction timestamp would be greater than the creation timestamp of Workload-B2 (but it could take a couple of attempts as the resolution of timestamps is 1s).

@PBundyra
Copy link
Contributor

Thank you, LGTM

@gabesaba
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 13, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 39f12fcb2958cf47566f8609c908a9c6c828a1e7

@mimowo
Copy link
Contributor Author

mimowo commented Aug 13, 2024

/cherry-pick release-0.8

@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: once the present PR merges, I will cherry-pick it on top of release-0.8 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot merged commit 2582458 into kubernetes-sigs:main Aug 13, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.9 milestone Aug 13, 2024
@k8s-infra-cherrypick-robot
Copy link
Contributor

@mimowo: new pull request created: #2830

In response to this:

/cherry-pick release-0.8

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot pushed a commit that referenced this pull request Aug 13, 2024
* Move declaration of the Preempted reasons to API pkg

* Fix when PrioritySortingWithingCohort is disabled
mbobrovskyi pushed a commit to epam/kubernetes-kueue that referenced this pull request Aug 20, 2024
@alculquicondor
Copy link
Contributor

/release-note-edit

Prevent infinite preemption loop when PrioritySortingWithinCohort=false
is used together with borrowWithinCohort.

@mimowo
Copy link
Contributor Author

mimowo commented Oct 4, 2024

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 4, 2024
kannon92 pushed a commit to openshift-kannon92/kubernetes-sigs-kueue that referenced this pull request Nov 19, 2024
@mimowo mimowo deleted the fix-disable-priority branch January 21, 2025 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Infinite preemption loop is possible when PrioritySortingWithinCohort=false is used with borrowWithinCohort
6 participants