Add enhancement for Workload preemption #410

alculquicondor · 2022-10-03T14:48:56Z

What type of PR is this?

/kind documentation

What this PR does / why we need it:

So far, the changes describe the goals, non-goals, user stories and API.

Which issue(s) this PR fixes:

Part of #83

Special notes for your reviewer:

k8s-ci-robot · 2022-10-03T14:49:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [alculquicondor]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

keps/83-workload-preemption/README.md

kerthcet · 2022-10-17T11:11:46Z

keps/83-workload-preemption/README.md

+  be arbitrarily delayed.
+- While Pods terminate, a ClusterQueue's quota could be oversubscribed.
+
+Oversubscription could be mitigated by watching the running pods but this will


Will we consider add a new condition type to batch job, like reclaimed, which indicates that all the pods owned by job is deleted.

Yes, we could do something like that, but I would leave it as a follow up. It's kind of related to #78

keps/83-workload-preemption/README.md

denkensk · 2022-10-18T10:58:48Z

keps/83-workload-preemption/README.md

+Sorting uses the following criteria:
+
+1. Flavors that don't borrow first.
+2. [New criteria] Highest priority first.


If some workloads have high priority but need to borrow from others or preempt other low priority workload in the same cluster queue , will them need to wait all the workloads that don't need to borrow to be scheduled firstly? Will it lead to long-term starvation of high-priority workloads？

Yes, first we prefer not to borrow (it's the semantics of min quota).

If other ClusterQueues are not using all their quota, this high priority workload would be allowed to borrow.

Otherwise, it will start preempting lower priority workloads in its CQ, before allowing borrowing in the cohort.

So no, there wouldn't be starvation within the ClusterQueue.

But it will lead to performance loss. Low priority workloads(no need to borrow) will be scheduled first, then high priority will preempt it for clusterQueue is lack of resources. At the last, low priority workloads will be scheduled again with resource borrowing.

if the high priority workload fits with borrowing, it can still borrow, but in the next scheduling cycle.

If there is no capacity to borrow, it will preempt low priority workloads, and they wouldn't fit when requeued either.

So no, there wouldn't be starvation within the ClusterQueue.

If we have several lower priority workloads which don't borrow resources and one hight priority workloads which needs to borrow resources. In my understanding, the hight priority workloads is starving, no?

In scheduling, we prefer the workloads wouldn't borrow resources, but we support preemption via priority, aren't they conflict with each other, that means lower workloads doesn't borrow resources might be scheduled at first, but then they will be preempted for higher priority ones?

How about only care about the priority in preemption, at least for startup. And leave the flavor assignment to #312, like MinimizeBorrowing. I think this is more clear.

This is not exactly the same as your design, but can somehow mitigate this.

Yes, first we prefer not to borrow (it's the semantics of min quota).

I assume that there are two workloads：

The workload_a has higher priority, but it don't have enough free quota in its cluster queue, so it need to preempt some lower workloads to get resource.

The workload_b has lower priority, but it has enough free quota in its cluster queue, so it won't need to borrow some quota.

@alculquicondor I think maybe in you design: workload_b will be scheduled firstly, right? If there are so many short-term workloads like workload_b, the workload_a will wait all the workloads like workload_b scheduled? right? Here may lead to some starvation that higher priority workload will wait for a long time?

If we have several lower priority workloads which don't borrow resources and one hight priority workloads which needs to borrow resources. In my understanding, the hight priority workloads is starving, no?

This is somewhat true. What you might be missing is that we are talking about heads of the ClusterQueues here.
So we have two heads of different ClusterQueues competing, but both have the highest priority within their ClusterQueue. Let's say A has lower priority than B. If B fits without borrowing, it should still be scheduled first, to satisfy the min quota of the ClusterQueue. That's the definition of min.

we prefer the workloads wouldn't borrow resources, but we support preemption via priority, aren't they conflict with each other

No. Preemption happens in two scenarios: (1) within a cohort, to recover the min quota that was borrowed to other ClusterQueues or (2) within a ClusterQueue, purely due to priority.

I think maybe in you design: workload_b will be scheduled firstly, right? If there are so many short-term workloads like workload_b, the workload_a will wait all the workloads like workload_b scheduled? right? Here may lead to some starvation that higher priority workload will wait for a long time?

Yes, this is working as intended. We should respect the min quota of the ClusterQueue for B first.
But workload_a could still preempt lower priority workloads within its ClusterQueue.

denkensk · 2022-10-18T11:13:25Z

keps/83-workload-preemption/README.md

+   If the workload has `.spec.onPreemption=Requeue`, clear `.spec.admission`.
+   The Workload will be requeued by the Workload event handler.
+
+The incoming Workload will be admitted once the changes in the victim Workloads


The time between change the workload and the cache updated, where the incoming workload is placed? will it re-queue again?

Yes, it will be in the inadmissible stage. Every time a Workload deletion is observed, it will be in the active queue again.

I rewrote this paragraph for clarity.

keps/83-workload-preemption/README.md

denkensk · 2022-10-18T11:17:36Z

keps/83-workload-preemption/README.md

+  // since the first time that the ClusterQueue has pending workloads that would
+  // fit under the min quota. This time is present in the timestamp for the
+  // PendingWorkloadsUnderMinQuota condition.
+  // If null, it means to wait indefinitely.


I'm not sure I understand correctly, shouldn't there be no wait here if null?

that can be achieved with an explicit 0.

null basically means disable preemption (I updated the comment). An additional nice thing about this semantics is that it's backwards compatible

keps/83-workload-preemption/README.md

kerthcet · 2022-10-19T03:58:00Z

keps/83-workload-preemption/README.md

+  be arbitrarily delayed.
+- While Pods terminate, a ClusterQueue's quota could be oversubscribed.
+
+Oversubscription could be mitigated by watching the running pods but this will


keps/83-workload-preemption/README.md

kerthcet · 2022-10-19T09:39:57Z

keps/83-workload-preemption/README.md

+Sorting uses the following criteria:
+
+1. Flavors that don't borrow first.
+2. [New criteria] Highest priority first.


But it will lead to performance loss. Low priority workloads(no need to borrow) will be scheduled first, then high priority will preempt it for clusterQueue is lack of resources. At the last, low priority workloads will be scheduled again with resource borrowing.

keps/83-workload-preemption/README.md

kerthcet · 2022-10-19T11:01:12Z

keps/83-workload-preemption/README.md

+The algorithm is like follows:
+
+    For each resource (or set of resources with the same flavors), evaluate
+    flavors in the order established in the ClusterQueue:


Related to #312

Linked in the paragraph below for readers convenience.

kumariitr · 2022-10-19T15:27:23Z

keps/83-workload-preemption/README.md

+Workload preemption:
+- A field in the ClusterQueue to influence how long to wait before starting
+  preemption.
+- A field in the Workload to determine whether to terminate or requeue upon preemption.


What are the usecases in which you think termination will be useful ?

+1, I don't think this is necessary, at least not initially, we can start with always re-queueing the jobs.

That's my question to the community.

One case where I'm almost certain that it would be useful is for interactive Jobs that @kannon92 is working on. But since this work is still going on, we can leave it as a follow up and just requeue for now.

Moved to Alternatives

I think I'd like a separation between workloads that went over their allocated timelimit and workloads that are over their timelimit but allowed to run due to preemption.

I don't think it needs to be addressed here but that is why I created #405 to maybe discuss separately from preemption.

keps/83-workload-preemption/README.md

alculquicondor

I also added some notes about reassigning flavors. And opened kubernetes/kubernetes#113221

keps/83-workload-preemption/README.md

alculquicondor · 2022-10-20T16:17:29Z

keps/83-workload-preemption/README.md

+workloads with higher priority and that Workloads are sent to ClusterQueues
+where most Workloads have the same priority.
+
+Additionally, the preemption is mostly a linear pass over the running workloads


I meant to say 'the preemption algorithm". Updated.

keps/83-workload-preemption/README.md

denkensk · 2022-10-20T09:43:15Z

keps/83-workload-preemption/README.md

+workloads when:
+- ClusterQueues under their minimum quota need the resources that are currently
+  lent to other ClusterQueues in the cohort.
+- Within a Clusterqueue, there are running Workloads with lower priority than


Suggested change

- Within a Clusterqueue, there are running Workloads with lower priority than

- Within a ClusterQueue, there are running Workloads with lower priority than

keps/83-workload-preemption/README.md

denkensk · 2022-11-05T11:56:50Z

keps/83-workload-preemption/README.md

+Sorting uses the following criteria:
+
+1. Flavors that don't borrow first.
+2. [New criteria] Highest priority first.


Yes, first we prefer not to borrow (it's the semantics of min quota).

I assume that there are two workloads：

The workload_a has higher priority, but it don't have enough free quota in its cluster queue, so it need to preempt some lower workloads to get resource.

The workload_b has lower priority, but it has enough free quota in its cluster queue, so it won't need to borrow some quota.

@alculquicondor I think maybe in you design: workload_b will be scheduled firstly, right? If there are so many short-term workloads like workload_b, the workload_a will wait all the workloads like workload_b scheduled? right? Here may lead to some starvation that higher priority workload will wait for a long time?

keps/83-workload-preemption/README.md

alculquicondor · 2022-11-22T15:52:52Z

/retest
(the flaky integration test was fixed)

keps/83-workload-preemption/README.md

ahg-g · 2022-11-22T19:04:54Z

keps/83-workload-preemption/README.md

+
+   If this is a ClusterQueue with BestEffortFIFO, the Workload is put in the
+   inadmissible staging area with a deadline equal to the remaining time to
+   fullfil `waitBeforePreemptionSeconds`.


If no existing workloads can be preempted to fit the incoming workload, then at the limit all workloads that belong to this CQ will always be in the active queue (pretty much converting it into a strict fifo).

This is if the deadline is not satisfied. Once it's satisfied, we proceed with preemption and requeue as usual as already explained at the end of this section. Still I tried to rephrase a little.

keps/83-workload-preemption/README.md

alculquicondor · 2022-11-22T22:19:10Z

/retest

ahg-g · 2022-11-22T23:44:36Z

keps/83-workload-preemption/README.md

+  // ClusterQueues, can preempt Workloads with higher priority in this
+  // ClusterQueue if it's using more than its min quota.
+  // Possible values are Allow and Disallow (default).
+  ByLowerPriorityWhenBorrowing *PreemptionByLowerPriorityWhenBorrowingPolicy


Strange name, any precedence in k8s to this naming format?

I don't think so.

Maybe something like this?

preemption: onLowPriorityReclaim: Preempt|Ignore

Aldo and I spent some time talking about the behavior and the name of this field.

Which of the following is this field controlling:

the policy this queue will use to preempt workloads running in other queues to recover quota borrowed from this queue

the policy other queues will use to preempt workloads running in this queue to recover quota this queue borrowed from other queues

Option 1 was what I expected, since different queues in a cohort could have different opinions about how important it is to recover quota borrowed from them, and option 2 can't express that.

If this is for option 1, it seems like there are at least three possible preemption policies for recovering/reclaiming quota borrowed from this queue (having the behavior you get with a nil value for the policy be able to be represented as an explicit enum value is good):

Never

LowerPriority

Always

It's also unclear how this interacts with the WaitBeforePreemptionSeconds field. Can a queue be configured to reclaim borrowed quota by preemption (either immediately or by priority) without being forced to enable preemption among workloads in the queue?

Maybe something like this?

preemption: onLowPriorityReclaim: Preempt|Ignore

The "LowPriority" aspect seems like one of the possible policies, not something we should bake into the field name

It is indeed option 1; I like the enum and we can perhaps name the field preemptionPolicy or just policy

I changed it to:

preepmtion: withinCohort: Never|FromLowerPriority|Always withinClusterQueue: Never|FromLowerPriority

alculquicondor · 2022-11-23T16:48:48Z

keps/83-workload-preemption/README.md

+  // in the transitionTimestamp of the Admitted condition.
+  // If null, it means to wait indefinitely, effectively disabling preemption
+  // for incoming Workloads in this ClusterQueue.
+  WaitBeforePreemptionSeconds *int64


Jordan and I spend some time thinking about this field, beyond just the name.

Here is one scenario where things get confusing:

Setup:

low priority stuff is running and consuming all resources, mode=StrictFIFO, WaitBeforePreemptionSeconds=2 seconds

add medium priority workload b at t=0

add high priority workload a at t=1

one second passes (t=2)

queue looks like high(workload-a(waiting 1 second)), medium(workload-b(waiting 2 seconds))

What should happen?

In best-effort mode:

we would look at workload-a, skip preemption (because it hadn't been waiting long enough)

continue to workload-b, trigger preemption of things matching b's resource requirements

change in available resources (because of our preemption) would make us reevaluate the whole list

if workload-a fit within the space made for workload-b by preemption, we would run it, otherwise, we'd run workload-b

In strict mode (currently):

we would only look at workload-a, and not preempt anything because it had not been waiting long enough

this means even though workload-b has been waiting longer than WaitBeforePreemptionSeconds, nothing triggers the queue to preempt to make progress until workload-a reaches that threshold

Naively, a user would expect once something in the queue had reached the wait period to consider preemption, we would start making room for the queue to progress to running that workload, even if that transitively meant making room for the workload at HEAD in strict mode.

I think we can make that change relatively easily by not just looking at the wait time for the head of the queue, but the longest wait time we have in the queue.

As for the field name, triggerAfterWorkloadWaitingSeconds should do.

I am having second thoughts on this parameter. Without informing the victim job of the upcoming preemption, how is shifting the time we preempt going to be useful at the limit?

Going with the interpretation of "at the limit" as when all the ClusterQueues are full, thus we would require preemption in all scenarios.

The time is measured from the first time we attempt to admit a Workload.
In StrictFIFO (assuming no higher priority workloads jump to the head of the queue), this means that we still wait before preemption after each admission.
BestEffortFIFO is more complicated, because the head will keep changing as long as there are no other changes in the cluster (like workloads finishing). In an extreme case, we would go through the entire queue and start the timers for all the workloads. Once the timer is fulfilled, we will do preemptions for all the elements in the queue until they are all admitted. Arguably, this is less useful.

I guess we could park it and instead provide a knob to control whether to preempt within a ClusterQueue or not, alongside with the knob to preempt within the cohort.

I think we can park it for now. It feels like in most cases this will simply shift the start of the incoming job, which is equivalent to as if the incoming job was submitted WaitBeforePreemptionSeconds later in most cases.

To do "cooperative" preemption, it is more useful to find ways to improve victim selection, probably via an explicit signal from the workloads (similar to preemption cost). For example, in the case of v1.Job and completions > parallelism, we can calculate how much progress it did by looking at success count and completions and prefer less the ones that are almost done?

Moved the field into the "Alternatives" section for now.

keps/83-workload-preemption/README.md

ahg-g · 2022-11-23T21:19:55Z

keps/83-workload-preemption/README.md

+Some highlights:
+- A Workload could get flavor assignments at different steps for different
+  resources.
+- Assigments that require preemption implicitly do not borrow quota.


I think it is reasonable to support the case where a higher priority workload can preempt lower priority ones that are coming from any CQ with usage above its min. We can leave that as a followup, but I think we should make sure that the initial implementation doesn't block this case.

I think it is reasonable to support the case where a higher priority workload can preempt lower priority ones that are coming from any CQ with usage above its min.

Isn't that what we are already proposing? Or do you mean that the CQ of the pending workload would go above its min by preempting other workloads on CQs that aren't above their min?

The statement "Assigments that require preemption implicitly do not borrow quota." implies that preemption is allowed only if the pending workload can fit under its CQ quota after preemption.

Consider the case where all admitted workloads in CQ1 are of equal or higher priority than the pending workload, and there are lower priority workloads from CQ2 that are admitted via borrowed quota from CQ3, then I am suggesting that the pending workload from CQ1 should preempt the lower priority ones from CQ2 to release the unused resources of CQ3.

Gotcha....

I think the implementation in this proposal can still be extended to support that:
we query what is the minimum priority among the admitted ClusterQueues in the flavor and see if it's higher than or equal to the pending Workload, then choose any flavor, but mark it as borrowing. Then run the preemption algorithm considering all workloads in the cohort.

sounds good, lets document this case then

Actually, to accommodate for that possible expansion, I renamed the enums a little, to include the fact that we are limiting preemption to situations of "reclaiming quota". Please see update.

ahg-g · 2022-11-28T02:56:54Z

Looks good to me, I think this is ready to merge, please squash.

alculquicondor · 2022-11-28T14:08:15Z

squashed

alculquicondor · 2022-11-28T14:21:18Z

/retest

alculquicondor · 2022-11-28T14:25:22Z

/retest

alculquicondor · 2022-11-28T15:19:00Z

/retest

ahg-g · 2022-11-28T15:25:04Z

/lgtm

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/documentation Categorizes issue or PR as related to documentation. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 3, 2022

k8s-ci-robot requested review from ArangoGutierrez and denkensk October 3, 2022 14:49

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 3, 2022

alculquicondor force-pushed the kep-preemption branch 3 times, most recently from 6e1c5c5 to 0b0c5f0 Compare October 5, 2022 13:26

ahg-g reviewed Oct 5, 2022

View reviewed changes

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 7, 2022

qelo reviewed Oct 13, 2022

View reviewed changes

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Oct 14, 2022

alculquicondor force-pushed the kep-preemption branch from 67a63bc to 59e692f Compare October 14, 2022 18:59

kerthcet reviewed Oct 17, 2022

View reviewed changes

qelo reviewed Oct 17, 2022

View reviewed changes

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2022

alculquicondor force-pushed the kep-preemption branch from 730b787 to fbc47fe Compare October 17, 2022 20:41

kerthcet reviewed Oct 18, 2022

View reviewed changes

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

keps/83-workload-preemption/README.md Show resolved Hide resolved

qelo reviewed Oct 18, 2022

View reviewed changes

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

denkensk reviewed Oct 18, 2022

View reviewed changes

alculquicondor mentioned this pull request Oct 18, 2022

Terminate a Job from an external controller kubernetes/kubernetes#113142

Closed

kumariitr reviewed Oct 19, 2022

View reviewed changes

keps/83-workload-preemption/README.md Show resolved Hide resolved

kerthcet reviewed Oct 19, 2022

View reviewed changes

kumariitr reviewed Oct 19, 2022

View reviewed changes

qelo reviewed Oct 20, 2022

View reviewed changes

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

alculquicondor commented Oct 20, 2022

View reviewed changes

kannon92 reviewed Oct 21, 2022

View reviewed changes

keps/83-workload-preemption/README.md Outdated Show resolved Hide resolved

denkensk reviewed Nov 5, 2022

View reviewed changes

alculquicondor force-pushed the kep-preemption branch from fbfc7d9 to 8a4786c Compare November 17, 2022 20:48

alculquicondor changed the title ~~WIP Add enhancement for Workload preemption~~ Add enhancement for Workload preemption Nov 17, 2022

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2022

alculquicondor force-pushed the kep-preemption branch from 8a4786c to 16e500c Compare November 17, 2022 22:18

ahg-g reviewed Nov 22, 2022

View reviewed changes

alculquicondor force-pushed the kep-preemption branch from 16e500c to a6f33f1 Compare November 22, 2022 20:49

ahg-g reviewed Nov 22, 2022

View reviewed changes

alculquicondor commented Nov 23, 2022

View reviewed changes

alculquicondor force-pushed the kep-preemption branch 3 times, most recently from b2ec4c6 to e6801eb Compare November 23, 2022 21:02

ahg-g reviewed Nov 23, 2022

View reviewed changes

alculquicondor force-pushed the kep-preemption branch from a70e7c0 to f4d43fb Compare November 24, 2022 19:48

Add enhancement for Workload preemption

06b802b

alculquicondor force-pushed the kep-preemption branch from f4d43fb to 06b802b Compare November 28, 2022 14:08

k8s-ci-robot assigned ahg-g Nov 28, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 28, 2022

k8s-ci-robot merged commit 347abbd into kubernetes-sigs:main Nov 28, 2022

	- Within a Clusterqueue, there are running Workloads with lower priority than
	- Within a ClusterQueue, there are running Workloads with lower priority than

Add enhancement for Workload preemption #410

Add enhancement for Workload preemption #410

Conversation

alculquicondor commented Oct 3, 2022 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

k8s-ci-robot commented Oct 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

denkensk Oct 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Nov 22, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor commented Nov 22, 2022 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liggitt Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alculquicondor Nov 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Nov 28, 2022

alculquicondor commented Nov 28, 2022

alculquicondor commented Nov 28, 2022

alculquicondor commented Nov 28, 2022

alculquicondor commented Nov 28, 2022

ahg-g commented Nov 28, 2022

alculquicondor commented Oct 3, 2022 •

edited

Loading

denkensk Oct 18, 2022 •

edited

Loading

liggitt Nov 23, 2022 •

edited

Loading

alculquicondor Nov 23, 2022 •

edited

Loading