-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add enhancement for Workload preemption #410
Add enhancement for Workload preemption #410
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
6e1c5c5
to
0b0c5f0
Compare
67a63bc
to
59e692f
Compare
be arbitrarily delayed. | ||
- While Pods terminate, a ClusterQueue's quota could be oversubscribed. | ||
|
||
Oversubscription could be mitigated by watching the running pods but this will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will we consider add a new condition type to batch job, like reclaimed, which indicates that all the pods owned by job is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could do something like that, but I would leave it as a follow up. It's kind of related to #78
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
730b787
to
fbc47fe
Compare
Sorting uses the following criteria: | ||
|
||
1. Flavors that don't borrow first. | ||
2. [New criteria] Highest priority first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some workloads have high priority but need to borrow from others or preempt other low priority workload in the same cluster queue , will them need to wait all the workloads that don't need to borrow to be scheduled firstly? Will it lead to long-term starvation of high-priority workloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, first we prefer not to borrow (it's the semantics of min
quota).
If other ClusterQueues are not using all their quota, this high priority workload would be allowed to borrow.
Otherwise, it will start preempting lower priority workloads in its CQ, before allowing borrowing in the cohort.
So no, there wouldn't be starvation within the ClusterQueue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it will lead to performance loss. Low priority workloads(no need to borrow) will be scheduled first, then high priority will preempt it for clusterQueue is lack of resources. At the last, low priority workloads will be scheduled again with resource borrowing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the high priority workload fits with borrowing, it can still borrow, but in the next scheduling cycle.
If there is no capacity to borrow, it will preempt low priority workloads, and they wouldn't fit when requeued either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So no, there wouldn't be starvation within the ClusterQueue.
If we have several lower priority workloads which don't borrow resources and one hight priority workloads which needs to borrow resources. In my understanding, the hight priority workloads is starving, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In scheduling, we prefer the workloads wouldn't borrow resources, but we support preemption via priority, aren't they conflict with each other, that means lower workloads doesn't borrow resources might be scheduled at first, but then they will be preempted for higher priority ones?
How about only care about the priority in preemption, at least for startup. And leave the flavor assignment to #312, like MinimizeBorrowing. I think this is more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not exactly the same as your design, but can somehow mitigate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, first we prefer not to borrow (it's the semantics of min quota).
I assume that there are two workloads:
- The workload_a has higher priority, but it don't have enough free quota in its cluster queue, so it need to preempt some lower workloads to get resource.
- The workload_b has lower priority, but it has enough free quota in its cluster queue, so it won't need to borrow some quota.
@alculquicondor I think maybe in you design: workload_b will be scheduled firstly, right? If there are so many short-term workloads like workload_b, the workload_a will wait all the workloads like workload_b scheduled? right? Here may lead to some starvation that higher priority workload will wait for a long time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have several lower priority workloads which don't borrow resources and one hight priority workloads which needs to borrow resources. In my understanding, the hight priority workloads is starving, no?
This is somewhat true. What you might be missing is that we are talking about heads of the ClusterQueues here.
So we have two heads of different ClusterQueues competing, but both have the highest priority within their ClusterQueue. Let's say A has lower priority than B. If B fits without borrowing, it should still be scheduled first, to satisfy the min quota of the ClusterQueue. That's the definition of min.
we prefer the workloads wouldn't borrow resources, but we support preemption via priority, aren't they conflict with each other
No. Preemption happens in two scenarios: (1) within a cohort, to recover the min quota that was borrowed to other ClusterQueues or (2) within a ClusterQueue, purely due to priority.
I think maybe in you design: workload_b will be scheduled firstly, right? If there are so many short-term workloads like workload_b, the workload_a will wait all the workloads like workload_b scheduled? right? Here may lead to some starvation that higher priority workload will wait for a long time?
Yes, this is working as intended. We should respect the min quota of the ClusterQueue for B first.
But workload_a could still preempt lower priority workloads within its ClusterQueue.
If the workload has `.spec.onPreemption=Requeue`, clear `.spec.admission`. | ||
The Workload will be requeued by the Workload event handler. | ||
|
||
The incoming Workload will be admitted once the changes in the victim Workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The time between change the workload and the cache updated, where the incoming workload is placed? will it re-queue again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will be in the inadmissible stage. Every time a Workload deletion is observed, it will be in the active queue again.
I rewrote this paragraph for clarity.
// since the first time that the ClusterQueue has pending workloads that would | ||
// fit under the min quota. This time is present in the timestamp for the | ||
// PendingWorkloadsUnderMinQuota condition. | ||
// If null, it means to wait indefinitely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand correctly, shouldn't there be no wait here if null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that can be achieved with an explicit 0.
null basically means disable preemption (I updated the comment). An additional nice thing about this semantics is that it's backwards compatible
be arbitrarily delayed. | ||
- While Pods terminate, a ClusterQueue's quota could be oversubscribed. | ||
|
||
Oversubscription could be mitigated by watching the running pods but this will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Sorting uses the following criteria: | ||
|
||
1. Flavors that don't borrow first. | ||
2. [New criteria] Highest priority first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it will lead to performance loss. Low priority workloads(no need to borrow) will be scheduled first, then high priority will preempt it for clusterQueue is lack of resources. At the last, low priority workloads will be scheduled again with resource borrowing.
The algorithm is like follows: | ||
|
||
For each resource (or set of resources with the same flavors), evaluate | ||
flavors in the order established in the ClusterQueue: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #312
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Linked in the paragraph below for readers convenience.
Workload preemption: | ||
- A field in the ClusterQueue to influence how long to wait before starting | ||
preemption. | ||
- A field in the Workload to determine whether to terminate or requeue upon preemption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the usecases in which you think termination will be useful ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, I don't think this is necessary, at least not initially, we can start with always re-queueing the jobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my question to the community.
One case where I'm almost certain that it would be useful is for interactive Jobs that @kannon92 is working on. But since this work is still going on, we can leave it as a follow up and just requeue for now.
Moved to Alternatives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like a separation between workloads that went over their allocated timelimit and workloads that are over their timelimit but allowed to run due to preemption.
I don't think it needs to be addressed here but that is why I created #405 to maybe discuss separately from preemption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added some notes about reassigning flavors. And opened kubernetes/kubernetes#113221
workloads with higher priority and that Workloads are sent to ClusterQueues | ||
where most Workloads have the same priority. | ||
|
||
Additionally, the preemption is mostly a linear pass over the running workloads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to say 'the preemption algorithm". Updated.
workloads when: | ||
- ClusterQueues under their minimum quota need the resources that are currently | ||
lent to other ClusterQueues in the cohort. | ||
- Within a Clusterqueue, there are running Workloads with lower priority than |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Within a Clusterqueue, there are running Workloads with lower priority than | |
- Within a ClusterQueue, there are running Workloads with lower priority than |
Sorting uses the following criteria: | ||
|
||
1. Flavors that don't borrow first. | ||
2. [New criteria] Highest priority first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, first we prefer not to borrow (it's the semantics of min quota).
I assume that there are two workloads:
- The workload_a has higher priority, but it don't have enough free quota in its cluster queue, so it need to preempt some lower workloads to get resource.
- The workload_b has lower priority, but it has enough free quota in its cluster queue, so it won't need to borrow some quota.
@alculquicondor I think maybe in you design: workload_b will be scheduled firstly, right? If there are so many short-term workloads like workload_b, the workload_a will wait all the workloads like workload_b scheduled? right? Here may lead to some starvation that higher priority workload will wait for a long time?
fbfc7d9
to
8a4786c
Compare
8a4786c
to
16e500c
Compare
/retest |
|
||
If this is a ClusterQueue with BestEffortFIFO, the Workload is put in the | ||
inadmissible staging area with a deadline equal to the remaining time to | ||
fullfil `waitBeforePreemptionSeconds`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If no existing workloads can be preempted to fit the incoming workload, then at the limit all workloads that belong to this CQ will always be in the active queue (pretty much converting it into a strict fifo).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is if the deadline is not satisfied. Once it's satisfied, we proceed with preemption and requeue as usual as already explained at the end of this section. Still I tried to rephrase a little.
16e500c
to
a6f33f1
Compare
/retest
|
// ClusterQueues, can preempt Workloads with higher priority in this | ||
// ClusterQueue if it's using more than its min quota. | ||
// Possible values are Allow and Disallow (default). | ||
ByLowerPriorityWhenBorrowing *PreemptionByLowerPriorityWhenBorrowingPolicy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strange name, any precedence in k8s to this naming format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so.
Maybe something like this?
preemption:
onLowPriorityReclaim: Preempt|Ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aldo and I spent some time talking about the behavior and the name of this field.
Which of the following is this field controlling:
- the policy this queue will use to preempt workloads running in other queues to recover quota borrowed from this queue
- the policy other queues will use to preempt workloads running in this queue to recover quota this queue borrowed from other queues
Option 1 was what I expected, since different queues in a cohort could have different opinions about how important it is to recover quota borrowed from them, and option 2 can't express that.
If this is for option 1, it seems like there are at least three possible preemption policies for recovering/reclaiming quota borrowed from this queue (having the behavior you get with a nil value for the policy be able to be represented as an explicit enum value is good):
Never
LowerPriority
Always
It's also unclear how this interacts with the WaitBeforePreemptionSeconds
field. Can a queue be configured to reclaim borrowed quota by preemption (either immediately or by priority) without being forced to enable preemption among workloads in the queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like this?
preemption: onLowPriorityReclaim: Preempt|Ignore
The "LowPriority" aspect seems like one of the possible policies, not something we should bake into the field name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed option 1; I like the enum and we can perhaps name the field preemptionPolicy
or just policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to:
preepmtion:
withinCohort: Never|FromLowerPriority|Always
withinClusterQueue: Never|FromLowerPriority
// in the transitionTimestamp of the Admitted condition. | ||
// If null, it means to wait indefinitely, effectively disabling preemption | ||
// for incoming Workloads in this ClusterQueue. | ||
WaitBeforePreemptionSeconds *int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jordan and I spend some time thinking about this field, beyond just the name.
Here is one scenario where things get confusing:
Setup:
- low priority stuff is running and consuming all resources, mode=StrictFIFO, WaitBeforePreemptionSeconds=2 seconds
- add medium priority workload b at t=0
- add high priority workload a at t=1
- one second passes (t=2)
- queue looks like high(workload-a(waiting 1 second)), medium(workload-b(waiting 2 seconds))
What should happen?
In best-effort mode:
- we would look at workload-a, skip preemption (because it hadn't been waiting long enough)
- continue to workload-b, trigger preemption of things matching b's resource requirements
- change in available resources (because of our preemption) would make us reevaluate the whole list
- if workload-a fit within the space made for workload-b by preemption, we would run it, otherwise, we'd run workload-b
In strict mode (currently):
- we would only look at workload-a, and not preempt anything because it had not been waiting long enough
- this means even though workload-b has been waiting longer than WaitBeforePreemptionSeconds, nothing triggers the queue to preempt to make progress until workload-a reaches that threshold
Naively, a user would expect once something in the queue had reached the wait period to consider preemption, we would start making room for the queue to progress to running that workload, even if that transitively meant making room for the workload at HEAD in strict mode.
I think we can make that change relatively easily by not just looking at the wait time for the head of the queue, but the longest wait time we have in the queue.
As for the field name, triggerAfterWorkloadWaitingSeconds
should do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am having second thoughts on this parameter. Without informing the victim job of the upcoming preemption, how is shifting the time we preempt going to be useful at the limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going with the interpretation of "at the limit" as when all the ClusterQueues are full, thus we would require preemption in all scenarios.
The time is measured from the first time we attempt to admit a Workload.
In StrictFIFO (assuming no higher priority workloads jump to the head of the queue), this means that we still wait before preemption after each admission.
BestEffortFIFO is more complicated, because the head will keep changing as long as there are no other changes in the cluster (like workloads finishing). In an extreme case, we would go through the entire queue and start the timers for all the workloads. Once the timer is fulfilled, we will do preemptions for all the elements in the queue until they are all admitted. Arguably, this is less useful.
I guess we could park it and instead provide a knob to control whether to preempt within a ClusterQueue or not, alongside with the knob to preempt within the cohort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can park it for now. It feels like in most cases this will simply shift the start of the incoming job, which is equivalent to as if the incoming job was submitted WaitBeforePreemptionSeconds
later in most cases.
To do "cooperative" preemption, it is more useful to find ways to improve victim selection, probably via an explicit signal from the workloads (similar to preemption cost). For example, in the case of v1.Job and completions > parallelism, we can calculate how much progress it did by looking at success count and completions and prefer less the ones that are almost done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the field into the "Alternatives" section for now.
b2ec4c6
to
e6801eb
Compare
Some highlights: | ||
- A Workload could get flavor assignments at different steps for different | ||
resources. | ||
- Assigments that require preemption implicitly do not borrow quota. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is reasonable to support the case where a higher priority workload can preempt lower priority ones that are coming from any CQ with usage above its min. We can leave that as a followup, but I think we should make sure that the initial implementation doesn't block this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is reasonable to support the case where a higher priority workload can preempt lower priority ones that are coming from any CQ with usage above its min.
Isn't that what we are already proposing? Or do you mean that the CQ of the pending workload would go above its min by preempting other workloads on CQs that aren't above their min?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The statement "Assigments that require preemption implicitly do not borrow quota." implies that preemption is allowed only if the pending workload can fit under its CQ quota after preemption.
Consider the case where all admitted workloads in CQ1 are of equal or higher priority than the pending workload, and there are lower priority workloads from CQ2 that are admitted via borrowed quota from CQ3, then I am suggesting that the pending workload from CQ1 should preempt the lower priority ones from CQ2 to release the unused resources of CQ3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha....
I think the implementation in this proposal can still be extended to support that:
we query what is the minimum priority among the admitted ClusterQueues in the flavor and see if it's higher than or equal to the pending Workload, then choose any flavor, but mark it as borrowing. Then run the preemption algorithm considering all workloads in the cohort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, lets document this case then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, to accommodate for that possible expansion, I renamed the enums a little, to include the fact that we are limiting preemption to situations of "reclaiming quota". Please see update.
a70e7c0
to
f4d43fb
Compare
Looks good to me, I think this is ready to merge, please squash. |
f4d43fb
to
06b802b
Compare
squashed |
/retest |
2 similar comments
/retest |
/retest |
/lgtm |
What type of PR is this?
/kind documentation
What this PR does / why we need it:
So far, the changes describe the goals, non-goals, user stories and API.
Which issue(s) this PR fixes:
Part of #83
Special notes for your reviewer: