Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose binding priority and preemption #4993

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
290 changes: 290 additions & 0 deletions docs/proposals/scheduling/binding-priority-preemption/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,290 @@
---
title: Binding Priority and Preemption

authors:
- "@LeonZh0u"
- "@seanlaii"
- "@wengyao04"
- "@whitewindmills"
- "@zclyne"

reviewers:
- "@RainbowMango"
- "@XiShanYongYe-Chang"
- "@chaunceyjiang"

approvers:
- "@RainbowMango"

creation-date: 2024-05-27

notes: This proposal refers to ResourceBinding and ClusterResourceBinding collectively as binding.
---

# Binding Priority and Preemption

## Summary

Currently, the scheduler only schedules workloads in order, that is, the workload triggered first can be scheduled first.
And when a new binding has certain scheduling requirements that makes it infeasible on any member cluster,
it may stay in the scheduling queue, until sufficient resources are free, and it can be scheduled.

This proposal proposes the concept of priorities and preemption for bindings in Karmada and how the priority impacts scheduling and preemption of bindings
when the member cluster runs out of resources.
- The priority is independent of its workload type and must be one of the valid values, there is a total order on the values.
- Preemption is the action taken when an important binding requires resources or conditions which are not available in all member clusters. So, one or more bindings need to be preempted to make room for the important binding.

## Motivation

It is fairly common in large-scale clusters to have more tasks than what Karmada can handle, so scheduling responses can become slow.
Often times the workload is a mix of high priority critical tasks, and non-urgent tasks that can wait.
Karmada should be able to distinguish these workloads in order to decide which ones should be scheduled sooner, which ones can wait and which ones can be preempted.

### Goals

- How to specify priority and preemption strategy for bindings.
- Define how the order of these priorities are specified.
- Define how new priority levels are added.
- Effect of priority on scheduling and preemption.
- Define the concept of binding preemption in Karmada.
- Define scenarios under which a binding may get preempted.
- Define mechanics of preemption.

### Non-Goals

- How preemption works in spread constraints.

## Proposal

### Terminology

When a new binding has certain scheduling requirements that makes it infeasible on any member cluster,
scheduler may choose to clear the scheduling result of lower priority bindings to satisfy the scheduling requirements of the new binding.
We call this operation "binding preemption". Binding preemption is distinguished from "[policy preemption](../policy-preemption/README.md)" where high-priority policies preempt low-priority policies.

### User Stories (Optional)

#### As a user, I want to the task I submitted to be scheduled as soon as possible.

I submit a workload that is prioritized above other workloads, but do not wish to discard existing work by preempting scheduled bindings.
I hope that the high priority workload will be scheduled ahead of other queued bindings, as soon as sufficient cluster resources "naturally" become free.
whitewindmills marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +69 to +70
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
I submit a workload that is prioritized above other workloads, but do not wish to discard existing work by preempting scheduled bindings.
I hope that the high priority workload will be scheduled ahead of other queued bindings, as soon as sufficient cluster resources "naturally" become free.
I usually classify my workloads based on priority and want high-priority workloads to be scheduled ahead of others, rather than waiting to be scheduled in the order they were queued. This will allow me to maintain the quality of service for important workloads even during resource contention.


![priority_queued](priority-queued.png)

#### As a user, I hope that GPU resources can be provided preferred to high-priority AI training tasks.

I submit a AI workload which occupies preferentially GPU resources of member clusters.
Karmada tries to find feasible clusters that can run the AI workload. If no enough member clusters are found,
I hope Karmada tries to remove bindings with lower priority from some arbitrary member clusters in order to make room for the pending bindings.

### Notes/Constraints/Caveats (Optional)

### Risks and Mitigations

Binding priority and preemption can have unwanted side effects. Here is an example of potential problems and ways to deal with them.

#### Bindings are preempted unnecessarily

Preemption removes existing bindings from some member clusters under resource pressure to make room for higher priority pending bindings.
If you give high priorities to certain bindings by mistake, these unintentionally high priority bindings may cause preemption in your multi-cluster system.

To address the problem, you can disable the preemption behavior for your bindings.
RainbowMango marked this conversation as resolved.
Show resolved Hide resolved

When a binding is preempted, there will be events recorded for the preempted binding.
whitewindmills marked this conversation as resolved.
Show resolved Hide resolved
Preemption should happen only when a member cluster does not have enough resources for a binding.
In such cases, preemption happens only when the priority of the pending binding (preemptor) is higher than the preempted bindings.
Preemption must not happen when there is no pending binding, or when the pending bindings have equal or lower priority than the preempted bindings.
If preemption happens in such scenarios, please feel free to file an issue.

## Design Details

### Effect of priority on scheduling

One could generally expect a binding with higher priority has a higher chance of getting scheduled than the same binding with lower priority.
RainbowMango marked this conversation as resolved.
Show resolved Hide resolved
However, there are other parameters that affect scheduling decisions. So, a high priority binding may or may not be scheduled before lower priority bindings.
RainbowMango marked this conversation as resolved.
Show resolved Hide resolved
For example, a high priority binding is marked as unschedulable due to insufficient resources of member clusters, but the preemption behavior is not enabled.
Karmada will try to schedule others with lower priority.

### Effect of priority on preemption

Generally, lower priority bindings are more likely to get preempted by higher priority bindings when member clusters have reached a threshold.
RainbowMango marked this conversation as resolved.
Show resolved Hide resolved
In such a case, scheduler may decide to preempt lower priority bindings to release enough resources for higher priority pending bindings.
As mentioned before, there are other parameters that affect scheduling decisions, such as cluster affinity and spread constraints.
If scheduler determines that a high priority binding cannot be scheduled even if lower priority bindings are preempted, it will not preempt lower priority bindings.

### New API

#### Priority Classes

This proposal will reuse the PriorityClass API of Kubernetes. The priority class defines the mapping between the priority name and its value.
It can have an optional description. It is an arbitrary string and is provided only as a guideline for users.

Similarly, we will follow the semantics of PriorityClass in Karmada. The following example gives a PriorityClass used by the system by default.
```yaml
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: foo
globalDefault: true
preemptionPolicy: PreemptLowerPriority
value: 2000000000
```

**Important notes**: Only one PriorityClass can be marked as `globalDefault`. However, if more than one PriorityClasses exists with their `globalDefault` field set to true, the
smallest value of such global default PriorityClasses will be used as the default priority.

### API change

#### ResourceBinding/ClusterResourceBinding API change

This proposal are going to add two new field: `Priority` and `PreemptionPolicy` for determining the behavior for preempting.
By default, preemption is disabled.

```go
// ResourceBindingSpec represents the expectation of ResourceBinding.
type ResourceBindingSpec struct {
// The priority value. The karmada-scheduler component use this field to find the
// priority of the binding.
// The higher the value, the higher the priority.
// +optional
Priority *int32 `json:"priority,omitempty"`
RainbowMango marked this conversation as resolved.
Show resolved Hide resolved

// PreemptionBehavior is the Policy for preempting bindings with lower priority.
// One of Never, PreemptLowerPriority.
// Defaults to Never if unset.
// +optional
PreemptionBehavior PreemptionBehavior `json:"preemptionBehavior,omitempty"`

... ...
```

#### PropagationPolicy/ClusterPropagationPolicy API change

User requests sent to Karmada may have PriorityClassName in their `Placement`.
Karmada resolves a PriorityClassName to its corresponding `Priority` and `PreemptionPolicy`, then populates them of the binding spec.
This proposal are going to add a new field `PriorityClassName` for specify PriorityClass.

```go
// Placement represents the rule for select clusters.
type Placement struct {
// PriorityClassName indicates bindings will use the PriorityClass to resolve it's priority and preemptionPolicy if specified.
// Any other name must be defined by creating a PriorityClass object with that name.
// If not specified, the binding will use the global default PriorityClass first.
// If the global default API does not exist, the binding priority will be zero, binding preemptionPolicy will be "Never".
// +optional
PriorityClassName string `json:"priorityClassName,omitempty"`
Comment on lines +170 to +175
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I guess we can present a complete design in this proposal, and notes, currently we only implement xxx.
    I mean we can have a field PriorityClassSource, the valid options could be FederatedPriorityClass,KubePriorityClass,PodPriorityClass, defaults to KubePriorityClass.
    Note that, we might don't add this filed when implementing this feature at alpha maturity.

  2. Another thing open for discussion is where the filed should be placed, under spec.placement or spec.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, I know that KubePriorityClass should be priorityclasses.scheduling.k8s.io, but what's PodPriorityClass?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PriorityClassSource represents where to find the priority class, there will be 3 ways:

  • From the FederatedPriorityClass, which Karmada might introduce in the future.
  • From the Kubernetes PriorityClass
  • From the Pod template

So, the PodPriorityClass means parsing the priority from the pod template.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, you mean, for CRDs, we still need a ResourceInterpreter interface to determine priority?

From the Pod template

and this rule requires the priority class must exist in Karmada control plane. so in my opinion, it is not much different from KubePriorityClass, and the API becomes more difficult to understand.
I don't know if I understand correctly. if not, please correct me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this rule requires the priority class must exist in Karmada control plane.

I'm not sure where validates the existence of PriorityClass in Kubernetes, but karmada-apiserver doesn't have such a restriction. People can apply Deployment and specify a non-existent PriorityClass.

Note: The idea of FederatedPriorityClass and PodPriorityClass is still not mature, just like a placeholder in this proposal, for extension purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure where validates the existence of PriorityClass in Kubernetes

not K8s, I mean Karmada should do this. that's weird - Karmada can find the priority class name from the Pod template, but this priority class doesn't exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PriorityClass is a kind of cluster-lever configuration, it is usually managed by the infra team, and it might synced from the legacy metadata system. In other words, that configuration might not propagated by Karmada.

By the way, Kubernetes doesn't require the PriorityClass to exist. I just ran a test with it, it shows that everything works fine:

-bash-5.0# karmadactl get deployments.apps --operation-scope=members --clusters=member1 nginx -o yaml 
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx
  namespace: default
  resourceVersion: "1428701"
  uid: 0215af9b-d8e8-44e1-bda4-92f574be2cc4
spec:
    # ...
    spec:
      containers:
      - image: nginx
      priorityClassName: horen  // not-exist PriorityClass

Copy link
Member Author

@whitewindmills whitewindmills Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RainbowMango first, sorry for the slow response.
I mean, Karmada should understand the priority class to apply it to ResourceBinding.
as your example shows, the deployment nginx did work fine, but it cannot co-work with ResourceBinding priority & preemption because the priority class horen doesn't exist in Karmada.
in plain words, when people set the field PriorityClassSource to PodPriorityClass, Karmada will parse the priority class name from the pod template, then find the priority class by name to apply it to ResourceBinding.


... ...
```

### About priority classes

Priority classes can be added or removed, but we only allow updating `Description`.
While Karmada can work fine if priority classes are changed at run-time,
the change can be confusing to users as bindings with a priority class which were created before the change will have a different priority and preemptionPolicy than those created after the change.
Deletion of priority classes is allowed, despite the fact that there may be existing bindings that have specified such priority class names in their `placement`.
In other words, there will be no referential integrity for priority classes. This is another reason that all system components should only work with priority and preemptionPolicy and not with the PriorityClassName.

One could delete an existing priority class and create another one with the same name and different content.
By doing so, they can achieve the same effect as updating a priority class, but we still do not allow updating priority classes to prevent accidental changes.

### Preemption scenario

In this proposal, the only scenario under which a group of bindings in Karmada may be preempted is when a higher priority binding cannot be scheduled due to various unmet scheduling requirements,
such as lack of resources, spread constraints, etc., and the preemption of the lower priority bindings allows the higher priority binding to be scheduled.
So, if the preemption of the lower priority bindings does not help with scheduling of the higher priority binding, those lower priority bindings will keep running and the higher priority binding will stay pending.

**Important notes**:
- The scheduler will only consider bindings whose resource requirements are not empty.
From scheduler's point of view, such a binding needs no resources and its preemption will not release any resources.
- Since the preemption feature is still in the experimental stage, we will not introduce preemption in ClusterResourceBinding first because ClusterResourceBinding are usually more sensitive cluster-level resources.

### Preemption order

When scheduling a pending binding, scheduler tries to place the binding on a member cluster that does not require preemption.
If there is no such a member cluster, scheduler may favor a member cluster where the number and/or priority of victims (preempted bindings) is smallest.
After choosing the member cluster, scheduler considers the lowest priority bindings for preemption first.
Scheduler starts from the lowest priority and considers enough bindings that should be preempted to allow the pending binding to schedule.
Scheduler only considers bindings that have lower priority than the pending binding.

Scheduler will try to minimize the number of preempted bindings.
whitewindmills marked this conversation as resolved.
Show resolved Hide resolved
As a result, it may preempt a binding while leaving lower priority bindings running if preemption of those lower priority bindings is not enough to schedule the pending binding while preemption of the higher priority binding(s) is enough to schedule the pending binding.
For example, if member cluster capacity is 10, and pending binding is priority 10 and requires 5 units of resource, and the running bindings are
```json lines
{
"priority": 0,
"request": 1
}
{
"priority": 1,
"request": 2
}
{
"priority": 2,
"request": 5
}
{
"priority": 3,
"request": 2
}
```
, scheduler will preempt the priority 2 binding only and leaves priority 1 and priority 0 running.

### Preemption in multi-scheduler

Karmada allows multiple schedulers to exist in its control plane, This introduces a race condition where multiple schedulers may perform round-robin preemption.
That is, scheduler A may schedule binding A, Scheduler B preempts binding A to schedule binding B which is then preempted by scheduler A to schedule binding A and we go in a loop.

Therefore, we strongly recommend not to enable preemption behavior in multiple schedulers.
If you must enable it, unless you can ensure that they will not have any intersection during preemption.
RainbowMango marked this conversation as resolved.
Show resolved Hide resolved

### Flowchart of the new scheduling algorithm

![binding_priority_preemption](binding-priority-preemption.png)

1. When the pending binding requires 2 units of resource and has 2 replicas, however, the schedule result(`target clusters`) only provide 3 units of resource.
So we get the resource requirements gap is `2 * 2 - 3 = 1`.
2. If a binding cannot meet the resource requirement of the pending binding,
we don't stop accumulating its resource requirements until the resource requirements of the pending binding are met.
3. If there is no binding can meet the resource requirements of the pending binding, no preemption will be executed when the preemption loop ends.
So we'll try to preempt the accumulated bindings to meet the resource requirements of the pending binding, if can't, just fail.

### Feature gate

This binding preemption feature is an experimental feature, so we will introduce the following feature gates, which are disabled by default.
- ResourceBindingPreemption: indicates if a high-priority binding could preempt a low-priority binding, it's a global enablement.
- CrossNamespaceResourceBindingPreemption: indicates if a high-priority binding could preempt a low-priority binding across namespaces.

### Components change

#### karmada-controller-manager

Currently, a binding will be created or updated when resource templates are matched by propagation policy.
When a binding is created or updated, detector tries to find the priority class by `priorityClassName` to populate the binding spec.
detector will populate the binding spec with default priority and preemptionPolicy if the priority class is not found.

Only priority classes changes (by deleting and adding them with a different content) will not trigger bindings' update.

#### karmada-scheduler

Currently, the scheduler only runs a serial scheduling loop by a FIFO queue.
With this proposal, a priority scheduling queue will be implemented to replace the current work queue.
It should retain all the functionality of the current work queue and additionally implement sorting by priority.

In addition, if preemption behavior is enabled, the scheduler should identify the circumstances under which the pending binding should preempt other bindings,
calculate which bindings need to be preempted to meet current resource requirements.

#### karmada-webhook

Since we have strictly defined the binding priority and preemption, so the webhook should perform extra
validation work to prevent misleading configuration.

### Test Plan

- All current testing should be passed, no break change would be involved by this feature.
- Add new E2E tests to cover the feature, the scope should include:
* bindings are scheduled by priority.
* preemption between high-priority bindings and low-priority bindings.
* preemption is disabled.

## Alternatives
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.