diff --git a/keps/582-preempt-based-on-flavor-order/README.md b/keps/582-preempt-based-on-flavor-order/README.md new file mode 100644 index 0000000000..95bb5f00b7 --- /dev/null +++ b/keps/582-preempt-based-on-flavor-order/README.md @@ -0,0 +1,380 @@ +# KEP-582: Preempt Based On Flavor Order + + + + + + +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Cluster Queue API](#cluster-queue-api) + - [Behavior Changes](#behavior-changes) + - [Implementation](#implementation) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit Tests](#unit-tests) + - [Integration tests](#integration-tests) + - [Graduation Criteria](#graduation-criteria) +- [Implementation History](#implementation-history) + + +## Summary + + +This proposal introduces an opt-in mechanism to borrow quota or preempt workloads in a flavor +before trying the next flavors in the ClusterQueue. + +## Motivation + + + +The order of ResourceFlavors within a ClusterQueue represents preference of +consumption. Jobs with higher priorities sometimes prefer to consume resources +in preferred ResourceFlavors. + +### Goals + + +- a mechanism to enable high priority jobs preempt low priority jobs using a flavor or borrow before considering the + next resource flavor when scheduling + +### Non-Goals + +- change the behavior to judge whether a podset can get enough resource in certain resource flavor. +- change the preemption and admission precess. + + +## Proposal + + + +### User Stories (Optional) + + + +#### Story 1 + +As a Kueue administrator I want to ensure more important jobs running on more +stable resources. This can happen in case that there are normal and spot instances +in my cluster. In this case I prefer my high priority jobs not running on spot +instances. If high priority jobs can preempt jobs in standard instances before trying spot instances, +stability can be achieved. + +My use case can be supported by setting `.Spec.FlavorFungibility.WhenCanPreempt` to `Preempt` in the ClusterQueue's spec. + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +## Design Details + + + +### Cluster Queue API + +We extend the Cluster Queue API to introduce the new fields: flavorFungibility to opt-in and configure the new behavior. + +For each type of resource in each podSet, Kueue will traverse all resource groups and resource flavors to find a available flavor in present. When there are insufficient resources in the flavor, kueue will prioritize preemption or borrowing based on the configured policy. + +``` +const ( + Borrow FlavorFungibilityPolicy = "Borrow" + Preempt FlavorFungibilityPolicy = "Preempt" + TryNextFlavor FlavorFungibilityPolicy = "TryNextFlavor" +) + +type FlavorFungibility struct { + // +kubebuilder:validation:Enum="Borrow,TryNextFlavor" + WhenCanBorrow FlavorFungibilityPolicy `json:"whenCanBorrow"` + // +kubebuilder:validation:Enum="Preempt,TryNextFlavor" + WhenCanPreempt FlavorFungibilityPolicy `json:"whenCanPreempt"` +} + +// ClusterQueueSpec defines the desired state of ClusterQueue +type ClusterQueueSpec struct { + ... + FlavorFungibility FlavorFungibility `json:"flavorFungibility"` +} +``` + +If flavorFungibility is nil in configuration, we will set the `WhenCanBorrow` to `Borrow` and set `WhenCanPreempt` to `TryNextFlavor` to maintain consistency with the current behavior. + +### Behavior Changes + +We will not change the behavior to judge whether a podset can get enough resource in certain resource flavor. Preemption and admission will not be influenced also. We only change the order these flavors were considered. + +After we try to schedule a podset in a resource flavor, we decide whether to traverse to the next flavor base on the `flavorFungibility`. If the assignment mode is `NoFit`, we will always try the next flavor until the last one. When the assignment mode is `Preempt`, we can return the currenty assignment if `WhenCanPreempt` is `Preempt`. Otherwise if the assignment mode is `Fit`, we try the next flavor only when we need borrowing in the current flavor and `WhenCanBorrow` is `TryNextFlavor`. + +We will store the scheduling context in workload info so that we can start from where we stop in previous scheduling attempts. This will be useful to avoid to waste time in one flavor all the time if we try to preempt in a flavor and failed. Scheduling context will contain the `LastScheduledFlavorIdx`, `ClusterQueueGeneration` attached to the CQ and `CohortGeneration`. Any changes to these properties will lead to a scheduling from the first flavor. + +`ClusterQueueGeneration` and `CohortGeneration` mark record the resource consumption of the CQs and Cohort. Any time the available resources of the CQs or Cohort increase, we will increase the genreation. So that if the Generation in scheduling context is lower, we should retry from the first flavor. Note that increasing after decreasing of the available resource will also make the generation increased, but I think this is acceptable since we can save the memory by just storing the generation instead of the usage state for each scheduling attempt. + +For example, if cluster queue has 2 resource groups and workload has 1 podSet as the following: + +``` +... + - coveredResources: ["cpu", "memory"] + flavors: + - name: "default-flavor1" + resources: + - name: "cpu" + nominalQuota: 3 + - name: "memory" + nominalQuota: 600Mi + - name: "default-flavor2" + resources: + - name: "cpu" + nominalQuota: 3 + - name: "memory" + nominalQuota: 600Mi + - coveredResources: ["gpu"] + flavors: + - name: "vendor1" + resources: + - name: "gpu" + nominalQuota: 9 + - name: "vendor2" + resources: + - name: "gpu" + nominalQuota: 9 +--- +... + podSets: + - count: 3 + spec: + containers: + - ... + resources: + requests: + cpu: "1" + memory: 200Mi + gpu: 1 +``` + +We will first try `default-flavor1` for cpu and memory resources. If `default-flavor1` doesn't fit, we try preempt in `default-flavor1`. And if we can not find enough candidates in `default-flavor1`, the workload will start from `default-flavor2` in the next time. + +### Implementation + +``` +func assignFlavors(log logr.Logger, requests []workload.PodSetResources, podSets []kueue.PodSet, resourceFlavors map[kueue.ResourceFlavorReference]*kueue.ResourceFlavor, cq *cache.ClusterQueue, lastAssignment *workload.AssigmentClusterQueueState) Assignment { + var assignment Assignment + if lastAssignment != nil { + assignment = Assignment{ + TotalBorrow: make(workload.FlavorResourceQuantities), + PodSets: make([]PodSetAssignment, 0, len(requests)), + LastState: *lastAssignment, + Usage: make(workload.FlavorResourceQuantities), + } + } else { + assignment = Assignment{ + TotalBorrow: make(workload.FlavorResourceQuantities), + PodSets: make([]PodSetAssignment, 0, len(requests)), + LastState: workload.AssigmentClusterQueueState{ + LastAssignedFlavorIdx: make([]map[corev1.ResourceName]int, 0), + CohortGeneration: 0, + ClusterQueueGeneration: cq.Generation, + }, + Usage: make(workload.FlavorResourceQuantities), + } + if cq.Cohort != nil { + assignment.LastState.CohortGeneration = cq.Cohort.Generation + } + } + ... +} + +func shouldTryNextFlavor(representativeMode FlavorAssignmentMode, flavorFungibility v1beta1.FlavorFungibility, whetherNeedBorrowing bool) bool { + policyPreempt := flavorFungibility.WhenCanPreempt + policyBorrow := flavorFungibility.WhenCanBorrow + if representativeMode == Preempt && policyPreempt == v1beta1.Preempt { + return false + } + + if representativeMode == Fit && whetherNeedBorrowing && policyBorrow == v1beta1.Borrow { + return false + } + + if representativeMode == Fit && !whetherNeedBorrowing { + return false + } + + return true +} +``` + +### Test Plan + + + +[Y] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +#### Unit Tests + + + + + +- `pkg/cache`: `2023-8-22` - `82.9%` +- `pkg/scheduler`: `2023-8-22` - `80.7%` +- `pkg/webhook`: `2023-8-22` - `71.2%` +- `pkg/workload`: `2023-8-22` - `54.9%` + +#### Integration tests + + +Scenarios that `WhenCanBorrow` is set as `Borrow` and `WhenCanPreempt` is set as `tryNextFlavor` are same with current behavior. So the added integration tests will these cover scenarios: + +- `WhenCanBorrow` is set as `tryNextFlavor`, +- `WhenCanPreempt` is set as `Preempt`. + +### Graduation Criteria + + + +## Implementation History + + diff --git a/keps/582-preempt-based-on-flavor-order/kep.yaml b/keps/582-preempt-based-on-flavor-order/kep.yaml new file mode 100644 index 0000000000..74967e0773 --- /dev/null +++ b/keps/582-preempt-based-on-flavor-order/kep.yaml @@ -0,0 +1,31 @@ +title: KEP Template +kep-number: 582 +authors: + - "@kunwuluan" +status: provisional +creation-date: 2023-05-17 +reviewers: + - "@alculquicondor" + - "@tenzen-y" +approvers: + +# The target maturity stage in the current dev cycle for this KEP. +stage: beta + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: 0.5 + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + beta: 0.5 + stable: + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: FlavorFungibility +disable-supported: true + +# The following PRR answers are required at beta release +metrics: