KEP-79: Hierarchical Cohorts (#1531)

kubernetes-sigs · Feb 9, 2024 · 209c096 · 209c096
1 parent 73de800
commit 209c096
Show file tree

Hide file tree

Showing 2 changed files with 282 additions and 0 deletions.
diff --git a/keps/79-hierarchical-cohorts/README.md b/keps/79-hierarchical-cohorts/README.md
@@ -0,0 +1,249 @@
+# KEP-79: Hierarchical Cohorts
+
+<!-- toc -->
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [User Stories (Optional)](#user-stories-optional)
+    - [Story 1](#story-1)
+    - [Story 2](#story-2)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Test Plan](#test-plan)
+      - [Prerequisite testing updates](#prerequisite-testing-updates)
+    - [Unit Tests](#unit-tests)
+  - [Graduation Criteria](#graduation-criteria)
+- [Implementation History](#implementation-history)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+<!-- /toc -->
+
+## Summary
+
+Introduce Cohort top-level object to allow setting multi-level quota 
+hierarchy, with advanced borrowing, lending mechanisms.
+
+## Motivation
+
+The current 2-level hierarchy (ClusterQueues and Cohorts) is not expressive
+enough to handle complex use cases of large organizations with tree-like
+team and quota/budget structures.
+
+### Goals
+
+* Create a multi-level hierarchy for advanced quota management. 
+* Be compatible with the existing ClusterQueue API and mechanics.
+* Allow setting constraints about borrowing and lending at all levels.
+* Provide quota for groups of Queues.
+
+### Non-Goals
+
+* Change the existing API and mechanics in a not backward-compatible way.
+* Introduce an alternative API to ClusterQueue.
+* Introduce new ways of fair sharing, like ratio-based sharing (at least 
+not in this KEP).
+* Introduce additional preemption models (this will be in a separate KEP).
+
+## Proposal
+
+Introduce a new object called Cohort with the similar quota provisioning mechanism
+as ClusterQueue. Cohorts additionally may specify its parent, another Cohort,
+forming together a tree-like organization structure. ClusterQueues will still be able to specify
+the Cohort they belong to. The Cohort mentioned by ClusterQueue doesn't require
+an actual object to be present. If such is not provided, it is understood 
+that the Cohort doesn't provide any quota, has no parent, doesn't belong to any bigger
+structure or has any non-default settings.
+
+The difference between ClusterQueue and Cohort will be that:
+
+* ClusterQueues are leaves in the organization tree, Cohorts are inner nodes. 
+* Cohort doesn't accept any workloads.
+* Nominal quota provided at the Cohort level is to be shared with the entire organization and doesn't 
+  have an owning ClusterQueue.
+* Borrowing limit specified at the Cohort level means that the entire subtree cannot borrow more 
+  from the rest of the organization tree than the given value.
+* Lending limit specified at the Cohort levels means that the rest of the organization tree
+  cannot borrow more from the subtree than the given value.
+
+Preemmptions and resoruce reclamation will happen among the whole cohort structure,
+in the similar fashion as they are executed now.
+
+### User Stories (Optional)
+
+#### Story 1
+
+I have two multi-team organizations in the company. One that does research and one that runs production
+workloads. Both are given some quota that is further distributed among the subteams. I want to grant 
+the production workloads the ability to borrow research quota if needed, but not the other way round.
+
+With this proposal, research org's top Cohort will simply set borrowingLimit to 0. Alternatively, production
+org's top Cohort can set lendingLimit to 0. BorrowingLimitis inside production org's ClusterQueues should
+be generous enough to allow borrowing from the research org.
+
+#### Story 2
+
+I have a couple organizations that have dedicated resources. The organizations should not borrow 
+from each other, however I want to have an additional "special" queue, with low priority jobs, that can 
+borrow unused capacity from any of the organizations.
+
+With this proposal, the cohorts for organizations will set borrowingLimit to 0. Top level Cohort will 
+contain all of these Cohorts, plus the "special" ClusterQueue, with borrowingLimit set to infinity. 
+
+### Risks and Mitigations
+
+* Users may create a cycle in the Cohort hierarchy - Kueue will stop all new admissions within
+the entire tree. The already admitted workloads will be allowed to continue. Appropriate 
+ClusterQueue/Cohort Status Conditions will be set and Events emited.
+
+* Scheduling and preemption may require more computation/resources.
+
+## Design Details
+
+The Cohort API will initially start only with the basic functionality. Additional policies
+regarding sharing Cohort resources can be added later.
+
+```go
+
+type Cohort struct {
+    metav1.TypeMeta   `json:",inline"`
+    metav1.ObjectMeta `json:"metadata,omitempty"`
+
+    Spec   CohortSpec   `json:"spec,omitempty"`
+    Status CohortStatus `json:"status,omitempty"`
+}
+
+type CohortSpec struct {
+    // Cohort parent name. The parent Cohort object doesn't have to exist.
+    // In such case, it is assumed that parent simply doesn't have any
+    // quota and limits and doesn't have any other custom settings.
+    Parent *string `json:"parent,omitempty"`
+
+    // resourceGroups describes groups of resources that the Cohort can
+    // share with ClusterQueues within the same group of Cohorts/ClusterQueues.
+    // Each resource group defines the list of resources and a list of flavors
+    // that provide quotas for these resources.
+    // Each resource and each flavor can only form part of one resource group.
+    // resourceGroups can be up to 16.
+    //
+    // BorrowingLimit specifies how much ClusterQueues under this Cohort can borrow
+    // from ClusterQueues/Cohorts that are NOT under this Cohort. For Cohorts without
+    // a parent (top of the hierarchy) the BorrowingLimit has to be 0.
+    //
+    // LendingLimit specifies how much ClusterQueues that are NOT under this Cohort
+    // can borrow from the ClusterQueues/Cohorts that are under this Cohort.
+    // If any of the Limits is not specified it means that there is no limit
+    // and ClusterQueues can borrow/lend as much as they want/have.
+    // 
+    // +listType=atomic
+    // +kubebuilder:validation:MaxItems=16
+    ResourceGroups []ResourceGroup `json:"resourceGroups,omitempty"`
+}
+
+const (
+    // Condition indicating that a Cohort is correctly configured (for example, there is no cycle).
+    CohortActive = "CohortActive"
+)
+
+// Status of the Cohort. May be empty if Cohort support is not enabled in alpha.
+// Status and stats may not cover the entire subtree, as the number of needed updates
+// per workload admission may be to high.
+type CohortStatus struct {
+    // conditions hold the latest available observations of the Conditions
+    // current state.
+    // +optional
+    // +listType=map
+    // +listMapKey=type
+    // +patchStrategy=merge
+    // +patchMergeKey=type
+    Conditions []metav1.Condition `json:"conditions,omitempty" patchStrategy:"merge" patchMergeKey:"type"`
+
+    // Additional stats may be added in the future, like the number 
+    // of admitted workloads, their usage etc, based on the user feedback.
+}
+```
+
+Currently, with 2-level hierarchy for each Cohort and ClusterQueue, Kueue
+makes sure that the following balances are kept:
+
+* ClusterQueues don't use more resources than they have and could possibly borrow.
+* Within a Cohort, the total amount of requested capacity doesn't exceed the total quota
+from all ClusterQueues, constrained by LendingLimit.
+
+Admission of a new workload can happen if both balances are kept after 
+adding the new workload. Kueue doesn't track who is borrowing/lending from who. It
+is enough that balances are kept and with good balances, there exists such borrower-lending 
+mapping that fulfills all needs. 
+
+With Hierarchical Cohorts, Kueue will be checking the whole Cohort subtree whether the
+correct balances are kept. To be more precise what it means
+let's define a function `T(x,r)` that takes either ClusterQueue x 
+or Cohort x and resource r (from a specific resource flavor). 
+
+`T(x, r)` returns the amount of resource r that is avaialble at the level of x from ClusterQueues
+and Cohorts that are either x or children of x (possibly indirect). In other words, how much of resource r can come from 
+the subtree. The value may be negative, what means that the subtree is borrowing from the outside of the subtree (the rest 
+of the hierarchy)
+
+`T(x,r)` can be relatively easily calculated while traversing the Cohort tree.
+
+* `T(x,r)` when x is a ClusterQueue:
+$$T(x,r) = quota(x,r) - usage(x,r)$$
+
+* `T(x,r)` when x is a Cohort:
+$$T(x,r) = quota(x,r) + \sum_{c \in children(x)} min(lendingLimit(c,r), T(c,r))$$
+
+Obviously, with the correct admission process, for any x and r, `T(x,r)  >=  -borrowingLimit(x,r)`
+Otherwise there would be too big debt at level x - some subtree is requesting more than allowed.
+
+Slightly less obvious, but also true is: **If there is no too big debt at any level then the admission is correct**.
+
+Negative `T(x,r)` presents the total amount of resources that a subtree is borrowing. Positive `T(x,r)` presents the total 
+amount of resources that a subtree can deliver (with respect of the `lendingLimit`). At the very top of hierarchy `T(x,r) >=0` 
+(`borrowingLimit` is there 0 since there is no-one to borrow from). `T(x,r)>=0` can occur also within the hierarchy.
+
+`T(x,r)>=0` means that at the level of x, the negative balance of all subtrees can be evened-out by
+other subtrees that have some extra capacity, with respect to their lendingLimit. Extra capacity can be 
+"passed" to the needing subtrees. Then, after this passing, the previously negative subtree becomes positive, and 
+we can re-apply the logic there. All negative sub-sub-trees can be balanced out by positive subtrees and the capacity
+that coming from "above". And so on and so on, up to reaching individual ClusterQueues.
+
+So a new workload can be admitted to a ClusterQueue if and only if, after admission, `T(x,r)  >=  -borrowingLimit(x,r)` 
+stays true at all elements of the hierarchy.
+
+### Test Plan
+
+[x] I/we understand the owners of the involved components may require updates to
+existing tests to make this code solid enough prior to committing the changes necessary
+to implement this enhancement.
+
+##### Prerequisite testing updates
+
+#### Unit Tests
+
+As the hierarchical cohorts reside entirely inside Kueue, most of the 
+tests will be done as unit and integration tests, checking things like:
+
+* Existing functionality at 2-levels.
+* Long-distance borrowing on multi-level hierarchy.
+* Lending/borrowing limits placed on many levels.
+* Preemptions across hierarchy.
+
+### Graduation Criteria
+
+This is an a core API element and will graduate together with the other core APIs.
+
+
+## Implementation History
+
+* 2023.12.29 - KEP - API and semantics.
+
+## Drawbacks
+
+It makes the scheduling even more complex and computation heavy. With complex limits 
+and quotas it may be hard for users to keep them under control.
+
+## Alternatives
+
+* https://github.com/kubernetes-sigs/kueue/pull/1093 - Hierarchical ClusterQueues. 
diff --git a/keps/79-hierarchical-cohorts/kep.yaml b/keps/79-hierarchical-cohorts/kep.yaml
@@ -0,0 +1,33 @@
+title: 
+kep-number: 79
+authors:
+  - "@mwielgus"
+status: provisional
+creation-date: 2023-12-28
+reviewers:
+  - "@tenzen-y "
+  - "@alculquicondor"
+approvers:
+  - "@alculquicondor"
+
+# The target maturity stage in the current dev cycle for this KEP.
+stage: beta
+
+# The most recent milestone for which work toward delivery of this KEP has been
+# done. This can be the current (upcoming) milestone, if it is being actively
+# worked on.
+latest-milestone: "v0.7"
+
+# The milestone at which this feature was, or is targeted to be, at each stage.
+milestone:
+  alpha: "v0.7"
+  beta: "v0.8"
+
+# The following PRR answers are required at alpha release
+# List the feature gate name and the components for which it must be enabled
+disable-supported: true
+
+# The following PRR answers are required at beta release
+# metrics:
+#   - my_feature_metric
+