From df942ce7a5244a109585a31427845dac4bc9733a Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Mon, 11 Aug 2025 14:44:54 -0700 Subject: [PATCH 01/18] Implement Extended Toleration Operators KEP Signed-off-by: Heba Elayoty --- .../README.md | 1125 +++++++++++++++++ .../5471-enable-sla-based-scheduling/kep.yaml | 43 + 2 files changed, 1168 insertions(+) create mode 100644 keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md create mode 100644 keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md new file mode 100644 index 00000000000..ed0dabd4639 --- /dev/null +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -0,0 +1,1125 @@ +# KEP-5471: Extended Toleration Operators for Threshold-Based Placement + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes) + - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos) + - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability) + - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [Scheduler Performance Regression](#scheduler-performance-regression) + - [User Confusion Between String and Numeric Semantics](#user-confusion-between-string-and-numeric-semantics) + - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) + - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) + - [Cross-SIG Impact](#cross-sig-impact) +- [Design Details](#design-details) + - [API Changes](#api-changes) + - [Semantics](#semantics) + - [Implementation](#implementation) + - [Feature Gate Definition](#feature-gate-definition) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +## Summary + +Extend **core/v1 Toleration** to support **numeric comparison operators** when matching **Node Taints**: + +- New operators: `Lt`, `Le`, `Ge`, `Gt` (in addition to existing `Equal`/`Exists`). +- Primary motivation: allow pods to opt‑in to nodes by `SLA/failure‑probability` values published as taints (e.g., `node.kubernetes.io/sla=950`). +- Scheduler impact is limited to the existing TaintToleration Filter; no new stages or algorithms. + +This preserves the well‑understood safety model of taints/tolerations (eviction via`NoExecute`) while enabling threshold‑based placement similar to numeric NodeAffinity, but with better operational semantics. + +## Motivation + +Many clusters blend (**on‑demand/higher‑SLA**) and (**spot-preemptible/lower‑SLA**) nodes. Platform teams want a safe default keeping most workloads away from risky capacity, while allowing specific workloads to opt‑in with explicit thresholds like `SLA ≥ 95%`. + +### Why not NodeAffinity alone? + +For the “node SLA / failure‑probability” use‑case, NodeAffinity can express minimum or exact SLA thresholds via label comparisons, but it’s not sufficient for the operational goals here: + +- **Policy orientation:** NodeAffinity is per‑pod; to keep most pods away from low‑SLA nodes you'd have to edit every workload. +- **Taints invert control**: nodes declare risk; only pods with a matching toleration may land. +- **Eviction semantics:** Affinity has no eviction. Taints support `NoExecute` with `tolerationSeconds`, letting operators drain/evict pods when a node's SLA class drops or a spot reclaim hits. +- **Operational ergonomics:** Centralized, node‑side policy is consistent with other safety taints (e.g., disk-pressure, memory-pressure). Teams opt‑in, reducing config drift. + +From a scheduling perspective, adding numeric operators to tolerations only adjusts match logic. It does not change queueing, scoring, or preemption algorithms. + +### Goals + +- Add comparison operators to Tolerations so pods can match taints like `node.kubernetes.io/sla=` using thresholds. +- Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`). +- Backward compatible and opt‑in via a feature gate. + +### Non-Goals + +- Standardizing an SLA key or unit (clusters may choose any integer scale, e.g., 950 for 95.0%). +- Implementing workload‑level "70/30" mix semantics. +- Changing NodeAffinity behavior. + +### Benefits for implementing this feature for DRA and AI Workloads + +In addition to general scheduling improvements, SLA‑aware opt‑in via Tolerations has specific advantages for `Dynamic Resource Allocation` (DRA) and `AI/ML`: + +For DRA, resource claims (e.g., GPUs/accelerators) can be steered by node reliability: critical claims stay on high‑SLA capacity; batch/cheap claims can land on lower‑SLA pools. Taints provide a default drive away from risky pools and `NoExecute` eviction if a pool degrades. + +For AI/ML, multi‑stage pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch preprocessing, fine‑tuning, or embedding generation to spot nodes. When spot nodes are reclaimed, `NoExecute` or `NoSchedule` effects plus tolerations allow graceful drain and controlled failover. In multi‑tenant GPU clusters, taints bound access to the reliable pools (fairness), and during autoscaling bursts, extra replicas can safely land on low‑SLA pools with explicit opt‑in. + +| Benefit | Impact on DRA | Impact on AI/ML workloads | +| --------------------------------- | --------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | +| **Cost–reliability optimization** | Bind/keep claims on reliability tiers via taints (+ tolerations to opt-in). | Keep latency-critical inference on high-SLA; shift batch to spot. | +| **Stage-aware placement** | Steer per-stage claims to tiers consistently with node policy. | Different stages tolerate different risk; make that explicit via tolerations. | +| **Resilience after preemption** | Use `NoExecute`/`tolerationSeconds` for graceful drain; re-admit on stable tiers. | Training/services recover faster with predictable eviction semantics. | +| **Multi-tenant fairness** | Avoid monopolization of high-SLA tiers by requiring explicit tolerations. | Fair access to reliable accelerators across teams. | +| **Smooth burst handling** | Bursts land on low-SLA pools via opt-in; baseline remains on high-SLA. | HPA can scale to spot with clear safety boundaries. | +| **Operational clarity** | Node-side policy is auditable and centralized. | Platform teams can document and enforce reliability classes cleanly. | + +## Proposal + +### User Stories (Optional) + +#### Story 1 — Cluster operator using mixed on-demand and spot nodes + +As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there. + +I also want to set numeric SLA thresholds in tolerations (e.g., `Ge 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules. + +**Example Configuration:** + +```yaml +# Spot nodes with 80% SLA get a repelling taint +apiVersion: v1 +kind: Node +metadata: + name: spot-node-1 +spec: + taints: + - key: node.kubernetes.io/sla + value: "800" + effect: NoSchedule +--- +# Cost-optimized workload explicitly tolerates SLA >= 750 +apiVersion: v1 +kind: Pod +spec: + tolerations: + - key: node.kubernetes.io/sla + operator: Ge + value: "750" + effect: NoSchedule +``` + +#### Story 2 — AI inference service with strict SLOs + +As an AI platform engineer, I want to ensure my latency-critical inference pods only run on nodes with SLA ≥ 95%, and I want them to be evicted if the node's SLA rating drops below that threshold. + +Taints and tolerations with numeric comparisons give me this eviction capability, which NodeAffinity cannot provide. + +**Example Configuration:** + +```yaml +# High-SLA on-demand node +apiVersion: v1 +kind: Node +metadata: + name: ondemand-node-1 +spec: + taints: + - key: node.kubernetes.io/sla + value: "950" + effect: NoExecute +--- +# Inference service requires SLA >= 950 with 30s grace period +apiVersion: apps/v1 +kind: Deployment +metadata: + name: inference-service +spec: + template: + spec: + tolerations: + - key: node.kubernetes.io/sla + operator: Ge + value: "950" + effect: NoExecute + tolerationSeconds: 30 +``` + +#### Story 3 — AI training workload balancing cost and reliability + +As an ML engineer running large distributed training, I want to run most worker pods on cheaper spot GPU nodes, but keep certain roles (e.g., parameter servers, checkpoint writers) on SLA ≥ 99.9% on-demand GPUs. + +With numeric tolerations, I can opt-in only the pods that are safe to run on spot, while letting the cluster's default taints repel all others. + +**Example Configuration:** + +```yaml +# Parameter server requires ultra-high reliability +apiVersion: v1 +kind: Pod +metadata: + name: parameter-server +spec: + tolerations: + - key: node.kubernetes.io/sla + operator: Ge + value: "999" # 99.9% SLA + effect: NoSchedule + containers: + - name: ps + resources: + requests: + nvidia.com/gpu: 1 +--- +# Training workers can tolerate spot nodes +apiVersion: v1 +kind: Pod +metadata: + name: training-worker +spec: + tolerations: + - key: node.kubernetes.io/sla + operator: Ge + value: "800" # 80% SLA acceptable + effect: NoSchedule + containers: + - name: worker + resources: + requests: + nvidia.com/gpu: 4 +``` + +#### Story 4 — DRA GPU claim management + +As a DRA driver implementer, I want to combine device resource claims with node SLA constraints so that GPU claims can only bind to nodes meeting a minimum reliability, unless the workload explicitly tolerates lower values. + +This ensures DRA allocations are both resource-correct and reliability-compliant. + +**Example Configuration:** + +```yaml +# DRA claim with SLA constraints +apiVersion: resource.k8s.io/v1alpha4 +kind: ResourceClaim +metadata: + name: gpu-claim-high-sla +spec: + devices: + requests: + - name: gpu + deviceClassName: nvidia-a100 +--- +# Pod using DRA claim with SLA requirements +apiVersion: v1 +kind: Pod +metadata: + name: dra-workload +spec: + resourceClaims: + - name: gpu-claim + resourceClaimName: gpu-claim-high-sla + tolerations: + - key: node.kubernetes.io/sla + operator: Ge + value: "950" # Ensure GPU nodes meet SLA requirements + effect: NoSchedule + containers: + - name: ml-workload + resources: + claims: + - name: gpu-claim +``` + +### Notes/Constraints/Caveats (Optional) + +- **Integer-Only Support**: The implementation supports signed 64-bit integers only. Decimal values (e.g., `"95.5"`) will be rejected by API validation when using numeric operators. + +- **Parsing Requirements**: Both taint value and toleration value must be parseable as integers for numeric operators (`Lt`, `Le`, `Ge`, `Gt`). If either fails parsing, the toleration does not match. + +- **Alpha Restrictions**: When `TaintTolerationComparisonOperators=false`, the API server rejects pods using the new operators. + +- **Strict Validation**: Unlike existing `Equal`/`Exists` operators which accept any string values, numeric operators require valid integer strings. This may catch existing invalid configurations. + +- **No Implicit Conversion**: Values like `"0950"` vs `"950"` are numerically equal but may confuse users expecting string matching behavior. + +- **Parsing Overhead**: Each taint/toleration match with numeric operators requires integer parsing. + +### Risks and Mitigations + +#### Scheduler Performance Regression + +**Risk**: Integer parsing during taint/toleration matching could degrade scheduler performance, especially in clusters with thousands of taints. + +**Mitigation**: + +- Parse integers only when new operators are used (no impact on existing workloads) +- Implement microbenchmarks during development to measure parsing overhead +- Consider caching parsed values in scheduler data structures if performance issues arise +- Feature gate allows disabling if performance problems occur + +#### User Confusion Between String and Numeric Semantics + +**Risk**: Users might expect numeric comparison with `Equal` operator or string comparison with `Ge` operator, leading to mismatched tolerations. + +**Mitigation**: + +- Clear documentation distinguishing string vs. numeric operators +- API validation provides specific error messages for malformed numeric values +- Examples in documentation show proper usage patterns +- Consider adding warnings/events when numeric values are used with string operators + +#### API Compatibility and Version Skew + +**Risk**: Pods using new operators cannot be scheduled if some schedulers don't support the feature, creating deployment failures during upgrades. + +**Mitigation**: + +- Feature gate prevents usage until all components are upgraded +- Clear upgrade documentation specifying component upgrade order +- Backward compatibility testing ensures existing workloads continue functioning +- Gradual rollout recommendations for production clusters + +#### Edge Cases in Numeric Parsing + +**Risk**: Unexpected behavior with edge cases like integer overflow, leading zeros, or malformed input could cause scheduling failures. + +**Mitigation**: + +- Use Go's standard `strconv.ParseInt()` with well-defined error handling +- Comprehensive unit tests covering edge cases (overflow, underflow, malformed strings) +- API validation rejects pods with unparseable values rather than silently failing +- Clear error messages help users identify and fix configuration issues + +#### Cross-SIG Impact + +- SIG-Node +- SIG-Apps +- SIG-Cluster-Lifecycle +- WG-Node-Lifecycle +- WG-Device-Management + +## Design Details + +### API Changes + +**File**: `staging/src/k8s.io/api/core/v1/types.go` + +Extend `core/v1.Toleration.Operator` to accept, in addition to `Equal` and `Exists`: + +- `Lt`: match if toleration.value < taint.value +- `Le`: match if toleration.value <= taint.value +- `Ge`: match if toleration.value >= taint.value +- `Gt`: match if toleration.value > taint.value +- `Equal`/`Exists`: Remain unchanged + +```go +// TolerationOperator is the set of operators that can be used in a toleration. +type TolerationOperator string + +const ( + TolerationOpEqual TolerationOperator = "Equal" + TolerationOpExists TolerationOperator = "Exists" + + // New numeric comparison operators (feature-gated) + TolerationOpLt TolerationOperator = "Lt" // Less than + TolerationOpLe TolerationOperator = "Le" // Less than or equal + TolerationOpGe TolerationOperator = "Ge" // Greater than or equal + TolerationOpGt TolerationOperator = "Gt" // Greater than +) +``` + +### Semantics + +To honor Kubernetes APIs that avoids floating-point numbers where possible due to precision and parsing issues, The new toleration operators will be introduced as integers (i.e.; 950 = 95.0%, 999 = 99.9%, 800 = 80.0%) + +### Implementation + +#### Feature Gate Definition + +**File**: `pkg/features/kube_features.go` + +```go +const ( + // TaintTolerationComparisonOperators enables numeric comparison operators (Lt, Le, Ge, Gt) for tolerations + TaintTolerationComparisonOperators featuregate.Feature = "TaintTolerationComparisonOperators" +) + +var defaultKubernetesFeatureGates = map[featuregate.Feature]featuregate.FeatureSpec{ + TaintTolerationComparisonOperators: {Default: false, PreRelease: featuregate.Alpha}, +} +``` + +**1. API Validation** - `pkg/apis/core/validation/validation.go` + +```go +func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) field.ErrorList { + allErrors := field.ErrorList{} + + for i, toleration := range tolerations { + idxPath := fldPath.Index(i) + + // Existing validation... + + // New: Validate numeric operators (feature-gated) + switch toleration.Operator { + case core.TolerationOpLt, core.TolerationOpLe, core.TolerationOpGe, core.TolerationOpGt: + if !utilfeature.DefaultFeatureGate.Enabled(features.TaintTolerationComparisonOperators) { + allErrors = append(allErrors, field.Invalid(idxPath.Child("operator"), + toleration.Operator, "numeric operators require TaintTolerationComparisonOperators feature gate")) + continue + } + + // Validate value is parseable as int64 + if _, err := strconv.ParseInt(toleration.Value, 10, 64); err != nil { + allErrors = append(allErrors, field.Invalid(idxPath.Child("value"), + toleration.Value, "value must be a valid integer for numeric operators")) + } + } + } + return allErrors +} +``` + +**2. Scheduler Logic** - `staging/src/k8s.io/component-helpers/scheduling/corev1/helpers.go` + +```go +// ToleratesTaint checks if the toleration tolerates the taint. +func (t *Toleration) ToleratesTaint(taint *Taint) bool { + // Existing key and effect matching logic... + + switch t.Operator { + // ... + case TolerationOpLt, TolerationOpLe, TolerationOpGe, TolerationOpGt: + // Feature gate check is not needed here as validation already handles it + return compareNumericValues(t.Value, taint.Value, t.Operator) + default: + return false + } +} + +func compareNumericValues(tolerationVal, taintVal string, op TolerationOperator) bool { + tVal, tErr := strconv.ParseInt(tolerationVal, 10, 64) + if tErr != nil { + return false // Invalid toleration value + } + + nVal, nErr := strconv.ParseInt(taintVal, 10, 64) + if nErr != nil { + return false // Invalid taint value + } + + switch op { + case TolerationOpLt: + return tVal < nVal + case TolerationOpLe: + return tVal <= nVal + case TolerationOpGe: + return tVal >= nVal + case TolerationOpGt: + return tVal > nVal + default: + return false + } +} +``` + +### Test Plan + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + +N/A + +##### Unit tests + + + + + +All core changes must be covered by unit tests, in both Taint API, validation, and scheduler sides: + +- **API Validation Tests:** (staging/src/k8s.io/api/core/v1/toleration_test.go) +- **Scheduler Helper Tests:** (staging/src/k8s.io/component-helpers/scheduling/corev1/helpers_test.go) +- **Validation Tests:** ( pkg/apis/core/validation/validation_test.go) +- ``: `` - `` + +##### Integration tests + + + + + +The following scenarios need to be covered in integration tests: + +- Feature gate's enabling/disabling +- **Scheduler Integration Tests:** will be extended to cover the new taints cases introduced in this KEP:(pkg/scheduler/framework/plugins/tainttoleration/taint_toleration_test.go) + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +##### e2e tests + + +The existing e2e tests will be extended to cover the new taints cases introduced in this KEP: + +- **Taints e2e Tests:** (test/e2e/node/taints.go) + +- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) + +### Graduation Criteria + + + +#### Alpha + +- Feature implemented behind `TaintTolerationComparisonOperators` feature gate (disabled by default) +- API validation for numeric operators in place +- Taint/toleration matching logic supports `Lt`, `Le`, `Ge`, `Gt` operators + +#### Beta + +- Feature enabled by default +- Feedback collected from early adopters in SIG-Scheduling +- Performance testing shows that there is no significant scheduler latency increase nor memory usage increase. +- Stress testing with: + - 1000+ nodes with numeric taints + - 10,000+ pods with numeric tolerations + - Mixed numeric/string operator usage + +#### GA + +- Evidence of real-world adoption. +- Complete scalability validation: + - 5000-node clusters with mixed taint/toleration workloads + - No performance regressions under sustained load + +### Upgrade / Downgrade Strategy + + +- Upgrade + - Enable the feature gate in both API Server and Scheduler. +- Downgrade + - Disable the feature gate in both API Server and Scheduler + +### Version Skew Strategy + + + +The skew between kubelet and control-plane components are not impacted. The kube-scheduler is expected to match the kube-apiserver minor version, but may be up to one minor version older (to allow live upgrades). + +In the release it's been added, the feature is disabled by default and not recognized by other components. +Whoever enabled the feature manually would take the risk of component like kube-scheduler being old and not recognize the fields. + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- [x] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: `TaintTolerationComparisonOperators` + - Components depending on the feature gate: + - kube-apiserver + - kube-scheduler + +###### Does enabling the feature change any default behavior? + +No + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + +Yes. + +###### What happens if we reenable the feature if it was previously rolled back? + +SLA toleration will be respected again. + +###### Are there any tests for feature enablement/disablement? + + +Tests have been added in the integration tests. See [Integration tests](#integration-tests) for more details. + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + +It shouldn't impact already running workloads. It's an opt-in feature. + +###### What specific metrics should inform a rollback? + + + +- `scheduler_scheduling_duration_seconds` +- `scheduler_scheduling_attempts_total` +- `scheduler_scheduling_attempts_total` +- `apiserver_request_total` + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + +Will be considered for beta. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + +No. + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +1. **Metrics**: + + ```promql + # Number of pods using numeric tolerations + scheduler_numeric_tolerations_total > 0 + + # Rate of numeric comparison operations + rate(scheduler_framework_extension_point_duration_seconds{plugin="TaintToleration"}[5m]) + ``` + +2. **API Queries**: + + ```bash + # Check for pods with numeric toleration operators + kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.tolerations[?(@.operator=="Ge")]}{"\n"}{end}' | grep -v "^[^:]*: *$" + + # Count nodes with numeric taints (SLA example) + kubectl get nodes -o jsonpath='{range .items[*]}{.spec.taints[?(@.key=="node.kubernetes.io/sla")]}{"\n"}{end}' | wc -l + ``` + +###### How can someone using this feature know that it is working for their instance? + + + +- [x] Events + - Event Reason: FailedScheduling + - Event Message: "node(s) had untolerated taint `node.kubernetes.io/sla`: `950`" +- [x] API .spec.taints + - Other field: `key: node.kubernetes.io/sla` +- [x] API .spec.tolerations + - Other field: `node.kubernetes.io/sla` + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [x] Metrics + - Metric name: + - `scheduler_scheduling_attempts_total` + - `scheduler_framework_extension_point_duration_seconds` + - Components exposing the metric: `kube-scheduler` + - Metric name: + - `kube_pod_status_phase` + - `kube_pod_status_scheduled_time` + - Components exposing the metric: `kube-apiserver` + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + +Yes, a new metrics: + +- `scheduler_numeric_tolerations_total`: To measure the number of pods scheduled using numeric toleration operators. +- `scheduler_numeric_taint_mismatches_total`: To measure the scheduling failures due to numeric taint/toleration mismatches. + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + +N/A + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + +No, the feature is designed to be an enhancement to existing logic without introducing any new API communication patterns. + +###### Will enabling / using this feature result in introducing new API types? + + +No. + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + +No. + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + +No. + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + +Potentially yes, but the impact should be **minimal**. The numeric toleration operators feature could slightly increase time for operations covered by existing SLIs/SLOs due to integer parsing overhead and validation overhead. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + +No. + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + +No. + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +Same as existing taint/toleration system which is graceful degradation. + +###### What are other known failure modes? + + +A failure mode due to numeric toleration operators have integer parsing errors from malformed taint/toleration values causing pods to be rejected with clear error messages. + +###### What steps should be taken if SLOs are not being met to determine the problem? + +N/A + +## Implementation History + + + +- 2025-08-11: Initial KEP + +## Drawbacks + + + +## Alternatives + +There are many different alternatives were considered: + +1. **Extend NodeAffinity with Numeric Operators:** Add Lt, Le, Ge, Gt to `NodeSelectorOperator` instead. + - **Pros:** `NodeAffinity` already supports `Gt`/`Lt` operators + - **Cons:** No eviction semantics, per-pod configuration (no cluster defaults), doesn't solve the operational model problem. +2. **New Dedicated SLA API Resource:** Create `SLAPolicy` CRD + - **Pros:** Clean separation, rich policy definitions. + - **Cons:** New API surface, additional complexity, breaks unified taint/toleration model. +3. **Custom Scheduler Plugin:** Use scheduling plugin with SLA-aware logic, [placement-policy-scheduler-plugins](https://github.com/Azure/placement-policy-scheduler-plugins) + - **Pros:** Full scheduling control, rich logic possible + - **Cons:** + - Out-of-tree scheduler plugin to maintain and manage + - Doesn't leverage existing taint/toleration infrastructure. +4. **Node Labels + Enhanced NodeAffinity:** Use labels instead of taints, extend NodeAffinity matching. + - **Pros:** Leverages existing label system. + - **Cons:** + - No default push-back behavior + - No eviction semantics + - Labels aren't meant for operational constraints. + + +## Infrastructure Needed (Optional) + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/website]: https://git.k8s.io/website diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml new file mode 100644 index 00000000000..53362d64449 --- /dev/null +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml @@ -0,0 +1,43 @@ +title: Extended Toleration Operators for Threshold-Based Placement +kep-number: 5471 +authors: + - "@jane.doe" +owning-sig: sig-scheduling +participating-sigs: + - sig-node +status: provisional +creation-date: 2025-08-11 +reviewers: + - TBD +approvers: + - TBD + - "@oscar.doe" + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.35" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.35" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: TaintTolerationComparisonOperators + components: + - kube-apiserver + - kube-scheduler +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - kube_pod_numeric_tolerations_total{operator="Ge|Le|Gt|Lt"} + - scheduler_failed_scheduling_attempts_total{reason="numeric_taint_mismatch"} + - scheduler_framework_extension_point_duration_seconds{plugin="TaintToleration"} From 7e5ac73b6e9714362e202292d1383d5810f47d62 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Fri, 15 Aug 2025 16:13:48 -0700 Subject: [PATCH 02/18] Address feedback by remove Ge/Le and add DRA example Signed-off-by: Heba Elayoty --- .../README.md | 141 +++++++++++------- .../5471-enable-sla-based-scheduling/kep.yaml | 14 +- 2 files changed, 95 insertions(+), 60 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index ed0dabd4639..f9fb802b340 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -14,10 +14,10 @@ - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos) - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability) - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management) + - [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management) - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Scheduler Performance Regression](#scheduler-performance-regression) - - [User Confusion Between String and Numeric Semantics](#user-confusion-between-string-and-numeric-semantics) - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) - [Cross-SIG Impact](#cross-sig-impact) @@ -73,7 +73,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release* Extend **core/v1 Toleration** to support **numeric comparison operators** when matching **Node Taints**: -- New operators: `Lt`, `Le`, `Ge`, `Gt` (in addition to existing `Equal`/`Exists`). +- New operators: `Lt`, `Gt` (in addition to existing `Equal`/`Exists`). - Primary motivation: allow pods to opt‑in to nodes by `SLA/failure‑probability` values published as taints (e.g., `node.kubernetes.io/sla=950`). - Scheduler impact is limited to the existing TaintToleration Filter; no new stages or algorithms. @@ -96,7 +96,7 @@ From a scheduling perspective, adding numeric operators to tolerations only adju ### Goals -- Add comparison operators to Tolerations so pods can match taints like `node.kubernetes.io/sla=` using thresholds. +- Add comparison operators to tolerations so pods can match taints like `node.kubernetes.io/sla=` using thresholds. - Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`). - Backward compatible and opt‑in via a feature gate. @@ -108,20 +108,20 @@ From a scheduling perspective, adding numeric operators to tolerations only adju ### Benefits for implementing this feature for DRA and AI Workloads -In addition to general scheduling improvements, SLA‑aware opt‑in via Tolerations has specific advantages for `Dynamic Resource Allocation` (DRA) and `AI/ML`: +In addition to general scheduling improvements, SLA‑aware opt‑in via tolerations has specific advantages for `Dynamic Resource Allocation (DRA)` and `AI/ML`: -For DRA, resource claims (e.g., GPUs/accelerators) can be steered by node reliability: critical claims stay on high‑SLA capacity; batch/cheap claims can land on lower‑SLA pools. Taints provide a default drive away from risky pools and `NoExecute` eviction if a pool degrades. +- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades. -For AI/ML, multi‑stage pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch preprocessing, fine‑tuning, or embedding generation to spot nodes. When spot nodes are reclaimed, `NoExecute` or `NoSchedule` effects plus tolerations allow graceful drain and controlled failover. In multi‑tenant GPU clusters, taints bound access to the reliable pools (fairness), and during autoscaling bursts, extra replicas can safely land on low‑SLA pools with explicit opt‑in. +- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover. -| Benefit | Impact on DRA | Impact on AI/ML workloads | -| --------------------------------- | --------------------------------------------------------------------------------- | ----------------------------------------------------------------------------- | -| **Cost–reliability optimization** | Bind/keep claims on reliability tiers via taints (+ tolerations to opt-in). | Keep latency-critical inference on high-SLA; shift batch to spot. | -| **Stage-aware placement** | Steer per-stage claims to tiers consistently with node policy. | Different stages tolerate different risk; make that explicit via tolerations. | -| **Resilience after preemption** | Use `NoExecute`/`tolerationSeconds` for graceful drain; re-admit on stable tiers. | Training/services recover faster with predictable eviction semantics. | -| **Multi-tenant fairness** | Avoid monopolization of high-SLA tiers by requiring explicit tolerations. | Fair access to reliable accelerators across teams. | -| **Smooth burst handling** | Bursts land on low-SLA pools via opt-in; baseline remains on high-SLA. | HPA can scale to spot with clear safety boundaries. | -| **Operational clarity** | Node-side policy is auditable and centralized. | Platform teams can document and enforce reliability classes cleanly. | +| Benefit | Impact on DRA | Impact on AI/ML workloads | +| ------------------------------ | --------------------------------------------------------- | ------------------------------------------------------- | +| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; batch uses spot | Inference on reliable nodes; training on cheaper pools | +| **Workload-aware placement** | Different claim types target appropriate node tiers | Pipeline stages match their reliability requirements | +| **Graceful preemption** | `NoExecute` provides controlled eviction timing | Predictable failover for training and serving workloads | +| **Resource fairness** | Prevents monopolization of premium capacity | Teams share reliable accelerators fairly | +| **Elastic scaling** | Bursts overflow to lower-SLA pools safely | HPA scales to spot with clear boundaries | +| **Policy transparency** | Node reliability classes are explicit and auditable | Platform teams enforce clear reliability tiers | ## Proposal @@ -131,7 +131,7 @@ For AI/ML, multi‑stage pipelines can place latency‑sensitive inference on hi As a cluster operator, I want a default repel from spot (low-SLA) nodes so that only workloads that explicitly tolerate them can land there. -I also want to set numeric SLA thresholds in tolerations (e.g., `Ge 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules. +I also want to set numeric SLA thresholds in tolerations (e.g., `Gt 950`) so pods can opt-in to reliable nodes or specific SLA bands without having to hardcode every SLA class in NodeAffinity rules. **Example Configuration:** @@ -153,7 +153,7 @@ kind: Pod spec: tolerations: - key: node.kubernetes.io/sla - operator: Ge + operator: Gt value: "750" effect: NoSchedule ``` @@ -188,7 +188,7 @@ spec: spec: tolerations: - key: node.kubernetes.io/sla - operator: Ge + operator: Gt value: "950" effect: NoExecute tolerationSeconds: 30 @@ -211,7 +211,7 @@ metadata: spec: tolerations: - key: node.kubernetes.io/sla - operator: Ge + operator: Gt value: "999" # 99.9% SLA effect: NoSchedule containers: @@ -228,7 +228,7 @@ metadata: spec: tolerations: - key: node.kubernetes.io/sla - operator: Ge + operator: Gt value: "800" # 80% SLA acceptable effect: NoSchedule containers: @@ -269,7 +269,7 @@ spec: resourceClaimName: gpu-claim-high-sla tolerations: - key: node.kubernetes.io/sla - operator: Ge + operator: Gt value: "950" # Ensure GPU nodes meet SLA requirements effect: NoSchedule containers: @@ -279,11 +279,68 @@ spec: - name: gpu-claim ``` +#### Story 5 — DRA device-level error budget management + +As a platform engineer managing GPU clusters with varying reliability states, I want to allocate devices based on their remaining error budget using numeric tolerations. So that critical workloads only get devices with sufficient reliability headroom while allowing degraded devices to serve less sensitive workloads. + +This will get the critical inference fresh devices (>24h error budget), batch training can use aging devices (1-24h), and severely degraded devices (<1h) are excluded from allocation entirely, enabling graceful device lifecycle management. + +**Example Configuration:** + +```yaml +# Driver taints devices with low error budget +kind: ResourceSlice +spec: + driver: device.example.com + devices: + - name: gpu-node-01-device-0 + attributes: + memory: "32Gi" + compute-capability: "8.6" + # Driver applies taint when error budget drops below 10 hours + taints: + - key: device.example.com/error-budget-in-hours + value: "8" # 8 hours remaining + effect: NoSchedule +--- +# Critical inference workload requires high-reliability devices +kind: ResourceClaim +metadata: + name: inference-gpu-claim +spec: + requests: + - name: high-reliability-gpu + deviceClassName: device.example.com + tolerations: + # Only accept devices with >24 hours error budget + - key: device.example.com/error-budget-in-hours + operator: Gt + value: "24" + effect: NoSchedule +--- +# Batch training workload tolerates degraded devices +kind: ResourceClaim +metadata: + name: training-gpu-claim +spec: + requests: + - name: batch-gpu + deviceClassName: device.example.com + tolerations: + # Accept devices with >1 hour error budget + - key: device.example.com/error-budget-in-hours + operator: Gt + value: "1" + effect: NoSchedule +``` + ### Notes/Constraints/Caveats (Optional) -- **Integer-Only Support**: The implementation supports signed 64-bit integers only. Decimal values (e.g., `"95.5"`) will be rejected by API validation when using numeric operators. +- **Integer-Only Support**: The implementation supports signed 64-bit integers only. Pod specs containing toleration values with decimal numbers (e.g., `"95.5"`) will be rejected by the API server during validation when using numeric comparison operators. -- **Parsing Requirements**: Both taint value and toleration value must be parseable as integers for numeric operators (`Lt`, `Le`, `Ge`, `Gt`). If either fails parsing, the toleration does not match. +- **Parsing Requirements**: The toleration value must be parseable as integers for numeric operators (`Lt`, `Gt`). If fails parsing, the toleration does not match. + + > Note: A taint like `foo=95.5:NoSchedule` is valid since taint values follow label values syntax, which allows. The numeric parsing/validation is enforced on toleration **only**. - **Alpha Restrictions**: When `TaintTolerationComparisonOperators=false`, the API server rejects pods using the new operators. @@ -302,21 +359,9 @@ spec: **Mitigation**: - Parse integers only when new operators are used (no impact on existing workloads) -- Implement microbenchmarks during development to measure parsing overhead - Consider caching parsed values in scheduler data structures if performance issues arise - Feature gate allows disabling if performance problems occur -#### User Confusion Between String and Numeric Semantics - -**Risk**: Users might expect numeric comparison with `Equal` operator or string comparison with `Ge` operator, leading to mismatched tolerations. - -**Mitigation**: - -- Clear documentation distinguishing string vs. numeric operators -- API validation provides specific error messages for malformed numeric values -- Examples in documentation show proper usage patterns -- Consider adding warnings/events when numeric values are used with string operators - #### API Compatibility and Version Skew **Risk**: Pods using new operators cannot be scheduled if some schedulers don't support the feature, creating deployment failures during upgrades. @@ -356,8 +401,6 @@ spec: Extend `core/v1.Toleration.Operator` to accept, in addition to `Equal` and `Exists`: - `Lt`: match if toleration.value < taint.value -- `Le`: match if toleration.value <= taint.value -- `Ge`: match if toleration.value >= taint.value - `Gt`: match if toleration.value > taint.value - `Equal`/`Exists`: Remain unchanged @@ -371,8 +414,6 @@ const ( // New numeric comparison operators (feature-gated) TolerationOpLt TolerationOperator = "Lt" // Less than - TolerationOpLe TolerationOperator = "Le" // Less than or equal - TolerationOpGe TolerationOperator = "Ge" // Greater than or equal TolerationOpGt TolerationOperator = "Gt" // Greater than ) ``` @@ -389,7 +430,7 @@ To honor Kubernetes APIs that avoids floating-point numbers where possible due t ```go const ( - // TaintTolerationComparisonOperators enables numeric comparison operators (Lt, Le, Ge, Gt) for tolerations + // TaintTolerationComparisonOperators enables numeric comparison operators (Lt, Gt) for tolerations TaintTolerationComparisonOperators featuregate.Feature = "TaintTolerationComparisonOperators" ) @@ -411,7 +452,7 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie // New: Validate numeric operators (feature-gated) switch toleration.Operator { - case core.TolerationOpLt, core.TolerationOpLe, core.TolerationOpGe, core.TolerationOpGt: + case core.TolerationOpLt, core.TolerationOpGt: if !utilfeature.DefaultFeatureGate.Enabled(features.TaintTolerationComparisonOperators) { allErrors = append(allErrors, field.Invalid(idxPath.Child("operator"), toleration.Operator, "numeric operators require TaintTolerationComparisonOperators feature gate")) @@ -438,7 +479,7 @@ func (t *Toleration) ToleratesTaint(taint *Taint) bool { switch t.Operator { // ... - case TolerationOpLt, TolerationOpLe, TolerationOpGe, TolerationOpGt: + case TolerationOpLt, TolerationOpGt: // Feature gate check is not needed here as validation already handles it return compareNumericValues(t.Value, taint.Value, t.Operator) default: @@ -460,10 +501,6 @@ func compareNumericValues(tolerationVal, taintVal string, op TolerationOperator) switch op { case TolerationOpLt: return tVal < nVal - case TolerationOpLe: - return tVal <= nVal - case TolerationOpGe: - return tVal >= nVal case TolerationOpGt: return tVal > nVal default: @@ -646,13 +683,14 @@ in back-to-back releases. - Feature implemented behind `TaintTolerationComparisonOperators` feature gate (disabled by default) - API validation for numeric operators in place -- Taint/toleration matching logic supports `Lt`, `Le`, `Ge`, `Gt` operators +- Taint/toleration matching logic supports `Lt`, `Gt` operators #### Beta - Feature enabled by default - Feedback collected from early adopters in SIG-Scheduling - Performance testing shows that there is no significant scheduler latency increase nor memory usage increase. +- Implement feature for DRA APIs - Stress testing with: - 1000+ nodes with numeric taints - 10,000+ pods with numeric tolerations @@ -853,7 +891,7 @@ logs or events for this purpose. ```bash # Check for pods with numeric toleration operators - kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.tolerations[?(@.operator=="Ge")]}{"\n"}{end}' | grep -v "^[^:]*: *$" + kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.tolerations[?(@.operator=="Gt")]}{"\n"}{end}' | grep -v "^[^:]*: *$" # Count nodes with numeric taints (SLA example) kubectl get nodes -o jsonpath='{range .items[*]}{.spec.taints[?(@.key=="node.kubernetes.io/sla")]}{"\n"}{end}' | wc -l @@ -1089,18 +1127,15 @@ Why should this KEP _not_ be implemented? There are many different alternatives were considered: -1. **Extend NodeAffinity with Numeric Operators:** Add Lt, Le, Ge, Gt to `NodeSelectorOperator` instead. - - **Pros:** `NodeAffinity` already supports `Gt`/`Lt` operators - - **Cons:** No eviction semantics, per-pod configuration (no cluster defaults), doesn't solve the operational model problem. -2. **New Dedicated SLA API Resource:** Create `SLAPolicy` CRD +1. **New Dedicated SLA API Resource:** Create `SLAPolicy` CRD - **Pros:** Clean separation, rich policy definitions. - **Cons:** New API surface, additional complexity, breaks unified taint/toleration model. -3. **Custom Scheduler Plugin:** Use scheduling plugin with SLA-aware logic, [placement-policy-scheduler-plugins](https://github.com/Azure/placement-policy-scheduler-plugins) +2. **Custom Scheduler Plugin:** Use scheduling plugin with SLA-aware logic, [placement-policy-scheduler-plugins](https://github.com/Azure/placement-policy-scheduler-plugins) - **Pros:** Full scheduling control, rich logic possible - **Cons:** - Out-of-tree scheduler plugin to maintain and manage - Doesn't leverage existing taint/toleration infrastructure. -4. **Node Labels + Enhanced NodeAffinity:** Use labels instead of taints, extend NodeAffinity matching. +3. **Node Labels + Enhanced NodeAffinity:** Use labels instead of taints, extend NodeAffinity matching. - **Pros:** Leverages existing label system. - **Cons:** - No default push-back behavior diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml index 53362d64449..83fb0d23713 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml @@ -6,12 +6,13 @@ owning-sig: sig-scheduling participating-sigs: - sig-node status: provisional -creation-date: 2025-08-11 +creation-date: 2025-08-08 reviewers: - - TBD + - "@SergeyKanzhelev" approvers: - - TBD - - "@oscar.doe" + - "@macsko" + - "@dom4ha" + - "@sanposhiho" # The target maturity stage in the current dev cycle for this KEP. # If the purpose of this KEP is to deprecate a user-visible feature @@ -38,6 +39,5 @@ disable-supported: true # The following PRR answers are required at beta release metrics: - - kube_pod_numeric_tolerations_total{operator="Ge|Le|Gt|Lt"} - - scheduler_failed_scheduling_attempts_total{reason="numeric_taint_mismatch"} - - scheduler_framework_extension_point_duration_seconds{plugin="TaintToleration"} + - scheduler_numeric_tolerations_total{operator="Gt|Lt"} + - scheduler_numeric_taint_mismatches_total From f8144877796654f5514a33a0f0c72218caf334cb Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Fri, 15 Aug 2025 17:03:20 -0700 Subject: [PATCH 03/18] Address PreferNoSchedule feedback Signed-off-by: Heba Elayoty --- .../README.md | 129 +++++++++++------- 1 file changed, 81 insertions(+), 48 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index f9fb802b340..2f0e0dcbfe6 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -1,53 +1,79 @@ # KEP-5471: Extended Toleration Operators for Threshold-Based Placement -- [Release Signoff Checklist](#release-signoff-checklist) -- [Summary](#summary) -- [Motivation](#motivation) - - [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads) -- [Proposal](#proposal) - - [User Stories (Optional)](#user-stories-optional) - - [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes) - - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos) - - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability) - - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management) - - [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management) - - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - - [Risks and Mitigations](#risks-and-mitigations) - - [Scheduler Performance Regression](#scheduler-performance-regression) - - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) - - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) - - [Cross-SIG Impact](#cross-sig-impact) -- [Design Details](#design-details) - - [API Changes](#api-changes) - - [Semantics](#semantics) - - [Implementation](#implementation) - - [Feature Gate Definition](#feature-gate-definition) - - [Test Plan](#test-plan) - - [Prerequisite testing updates](#prerequisite-testing-updates) - - [Unit tests](#unit-tests) - - [Integration tests](#integration-tests) - - [e2e tests](#e2e-tests) - - [Graduation Criteria](#graduation-criteria) - - [Alpha](#alpha) - - [Beta](#beta) - - [GA](#ga) - - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - - [Version Skew Strategy](#version-skew-strategy) -- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) - - [Monitoring Requirements](#monitoring-requirements) - - [Dependencies](#dependencies) - - [Scalability](#scalability) - - [Troubleshooting](#troubleshooting) -- [Implementation History](#implementation-history) -- [Drawbacks](#drawbacks) -- [Alternatives](#alternatives) -- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) +- [KEP-5471: Extended Toleration Operators for Threshold-Based Placement](#kep-5471-extended-toleration-operators-for-threshold-based-placement) + - [Release Signoff Checklist](#release-signoff-checklist) + - [Summary](#summary) + - [Motivation](#motivation) + - [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads) + - [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes) + - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos) + - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability) + - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management) + - [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [Scheduler Performance Regression](#scheduler-performance-regression) + - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) + - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) + - [Cross-SIG Impact](#cross-sig-impact) + - [Design Details](#design-details) + - [API Changes](#api-changes) + - [Semantics](#semantics) + - [Implementation](#implementation) + - [Feature Gate Definition](#feature-gate-definition) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) + - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) + - [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior) + - [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement) + - [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back) + - [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [How can a rollout or rollback fail? Can it impact already running workloads?](#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads) + - [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback) + - [Were upgrade and rollback tested? Was the upgrade-\>downgrade-\>upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested) + - [Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc) + - [Monitoring Requirements](#monitoring-requirements) + - [How can an operator determine if the feature is in use by workloads?](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads) + - [How can someone using this feature know that it is working for their instance?](#how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance) + - [What are the reasonable SLOs (Service Level Objectives) for the enhancement?](#what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement) + - [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) + - [Are there any missing metrics that would be useful to have to improve observability of this feature?](#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature) + - [Dependencies](#dependencies) + - [Does this feature depend on any specific services running in the cluster?](#does-this-feature-depend-on-any-specific-services-running-in-the-cluster) + - [Scalability](#scalability) + - [Will enabling / using this feature result in any new API calls?](#will-enabling--using-this-feature-result-in-any-new-api-calls) + - [Will enabling / using this feature result in introducing new API types?](#will-enabling--using-this-feature-result-in-introducing-new-api-types) + - [Will enabling / using this feature result in any new calls to the cloud provider?](#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider) + - [Will enabling / using this feature result in increasing size or count of the existing API objects?](#will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects) + - [Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?](#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos) + - [Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?](#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components) + - [Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?](#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc) + - [Troubleshooting](#troubleshooting) + - [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable) + - [What are other known failure modes?](#what-are-other-known-failure-modes) + - [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem) + - [Implementation History](#implementation-history) + - [Drawbacks](#drawbacks) + - [Alternatives](#alternatives) + - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) ## Release Signoff Checklist @@ -420,7 +446,14 @@ const ( ### Semantics -To honor Kubernetes APIs that avoids floating-point numbers where possible due to precision and parsing issues, The new toleration operators will be introduced as integers (i.e.; 950 = 95.0%, 999 = 99.9%, 800 = 80.0%) +- To honor Kubernetes APIs that avoids floating-point numbers where possible due to precision and parsing issues, The new toleration operators will be introduced as integers (i.e.; 950 = 95.0%, 999 = 99.9%, 800 = 80.0%). +- For `PreferNoSchedule` taints, numeric operators only determine whether the taint is considered as tolerated for scoring: + +- **Tolerated taints**: Do not count against the node's score. +- **Intolerated taints**: Count against the node's score. +- **Scoring**: Unchanged - nodes with fewer intolerable `PreferNoSchedule` taints receive higher scores. + +This maintains consistent soft-preference behavior while enabling threshold-based SLA matching. For example, A pod requiring SLA > 95% will prefer nodes with SLA ≥ 950 over nodes with SLA < 950, but won't be blocked from scheduling on lower-SLA nodes if higher-SLA capacity is unavailable. ### Implementation From b2e3f99a64669240f9ca9c009e244ab12f2abd88 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Fri, 15 Aug 2025 17:19:16 -0700 Subject: [PATCH 04/18] format toc Signed-off-by: Heba Elayoty --- .../README.md | 120 +++++++----------- 1 file changed, 47 insertions(+), 73 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index 2f0e0dcbfe6..f8c06ccb0b4 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -1,79 +1,53 @@ # KEP-5471: Extended Toleration Operators for Threshold-Based Placement -- [KEP-5471: Extended Toleration Operators for Threshold-Based Placement](#kep-5471-extended-toleration-operators-for-threshold-based-placement) - - [Release Signoff Checklist](#release-signoff-checklist) - - [Summary](#summary) - - [Motivation](#motivation) - - [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone) - - [Goals](#goals) - - [Non-Goals](#non-goals) - - [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads) - - [Proposal](#proposal) - - [User Stories (Optional)](#user-stories-optional) - - [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes) - - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos) - - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability) - - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management) - - [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management) - - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - - [Risks and Mitigations](#risks-and-mitigations) - - [Scheduler Performance Regression](#scheduler-performance-regression) - - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) - - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) - - [Cross-SIG Impact](#cross-sig-impact) - - [Design Details](#design-details) - - [API Changes](#api-changes) - - [Semantics](#semantics) - - [Implementation](#implementation) - - [Feature Gate Definition](#feature-gate-definition) - - [Test Plan](#test-plan) - - [Prerequisite testing updates](#prerequisite-testing-updates) - - [Unit tests](#unit-tests) - - [Integration tests](#integration-tests) - - [e2e tests](#e2e-tests) - - [Graduation Criteria](#graduation-criteria) - - [Alpha](#alpha) - - [Beta](#beta) - - [GA](#ga) - - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) - - [Version Skew Strategy](#version-skew-strategy) - - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - - [Feature Enablement and Rollback](#feature-enablement-and-rollback) - - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) - - [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior) - - [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement) - - [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back) - - [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement) - - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) - - [How can a rollout or rollback fail? Can it impact already running workloads?](#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads) - - [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback) - - [Were upgrade and rollback tested? Was the upgrade-\>downgrade-\>upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested) - - [Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc) - - [Monitoring Requirements](#monitoring-requirements) - - [How can an operator determine if the feature is in use by workloads?](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads) - - [How can someone using this feature know that it is working for their instance?](#how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance) - - [What are the reasonable SLOs (Service Level Objectives) for the enhancement?](#what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement) - - [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service) - - [Are there any missing metrics that would be useful to have to improve observability of this feature?](#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature) - - [Dependencies](#dependencies) - - [Does this feature depend on any specific services running in the cluster?](#does-this-feature-depend-on-any-specific-services-running-in-the-cluster) - - [Scalability](#scalability) - - [Will enabling / using this feature result in any new API calls?](#will-enabling--using-this-feature-result-in-any-new-api-calls) - - [Will enabling / using this feature result in introducing new API types?](#will-enabling--using-this-feature-result-in-introducing-new-api-types) - - [Will enabling / using this feature result in any new calls to the cloud provider?](#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider) - - [Will enabling / using this feature result in increasing size or count of the existing API objects?](#will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects) - - [Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?](#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos) - - [Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?](#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components) - - [Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?](#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc) - - [Troubleshooting](#troubleshooting) - - [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable) - - [What are other known failure modes?](#what-are-other-known-failure-modes) - - [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem) - - [Implementation History](#implementation-history) - - [Drawbacks](#drawbacks) - - [Alternatives](#alternatives) - - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone) + - [Goals](#goals) + - [Non-Goals](#non-goals) + - [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes) + - [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos) + - [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability) + - [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management) + - [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) + - [Scheduler Performance Regression](#scheduler-performance-regression) + - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) + - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) + - [Cross-SIG Impact](#cross-sig-impact) +- [Design Details](#design-details) + - [API Changes](#api-changes) + - [Semantics](#semantics) + - [Implementation](#implementation) + - [Feature Gate Definition](#feature-gate-definition) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) ## Release Signoff Checklist From a6b3719b5785283e367147b94903a25a058e134e Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Tue, 26 Aug 2025 03:03:29 -0700 Subject: [PATCH 05/18] Address feedback Signed-off-by: Heba Elayoty --- keps/prod-readiness/sig-scheduling/5471.yaml | 3 + .../README.md | 95 +++++++++++++------ .../5471-enable-sla-based-scheduling/kep.yaml | 4 +- 3 files changed, 72 insertions(+), 30 deletions(-) create mode 100644 keps/prod-readiness/sig-scheduling/5471.yaml diff --git a/keps/prod-readiness/sig-scheduling/5471.yaml b/keps/prod-readiness/sig-scheduling/5471.yaml new file mode 100644 index 00000000000..7c8d1ae03c6 --- /dev/null +++ b/keps/prod-readiness/sig-scheduling/5471.yaml @@ -0,0 +1,3 @@ +kep-number: 5471 +alpha: + approver: "@soltysh" diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index f8c06ccb0b4..2c3dbc4449c 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -29,6 +29,7 @@ - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) - [Unit tests](#unit-tests) + - [Performance tests](#performance-tests) - [Integration tests](#integration-tests) - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) @@ -99,6 +100,7 @@ From a scheduling perspective, adding numeric operators to tolerations only adju - Add comparison operators to tolerations so pods can match taints like `node.kubernetes.io/sla=` using thresholds. - Keep behavior consistent with existing effects (`NoSchedule`, `PreferNoSchedule`, `NoExecute`). - Backward compatible and opt‑in via a feature gate. +- Zero operational performance impact on existing pod scheduling using `Equal` and `Exists` operators. ### Non-Goals @@ -110,13 +112,13 @@ From a scheduling perspective, adding numeric operators to tolerations only adju In addition to general scheduling improvements, SLA‑aware opt‑in via tolerations has specific advantages for `Dynamic Resource Allocation (DRA)` and `AI/ML`: -- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades. +- DRA steers GPUs/accelerators resource claims by node reliability: critical workloads get high‑SLA capacity while interruptible batch workloads use cheaper pools. Taints block risky pools and evict when capacity degrades. -- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing batch to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover. +- AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing checkpoint-able batch workloads to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover. | Benefit | Impact on DRA | Impact on AI/ML workloads | | ------------------------------ | --------------------------------------------------------- | ------------------------------------------------------- | -| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; batch uses spot | Inference on reliable nodes; training on cheaper pools | +| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; interruptible batch uses spot | Inference on reliable nodes; checkpoint-able training on cheaper pools | | **Workload-aware placement** | Different claim types target appropriate node tiers | Pipeline stages match their reliability requirements | | **Graceful preemption** | `NoExecute` provides controlled eviction timing | Predictable failover for training and serving workloads | | **Resource fairness** | Prevents monopolization of premium capacity | Teams share reliable accelerators fairly | @@ -156,6 +158,18 @@ spec: operator: Gt value: "750" effect: NoSchedule +--- +# Critical workload will not be scheduled until a suitable high reliability node has capacity +apiVersion: v1 +kind: Pod +metadata: + name: critical-workload +spec: + tolerations: + - key: node.kubernetes.io/sla + operator: Gt + value: "950" + effect: NoSchedule ``` #### Story 2 — AI inference service with strict SLOs @@ -247,7 +261,31 @@ This ensures DRA allocations are both resource-correct and reliability-compliant **Example Configuration:** ```yaml -# DRA claim with SLA constraints +# High-SLA GPU device published by DRA driver +apiVersion: resource.k8s.io/v1alpha4 +kind: ResourceSlice +metadata: + name: gpu-node-01-slice +spec: + driver: nvidia.com/gpu + pool: + name: gpu-node-01 + generation: 1 + devices: + - name: gpu-node-01-device-0 + basic: + attributes: + memory: "32Gi" + compute-capability: "8.6" + capacity: + count: 1 + # Driver applies SLA taint based on node reliability metrics + taints: + - key: node.kubernetes.io/sla + value: "980" # 98% SLA + effect: NoSchedule +--- +# DRA claim with SLA constraints apiVersion: resource.k8s.io/v1alpha4 kind: ResourceClaim metadata: @@ -257,6 +295,12 @@ spec: requests: - name: gpu deviceClassName: nvidia-a100 + tolerations: + # Only accept GPUs with SLA >= 950 (95%) + - key: node.kubernetes.io/sla + operator: Gt + value: "950" + effect: NoSchedule --- # Pod using DRA claim with SLA requirements apiVersion: v1 @@ -318,7 +362,7 @@ spec: value: "24" effect: NoSchedule --- -# Batch training workload tolerates degraded devices +# Batch Short-lived batch training workload tolerates degraded devices kind: ResourceClaim metadata: name: training-gpu-claim @@ -340,7 +384,7 @@ spec: - **Parsing Requirements**: The toleration value must be parseable as integers for numeric operators (`Lt`, `Gt`). If fails parsing, the toleration does not match. - > Note: A taint like `foo=95.5:NoSchedule` is valid since taint values follow label values syntax, which allows. The numeric parsing/validation is enforced on toleration **only**. + > Note: A taint like `foo=95.5:NoSchedule` is valid since taint values follow label values syntax, which allows. The numeric parsing/validation is enforced on toleration *only*. - **Alpha Restrictions**: When `TaintTolerationComparisonOperators=false`, the API server rejects pods using the new operators. @@ -350,6 +394,8 @@ spec: - **Parsing Overhead**: Each taint/toleration match with numeric operators requires integer parsing. +- Invalid taints meant to be used with the new comparison operators (e.g., `node.kubernetes.io/sla=95.5` and `node.kubernetes.io/version=1`) are not detected at admission time. + ### Risks and Mitigations #### Scheduler Performance Regression @@ -358,7 +404,8 @@ spec: **Mitigation**: -- Parse integers only when new operators are used (no impact on existing workloads) +- Parse integers only when new operators are used. +- Existing `Equal`/`Exists` operators execute identical code paths with no additional overhead. - Consider caching parsed values in scheduler data structures if performance issues arise - Feature gate allows disabling if performance problems occur @@ -482,19 +529,21 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie ```go // ToleratesTaint checks if the toleration tolerates the taint. func (t *Toleration) ToleratesTaint(taint *Taint) bool { + switch t.Operator { // Existing key and effect matching logic... - switch t.Operator { - // ... + // Handle existing operators first. This ensures + // zero performance impact for existing Equal/Exists scenarios. case TolerationOpLt, TolerationOpGt: // Feature gate check is not needed here as validation already handles it - return compareNumericValues(t.Value, taint.Value, t.Operator) + // Only parse values when comparison operators are actually used + return compareValues(t.Value, taint.Value, t.Operator) default: return false } } -func compareNumericValues(tolerationVal, taintVal string, op TolerationOperator) bool { +func compareValues(tolerationVal, taintVal string, op TolerationOperator) bool { tVal, tErr := strconv.ParseInt(tolerationVal, 10, 64) if tErr != nil { return false // Invalid toleration value @@ -558,6 +607,11 @@ All core changes must be covered by unit tests, in both Taint API, validation, a - **Validation Tests:** ( pkg/apis/core/validation/validation_test.go) - ``: `` - `` +##### Performance tests + +- Establish current scheduling latency for workloads using only `Equal`/`Exists` operators +- Verify that enabling the feature gate with no comparison operators used shows no measurable performance difference. + ##### Integration tests ## Infrastructure Needed (Optional) - - - -[kubernetes.io]: https://kubernetes.io/ -[kubernetes/enhancements]: https://git.k8s.io/enhancements -[kubernetes/website]: https://git.k8s.io/website diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml index 83fb0d23713..47873352ba4 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml @@ -4,8 +4,8 @@ authors: - "@jane.doe" owning-sig: sig-scheduling participating-sigs: - - sig-node -status: provisional + - sig-apps +status: implementable creation-date: 2025-08-08 reviewers: - "@SergeyKanzhelev" From f0442a739409a118225a70fed027c03386425b29 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Tue, 2 Sep 2025 11:32:16 -0700 Subject: [PATCH 06/18] Remove comments and address feedback Signed-off-by: Heba Elayoty --- .../README.md | 404 +----------------- .../5471-enable-sla-based-scheduling/kep.yaml | 2 +- 2 files changed, 16 insertions(+), 390 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index 2c3dbc4449c..cbf3365d6d6 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -573,173 +573,44 @@ to implement this enhancement. ##### Prerequisite testing updates - N/A ##### Unit tests - - - - All core changes must be covered by unit tests, in both Taint API, validation, and scheduler sides: - **API Validation Tests:** (staging/src/k8s.io/api/core/v1/toleration_test.go) - **Scheduler Helper Tests:** (staging/src/k8s.io/component-helpers/scheduling/corev1/helpers_test.go) - **Validation Tests:** ( pkg/apis/core/validation/validation_test.go) +- **ToleratesTaint plugin:** (pkg/scheduler/framework/plugins/tainttoleration/taint_toleration_test.go) - ``: `` - `` ##### Performance tests - Establish current scheduling latency for workloads using only `Equal`/`Exists` operators - Verify that enabling the feature gate with no comparison operators used shows no measurable performance difference. +- **Scheduler Performance Tests:** will be extended to cover the new taints cases introduced in this KEP:(test/integration/scheduler_perf) ##### Integration tests - - - - The following scenarios need to be covered in integration tests: - Feature gate's enabling/disabling -- **Scheduler Integration Tests:** will be extended to cover the new taints cases introduced in this KEP:(pkg/scheduler/framework/plugins/tainttoleration/taint_toleration_test.go) +- **Scheduler Integration Tests:** will be extended to cover the new taints cases introduced in this KEP:(test/integration/scheduler) - [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) ##### e2e tests - The existing e2e tests will be extended to cover the new taints cases introduced in this KEP: -- **Taints e2e Tests:** (test/e2e/node/taints.go) +- **Node Taints e2e Tests:** (test/e2e/node/taints.go) +- **Scheduler Taints e2e Tests:** (test/e2e/scheduling) -- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/e2e/...): [SIG ...](https://testgrid.k8s.io/sig-...?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) +- [test name]() ### Graduation Criteria - - #### Alpha - Feature implemented behind `TaintTolerationComparisonOperators` feature gate (disabled by default) @@ -761,17 +632,6 @@ in back-to-back releases. ### Upgrade / Downgrade Strategy - - Upgrade - Enable the feature gate in both API Server and Scheduler. - Downgrade @@ -779,19 +639,6 @@ enhancement: ### Version Skew Strategy - - The skew between kubelet and control-plane components are not impacted. The kube-scheduler is expected to match the kube-apiserver minor version, but may be up to one minor version older (to allow live upgrades). In the release it's been added, the feature is disabled by default and not recognized by other components. @@ -799,28 +646,6 @@ Whoever enabled the feature manually would take the risk of component like kube- ## Production Readiness Review Questionnaire - - ### Feature Enablement and Rollback ###### How can this feature be enabled / disabled in a live cluster? @@ -837,16 +662,6 @@ No ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? - Yes. ###### What happens if we reenable the feature if it was previously rolled back? @@ -855,84 +670,32 @@ SLA toleration will be respected again. ###### Are there any tests for feature enablement/disablement? - Tests have been added in the integration tests. See [Integration tests](#integration-tests) for more details. ### Rollout, Upgrade and Rollback Planning - - ###### How can a rollout or rollback fail? Can it impact already running workloads? - It shouldn't impact already running workloads. It's an opt-in feature. ###### What specific metrics should inform a rollback? - - - `scheduler_scheduling_duration_seconds` - `scheduler_scheduling_attempts_total` -- `scheduler_scheduling_attempts_total` - `apiserver_request_total` ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - Will be considered for beta. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - No. ### Monitoring Requirements - - ###### How can an operator determine if the feature is in use by workloads? - - 1. **Metrics**: ```promql @@ -955,15 +718,6 @@ logs or events for this purpose. ###### How can someone using this feature know that it is working for their instance? - - - [x] Events - Event Reason: FailedScheduling - Event Message: "node(s) had untolerated taint `node.kubernetes.io/sla`: `950`" @@ -974,27 +728,8 @@ Recall that end users cannot usually observe component logs or access metrics. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - - ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - - - [x] Metrics - Metric name: - `scheduler_scheduling_attempts_total` @@ -1007,20 +742,19 @@ Pick one more of these and delete the rest. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? - Yes, a new metrics: -- `scheduler_numeric_tolerations_total`: To measure the number of pods scheduled using numeric toleration operators. -- `scheduler_numeric_taint_mismatches_total`: To measure the scheduling failures due to numeric taint/toleration mismatches. +- `scheduler_numeric_taint_evaluations_total`: tracks each numeric evaluation with its result. +- `scheduler_numeric_tolerations_total`: tracks successful scheduling with numeric tolerations. +These metrics provide visibility into: -### Dependencies +1. How frequently the numeric toleration feature is being used +2. The effectiveness of numeric taint/toleration matching +3. Per-profile usage patterns for multi-scheduler setups - +In addition, the scheduler has an existing `scheduler_unschedulable_pods` metric that handles the multiple failure reasons by incrementing for each plugin that rejects a pod. + +### Dependencies ###### Does this feature depend on any specific services running in the cluster? @@ -1028,130 +762,42 @@ N/A ### Scalability - - ###### Will enabling / using this feature result in any new API calls? - No, the feature is designed to be an enhancement to existing logic without introducing any new API communication patterns. ###### Will enabling / using this feature result in introducing new API types? - No. ###### Will enabling / using this feature result in any new calls to the cloud provider? - No. ###### Will enabling / using this feature result in increasing size or count of the existing API objects? - No. ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? - Potentially yes, but the impact should be **minimal**. The numeric toleration operators feature could slightly increase time for operations covered by existing SLIs/SLOs due to integer parsing overhead and validation overhead. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? - No. ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? - No. ### Troubleshooting - - ###### How does this feature react if the API server and/or etcd is unavailable? Same as existing taint/toleration system which is graceful degradation. ###### What are other known failure modes? - A failure mode due to numeric toleration operators have integer parsing errors from malformed taint/toleration values causing pods to be rejected with clear error messages. ###### What steps should be taken if SLOs are not being met to determine the problem? @@ -1160,25 +806,10 @@ N/A ## Implementation History - - - 2025-08-11: Initial KEP ## Drawbacks - - ## Alternatives There are many different alternatives were considered: @@ -1197,10 +828,5 @@ There are many different alternatives were considered: - No default push-back behavior - No eviction semantics - Labels aren't meant for operational constraints. - ## Infrastructure Needed (Optional) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml index 47873352ba4..a1d4aeb5b69 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml @@ -39,5 +39,5 @@ disable-supported: true # The following PRR answers are required at beta release metrics: + - scheduler_numeric_taint_evaluations_total - scheduler_numeric_tolerations_total{operator="Gt|Lt"} - - scheduler_numeric_taint_mismatches_total From 3c07edf590e9517ff73daeea99d87f8424b066bb Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Mon, 8 Sep 2025 09:36:18 -0700 Subject: [PATCH 07/18] update TaintToleration and compareValue func to return error Signed-off-by: Heba Elayoty --- .../README.md | 33 ++++++++++--------- 1 file changed, 17 insertions(+), 16 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index cbf3365d6d6..78128d17daa 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -116,14 +116,14 @@ In addition to general scheduling improvements, SLA‑aware opt‑in via tolerat - AI/ML pipelines can place latency‑sensitive inference on high‑SLA nodes while directing checkpoint-able batch workloads to run on spot nodes. When spot nodes are reclaimed, taints trigger graceful drain and controlled failover. -| Benefit | Impact on DRA | Impact on AI/ML workloads | -| ------------------------------ | --------------------------------------------------------- | ------------------------------------------------------- | -| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; interruptible batch uses spot | Inference on reliable nodes; checkpoint-able training on cheaper pools | -| **Workload-aware placement** | Different claim types target appropriate node tiers | Pipeline stages match their reliability requirements | -| **Graceful preemption** | `NoExecute` provides controlled eviction timing | Predictable failover for training and serving workloads | -| **Resource fairness** | Prevents monopolization of premium capacity | Teams share reliable accelerators fairly | -| **Elastic scaling** | Bursts overflow to lower-SLA pools safely | HPA scales to spot with clear boundaries | -| **Policy transparency** | Node reliability classes are explicit and auditable | Platform teams enforce clear reliability tiers | +| Benefit | Impact on DRA | Impact on AI/ML workloads | +| ------------------------------ | ----------------------------------------------------------------------- | ---------------------------------------------------------------------- | +| **Cost–reliability trade-off** | Critical workloads stay on premium nodes; interruptible batch uses spot | Inference on reliable nodes; checkpoint-able training on cheaper pools | +| **Workload-aware placement** | Different claim types target appropriate node tiers | Pipeline stages match their reliability requirements | +| **Graceful preemption** | `NoExecute` provides controlled eviction timing | Predictable failover for training and serving workloads | +| **Resource fairness** | Prevents monopolization of premium capacity | Teams share reliable accelerators fairly | +| **Elastic scaling** | Bursts overflow to lower-SLA pools safely | HPA scales to spot with clear boundaries | +| **Policy transparency** | Node reliability classes are explicit and auditable | Platform teams enforce clear reliability tiers | ## Proposal @@ -528,7 +528,7 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie ```go // ToleratesTaint checks if the toleration tolerates the taint. -func (t *Toleration) ToleratesTaint(taint *Taint) bool { +func (t *Toleration) ToleratesTaint(taint *Taint) (bool, error) { switch t.Operator { // Existing key and effect matching logic... @@ -539,28 +539,29 @@ func (t *Toleration) ToleratesTaint(taint *Taint) bool { // Only parse values when comparison operators are actually used return compareValues(t.Value, taint.Value, t.Operator) default: - return false + return false, errors.New("cannot handle the operator") } } -func compareValues(tolerationVal, taintVal string, op TolerationOperator) bool { +// return error to inform the user what went wrong, not only that the toleration is not matching for any node. +func compareValues(tolerationVal, taintVal string, op TolerationOperator) (bool, error) { tVal, tErr := strconv.ParseInt(tolerationVal, 10, 64) if tErr != nil { - return false // Invalid toleration value + return false, tErr // Invalid toleration value } nVal, nErr := strconv.ParseInt(taintVal, 10, 64) if nErr != nil { - return false // Invalid taint value + return false, nErr // Invalid taint value } switch op { case TolerationOpLt: - return tVal < nVal + return tVal < nVal, nil case TolerationOpGt: - return tVal > nVal + return tVal > nVal, nil default: - return false + return false, errors.New("toleration and taints values are equal") } } ``` From 5d464cb22b9be198a3e60b4233118788b1bb99c5 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Tue, 16 Sep 2025 17:37:04 -0700 Subject: [PATCH 08/18] addresss feedback Signed-off-by: Heba Elayoty --- .../README.md | 44 ++++++++++++++----- 1 file changed, 33 insertions(+), 11 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index 78128d17daa..45e7c786e9a 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -20,6 +20,7 @@ - [Scheduler Performance Regression](#scheduler-performance-regression) - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) + - [Taint Misconfiguration Detection](#taint-misconfiguration-detection) - [Cross-SIG Impact](#cross-sig-impact) - [Design Details](#design-details) - [API Changes](#api-changes) @@ -159,6 +160,22 @@ spec: value: "750" effect: NoSchedule --- +apiVersion: v1 +kind: Pod +metadata: + name: flexible-sla-workload +spec: + tolerations: + # Accept nodes with SLA >= 900 (SLA = 900 OR SLA > 900) + - key: node.kubernetes.io/sla + operator: Equal + value: "900" + effect: NoSchedule + - key: node.kubernetes.io/sla + operator: Gt + value: "900" + effect: NoSchedule +--- # Critical workload will not be scheduled until a suitable high reliability node has capacity apiVersion: v1 kind: Pod @@ -396,6 +413,8 @@ spec: - Invalid taints meant to be used with the new comparison operators (e.g., `node.kubernetes.io/sla=95.5` and `node.kubernetes.io/version=1`) are not detected at admission time. +- **Taint Misconfiguration Risk**: When nodes have taints with non-numeric values (e.g., `node.kubernetes.io/sla=high` instead of `node.kubernetes.io/sla=950`) that are intended for use with numeric operators, the misconfiguration is only detected during pod scheduling attempts, not at taint creation time. This can lead to scheduling failures that are difficult to diagnose. + ### Risks and Mitigations #### Scheduler Performance Regression @@ -431,6 +450,16 @@ spec: - API validation rejects pods with unparseable values rather than silently failing - Clear error messages help users identify and fix configuration issues +#### Taint Misconfiguration Detection + +**Risk**: Node taints intended for numeric comparison may contain non-numeric values (e.g., `node.kubernetes.io/sla=high` instead of `node.kubernetes.io/sla=950`), causing scheduling failures that are only detected during pod placement attempts rather than at taint creation time. + +**Mitigation**: + +- Clear documentation and examples showing proper numeric taint configuration +- Enhanced error messages in scheduling events that clearly indicate parsing failures +- Monitoring and alerting on scheduling failures due to taint parsing errors + #### Cross-SIG Impact - SIG-Node @@ -560,8 +589,6 @@ func compareValues(tolerationVal, taintVal string, op TolerationOperator) (bool, return tVal < nVal, nil case TolerationOpGt: return tVal > nVal, nil - default: - return false, errors.New("toleration and taints values are equal") } } ``` @@ -580,11 +607,10 @@ N/A All core changes must be covered by unit tests, in both Taint API, validation, and scheduler sides: -- **API Validation Tests:** (staging/src/k8s.io/api/core/v1/toleration_test.go) -- **Scheduler Helper Tests:** (staging/src/k8s.io/component-helpers/scheduling/corev1/helpers_test.go) -- **Validation Tests:** ( pkg/apis/core/validation/validation_test.go) -- **ToleratesTaint plugin:** (pkg/scheduler/framework/plugins/tainttoleration/taint_toleration_test.go) -- ``: `` - `` +- `staging/src/k8s.io/api/core/v1/toleration_test.go`: Sep-16-2025 - 66.7% +- `staging/src/k8s.io/component-helpers/scheduling/corev1/helpers_test.go`: Sep-16-2025 - 100% +- `pkg/apis/core/validation/validation_test.go`: Sep-16-2025 - 85.1% +- `pkg/scheduler/framework/plugins/tainttoleration/taint_toleration_test.go`: Sep-16-2025 - 86.9% ##### Performance tests @@ -599,8 +625,6 @@ The following scenarios need to be covered in integration tests: - Feature gate's enabling/disabling - **Scheduler Integration Tests:** will be extended to cover the new taints cases introduced in this KEP:(test/integration/scheduler) -- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) - ##### e2e tests The existing e2e tests will be extended to cover the new taints cases introduced in this KEP: @@ -608,8 +632,6 @@ The existing e2e tests will be extended to cover the new taints cases introduced - **Node Taints e2e Tests:** (test/e2e/node/taints.go) - **Scheduler Taints e2e Tests:** (test/e2e/scheduling) -- [test name]() - ### Graduation Criteria #### Alpha From e86f68e9776835803e1843a6c6e0ace7e487c1d3 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Mon, 22 Sep 2025 14:24:15 -0700 Subject: [PATCH 09/18] update feedback Signed-off-by: Heba Elayoty --- .../README.md | 22 +++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index 45e7c786e9a..e7f3cc34256 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -407,7 +407,7 @@ spec: - **Strict Validation**: Unlike existing `Equal`/`Exists` operators which accept any string values, numeric operators require valid integer strings. This may catch existing invalid configurations. -- **No Implicit Conversion**: Values like `"0950"` vs `"950"` are numerically equal but may confuse users expecting string matching behavior. +- **Leading Zeros Validation**: The API validation will reject taint and toleration values that contain leading zeros (e.g., `"0950"`, `"007"`) when used with numeric operators (`Lt`, `Gt`). This ensures consistent behavior and prevents the ambiguity between string and numeric interpretations. Only values without leading zeros are accepted (e.g., `"950"`, `"7"`). - **Parsing Overhead**: Each taint/toleration match with numeric operators requires integer parsing. @@ -441,14 +441,17 @@ spec: #### Edge Cases in Numeric Parsing -**Risk**: Unexpected behavior with edge cases like integer overflow, leading zeros, or malformed input could cause scheduling failures. +**Risk**: Unexpected behavior with edge cases like integer overflow, leading zeros, or malformed input could cause scheduling failures. Leading zeros in values (e.g., `"0950"`) could create user confusion about whether values are treated as strings or numbers. **Mitigation**: - Use Go's standard `strconv.ParseInt()` with well-defined error handling -- Comprehensive unit tests covering edge cases (overflow, underflow, malformed strings) +- Comprehensive unit tests covering edge cases (overflow, underflow, malformed strings, leading zeros) - API validation rejects pods with unparseable values rather than silently failing +- **API validation explicitly rejects values with leading zeros** when using numeric operators to eliminate confusion - Clear error messages help users identify and fix configuration issues +- Documentation clearly states that leading zeros are not permitted for numeric operators +- **Performance validation via scheduler-perf tests** to ensure no measurable scheduling latency degradation from integer parsing overhead #### Taint Misconfiguration Detection @@ -458,6 +461,7 @@ spec: - Clear documentation and examples showing proper numeric taint configuration - Enhanced error messages in scheduling events that clearly indicate parsing failures +- Scheduler logging for taint parsing failures to help cluster admins identify misconfigured nodes even when pods successfully schedule on other nodes with valid numeric taints - Monitoring and alerting on scheduling failures due to taint parsing errors #### Cross-SIG Impact @@ -546,6 +550,13 @@ func validateTolerations(tolerations []core.Toleration, fldPath *field.Path) fie if _, err := strconv.ParseInt(toleration.Value, 10, 64); err != nil { allErrors = append(allErrors, field.Invalid(idxPath.Child("value"), toleration.Value, "value must be a valid integer for numeric operators")) + continue + } + + // Reject values with leading zeros to prevent confusion + if len(toleration.Value) > 1 && toleration.Value[0] == '0' && toleration.Value != "0" { + allErrors = append(allErrors, field.Invalid(idxPath.Child("value"), + toleration.Value, "leading zeros are not allowed in numeric values (use '950' instead of '0950')")) } } } @@ -581,6 +592,9 @@ func compareValues(tolerationVal, taintVal string, op TolerationOperator) (bool, nVal, nErr := strconv.ParseInt(taintVal, 10, 64) if nErr != nil { + // Log taint parsing failures to help cluster admins identify misconfigured nodes + // even when pods can still schedule on other nodes with valid numeric taints + klog.Warningf("Failed to parse taint value %q as integer for numeric comparison: %v", taintVal, nErr) return false, nErr // Invalid taint value } @@ -605,7 +619,7 @@ N/A ##### Unit tests -All core changes must be covered by unit tests, in both Taint API, validation, and scheduler sides: +All core changes must be covered by unit tests, in both Taint API, validation, and scheduler sides. Tests must specifically cover leading zeros behavior (e.g., `"0950"` vs `"950"`): - `staging/src/k8s.io/api/core/v1/toleration_test.go`: Sep-16-2025 - 66.7% - `staging/src/k8s.io/component-helpers/scheduling/corev1/helpers_test.go`: Sep-16-2025 - 100% From ca32660e41ec1ed245873287e43309df5d901914 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Mon, 29 Sep 2025 13:11:27 -0700 Subject: [PATCH 10/18] update feedback Signed-off-by: Heba Elayoty --- .../README.md | 37 ++++++++++--------- 1 file changed, 20 insertions(+), 17 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index e7f3cc34256..a765319fe12 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -18,7 +18,6 @@ - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) - [Risks and Mitigations](#risks-and-mitigations) - [Scheduler Performance Regression](#scheduler-performance-regression) - - [API Compatibility and Version Skew](#api-compatibility-and-version-skew) - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) - [Taint Misconfiguration Detection](#taint-misconfiguration-detection) - [Cross-SIG Impact](#cross-sig-impact) @@ -428,17 +427,6 @@ spec: - Consider caching parsed values in scheduler data structures if performance issues arise - Feature gate allows disabling if performance problems occur -#### API Compatibility and Version Skew - -**Risk**: Pods using new operators cannot be scheduled if some schedulers don't support the feature, creating deployment failures during upgrades. - -**Mitigation**: - -- Feature gate prevents usage until all components are upgraded -- Clear upgrade documentation specifying component upgrade order -- Backward compatibility testing ensures existing workloads continue functioning -- Gradual rollout recommendations for production clusters - #### Edge Cases in Numeric Parsing **Risk**: Unexpected behavior with edge cases like integer overflow, leading zeros, or malformed input could cause scheduling failures. Leading zeros in values (e.g., `"0950"`) could create user confusion about whether values are treated as strings or numbers. @@ -779,16 +767,18 @@ No. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -Yes, a new metrics: +Yes, a new metric: -- `scheduler_numeric_taint_evaluations_total`: tracks each numeric evaluation with its result. -- `scheduler_numeric_tolerations_total`: tracks successful scheduling with numeric tolerations. -These metrics provide visibility into: +- `scheduler_numeric_tolerations_total`: tracks successful pod scheduling with numeric tolerations (aggregated count, not per-evaluation). + +This metric provides visibility into: 1. How frequently the numeric toleration feature is being used -2. The effectiveness of numeric taint/toleration matching +2. Overall adoption and usage patterns 3. Per-profile usage patterns for multi-scheduler setups +Note: We intentionally avoid tracking each individual numeric evaluation to prevent metric explosion in large clusters. + In addition, the scheduler has an existing `scheduler_unschedulable_pods` metric that handles the multiple failure reasons by incrementing for each plugin that rejects a pod. ### Dependencies @@ -866,4 +856,17 @@ There are many different alternatives were considered: - No eviction semantics - Labels aren't meant for operational constraints. +4. **Add Separate `NumValue int64` Field:** Add a dedicated numeric field alongside the existing `Value string` field in Taint/Toleration structs. + - **Pros:** + - Eliminates parsing overhead and errors + - Type-safe integer handling + - No concerns about leading zeros or malformed values + - Better performance for numeric comparisons + - **Cons:** + - Not aesthetically pleasing API design with dual fields + - Users might set wrong field or both fields accidentally + - Complex validation logic for field combinations + - Memory/storage overhead for additional field + - API complexity and documentation burden + ## Infrastructure Needed (Optional) From 064163b00253d2c8040ed06e78184c4160cb6311 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Mon, 29 Sep 2025 13:26:27 -0700 Subject: [PATCH 11/18] update mitigations Signed-off-by: Heba Elayoty --- keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index a765319fe12..ed4ec82b554 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -449,8 +449,7 @@ spec: - Clear documentation and examples showing proper numeric taint configuration - Enhanced error messages in scheduling events that clearly indicate parsing failures -- Scheduler logging for taint parsing failures to help cluster admins identify misconfigured nodes even when pods successfully schedule on other nodes with valid numeric taints -- Monitoring and alerting on scheduling failures due to taint parsing errors +- Users can use the metric to set up alerts and monitoring. #### Cross-SIG Impact From 2f80e5c635a74603a4c6292e93ff6bc9434c1c17 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Fri, 3 Oct 2025 16:12:56 +0000 Subject: [PATCH 12/18] feedback comments Signed-off-by: Heba Elayoty --- .../5471-enable-sla-based-scheduling/README.md | 18 +++++++++--------- .../5471-enable-sla-based-scheduling/kep.yaml | 11 +++++------ 2 files changed, 14 insertions(+), 15 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index ed4ec82b554..fe6d4ca68dd 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -76,9 +76,9 @@ Extend **core/v1 Toleration** to support **numeric comparison operators** when m - New operators: `Lt`, `Gt` (in addition to existing `Equal`/`Exists`). - Primary motivation: allow pods to opt‑in to nodes by `SLA/failure‑probability` values published as taints (e.g., `node.kubernetes.io/sla=950`). -- Scheduler impact is limited to the existing TaintToleration Filter; no new stages or algorithms. +- Scheduler impact is limited to the existing TaintToleration plugin; no new stages or algorithms. -This preserves the well‑understood safety model of taints/tolerations (eviction via`NoExecute`) while enabling threshold‑based placement similar to numeric NodeAffinity, but with better operational semantics. +This preserves the well‑understood safety model of taints/tolerations (eviction via `NoExecute`) while enabling threshold‑based placement similar to numeric NodeAffinity, but with better operational semantics. ## Motivation @@ -149,7 +149,7 @@ spec: value: "800" effect: NoSchedule --- -# Cost-optimized workload explicitly tolerates SLA >= 750 +# Cost-optimized workload explicitly tolerates SLA > 750 apiVersion: v1 kind: Pod spec: @@ -190,7 +190,7 @@ spec: #### Story 2 — AI inference service with strict SLOs -As an AI platform engineer, I want to ensure my latency-critical inference pods only run on nodes with SLA ≥ 95%, and I want them to be evicted if the node's SLA rating drops below that threshold. +As an AI platform engineer, I want to ensure my latency-critical inference pods only run on nodes with SLA > 95%, and I want them to be evicted if the node's SLA rating drops below that threshold. Taints and tolerations with numeric comparisons give me this eviction capability, which NodeAffinity cannot provide. @@ -208,7 +208,7 @@ spec: value: "950" effect: NoExecute --- -# Inference service requires SLA >= 950 with 30s grace period +# Inference service requires SLA > 950 with 30s grace period apiVersion: apps/v1 kind: Deployment metadata: @@ -226,7 +226,7 @@ spec: #### Story 3 — AI training workload balancing cost and reliability -As an ML engineer running large distributed training, I want to run most worker pods on cheaper spot GPU nodes, but keep certain roles (e.g., parameter servers, checkpoint writers) on SLA ≥ 99.9% on-demand GPUs. +As an ML engineer running large distributed training, I want to run most worker pods on cheaper spot GPU nodes, but keep certain roles (e.g., parameter servers, checkpoint writers) on SLA > 99.9% on-demand GPUs. With numeric tolerations, I can opt-in only the pods that are safe to run on spot, while letting the cluster's default taints repel all others. @@ -312,7 +312,7 @@ spec: - name: gpu deviceClassName: nvidia-a100 tolerations: - # Only accept GPUs with SLA >= 950 (95%) + # Only accept GPUs with SLA > 950 (95%) - key: node.kubernetes.io/sla operator: Gt value: "950" @@ -406,7 +406,7 @@ spec: - **Strict Validation**: Unlike existing `Equal`/`Exists` operators which accept any string values, numeric operators require valid integer strings. This may catch existing invalid configurations. -- **Leading Zeros Validation**: The API validation will reject taint and toleration values that contain leading zeros (e.g., `"0950"`, `"007"`) when used with numeric operators (`Lt`, `Gt`). This ensures consistent behavior and prevents the ambiguity between string and numeric interpretations. Only values without leading zeros are accepted (e.g., `"950"`, `"7"`). +- **Leading Zeros Validation**: The API validation will reject taint and toleration values that contain leading zeros (e.g., `"0950"`, `"007"`) when used with numeric operators (`Lt`, `Gt`). This ensures consistent behavior and prevents the ambiguity between string and numeric interpretations. Only values without leading zeros are accepted (e.g., `"950"`, `"7"`). Zero `0` as a value is accepted though. - **Parsing Overhead**: Each taint/toleration match with numeric operators requires integer parsing. @@ -494,7 +494,7 @@ const ( - **Intolerated taints**: Count against the node's score. - **Scoring**: Unchanged - nodes with fewer intolerable `PreferNoSchedule` taints receive higher scores. -This maintains consistent soft-preference behavior while enabling threshold-based SLA matching. For example, A pod requiring SLA > 95% will prefer nodes with SLA ≥ 950 over nodes with SLA < 950, but won't be blocked from scheduling on lower-SLA nodes if higher-SLA capacity is unavailable. +This maintains consistent soft-preference behavior while enabling threshold-based SLA matching. For example, A pod requiring SLA > 95% will prefer nodes with SLA > 950 over nodes with SLA < 950, but won't be blocked from scheduling on lower-SLA nodes if higher-SLA capacity is unavailable. ### Implementation diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml index a1d4aeb5b69..c5af3b90476 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml @@ -1,18 +1,18 @@ title: Extended Toleration Operators for Threshold-Based Placement kep-number: 5471 authors: - - "@jane.doe" + - "@helayoty" owning-sig: sig-scheduling participating-sigs: - sig-apps status: implementable creation-date: 2025-08-08 reviewers: - - "@SergeyKanzhelev" -approvers: - - "@macsko" - "@dom4ha" - "@sanposhiho" +approvers: + - "@macsko" + # The target maturity stage in the current dev cycle for this KEP. # If the purpose of this KEP is to deprecate a user-visible feature @@ -39,5 +39,4 @@ disable-supported: true # The following PRR answers are required at beta release metrics: - - scheduler_numeric_taint_evaluations_total - - scheduler_numeric_tolerations_total{operator="Gt|Lt"} + - scheduler_numeric_tolerations_total From 30995a2e5dce23ff75a96a3402592833de11a4da Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Tue, 7 Oct 2025 12:20:21 +0000 Subject: [PATCH 13/18] update checklist Signed-off-by: Heba Elayoty --- .../sig-scheduling/5471-enable-sla-based-scheduling/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index fe6d4ca68dd..624038c6e59 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -56,8 +56,8 @@ Items marked with (R) are required *prior to targeting to a milestone / release*. - [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) -- [ ] (R) KEP approvers have approved the KEP status as `implementable` -- [ ] (R) Design details are appropriately documented +- [x] (R) KEP approvers have approved the KEP status as `implementable` +- [x] (R) Design details are appropriately documented - [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - [ ] e2e Tests for all Beta API Operations (endpoints) - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) From 0e9419169a608eb8b679d10bd240e0888fecf0a9 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Mon, 13 Oct 2025 17:01:16 +0000 Subject: [PATCH 14/18] address review feedback Signed-off-by: Heba Elayoty --- .../README.md | 100 ++++++++++++------ .../5471-enable-sla-based-scheduling/kep.yaml | 3 +- 2 files changed, 72 insertions(+), 31 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index 624038c6e59..ad146218494 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -405,9 +405,6 @@ spec: - **Alpha Restrictions**: When `TaintTolerationComparisonOperators=false`, the API server rejects pods using the new operators. - **Strict Validation**: Unlike existing `Equal`/`Exists` operators which accept any string values, numeric operators require valid integer strings. This may catch existing invalid configurations. - -- **Leading Zeros Validation**: The API validation will reject taint and toleration values that contain leading zeros (e.g., `"0950"`, `"007"`) when used with numeric operators (`Lt`, `Gt`). This ensures consistent behavior and prevents the ambiguity between string and numeric interpretations. Only values without leading zeros are accepted (e.g., `"950"`, `"7"`). Zero `0` as a value is accepted though. - - **Parsing Overhead**: Each taint/toleration match with numeric operators requires integer parsing. - Invalid taints meant to be used with the new comparison operators (e.g., `node.kubernetes.io/sla=95.5` and `node.kubernetes.io/version=1`) are not detected at admission time. @@ -621,10 +618,15 @@ All core changes must be covered by unit tests, in both Taint API, validation, a ##### Integration tests -The following scenarios need to be covered in integration tests: +Update the following integration tests to include new operators: -- Feature gate's enabling/disabling -- **Scheduler Integration Tests:** will be extended to cover the new taints cases introduced in this KEP:(test/integration/scheduler) +1. **TestTaintTolerationFilter:** (`filters/filters_test.go`) +2. **TestTaintTolerationScoring:** (`scoring/priorities_test.go`) +3. **TestTaintNodeByCondition:** (`taint/taint_test.go`) +4. **General Scheduler Tests:** (`scheduler_test.go`): + - Dynamic taint addition/removal + - Pod rescheduling after taint changes + - Integration with NodeAffinity ##### e2e tests @@ -660,6 +662,15 @@ The existing e2e tests will be extended to cover the new taints cases introduced - Enable the feature gate in both API Server and Scheduler. - Downgrade - Disable the feature gate in both API Server and Scheduler + +**What happens when the scheduler doesn't recognize Gt/Lt operators:** + +When the feature gate is disabled and the scheduler encounters a pod with `Gt`/`Lt` operator: + +- The toleration filter returns `false` (doesn't match) +- Pod is considered to have untolerated taints +- Filter returns `UnschedulableAndUnresolvable` status +- Pod remains in Pending state. ### Version Skew Strategy @@ -686,11 +697,31 @@ No ###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? -Yes. +Yes, but with caveats for existing workloads using the new operators. + +Impact on existing pods with Gt/Lt operators when feature is disabled: + +1. **Already-running pods**: Continue running normally. The kubelet doesn't need to re-evaluate tolerations for running pods. + +2. **Unscheduled/pending pods**: + - Remain in the cluster but cannot be scheduled + - The scheduler's TaintToleration plugin won't recognize Gt/Lt operators and will treat them as non-matching + - These pods will remain in Pending state with events indicating untolerated taints + +3. **New pod creation**: + - API server validation will **reject** new pods with Gt/Lt operators + - Error: `spec.tolerations[].operator: Unsupported value: "Gt": supported values: "Equal", "Exists"` + +4. **Pod updates**: + - Cannot update existing pods (even those already in etcd) if they contain Gt/Lt operators + - Validation runs on update and will reject the unsupported operators ###### What happens if we reenable the feature if it was previously rolled back? -SLA toleration will be respected again. +Extended toleration operators will be respected again: +- Existing pods with Gt/Lt operators in etcd become valid and schedulable +- New pods can be created with Gt/Lt operators +- The scheduler will properly evaluate numeric comparisons ###### Are there any tests for feature enablement/disablement? @@ -700,12 +731,22 @@ Tests have been added in the integration tests. See [Integration tests](#integra ###### How can a rollout or rollback fail? Can it impact already running workloads? -It shouldn't impact already running workloads. It's an opt-in feature. +**Rollout**: The feature enablement itself is safe and shouldn't impact existing workloads. It's an opt-in feature that only affects pods explicitly using Gt/Lt operators. + +**Rollback**: Can impact workloads if not done carefully: + +1. **Running pods** with Gt/Lt operators: Will continue running (safe) +2. **Pending pods** with Gt/Lt operators: Will become stuck in Pending state, as: + - They remain in etcd but validation rejects them + - The scheduler won't recognize the operators + - Force deletion may be required: `kubectl delete pod --force --grace-period=0` +3. **Workload controllers** (Deployments, StatefulSets, etc.): + - If the pod template uses Gt/Lt operators, the controller cannot create new pods + - Rolling updates will fail ###### What specific metrics should inform a rollback? - `scheduler_scheduling_duration_seconds` -- `scheduler_scheduling_attempts_total` - `apiserver_request_total` ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -723,10 +764,16 @@ No. 1. **Metrics**: ```promql - # Number of pods using numeric tolerations - scheduler_numeric_tolerations_total > 0 + # Number of pods evaluated by TaintToleration plugin + scheduler_plugin_evaluation_total{plugin="TaintToleration"} > 0 - # Rate of numeric comparison operations + # Monitor rate of pods rejected by TaintToleration plugin + rate(scheduler_plugin_evaluation_total{plugin="TaintToleration", status=~"Unschedulable.*"}[5m]) + + # Rate of successful evaluations + rate(scheduler_plugin_evaluation_total{plugin="TaintToleration", status="Success"}[5m]) + + # Plugin execution duration rate(scheduler_framework_extension_point_duration_seconds{plugin="TaintToleration"}[5m]) ``` @@ -735,20 +782,19 @@ No. ```bash # Check for pods with numeric toleration operators kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.tolerations[?(@.operator=="Gt")]}{"\n"}{end}' | grep -v "^[^:]*: *$" + kubectl get pods -A -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.spec.tolerations[?(@.operator=="Lt")]}{"\n"}{end}' | grep -v "^[^:]*: *$" - # Count nodes with numeric taints (SLA example) - kubectl get nodes -o jsonpath='{range .items[*]}{.spec.taints[?(@.key=="node.kubernetes.io/sla")]}{"\n"}{end}' | wc -l ``` ###### How can someone using this feature know that it is working for their instance? - [x] Events - Event Reason: FailedScheduling - - Event Message: "node(s) had untolerated taint `node.kubernetes.io/sla`: `950`" + - Event Message: "node(s) had untolerated taint {: }" (e.g., with numeric taint) - [x] API .spec.taints - - Other field: `key: node.kubernetes.io/sla` + - Observe taints values on nodes - [x] API .spec.tolerations - - Other field: `node.kubernetes.io/sla` + - Observe tolerations with `operator: Gt` or `operator: Lt` on pods ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? @@ -756,27 +802,21 @@ No. - [x] Metrics - Metric name: - - `scheduler_scheduling_attempts_total` - `scheduler_framework_extension_point_duration_seconds` + - `scheduler_plugin_evaluation_total` - Components exposing the metric: `kube-scheduler` - - Metric name: - - `kube_pod_status_phase` - - `kube_pod_status_scheduled_time` - - Components exposing the metric: `kube-apiserver` ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -Yes, a new metric: +Yes, an extension to an existing metric: -- `scheduler_numeric_tolerations_total`: tracks successful pod scheduling with numeric tolerations (aggregated count, not per-evaluation). +**Extend `scheduler_plugin_evaluation_total` with a `status` label** -This metric provides visibility into: +Currently, `scheduler_plugin_evaluation_total` tracks plugin evaluation counts with labels: `plugin`, `extension_point`, `profile`. We propose adding a `status` label (similar to `scheduler_plugin_execution_duration_seconds`) to enable monitoring of plugin outcomes, including errors. -1. How frequently the numeric toleration feature is being used -2. Overall adoption and usage patterns -3. Per-profile usage patterns for multi-scheduler setups +The status label will use framework status codes: `Success`, `Unschedulable`, `UnschedulableAndUnresolvable`, `Error`, etc. -Note: We intentionally avoid tracking each individual numeric evaluation to prevent metric explosion in large clusters. + >Note: Currently, integer parsing failures for Gt/Lt operators result in the toleration not matching (returning `Unschedulable` status), similar to how label selectors behave. This means parsing errors are not distinguished from legitimate mismatches in metrics. Future enhancements could modify the implementation to return `Error` status for parsing failures to improve debuggability. In addition, the scheduler has an existing `scheduler_unschedulable_pods` metric that handles the multiple failure reasons by incrementing for each plugin that rejects a pod. diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml index c5af3b90476..a22f3e632c6 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/kep.yaml @@ -39,4 +39,5 @@ disable-supported: true # The following PRR answers are required at beta release metrics: - - scheduler_numeric_tolerations_total + - scheduler_framework_extension_point_duration_seconds + - scheduler_plugin_evaluation_total From 753e147d87df509dda92d1f3a30ce8aa6d812003 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Tue, 14 Oct 2025 16:41:31 +0000 Subject: [PATCH 15/18] address upgrade/roolback feedback Signed-off-by: Heba Elayoty --- .../5471-enable-sla-based-scheduling/README.md | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index ad146218494..443970a11c0 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -37,6 +37,8 @@ - [Beta](#beta) - [GA](#ga) - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Upgrade](#upgrade) + - [Downgrade](#downgrade) - [Version Skew Strategy](#version-skew-strategy) - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) - [Feature Enablement and Rollback](#feature-enablement-and-rollback) @@ -658,10 +660,11 @@ The existing e2e tests will be extended to cover the new taints cases introduced ### Upgrade / Downgrade Strategy -- Upgrade - - Enable the feature gate in both API Server and Scheduler. -- Downgrade - - Disable the feature gate in both API Server and Scheduler +#### Upgrade + Enable the feature gate in kube-apiserver first then kube-scheduler. This ensures the API server can accept and validate pods with the new operators before the kube-scheduler tries to process them. + +#### Downgrade + Disable the feature gate in in kube-scheduler then kube-apiserver. Since we want to stop the kube-scheduler from processing the new operators first, then stop the API server from accepting new pods with those operators. This prevents the scheduler from trying to handle features the API server would reject. **What happens when the scheduler doesn't recognize Gt/Lt operators:** @@ -735,8 +738,8 @@ Tests have been added in the integration tests. See [Integration tests](#integra **Rollback**: Can impact workloads if not done carefully: -1. **Running pods** with Gt/Lt operators: Will continue running (safe) -2. **Pending pods** with Gt/Lt operators: Will become stuck in Pending state, as: +1. **Running pods** with Gt/Lt operators: continue running (safe) +2. **Pending pods** with Gt/Lt operators: become stuck in Pending state, as: - They remain in etcd but validation rejects them - The scheduler won't recognize the operators - Force deletion may be required: `kubectl delete pod --force --grace-period=0` From dac77917c846d90ec5ccd187b3ef8ca8db29f206 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Wed, 15 Oct 2025 14:31:24 +0000 Subject: [PATCH 16/18] Address PRR comments Signed-off-by: Heba Elayoty --- .../README.md | 88 +++++++++++++++---- 1 file changed, 71 insertions(+), 17 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index 443970a11c0..e73d6c6e7a1 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -20,6 +20,7 @@ - [Scheduler Performance Regression](#scheduler-performance-regression) - [Edge Cases in Numeric Parsing](#edge-cases-in-numeric-parsing) - [Taint Misconfiguration Detection](#taint-misconfiguration-detection) + - [Controller Hot-Loop When Feature Gate is Disabled](#controller-hot-loop-when-feature-gate-is-disabled) - [Cross-SIG Impact](#cross-sig-impact) - [Design Details](#design-details) - [API Changes](#api-changes) @@ -76,7 +77,7 @@ Items marked with (R) are required *prior to targeting to a milestone / release* Extend **core/v1 Toleration** to support **numeric comparison operators** when matching **Node Taints**: -- New operators: `Lt`, `Gt` (in addition to existing `Equal`/`Exists`). +- New operators: `Lt`, `Gt` (in addition to existing `Equal`/`Exists`). These operators are already used in NodeAffinity/NodeSelector, so users are familiar with them. - Primary motivation: allow pods to opt‑in to nodes by `SLA/failure‑probability` values published as taints (e.g., `node.kubernetes.io/sla=950`). - Scheduler impact is limited to the existing TaintToleration plugin; no new stages or algorithms. @@ -167,7 +168,7 @@ metadata: name: flexible-sla-workload spec: tolerations: - # Accept nodes with SLA >= 900 (SLA = 900 OR SLA > 900) + # Accept nodes with SLA > 900 - key: node.kubernetes.io/sla operator: Equal value: "900" @@ -400,19 +401,24 @@ spec: - **Integer-Only Support**: The implementation supports signed 64-bit integers only. Pod specs containing toleration values with decimal numbers (e.g., `"95.5"`) will be rejected by the API server during validation when using numeric comparison operators. -- **Parsing Requirements**: The toleration value must be parseable as integers for numeric operators (`Lt`, `Gt`). If fails parsing, the toleration does not match. +- **Parsing Requirements**: The toleration value must be parseable as integers for numeric operators (`Lt`, `Gt`). If parsing fails, the toleration does not match. - > Note: A taint like `foo=95.5:NoSchedule` is valid since taint values follow label values syntax, which allows. The numeric parsing/validation is enforced on toleration *only*. +- **Non-Numeric Taint Values**: When a pod toleration uses `Lt` or `Gt` operators, it only matches taints with numeric values. If a node has a taint with a non-numeric value, the toleration will not match, and the pod cannot schedule on that node. + + **Example**: + - Node taint: `node.kubernetes.io/sla=high:NoSchedule` + - Pod toleration: `{key: "node.kubernetes.io/sla", operator: "Gt", value: "900"}` + - **Result**: Toleration does not match and pod cannot schedule on this node + - The pod remains `Pending` and can schedule on other nodes with valid numeric taints + - The pod is not failed or rejected entirely + + > Note: Taint values are not validated at node registration time. A taint like `foo=95.5:NoSchedule` or `foo=high:NoSchedule` is valid since taint values follow label value syntax. Numeric parsing and validation only occurs during scheduling when matching against tolerations with `Lt`/`Gt` operators. - **Alpha Restrictions**: When `TaintTolerationComparisonOperators=false`, the API server rejects pods using the new operators. - **Strict Validation**: Unlike existing `Equal`/`Exists` operators which accept any string values, numeric operators require valid integer strings. This may catch existing invalid configurations. - **Parsing Overhead**: Each taint/toleration match with numeric operators requires integer parsing. -- Invalid taints meant to be used with the new comparison operators (e.g., `node.kubernetes.io/sla=95.5` and `node.kubernetes.io/version=1`) are not detected at admission time. - -- **Taint Misconfiguration Risk**: When nodes have taints with non-numeric values (e.g., `node.kubernetes.io/sla=high` instead of `node.kubernetes.io/sla=950`) that are intended for use with numeric operators, the misconfiguration is only detected during pod scheduling attempts, not at taint creation time. This can lead to scheduling failures that are difficult to diagnose. - ### Risks and Mitigations #### Scheduler Performance Regression @@ -442,13 +448,27 @@ spec: #### Taint Misconfiguration Detection -**Risk**: Node taints intended for numeric comparison may contain non-numeric values (e.g., `node.kubernetes.io/sla=high` instead of `node.kubernetes.io/sla=950`), causing scheduling failures that are only detected during pod placement attempts rather than at taint creation time. +**Risk**: Node taints intended for numeric comparison may contain non-numeric values (e.g., `node.kubernetes.io/sla=high` instead of `node.kubernetes.io/sla=950`). Since taint values are not validated at node registration time, these misconfigurations are only detected during scheduling when a pod with `Lt`/`Gt` tolerations attempts to match. This can lead to pods remaining in `Pending` state without clear indication of the root cause. **Mitigation**: -- Clear documentation and examples showing proper numeric taint configuration -- Enhanced error messages in scheduling events that clearly indicate parsing failures -- Users can use the metric to set up alerts and monitoring. +- Pod validation: Current validation strictly enforces that only `Equal` and `Exists` operators are allowed. Users with numeric taint values today must explicitly change the operator to `Lt` or `Gt`, at which point pod-side validation will catch non-numeric toleration values and reject the pod spec before scheduling. + +#### Controller Hot-Loop When Feature Gate is Disabled + +**Risk**: If a workload controller (Deployment, StatefulSet, Job, etc.) has a pod template that uses `Lt` or `Gt` operators, and the feature gate is disabled or was disabled after being enabled, the controller will enter a hot-loop: + +1. Controller attempts to create a pod from the template +2. API server validation rejects the pod with error: `Unsupported value: "Gt": supported values: "Equal", "Exists"` +3. Controller immediately retries pod creation and this cycle repeats indefinitely + +This is particularly problematic during rollback/downgrade scenarios or for multi-cluster deployments where the feature gate state differs across clusters. + +**Mitigation**: + +- Before disabling the feature gate, cluster operators should identify all workloads using `Lt`/`Gt` operators via API discovery or scanning tools +- The Upgrade/downgrade documentation should explicitly warns about this scenario and provides steps to identify affected workloads +- The `apiserver_request_total` metric can be used to detect hot-loop conditions #### Cross-SIG Impact @@ -674,6 +694,7 @@ When the feature gate is disabled and the scheduler encounters a pod with `Gt`/` - Pod is considered to have untolerated taints - Filter returns `UnschedulableAndUnresolvable` status - Pod remains in Pending state. + - Feature gate on/off test cases ### Version Skew Strategy @@ -738,19 +759,24 @@ Tests have been added in the integration tests. See [Integration tests](#integra **Rollback**: Can impact workloads if not done carefully: -1. **Running pods** with Gt/Lt operators: continue running (safe) -2. **Pending pods** with Gt/Lt operators: become stuck in Pending state, as: +1. Running pods with Gt/Lt operators: continue running (safe) +2. Pending pods with Gt/Lt operators: become stuck in Pending state, as: - They remain in etcd but validation rejects them - The scheduler won't recognize the operators - Force deletion may be required: `kubectl delete pod --force --grace-period=0` -3. **Workload controllers** (Deployments, StatefulSets, etc.): +3. Workload controllers (Deployments, StatefulSets, etc.): - If the pod template uses Gt/Lt operators, the controller cannot create new pods - Rolling updates will fail + + **Recommended rollback procedure to prevent hot loop**: + 1. Update identified workloads to use `Equal` or remove numeric tolerations + 2. Delete pending pods that use `Lt`/`Gt` operators + 3. Disable feature gate in kube-scheduler first, then kube-apiserver ###### What specific metrics should inform a rollback? -- `scheduler_scheduling_duration_seconds` -- `apiserver_request_total` +- `scheduler_scheduling_duration_seconds`: Increased scheduling latency may indicate performance issues with numeric parsing +- `apiserver_request_total`: Spike in validation errors may indicate controller hot-loops ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? @@ -911,4 +937,32 @@ There are many different alternatives were considered: - Memory/storage overhead for additional field - API complexity and documentation burden +5.**Use Existing `Equal` Operator with Numeric Values (No New Operators):** + + Instead of introducing `Lt`/`Gt`, use the existing `Equal` operator with numeric taint values. For example: + - Node: `node.kubernetes.io/sla=950:NoSchedule` + - Pod: `{key: "node.kubernetes.io/sla", operator: "Equal", value: "950"}` + + **Pros:** + - No API changes needed + + **Cons:** + - Pods must specify exact SLA values, not ranges. A pod cannot say "accept any node with SLA > 950" + - Multiple tolerations required: If nodes have varying SLA values (e.g., 950, 960, 970, 980, 990), pods need separate `Equal` tolerations for each value they're willing to accept: + ```yaml + tolerations: + - key: node.kubernetes.io/sla + operator: Equal + value: "950" + - key: node.kubernetes.io/sla + operator: Equal + value: "960" + - key: node.kubernetes.io/sla + operator: Equal + value: "970" + # ... and so on + ``` + - Poor semantics for "best effort" workloads since you can't easily express "I'll take any spot/preemptible node regardless of SLA" without enumerating all possible low-SLA values + - Changes to node SLA classification schemes require updating all pod manifests + ## Infrastructure Needed (Optional) From 20465c4d375e2387186f880827ee020b1064fb75 Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Wed, 15 Oct 2025 15:14:43 +0000 Subject: [PATCH 17/18] Remove taints from discoverable options Signed-off-by: Heba Elayoty --- keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index e73d6c6e7a1..7fc482e7d45 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -820,8 +820,6 @@ No. - [x] Events - Event Reason: FailedScheduling - Event Message: "node(s) had untolerated taint {: }" (e.g., with numeric taint) -- [x] API .spec.taints - - Observe taints values on nodes - [x] API .spec.tolerations - Observe tolerations with `operator: Gt` or `operator: Lt` on pods From c35017a1406a3131644beeec4a740f880093937e Mon Sep 17 00:00:00 2001 From: Heba Elayoty Date: Wed, 15 Oct 2025 15:17:54 +0000 Subject: [PATCH 18/18] Add fearture gate to integration tests Signed-off-by: Heba Elayoty --- keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md index 7fc482e7d45..44d762cb4cf 100644 --- a/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md +++ b/keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md @@ -649,6 +649,7 @@ Update the following integration tests to include new operators: - Dynamic taint addition/removal - Pod rescheduling after taint changes - Integration with NodeAffinity + - Feature gate on/off ##### e2e tests