Skip to content

Commit 280a593

Browse files
committed
update PRR checklist
Signed-off-by: Heba Elayoty <heelayot@microsoft.com>
1 parent a44c119 commit 280a593

File tree

1 file changed

+79
-53
lines changed
  • keps/sig-scheduling/5471-enable-sla-based-scheduling

1 file changed

+79
-53
lines changed

keps/sig-scheduling/5471-enable-sla-based-scheduling/README.md

Lines changed: 79 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -1,56 +1,82 @@
11
# KEP-5471: Extended Toleration Operators for Threshold-Based Placement
22

33
<!-- toc -->
4-
- [Release Signoff Checklist](#release-signoff-checklist)
5-
- [Summary](#summary)
6-
- [Motivation](#motivation)
7-
- [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone)
8-
- [Goals](#goals)
9-
- [Non-Goals](#non-goals)
10-
- [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads)
11-
- [Proposal](#proposal)
12-
- [User Stories (Optional)](#user-stories-optional)
13-
- [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes)
14-
- [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos)
15-
- [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability)
16-
- [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management)
17-
- [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management)
18-
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
19-
- [Risks and Mitigations](#risks-and-mitigations)
20-
- [Scheduler Performance Regression](#scheduler-performance-regression)
21-
- [Taint Misconfiguration Detection](#taint-misconfiguration-detection)
22-
- [Controller Hot-Loop When Feature Gate is Disabled](#controller-hot-loop-when-feature-gate-is-disabled)
23-
- [Cross-SIG Impact](#cross-sig-impact)
24-
- [Design Details](#design-details)
25-
- [API Changes](#api-changes)
26-
- [Semantics](#semantics)
27-
- [Implementation](#implementation)
28-
- [Feature Gate Definition](#feature-gate-definition)
29-
- [Test Plan](#test-plan)
30-
- [Prerequisite testing updates](#prerequisite-testing-updates)
31-
- [Unit tests](#unit-tests)
32-
- [Performance tests](#performance-tests)
33-
- [Integration tests](#integration-tests)
34-
- [e2e tests](#e2e-tests)
35-
- [Graduation Criteria](#graduation-criteria)
36-
- [Alpha](#alpha)
37-
- [Beta](#beta)
38-
- [GA](#ga)
39-
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
40-
- [Upgrade](#upgrade)
41-
- [Downgrade](#downgrade)
42-
- [Version Skew Strategy](#version-skew-strategy)
43-
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
44-
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
45-
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
46-
- [Monitoring Requirements](#monitoring-requirements)
47-
- [Dependencies](#dependencies)
48-
- [Scalability](#scalability)
49-
- [Troubleshooting](#troubleshooting)
50-
- [Implementation History](#implementation-history)
51-
- [Drawbacks](#drawbacks)
52-
- [Alternatives](#alternatives)
53-
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
4+
- [KEP-5471: Extended Toleration Operators for Threshold-Based Placement](#kep-5471-extended-toleration-operators-for-threshold-based-placement)
5+
- [Release Signoff Checklist](#release-signoff-checklist)
6+
- [Summary](#summary)
7+
- [Motivation](#motivation)
8+
- [Why not NodeAffinity alone?](#why-not-nodeaffinity-alone)
9+
- [Goals](#goals)
10+
- [Non-Goals](#non-goals)
11+
- [Benefits for implementing this feature for DRA and AI Workloads](#benefits-for-implementing-this-feature-for-dra-and-ai-workloads)
12+
- [Proposal](#proposal)
13+
- [User Stories (Optional)](#user-stories-optional)
14+
- [Story 1 — Cluster operator using mixed on-demand and spot nodes](#story-1--cluster-operator-using-mixed-on-demand-and-spot-nodes)
15+
- [Story 2 — AI inference service with strict SLOs](#story-2--ai-inference-service-with-strict-slos)
16+
- [Story 3 — AI training workload balancing cost and reliability](#story-3--ai-training-workload-balancing-cost-and-reliability)
17+
- [Story 4 — DRA GPU claim management](#story-4--dra-gpu-claim-management)
18+
- [Story 5 — DRA device-level error budget management](#story-5--dra-device-level-error-budget-management)
19+
- [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional)
20+
- [Risks and Mitigations](#risks-and-mitigations)
21+
- [Scheduler Performance Regression](#scheduler-performance-regression)
22+
- [Taint Misconfiguration Detection](#taint-misconfiguration-detection)
23+
- [Controller Hot-Loop When Feature Gate is Disabled](#controller-hot-loop-when-feature-gate-is-disabled)
24+
- [Cross-SIG Impact](#cross-sig-impact)
25+
- [Design Details](#design-details)
26+
- [API Changes](#api-changes)
27+
- [Semantics](#semantics)
28+
- [Implementation](#implementation)
29+
- [Feature Gate Definition](#feature-gate-definition)
30+
- [Test Plan](#test-plan)
31+
- [Prerequisite testing updates](#prerequisite-testing-updates)
32+
- [Unit tests](#unit-tests)
33+
- [Performance tests](#performance-tests)
34+
- [Integration tests](#integration-tests)
35+
- [e2e tests](#e2e-tests)
36+
- [Graduation Criteria](#graduation-criteria)
37+
- [Alpha](#alpha)
38+
- [Beta](#beta)
39+
- [GA](#ga)
40+
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
41+
- [Upgrade](#upgrade)
42+
- [Downgrade](#downgrade)
43+
- [Version Skew Strategy](#version-skew-strategy)
44+
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
45+
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
46+
- [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
47+
- [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior)
48+
- [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement)
49+
- [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back)
50+
- [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement)
51+
- [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
52+
- [How can a rollout or rollback fail? Can it impact already running workloads?](#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads)
53+
- [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback)
54+
- [Were upgrade and rollback tested? Was the upgrade-\>downgrade-\>upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested)
55+
- [Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc)
56+
- [Monitoring Requirements](#monitoring-requirements)
57+
- [How can an operator determine if the feature is in use by workloads?](#how-can-an-operator-determine-if-the-feature-is-in-use-by-workloads)
58+
- [How can someone using this feature know that it is working for their instance?](#how-can-someone-using-this-feature-know-that-it-is-working-for-their-instance)
59+
- [What are the reasonable SLOs (Service Level Objectives) for the enhancement?](#what-are-the-reasonable-slos-service-level-objectives-for-the-enhancement)
60+
- [What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?](#what-are-the-slis-service-level-indicators-an-operator-can-use-to-determine-the-health-of-the-service)
61+
- [Are there any missing metrics that would be useful to have to improve observability of this feature?](#are-there-any-missing-metrics-that-would-be-useful-to-have-to-improve-observability-of-this-feature)
62+
- [Dependencies](#dependencies)
63+
- [Does this feature depend on any specific services running in the cluster?](#does-this-feature-depend-on-any-specific-services-running-in-the-cluster)
64+
- [Scalability](#scalability)
65+
- [Will enabling / using this feature result in any new API calls?](#will-enabling--using-this-feature-result-in-any-new-api-calls)
66+
- [Will enabling / using this feature result in introducing new API types?](#will-enabling--using-this-feature-result-in-introducing-new-api-types)
67+
- [Will enabling / using this feature result in any new calls to the cloud provider?](#will-enabling--using-this-feature-result-in-any-new-calls-to-the-cloud-provider)
68+
- [Will enabling / using this feature result in increasing size or count of the existing API objects?](#will-enabling--using-this-feature-result-in-increasing-size-or-count-of-the-existing-api-objects)
69+
- [Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?](#will-enabling--using-this-feature-result-in-increasing-time-taken-by-any-operations-covered-by-existing-slisslos)
70+
- [Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?](#will-enabling--using-this-feature-result-in-non-negligible-increase-of-resource-usage-cpu-ram-disk-io--in-any-components)
71+
- [Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?](#can-enabling--using-this-feature-result-in-resource-exhaustion-of-some-node-resources-pids-sockets-inodes-etc)
72+
- [Troubleshooting](#troubleshooting)
73+
- [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable)
74+
- [What are other known failure modes?](#what-are-other-known-failure-modes)
75+
- [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem)
76+
- [Implementation History](#implementation-history)
77+
- [Drawbacks](#drawbacks)
78+
- [Alternatives](#alternatives)
79+
- [Infrastructure Needed (Optional)](#infrastructure-needed-optional)
5480
<!-- /toc -->
5581

5682
## Release Signoff Checklist
@@ -60,14 +86,14 @@ Items marked with (R) are required *prior to targeting to a milestone / release*
6086
- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
6187
- [x] (R) KEP approvers have approved the KEP status as `implementable`
6288
- [x] (R) Design details are appropriately documented
63-
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
89+
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
6490
- [ ] e2e Tests for all Beta API Operations (endpoints)
6591
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
6692
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
6793
- [ ] (R) Graduation criteria is in place
6894
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
69-
- [ ] (R) Production readiness review completed
70-
- [ ] (R) Production readiness review approved
95+
- [x] (R) Production readiness review completed
96+
- [x] (R) Production readiness review approved
7197
- [x] "Implementation History" section is up-to-date for milestone
7298
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
7399
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

0 commit comments

Comments
 (0)