Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote Improved multi-numa alignment in Topology Manager to beta #4079

Merged
merged 1 commit into from
Jun 15, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion keps/prod-readiness/sig-node/3545.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 3545
alpha:
approver: "@johnbelamaric"
approver: "@johnbelamaric"
beta:
approver: "@johnbelamaric"
dchen1107 marked this conversation as resolved.
Show resolved Hide resolved
38 changes: 29 additions & 9 deletions keps/sig-node/3545-improved-multi-numa-alignment/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,18 +71,18 @@ Items marked with (R) are required *prior to targeting to a milestone / release*

- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [x] (R) Design details are appropriately documented
- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- [ ] e2e Tests for all Beta API Operations (endpoints)
- [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free
- [ ] (R) Graduation criteria is in place
- [x] (R) Graduation criteria is in place
- [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md)
- [ ] (R) Production readiness review completed
- [x] (R) Production readiness review completed
- [ ] (R) Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
- [x] "Implementation History" section is up-to-date for milestone
- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

<!--
**Note:** This checklist is iterative and should be reviewed and updated every time this enhancement is being considered for a milestone.
Expand Down Expand Up @@ -252,6 +252,7 @@ to implement this enhancement.
##### Unit tests

- `k8s.io/kubernetes/pkg/kubelet/cm/topologymanager`: `09-23-2022` - `92.4`
- `k8s.io/kubernetes/pkg/kubelet/cm/topologymanager`: `06-12-2023` - `93.2`

##### Integration tests

Expand Down Expand Up @@ -302,6 +303,12 @@ When an option graduates, its visibility should be moved to be controlled by the
The introduction of these feature gates gives us the ability to move the option to beta and later stable without implying that all available options are stable.
This approach is similliar to graduation criteria for `CPUManagerPolicyOptions` introduced [here](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2625-cpumanager-policies-thread-placement#graduation-criteria-of-options).

In 1.28 this feature is being promoted to Beta. We propose following changes to TopologyManager policy options default visibility:

- `TopologyManagerPolicyOptions` feature flag for enabling/disabling the entire feature will be enabled by default.
- `TopologyManagerPolicyBetaOptions` feature flag for enabling/disabling beta options will be enabled by default.
dchen1107 marked this conversation as resolved.
Show resolved Hide resolved
dchen1107 marked this conversation as resolved.
Show resolved Hide resolved
- `prefer-closest-numa-nodes` will be moved to Beta options.

The graduation Criteria of options is described below:

#### Graduation of Options to `Beta-quality` (non-hidden)
Expand Down Expand Up @@ -378,7 +385,7 @@ No.

###### How can an operator determine if the feature is in use by workloads?

Inspect the kubelet configuration of the nodes: check feature gate and usage of the new option
Inspect the kubelet configuration of the nodes: check feature gate and usage of the new option.

###### How can someone using this feature know that it is working for their instance?

Expand Down Expand Up @@ -434,14 +441,26 @@ No.

No.

###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

### Troubleshooting

###### How does this feature react if the API server and/or etcd is unavailable?

N/A.

###### What are other known failure modes?

TBD.
There are 2 scenarios where Kubelet may fail to start due to using this feature:

- Bad policy option name or using policy option without enabling appropriate feature flag. we are emitting appropriate error message for this case,
Kubelet will fail to start and print error message what happened. To recover one just have to provide fix policy option name or disable/enable feature flags.

- Cadvisor is not exposing distances for NUMA domains. In this case Kubelet will fail with `error getting NUMA distances from cadvisor` message.
Reading NUMA distances is only performed when `prefer-clostest-numa-nodes` option is specified.
To recover one has to either disable `TopologyManagerPolicyOptions` feature-flag or stop using `prefer-closest-numa-nodes` option.

###### What steps should be taken if SLOs are not being met to determine the problem?

Expand All @@ -450,3 +469,4 @@ N/A.
## Implementation History

- 2021-09-26: KEP created
- 2023-06-12: KEP updated for Beta release
4 changes: 2 additions & 2 deletions keps/sig-node/3545-improved-multi-numa-alignment/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,12 @@ see-also: []
replaces: []

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.26"
latest-milestone: "v1.28"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down