-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promote Improved multi-numa alignment in Topology Manager to beta #4079
Promote Improved multi-numa alignment in Topology Manager to beta #4079
Conversation
PiotrProkop
commented
Jun 12, 2023
- One-line PR description: Promote Improved multi-numa alignment in Topology Manager to beta
- Issue link: Improved multi-numa alignment in Topology Manager #3545
- Other comments:
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: you may want to check the Release Signoff Checklist
to see if you can/should check more items.
It seems to me there's a TBD left in the Production Readiness Questionnaire, could you please check?
fc22159
to
1e0cd0d
Compare
Thanks for the review. Updated both sections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 Taking a look at this as a PRR shadow and I had a general question about troubleshooting and operators understanding that the feature is working correctly.
There is a fair bit of detail around how someone can know that it's working by comparing specific workloads but there isn't much detail about cluster operators really understanding this (including with upgrade/downgrade. Could you provide a little more thought/insight from that perspective? Specifically on a rollout, if the kubelet may fail to tart or the kubelet may crash, how do we determine those scenarios are due to this enhancement?
N/A. | ||
|
||
###### What are other known failure modes? | ||
|
||
TBD. | ||
No known failure modes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On Line 372 under the rollout / rollback section, this is mentioned:
Kubelet may fail to start. The kubelet may crash.
Is that statement valid, and if so could we identify what those failure modes might be? How does someone recover from that failure mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing that out. I think there are only 2 scenarios where kubelet can crash due to this feature:
- bad policy option name, we are already logging appropriate logs for this, to recover one just have to provide correct policy name or disable
TopologyManagerPolicyOptions
- cadvisor is not exposing distances for
NUMA
domains, we are also logging it, to recover one has to disableTopologyManagerPolicyOptions
I'll update KEP with those steps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Section updated.
Signed-off-by: pprokop <pprokop@nvidia.com>
1e0cd0d
to
af6d25f
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dchen1107, johnbelamaric, PiotrProkop The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |