Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote Improved multi-numa alignment in Topology Manager to beta #4079

Merged
merged 1 commit into from
Jun 15, 2023

Conversation

PiotrProkop
Copy link
Contributor

  • One-line PR description: Promote Improved multi-numa alignment in Topology Manager to beta
  • Other comments:

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 12, 2023
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jun 12, 2023
@ffromani
Copy link
Contributor

/cc

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: you may want to check the Release Signoff Checklist to see if you can/should check more items.

It seems to me there's a TBD left in the Production Readiness Questionnaire, could you please check?

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 12, 2023
@PiotrProkop
Copy link
Contributor Author

Minor: you may want to check the Release Signoff Checklist to see if you can/should check more items.

It seems to me there's a TBD left in the Production Readiness Questionnaire, could you please check?

Thanks for the review. Updated both sections.

@dchen1107 dchen1107 added this to the v1.28 milestone Jun 13, 2023
Copy link
Contributor

@jeremyrickard jeremyrickard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Taking a look at this as a PRR shadow and I had a general question about troubleshooting and operators understanding that the feature is working correctly.

There is a fair bit of detail around how someone can know that it's working by comparing specific workloads but there isn't much detail about cluster operators really understanding this (including with upgrade/downgrade. Could you provide a little more thought/insight from that perspective? Specifically on a rollout, if the kubelet may fail to tart or the kubelet may crash, how do we determine those scenarios are due to this enhancement?

N/A.

###### What are other known failure modes?

TBD.
No known failure modes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On Line 372 under the rollout / rollback section, this is mentioned:

Kubelet may fail to start. The kubelet may crash.

Is that statement valid, and if so could we identify what those failure modes might be? How does someone recover from that failure mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out. I think there are only 2 scenarios where kubelet can crash due to this feature:

  • bad policy option name, we are already logging appropriate logs for this, to recover one just have to provide correct policy name or disable TopologyManagerPolicyOptions
  • cadvisor is not exposing distances for NUMA domains, we are also logging it, to recover one has to disable TopologyManagerPolicyOptions

I'll update KEP with those steps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Section updated.

Signed-off-by: pprokop <pprokop@nvidia.com>
@dchen1107
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2023
@johnbelamaric
Copy link
Member

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dchen1107, johnbelamaric, PiotrProkop

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/node Categorizes an issue or PR as relevant to SIG Node. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

6 participants