-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3902: Decouple TaintManager from NodeLifeCycleController #3901
Conversation
ac1a89f
to
91ba8c8
Compare
sig/node |
@yuanchen8911: GitHub didn't allow me to request PR reviews from the following users: atosatto. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @atiratree @Huang-Wei |
/sig node scheduling |
FYI: may this kubernetes/kubernetes#115840 related? I didn't read the KEP. |
@kerthcet we see this as a preliminary step aiming to remove deprecated features in |
18f3967
to
f21c474
Compare
@soltysh @kow3ns, @alculquicondor, thanks a lot for the discussions on the proposal in today's |
ownership discussion kubernetes/kubernetes#116771 |
|
||
* Fix all reported bugs if any. | ||
|
||
### Upgrade / Downgrade Strategy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if somebody had the controller replaced, after the split the second one will stay? SO they will need to disable one more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For downgrade, disabling the new feature using the feature gate will automatically disable a custom controller and use the current and default combined controller instead.
BTW, Aldo and other reviewers strongly suggested custom controller should not be the focus of the proposal.
@SergeyKanzhelev , thanks for the comments! Can you take another look? |
Queued - will take a look, but most probably Wed/Thu. |
4b5ff1e
to
89f2ac7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall proposal makes sense to me (including going directly to Beta), but I have a bunch of comments about the PRR itself - PTAL
|
||
*This section must be completed when targeting alpha to a release.* | ||
|
||
* **How can this feature be enabled / disabled in a live cluster?** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update to the new KEP template (as mentioned above) - the feature-gate is then required explicitly :)
* The feature will work as usual. | ||
* **Are there any tests for feature enablement/disablement?** | ||
* Planned for Beta release | ||
* Appropriate tests have been added in the integration tests. See [Integration tests](#integration-tests) for more details. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests haven't been added, rather they are planned, right?
That being said, if I'm reading correctly, those tests are rather checking if things work fine when FG is enabled. They are not trying to test the enablement/disablement - being "run k8s with FG enabled, restart controller-manager with FG disabled and check if things still work as expected (and vice-versa)"
Which BTW is probably fine here, given this is in-memory feature (so no state is really stored here).
So I would be ok with saying sth like:
"No enablement/disablement tests are needed since this is in-memory feature and just regular tests with the feature being enabled/disabled do the job."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated as suggested.
*This section must be completed when targeting beta graduation to a release.* | ||
|
||
* **How can a rollout or rollback fail? Can it impact already running workloads?** | ||
A feature gate `SeparateTaintManager` controls whether to use the split `TaintManager` or the old combined `NodeLifecycleController`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please answer the question. Can it impact already running workloads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make sense?
"This is an opt-in feature, and it does not change any default behavior. Unless there is a bug in the implementation, a rollout can not fail. If a rollout does fail, running workloads will not be evicted propertly on tainted nodes. We don't expect a rollback can fail."
A feature gate `SeparateTaintManager` controls whether to use the split `TaintManager` or the old combined `NodeLifecycleController`. | ||
|
||
* **What specific metrics should inform a rollback?** | ||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't N/A for sure.
Some huge number of pod evictions? [Please add appropriate metric]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "A significantly changing number of pod evictions and/or a substantial increase in pod eviction latency"?
N/A | ||
|
||
* **Were upgrade and rollback tested? Was the upgrade→downgrade→upgrade path tested?** | ||
They will be covered in unit and e2e tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will they? We don't really have downgrade e2e tests running anywhere.
Please run them manually and describe the test case
[https://github.com//pull/3658 is a nice example]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to test plan "manually verify the rollback and upgrade-downgrade-upgrade path will pass the e2e testing"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall proposal makes sense to me (including going directly to Beta), but I have a bunch of comments about the PRR itself - PTAL
Thanks for the review, Wojciech! I've updated the doc to (hopefully) address the comments :-). Can you please take another look when you get a chance? Thanks again!
* **How can an operator determine if the feature is in use by workloads?** | ||
* It can be determined by if the feature gate `SeparateTaintManager` is used. | ||
* **How can someone using this feature know that it is working for their instance?** | ||
* Node taints and taint-based pod eviction work as usual. Admins and users should not see any different behavior if everything works as planned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can I determine it as an operator running gazillions clusters?
Please specify a metric (number of pod evictions is probably a best one).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to "Node taints and taint-based pod eviction should work as usual and there is no significant change in the number pod evictions (taint_manager_pod_evictions_total
) and/or pod eviction latency (taint_manager_pod_eviction_latency
)."
* **What are the reasonable SLOs (Service Level Objectives) for the enhancement?** | ||
* The performance of node taint-based eviction should remain the same level as before. | ||
* **What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?** | ||
* The metrics for both `NodeLifecycleController` and `TaintManager`'s queues should stay the same levels as before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What metrics? Please be specific.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "the number of pod evictions and pod eviction latency"?
taint_manager_pod_evictions_total
taint_manager_pod_eviction_latency
* **What are other known failure modes?** | ||
* No | ||
* **What steps should be taken if SLOs are not being met to determine the problem?** | ||
* If the default or custom taint-manager is working or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand it - can you please clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to
If the pod eviction latency increases significantly, validate if the communication between NodeLifecycleController
and TaintManager
works. If the number of pod evictions is abnormal, run tests to verify the TaintManager
works properly.
17cd65c
to
86dd1f5
Compare
Add diagrams Update diagrams Update diagrams Update diagrams Resize diagram images Resize diagram images Resize diagram images Update image size Resize diagram images Fix a format issue Update images Update README.md Update README.md Update README.md Update README.md Update README.md Update README.md Add issue id Update toc Fix typos Fix a format issue Fix format issues Update decoupled-taint-manager KEP Fix a typo in kep.yaml Fix errors Revise README.md Update alternatives Update alternatives Update alterntives Update alternatives Update README Update keps/sig-scheduling/3902-decoupled-taint-manager/README.md Co-authored-by: Wei Huang <weih@hey.com> Address Wei Huang's comments Add Story 3 Remove backup files Address Aldo and Ravi's comments Update Goals Reformat code Reformat code Focus on code split Remove the backup file Address Aldo's comments Fix typos Fix an format issue Fix toc Update README.md Another revison to address comments Fix diagram error Update status in README.md Address sig-app comments Address Wei's comments Update Address final comments from sig-app/sig-scheduling Address sig-node comments Update risk and mitigation Update risk and mitigations Adress PRR review comments Fix typos Update status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
lgtm from SIG Node.
This seems fine from PRR perspective - thanks! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei, SergeyKanzhelev, wojtek-t, yuanchen8911 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
TaintManager
from theNodeLifecycleController
to enable more flexible extension and enhancement in node taint-based eviction.The proposal were presented to sig-apps and sig-node. An implementation is WIP.