-
Notifications
You must be signed in to change notification settings - Fork 219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(disruption): add node notready controller #1755
feat(disruption): add node notready controller #1755
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: mariuskimmina The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @mariuskimmina! |
Hi @mariuskimmina. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
1d3f2eb
to
01e633c
Compare
I think this does count as corporate contribution, it's the first time our company does it tho, so bare with me while I am trying to figure the CLA stuff out. |
Signed-off-by: Marius Kimmina <marius@adjoe.io> Co-authored-by: Marius Kimmina <marius@adjoe.io> Co-authored-by: Tadeh Alexani Khodavirdian <tadeh@adjoe.io>
01e633c
to
422151e
Compare
Signed-off-by: Marius Kimmina <marius@adjoe.io>
Signed-off-by: Marius Kimmina <marius@adjoe.io>
@@ -3,7 +3,7 @@ apiVersion: apiextensions.k8s.io/v1 | |||
kind: CustomResourceDefinition | |||
metadata: | |||
annotations: | |||
controller-gen.kubebuilder.io/version: v0.16.3 | |||
controller-gen.kubebuilder.io/version: v0.16.4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we (you) split this bump into its own commit / explain more about the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bump was created by running make verify
and then presumably by go generate ./...
- as this is my first time working on karpenter I wasn't sure if I should commit this change. Happy to remove the bumps if it is deemed not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I would make the autogenerated changes a separate commit, then it is easy to omit if appropriate)
log.FromContext(ctx).V(0).Info("Deleted nodeclaim because the node has been unreachable for more than unreachableTimeout", "node", node.Name) | ||
return reconcile.Result{}, nil |
This comment was marked as duplicate.
This comment was marked as duplicate.
Sorry, something went wrong.
Signed-off-by: Marius Kimmina <marius@adjoe.io>
Signed-off-by: Marius Kimmina <marius@adjoe.io>
durationSinceTaint := time.Since(taint.TimeAdded.Time) | ||
if durationSinceTaint > *nodeClaim.Spec.UnreachableTimeout.Duration { | ||
// if node is unreachable for too long, delete the nodeclaim | ||
if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Should something happen to the
.status
of the NodeClaim before deletion? - Should we record an Event regarding the NodeClaim with the Node as related? (I would)
@mariuskimmina fyi if you haven't seen or were aware of, the @engedaam opened up an RFC that seems to tackle the same set of issues :) #1768 |
@njtran thanks for the heads up, his approach does seem more well thought out - I am not sure how I should proceed from here
|
Hey @mariuskimmina, I'm currently planning on handling the implementation. This is a problem space we are trying to move quickly on to help solve for users. We can close this PR out. If you have the time I would appropriate any and all feedback you can provide on both the RFC and implantation |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Closing in favor of #1793 |
Fixes #1659
Description
We would like karpenter to be able to terminate nodes if they have been in an unreachable state for too long.
This has happened to us in the past and as far as I can tell spotio for example already handles this case.
We experienced such a case of the node becoming unreachable when the kubelet on the node died.
This pr introduces a new field to the nodepool
unreachableTimeout
which can be set to e.g. 10 minutes so that Karpenter would actively terminate a node when it has been unreachable for more than 10 minutes.We called it notready controller as that's the state the nodes are in when they become unreachable but there might be a better alternative.
How was this change tested?
We added a test suite for this case and we also tested it on one of our EKS test clusters where we simulated a node becoming unreachable and had Karpenter mark the nodeclaim for deletion.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.