Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(disruption): add node notready controller #1755

Conversation

mariuskimmina
Copy link

Fixes #1659

Description
We would like karpenter to be able to terminate nodes if they have been in an unreachable state for too long.
This has happened to us in the past and as far as I can tell spotio for example already handles this case.
We experienced such a case of the node becoming unreachable when the kubelet on the node died.

This pr introduces a new field to the nodepool unreachableTimeout which can be set to e.g. 10 minutes so that Karpenter would actively terminate a node when it has been unreachable for more than 10 minutes.

We called it notready controller as that's the state the nodes are in when they become unreachable but there might be a better alternative.

How was this change tested?

We added a test suite for this case and we also tested it on one of our EKS test clusters where we simulated a node becoming unreachable and had Karpenter mark the nodeclaim for deletion.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

linux-foundation-easycla bot commented Oct 16, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot requested a review from engedaam October 16, 2024 15:50
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mariuskimmina
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from jmdeal October 16, 2024 15:50
@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @mariuskimmina!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 16, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @mariuskimmina. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 16, 2024
@mariuskimmina mariuskimmina force-pushed the node-notready-controller branch from 1d3f2eb to 01e633c Compare October 16, 2024 15:51
@mariuskimmina
Copy link
Author

I think this does count as corporate contribution, it's the first time our company does it tho, so bare with me while I am trying to figure the CLA stuff out.

Signed-off-by: Marius Kimmina <marius@adjoe.io>
Co-authored-by: Marius Kimmina <marius@adjoe.io>
Co-authored-by: Tadeh Alexani Khodavirdian <tadeh@adjoe.io>
@mariuskimmina mariuskimmina force-pushed the node-notready-controller branch from 01e633c to 422151e Compare October 16, 2024 16:13
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 16, 2024
Signed-off-by: Marius Kimmina <marius@adjoe.io>
Signed-off-by: Marius Kimmina <marius@adjoe.io>
@JacobHenner JacobHenner mentioned this pull request Oct 21, 2024
@@ -3,7 +3,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.16.3
controller-gen.kubebuilder.io/version: v0.16.4
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we (you) split this bump into its own commit / explain more about the context?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bump was created by running make verify and then presumably by go generate ./... - as this is my first time working on karpenter I wasn't sure if I should commit this change. Happy to remove the bumps if it is deemed not necessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I would make the autogenerated changes a separate commit, then it is easy to omit if appropriate)

Comment on lines 68 to 69
log.FromContext(ctx).V(0).Info("Deleted nodeclaim because the node has been unreachable for more than unreachableTimeout", "node", node.Name)
return reconcile.Result{}, nil

This comment was marked as duplicate.

Signed-off-by: Marius Kimmina <marius@adjoe.io>
Signed-off-by: Marius Kimmina <marius@adjoe.io>
durationSinceTaint := time.Since(taint.TimeAdded.Time)
if durationSinceTaint > *nodeClaim.Spec.UnreachableTimeout.Duration {
// if node is unreachable for too long, delete the nodeclaim
if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should something happen to the .status of the NodeClaim before deletion?
  • Should we record an Event regarding the NodeClaim with the Node as related? (I would)

@njtran
Copy link
Contributor

njtran commented Oct 22, 2024

@mariuskimmina fyi if you haven't seen or were aware of, the @engedaam opened up an RFC that seems to tackle the same set of issues :) #1768

@mariuskimmina
Copy link
Author

@mariuskimmina fyi if you haven't seen or were aware of, the @engedaam opened up an RFC that seems to tackle the same set of issues :) #1768

@njtran thanks for the heads up, his approach does seem more well thought out - I am not sure how I should proceed from here

  • should this PR remain open?
  • will @engedaam also take care of the implementation?
  • anything else I can do to help?

@engedaam
Copy link
Contributor

Hey @mariuskimmina, I'm currently planning on handling the implementation. This is a problem space we are trying to move quickly on to help solve for users. We can close this PR out. If you have the time I would appropriate any and all feedback you can provide on both the RFC and implantation

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 31, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mariuskimmina
Copy link
Author

Closing in favor of #1793

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node NotReady Disruption Controller
5 participants