Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support AWS EKS auto repair feature #7591

Open
Atom-oh opened this issue Jan 14, 2025 · 1 comment
Open

Add support AWS EKS auto repair feature #7591

Atom-oh opened this issue Jan 14, 2025 · 1 comment
Labels
feature New feature or request

Comments

@Atom-oh
Copy link

Atom-oh commented Jan 14, 2025

Description

What problem are you trying to solve?
Here's the English translation:

The EKS Auto Repair feature has been created, but it is only supported for EKS node groups and not for Karpenter. In Auto Repair, when a node's status becomes problematic, it changes to Not Ready. Please make it possible to terminate these nodes at a specific time.

How important is this feature to you?
ometimes, when a node's disk encounters issues or when the kubelet malfunctions, and these problems don't automatically resolve, it can lead to significant problems with pod operations. This situation is considered very critical.

In EKS node groups, this issue is addressed through the auto scaling group mechanism, which terminates the problematic node and replaces it with a new one. However, this functionality is not currently supported in Karpenter nodepools, which makes it a critical concern.

To elaborate on this:

EKS Auto Repair: This feature is designed to maintain the health of the cluster by automatically addressing node issues.

Node Group vs. Karpenter: While EKS node groups have this auto-repair capability through Auto Scaling Groups (ASGs), Karpenter, which is an alternative node provisioning solution, currently lacks this feature.

Critical Impact: The absence of this feature in Karpenter can lead to prolonged downtime or degraded performance if a node becomes unhealthy, as there's no automatic mechanism to replace the problematic node.

Desired Solution: Implementing a similar auto-repair or node replacement mechanism for Karpenter nodepools would be beneficial. This could involve detecting nodes in a 'Not Ready' state and scheduling them for termination and replacement at a specified time, similar to how it works with EKS node groups.

Importance: This feature is crucial for maintaining the reliability and performance of Kubernetes clusters, especially in production environments where downtime can have significant impacts.

Is there any specific aspect of this issue you'd like to explore further, or do you have any questions about potential solutions or workarounds?

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@Atom-oh Atom-oh added feature New feature or request needs-triage Issues that need to be triaged labels Jan 14, 2025
@jigisha620
Copy link
Contributor

I think what you are looking for is Node Repair in Karpenter and this has been implemented as part of this PR - kubernetes-sigs/karpenter#1793. It should be included in our next release.

@jigisha620 jigisha620 removed the needs-triage Issues that need to be triaged label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants