Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid draining of NotReady nodes #448

Closed
prashanth26 opened this issue Apr 21, 2020 · 5 comments · Fixed by #450
Closed

Avoid draining of NotReady nodes #448

prashanth26 opened this issue Apr 21, 2020 · 5 comments · Fixed by #450
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/2d Effort for issue is around 2 days exp/beginner Issue that requires only basic skills kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) platform/all status/under-investigation Issue is under investigation topology/seed Affects Seed clusters

Comments

@prashanth26
Copy link
Contributor

What would you like to be added:
Avoid draining of nodes in NotReady state.

Why is this needed:
Draining of nodes in a NotReady state sometimes ends up waiting for up to drain timeout as the Kubelet might be stuck. So in cases where the node isn't ready, a force delete might be preferred.

@prashanth26 prashanth26 added kind/enhancement Enhancement, improvement, extension exp/beginner Issue that requires only basic skills status/new Issue is new and unprocessed platform/all area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related size/s Size of pull request is small (see gardener-robot robot/bots/size.py) topology/seed Affects Seed clusters component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) labels Apr 21, 2020
@zuzzas
Copy link
Contributor

zuzzas commented Apr 21, 2020

I'd like to take on this one. If swap is disabled on Linux machines and requests/limits are improperly enforced, some Nodes might get stuck for the drain timeout.

Coding and testing right now.

@rfranzke
Copy link
Member

Awesome, thanks @zuzzas

@hardikdr hardikdr reopened this Apr 28, 2020
@hardikdr
Copy link
Member

Re-opening the issue, as there seems to be a further scope of improvement on top of the #450.

  1. The current implementation applies complete drain-timeout if the node is NotReady for less than 5 mins. We should probably follow the suggestion made here.

  2. Changes could potentially be migrated to machine.go.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Jun 28, 2020
@prashanth26 prashanth26 added status/under-investigation Issue is under investigation and removed lifecycle/stale Nobody worked on this for 6 months (will further age) status/new Issue is new and unprocessed labels Aug 16, 2020
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 16, 2020
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Dec 16, 2020
@prashanth26
Copy link
Contributor Author

/exp beginner

@gardener-robot gardener-robot added effort/2d Effort for issue is around 2 days and removed size/s Size of pull request is small (see gardener-robot robot/bots/size.py) labels Mar 8, 2021
@prashanth26
Copy link
Contributor Author

prashanth26 commented Mar 30, 2021

/close in favour of #579 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance Performance (across all domains, such as control plane, networking, storage, etc.) related component/mcm Machine Controller Manager (including Node Problem Detector, Cluster Auto Scaler, etc.) effort/2d Effort for issue is around 2 days exp/beginner Issue that requires only basic skills kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) platform/all status/under-investigation Issue is under investigation topology/seed Affects Seed clusters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants