-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RemovePodsViolatingNodeTaints policy not working with --descheduling-interval option #245
Comments
Some logs for reference -
The taint |
/kind bug |
This looks like a valid bug, when the descheduler starts up we first load the list of nodes which is then passed into the strategy for every loop. I imagine similar bugs can affect the other strategies because of this. Perhaps it would be better to move the section of code that loads the nodes into the |
The reason the list was fetched just once because of initial design where it was supposed to be run just once as a job or cronjob. Due to introduction of the interval option, the list needs to be fetched inside the loop for each new iteration, so that each iteration is fresh in itself. We should also make sure that any changes in this regard should not break those scenario where interval option is not being used. |
Also running descheduler every |
/assign |
@aveshagarwal I am experimenting a combined setup of Node Problem Detector and Descheduler, where NPD will taint any faulty nodes in the cluster and Descheduler can drain PODs from that faulty node (via RemovePodsViolatingNodeTaints policy). Increasing time interval between consecutive runs of Descheduler will lead to increase in time of faulty node detection and remediation. |
Just wanted to post an update that we've gotten the NPD+Descheduler Wombo Combo working in production and it seems to work pretty well. Would it be useful to add this use case to any documentation? |
@dharmab yes it would be useful to document this real world use case. It would be great if you could submit a PR to update docs/user-guide.md with the details. Thanks! |
While trying to run Descheduler with
RemovePodsViolatingNodeTaints
policy and--descheduling-interval
option set to 5m , we are observing that descheduler is caching the nodes status/taints etc. at its first run and that cache did not get updated in the subsequent runs.Due to this, any changes made to nodes taints(after the descheduler's first run) did not get picked up by Descheduler and hence pods are not getting Evicted from that node.
The text was updated successfully, but these errors were encountered: