Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add 'namespace' parameter support to LowNodeUtilization strategy #391

Closed
korjek opened this issue Sep 2, 2020 · 36 comments
Closed

Add 'namespace' parameter support to LowNodeUtilization strategy #391

korjek opened this issue Sep 2, 2020 · 36 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@korjek
Copy link

korjek commented Sep 2, 2020

Hello.

I think that would be great if 'namespace' parameter is supported with LowNodeUtilization strategy.

@korjek korjek added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 2, 2020
@seanmalloy
Copy link
Member

@korjek thanks for opening this issue. I'm assuming you are referring to the namespace filtering feature that was just added in the v0.19.0 release.

When the namespace filtering enhancement was initially discussed it was determined that it would not make sense to enable this for all descheduler strategies. For example for the LowNodeUtilization strategy to be effective it needs to be able to process as many nodes as possible.

Can you provide any additional details on your use case?

@korjek
Copy link
Author

korjek commented Sep 4, 2020

@seanmalloy thank you for your reply. You are right, I meant namespace filtering.

What I got from a documentation namespace filtering is enabled for all strategies, but LowNodeUtilization.
Our usecase is that we have a separate namespace where idempotent to restart pods are running. So the idea with using namespace filtering with LowNodeUtilization strategy is that Descheduler will evict pods only from that specific namespace.

Another solution that I can think of for our use case: to specify the lowest priority for all pods in that namespace and use thresholdPriorityClassName filtering. But a drawback of doing this: in case we lost nodes and there is no enough capacity to schedule all pods across the cluster, we would like to have pods on this namespace scheduled and some other pods evicted by kube-scheduler. So using the lowest priority in this case is not a good idea.

That's why we think it would be great to have namespace filtering support with LowNodeUtilization strategy.

@seanmalloy
Copy link
Member

@korjek are the pods running in this namespace running on a different set of nodes? Could you maybe filter using a node selector?

@korjek
Copy link
Author

korjek commented Sep 5, 2020

@seanmalloy no, they are using the same nodes pool.

But what's the reason to do not have namespace filtering for LowNodeUtilization strategy?

@damemi
Copy link
Contributor

damemi commented Sep 10, 2020

LowNodeUtilization needs to consider as many pods as possible to accurately calculate each node's resource consumption. Only considering a subset of these for eviction would hurt its effectiveness at balancing cluster resource usage.

Can you give some more details on what your use case is? Perhaps there is another approach with the scheduler/descheduler that would fit better

@korjek
Copy link
Author

korjek commented Sep 11, 2020

@damemi but why the descheduler needs to calculate node's resource consumptions if that info is available from node's status?

How I see this: when LowNodeUtilization strategy generates a list of pods for eviction it should take into account namespace filtering (ie when evaluating pods for eviction, skip pods that are filtered by namespace). The rest of the logic shouldn't be changed.

Here is my usecase #391 (comment)

@damemi
Copy link
Contributor

damemi commented Sep 14, 2020

If you limit the pods that are available for eviction with LowNodeUtilization, the strategy won't work very well. The point of it is to balance entire node usage, so unless the namespace you're restricting it to makes up a large portion of your cluster's consumption then there isn't much point to running this strategy in this namespace.

By your use case, I meant why do you specifically want to run LowNodeUtilization in one namespace? I understand that you want to restrict evictions to certain idempotent pods, but are you just trying to keep these pods spread out between nodes? Are they actually a large portion of your node's usage?

If we can know your end goal with this strategy, we can better determine if this is a real use case or if something else is more appropriate (like Pod Topology Spreading or RemoveDuplicates)

@ingvagabund
Copy link
Contributor

ingvagabund commented Sep 16, 2020

Our usecase is that we have a separate namespace where idempotent to restart pods are running. So the idea with using namespace filtering with LowNodeUtilization strategy is that Descheduler will evict pods only from that specific namespace.

IIUC, the specific namespace is like a bank of pods to be sacrificed when certain nodes starts to be overutilized. Assuming what you are asking for is implemented, in case no pod from the specific namespace runs on any of the overutlized nodes, the strategy will have no effect. More or less the same holds if a small fraction of pods from the namespace runs on the overutilized nodes (debatable, yet depending on a particular use case).

I can see how it aligns with LowNodeUtilization strategy though unless the bank of pods are concentrated around overutilized nodes, I don't see any other benefit. I can see the case where a set of nodes have a main application (each). Each main application concentrates a bank of workers on the same node (in the specific namespace). In case the worker pods start to eat too much cpu/memory (upper bounded by resource limits), nodes can get overutlized in which case it makes sense to reschedule/evict only some of the worker pods from the bank. @korjek Is this use case close enough?

@korjek
Copy link
Author

korjek commented Sep 23, 2020

@damemi
Yes, these pods are the biggest portion of node's usage (~90% of all cluster).
Of course I can go and update PriorityClass for pods in this namespace or pods on other namespaces (or both at the same time) and use PriorityClass filter. But using namespace filtering requires less effort in this case.

@korjek
Copy link
Author

korjek commented Sep 23, 2020

IIUC, the specific namespace is like a bank of pods to be sacrificed when certain nodes starts to be overutilized. Assuming what you are asking for is implemented, in case no pod from the specific namespace runs on any of the overutlized nodes, the strategy will have no effect. More or less the same holds if a small fraction of pods from the namespace runs on the overutilized nodes (debatable, yet depending on a particular use case).

But that's actual for PriorityClass filtering too: if all pods except a small amount are using higher PriorityClass than specified using PriorityClass filter, strategy will have no effect.

I can see how it aligns with LowNodeUtilization strategy though unless the bank of pods are concentrated around overutilized nodes, I don't see any other benefit. I can see the case where a set of nodes have a main application (each). Each main application concentrates a bank of workers on the same node (in the specific namespace). In case the worker pods start to eat too much cpu/memory (upper bounded by resource limits), nodes can get overutlized in which case it makes sense to reschedule/evict only some of the worker pods from the bank. @korjek Is this use case close enough?

Almost. The main load (~90%) are caused by pods running in one namespace. So once AWS reclaims node, other nodes become really overloaded.
As I relied to @damemi , I can handle my case using PriorityClass filter, but it will require more changes from my side.

@damemi
Copy link
Contributor

damemi commented Sep 24, 2020

The main load (~90%) are caused by pods running in one namespace. So once AWS reclaims node, other nodes become really overloaded.

This is the key here, and I think it's a legitimate use case. Ultimately there isn't much "risk" with enabling namespace filtering for this strategy, the only risk is that if used improperly it will hurt the strategy's effectiveness (or maybe cause it to hotloop/run every iteration with no results but we could probably get around this by only listing the viable pods when calculating usage). If we do enable this, those outcomes should be clearly noted in the docs that it is not normally recommended to run this strategy with namespace filtering.

@seanmalloy @ingvagabund wdyt?

@seanmalloy
Copy link
Member

The main load (~90%) are caused by pods running in one namespace. So once AWS reclaims node, other nodes become really overloaded.

This is the key here, and I think it's a legitimate use case. Ultimately there isn't much "risk" with enabling namespace filtering for this strategy, the only risk is that if used improperly it will hurt the strategy's effectiveness (or maybe cause it to hotloop/run every iteration with no results but we could probably get around this by only listing the viable pods when calculating usage). If we do enable this, those outcomes should be clearly noted in the docs that it is not normally recommended to run this strategy with namespace filtering.

@seanmalloy @ingvagabund wdyt?

@damemi I'm fine with this as long as it is clearly documented.

@ingvagabund
Copy link
Contributor

ingvagabund commented Sep 27, 2020

Semantics of filtering namespaces through Namespaces field here is different from other cases. The expected semantics is filter pods and then run strategies. In this case it runs a strategy and then filter pods. I am inclined to introduce a new field, e.g. EvictableNamespaces to distinguish the difference and have it "only" include namespace without excluding.

@damemi
Copy link
Contributor

damemi commented Sep 28, 2020

I'm not sure this requires a new field, it would probably be sufficient to document the difference because it's just for one strategy.

@ingvagabund
Copy link
Contributor

ingvagabund commented Sep 29, 2020

because it's just for one strategy.

two once HighNodeUtilization strategy gets implemented

@lixiang233
Copy link
Contributor

This strategy is different as it's a strategy for Node, I think Namespace parameter in this strategy can only be understood as EvictableNamespaces(so we needn't a new parameter), because node's utilization is a property of Node, and filtering pods shouldn't change this property, Namespace parameter here can only be used to limit the scope of evictable pods. We can mention this in README to prevent misunderstanding.

@ingvagabund
Copy link
Contributor

Nobody reads a documentation unless it's really needed. I will not argue against reusing Namespaces field. Just wanted to voice my concerns.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 29, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 28, 2021
@lixiang233
Copy link
Contributor

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 28, 2021
@lindhe
Copy link

lindhe commented Mar 2, 2021

I was fooled by this too. I assumed the filter to let me have some namespaces to be untouched by descheduler, but that's clearly not the case.

@damemi
Copy link
Contributor

damemi commented Mar 2, 2021

I was fooled by this too. I assumed the filter to let me have some namespaces to be untouched by descheduler, but that's clearly not the case.

I think that adding a top-level setting for excluding namespaces is a valid request. In OpenShift, we currently have our operator manually append system namespaces to each strategy's exclude list and a cluster-wide setting would be more convenient.

I also think that excluding namespaces from LowNodeUtilization has (a) more utility and (b) less risk as opposed to the includeNamespaces for it. So, I have no problem excluding namespaces from LNU with a clear note that using it too broadly can have negative effects on the strategy.

@eagle9527
Copy link

LowNodeUtilization filters through the namespace, which does not affect applications in other namespaces. If the entire cluster is used, production applications will be unavailable.

@ingvagabund
Copy link
Contributor

ingvagabund commented Mar 4, 2021

+1 for allowing to set a list of excluded namespaces only yet clearly expressing risks of misusing it. Also stressing the setting applies in eviction phase only.

The strategy evicts pods in order to move pods from overutilized nodes to underutilized ordering pods from the least priority to the most. In case the list of excluded namespaces is configured improperly, some of the least priority pods might stay and pods with higher priority might get evicted instead. Which still improves resource usage among nodes, though not always respecting the priority. Description of LowNodeUtilization strategy does not mention the priority ordering is respected during eviction. Worth mentioning what behavior each strategy actually guarantees. https://github.com/kubernetes-sigs/descheduler#pod-evictions mentions the ordering though.

Worth adding another do-not-respect-priority-ordering alike param to the strategy. So anyone who wants to use the list of excluded namespaces has to set the param and take into account the fact the pod priority may not be respected. Which can violate guarantees of the strategy otherwise. Unfortunately, we don't have a notion of experimental features here. Making some params as alpha/experimental.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 2, 2021
@k8s-triage-robot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@korjek
Copy link
Author

korjek commented Aug 2, 2021

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 2, 2021
@k8s-ci-robot
Copy link
Contributor

@korjek: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@seanmalloy
Copy link
Member

/reopen

@k8s-ci-robot
Copy link
Contributor

@seanmalloy: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Oct 1, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

10 participants