-
Notifications
You must be signed in to change notification settings - Fork 669
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple descheduler strategies of the same type #486
Comments
Hi @irmiller22, thanks for opening this issue. We have discussed similar requests before (I thought there was an open issue, but I couldn't find it), so there is definitely some validity to what you describe. One option we discussed was something similar to scheduler profiles (https://kubernetes.io/docs/reference/scheduling/config/#multiple-profiles). This would take some minor refactoring but is certainly doable. Since I couldn't find another issue for this, maybe we can officially open discussion on this here. @ingvagabund @seanmalloy wdyt? |
There's currently no way to separate pods into two distinct groups and have multiple instances of the same strategy to target different group. Specifying It's more better to allow a strategy to target specific group of pods based on a label selector. Nevertheless, +1 for allowing running multiple instances of the same strategy wherever it's applicable. We might start with apiVersion: "descheduler/v1alpha2" <-- notice the version bump
kind: "DeschedulerPolicy"
strategies:
- name: PodLifeTime
- enabled: true
...
- name: PodLifeTime
- enabled: true
... |
This was another concern I had. I wonder if implementing an annotation like |
As long as it's guided by applications to choose which descheduler is more suitable. Not to run the same descheduler just with a different name and different set of strategies. Imho, it's orthogonal to this issue. |
I would like to pick this up if its still under consideration. Got some questions
|
I've given my perspective on your questions inline below. Prior to starting implementation we need to have some sort of high level agreement on what exactly the v1alpha2 DeschedulerPolicy will look like. For starters I recommend adding a well thought out comment on this issues with a detailed proposal for v1alpha2. Let's also see what @damemi and @ingvagabund think about this.
The only other option I can think if using an annotation which was mentioned in #486 (comment).
I believe it makes sense to have v1alpha1 one co-exist with v1alpha2.
I think we would want to allow multiple descheduler profiles for all strategies. |
@hanumanthan thanks for picking this up.
All strategies up to LowNodeUtilization can already specify a label selector: #510. As long as only admin needs to configure the descheduler, this is sufficient.
Providing multiple instances of the same strategy requires incompatible changes. So we will need to support both versions at the same time for at least 3 releases. Also, I'd like to keep the changes simple so we can simply convert v1alpha1 to v1alpha2. With that, we might also get rid of
In theory all of them. As long as a strategy allows to set a namespace alongside other parameter (e.g. thresholdPriority). Even LowNodeUtilization can be utilized here if we allow to set a node selector to create multiple (and non-overlapping) node pools. On the other hand, it will be up to a user to make sure multiple instances of the same strategy do not interfere with each other where they are not supposed to. Back to
We can't reasonably check within the descheduler that every configuration is conflict free. Two instances of the same strategy can have non-overlapping groups of pods based on a label selector. Though, there's nothing forbidding to have a group that is targeted by both label selectors. Given this can also change dynamically. |
As @seanmalloy said, this is a much higher level discussion which will require a design doc for the new version. A new API version is an opportunity for us to resolve other possible improvements we've uncovered as the project as grown (one example is the change I suggested in #314).
Technically, an alpha API can be removed at any time, especially since this is just a config API and Descheduler is the only consumer of it. However as good citizens it would be preferable to support both APIs to allow users time to switch over. This will require us to generate conversion functions between the two (probably with I will start a design document to gather some of the notes that have been mentioned here so far |
I don't recall the exact reason but I believe we have done it for the sake of being in sync with featuregate flags which used to have |
Thanks @damemi I will wait for the design doc to start working on this feature. |
Here is a link to the doc I started: https://docs.google.com/document/d/1S1JCh-0F-QCJvBBG-kbmXiHAJFF8doArhDIAKbOj93I/edit# Feel free to comment there or propose any alternatives. I think the options are fairly straightforward |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Just adding a voice for support on this issue.. the ability to define different TTLs for pods based on different labels is pretty important to us.. |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'm also keen to see this moving. Just like @diranged , in the company that I work we really wanted to have specific strategies per namespace/label. Without this, it's hard to adopt the descheduler - although it does do what we want it doesn't to the way we wanted. I initially thought the descheduler would be a fully fledged operator where we could provide CRDs in a distributed way and it would merge/compile the strategies/policies into one config and run it. If that was the case, it would be awesome. (A question for the core team): Would you be willing to move torwards the "operator with CRDs" target? |
|
Has this been succeeded by #926? It appears the original design doc went stale in April 2021 and was replaced by a new document? If so, might I suggest closing this since in favor of !926? I found it a bit confusing to find the new design. |
Many of the issues were kept open so we can keep track of the original requests. @damemi would you be ok referring the new design in https://docs.google.com/document/d/1S1JCh-0F-QCJvBBG-kbmXiHAJFF8doArhDIAKbOj93I/edit#? |
Yup, done Also, agree with @ingvagabund on keeping this open. This issue is a subset of the overall design of v1alpha2 api in #926 |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
this is now possible using profiles in v1alpha2 |
@a7i: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is your feature request related to a problem? Please describe.
k8s version: 1.18
descheduler version: 1.18
Currently, we have
CronJob
resources that are getting stuck due to this k8s issue: kubernetes/kubernetes#52172. We are using managed k8s (EKS), and 1.19 is not available yet. The issue is apparently fixed in 1.19, but we're unable to upgrade.Long story short, a
CronJob
pod will spin up, and will immediately get stuck due to the bug described in the linked issue above, and the pod will ultimately end up inWaiting
state with the reason beingCreateContainerError
.The only way to address this currently is to manually delete the problematic pods and have k8s re-schedule the CronJob objects during the next scheduled run. Currently, these problematic pods are blocking the CronJob resource from scheduling new pods.
Describe the solution you'd like
We'd like to leverage descheduler to handle this case by allowing for multiple
PodLifeTime
strategies. As I understand it from the documentation, we're only allowed to implement one definition per strategy. We'd like to have a policy that applies only to ourCronJob
objects (which have a priority class calledservice-cron
), and another policy that applies to all other pods.For example, the below configuration would be desired:
If there are other ways to accomplish this, please let me know! I'm open to suggestions.
Describe alternatives you've considered
We've considered upgrading our self-managed clusters to 1.19, but our managed clusters are unable to upgrade since 1.19 isn't available yet. We'd prefer not to run into version drift across our clusters.
What version of descheduler are you using?
descheduler version: 1.18
Additional context
The text was updated successfully, but these errors were encountered: