-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide support for explicitly pausing autoscaling of workloads. #944
Comments
I like the idea with a separate CRD especially because of the description field and to leave the |
I also support this idea and what @flecno said, however I would like to see a more generic CRD where you not can only scale to zero, but also enforce any value you like. Maybe call it This would help us in certain other seldom situations, where we have to process more independent of the triggers. |
The goal with this seperate CRD is to override the ScaledObject. With the manual approach, what would be the scenario where you cannot just scale the deployment itself let's say? |
Scaling to zero or scaling to n, in both cases you enforce a value of fixed pods you require, where the autoscaling is not supposed to interfere. Using Example ScaledObject:
Current scale is 1, trying to scale to 3 with
And the spec of the keda-generated hpa is not above MaxReplicas:
So all I wanted to suggest was, that a more generic override with any value is appreciated (not only zero). |
Being able to suspend scaling during operations without removing or touching object state would be ideal. With our HPA setup, we could scale down the target object to 0, without touching any of the min/max parameters of the HPA, and allow the HPA to kick back in afterwards by scaling the target back up to 1. |
Agreed, would you prefer a new CRD then for that or what would you expect from KEDA? |
@tomkerkhove a CRD would work for sure! It feels a bit elaborate compared to just setting an annotation/label on one of the involved objects, but it also leaves room for much more functionality I guess? |
I am more inclined towards annotation/label based solution. it is more clean and we don't introduce a new set of resources. Out of curiosity, what benefits do you see behind the CRD based solution? |
Oh I was just checking what the expectations were; either are fine for me but would be good if this got surfaced somehow if we can do |
Exposing it in the get would be nice for sure. At first glance I thought the |
This is already possible today by aligning min/max replica count
This is my current thinking actually to have a |
Ideally, it would be something where granular permission could be given in the event someone on-call can temporarily disable w/o touching real object w/ any git re-run / etc. But yeah. big +1 to this request. |
@zroubalik OK for you if we commit to this for our roadmap? |
@tomkerkhove yeah :) |
Guys would implementing something like this handle this case ? |
Shall we aim for 2.3 for this @jeffhollan @zroubalik ? |
As prior art, CronJob objects have a boolean field for |
That's fine by me to have that or something similar. |
Fine by me :) I am still not sure whether we should just explicitly scale to 0 (then I am up for using the same property like CronJob does) or whether we should scale to some specific numbers of replica (this would require a different property) |
Based on what I've heard and seen I think we should consider it explicitly scale to X and "disable/pause" autoscaling without removing the Then you can do maintenance jobs or so and just put it back to autoscaling when done. |
One feature that would be nice to have is a duration for the scaling override. It can be all too easy to scale things to 0 when something happens, and then somebody forgets to scale it back up. Being able to say "scale to 0, then return to normal in 4 hours" would solve that problem nicely. |
@derekperkins That realllly doesn't work great in a convergent system :-/ At best you can do "disregard this config after timestamp X" but it's an inherently non-convergent behavior so the edge cases get gnarly. |
@zroubalik how so? is it because |
Exactly. And second concern (though minor and just nitpicking): I don't like the complexity of the annotation, is it |
Fine by me, it was just an idea |
As we're planning KEDA v2.7 next week I'll keep this open until we have ScaledJob support |
So to summarize: We will support ScaledJobs through the Do we support it as well? Keeping in mind that this would be the order of application:
|
TBH I prefer
Using this approach, we'll be consistent using both annotations always (to avoid logic problems if only 1 is set) and also we can keep the already existing behavior without any other change in how we manage the replicas only in the annotation manage itself. Also we will "correct" misconfiguration from our users, giving them the "correct" solution |
The risk you have there is this:
We could, in theory, automatically remove |
you are right, but if we don't add the extra needed annotation, I'd require both for doing anything, I mean, even thought we could work using only |
I'm not sure I understand what you mean, can you elaborate please? |
Sure, My proposal is just requiring both in case of This could simplify the logic in the |
I don't think that really makes sense nor adds value:
The logic around this should be very straight-forward and is fairly simple IMO:
Logic is straight-forward and easy to document. |
okey, we can start with that and iterate if we find problems |
So we'll start with this approach then? If so, I'll create a new issue for support in scaled jobs and one for this new flavor in scaled object. |
@tomkerkhove I have some spare cycles this week and could take on implementation for ScaledJob if you've not started already. Currently in need of the ability to pause/prevent execution of Cron triggered ScaledJobs and working on a pull request seems less kludgy than any other workaround we've discussed. |
As far as I know, that's not been implemented yet so feel free to take a stab at it! |
@tomkerkhove @DanInProgress |
So you want to basically pause autoscaling, without knowing what the current replica count is then, @arnoldyahad? |
Thanks for the very quick reply @tomkerkhove our use-case right now is with spot.io and EC2 instances, where we can just press "pause autoscaling" in spot.io. and it pauses all autoscaling activities immediately. This is important to us as when production goes down - cpu drops to 0% and we begin a massive scale down, so we have API to spot.io to suspend the autoscaling. we would like our devs to be able to respond quick enough to such incidents, and just putting an annotation like |
can't they just get the current replicas and set them? I mean, they can just run
and get the value from it, if they can annotate the SO I assume that they can get the current replica count |
As I've mentioned before - while specifying a paused replica count is useful in some scenarios you, in other scenarios you don't really care what the current instance count is and just want to pause where you are today. Using the command above helps with that, but is an additional step that we can/should avoid to make @arnoldyahad and others their life easier. Our tag line is to make app automatically simpler, but in this case it's friction we can remove easily. |
This was what we landed on AFAIK but never got final "sign off" so did not create issues yet but would love to create them. |
@JorTurFer Thanks for the comment, meanwhile we will indeed use a workaround. If someone is interested:
|
Provide support for explicitly stating workloads to scale to zero without the option of scaling up.
This can be useful for manually scaling-to-zero instances because:
Why not delete the deployment? Glad you've asked! Because we don't want to touch the applications themselves but merely remove the instances it is running from an operational perspective. Once everything is good to go, we can enable it to scale again.
Suggestion
Introduce a new CRD, for example
ManualScaleToZero
, which targets a given deployment/workload and provides a description why it's scaled to 0 for now.If scaled objects/jobs are configured, they are ignored in favor of the new CRD.
The text was updated successfully, but these errors were encountered: