-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an option to trigger iterations more frequently #6589
Conversation
CC @x13n |
I take it we're planning on following this up with a separate PR to make use of this functionality? |
I a bit hesitant exposing internal fields in static autoscaler like this when there's plenty of existing hooks in static autoscaler already. Could this instead be implemented as scale up/down status processor? |
@gjtempleton I actually planned to do this in a forked Cluster Autoscaler implementation, assuming that exposing these fields won't hurt. But I can follow up with enabling the same in this repo if the repo owners are OK with it. @x13n can you elaborate more on what you are worried about? I can agree that exposing all of the internal state would not be a good idea, but for these specific fields it's fairly natural to show them to the external world (readonly). Additionally, the "run" function where these values would be actually used [1] can easily access outputs from cluster autoscaler, while for processors don't have any natural way of passing the results there. [1] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/main.go#L574 |
This PR only exposes two fields, so it is a bit hard to discuss how exactly it will be used. Regading the But to Guy's point above: Let's maybe start with a purpose of the change and then decide how to best fulfill that purpose instead of starting by exposing internal fields of StaticAutoscaler without extra context? Maybe exposing these fields is the way to go, but it is hard to decide if that's the entire change. |
We discussed this change offline with @x13n and decided to implement a more comprehensive (flag-guarded) feature of improving autoscaling throughput and latency in the same PR. |
cluster-autoscaler/main.go
Outdated
|
||
metrics.UpdateDurationFromStart(metrics.Main, loopStart) | ||
} | ||
runAutoscalerOnce := func(loopStart time.Time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to make it a proper function, perhaps in the new loop
module?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cluster-autoscaler/loop/trigger.go
Outdated
) | ||
|
||
// StaticAutoscaler exposes recent autoscaler activity | ||
type StaticAutoscaler interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't need to be public, this interface is meant to be consumed locally in this module. Also, StaticAutoscaler
is the current implementation of the interface, but it doesn't make a lot of sense as the name - with the current set of exposed functions something along the lines of scalingTimesGetter
would be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cluster-autoscaler/loop/trigger.go
Outdated
// Wait waits for the next autoscaling iteration | ||
func (t *LoopTrigger) Wait(lastRun time.Time) { | ||
sleepStart := time.Now() | ||
defer metrics.UpdateDurationFromStart("loopWait", sleepStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"loopWait"
should be a constant in metrics.go
, same as other function labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if !t.autoscaler.LastScaleUpTime().Before(lastRun) || | ||
!t.autoscaler.LastScaleDownDeleteTime().Before(lastRun) { | ||
select { | ||
case <-t.podObserver.unschedulablePodChan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're clearing the signal about unschedulable pod appearing here, but not about scan interval passing. Why are you treating some triggers differently than others? This will lead to wasting a loop that isn't needed every now and then, which is not terrible, but definitely something we could avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're clearing just this one channel because it persists between loops, while the time.After() is re-created with each iteration. I don't think it will have the effect you're suggesting, but I'll actually test that, for now publishing the other responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, you're right, we don't make a call to After()
at all in this branch, nvm my comment then :)
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, these are great suggestions.
Putting it on hold for now since I'll want to re-test it when approved.
/hold
cluster-autoscaler/loop/trigger.go
Outdated
) | ||
|
||
// StaticAutoscaler exposes recent autoscaler activity | ||
type StaticAutoscaler interface { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cluster-autoscaler/loop/trigger.go
Outdated
// Wait waits for the next autoscaling iteration | ||
func (t *LoopTrigger) Wait(lastRun time.Time) { | ||
sleepStart := time.Now() | ||
defer metrics.UpdateDurationFromStart("loopWait", sleepStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if !t.autoscaler.LastScaleUpTime().Before(lastRun) || | ||
!t.autoscaler.LastScaleDownDeleteTime().Before(lastRun) { | ||
select { | ||
case <-t.podObserver.unschedulablePodChan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're clearing just this one channel because it persists between loops, while the time.After() is re-created with each iteration. I don't think it will have the effect you're suggesting, but I'll actually test that, for now publishing the other responses.
cluster-autoscaler/main.go
Outdated
|
||
metrics.UpdateDurationFromStart(metrics.Main, loopStart) | ||
} | ||
runAutoscalerOnce := func(loopStart time.Time) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
more frequently: based on new unschedulable pods and every time a previous iteration was productive.
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kawych, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I re-tested it, submitting. /unhold |
What type of PR is this?
What this PR does / why we need it:
Trigger new autoscaling iterations based on two additional criteria:
This also avoids burning CPU unnecessarily, which would happen if we just reduced scanInterval to some value close to zero.
This functionality is flag-guarded, disabled by default.
Does this PR introduce a user-facing change?