Add an option to trigger iterations more frequently #6589

kawych · 2024-03-06T11:20:49Z

What type of PR is this?

What this PR does / why we need it:

Trigger new autoscaling iterations based on two additional criteria:

There are new unschedulable pods - reduces autoscaling latency slightly
Last iteration was productive (there was a scale-up or scale-down) - increases autoscaling throughput in cases where there are multiple heterogeneous workloads, each requiring a separate iteration.

This also avoids burning CPU unnecessarily, which would happen if we just reduced scanInterval to some value close to zero.

This functionality is flag-guarded, disabled by default.

Does this PR introduce a user-facing change?

NONE

kawych · 2024-03-06T11:21:01Z

CC @x13n

gjtempleton · 2024-03-11T08:07:56Z

I take it we're planning on following this up with a separate PR to make use of this functionality?

x13n · 2024-03-11T09:40:00Z

I a bit hesitant exposing internal fields in static autoscaler like this when there's plenty of existing hooks in static autoscaler already. Could this instead be implemented as scale up/down status processor?

kawych · 2024-03-11T10:01:44Z

@gjtempleton I actually planned to do this in a forked Cluster Autoscaler implementation, assuming that exposing these fields won't hurt. But I can follow up with enabling the same in this repo if the repo owners are OK with it.

@x13n can you elaborate more on what you are worried about? I can agree that exposing all of the internal state would not be a good idea, but for these specific fields it's fairly natural to show them to the external world (readonly). Additionally, the "run" function where these values would be actually used [1] can easily access outputs from cluster autoscaler, while for processors don't have any natural way of passing the results there.

[1] https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/main.go#L574

x13n · 2024-03-12T15:25:23Z

This PR only exposes two fields, so it is a bit hard to discuss how exactly it will be used. Regading the run function - we should limit the amount of logic in main, rather than keep adding to it.

But to Guy's point above: Let's maybe start with a purpose of the change and then decide how to best fulfill that purpose instead of starting by exposing internal fields of StaticAutoscaler without extra context? Maybe exposing these fields is the way to go, but it is hard to decide if that's the entire change.

kawych · 2024-03-14T09:26:49Z

We discussed this change offline with @x13n and decided to implement a more comprehensive (flag-guarded) feature of improving autoscaling throughput and latency in the same PR.

x13n · 2024-03-15T14:12:23Z

cluster-autoscaler/main.go

-
-				metrics.UpdateDurationFromStart(metrics.Main, loopStart)
-			}
+	runAutoscalerOnce := func(loopStart time.Time) {


Any reason not to make it a proper function, perhaps in the new loop module?

x13n · 2024-03-15T14:16:21Z

cluster-autoscaler/loop/trigger.go

+)
+
+// StaticAutoscaler exposes recent autoscaler activity
+type StaticAutoscaler interface {


It doesn't need to be public, this interface is meant to be consumed locally in this module. Also, StaticAutoscaler is the current implementation of the interface, but it doesn't make a lot of sense as the name - with the current set of exposed functions something along the lines of scalingTimesGetter would be more appropriate.

x13n · 2024-03-15T14:20:35Z

cluster-autoscaler/loop/trigger.go

+// Wait waits for the next autoscaling iteration
+func (t *LoopTrigger) Wait(lastRun time.Time) {
+	sleepStart := time.Now()
+	defer metrics.UpdateDurationFromStart("loopWait", sleepStart)


"loopWait" should be a constant in metrics.go, same as other function labels.

x13n · 2024-03-15T14:24:51Z

cluster-autoscaler/loop/trigger.go

+	if !t.autoscaler.LastScaleUpTime().Before(lastRun) ||
+		!t.autoscaler.LastScaleDownDeleteTime().Before(lastRun) {
+		select {
+		case <-t.podObserver.unschedulablePodChan:


So you're clearing the signal about unschedulable pod appearing here, but not about scan interval passing. Why are you treating some triggers differently than others? This will lead to wasting a loop that isn't needed every now and then, which is not terrible, but definitely something we could avoid.

We're clearing just this one channel because it persists between loops, while the time.After() is re-created with each iteration. I don't think it will have the effect you're suggesting, but I'll actually test that, for now publishing the other responses.

Ah, you're right, we don't make a call to After() at all in this branch, nvm my comment then :)

x13n · 2024-03-15T14:28:08Z

/assign

kawych

Thanks, these are great suggestions.

Putting it on hold for now since I'll want to re-test it when approved.

/hold

kawych · 2024-03-15T14:41:12Z

cluster-autoscaler/loop/trigger.go

+)
+
+// StaticAutoscaler exposes recent autoscaler activity
+type StaticAutoscaler interface {


kawych · 2024-03-15T14:41:19Z

cluster-autoscaler/loop/trigger.go

+// Wait waits for the next autoscaling iteration
+func (t *LoopTrigger) Wait(lastRun time.Time) {
+	sleepStart := time.Now()
+	defer metrics.UpdateDurationFromStart("loopWait", sleepStart)


kawych · 2024-03-15T14:44:00Z

cluster-autoscaler/loop/trigger.go

+	if !t.autoscaler.LastScaleUpTime().Before(lastRun) ||
+		!t.autoscaler.LastScaleDownDeleteTime().Before(lastRun) {
+		select {
+		case <-t.podObserver.unschedulablePodChan:


We're clearing just this one channel because it persists between loops, while the time.After() is re-created with each iteration. I don't think it will have the effect you're suggesting, but I'll actually test that, for now publishing the other responses.

kawych · 2024-03-15T14:44:19Z

cluster-autoscaler/main.go

-
-				metrics.UpdateDurationFromStart(metrics.Main, loopStart)
-			}
+	runAutoscalerOnce := func(loopStart time.Time) {


more frequently: based on new unschedulable pods and every time a previous iteration was productive.

x13n · 2024-03-15T16:44:20Z

/lgtm
/approve

k8s-ci-robot · 2024-03-15T16:44:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kawych, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

kawych · 2024-03-15T19:15:45Z

I re-tested it, submitting.

/unhold

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Mar 6, 2024

k8s-ci-robot requested review from BigDarkClown and feiskyer March 6, 2024 11:20

k8s-ci-robot added the area/cluster-autoscaler label Mar 6, 2024

k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 6, 2024

kawych force-pushed the nowait branch from 9d50917 to 960be9a Compare March 6, 2024 11:24

kawych force-pushed the nowait branch from 960be9a to 95bdecf Compare March 13, 2024 16:33

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 13, 2024

kawych force-pushed the nowait branch from 95bdecf to a632137 Compare March 13, 2024 16:40

kawych changed the title ~~Expose autoscaler's recent activity times~~ Add an option to trigger iterations more frequently Mar 14, 2024

x13n requested changes Mar 15, 2024

View reviewed changes

k8s-ci-robot assigned x13n Mar 15, 2024

kawych force-pushed the nowait branch from a632137 to bf71b14 Compare March 15, 2024 14:40

kawych commented Mar 15, 2024

View reviewed changes

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 15, 2024

Add an option to Cluster Autoscaler that allows triggering new loops

702883d

more frequently: based on new unschedulable pods and every time a previous iteration was productive.

kawych force-pushed the nowait branch from bf71b14 to 702883d Compare March 15, 2024 14:46

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 15, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 15, 2024

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 15, 2024

k8s-ci-robot merged commit 109998d into kubernetes:master Mar 15, 2024
6 checks passed

aleksandra-malinowska mentioned this pull request Sep 24, 2024

Add support for frequent loops when provisioningrequest is encountered in last iteration #7271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option to trigger iterations more frequently #6589

Add an option to trigger iterations more frequently #6589

kawych commented Mar 6, 2024 •

edited

Loading

kawych commented Mar 6, 2024

gjtempleton commented Mar 11, 2024

x13n commented Mar 11, 2024

kawych commented Mar 11, 2024

x13n commented Mar 12, 2024

kawych commented Mar 14, 2024

x13n Mar 15, 2024

kawych Mar 15, 2024

x13n Mar 15, 2024

kawych Mar 15, 2024

x13n Mar 15, 2024

kawych Mar 15, 2024

x13n Mar 15, 2024

kawych Mar 15, 2024

x13n Mar 15, 2024

x13n commented Mar 15, 2024

kawych left a comment

kawych Mar 15, 2024

kawych Mar 15, 2024

kawych Mar 15, 2024

kawych Mar 15, 2024

x13n commented Mar 15, 2024

k8s-ci-robot commented Mar 15, 2024

kawych commented Mar 15, 2024

Add an option to trigger iterations more frequently #6589

Add an option to trigger iterations more frequently #6589

Conversation

kawych commented Mar 6, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

kawych commented Mar 6, 2024

gjtempleton commented Mar 11, 2024

x13n commented Mar 11, 2024

kawych commented Mar 11, 2024

x13n commented Mar 12, 2024

kawych commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n commented Mar 15, 2024

kawych left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

x13n commented Mar 15, 2024

k8s-ci-robot commented Mar 15, 2024

kawych commented Mar 15, 2024

kawych commented Mar 6, 2024 •

edited

Loading