Improve autoscaler's observed average calculation. #1194

markusthoemmes · 2018-06-13T21:02:43Z

Proposed Changes

The autoscaler takes all probes over a certain window, either panic or stable respectively, to calculate the observed concurrency in the system. If you simply take all probes and calculate the average across all of them, newly joined pods get an unfairly low share in the average calculation. That means that even though the system has successfully scaled up to more pods, the "old" and overloaded pods' measurements are given an unnaturally high weight vs. the newly joined pods.

This aims to solve this issue by not aggregating over all probes but aggregating an average per pod first. That way, all oberves pods in the system get an equal share in the observed concurrency calculation and the autoscaler scales more precisely.

Release Note

Improve autoscaler's observed average calculation.

The autoscaler takes all probes over a certain window, either panic or stable respectively, to calculate the observed concurrency in the system. If you simply take all probes and calculate the average across all of them, newly joined pods get an unfairly low share in the average calculation. That means that even though the system has successfully scaled up to more pods, the "old" and overloaded pods' measurements are given an unnaturally high weight vs. the newly joined pods. This aims to solve this issue by not aggregating over all probes but aggregating an average per pod first. That way, all oberves pods in the system get an equal share in the observed concurrency calculation and the autoscaler scales more precisely.

markusthoemmes · 2018-06-13T21:03:25Z

/assign @josephburnett

knative-metrics-robot · 2018-06-13T21:05:01Z

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File	Old Coverage	New Coverage	Delta
pkg/autoscaler/autoscaler.go	93.6%	94.2%	0.6

*TestCoverage feature is being tested, do not rely on any info here yet

rootfs · 2018-06-13T21:38:18Z

pkg/autoscaler/autoscaler.go

+	return accumulatedConcurrency / float64(agg.observedPods())
+}
+
+// hols an aggregation per pod


rootfs · 2018-06-13T22:57:46Z

would it make more precise if average is a moving average instead of simple average? that'll handle bursty traffic too.

markusthoemmes · 2018-06-14T06:25:43Z

@rootfs Thought about a similar thing. Essentially I'd think weighing older metrics lower than the newer ones would make sense.

Kinda wanted to implement this step first though, we can always improve on that. Want more changes in the current PR?

Edit: Thinking about it again: Aren't we already effectively using a moving average, since we only look at metrics of a defined window (which is a time window vs. a probe window but still)?

josephburnett

This looks great!

josephburnett · 2018-06-14T15:05:03Z

pkg/autoscaler/autoscaler.go

+func (agg *totalAggregation) aggregate(stat Stat) {
+	if agg.perPodAggregations == nil {
+		agg.perPodAggregations = make(map[string]*perPodAggregation)
+	}


Nit: a more idiomatic way of doing this initialization is to create a method newTotalAggregation that returns a totalAggregation with a non-nil map. Then aggregate can just focus on incremental updates, not initialization.

josephburnett

/approve
/lgtm

google-prow-robot · 2018-06-14T16:54:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: josephburnett, markusthoemmes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/autoscaler/OWNERS~~ [josephburnett]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-metrics-robot · 2018-06-14T16:54:50Z

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File	Old Coverage	New Coverage	Delta
pkg/autoscaler/autoscaler.go	93.6%	94.1%	0.5

*TestCoverage feature is being tested, do not rely on any info here yet

markusthoemmes added 3 commits June 13, 2018 22:57

Cleanup autoscaler code.

4690403

Some more cleanup.

f3c5d35

google-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 13, 2018

google-prow-robot assigned josephburnett Jun 13, 2018

rootfs reviewed Jun 13, 2018

View reviewed changes

josephburnett approved these changes Jun 14, 2018

View reviewed changes

Refactor into constructor.

52c73e3

josephburnett approved these changes Jun 14, 2018

View reviewed changes

google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2018

google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2018

google-prow-robot merged commit 1ca8602 into knative:master Jun 14, 2018

josephburnett mentioned this pull request Jun 5, 2019

Add markusthoemmes as Scaling Working Group Lead. knative/community#11

Merged

nak3 pushed a commit to nak3/serving that referenced this pull request Aug 3, 2022

fix pdb version for 1.4 (knative#1194)

fef1968

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve autoscaler's observed average calculation. #1194

Improve autoscaler's observed average calculation. #1194

markusthoemmes commented Jun 13, 2018

markusthoemmes commented Jun 13, 2018

knative-metrics-robot commented Jun 13, 2018

rootfs Jun 13, 2018

rootfs commented Jun 13, 2018

markusthoemmes commented Jun 14, 2018 •

edited

Loading

josephburnett left a comment

josephburnett Jun 14, 2018

josephburnett left a comment

google-prow-robot commented Jun 14, 2018

knative-metrics-robot commented Jun 14, 2018

Improve autoscaler's observed average calculation. #1194

Improve autoscaler's observed average calculation. #1194

Conversation

markusthoemmes commented Jun 13, 2018

Proposed Changes

markusthoemmes commented Jun 13, 2018

knative-metrics-robot commented Jun 13, 2018

rootfs Jun 13, 2018

Choose a reason for hiding this comment

rootfs commented Jun 13, 2018

markusthoemmes commented Jun 14, 2018 • edited Loading

josephburnett left a comment

Choose a reason for hiding this comment

josephburnett Jun 14, 2018

Choose a reason for hiding this comment

josephburnett left a comment

Choose a reason for hiding this comment

google-prow-robot commented Jun 14, 2018

knative-metrics-robot commented Jun 14, 2018

markusthoemmes commented Jun 14, 2018 •

edited

Loading