Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve autoscaler's observed average calculation. #1194

Merged
merged 4 commits into from
Jun 14, 2018
Merged

Improve autoscaler's observed average calculation. #1194

merged 4 commits into from
Jun 14, 2018

Conversation

markusthoemmes
Copy link
Contributor

Proposed Changes

The autoscaler takes all probes over a certain window, either panic or stable respectively, to calculate the observed concurrency in the system. If you simply take all probes and calculate the average across all of them, newly joined pods get an unfairly low share in the average calculation. That means that even though the system has successfully scaled up to more pods, the "old" and overloaded pods' measurements are given an unnaturally high weight vs. the newly joined pods.

This aims to solve this issue by not aggregating over all probes but aggregating an average per pod first. That way, all oberves pods in the system get an equal share in the observed concurrency calculation and the autoscaler scales more precisely.

Release Note

Improve autoscaler's observed average calculation.

The autoscaler takes all probes over a certain window, either panic or stable respectively, to calculate the observed concurrency in the system. If you simply take all probes and calculate the average across all of them, newly joined pods get an unfairly low share in the average calculation. That means that even though the system has successfully scaled up to more pods, the "old" and overloaded pods' measurements are given an unnaturally high weight vs. the newly joined pods.

This aims to solve this issue by not aggregating over all probes but aggregating an average per pod first. That way, all oberves pods in the system get an equal share in the observed concurrency calculation and the autoscaler scales more precisely.
@google-prow-robot google-prow-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 13, 2018
@markusthoemmes
Copy link
Contributor Author

/assign @josephburnett

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/autoscaler/autoscaler.go 93.6% 94.2% 0.6

*TestCoverage feature is being tested, do not rely on any info here yet

return accumulatedConcurrency / float64(agg.observedPods())
}

// hols an aggregation per pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holds

@rootfs
Copy link
Contributor

rootfs commented Jun 13, 2018

would it make more precise if average is a moving average instead of simple average? that'll handle bursty traffic too.

@markusthoemmes
Copy link
Contributor Author

markusthoemmes commented Jun 14, 2018

@rootfs Thought about a similar thing. Essentially I'd think weighing older metrics lower than the newer ones would make sense.

Kinda wanted to implement this step first though, we can always improve on that. Want more changes in the current PR?

Edit: Thinking about it again: Aren't we already effectively using a moving average, since we only look at metrics of a defined window (which is a time window vs. a probe window but still)?

Copy link
Contributor

@josephburnett josephburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

func (agg *totalAggregation) aggregate(stat Stat) {
if agg.perPodAggregations == nil {
agg.perPodAggregations = make(map[string]*perPodAggregation)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: a more idiomatic way of doing this initialization is to create a method newTotalAggregation that returns a totalAggregation with a non-nil map. Then aggregate can just focus on incremental updates, not initialization.

Copy link
Contributor

@josephburnett josephburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@google-prow-robot google-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 14, 2018
@google-prow-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: josephburnett, markusthoemmes

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-prow-robot google-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 14, 2018
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/. Say /test pull-knative-serving-go-coverage to run the coverage report again

File Old Coverage New Coverage Delta
pkg/autoscaler/autoscaler.go 93.6% 94.1% 0.5

*TestCoverage feature is being tested, do not rely on any info here yet

@google-prow-robot google-prow-robot merged commit 1ca8602 into knative:master Jun 14, 2018
nak3 pushed a commit to nak3/serving that referenced this pull request Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants