Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concurrent request bucketing in autoscaler #1060

Closed
markusthoemmes opened this issue Jun 5, 2018 · 1 comment
Closed

Improve concurrent request bucketing in autoscaler #1060

markusthoemmes opened this issue Jun 5, 2018 · 1 comment
Assignees
Labels
area/autoscale kind/feature Well-understood/specified features, ready for coding.

Comments

@markusthoemmes
Copy link
Contributor

/area autoscale
/kind dev
/assign @markusthoemmes

Expected Behavior

Constant load of a certain CONCURRENCY on an application should result in CONCURRENCY amount of containers eventually created, regardless of how long the requests take.

Actual Behavior

For very short running HTTP requests, a bucketing and quantization mechanism in the queue_proxy counts all requests in a specific timeframe (100ms default) as "concurrent".

If for example 10 10ms requests arrive in one of those 100ms buckets, one after the other, the concurrency reported by the queue for this bucket will be 10, even though the "real" concurrency was only one, since they all arrived one after the other.

Steps to Reproduce the Problem

  1. Deploy the helloworld sample: knative/serving:sample/helloworld/README.md@master#running
  2. Run a curl loop (one request after the other) against the system for long enough: for i in {0..1000}; do curl -H "Host: $SERVICE_HOST" "http://$SERVICE_IP/"; done
  3. Observe the pod count via kubectl get pods, telemetry or the autoscaler logs.

Additional Info

Even for that simple bash loop (and with very bad latency of about ~30ms), I get 3 containers created, even though one would easily suffice for the task:

kubectl get pods
NAME                                          READY     STATUS    RESTARTS   AGE
helloworld-00001-deployment-8888987f5-6sx4l   4/4       Running   0          1h
helloworld-00001-deployment-8888987f5-gs8ll   4/4       Running   0          22s
helloworld-00001-deployment-8888987f5-k76tw   4/4       Running   0          18s
helloworld-00001-deployment-8888987f5-vpsk7   4/4       Running   0          12s
@google-prow-robot google-prow-robot added area/autoscale kind/feature Well-understood/specified features, ready for coding. labels Jun 5, 2018
@markusthoemmes
Copy link
Contributor Author

Proposed solution:

A possible solution to this issue is to take the maximum actual concurrency for each of the buckets. What does that mean?

Today, as noted above, all requests that arrive in a certain timeframe (100ms by default) are measured as concurrent. Codewise, this is done by draining the queue of decrements (i.e. responses sent) after a such a bucket timeframe.

We could instead increment/decrement the concurrency counter as requests/responses happen. Now sampling rate becomes an issue, because what happens if you have many of those super short requests but at the end of the bucket your concurrency count is always 0? You wouldn't scale at all.

The proposed solution is therefore to take the maximum value of concurrency per recorded bucket instead of sampling the concurrency value at the end of each bucket. The maximum value of concurrency is a "realistic" value of how many concurrent requests there were at a given point in time. There is no arbitrary value needed which leads to inaccuracies like explained above.

As everything is smoothened out via still averaging over the buckets this shouldn't be subject to spiky pathologies.

google-prow-robot pushed a commit that referenced this issue Jun 13, 2018
* Improve concurrent request bucketing in queue-proxy.

Instead of counting all requests that arrived in a certain bucket as concurrent, the queue proxy now reports the actual maximum concurrency that was present inside of one bucket.

If for example 3 requests arrive at once, the maximum concurrency will be 3 for that bucket. If another arises while the 3 remaining are still open, maximum concurrency is 4. Closing requests results in a decrement on the concurrency immediatly (versus draining the outgoing request queue on quantization which results in the behavior described above).

Closes #1060

* Fix edge case and add a specific test.

* Unify channels and add buffer again.

* Update config documentation.

* Adjust channel typing.
skonto added a commit to skonto/serving that referenced this issue Feb 6, 2025
* fix configuration reconcile loop

* rename var

Co-authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/feature Well-understood/specified features, ready for coding.
Projects
None yet
Development

No branches or pull requests

2 participants