-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global ratelimiter, part 1: core algorithm for computing weights #5689
Conversation
Pull Request Test Coverage Report for Build 018e43b5-29d4-4581-9b85-2edf6e8f9656Details
💛 - Coveralls |
a63fc02
to
11ec7a8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the assumption of this algorithm is that the requests may not be evenly distributed among the hosts, but the distribution is somewhat stable and does not fluctuate. Is that correct?
I would add to your comment / update the implicit assumption we're making here is that it's history and the DB which is the thing which is vulnerable, and the frontend hosts themselves rarely need as much protection, so by sharing information around we are attempting to achieve fairness to the customer while protecting the vulnerable thing |
"does not continually fluctuate significantly on a second-by-second basis", yeah. It's assumed to change while our servers are are running (callers come and go and change locations), but more on the order of "minutes or longer" rather than seconds. which matches how load balancers and network planes generally work - they don't generally send a flood to one host one second, then a flood to a second host the next, etc. most attempt to evenly distribute load and are closer to round-robin or random, and at best you might see something like "send to least-requests-in-flight". but even least-requests won't send all long requests to one host and all short requests to a different one, it should still be roughly randomly distributed and so roughly stable over a longish period of time. load balancers dealing with large numbers do tend to only know about a subset of hosts though (forming a tree or incomplete graph between caller/callees, basically) which is essentially what's happening to us internally. to some degree all along, and to a MUCH greater degree recently. and internally, ^ that kind of load balance change really only occurs at around the rate of deploys or datacenter re-balancing, which is very roughly: less than hourly and takes multiple minutes to complete, as minimum bounds (many are much less frequent and much slower). as long as it takes something like a minute or less to restore ideal behavior, that should be fine, and I expect this to be a few times faster in practice (some change in ~3s, near-ideal in ~10s). if someone is using a routing system that doesn't act like ^ this, then yea IMO it's out of scope for this implementation. ideally, handling that would be "come up with a way to handle it, and plug it in as a different algorithm / different communicated data", but no doubt we'll discover we need some small changes. but that'll be entirely outside this PR's scope, as this is just the one set of assumptions. ^ and thanks for asking, clearly I should put something like this in the files, it's definitely relevant for the long term. |
Not sure if you've tried/looked, but I've recently found https://github.com/mennanov/limiters. That comes with some limitations of external db requirements, but it might be easier. |
update cycle, so this is mostly intended as a tool for reducing incorrectly-rejected | ||
requests when a ratelimit's usage is well below its allowed limit. | ||
|
||
# Dealing with expired data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is obvious after discussing in person, but I was initially confused about host-level granularity being persisted because - in the back of my mind - the problem you were solving was zonal imbalances.
Obviously since you don't aggregate at the zone level you need to do it at the host level, and therefore you have this GC problem, but that didn't go into my brain the first instance I looked at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, the final goal for all this is to let us experiment internally while reusing the overall sharding / request flow / etc, since internally we have information that doesn't exist in other environments.
which is probably true of other environments too. "customized" seems quite likely in general, though I do expect a majority will just have round-robin or random or something and won't need this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM and we will probably iterate further as rest of the layers that depend on this get implemented. Only leftover comment from my side is MaxDelayDuration
vs UpdateRate
which would improve reasonability of the algo implementation.
…s and docs to cover new code / describe new behavior
…story service and workflow histories and whatnot
I did finally check that + the code, and AFAICT they all require a request per limit check / they're synchronous checks, not asynchronous. Might be worth checking in more detail if we decide to toss this and use an external storage and hammer it super hard, but that's not really the goal at the moment. |
Codecov Report
Additional details and impacted files
... and 7 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
A high level overview is covered in
common/quotas/global/doc.go
, and this specific piece is further covered in the./algorithm/requestweighted.go
file.At a very high level though, this is an isolated piece of a "deployment-wide load-balance-aware ratelimiter", intended to solve problems with our internal clusters with our current high-level frontend ratelimiters around client-triggered domain actions (start workflow, poll, etc).
This is the "logical core" of the whole system, with as few dependencies and concerns as I can manage. All frontend hosts that will be imposing limits eventually send the update arguments to an aggregating host, and the aggregating host largely just delegates to a single instance of this algorithm. Essentially all the aggregating host needs to do to respond is multiply the weight by each configured ratelimit (from dynamic config).
I've left the concurrency fairly coarse, both for simplicity and for computational speed. I don't believe this will be a bottleneck in even our largest cluster, but even if it is it should be fairly trivial to split each batch's ratelimit keys into N different sharded instances, which can be completely independent.
The problem
All of our current ratelimiters currently fall into two conceptual buckets:
While these are very straightforward and have served us well, they're proving inadequate in our internal clusters due to imbalanced load on our frontend instances. Sometimes wildly imbalanced, e.g. 10% of hosts receiving all traffic for a domain.
What this leads to is a lower effective ratelimit than is intended, e.g.:
This is leading to a nasty scenario where we tell our users they get N rps, but they can't use it all, so we raise their limits to give them an effective N rps (actually like 3N configured)... but if their requests shift in balance we might allow 3N requests through, possibly risking cluster health. This is particularly risky for our largest users, as we lose our protection against significant un-planned load increases.
So we need to do something.
The solution
We can rather clearly divide solutions into either "don't send imbalanced load" or "share information about load", because it's not possible to make correct purely-local decisions. The first is not an option for us and the second has a variety of tradeoffs.
Within "share information" side we can also divide out two options:
For just our frontend requests, we might be willing to go synchronous because the rates are relatively low and the proportional added latency would be very small... but we have imbalanced load in many other places, e.g. due to hot shards causing more requests against the database in a History host, which can reach 100,000 rps. Having a general tool seems useful, especially if it could cover all of these.
So this is part 1 of building option 2. Other parts, like "actually share data" and "configure all this", are coming in later PRs.
The end goal
Ultimately the goal is to have a system which will be ~90% accurate within ~10s regardless of changes, and "idle" at better than 99% for relatively stable / normal usage regardless of distribution. For this implementation, that implies roughly a 3s update cycle at 0.5 weight (each frontend host pushes aggregate data every 3s, and each update moves data to the average of the previous data and the new data).
Brief, large spikes are essentially out of scope - they will quite possibly perform worse with this system, particularly if they jump between hosts every couple seconds. These have caused fairly major load spike related issues in the past though, so we have no real interest in allowing them... and we have some async APIs being built that will queue and handle this more smoothly, which will hopefully be a better solution anyway.
Whether this exact implementation survives our internal use remains to be seen. The plan is to make different algorithms (and different data exchanged) relatively pluggable so we can experiment easily, once the framework is fully in place.
And last but not least, the existing N-rps and N-rps-divided-by-hosts ratelimiters are not going away, and this async-load-balancing limiter will be fully disable-able to go back to using them (and may even forever remain disabled by default). If your clusters do not have imbalanced request load, you might prefer to use the simpler ones, and they might even behave better. They are also useful as fallbacks for the global system, in case communication breaks down for some reason.