Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global ratelimiter, part 1: core algorithm for computing weights #5689

Merged
merged 15 commits into from
Mar 15, 2024

Conversation

Groxx
Copy link
Member

@Groxx Groxx commented Feb 23, 2024

A high level overview is covered in common/quotas/global/doc.go, and this specific piece is further covered in the ./algorithm/requestweighted.go file.

At a very high level though, this is an isolated piece of a "deployment-wide load-balance-aware ratelimiter", intended to solve problems with our internal clusters with our current high-level frontend ratelimiters around client-triggered domain actions (start workflow, poll, etc).

This is the "logical core" of the whole system, with as few dependencies and concerns as I can manage. All frontend hosts that will be imposing limits eventually send the update arguments to an aggregating host, and the aggregating host largely just delegates to a single instance of this algorithm. Essentially all the aggregating host needs to do to respond is multiply the weight by each configured ratelimit (from dynamic config).

I've left the concurrency fairly coarse, both for simplicity and for computational speed. I don't believe this will be a bottleneck in even our largest cluster, but even if it is it should be fairly trivial to split each batch's ratelimit keys into N different sharded instances, which can be completely independent.

The problem

All of our current ratelimiters currently fall into two conceptual buckets:

  1. they enforce N rps on that in-memory instance
  2. they enforce N rps on that in-memory instance divided by the number of hosts in a ring

While these are very straightforward and have served us well, they're proving inadequate in our internal clusters due to imbalanced load on our frontend instances. Sometimes wildly imbalanced, e.g. 10% of hosts receiving all traffic for a domain.

What this leads to is a lower effective ratelimit than is intended, e.g.:

  • 100 rps is allowed
  • 10 hosts exist (so each host allows 10 rps)
  • 1 host receives 100 rps, the others receive 0
  • the requests are incorrectly limited to 10rps
  • stuff slows down, oncalls get paged, nobody is happy

This is leading to a nasty scenario where we tell our users they get N rps, but they can't use it all, so we raise their limits to give them an effective N rps (actually like 3N configured)... but if their requests shift in balance we might allow 3N requests through, possibly risking cluster health. This is particularly risky for our largest users, as we lose our protection against significant un-planned load increases.

So we need to do something.

The solution

We can rather clearly divide solutions into either "don't send imbalanced load" or "share information about load", because it's not possible to make correct purely-local decisions. The first is not an option for us and the second has a variety of tradeoffs.

Within "share information" side we can also divide out two options:

  1. Synchronously check on every request. This allows "perfect" limiting and likely simpler algorithms, but has a non-trivial latency cost (unsuitable for high frequency) and risk (another point of failure, particularly in edge cases or sensitive locations).
  2. Something asynchronous. This is necessarily imprecise, but allows retries and puts no limits on limiter-usage frequency.

For just our frontend requests, we might be willing to go synchronous because the rates are relatively low and the proportional added latency would be very small... but we have imbalanced load in many other places, e.g. due to hot shards causing more requests against the database in a History host, which can reach 100,000 rps. Having a general tool seems useful, especially if it could cover all of these.

So this is part 1 of building option 2. Other parts, like "actually share data" and "configure all this", are coming in later PRs.

The end goal

Ultimately the goal is to have a system which will be ~90% accurate within ~10s regardless of changes, and "idle" at better than 99% for relatively stable / normal usage regardless of distribution. For this implementation, that implies roughly a 3s update cycle at 0.5 weight (each frontend host pushes aggregate data every 3s, and each update moves data to the average of the previous data and the new data).

Brief, large spikes are essentially out of scope - they will quite possibly perform worse with this system, particularly if they jump between hosts every couple seconds. These have caused fairly major load spike related issues in the past though, so we have no real interest in allowing them... and we have some async APIs being built that will queue and handle this more smoothly, which will hopefully be a better solution anyway.

Whether this exact implementation survives our internal use remains to be seen. The plan is to make different algorithms (and different data exchanged) relatively pluggable so we can experiment easily, once the framework is fully in place.

And last but not least, the existing N-rps and N-rps-divided-by-hosts ratelimiters are not going away, and this async-load-balancing limiter will be fully disable-able to go back to using them (and may even forever remain disabled by default). If your clusters do not have imbalanced request load, you might prefer to use the simpler ones, and they might even behave better. They are also useful as fallbacks for the global system, in case communication breaks down for some reason.

@coveralls
Copy link

coveralls commented Feb 23, 2024

Pull Request Test Coverage Report for Build 018e43b5-29d4-4581-9b85-2edf6e8f9656

Details

  • 191 of 191 (100.0%) changed or added relevant lines in 1 file are covered.
  • 81 unchanged lines in 16 files lost coverage.
  • Overall coverage increased (+0.02%) to 64.914%

Files with Coverage Reduction New Missed Lines %
service/history/task/transfer_standby_task_executor.go 2 87.01%
client/history/client.go 2 35.78%
service/history/handler/handler.go 2 49.59%
common/util.go 2 91.69%
common/membership/hashring.go 2 82.23%
common/task/fifo_task_scheduler.go 3 84.54%
common/persistence/nosql/nosqlplugin/cassandra/workflow.go 3 71.41%
common/persistence/statsComputer.go 3 93.57%
service/frontend/api/handler.go 4 62.11%
service/history/task/redispatcher.go 4 89.53%
Totals Coverage Status
Change from base Build 018e436a-a816-4303-af1c-1f41d4401a76: 0.02%
Covered Lines: 94753
Relevant Lines: 145968

💛 - Coveralls

@Groxx Groxx force-pushed the global_aggregator branch from a63fc02 to 11ec7a8 Compare February 23, 2024 01:31
Copy link
Member

@Shaddoll Shaddoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the assumption of this algorithm is that the requests may not be evenly distributed among the hosts, but the distribution is somewhat stable and does not fluctuate. Is that correct?

@davidporter-id-au
Copy link
Member

I would add to your comment / update the implicit assumption we're making here is that it's history and the DB which is the thing which is vulnerable, and the frontend hosts themselves rarely need as much protection, so by sharing information around we are attempting to achieve fairness to the customer while protecting the vulnerable thing

@Groxx
Copy link
Member Author

Groxx commented Feb 23, 2024

So the assumption of this algorithm is that the requests may not be evenly distributed among the hosts, but the distribution is somewhat stable and does not fluctuate. Is that correct?

"does not continually fluctuate significantly on a second-by-second basis", yeah. It's assumed to change while our servers are are running (callers come and go and change locations), but more on the order of "minutes or longer" rather than seconds.

which matches how load balancers and network planes generally work - they don't generally send a flood to one host one second, then a flood to a second host the next, etc. most attempt to evenly distribute load and are closer to round-robin or random, and at best you might see something like "send to least-requests-in-flight". but even least-requests won't send all long requests to one host and all short requests to a different one, it should still be roughly randomly distributed and so roughly stable over a longish period of time.
(how long depends on a lot, which is part of why these values are configurable)

load balancers dealing with large numbers do tend to only know about a subset of hosts though (forming a tree or incomplete graph between caller/callees, basically) which is essentially what's happening to us internally. to some degree all along, and to a MUCH greater degree recently.

and internally, ^ that kind of load balance change really only occurs at around the rate of deploys or datacenter re-balancing, which is very roughly: less than hourly and takes multiple minutes to complete, as minimum bounds (many are much less frequent and much slower). as long as it takes something like a minute or less to restore ideal behavior, that should be fine, and I expect this to be a few times faster in practice (some change in ~3s, near-ideal in ~10s).

if someone is using a routing system that doesn't act like ^ this, then yea IMO it's out of scope for this implementation. ideally, handling that would be "come up with a way to handle it, and plug it in as a different algorithm / different communicated data", but no doubt we'll discover we need some small changes. but that'll be entirely outside this PR's scope, as this is just the one set of assumptions.


^ and thanks for asking, clearly I should put something like this in the files, it's definitely relevant for the long term.

common/quotas/global/algorithm/requestweighted.go Outdated Show resolved Hide resolved
common/quotas/global/algorithm/requestweighted.go Outdated Show resolved Hide resolved
common/quotas/global/algorithm/requestweighted.go Outdated Show resolved Hide resolved
common/quotas/global/algorithm/requestweighted.go Outdated Show resolved Hide resolved
common/quotas/global/algorithm/requestweighted_test.go Outdated Show resolved Hide resolved
common/quotas/global/algorithm/requestweighted.go Outdated Show resolved Hide resolved
common/quotas/global/algorithm/requestweighted.go Outdated Show resolved Hide resolved
common/quotas/global/algorithm/requestweighted_test.go Outdated Show resolved Hide resolved
common/quotas/global/doc.go Outdated Show resolved Hide resolved
@3vilhamster
Copy link
Contributor

Not sure if you've tried/looked, but I've recently found https://github.com/mennanov/limiters.
It supports multiple storage engines and has different implementations already.

That comes with some limitations of external db requirements, but it might be easier.
But that will support any kind of bursty traffic, given that we don't hit the limit of db itself.

update cycle, so this is mostly intended as a tool for reducing incorrectly-rejected
requests when a ratelimit's usage is well below its allowed limit.

# Dealing with expired data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is obvious after discussing in person, but I was initially confused about host-level granularity being persisted because - in the back of my mind - the problem you were solving was zonal imbalances.

Obviously since you don't aggregate at the zone level you need to do it at the host level, and therefore you have this GC problem, but that didn't go into my brain the first instance I looked at it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the final goal for all this is to let us experiment internally while reusing the overall sharding / request flow / etc, since internally we have information that doesn't exist in other environments.

which is probably true of other environments too. "customized" seems quite likely in general, though I do expect a majority will just have round-robin or random or something and won't need this.

common/quotas/global/algorithm/requestweighted.go Outdated Show resolved Hide resolved
Copy link
Member

@taylanisikdemir taylanisikdemir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM and we will probably iterate further as rest of the layers that depend on this get implemented. Only leftover comment from my side is MaxDelayDuration vs UpdateRate which would improve reasonability of the algo implementation.

@Groxx
Copy link
Member Author

Groxx commented Mar 5, 2024

Not sure if you've tried/looked, but I've recently found https://github.com/mennanov/limiters. It supports multiple storage engines and has different implementations already.

That comes with some limitations of external db requirements, but it might be easier. But that will support any kind of bursty traffic, given that we don't hit the limit of db itself.

I did finally check that + the code, and AFAICT they all require a request per limit check / they're synchronous checks, not asynchronous.

Might be worth checking in more detail if we decide to toss this and use an external storage and hammer it super hard, but that's not really the goal at the moment.

Copy link

codecov bot commented Mar 15, 2024

Codecov Report

Merging #5689 (f84e9ff) into master (9cd9109) will increase coverage by 0.03%.
The diff coverage is 100.00%.

❗ Current head f84e9ff differs from pull request most recent head 3fb97e3. Consider uploading reports for the commit 3fb97e3 to get more accurate results

Additional details and impacted files
Files Coverage Δ
common/quotas/global/algorithm/requestweighted.go 100.00% <100.00%> (ø)

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9cd9109...3fb97e3. Read the comment docs.

@Groxx Groxx enabled auto-merge (squash) March 15, 2024 20:06
@Groxx Groxx merged commit 3091e41 into master Mar 15, 2024
19 checks passed
@Groxx Groxx deleted the global_aggregator branch March 15, 2024 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants