-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global ratelimiter: improve low-weight low-usage scenarios #6238
Global ratelimiter: improve low-weight low-usage scenarios #6238
Conversation
…it based on fallback limit
A rather minor but wide-touching change, but it will be useful for a later PR.
target := rate.Limit(c.targetRPS(lkey)) | ||
limiter := c.global.Load(lkey) | ||
fallbackTarget := limiter.FallbackLimit() | ||
boosted := boostRPS(target, fallbackTarget, info.Weight, info.UsedRPS) | ||
limiter.Update(boosted) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything else is in service of this change here: boost limits before applying them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first glance I can't fault anything here, I'll try and take another look again tomorrow when less tired.
I appreciate the effort you've gone to explaining visually, mathematically what you're thinking and I'm sure future engineers will similarly. This sets a high bar for us.
IDL changes needed to build cadence-workflow/cadence#6238 Basically this is just: return the "used RPS" that was already being tracked. For compatibility purposes: while deploying, this will lead to version-crossing requests failing due to an unknown type, which will eventually put those keys into the "failing" mode. This is currently _intended_, to serve as a full test of that fallback system, but should not be done in general. Due to a different test (broken hashring) we have already seen the fallback system kick in correctly, so I'm fairly confident that this will work as intended. Still, it's probably worth testing, particularly because it's easy. --- This only changes thrift because this type is serialized _outside_ RPC, just like our database types. There is no proto equivalent, at best it'd be these-serialized-thrift-bytes sent in a proto wrapper.
map[shared.GlobalKey]rpc.UpdateEntry{ | ||
"test:domain-user-limit": { | ||
Weight: 1, | ||
// re 10 vs 15: used RPS only tracks accepted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what 15 refers to here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update, err := rpc.TestUpdateToAny(t, "testhost", time.Second, map[shared.GlobalKey]rpc.Calls{
"test:domain-user-limit": {
Allowed: 10,
Rejected: 5,
},
})
^ I inserted 10 allowed and 5 rejected calls before querying for weights
I would suggest putting The PR docs probably need to be thrown into a MD file, describing hwo to think about the rate-limiter's tradeoffs and algorithm |
yea, I think I'll do that once I'm done with the changes and it's fully "on" in production. better than delaying things at the moment :| |
…orkflow#6238) Part 1 of 2 to address some relatively-expected "infrequent usage behaves sub-optimally" issues. In this one, I'm tackling an issue where there are some high-usage limiting-hosts, and some occasional-and-low-usage hosts, and e.g. less than half of the target RPS is being used at any time. In ^ this scenario it's possible for the low-usage hosts to be given a weight so low that they block _most_ of the requests they receive, especially if those requests come in small bursts (which is quite common) rather than evenly distributed. This is partly because their weight is low enough to only allow a couple RPS, and partly because our limiters currently set their burst size == limit, so they also do not allow bursts that would otherwise be within-budget. This _cannot_ be addressed perfectly with an async system like this (we can't perfectly predict caller behavior), but there are some easy partial improvements worth building: 1. (this PR) Allow low-weight hosts to over-commit and accept more than their fair share of RPS, if there is free RPS in the cluster. - For _consistently_ low-RPS hosts, this should be fair and a moderate improvement, and it performs no worse than a non-global algorithm if the limit is not usually fully used. 2. Raise the burst size, so intermittent callers can accumulate a small burst budget. - Currently this matches the RPS (min 1), which means e.g. a totally innocuous 2-request-event every 10 seconds will likely reject 50%. Obviously that's not great. - "Fair" values here are fuzzier, and will be figured out in a later PR. In this PR, I "boost" RPSes that are below the fallback values, in a way that converges towards the target RPS as usage increases. This allows some over-committing, but we already allow that, and the fallback values are assumed to be "safe enough": they'd be used if the caller just waited a bit longer between calls -> the global limiter was garbage-collected. If global usage increases, the boost-amount will shrink, and the over-commit will shrink as well. So while this allows spiky requests through at a somewhat higher rate than desired, heavy usage should still converge near the target RPS rather than e.g. 2x the target. --- A lot of lines have changed, but this PR boils down to: - change the IDL to return more data (which was already tracked) as a new type - this will break the current system, causing the global limiters to use their fallback - ^ this is currently intentional, to serve as a test of that fallback, but should not generally be done. normally it should handle both types, or shift to a plugin system and just have two stacks in use, switched by type. - change the `algorithm` portion to return a struct rather than 2 values, to make this additional data easier to pass around - boost RPS when updating the ratelimiters, in the `collection` - a lot of minor changes to handle the new type, without changing behavior --- Visually, for a hopefully more intuitive grasp of the issue, this is how weights/RPS currently get distributed among a number of hosts: ![ratelimiter - orig](https://github.com/user-attachments/assets/0956a507-cd22-4874-843e-7d987d1a32b9) Low weights get a low ratio of the allowed RPS, so they are also only allowed to _grow_ a little, while high-weight hosts are allowed to grow a lot. If we assume traffic arrives at frontends roughly randomly, this is fine (if they grow, they will all grow by the same %), but in practice we see a lot of "constant high use" to some hosts, and "occasional low use" to others, e.g. due to different services performing different call patterns and being routed differently. Mathematically this makes sense... but it's relatively easy for something that only sends a couple requests every 10 seconds or so to eventually be allocated ~1 RPS because its weight is legitimately that low. But then nearly all of those low-RPS requests are rejected despite there (possibly) being more than enough free-RPS in the cluster. This PR essentially changes the behavior to look like this: ![ratelimiter - changed](https://github.com/user-attachments/assets/0193bfdf-687c-4b6e-805b-78d411c65a6f) If there is unused-RPS in the cluster, low-weight hosts are allowed to increase their RPS up to their fallback-behavior's limit. This is limited by the amount of unused-RPS, so as usage increases it will move towards the un-boosted limit (i.e. weight only), and the overall RPS should converge at the target RPS. Under high usage, it looks more like this (only a small boost added to low weight hosts): ![ratelimiter - changed high usage](https://github.com/user-attachments/assets/d005541f-4f82-4344-924f-a61159805481) _How_ the unused RPS is best-distributed is... fuzzy. Here I've taken the easy and low-data-needed option: just use it all, independently, in each low-weight host. This is more likely to prevent unfairly-blocked requests, but allows more requests in a multi-host burst (until re-balanced). Another option considered (and originally planned) was to distribute it between all *contributing* low-weight hosts, but this needs to either exchange more data or put target-RPS knowledge into both the limiter and the aggregator side of this system. I think the latter will eventually happen and it's basically fine if/when it does, but it's a larger change than I wanted to do right now, and the former just didn't feel worth the extra logical complexity. Simpler is better with this kind of system, because it's more understandable and predictable. The original idea has a simulated implementation here, with a lot of tests to run a simulated traffic pattern and log how weights / individual hosts / etc behave: cadence-workflow#6196 but that will likely not be merged - that PR served its purpose of showing whether or not my intuition about the system's behavior as a whole was correct, and the large chunk of code to do that is probably not worth maintaining.
…r new ratelimiters # Motivation: The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage. Previously it was thought this was just due to irrationally-low weights, and cadence-workflow#6238 addressed that (and is still desirable). After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were *at least* identical, and possibly larger. And then I found this PR's issue. # Issue and fix What was happening is that the initial `rate.NewLimiter(0,0)` was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit. This happens because `(0, 0)` creates a zero-burst limit on the "primary" limiter, and the shadowed `.Allow()` calls were advancing the limiter's internal "now" value... ... and then when the limit and burst were increased, the limiter would have to fill from zero. This put it in a worse position than local / fallback limiters, which start from `(local, local)` with a zero "now" value, and then the next `.Allow()` is basically guaranteed to fill the token bucket due to many years "elapsing". The fix has two parts: 1: Avoid advancing the un-initialized limiter's internal time until a reasonable limit/burst has been set. This is done by simply not calling it while in startup mode. 2: Avoid advancing limiters' time when setting limit and burst. This means that after an idle period -> `Update()` -> `Allow()`, tokens will fill as if they were set all along, and the setters can be called in any order. The underlying `rate.Limiter` does *not* do this, it advances time when setting these... but that seems undesirable. It means old values are preferred (which is reasonable - they were set when that time passed), *and* it means that the order you call these has a significant impact on the outcome, even with the same values and the same timing. I can only see that as surprising, and worth avoiding. # Alternative approach 2 seems worth keeping. But 1 has a relatively clear alternative: Don't create the "primary" limiter until the first `Update()`. Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that. If I were to do this, I would just switch to a mutex, the `rate.Limiter` already uses them so it should be near zero cost. I'm happy to build that if someone prefers, I just didn't bother this time.
…r new ratelimiters (#6280) # Motivation: The global ratelimiter system was exhibiting some weird request-rejection at very low RPS usage. On our dashboards it looks like this: <img width="927" alt="Screenshot 2024-09-11 at 18 55 09" src="https://github.com/user-attachments/assets/8236d945-0f8f-45f9-9a9c-5c908f386ae4"> Previously I thought this was just due to undesirably-low weights, and #6238 addressed that (and is still a useful addition). After that was rolled out, behavior improved, but small numbers still occurred... which should not have happened because the "boosting" logic should have meant that the global limits were *at least* identical, and likely larger. Which drove me to re-read the details and think harder. And then I found this PR's issue. # Issue and fix What was happening is that the initial `rate.NewLimiter(0,0)` detail was "leaking" into limits after the first update, so a request that occurred immediately after would likely be rejected, regardless of the configured limit. This happens because `(0, 0)` creates a zero-burst limit on the "primary" limiter, and the shadowed `.Allow()` calls were advancing the limiter's internal "now" value... ... and then when the limit and burst were increased, the limiter would have to fill from zero. This put it in a worse position than local / fallback limiters, which start from `(local, local)` with a zero "now" value, and then the next `.Allow()` is basically guaranteed to fill the token bucket due to many years "elapsing". So the fix has two parts: 1: Avoid advancing the zero-valued limiter's internal time until a reasonable limit/burst has been set. This is done by simply not calling it while in startup mode. 2: Avoid advancing limiters' time when setting limit and burst. This means that after an idle period -> `Update()` -> `Allow()`, tokens will fill as if the new values were set all along, and the setters can be called in any order. The underlying `rate.Limiter` does *not* do the second, it advances time when setting these... but that seems undesirable. It means old values are preferred (which is reasonable, they were set when that time passed), *and* it means that the order you call to set both burst and limit has a significant impact on the outcome, even with the same values and the same timing: time passes only on the first call, the second has basically zero elapsed and has no immediate effect at all (unless lowering burst). I can only see that latter part as surprising, and definitely worth avoiding. # Alternative approach 2 seems worth keeping. But 1 has a relatively clear alternative: Don't create the "primary" limiter until the first `Update()`. Because it's currently atomic-oriented, this can't be done safely without adding atomics or locks everywhere... so I didn't do that. If I were to do this, I would just switch to a mutex, the `rate.Limiter` already uses them so it should be near zero cost. I'm happy to build that if someone prefers, I just didn't bother this time.
Part 1 of 2 to address some relatively-expected "infrequent usage behaves sub-optimally" issues.
In this one, I'm tackling an issue where there are some high-usage limiting-hosts, and some occasional-and-low-usage hosts, and e.g. less than half of the target RPS is being used at any time.
In ^ this scenario it's possible for the low-usage hosts to be given a weight so low that they block most of the requests they receive, especially if those requests come in small bursts (which is quite common) rather than evenly distributed. This is partly because their weight is low enough to only allow a couple RPS, and partly because our limiters currently set their burst size == limit, so they also do not allow bursts that would otherwise be within-budget.
This cannot be addressed perfectly with an async system like this (we can't perfectly predict caller behavior), but there are some easy partial improvements worth building:
In this PR, I "boost" RPSes that are below the fallback values, in a way that converges towards the target RPS as usage increases. This allows some over-committing, but we already allow that, and the fallback values are assumed to be "safe enough": they'd be used if the caller just waited a bit longer between calls -> the global limiter was garbage-collected.
If global usage increases, the boost-amount will shrink, and the over-commit will shrink as well. So while this allows spiky requests through at a somewhat higher rate than desired, heavy usage should still converge near the target RPS rather than e.g. 2x the target.
A lot of lines have changed, but this PR boils down to:
algorithm
portion to return a struct rather than 2 values, to make this additional data easier to pass aroundcollection
Visually, for a hopefully more intuitive grasp of the issue, this is how weights/RPS currently get distributed among a number of hosts:
Low weights get a low ratio of the allowed RPS, so they are also only allowed to grow a little, while high-weight hosts are allowed to grow a lot. If we assume traffic arrives at frontends roughly randomly, this is fine (if they grow, they will all grow by the same %), but in practice we see a lot of "constant high use" to some hosts, and "occasional low use" to others, e.g. due to different services performing different call patterns and being routed differently.
Mathematically this makes sense... but it's relatively easy for something that only sends a couple requests every 10 seconds or so to eventually be allocated ~1 RPS because its weight is legitimately that low. But then nearly all of those low-RPS requests are rejected despite there (possibly) being more than enough free-RPS in the cluster.
This PR essentially changes the behavior to look like this:
If there is unused-RPS in the cluster, low-weight hosts are allowed to increase their RPS up to their fallback-behavior's limit. This is limited by the amount of unused-RPS, so as usage increases it will move towards the un-boosted limit (i.e. weight only), and the overall RPS should converge at the target RPS.
Under high usage, it looks more like this (only a small boost added to low weight hosts):
How the unused RPS is best-distributed is... fuzzy. Here I've taken the easy and low-data-needed option: just use it all, independently, in each low-weight host. This is more likely to prevent unfairly-blocked requests, but allows more requests in a multi-host burst (until re-balanced).
Another option considered (and originally planned) was to distribute it between all contributing low-weight hosts, but this needs to either exchange more data or put target-RPS knowledge into both the limiter and the aggregator side of this system. I think the latter will eventually happen and it's basically fine if/when it does, but it's a larger change than I wanted to do right now, and the former just didn't feel worth the extra logical complexity. Simpler is better with this kind of system, because it's more understandable and predictable.
The original idea has a simulated implementation here, with a lot of tests to run a simulated traffic pattern and log how weights / individual hosts / etc behave: #6196
but that will likely not be merged - that PR served its purpose of showing whether or not my intuition about the system's behavior as a whole was correct, and the large chunk of code to do that is probably not worth maintaining.