-
Notifications
You must be signed in to change notification settings - Fork 809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global ratelimiter fix: prevent low-weight hosts from becoming unusable #6196
Global ratelimiter fix: prevent low-weight hosts from becoming unusable #6196
Conversation
// find out how many would've been accepted and track it (uses previous-round target-RPS because limiters always lag by an update cycle) | ||
track(&results.accepts[0], &results.rejects[0], r[0], results.targetRPS[0]) | ||
track(&results.accepts[1], &results.rejects[1], r[1], results.targetRPS[1]) | ||
track(&results.accepts[2], &results.rejects[2], r[2], results.targetRPS[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering changing these to use time-mocked real-ratelimiters and spreading requests evenly across each iteration... but I haven't yet tried, and I suspect it'll look a bit more complicated.
It would however give this "free" accept/reject tracking (just wrap them in a counted limiter) and realistic token-bucket-size usage and RPS carryover, as right now it unfairly penalizes e.g. 0.99rps as rejecting all requests because it's never over 1.
Any votes? I'm kinda curious how it'll affect the numbers...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proof of concept anyway: #6197
changing results are commented out and left in place.
tl;dr implementing it in a less-bad way will take a bit more work, but:
- most of the results are the same
- most things perform slightly better
- proposed new algorithm does noticeably better, more ideal results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time-mocked should be a better option since, on CI, time is very relative, and it can take more time to settle RPS, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
both use fake-time, that PoC is to use a clock.Ratelimiter
to more-accurately keep track of allows/rejects. it's mostly for fractional values, since my fake-counting is essentially only correct on integers.
t.Run("v1 nonzero tweak - running average ignoring zeros", func(t *testing.T) { | ||
// an idea to improve on the original: hosts never see their own degraded values, but others do. | ||
// this allows others to act as if the host is idle (when it has been recently), but allows a periodic spike to go through. | ||
// | ||
// this has not been deployed, it was just an idea that IMO did not pan out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the "failed idea I don't intend to build".
if we don't think this is worth keeping here because of that, I'm entirely game to just delete it. I wrote it mostly for myself anyway.
if it gives anyone a better idea, I'd be curious to hear the details and try it, but I think it needs to be smarter about not just zeros as "imbalanced low usage" is part of the problem, not just zero/non-zero toggling.
t.Run("zeros", func(t *testing.T) { | ||
// zeros are the worst-case scenario here, greatly favoring the non-zero values. | ||
check(t, data{ | ||
targetRPS: [3]float64{0.63, 3.57, 10.7}, // intermittent allows far too few, constant is allowed over 2x what it uses | ||
callRPS: [3]float64{1.25, 2.50, 5.00}, // all at or below per-host limit at all times, not just in aggregate | ||
accepts: [3]float64{0.00, 1.50, 5.00}, // this is the main problems with this algorithm: bad behavior when imbalanced but low usage. | ||
rejects: [3]float64{1.25, 1.00, 0.00}, | ||
cumulativeRPS: targetRPS, // but this is useful and simple: it always tries to enforce the "real" target, never over or under. | ||
}, run(t, roundSchedule(0, 5))) | ||
}) | ||
t.Run("low non-zeros", func(t *testing.T) { | ||
// even 1 instead of 0 does quite a lot better, though still not great | ||
check(t, data{ | ||
targetRPS: [3]float64{2.21, 4.07, 8.72}, | ||
callRPS: [3]float64{2.00, 3.00, 5.00}, | ||
accepts: [3]float64{1.25, 2.25, 5.00}, // constant use behaves well, others are not great | ||
rejects: [3]float64{0.75, 0.75, 0.00}, | ||
cumulativeRPS: targetRPS, | ||
}, run(t, roundSchedule(1, 5))) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of understood the problem from the description, but I cannot understand it from the test description.
t.Run("zeros", func(t *testing.T) { | ||
check(t, data{ | ||
targetRPS: [3]float64{3.38, 5.00, 10.7}, | ||
callRPS: [3]float64{1.25, 2.50, 5.00}, | ||
accepts: [3]float64{0.75, 2.50, 5.00}, // very close, a main motivating reason for this approach | ||
rejects: [3]float64{0.50, 0.00, 0.00}, | ||
cumulativeRPS: targetRPS + 4.1, // when very low, allows moderate bit above | ||
}, run(t, roundSchedule(0, 5))) | ||
}) | ||
t.Run("low nonzeros", func(t *testing.T) { | ||
|
||
check(t, data{ | ||
targetRPS: [3]float64{4.34, 5.00, 8.72}, | ||
callRPS: [3]float64{2.00, 3.00, 5.00}, | ||
accepts: [3]float64{1.75, 3.00, 5.00}, // almost perfect, a main motivating reason for this approach | ||
rejects: [3]float64{0.25, 0.00, 0.00}, | ||
cumulativeRPS: targetRPS + 3.1, // less overage but still moderate | ||
}, run(t, roundSchedule(1, 5))) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and this is my proposed solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on how a test scenario is a solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fewer rejects, more accepts, without (significantly) exceeding the intended rps. basically. it's a simulation of ~4 processes, per test.
there's not much in here that's worth asserting, so it's just checking "last round summary" values. the logs contain details of how it gets to this state and what the overall behavior is, which is WAY too much to assert against.
if you run any of these and look at the output, you can see the number of calls being performed per "second", how that updates weights in the system, and what the resulting RPS it allows for the next second.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I mean is that it is not a solution.
As far as I understand, you don't change any code logic; you just clarify the edge case scenario. Or am I missing something?
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted filessee 8 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
|
sorry, all the context fell out my brain again. Define 'hosts' here? Are you referring to frontend, history aggregator or customer worker hosts. |
This would be for frontend hosts / the ones doing the limiting. We've got a few cases that look like general background activity that's relatively steady both in frequency and in what frontend hosts it contacts, and then a low frequency / random ping from other machines that hits the same limit. E.g. workflows doing their thing in one service, plus an occasional third party service that sends a small burst. That third party has a completely different routing layout, so when they hit frontends that don't overlap they get small weights that degrade quickly to even tinier weights... until most of their requests get rejected for no apparent reason, because they don't recover enough tokens with their evenly-shared 0.1RPS to do anything useful. It's "fair" because those hosts really do receive a vanishingly small amount of load comparatively (plus they're idle for a while), but it's unfair given our goal of "you can send X RPS". This is an attempt to find a middle ground that allows those small fair bursts without allowing giant surprise bursts. Hopefully my run-on late-night sentences make sense in the morning :] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is experimented, I think it is find to try out different scenarios, but I'd like to have in writing the end strategy and a list of problems somewhere.
All are bound to a single key across 3 hosts with a simple 15rps target (5 per host when balanced) (3 hosts chosen because there are many easy mistakes with 2). | ||
*/ | ||
const targetRPS = 15 | ||
max := func(a, b float64) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: math.Max
} | ||
return b | ||
} | ||
min := func(a, b float64) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: math.Min
} | ||
return b | ||
} | ||
_, _ = max, min |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These functions could be outside of the test body, so this is not required.
|
||
type data struct { | ||
// target for the immediate next period of time, as a sample of how target RPSes settle over time | ||
targetRPS [3]float64 // rps that should be allowed by the limiter hosts at the end of the last round |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain to me more about [3] floats—what each one means and how to read it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using a struct with named fields to make it easier to understand scenarios.
// find out how many would've been accepted and track it (uses previous-round target-RPS because limiters always lag by an update cycle) | ||
track(&results.accepts[0], &results.rejects[0], r[0], results.targetRPS[0]) | ||
track(&results.accepts[1], &results.rejects[1], r[1], results.targetRPS[1]) | ||
track(&results.accepts[2], &results.rejects[2], r[2], results.targetRPS[2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time-mocked should be a better option since, on CI, time is very relative, and it can take more time to settle RPS, I think.
t.Run("zeros", func(t *testing.T) { | ||
// zeros are the worst-case scenario here, greatly favoring the non-zero values. | ||
check(t, data{ | ||
targetRPS: [3]float64{0.63, 3.57, 10.7}, // intermittent allows far too few, constant is allowed over 2x what it uses | ||
callRPS: [3]float64{1.25, 2.50, 5.00}, // all at or below per-host limit at all times, not just in aggregate | ||
accepts: [3]float64{0.00, 1.50, 5.00}, // this is the main problems with this algorithm: bad behavior when imbalanced but low usage. | ||
rejects: [3]float64{1.25, 1.00, 0.00}, | ||
cumulativeRPS: targetRPS, // but this is useful and simple: it always tries to enforce the "real" target, never over or under. | ||
}, run(t, roundSchedule(0, 5))) | ||
}) | ||
t.Run("low non-zeros", func(t *testing.T) { | ||
// even 1 instead of 0 does quite a lot better, though still not great | ||
check(t, data{ | ||
targetRPS: [3]float64{2.21, 4.07, 8.72}, | ||
callRPS: [3]float64{2.00, 3.00, 5.00}, | ||
accepts: [3]float64{1.25, 2.25, 5.00}, // constant use behaves well, others are not great | ||
rejects: [3]float64{0.75, 0.75, 0.00}, | ||
cumulativeRPS: targetRPS, | ||
}, run(t, roundSchedule(1, 5))) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of understood the problem from the description, but I cannot understand it from the test description.
t.Run("zeros", func(t *testing.T) { | ||
check(t, data{ | ||
targetRPS: [3]float64{3.38, 5.00, 10.7}, | ||
callRPS: [3]float64{1.25, 2.50, 5.00}, | ||
accepts: [3]float64{0.75, 2.50, 5.00}, // very close, a main motivating reason for this approach | ||
rejects: [3]float64{0.50, 0.00, 0.00}, | ||
cumulativeRPS: targetRPS + 4.1, // when very low, allows moderate bit above | ||
}, run(t, roundSchedule(0, 5))) | ||
}) | ||
t.Run("low nonzeros", func(t *testing.T) { | ||
|
||
check(t, data{ | ||
targetRPS: [3]float64{4.34, 5.00, 8.72}, | ||
callRPS: [3]float64{2.00, 3.00, 5.00}, | ||
accepts: [3]float64{1.75, 3.00, 5.00}, // almost perfect, a main motivating reason for this approach | ||
rejects: [3]float64{0.25, 0.00, 0.00}, | ||
cumulativeRPS: targetRPS + 3.1, // less overage but still moderate | ||
}, run(t, roundSchedule(1, 5))) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on how a test scenario is a solution?
gonna pause on this one - it's a lot of work to maintain and tweak, and I've already gotten what I need out of it: confidence that a planned change is reasonable and behaves reasonably well. |
Part 1 of 2 to address some relatively-expected "infrequent usage behaves sub-optimally" issues. In this one, I'm tackling an issue where there are some high-usage limiting-hosts, and some occasional-and-low-usage hosts, and e.g. less than half of the target RPS is being used at any time. In ^ this scenario it's possible for the low-usage hosts to be given a weight so low that they block _most_ of the requests they receive, especially if those requests come in small bursts (which is quite common) rather than evenly distributed. This is partly because their weight is low enough to only allow a couple RPS, and partly because our limiters currently set their burst size == limit, so they also do not allow bursts that would otherwise be within-budget. This _cannot_ be addressed perfectly with an async system like this (we can't perfectly predict caller behavior), but there are some easy partial improvements worth building: 1. (this PR) Allow low-weight hosts to over-commit and accept more than their fair share of RPS, if there is free RPS in the cluster. - For _consistently_ low-RPS hosts, this should be fair and a moderate improvement, and it performs no worse than a non-global algorithm if the limit is not usually fully used. 2. Raise the burst size, so intermittent callers can accumulate a small burst budget. - Currently this matches the RPS (min 1), which means e.g. a totally innocuous 2-request-event every 10 seconds will likely reject 50%. Obviously that's not great. - "Fair" values here are fuzzier, and will be figured out in a later PR. In this PR, I "boost" RPSes that are below the fallback values, in a way that converges towards the target RPS as usage increases. This allows some over-committing, but we already allow that, and the fallback values are assumed to be "safe enough": they'd be used if the caller just waited a bit longer between calls -> the global limiter was garbage-collected. If global usage increases, the boost-amount will shrink, and the over-commit will shrink as well. So while this allows spiky requests through at a somewhat higher rate than desired, heavy usage should still converge near the target RPS rather than e.g. 2x the target. --- A lot of lines have changed, but this PR boils down to: - change the IDL to return more data (which was already tracked) as a new type - this will break the current system, causing the global limiters to use their fallback - ^ this is currently intentional, to serve as a test of that fallback, but should not generally be done. normally it should handle both types, or shift to a plugin system and just have two stacks in use, switched by type. - change the `algorithm` portion to return a struct rather than 2 values, to make this additional data easier to pass around - boost RPS when updating the ratelimiters, in the `collection` - a lot of minor changes to handle the new type, without changing behavior --- Visually, for a hopefully more intuitive grasp of the issue, this is how weights/RPS currently get distributed among a number of hosts: ![ratelimiter - orig](https://github.com/user-attachments/assets/0956a507-cd22-4874-843e-7d987d1a32b9) Low weights get a low ratio of the allowed RPS, so they are also only allowed to _grow_ a little, while high-weight hosts are allowed to grow a lot. If we assume traffic arrives at frontends roughly randomly, this is fine (if they grow, they will all grow by the same %), but in practice we see a lot of "constant high use" to some hosts, and "occasional low use" to others, e.g. due to different services performing different call patterns and being routed differently. Mathematically this makes sense... but it's relatively easy for something that only sends a couple requests every 10 seconds or so to eventually be allocated ~1 RPS because its weight is legitimately that low. But then nearly all of those low-RPS requests are rejected despite there (possibly) being more than enough free-RPS in the cluster. This PR essentially changes the behavior to look like this: ![ratelimiter - changed](https://github.com/user-attachments/assets/0193bfdf-687c-4b6e-805b-78d411c65a6f) If there is unused-RPS in the cluster, low-weight hosts are allowed to increase their RPS up to their fallback-behavior's limit. This is limited by the amount of unused-RPS, so as usage increases it will move towards the un-boosted limit (i.e. weight only), and the overall RPS should converge at the target RPS. Under high usage, it looks more like this (only a small boost added to low weight hosts): ![ratelimiter - changed high usage](https://github.com/user-attachments/assets/d005541f-4f82-4344-924f-a61159805481) _How_ the unused RPS is best-distributed is... fuzzy. Here I've taken the easy and low-data-needed option: just use it all, independently, in each low-weight host. This is more likely to prevent unfairly-blocked requests, but allows more requests in a multi-host burst (until re-balanced). Another option considered (and originally planned) was to distribute it between all *contributing* low-weight hosts, but this needs to either exchange more data or put target-RPS knowledge into both the limiter and the aggregator side of this system. I think the latter will eventually happen and it's basically fine if/when it does, but it's a larger change than I wanted to do right now, and the former just didn't feel worth the extra logical complexity. Simpler is better with this kind of system, because it's more understandable and predictable. The original idea has a simulated implementation here, with a lot of tests to run a simulated traffic pattern and log how weights / individual hosts / etc behave: #6196 but that will likely not be merged - that PR served its purpose of showing whether or not my intuition about the system's behavior as a whole was correct, and the large chunk of code to do that is probably not worth maintaining.
…orkflow#6238) Part 1 of 2 to address some relatively-expected "infrequent usage behaves sub-optimally" issues. In this one, I'm tackling an issue where there are some high-usage limiting-hosts, and some occasional-and-low-usage hosts, and e.g. less than half of the target RPS is being used at any time. In ^ this scenario it's possible for the low-usage hosts to be given a weight so low that they block _most_ of the requests they receive, especially if those requests come in small bursts (which is quite common) rather than evenly distributed. This is partly because their weight is low enough to only allow a couple RPS, and partly because our limiters currently set their burst size == limit, so they also do not allow bursts that would otherwise be within-budget. This _cannot_ be addressed perfectly with an async system like this (we can't perfectly predict caller behavior), but there are some easy partial improvements worth building: 1. (this PR) Allow low-weight hosts to over-commit and accept more than their fair share of RPS, if there is free RPS in the cluster. - For _consistently_ low-RPS hosts, this should be fair and a moderate improvement, and it performs no worse than a non-global algorithm if the limit is not usually fully used. 2. Raise the burst size, so intermittent callers can accumulate a small burst budget. - Currently this matches the RPS (min 1), which means e.g. a totally innocuous 2-request-event every 10 seconds will likely reject 50%. Obviously that's not great. - "Fair" values here are fuzzier, and will be figured out in a later PR. In this PR, I "boost" RPSes that are below the fallback values, in a way that converges towards the target RPS as usage increases. This allows some over-committing, but we already allow that, and the fallback values are assumed to be "safe enough": they'd be used if the caller just waited a bit longer between calls -> the global limiter was garbage-collected. If global usage increases, the boost-amount will shrink, and the over-commit will shrink as well. So while this allows spiky requests through at a somewhat higher rate than desired, heavy usage should still converge near the target RPS rather than e.g. 2x the target. --- A lot of lines have changed, but this PR boils down to: - change the IDL to return more data (which was already tracked) as a new type - this will break the current system, causing the global limiters to use their fallback - ^ this is currently intentional, to serve as a test of that fallback, but should not generally be done. normally it should handle both types, or shift to a plugin system and just have two stacks in use, switched by type. - change the `algorithm` portion to return a struct rather than 2 values, to make this additional data easier to pass around - boost RPS when updating the ratelimiters, in the `collection` - a lot of minor changes to handle the new type, without changing behavior --- Visually, for a hopefully more intuitive grasp of the issue, this is how weights/RPS currently get distributed among a number of hosts: ![ratelimiter - orig](https://github.com/user-attachments/assets/0956a507-cd22-4874-843e-7d987d1a32b9) Low weights get a low ratio of the allowed RPS, so they are also only allowed to _grow_ a little, while high-weight hosts are allowed to grow a lot. If we assume traffic arrives at frontends roughly randomly, this is fine (if they grow, they will all grow by the same %), but in practice we see a lot of "constant high use" to some hosts, and "occasional low use" to others, e.g. due to different services performing different call patterns and being routed differently. Mathematically this makes sense... but it's relatively easy for something that only sends a couple requests every 10 seconds or so to eventually be allocated ~1 RPS because its weight is legitimately that low. But then nearly all of those low-RPS requests are rejected despite there (possibly) being more than enough free-RPS in the cluster. This PR essentially changes the behavior to look like this: ![ratelimiter - changed](https://github.com/user-attachments/assets/0193bfdf-687c-4b6e-805b-78d411c65a6f) If there is unused-RPS in the cluster, low-weight hosts are allowed to increase their RPS up to their fallback-behavior's limit. This is limited by the amount of unused-RPS, so as usage increases it will move towards the un-boosted limit (i.e. weight only), and the overall RPS should converge at the target RPS. Under high usage, it looks more like this (only a small boost added to low weight hosts): ![ratelimiter - changed high usage](https://github.com/user-attachments/assets/d005541f-4f82-4344-924f-a61159805481) _How_ the unused RPS is best-distributed is... fuzzy. Here I've taken the easy and low-data-needed option: just use it all, independently, in each low-weight host. This is more likely to prevent unfairly-blocked requests, but allows more requests in a multi-host burst (until re-balanced). Another option considered (and originally planned) was to distribute it between all *contributing* low-weight hosts, but this needs to either exchange more data or put target-RPS knowledge into both the limiter and the aggregator side of this system. I think the latter will eventually happen and it's basically fine if/when it does, but it's a larger change than I wanted to do right now, and the former just didn't feel worth the extra logical complexity. Simpler is better with this kind of system, because it's more understandable and predictable. The original idea has a simulated implementation here, with a lot of tests to run a simulated traffic pattern and log how weights / individual hosts / etc behave: cadence-workflow#6196 but that will likely not be merged - that PR served its purpose of showing whether or not my intuition about the system's behavior as a whole was correct, and the large chunk of code to do that is probably not worth maintaining.
From a problem suspected all along but rather conclusively shown in production:
Intermittent or mildly spiking requests are assigned a low weight, leading to rejected requests even when well below intended limits.
As a pathological example, given 1,000 RPS:
This leads to the constant-receiving hosts having something like 99.5+% of the weight of the cluster,
because the cronjob host gets <1 RPS inferred by the time it tries to send more requests.
Since its allowance is so low, it rejects like 9/10ths of the requests, despite there being tons of room left in the quota.
We don't have a lot of these, but we do have a few, and they look quite wrong at a glance and are a bit hard to understand.
That seems worth trying to address before widespread use, even if it makes the whole system a bit more complicated.
Included in this PR are tests of 3 algorithms:
I have not yet implemented it because I figured we'd talk about the results first and decide if more time should be spent
trying to find other approaches.
If we do move forward with it, I see a few possible implementations:
behaves reasonably and uses the fallback logic like it should.
we think is better in ~all scenarios.
I do not think it's going to be difficult or risky, just "the most work".
I'm inclined to do 2 for "don't leave the old impl, and test the failure paths" purposes, but honestly I'm up for any of them.
None feel like a bad choice since I've already found all the code-paths where things need to go, when building this out the first time,
and I've been thinking about what I'd need to do for the remaining bits since near the beginning.