-
Notifications
You must be signed in to change notification settings - Fork 436
Speed up pruning of ratelimiter #19129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
anoadragon453
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions below as I try to grok this code.
synapse/api/ratelimiting.py
Outdated
|
|
||
| self.actions[key] = (action_count, time_start, rate_hz) | ||
|
|
||
| prune_time_s += 0.1 # Add a buffer to ensure we don't try and prune too early |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this buffer do? I worry it might cause difficult-to-debug bugs in unit tests.
Should we only add this buffer if prune_time_s is within a certain amount of time from now? i.e. prune_time_s += min(0.1, prune_time_s - time_now_s)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is mostly me being paranoid about timings. When the prune function gets called it checks if the entry can be pruned by doing the comparison again, if its not ready to be pruned the function will not schedule another prune (as it should have been scheduled when we updated it). So the only thing that really matters is that the prune function gets called after the entry expires, and we ensure that by scheduling prune 0.1s after we think the entry should expire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will expand comment
|
|
||
| for key in to_prune: | ||
| value = self.actions.get(key) | ||
| if value is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When would this happen (I don't think we delete keys from actions other than this function)? Is this just a sanity check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a given entry that has been updated multiple times we'll schedule multiple calls to prune, I think given the nature of the updates only the last prune will delete the entry. However it really isn't obvious that there aren't edge cases where an earlier prune call will remove the entry even if there are later prune calls scheduled
| else: | ||
| del self.actions[key] | ||
|
|
||
| del self.actions[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we use del self.actions[key] here and self.actions.pop(key, None) above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we know the key is in the map as we just looked the key up in the map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, they do different things don't they. Carry on!
|
|
||
| self.clock.looping_call(self._prune_message_counts, 60 * 1000) | ||
| # Records when actions should potentially be pruned. | ||
| self._timer: WheelTimer[Hashable] = WheelTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the default accuracy of 5s per bucket enough? Will clients be told to "retry in 3s" and then still get an error if the current bucket has yet to expire?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timer and pruning has no observable effect, its just a cleanup of stale entries to keep the actions dict from infinitely expanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. I didn't realise that this was all separate from the actual rate-limiting functionality. This makes the timing a lot less critical, and was the missing bit in my understanding.
anoadragon453
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
|
||
| self.clock.looping_call(self._prune_message_counts, 60 * 1000) | ||
| # Records when actions should potentially be pruned. | ||
| self._timer: WheelTimer[Hashable] = WheelTimer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see. I didn't realise that this was all separate from the actual rate-limiting functionality. This makes the timing a lot less critical, and was the missing bit in my understanding.
| else: | ||
| del self.actions[key] | ||
|
|
||
| del self.actions[key] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes, they do different things don't they. Carry on!
I noticed this in some profiling. Basically, we prune the ratelimiters by copying and iterating over every entry every 60 seconds. Instead, let's use a wheel timer to track when we should potentially prune a given key, and then we a) check fewer keys, and b) can run more frequently. Hopefully this should mean we don't have a large pause everytime we prune a ratelimiter with lots of keys.
Also fixes a bug where we didn't prune entries that were added via
record_actionand never subsequently updated. This affected the media and joins-per-room ratelimiter.