Skip to content

Conversation

@erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Nov 3, 2025

I noticed this in some profiling. Basically, we prune the ratelimiters by copying and iterating over every entry every 60 seconds. Instead, let's use a wheel timer to track when we should potentially prune a given key, and then we a) check fewer keys, and b) can run more frequently. Hopefully this should mean we don't have a large pause everytime we prune a ratelimiter with lots of keys.

Also fixes a bug where we didn't prune entries that were added via record_action and never subsequently updated. This affected the media and joins-per-room ratelimiter.

@erikjohnston erikjohnston marked this pull request as ready for review November 3, 2025 19:20
@erikjohnston erikjohnston requested a review from a team as a code owner November 3, 2025 19:20
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions below as I try to grok this code.


self.actions[key] = (action_count, time_start, rate_hz)

prune_time_s += 0.1 # Add a buffer to ensure we don't try and prune too early
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this buffer do? I worry it might cause difficult-to-debug bugs in unit tests.

Should we only add this buffer if prune_time_s is within a certain amount of time from now? i.e. prune_time_s += min(0.1, prune_time_s - time_now_s)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is mostly me being paranoid about timings. When the prune function gets called it checks if the entry can be pruned by doing the comparison again, if its not ready to be pruned the function will not schedule another prune (as it should have been scheduled when we updated it). So the only thing that really matters is that the prune function gets called after the entry expires, and we ensure that by scheduling prune 0.1s after we think the entry should expire.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will expand comment


for key in to_prune:
value = self.actions.get(key)
if value is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen (I don't think we delete keys from actions other than this function)? Is this just a sanity check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a given entry that has been updated multiple times we'll schedule multiple calls to prune, I think given the nature of the updates only the last prune will delete the entry. However it really isn't obvious that there aren't edge cases where an earlier prune call will remove the entry even if there are later prune calls scheduled

else:
del self.actions[key]

del self.actions[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we use del self.actions[key] here and self.actions.pop(key, None) above.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we know the key is in the map as we just looked the key up in the map.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, they do different things don't they. Carry on!


self.clock.looping_call(self._prune_message_counts, 60 * 1000)
# Records when actions should potentially be pruned.
self._timer: WheelTimer[Hashable] = WheelTimer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the default accuracy of 5s per bucket enough? Will clients be told to "retry in 3s" and then still get an error if the current bucket has yet to expire?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timer and pruning has no observable effect, its just a cleanup of stale entries to keep the actions dict from infinitely expanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. I didn't realise that this was all separate from the actual rate-limiting functionality. This makes the timing a lot less critical, and was the missing bit in my understanding.

Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


self.clock.looping_call(self._prune_message_counts, 60 * 1000)
# Records when actions should potentially be pruned.
self._timer: WheelTimer[Hashable] = WheelTimer()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I see. I didn't realise that this was all separate from the actual rate-limiting functionality. This makes the timing a lot less critical, and was the missing bit in my understanding.

else:
del self.actions[key]

del self.actions[key]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, they do different things don't they. Carry on!

@erikjohnston erikjohnston merged commit 5408101 into develop Nov 4, 2025
41 of 43 checks passed
@erikjohnston erikjohnston deleted the erikj/faster_ratelimit branch November 4, 2025 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants