Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Track number of hosts affected by the rate limiter #13541

Merged
1 change: 1 addition & 0 deletions changelog.d/13541.misc
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add metrics to track how the rate limiter is affecting requests (sleep/reject).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same changelog as #13534 so they merge together.

43 changes: 39 additions & 4 deletions synapse/util/ratelimitutils.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@
run_in_background,
)
from synapse.logging.opentracing import start_active_span
from synapse.metrics import Histogram
from synapse.metrics import Histogram, LaterGauge
from synapse.util import Clock

if typing.TYPE_CHECKING:
Expand Down Expand Up @@ -74,6 +74,27 @@ def new_limiter() -> "_PerHostRatelimiter":
str, "_PerHostRatelimiter"
] = collections.defaultdict(new_limiter)

# We track the number of affected hosts per time-period so we can
# differentiate one really noisy homeserver from a general
# ratelimit tuning problem across the federation.
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
LaterGauge(
MadLittleMods marked this conversation as resolved.
Show resolved Hide resolved
"synapse_rate_limit_sleep_affected_hosts",
"Number of hosts that had requests put to sleep",
[],
lambda: sum(
ratelimiter.should_sleep() for ratelimiter in self.ratelimiters.values()
),
)
LaterGauge(
"synapse_rate_limit_reject_affected_hosts",
"Number of hosts that had requests rejected",
[],
lambda: sum(
ratelimiter.should_reject()
for ratelimiter in self.ratelimiters.values()
),
Comment on lines +92 to +95
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a note, when I log should_reject() in Complement, I see should_reject queue_size=0 self.reject_limit=99999 (the high reject_limit is from docker/complement/conf/workers-shared-extra.yaml.j2#L36) for all of the calls. Not sure I am exercising this enough in parallel to see it though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably caused by #13641 which is being addressed in #13649

Copy link
Contributor

@DMRobertson DMRobertson Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These numbers look consistent with data I saw when investigating in the manhole (sum(x.should_reject() for x in hs.get_federation_ratelimiter().ratelimiters.values())). It might be that we need to wait some time for queues to grow before we start rejecting requests. I'd like to leave confirming/investigating this on your plate @MadLittleMods if that's ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DMRobertson Looks like #13649 hasn't fixed it if it has already been rolled out to matrix.org 😩 (still all 0)

Here are the rate limiter graph section I have:
https://grafana.matrix.org/d/dYoRgTgVz/messages-timing?orgId=1&from=1661854473240&to=1661876073240&viewPanel=220

I'll create an actual issue to track this more long-term now ⏩

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#13670

)

def ratelimit(self, host: str) -> "_GeneratorContextManager[defer.Deferred[None]]":
"""Used to ratelimit an incoming request from a given host

Expand Down Expand Up @@ -139,6 +160,21 @@ def ratelimit(self, host: str) -> "Iterator[defer.Deferred[None]]":
finally:
self._on_exit(request_id)

def should_reject(self) -> bool:
"""
Whether to reject the request if we already have too many queued up
(either sleeping or in the ready queue).
"""
queue_size = len(self.ready_request_queue) + len(self.sleeping_requests)
return queue_size > self.reject_limit

def should_sleep(self) -> bool:
"""
Whether to sleep the request if we already have too many requests coming
through within the window.
"""
return len(self.request_times) > self.sleep_limit

def _on_enter(self, request_id: object) -> "defer.Deferred[None]":
time_now = self.clock.time_msec()

Expand All @@ -149,8 +185,7 @@ def _on_enter(self, request_id: object) -> "defer.Deferred[None]":

# reject the request if we already have too many queued up (either
# sleeping or in the ready queue).
queue_size = len(self.ready_request_queue) + len(self.sleeping_requests)
if queue_size > self.reject_limit:
if self.should_reject():
logger.debug("Ratelimiter(%s): rejecting request", self.host)
rate_limit_reject_counter.inc()
raise LimitExceededError(
Expand Down Expand Up @@ -180,7 +215,7 @@ def queue_request() -> "defer.Deferred[None]":
len(self.request_times),
)

if len(self.request_times) > self.sleep_limit:
if self.should_sleep():
logger.debug(
"Ratelimiter(%s) [%s]: sleeping request for %f sec",
self.host,
Expand Down