-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
federation catchup is slow and network-heavy #9492
Comments
What does the spec say? Shouldn't it try to send all events it can queue up (within reason)? |
It looks like we fetch the oldest events that the remote has not yet received: synapse/synapse/federation/sender/per_destination_queue.py Lines 461 to 465 in 0a00b7f
synapse/synapse/storage/databases/main/transactions.py Lines 424 to 442 in 0a00b7f
rather than the most recent event per room? |
destination_rooms stores the stream_ordering of the most recent event in each room that we should have sent to each destination, rather than the most recent event that we actually did send. So when we join to events, we get the event id of that single most recent event per room. |
Federation catch up mode is very inefficient if the number of events that the remote server has missed is small, since handling gaps can be very expensive, c.f. #9492. Instead of going into catch up mode whenever we see an error, we instead do so only if we've backed off from trying the remote for more than an hour (the assumption being that in such a case it is more than a transient failure).
I wonder if another thing at play here is the interaction of all the different servers in the room. Take a busy room like Matrix HQ, now after a few hours of down time a server will receive the latest event from all servers that sent an event during that time, for each one the receiving server will first do |
Maybe rooms need something similar to |
I wonder if Synapse should only send out events if they're a forward extremity? In the hopes that the server that has sent subsequent events in the room will send the latest event? It's possible that the other server won't, but that feels like a rare case. By doing so it should mean that the receiving server only receives the current extremities they've missed, rather than an additional smattering of events at different points in the gap. |
Federation catch up mode is very inefficient if the number of events that the remote server has missed is small, since handling gaps can be very expensive, c.f. #9492. Instead of going into catch up mode whenever we see an error, we instead do so only if we've backed off from trying the remote for more than an hour (the assumption being that in such a case it is more than a transient failure).
Note that on March 18th a PR was merged that implements Erik's suggestion from above: #9640 |
yes, I've updated the summary of this issue, since it's no longer a simple matter of optimisation. |
If a remote HS takes a long time to process a transaction (perhaps because of #9490), then we will enter "catchup mode" for that homeserver. In this mode, we only ever send the most recent event per room, which means that the remote homeserver has to backfill for any intermediate events, which is, comparatively very slow (and may require calling
/state_ids
, see #7893 and #6597).So, we have a homeserver which is struggling to keep up, and we are making its life even harder by only sending it a subset of the traffic in the room.
I'm not at all sure why we do this. Shouldn't we send all the events that the remote server hasn't received (up to a limit)?
The text was updated successfully, but these errors were encountered: