federation catchup is slow and network-heavy #9492

richvdh · 2021-02-24T18:43:41Z

If a remote HS takes a long time to process a transaction (perhaps because of #9490), then we will enter "catchup mode" for that homeserver. In this mode, we only ever send the most recent event per room, which means that the remote homeserver has to backfill for any intermediate events, which is, comparatively very slow (and may require calling /state_ids, see #7893 and #6597).

So, we have a homeserver which is struggling to keep up, and we are making its life even harder by only sending it a subset of the traffic in the room.

I'm not at all sure why we do this. Shouldn't we send all the events that the remote server hasn't received (up to a limit)?

The text was updated successfully, but these errors were encountered:

ShadowJonathan · 2021-02-24T18:45:20Z

What does the spec say? Shouldn't it try to send all events it can queue up (within reason)?

anoadragon453 · 2021-02-25T13:23:46Z

It looks like we fetch the oldest events that the remote has not yet received:

synapse/synapse/federation/sender/per_destination_queue.py

Lines 461 to 465 in 0a00b7f

    
           while True: 
        
               event_ids = await self._store.get_catch_up_room_event_ids( 
        
                   self._destination, 
        
                   self._last_successful_stream_ordering, 
        
               )

synapse/synapse/storage/databases/main/transactions.py

Lines 424 to 442 in 0a00b7f

    
               def _get_catch_up_room_event_ids_txn( 
        
                   txn: LoggingTransaction, 
        
                   destination: str, 
        
                   last_successful_stream_ordering: int, 
        
               ) -> List[str]: 
        
                   q = """ 
        
                           SELECT event_id FROM destination_rooms 
        
                            JOIN events USING (stream_ordering) 
        
                           WHERE destination = ? 
        
                             AND stream_ordering > ? 
        
                           ORDER BY stream_ordering 
        
                           LIMIT 50 
        
                       """ 
        
                   txn.execute( 
        
                       q, 
        
                       (destination, last_successful_stream_ordering), 
        
                   ) 
        
                   event_ids = [row[0] for row in txn] 
        
                   return event_ids

rather than the most recent event per room?

richvdh · 2021-02-25T15:31:18Z

destination_rooms stores the stream_ordering of the most recent event in each room that we should have sent to each destination, rather than the most recent event that we actually did send. So when we join to events, we get the event id of that single most recent event per room.

Federation catch up mode is very inefficient if the number of events that the remote server has missed is small, since handling gaps can be very expensive, c.f. #9492. Instead of going into catch up mode whenever we see an error, we instead do so only if we've backed off from trying the remote for more than an hour (the assumption being that in such a case it is more than a transient failure).

erikjohnston · 2021-03-09T16:51:50Z

I wonder if another thing at play here is the interaction of all the different servers in the room. Take a busy room like Matrix HQ, now after a few hours of down time a server will receive the latest event from all servers that sent an event during that time, for each one the receiving server will first do /get_missing_events to fetch up to ten missing events, then do a /state_ids request if there is still a gap. This means that for a single room the server can end up processing a lot more than just the latest event, and end up with a many small chunks of DAG int the gap between it going offline and coming back online

ShadowJonathan · 2021-03-09T18:28:24Z

Maybe rooms need something similar to ResponseCache.wrap(), in the sense that getting the backlog and state_id is handled by one worker, and every subsequent request is queued to that until it is done, this might require closer and more complex coordination (even more so across workers), but it'll solve that problem.

erikjohnston · 2021-03-11T16:15:58Z

I wonder if another thing at play here is the interaction of all the different servers in the room. Take a busy room like Matrix HQ, now after a few hours of down time a server will receive the latest event from all servers that sent an event during that time, for each one the receiving server will first do /get_missing_events to fetch up to ten missing events, then do a /state_ids request if there is still a gap. This means that for a single room the server can end up processing a lot more than just the latest event, and end up with a many small chunks of DAG int the gap between it going offline and coming back online

I wonder if Synapse should only send out events if they're a forward extremity? In the hopes that the server that has sent subsequent events in the room will send the latest event? It's possible that the other server won't, but that feels like a rare case. By doing so it should mean that the receiving server only receives the current extremities they've missed, rather than an additional smattering of events at different points in the gap.

Federation catch up mode is very inefficient if the number of events that the remote server has missed is small, since handling gaps can be very expensive, c.f. #9492. Instead of going into catch up mode whenever we see an error, we instead do so only if we've backed off from trying the remote for more than an hour (the assumption being that in such a case it is more than a transient failure).

reivilibre · 2022-08-05T15:49:45Z

Note that on March 18th a PR was merged that implements Erik's suggestion from above: #9640

richvdh · 2022-08-08T09:28:20Z

yes, I've updated the summary of this issue, since it's no longer a simple matter of optimisation.

clokep added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Feb 24, 2021

erikjohnston mentioned this issue Mar 8, 2021

Don't go into federation catch up mode so easily #9561

Merged

erikjohnston mentioned this issue Mar 15, 2021

Only send catch up events if they're the latest in the room #9621

Closed

clokep assigned erikjohnston Mar 16, 2021

erikjohnston removed their assignment Feb 28, 2022

richvdh changed the title ~~federation catchup is very inefficient~~ federation catchup is slow and network-heavy Aug 8, 2022

MadLittleMods added the A-Federation label Aug 25, 2022

matrixbot mentioned this issue Dec 21, 2023

federation catchup is slow and network-heavy element-hq/synapse#9492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

federation catchup is slow and network-heavy #9492

federation catchup is slow and network-heavy #9492

richvdh commented Feb 24, 2021

ShadowJonathan commented Feb 24, 2021

anoadragon453 commented Feb 25, 2021

richvdh commented Feb 25, 2021

erikjohnston commented Mar 9, 2021

ShadowJonathan commented Mar 9, 2021

erikjohnston commented Mar 11, 2021

reivilibre commented Aug 5, 2022

richvdh commented Aug 8, 2022

federation catchup is slow and network-heavy #9492

federation catchup is slow and network-heavy #9492

Comments

richvdh commented Feb 24, 2021

ShadowJonathan commented Feb 24, 2021

anoadragon453 commented Feb 25, 2021

richvdh commented Feb 25, 2021

erikjohnston commented Mar 9, 2021

ShadowJonathan commented Mar 9, 2021

erikjohnston commented Mar 11, 2021

reivilibre commented Aug 5, 2022

richvdh commented Aug 8, 2022