This repository has been archived by the owner on Apr 26, 2024. It is now read-only.
backfill slow and expensive #2504
Labels
A-Federation
A-Messages-Endpoint
/messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill)
A-Performance
Performance, both client-facing and admin-facing
T-Task
Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Today we saw a pattern of requests from a client which appeared to have got stuck in a loop calling /messages on matrix HQ (but passing the same pagination parameters over and over again).
There's probably a bug in the client, but that isn't the point here. The main problem was that each request ended up burning 300ms of CPU time, which combined with the rate of requests, DoSed matrix.org.
Some example requests:
Each request prompted a backfill request to
jki.re
, for example:It appears that there are five events in HQ in this timeframe which have prev_event links to events we do not have, so /messages requests in the area prompt a backfill attempt; jki.re is chosen somewhat arbitrarily, but also doesn't have the missing events - so we don't make any forward progress. We then repeat this process for each subsequent client request.
There appear to be four separate problems here:
/messages
requests which do not trigger a backfill attempt do not seem to cause this CPU time usage, so it kinda has to be in themaybe_backfill
codepath... but I can't see where)The text was updated successfully, but these errors were encountered: