Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

backfill slow and expensive #2504

Open
richvdh opened this issue Oct 5, 2017 · 3 comments
Open

backfill slow and expensive #2504

richvdh opened this issue Oct 5, 2017 · 3 comments
Labels
A-Federation A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.

Comments

@richvdh
Copy link
Member

richvdh commented Oct 5, 2017

Today we saw a pattern of requests from a client which appeared to have got stuck in a loop calling /messages on matrix HQ (but passing the same pagination parameters over and over again).

There's probably a bug in the client, but that isn't the point here. The main problem was that each request ended up burning 300ms of CPU time, which combined with the rate of requests, DoSed matrix.org.

Some example requests:

2017-10-05 14:58:49,106 - synapse.access.http.8080 - 91 - INFO - GET-4039099 - 8080 - {@user:matrix.org} Processed request: 1813ms (271ms, 0ms) (4ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,520 - synapse.access.http.8080 - 91 - INFO - GET-4039004 - 8080 - {@user:matrix.org} Processed request: 4056ms (316ms, 0ms) (19ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,522 - synapse.access.http.8080 - 91 - INFO - GET-4039030 - 8080 - {@user:matrix.org} Processed request: 3279ms (276ms, 0ms) (15ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:49,572 - synapse.access.http.8080 - 91 - INFO - GET-4039061 - 8080 - {@user:matrix.org} Processed request: 2853ms (291ms, 0ms) (16ms/3) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:50,089 - synapse.access.http.8080 - 91 - INFO - GET-4039231 - 8080 - {@user:matrix.org} Processed request: 1835ms (280ms, 0ms) (3ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:51,605 - synapse.access.http.8080 - 91 - INFO - GET-4039382 - 8080 - {@user:matrix.org} Processed request: 2212ms (259ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,000 - synapse.access.http.8080 - 91 - INFO - GET-4039452 - 8080 - {@user:matrix.org} Processed request: 1888ms (283ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,003 - synapse.access.http.8080 - 91 - INFO - GET-4039453 - 8080 - {@user:matrix.org} Processed request: 1881ms (287ms, 0ms) (4ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,241 - synapse.access.http.8080 - 91 - INFO - GET-4039454 - 8080 - {@user:matrix.org} Processed request: 2110ms (268ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"
2017-10-05 14:58:52,481 - synapse.access.http.8080 - 91 - INFO - GET-4039483 - 8080 - {@user:matrix.org} Processed request: 1352ms (279ms, 0ms) (2ms/2) 8807B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353323-282576586_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"

Each request prompted a backfill request to jki.re, for example:

2017-10-05 14:58:45,474 - synapse.access.http.8080 - 59 - INFO - GET-4039005 - 8080 - Received request: GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20
2017-10-05 14:58:46,189 - synapse.http.outbound - 154 - INFO - GET-4039005 - {GET-O-6237} [jki.re] Sending request: GET matrix://jki.re/_matrix/federation/v1/backfill/!cURbafjkfsMDVwdRDQ:matrix.org/?limit=100&v=%241504371387441TzMaJ%3Ahveem.no&v=%2414998105322MIGsD%3Aarquilla.net&v=%2415030841510raCcl%3Ahveem.no&v=%2415068188390cXeVs%3Atapenet.org&v=%2414847676930qrrMY%3Adeuxfleurs.fr
2017-10-05 14:58:47,382 - synapse.http.outbound - 239 - INFO - GET-4039005 - {GET-O-6237} [jki.re] Result: 200 OK
2017-10-05 14:58:47,801 - synapse.access.http.8080 - 91 - INFO - GET-4039005 - 8080 - {@user:matrix.org} Processed request: 2327ms (283ms, 0ms) (3ms/2) 9809B 200 "GET /_matrix/client/r0/rooms/!cURbafjkfsMDVwdRDQ:matrix.org/messages?access_token=<redacted>&from=t353343-282585688_300957161_39496_54465289_15147448_147235_1853980_2348789&dir=b&limit=20 HTTP/1.1" "Mozilla/5.0"

It appears that there are five events in HQ in this timeframe which have prev_event links to events we do not have, so /messages requests in the area prompt a backfill attempt; jki.re is chosen somewhat arbitrarily, but also doesn't have the missing events - so we don't make any forward progress. We then repeat this process for each subsequent client request.

There appear to be four separate problems here:

  • The 300ms of CPU time is somewhat inexplicable. I haven't figured out what it's doing, or whether it happens before or after the callout to jki.re. (/messages requests which do not trigger a backfill attempt do not seem to cause this CPU time usage, so it kinda has to be in the maybe_backfill codepath... but I can't see where)
  • When jki.re returns the empty list rather than any of the events we have asked for, we are happy with that and don't ask any of the other servers in the room.
  • We don't record failures when backfilling, so are doomed to repeat historical mistakes.
  • There is no attempt to ratelimit how often we do this.
@richvdh
Copy link
Member Author

richvdh commented Feb 21, 2019

related: #2726

@MadLittleMods MadLittleMods added the A-Performance Performance, both client-facing and admin-facing label Jul 7, 2021
@erikjohnston erikjohnston added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jul 23, 2021
@MadLittleMods
Copy link
Contributor

Seems related to #10742

@MadLittleMods MadLittleMods added the A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) label Jul 19, 2022
@MadLittleMods
Copy link
Contributor

We don't record failures when backfilling, so are doomed to repeat historical mistakes.

Keeping track of failed pull attempts with backfill was added in #13621

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Messages-Endpoint /messages client API endpoint (`RoomMessageListRestServlet`) (which also triggers /backfill) A-Performance Performance, both client-facing and admin-facing T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks.
Projects
None yet
Development

No branches or pull requests

3 participants