Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

toDevice messages on matrix.org can take 14.1s to propagate. #9564

Closed
ara4n opened this issue Mar 8, 2021 · 7 comments
Closed

toDevice messages on matrix.org can take 14.1s to propagate. #9564

ara4n opened this issue Mar 8, 2021 · 7 comments
Assignees
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@ara4n
Copy link
Member

ara4n commented Mar 8, 2021

element-hq/element-web#8376

This has been around for years, but bites me surprisingly frequently (even on test accounts!) making me assume it might be quite a widespread problem (and perhaps contribute to UISIs and encryption delays).

@anoadragon453
Copy link
Member

Note that the linked Element Web issue has Synapse logs as well. At the moment it looks to have been tracked down to the toDevice message not waking up the synchrotron for ~14s. We do at least try to do so:

self.notifier.on_new_event(
"to_device_key", stream_id, users=local_messages.keys()
)

We also talked about increasing Jaeger's monitoring on matrix.org to try and catch the full path of the request as it goes through the system (it's currently only at 10%). We also don't have the encryption worker in Jaeger's services at the moment, which would be a good initial fix to begin with.

@anoadragon453 anoadragon453 added the X-Needs-Info This issue is blocked awaiting information from the reporter label Mar 9, 2021
@callahad
Copy link
Contributor

From the team sync today, we think getting our Jaeger deployment into a place where it's useful for solving this could turn into a pretty significant rabbit hole, and we're at capacity for rabbit holes at the moment.

The hope is that we'll be able to come up for air in the next ~1-2 weeks and consider digging in more deeply, but in the meantime ara4n is going to do some manual digging and see if he turns up anything useful.

@richvdh richvdh removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Apr 14, 2021
@richvdh
Copy link
Member

richvdh commented May 10, 2021

related: #9533

@ara4n
Copy link
Member Author

ara4n commented May 10, 2021

So i've been trying to repro this today with:

bucephalus:~ matthew$ gdate; d=`gdate +%s.%N`; echo $d; curl 'https://matrix-client.matrix.org/_matrix/client/r0/sendToDevice/netrasphere.test/'$d -H 'authorization: Bearer secret' -X PUT --data '{"messages": { "@matthewtest7:matrix.org": { "*": { "foo": "bar" }}}}'
1620663200.138457000
Mon May 10 17:13:20 BST 2021
{}

Most of the messages are <1s; about 1 in 30 takes ~5 seconds, but (after waiting until later in the day for things to get a busier) i just had the one above take 8 seconds:

17:13:28.070 rageshake.js:65 received to_device net.arasphere.test from: @matthewtest7:matrix.org id: undefined

I'll try again 9-10pm tonight UK which is when i was seeing the worst behaviour before. However, it may be worth chasing this particular request given 8s feels surprisingly long, to see if it has the same symptoms as the ones I was tracking in element-hq/element-web#8376

@richvdh
Copy link
Member

richvdh commented May 10, 2021

On the encryption worker:

2021-05-10 16:13:20,223 - synapse.access.http.8134 - 408 - INFO - PUT-5517595 - 80.250.101.234 - 8134 - {@matthewtest7:matrix.org} Processed request: 0.010sec/-0.000sec (0.004sec, 0.000sec) (0.001sec/0.006sec/3) 2B 200 "PUT /_matrix/client/r0/sendToDevice/net.arasphere.test/1620663200.138457000 HTTP/1.1" "curl/7.64.1" [0 dbevts]
2021-05-10 16:13:20,225 - synapse.replication.tcp.resource - 188 - INFO - replication_notifier-3169787 - Streaming: to_device -> 259526353

On synchrotron5:

2021-05-10 16:13:20,878 - synapse.access.http.8096 - 408 - INFO - GET-10844832 - 80.250.101.234 - 8096 - {@matthewtest7:matrix.org} Processed request: 0.450sec/-0.000sec (0.006sec, 0.000sec) (0.004sec/0.009sec/1) 1235B 200 "GET /_matrix/client/r0/sync?filter=4&timeout=30000&since=m2005091140~1.2005091149_757284961_11949499_861387742_755485781_2301469_259526327_1004328024_194867 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" [0 dbevts]
2021-05-10 16:13:24,664 - synapse.access.http.8096 - 408 - INFO - GET-10844895 - 80.250.101.234 - 8096 - {@matthewtest7:matrix.org} Processed request: 2.941sec/-0.000sec (0.034sec, 0.003sec) (0.006sec/0.067sec/4) 1244B 200 "GET /_matrix/client/r0/sync?filter=4&timeout=30000&since=m2005091175~1.2005091223_757284961_11949588_861387792_755485835_2301469_259526336_1004328026_194867 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" [1 dbevts]
2021-05-10 16:13:27,774 - synapse.access.http.8096 - 408 - INFO - GET-10845049 - 80.250.101.234 - 8096 - {@matthewtest7:matrix.org} Processed request: 1.669sec/-0.000sec (0.010sec, 0.002sec) (0.076sec/0.436sec/3) 1366B 200 "GET /_matrix/client/r0/sync?filter=4&timeout=30000&since=s2005091231_757284961_11949672_861387822_755485883_2301469_259526346_1004328028_194867 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" [0 dbevts]
2021-05-10 16:13:28,706 - synapse.access.http.8096 - 408 - INFO - GET-10845136 - 80.250.101.234 - 8096 - {@matthewtest7:matrix.org} Processed request: 0.252sec/-0.000sec (0.005sec, 0.000sec) (0.008sec/0.013sec/2) 1118B 200 "GET /_matrix/client/r0/sync?filter=4&timeout=30000&since=s2005091473_757284961_11949914_861387961_755486034_2301469_259526368_1004328036_194867 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" [0 dbevts]

We can see from the since params (7th component) that the to-device message, with stream id 259526353, wasn't returned until the third of those four sync requests, which is consistent with Matthew's timings. So why wasn't it returned by either of the earlier /syncs ?

No idea, in short. It implies the synchrotron didn't know about the new to-device event, but it's unclear why. There's no particular evidence of GC, or delays in the events replication stream at least (which is a bit more verbose than the to_device stream, so easier to trace).

So this is consistent with element-hq/element-web#8376. #9959 might help diagnose it a bit.

@richvdh richvdh self-assigned this May 13, 2021
@erikjohnston erikjohnston added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels May 13, 2021
@richvdh
Copy link
Member

richvdh commented May 13, 2021

@ara4n reported another delayed message:

Wed May 12 16:43:29 BST 2021
1620834209.501334000
{}
->

16:43:35.609 rageshake.js:65 received to_device net.arasphere.test from: @matthewtest7:matrix.org id: undefined

On the encryption worker:

2021-05-12 15:43:29,579 - synapse.9533_debug - 549 - DEBUG - PUT-316121-- - Stored to-device messages with stream_id 260733698 for [('@matthewtest7:matrix.org', '29yr+WWkOOlgiOJH+aVIzlejMCeGBkUBLSH4VolH2o8'), ('@matthewtest7:matrix.org', 'AGMPOJNUXK'), ('@matthewtest7:matrix.org', 'BvrEPEYN5xqarqzvw15TawHRGopBlCZB4BNt+wTcX20'), ('@matthewtest7:matrix.org', 'DHKBGEETBN'), ('@matthewtest7:matrix.org', 'FBR5CDAZUUmdxU0w6kkLkARm+OGN5mOQq46D/fHHNfE'), ('@matthewtest7:matrix.org', 'FIGEQEAQRK'), ('@matthewtest7:matrix.org', 'FVRJKEZQBW'), ('@matthewtest7:matrix.org', 'GAWSPZDHZY'), ('@matthewtest7:matrix.org', 'GCEZUQZUQO'), ('@matthewtest7:matrix.org', 'GEFBWQHMEO'), ('@matthewtest7:matrix.org', 'HNGIOKUKWZ'), ('@matthewtest7:matrix.org', 'IETEMEMWSB'), ('@matthewtest7:matrix.org', 'KAYDNSRDZZ'), ('@matthewtest7:matrix.org', 'LGFPVHDGHM'), ('@matthewtest7:matrix.org', 'MROFBYOBJN'), ('@matthewtest7:matrix.org', 'NBi2sj9rh/xflRIjIdUdp8iRpOSHjeiXyPHoWofDcnw'), ('@matthewtest7:matrix.org', 'PTSANZNJNZ'), ('@matthewtest7:matrix.org', 'QDTBLOQDCO'), ('@matthewtest7:matrix.org', 'RJEEKTTNAD'), ('@matthewtest7:matrix.org', 'RSACOJTTMA'), ('@matthewtest7:matrix.org', 'TBOTWOIPUO'), ('@matthewtest7:matrix.org', 'TMEBWPPFVP'), ('@matthewtest7:matrix.org', 'UlPr+z2c2tzBAVO1NJED+Cdz9Ib86Oax2c2uDCLqtdA'), ('@matthewtest7:matrix.org', 'VEVLFOCPIL'), ('@matthewtest7:matrix.org', 'WN4TmF8uxs7uqdOCv8RxqQFqBzg4c9UaelAhd4cgXTU'), ('@matthewtest7:matrix.org', 'WQNXLQTROL'), ('@matthewtest7:matrix.org', 'YKFVSQHZTO'), ('@matthewtest7:matrix.org', 'YXDAYDVKRF'), ('@matthewtest7:matrix.org', 'Z6Ws6GEVaHzgN5lC3aoK6MAwPXY/sHRF/F74fw8X8Yk'), ('@matthewtest7:matrix.org', 'cesMHMyNR5kLVE/4+18pfbdn+arMLmLXds9heiPn/jw'), ('@matthewtest7:matrix.org', 'gmNxf99ZtaMc11G66UM8J3bhDMBk93GFtAhGgfNEHTI'), ('@matthewtest7:matrix.org', 'iW2mIVnEkmy/P6TT6Cyq9s42JBfAFSARUQzLp3TIG7s'), ('@matthewtest7:matrix.org', 'j/BtV3WT8au+CKcsYXgLPYtKf4cLUgFyLkzv2qffOiE'), ('@matthewtest7:matrix.org', 'mRLJfUT2SoChNJ8ZLqMOwiAwjR4yKzSKoe6AM4U6l9A')]
2021-05-12 15:43:29,582 - synapse.9533_debug - 434 - DEBUG - PUT-316121- - to-device messages stream id 260733698, awaking streams for dict_keys(['@matthewtest7:matrix.org'])
2021-05-12 15:43:29,583 - synapse.access.http.8134 - 408 - INFO - PUT-316121 - 80.250.101.234 - 8134 - {@matthewtest7:matrix.org} Processed request: 0.014sec/-0.000sec (0.003sec, 0.001sec) (0.001sec/0.010sec/3) 2B 200 "PUT /_matrix/client/r0/sendToDevice/net.arasphere.test/1620834209.501334000 HTTP/1.1" "curl/7.64.1" [0 dbevts]

On synchrotron8:

2021-05-12 15:43:29,610 - synapse.9533_debug - 434 - DEBUG - process-replication-data-2994089 - to-device messages stream id 260733698, awaking streams for ['@matthewtest7:matrix.org']
2021-05-12 15:43:32,845 - synapse.9533_debug - 434 - DEBUG - process-replication-data-2994483 - to-device messages stream id 260733716, awaking streams for ['@matthewtest7:matrix.org']
2021-05-12 15:43:32,875 - synapse.9533_debug - 434 - DEBUG - process-replication-data-2994530 - to-device messages stream id 260733718, awaking streams for ['@matthewtest7:matrix.org']
2021-05-12 15:43:32,896 - synapse.9533_debug - 434 - DEBUG - process-replication-data-2994577 - to-device messages stream id 260733719, awaking streams for ['@matthewtest7:matrix.org']
2021-05-12 15:43:35,123 - synapse.access.http.8300 - 408 - INFO - GET-581195 - 80.250.101.234 - 8300 - {@matthewtest7:matrix.org} Processed request: 6.068sec/-0.000sec (0.117sec, 0.002sec) (0.986sec/0.479sec/7) 1458B 200 "GET /_matrix/client/r0/sync?filter=4&timeout=30000&since=s2009948967_757284961_681796_864628704_759364233_2306590_260733689_1011208698_195525 HTTP/1.1" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36" [1 dbevts]

so we can tell that the replication notification is making it as far as the synchrotron, but the /sync request isn't waking up for some reason (or it took 6 seconds between wakeup and return). Better add some more debug...

@richvdh
Copy link
Member

richvdh commented Jun 7, 2021

I think this is basically fixed by #10124, though we'll leave some of the debugging in place to help trace it if it turns up again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

5 participants