-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Events that take >60s to send have a high chance of being duplicated when using an event_creator #3365
Comments
Thanks @turt2live agreed that we should be sending an identifier to dedupe on |
Bump, same issue occurred again today on riot |
@neilisfragile Any chance of working on this in the near future...? It does make us all look a bit silly when messages get duplicated because of this, so it would be nice to fix and reassure everyone we do have proper message transactions. |
related (in that it shares a similar root cause): #6316 |
mozilla just got bitten embarassingly by this thanks to the DB temporarily disappearing from under the synapse - it makes an outage look even worse than it was already. |
some initial questions which might help point towards a solution here:
One potential solution to this is to store a map in the We should be careful that any solution here does not conflict with plans to support multiple event persisters. |
I'm a little worried that if this doesn't get fixed soonish then it'll never get fixed. The scope keeps expanding as time passes because the issue hasn't been addressed, meaning it has to consider things it hasn't had to before. |
on the contrary: I think that as we increase parallelism it's going to be more important than ever to be fixed. I've bumped it right up the backend team's priority list. |
Erik has said that a given room will be given the same EP each time, so hopefully there should be ability to do something there. I'm going to pick this up and see where I can take it :). |
With writing a workerised test seeming too non-trivial, I have taken to manually reproducing with curl. By adding a random (I used 50% chance) sleep to Then I used this curl command: curl -i --request PUT -H "Content-Type: application/json" -H 'Authorization: Bearer MDAxMmxvY2F0aW9uIHN5blcKMDAxM2lkZW50aWZpZXIga2V5CjAwMTBjaWQgZ2VuID0gMQowMDFiY2lkIHVzZXJfaWQgPSBAdTE6c3luVwowMDE2Y2lkIHR5cGUgPSBhY2Nlc3MKMDAyMWNpZCBub25jZSA9IFE7RGtTLHgjJjFFcU5jQ2UKMDAyZnNpZ25hdHVyZSANYDw3YTyeXtytKUJPWC3t948Au1evgKuN48VCyP6xDQo' -A 'A/1.0' -d '@-' 'https://synw:8448/_matrix/client/r0/rooms/!dfVRmuOniuCbXmCbSq:synW/send/m.room.message/m1598345463534.2' <<EOF
{"msgtype":"m.text","body":"try it now"}
EOF (substituting in a fresh transaction ID) Once you hit the sleep and it finally returns with a 500, retrying the same curl command causes a message duplication. Further repeats following a non-slowed send will be deduplicated by transaction ID appopriately. So something is doing something somewhere. Will look into this further tomorrow. |
I think it's looking reasonable that I have to drop this because I don't know if I will get chance to properly take it on before I evaporate for this summer. Nevertheless, I'll save someone else the digging and offer some notes:
The solution thus seems to be that we need to use (I notice there is an
PS N.B. Each room is only handled by one event persister, so it is 'safe' to do in-process caching of transaction IDs. |
This is likely to be changed, if not improved, by #8400. |
One hope is that this might provide some insights into #3365.
Overview
When clients make a request to send an event, the event creator forwards that to the main synapse process. The request to the main process has a 60s timeout, which can be reached under extremely high load. The event creator then returns an error to the client, which then happily retries the request. When the client retries, the steps start all over again. While this is going on, the main process is still trying to send the first event.
Logs
In this scenario, an event was repeated 4 times due to high system load (not uncommon, sadly). The 'client' in this case is the Discord bridge (which uses the js-sdk under the hood, similar to Riot).
Here's what the event_creator sees:
Most importantly, the same transaction ID is used to send all 4 messages (
m1528285583650.0
).The main process sees the following (requests correlated by event ID):
As demonstrated, synapse did eventually finish sending each of the 4 events (with the last one completing fast enough that the client didn't retry).
Possible solution
The timeout between the event creator and main process could be removed, although that has the consequence of other network failures causing potentially duplicate events. Regardless of the timeout being removed, the event creator could send the transaction ID alongside the send_event call to the main process. The main process can then dedupe requests it has already seen, and return the 'real' event ID it ended up sending. The event creator may need to be made aware that the event ID can change as part of the request.
The client retrying the request should already be deduped, although the event creator may need to be smarter about what requests it considers as 'in flight'. For instance, timeouts from the main process should not fail the whole request and instead invoke the event creator to retry on behalf of the client, even though the client may time out and try again independently.
The text was updated successfully, but these errors were encountered: