Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Events that take >60s to send have a high chance of being duplicated when using an event_creator #3365

Closed
turt2live opened this issue Jun 7, 2018 · 13 comments · Fixed by #8476
Assignees
Labels
z-minor (Deprecated Label)

Comments

@turt2live
Copy link
Member

Overview

When clients make a request to send an event, the event creator forwards that to the main synapse process. The request to the main process has a 60s timeout, which can be reached under extremely high load. The event creator then returns an error to the client, which then happily retries the request. When the client retries, the steps start all over again. While this is going on, the main process is still trying to send the first event.

Logs

In this scenario, an event was repeated 4 times due to high system load (not uncommon, sadly). The 'client' in this case is the Discord bridge (which uses the js-sdk under the hood, similar to Riot).

Here's what the event_creator sees:

event_creator - 2018-06-06 11:46:23,718 - synapse.access.http.8123 - 97 - INFO - PUT-12029- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:46:23,719 - synapse.util.async - 291 - INFO - PUT-12029- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:46:23,918 - synapse.http.client - 100 - INFO - PUT-12029- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...dle:t2bot.io
event_creator - 2018-06-06 11:47:24,326 - synapse.http.client - 122 - INFO - PUT-12029- Error sending request to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...dle:t2bot.io: ResponseNeverReceived [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
event_creator - 2018-06-06 11:47:24,327 - synapse.util.async - 300 - INFO - PUT-12029- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:47:24,328 - synapse.http.server - 102 - ERROR - PUT-12029- Failed handle request via <function _async_render at 0x7f70d91c36e0>: <XForwardedForRequest at 0x7f70ada379e0 method=PUT uri=/_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io clientproto=HTTP/1.0 site=8123>: Traceback (most recent call last):
event_creator - 2018-06-06 11:47:24,328 - synapse.access.http.8123 - 131 - INFO - PUT-12029- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 60610ms (3ms, 0ms) (1ms/193ms/3) 55B 500 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

event_creator - 2018-06-06 11:47:26,416 - synapse.access.http.8123 - 97 - INFO - PUT-12044- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:47:26,417 - synapse.util.async - 291 - INFO - PUT-12044- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:47:26,615 - synapse.http.client - 100 - INFO - PUT-12044- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...MLc:t2bot.io
event_creator - 2018-06-06 11:48:26,616 - synapse.http.client - 122 - INFO - PUT-12044- Error sending request to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...MLc:t2bot.io: ResponseNeverReceived [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
event_creator - 2018-06-06 11:48:26,619 - synapse.util.async - 300 - INFO - PUT-12044- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:48:26,620 - synapse.http.server - 102 - ERROR - PUT-12044- Failed handle request via <function _async_render at 0x7f70d91c36e0>: <XForwardedForRequest at 0x7f70ab91b638 method=PUT uri=/_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io clientproto=HTTP/1.0 site=8123>: Traceback (most recent call last):
event_creator - 2018-06-06 11:48:26,622 - synapse.access.http.8123 - 131 - INFO - PUT-12044- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 60206ms (8ms, 0ms) (1ms/193ms/2) 55B 500 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

event_creator - 2018-06-06 11:48:30,710 - synapse.access.http.8123 - 97 - INFO - PUT-12060- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:48:30,711 - synapse.util.async - 291 - INFO - PUT-12060- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:48:30,972 - synapse.http.client - 100 - INFO - PUT-12060- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...yFb:t2bot.io
event_creator - 2018-06-06 11:49:30,974 - synapse.http.client - 122 - INFO - PUT-12060- Error sending request to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...yFb:t2bot.io: ResponseNeverReceived [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
event_creator - 2018-06-06 11:49:30,977 - synapse.util.async - 300 - INFO - PUT-12060- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:49:30,981 - synapse.http.server - 102 - ERROR - PUT-12060- Failed handle request via <function _async_render at 0x7f70d91c36e0>: <XForwardedForRequest at 0x7f70a9f67cf8 method=PUT uri=/_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io clientproto=HTTP/1.0 site=8123>: Traceback (most recent call last):
event_creator - 2018-06-06 11:49:30,982 - synapse.access.http.8123 - 131 - INFO - PUT-12060- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 60272ms (12ms, 0ms) (2ms/255ms/2) 55B 500 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

event_creator - 2018-06-06 11:49:39,072 - synapse.access.http.8123 - 97 - INFO - PUT-12070- 111.222.333.85 - 8123 - Received request: PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io
event_creator - 2018-06-06 11:49:39,073 - synapse.util.async - 291 - INFO - PUT-12070- Acquired uncontended limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:49:39,131 - synapse.http.client - 100 - INFO - PUT-12070- Sending request PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...mFi:t2bot.io
event_creator - 2018-06-06 11:50:38,135 - synapse.http.client - 115 - INFO - PUT-12070- Received response to  PUT http://127.0.0.1:9091/_synapse/replication/send_event/$152...mFi:t2bot.io: 200
event_creator - 2018-06-06 11:50:38,135 - synapse.util.async - 300 - INFO - PUT-12070- Releasing limiter lock for key '!pXkVSJ...:server.com'
event_creator - 2018-06-06 11:50:38,136 - synapse.access.http.8123 - 131 - INFO - PUT-12070- 111.222.333.85 - 8123 - {@_discord_324...619:t2bot.io} Processed request: 59064ms (4ms, 3ms) (1ms/49ms/2) 46B 200 "PUT /_matrix/client/r0/rooms/!pXkVSJ...%3Aserver.com/send/m.room.message/m1528285583650.0?access_token=<redacted>&user_id=%40_discord_324...619%3At2bot.io HTTP/1.0" "None"

Most importantly, the same transaction ID is used to send all 4 messages (m1528285583650.0).

The main process sees the following (requests correlated by event ID):

homeserver - 2018-06-06 11:46:23,921 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...dle:t2bot.io
homeserver - 2018-06-06 11:46:23,921 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...dle:t2bot.io], calculating new one
homeserver - 2018-06-06 11:46:23,921 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...dle:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:37,687 - synapse.http.server - 404 - WARNING - Not sending response to request <SynapseRequest at 0x7f5d86c94200 method=PUT uri=/_synapse/replication/send_event/$152...dle:t2bot.io clientproto=HTTP/1.1 site=9091>, already disconnected.
homeserver - 2018-06-06 11:50:37,687 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 253767ms (0ms, 0ms) (0ms/0ms/0) 0B 200 "PUT /_synapse/replication/send_event/$152...dle:t2bot.io HTTP/1.1" "Synapse/0.30.0"

homeserver - 2018-06-06 11:47:26,618 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...MLc:t2bot.io
homeserver - 2018-06-06 11:47:26,618 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...MLc:t2bot.io], calculating new one
homeserver - 2018-06-06 11:47:26,619 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...MLc:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:38,086 - synapse.http.server - 404 - WARNING - Not sending response to request <SynapseRequest at 0x7f5e7a806950 method=PUT uri=/_synapse/replication/send_event/$152...MLc:t2bot.io clientproto=HTTP/1.1 site=9091>, already disconnected.
homeserver - 2018-06-06 11:50:38,086 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 191468ms (0ms, 0ms) (0ms/0ms/0) 0B 200 "PUT /_synapse/replication/send_event/$152...MLc:t2bot.io HTTP/1.1" "Synapse/0.30.0"

homeserver - 2018-06-06 11:48:31,021 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...yFb:t2bot.io
homeserver - 2018-06-06 11:48:31,022 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...yFb:t2bot.io], calculating new one
homeserver - 2018-06-06 11:48:31,023 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...yFb:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:38,087 - synapse.http.server - 404 - WARNING - Not sending response to request <SynapseRequest at 0x7f5e6d868cb0 method=PUT uri=/_synapse/replication/send_event/$152...yFb:t2bot.io clientproto=HTTP/1.1 site=9091>, already disconnected.
homeserver - 2018-06-06 11:50:38,087 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 127066ms (0ms, 0ms) (0ms/0ms/0) 0B 200 "PUT /_synapse/replication/send_event/$152...yFb:t2bot.io HTTP/1.1" "Synapse/0.30.0"

homeserver - 2018-06-06 11:49:39,136 - synapse.access.http.9091 - 97 - INFO - 127.0.0.1 - 9091 - Received request: PUT /_synapse/replication/send_event/$152...mFi:t2bot.io
homeserver - 2018-06-06 11:49:39,136 - synapse.util.caches.response_cache - 150 - INFO - [send_event]: no cached result for [$152...mFi:t2bot.io], calculating new one
homeserver - 2018-06-06 11:49:39,138 - synapse.replication.http.send_event - 147 - INFO - Got event to send with ID: $152...mFi:t2bot.io into room: !pXkVSJ...:server.com
homeserver - 2018-06-06 11:50:38,088 - synapse.access.http.9091 - 131 - INFO - 127.0.0.1 - 9091 - {@_discord_324...619:t2bot.io} Processed request: 58952ms (0ms, 0ms) (0ms/0ms/0) 2B 200 "PUT /_synapse/replication/send_event/$152...mFi:t2bot.io HTTP/1.1" "Synapse/0.30.0"

As demonstrated, synapse did eventually finish sending each of the 4 events (with the last one completing fast enough that the client didn't retry).

Possible solution

The timeout between the event creator and main process could be removed, although that has the consequence of other network failures causing potentially duplicate events. Regardless of the timeout being removed, the event creator could send the transaction ID alongside the send_event call to the main process. The main process can then dedupe requests it has already seen, and return the 'real' event ID it ended up sending. The event creator may need to be made aware that the event ID can change as part of the request.

The client retrying the request should already be deduped, although the event creator may need to be smarter about what requests it considers as 'in flight'. For instance, timeouts from the main process should not fail the whole request and instead invoke the event creator to retry on behalf of the client, even though the client may time out and try again independently.

@turt2live
Copy link
Member Author

The "high system load" is associated with spikes like this (same timeframe as the logs above):
image

What ends up causing this particular failure mode is inserting into the events table can suddenly take a really long time, therefore queries stack up. This ends up causing the entire server to grind to a halt until the database lets queries through again. I'm not totally sure why postgres freezes up like this, but I suspect it's related to rebuilding indexes?

In any case, the duplication really shouldn't be happening.

@neilisfragile neilisfragile added p1 z-minor (Deprecated Label) labels Jun 22, 2018
@neilisfragile
Copy link
Contributor

Thanks @turt2live agreed that we should be sending an identifier to dedupe on

@hugecheese
Copy link

Bump, same issue occurred again today on riot

@jryans
Copy link
Contributor

jryans commented Nov 15, 2019

@neilisfragile Any chance of working on this in the near future...? It does make us all look a bit silly when messages get duplicated because of this, so it would be nice to fix and reassure everyone we do have proper message transactions.

@richvdh
Copy link
Member

richvdh commented Nov 18, 2019

related (in that it shares a similar root cause): #6316

@ara4n
Copy link
Member

ara4n commented Feb 19, 2020

mozilla just got bitten embarassingly by this thanks to the DB temporarily disappearing from under the synapse - it makes an outage look even worse than it was already.

@richvdh
Copy link
Member

richvdh commented Aug 13, 2020

some initial questions which might help point towards a solution here:

  • how does the existing deduplication (based on the txn_id in a /rooms/<room_id>/send/<type>/<txn_id> request) work? I couldn't immediately find any code for this.
  • Assuming such code exists, does it actually work at all in the face of multiple event_creator instances?

One potential solution to this is to store a map in the event_creator from txn_id to the created event object. Subsequent attempts to send with the same txn_id could use the same event object, resulting in the same event id, which the event_persister can then deduplicate? However, I'm not sure how that would fit with multiple event_creators, hence the second question.

We should be careful that any solution here does not conflict with plans to support multiple event persisters.

@turt2live
Copy link
Member Author

turt2live commented Aug 13, 2020

I'm a little worried that if this doesn't get fixed soonish then it'll never get fixed. The scope keeps expanding as time passes because the issue hasn't been addressed, meaning it has to consider things it hasn't had to before.

@richvdh
Copy link
Member

richvdh commented Aug 14, 2020

on the contrary: I think that as we increase parallelism it's going to be more important than ever to be fixed. I've bumped it right up the backend team's priority list.

@reivilibre reivilibre self-assigned this Aug 24, 2020
@reivilibre
Copy link
Contributor

We should be careful that any solution here does not conflict with plans to support multiple event persisters.

Erik has said that a given room will be given the same EP each time, so hopefully there should be ability to do something there.

I'm going to pick this up and see where I can take it :).

@reivilibre
Copy link
Contributor

With writing a workerised test seeming too non-trivial, I have taken to manually reproducing with curl.

By adding a random (I used 50% chance) sleep to send_event._handle_request, I can simulate a slow event persister.

Then I used this curl command:

curl -i --request PUT -H "Content-Type: application/json" -H 'Authorization: Bearer MDAxMmxvY2F0aW9uIHN5blcKMDAxM2lkZW50aWZpZXIga2V5CjAwMTBjaWQgZ2VuID0gMQowMDFiY2lkIHVzZXJfaWQgPSBAdTE6c3luVwowMDE2Y2lkIHR5cGUgPSBhY2Nlc3MKMDAyMWNpZCBub25jZSA9IFE7RGtTLHgjJjFFcU5jQ2UKMDAyZnNpZ25hdHVyZSANYDw3YTyeXtytKUJPWC3t948Au1evgKuN48VCyP6xDQo' -A 'A/1.0' -d '@-' 'https://synw:8448/_matrix/client/r0/rooms/!dfVRmuOniuCbXmCbSq:synW/send/m.room.message/m1598345463534.2' <<EOF
{"msgtype":"m.text","body":"try it now"}
EOF

(substituting in a fresh transaction ID)

Once you hit the sleep and it finally returns with a 500, retrying the same curl command causes a message duplication.

Further repeats following a non-slowed send will be deduplicated by transaction ID appopriately. So something is doing something somewhere.

Will look into this further tomorrow.

@reivilibre
Copy link
Contributor

reivilibre commented Sep 7, 2020

I think it's looking reasonable that I have to drop this because I don't know if I will get chance to properly take it on before I evaporate for this summer.

Nevertheless, I'll save someone else the digging and offer some notes:

  • Event objects have an event.internal_metadata.txn_id field which appears to correspond with the transaction ID when the event is created in the client-server API.
    • This field doesn't appear to be used itself in anything (apart from some serialisation stuff).
      • The current deduplication is handled at the client-server API REST layer — see HttpTransactionCache and fetch_on_execute_request
        • when the replication request to the event persister times out, the CS API request errors so the client tries again

The solution thus seems to be that we need to use event.internal_metadata.txn_id in the event persister (re-use/cache the result of ReplicationSendEventRestServlet._handle_request).
Note that the key to deduplicate on is not only the transaction ID but the pair of (transaction ID, user access token) — this is crucial as otherwise someone unfriendly (or just plain bad luck) could consume transaction IDs to make event sends fail for no apparent reason for other users.

(I notice there is an event.internal_metadata.token_id — wonder if this is the ID of the user's access token. If so, you could use that as part of the deduplication key!!).

ResponseCache seemed like an interesting piece of machinery to use (the scheduling of timeouts might not be right for the job, though — I would imagine you rather want to sweep old transactions in batches rather than schedule a call later for each individual one…). I also later discovered HttpTransactionCache — it may be possible to reuse this machinery.

PS N.B. Each room is only handled by one event persister, so it is 'safe' to do in-process caching of transaction IDs.

@reivilibre reivilibre removed their assignment Sep 7, 2020
@richvdh
Copy link
Member

richvdh commented Sep 25, 2020

This is likely to be changed, if not improved, by #8400.

richvdh added a commit that referenced this issue Sep 29, 2020
One hope is that this might provide some insights into #3365.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
z-minor (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants