-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Drop federation transmission queues during a significant remote outage. #7864
Drop federation transmission queues during a significant remote outage. #7864
Conversation
Fixes #7828. Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
This sounds like you would drop transactions on a intermittent failure too though? E.g. if s server is upgraded, or network troubles, or you are matrix.org. Emptying a transaction too soon might cause even more netsplits? |
I agree with @Half-Shot here, this sounds a bit too overzealous. Another concern: While servers can easily backfill missing PDUs in a room, is the same the case for missed EDUs? If we randomly stop sending EDUs to other servers, this is going to result in messages which are unable to decrypt, because the to_device messages get lost, right? |
Well, we only do this after trying to send a given transaction 11 times (those retries happen at a lower level, in the
In the general case, no. However, the EDUs that are being dropped here are those that we can afford to drop (specifically, only typing notifications are being dropped here). Other EDUs such as to_device messages are queued with a different mechanism. (Indeed, if the only record we had of to_device messages was these in-memory queues, we'd have big problems whenever the federation sender process got restarted.) |
Ah, that's what I was missing then. I thought this was a drop immediately after a failure. 20 minutes sounds very reasonable. |
FTR I don't think we can assume that intermittent failures will lead to timeouts, if synapse dies or is down for maintenance then we'll likely get a connection refused response very quickly, which I think takes the total time down to sub 10 minutes. That is easily plausible for a deploy-gone-wrong or down-for-brief maintenance; jki.re certainly sometimes goes down for that long semi regularly. I'm a little bit hesitant with being this aggressive in general, and certainly wouldn't want to merge this until we have landed #2528 as this will just make that issue so much worse. Personally I'd probably look at the |
What if the queues were just trimmed down to the latest 10 or so events? |
Signed-off-by: Olivier Wilkinson (reivilibre) <olivier@librepush.net>
I think I agree that it seems like this should only be put in after a solution to #2528 (which will also solve the case of losing touch with other servers after a 'natural' loss of transmission queues because of an outage, upgrade or other restart). However, does this seem reasonable now, assuming that mechanism arrives? |
changelog.d/7864.bugfix
Outdated
@@ -0,0 +1 @@ | |||
Fix in-memory federation transmission queues building up indefinitely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this mean to an end-user? Is there a behavior that is fixed with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
Stop leaking memory when remote servers have been unreachable for a long time.
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe:
Fix a memory leak by limiting the number of messages queued for a remote server that has been unreachable for a long time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code lgtm, but as @clokep says the changelog could do with clearer words.
I'd kinda like us to get this merged rather than have it hanging around for weeks in purgatory. @erikjohnston are you strongly opposed to it? |
This could likely use a merge of develop at this point too, just to ensure churn hasn't caused issues. |
This seems fine to me |
…_offline_txn_queues_fix
…ault_templates * 'develop' of github.com:matrix-org/synapse: (57 commits) Remove a space at the start of a changelog entry. More changelog tweaks More changelog tweaks Remove unwanted changelog line 1.19.0rc1 Run `remove_push_actions_from_staging` in foreground (#8081) Drop federation transmission queues during a significant remote outage. (#7864) Fix unawaited coroutine error in tests. (#8072) Convert the roommember database to async/await. (#8070) Convert devices database to async/await. (#8069) Add type hints to handlers.message and events.builder (#8067) Convert account data, device inbox, and censor events databases to async/await (#8063) Convert appservice, group server, profile and more databases to async (#8066) Fix typing for notifier (#8064) Convert tags and metrics databases to async/await (#8062) Converts event_federation and registration databases to async/await (#8061) Add comment explaining cast Update changelog.d/8051.misc Add typing info to Notifier (#8058) Handle optional dependencies for Oidc and Saml ...
…sword_reset_confirmation * 'develop' of github.com:matrix-org/synapse: (85 commits) Use the default templates when a custom template file cannot be found (#8037) Changelog changes 1.19.0 Convert stream database to async/await. (#8074) Add a shadow-banned flag to users. (#8092) Convert pusher databases to async/await. (#8075) Convert receipts and events databases to async/await. (#8076) Add type hints to synapse.handlers.room (#8090) Remove some unused database functions. (#8085) Convert misc database code to async (#8087) Remove a space at the start of a changelog entry. More changelog tweaks More changelog tweaks Remove unwanted changelog line 1.19.0rc1 Run `remove_push_actions_from_staging` in foreground (#8081) Drop federation transmission queues during a significant remote outage. (#7864) Fix unawaited coroutine error in tests. (#8072) Convert the roommember database to async/await. (#8070) Convert devices database to async/await. (#8069) ...
Synapse 1.20.0rc1 (2020-09-08) ============================== Removal warning --------------- Some older clients used a [disallowed character](https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-register-email-requesttoken) (`:`) in the `client_secret` parameter of various endpoints. The incorrect behaviour was allowed for backwards compatibility, but is now being removed from Synapse as most users have updated their client. Further context can be found at [\#6766](#6766). Features -------- - Add an endpoint to query your shared rooms with another user as an implementation of [MSC2666](matrix-org/matrix-spec-proposals#2666). ([\#7785](#7785)) - Iteratively encode JSON to avoid blocking the reactor. ([\#8013](#8013), [\#8116](#8116)) - Add support for shadow-banning users (ignoring any message send requests). ([\#8034](#8034), [\#8092](#8092), [\#8095](#8095), [\#8142](#8142), [\#8152](#8152), [\#8157](#8157), [\#8158](#8158), [\#8176](#8176)) - Use the default template file when its equivalent is not found in a custom template directory. ([\#8037](#8037), [\#8107](#8107), [\#8252](#8252)) - Add unread messages count to sync responses, as specified in [MSC2654](matrix-org/matrix-spec-proposals#2654). ([\#8059](#8059), [\#8254](#8254), [\#8270](#8270), [\#8274](#8274)) - Optimise `/federation/v1/user/devices/` API by only returning devices with encryption keys. ([\#8198](#8198)) Bugfixes -------- - Fix a memory leak by limiting the length of time that messages will be queued for a remote server that has been unreachable. ([\#7864](#7864)) - Fix `Re-starting finished log context PUT-nnnn` warning when event persistence failed. ([\#8081](#8081)) - Synapse now correctly enforces the valid characters in the `client_secret` parameter used in various endpoints. ([\#8101](#8101)) - Fix a bug introduced in v1.7.2 impacting message retention policies that would allow federated homeservers to dictate a retention period that's lower than the configured minimum allowed duration in the configuration file. ([\#8104](#8104)) - Fix a long-standing bug where invalid JSON would be accepted by Synapse. ([\#8106](#8106)) - Fix a bug introduced in Synapse v1.12.0 which could cause `/sync` requests to fail with a 404 if you had a very old outstanding room invite. ([\#8110](#8110)) - Return a proper error code when the rooms of an invalid group are requested. ([\#8129](#8129)) - Fix a bug which could cause a leaked postgres connection if synapse was set to daemonize. ([\#8131](#8131)) - Clarify the error code if a user tries to register with a numeric ID. This bug was introduced in v1.15.0. ([\#8135](#8135)) - Fix a bug where appservices with ratelimiting disabled would still be ratelimited when joining rooms. This bug was introduced in v1.19.0. ([\#8139](#8139)) - Fix logging in via OpenID Connect with a provider that uses integer user IDs. ([\#8190](#8190)) - Fix a longstanding bug where user directory updates could break when unexpected profile data was included in events. ([\#8223](#8223)) - Fix a longstanding bug where stats updates could break when unexpected profile data was included in events. ([\#8226](#8226)) - Fix slow start times for large servers by removing a table scan of the `users` table from startup code. ([\#8271](#8271)) Updates to the Docker image --------------------------- - Fix builds of the Docker image on non-x86 platforms. ([\#8144](#8144)) - Added curl for healthcheck support and readme updates for the change. Contributed by @maquis196. ([\#8147](#8147)) Improved Documentation ---------------------- - Link to matrix-synapse-rest-password-provider in the password provider documentation. ([\#8111](#8111)) - Updated documentation to note that Synapse does not follow `HTTP 308` redirects due to an upstream library not supporting them. Contributed by Ryan Cole. ([\#8120](#8120)) - Explain better what GDPR-erased means when deactivating a user. ([\#8189](#8189)) Internal Changes ---------------- - Add filter `name` to the `/users` admin API, which filters by user ID or displayname. Contributed by Awesome Technologies Innovationslabor GmbH. ([\#7377](#7377), [\#8163](#8163)) - Reduce run times of some unit tests by advancing the reactor a fewer number of times. ([\#7757](#7757)) - Don't fail `/submit_token` requests on incorrect session ID if `request_token_inhibit_3pid_errors` is turned on. ([\#7991](#7991)) - Convert various parts of the codebase to async/await. ([\#8071](#8071), [\#8072](#8072), [\#8074](#8074), [\#8075](#8075), [\#8076](#8076), [\#8087](#8087), [\#8100](#8100), [\#8119](#8119), [\#8121](#8121), [\#8133](#8133), [\#8156](#8156), [\#8162](#8162), [\#8166](#8166), [\#8168](#8168), [\#8173](#8173), [\#8191](#8191), [\#8192](#8192), [\#8193](#8193), [\#8194](#8194), [\#8195](#8195), [\#8197](#8197), [\#8199](#8199), [\#8200](#8200), [\#8201](#8201), [\#8202](#8202), [\#8207](#8207), [\#8213](#8213), [\#8214](#8214)) - Remove some unused database functions. ([\#8085](#8085)) - Add type hints to various parts of the codebase. ([\#8090](#8090), [\#8127](#8127), [\#8187](#8187), [\#8241](#8241), [\#8140](#8140), [\#8183](#8183), [\#8232](#8232), [\#8235](#8235), [\#8237](#8237), [\#8244](#8244)) - Return the previous stream token if a non-member event is a duplicate. ([\#8093](#8093), [\#8112](#8112)) - Separate `get_current_token` into two since there are two different use cases for it. ([\#8113](#8113)) - Remove `ChainedIdGenerator`. ([\#8123](#8123)) - Reduce the amount of whitespace in JSON stored and sent in responses. ([\#8124](#8124)) - Update the test federation client to handle streaming responses. ([\#8130](#8130)) - Micro-optimisations to `get_auth_chain_ids`. ([\#8132](#8132)) - Refactor `StreamIdGenerator` and `MultiWriterIdGenerator` to have the same interface. ([\#8161](#8161)) - Add functions to `MultiWriterIdGen` used by events stream. ([\#8164](#8164), [\#8179](#8179)) - Fix tests that were broken due to the merge of 1.19.1. ([\#8167](#8167)) - Make `SlavedIdTracker.advance` have the same interface as `MultiWriterIDGenerator`. ([\#8171](#8171)) - Remove unused `is_guest` parameter from, and add safeguard to, `MessageHandler.get_room_data`. ([\#8174](#8174), [\#8181](#8181)) - Standardize the mypy configuration. ([\#8175](#8175)) - Refactor some of `LoginRestServlet`'s helper methods, and move them to `AuthHandler` for easier reuse. ([\#8182](#8182)) - Fix `wait_for_stream_position` to allow multiple waiters on same stream ID. ([\#8196](#8196)) - Make `MultiWriterIDGenerator` work for streams that use negative values. ([\#8203](#8203)) - Refactor queries for device keys and cross-signatures. ([\#8204](#8204), [\#8205](#8205), [\#8222](#8222), [\#8224](#8224), [\#8225](#8225), [\#8231](#8231), [\#8233](#8233), [\#8234](#8234)) - Fix type hints for functions decorated with `@cached`. ([\#8240](#8240)) - Remove obsolete `order` field from federation send queues. ([\#8245](#8245)) - Stop sub-classing from object. ([\#8249](#8249)) - Add more logging to debug slow startup. ([\#8264](#8264)) - Do not attempt to upgrade database schema on worker processes. ([\#8266](#8266), [\#8276](#8276))
Synapse 1.20.0 (2020-09-22) =========================== No significant changes since v1.20.0rc5. Removal warning --------------- Historically, the [Synapse Admin API](https://github.com/matrix-org/synapse/tree/master/docs) has been accessible under the `/_matrix/client/api/v1/admin`, `/_matrix/client/unstable/admin`, `/_matrix/client/r0/admin` and `/_synapse/admin` prefixes. In a future release, we will be dropping support for accessing Synapse's Admin API using the `/_matrix/client/*` prefixes. This makes it easier for homeserver admins to lock down external access to the Admin API endpoints. Synapse 1.20.0rc5 (2020-09-18) ============================== In addition to the below, Synapse 1.20.0rc5 also includes the bug fix that was included in 1.19.3. Features -------- - Add flags to the `/versions` endpoint for whether new rooms default to using E2EE. ([\#8343](matrix-org/synapse#8343)) Bugfixes -------- - Fix rate limiting of federation `/send` requests. ([\#8342](matrix-org/synapse#8342)) - Fix a longstanding bug where back pagination over federation could get stuck if it failed to handle a received event. ([\#8349](matrix-org/synapse#8349)) Internal Changes ---------------- - Blacklist [MSC2753](matrix-org/matrix-spec-proposals#2753) SyTests until it is implemented. ([\#8285](matrix-org/synapse#8285)) Synapse 1.20.0rc4 (2020-09-16) ============================== Synapse 1.20.0rc4 is identical to 1.20.0rc3, with the addition of the security fix that was included in 1.19.2. Synapse 1.20.0rc3 (2020-09-11) ============================== Bugfixes -------- - Fix a bug introduced in v1.20.0rc1 where the wrong exception was raised when invalid JSON data is encountered. ([\#8291](matrix-org/synapse#8291)) Synapse 1.20.0rc2 (2020-09-09) ============================== Bugfixes -------- - Fix a bug introduced in v1.20.0rc1 causing some features related to notifications to misbehave following the implementation of unread counts. ([\#8280](matrix-org/synapse#8280)) Synapse 1.20.0rc1 (2020-09-08) ============================== Removal warning --------------- Some older clients used a [disallowed character](https://matrix.org/docs/spec/client_server/r0.6.1#post-matrix-client-r0-register-email-requesttoken) (`:`) in the `client_secret` parameter of various endpoints. The incorrect behaviour was allowed for backwards compatibility, but is now being removed from Synapse as most users have updated their client. Further context can be found at [\#6766](matrix-org/synapse#6766). Features -------- - Add an endpoint to query your shared rooms with another user as an implementation of [MSC2666](matrix-org/matrix-spec-proposals#2666). ([\#7785](matrix-org/synapse#7785)) - Iteratively encode JSON to avoid blocking the reactor. ([\#8013](matrix-org/synapse#8013), [\#8116](matrix-org/synapse#8116)) - Add support for shadow-banning users (ignoring any message send requests). ([\#8034](matrix-org/synapse#8034), [\#8092](matrix-org/synapse#8092), [\#8095](matrix-org/synapse#8095), [\#8142](matrix-org/synapse#8142), [\#8152](matrix-org/synapse#8152), [\#8157](matrix-org/synapse#8157), [\#8158](matrix-org/synapse#8158), [\#8176](matrix-org/synapse#8176)) - Use the default template file when its equivalent is not found in a custom template directory. ([\#8037](matrix-org/synapse#8037), [\#8107](matrix-org/synapse#8107), [\#8252](matrix-org/synapse#8252)) - Add unread messages count to sync responses, as specified in [MSC2654](matrix-org/matrix-spec-proposals#2654). ([\#8059](matrix-org/synapse#8059), [\#8254](matrix-org/synapse#8254), [\#8270](matrix-org/synapse#8270), [\#8274](matrix-org/synapse#8274)) - Optimise `/federation/v1/user/devices/` API by only returning devices with encryption keys. ([\#8198](matrix-org/synapse#8198)) Bugfixes -------- - Fix a memory leak by limiting the length of time that messages will be queued for a remote server that has been unreachable. ([\#7864](matrix-org/synapse#7864)) - Fix `Re-starting finished log context PUT-nnnn` warning when event persistence failed. ([\#8081](matrix-org/synapse#8081)) - Synapse now correctly enforces the valid characters in the `client_secret` parameter used in various endpoints. ([\#8101](matrix-org/synapse#8101)) - Fix a bug introduced in v1.7.2 impacting message retention policies that would allow federated homeservers to dictate a retention period that's lower than the configured minimum allowed duration in the configuration file. ([\#8104](matrix-org/synapse#8104)) - Fix a long-standing bug where invalid JSON would be accepted by Synapse. ([\#8106](matrix-org/synapse#8106)) - Fix a bug introduced in Synapse v1.12.0 which could cause `/sync` requests to fail with a 404 if you had a very old outstanding room invite. ([\#8110](matrix-org/synapse#8110)) - Return a proper error code when the rooms of an invalid group are requested. ([\#8129](matrix-org/synapse#8129)) - Fix a bug which could cause a leaked postgres connection if synapse was set to daemonize. ([\#8131](matrix-org/synapse#8131)) - Clarify the error code if a user tries to register with a numeric ID. This bug was introduced in v1.15.0. ([\#8135](matrix-org/synapse#8135)) - Fix a bug where appservices with ratelimiting disabled would still be ratelimited when joining rooms. This bug was introduced in v1.19.0. ([\#8139](matrix-org/synapse#8139)) - Fix logging in via OpenID Connect with a provider that uses integer user IDs. ([\#8190](matrix-org/synapse#8190)) - Fix a longstanding bug where user directory updates could break when unexpected profile data was included in events. ([\#8223](matrix-org/synapse#8223)) - Fix a longstanding bug where stats updates could break when unexpected profile data was included in events. ([\#8226](matrix-org/synapse#8226)) - Fix slow start times for large servers by removing a table scan of the `users` table from startup code. ([\#8271](matrix-org/synapse#8271)) Updates to the Docker image --------------------------- - Fix builds of the Docker image on non-x86 platforms. ([\#8144](matrix-org/synapse#8144)) - Added curl for healthcheck support and readme updates for the change. Contributed by @maquis196. ([\#8147](matrix-org/synapse#8147)) Improved Documentation ---------------------- - Link to matrix-synapse-rest-password-provider in the password provider documentation. ([\#8111](matrix-org/synapse#8111)) - Updated documentation to note that Synapse does not follow `HTTP 308` redirects due to an upstream library not supporting them. Contributed by Ryan Cole. ([\#8120](matrix-org/synapse#8120)) - Explain better what GDPR-erased means when deactivating a user. ([\#8189](matrix-org/synapse#8189)) Internal Changes ---------------- - Add filter `name` to the `/users` admin API, which filters by user ID or displayname. Contributed by Awesome Technologies Innovationslabor GmbH. ([\#7377](matrix-org/synapse#7377), [\#8163](matrix-org/synapse#8163)) - Reduce run times of some unit tests by advancing the reactor a fewer number of times. ([\#7757](matrix-org/synapse#7757)) - Don't fail `/submit_token` requests on incorrect session ID if `request_token_inhibit_3pid_errors` is turned on. ([\#7991](matrix-org/synapse#7991)) - Convert various parts of the codebase to async/await. ([\#8071](matrix-org/synapse#8071), [\#8072](matrix-org/synapse#8072), [\#8074](matrix-org/synapse#8074), [\#8075](matrix-org/synapse#8075), [\#8076](matrix-org/synapse#8076), [\#8087](matrix-org/synapse#8087), [\#8100](matrix-org/synapse#8100), [\#8119](matrix-org/synapse#8119), [\#8121](matrix-org/synapse#8121), [\#8133](matrix-org/synapse#8133), [\#8156](matrix-org/synapse#8156), [\#8162](matrix-org/synapse#8162), [\#8166](matrix-org/synapse#8166), [\#8168](matrix-org/synapse#8168), [\#8173](matrix-org/synapse#8173), [\#8191](matrix-org/synapse#8191), [\#8192](matrix-org/synapse#8192), [\#8193](matrix-org/synapse#8193), [\#8194](matrix-org/synapse#8194), [\#8195](matrix-org/synapse#8195), [\#8197](matrix-org/synapse#8197), [\#8199](matrix-org/synapse#8199), [\#8200](matrix-org/synapse#8200), [\#8201](matrix-org/synapse#8201), [\#8202](matrix-org/synapse#8202), [\#8207](matrix-org/synapse#8207), [\#8213](matrix-org/synapse#8213), [\#8214](matrix-org/synapse#8214)) - Remove some unused database functions. ([\#8085](matrix-org/synapse#8085)) - Add type hints to various parts of the codebase. ([\#8090](matrix-org/synapse#8090), [\#8127](matrix-org/synapse#8127), [\#8187](matrix-org/synapse#8187), [\#8241](matrix-org/synapse#8241), [\#8140](matrix-org/synapse#8140), [\#8183](matrix-org/synapse#8183), [\#8232](matrix-org/synapse#8232), [\#8235](matrix-org/synapse#8235), [\#8237](matrix-org/synapse#8237), [\#8244](matrix-org/synapse#8244)) - Return the previous stream token if a non-member event is a duplicate. ([\#8093](matrix-org/synapse#8093), [\#8112](matrix-org/synapse#8112)) - Separate `get_current_token` into two since there are two different use cases for it. ([\#8113](matrix-org/synapse#8113)) - Remove `ChainedIdGenerator`. ([\#8123](matrix-org/synapse#8123)) - Reduce the amount of whitespace in JSON stored and sent in responses. ([\#8124](matrix-org/synapse#8124)) - Update the test federation client to handle streaming responses. ([\#8130](matrix-org/synapse#8130)) - Micro-optimisations to `get_auth_chain_ids`. ([\#8132](matrix-org/synapse#8132)) - Refactor `StreamIdGenerator` and `MultiWriterIdGenerator` to have the same interface. ([\#8161](matrix-org/synapse#8161)) - Add functions to `MultiWriterIdGen` used by events stream. ([\#8164](matrix-org/synapse#8164), [\#8179](matrix-org/synapse#8179)) - Fix tests that were broken due to the merge of 1.19.1. ([\#8167](matrix-org/synapse#8167)) - Make `SlavedIdTracker.advance` have the same interface as `MultiWriterIDGenerator`. ([\#8171](matrix-org/synapse#8171)) - Remove unused `is_guest` parameter from, and add safeguard to, `MessageHandler.get_room_data`. ([\#8174](matrix-org/synapse#8174), [\#8181](matrix-org/synapse#8181)) - Standardize the mypy configuration. ([\#8175](matrix-org/synapse#8175)) - Refactor some of `LoginRestServlet`'s helper methods, and move them to `AuthHandler` for easier reuse. ([\#8182](matrix-org/synapse#8182)) - Fix `wait_for_stream_position` to allow multiple waiters on same stream ID. ([\#8196](matrix-org/synapse#8196)) - Make `MultiWriterIDGenerator` work for streams that use negative values. ([\#8203](matrix-org/synapse#8203)) - Refactor queries for device keys and cross-signatures. ([\#8204](matrix-org/synapse#8204), [\#8205](matrix-org/synapse#8205), [\#8222](matrix-org/synapse#8222), [\#8224](matrix-org/synapse#8224), [\#8225](matrix-org/synapse#8225), [\#8231](matrix-org/synapse#8231), [\#8233](matrix-org/synapse#8233), [\#8234](matrix-org/synapse#8234)) - Fix type hints for functions decorated with `@cached`. ([\#8240](matrix-org/synapse#8240)) - Remove obsolete `order` field from federation send queues. ([\#8245](matrix-org/synapse#8245)) - Stop sub-classing from object. ([\#8249](matrix-org/synapse#8249)) - Add more logging to debug slow startup. ([\#8264](matrix-org/synapse#8264)) - Do not attempt to upgrade database schema on worker processes. ([\#8266](matrix-org/synapse#8266), [\#8276](matrix-org/synapse#8276))
* commit 'ff0e89465': Drop federation transmission queues during a significant remote outage. (#7864)
Fixes #7828. Depends on #2528.
Signed-off-by: Olivier Wilkinson (reivilibre) olivier@librepush.net
Pull Request Checklist
EventStore
toEventWorkerStore
.".code blocks
.