Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Refactor MatrixFederationAgent to retry SRV. #5864

Merged
merged 10 commits into from
Aug 27, 2019

Conversation

erikjohnston
Copy link
Member

@erikjohnston erikjohnston commented Aug 15, 2019

This refactors MatrixFederationAgent to move the SRV lookup into the
endpoint code, this has two benefits:

  1. Its easier to retry different host/ports in the same way as HostnameEndpoint.
  2. We avoid SRV lookups if we have a free connection in the pool

@erikjohnston erikjohnston force-pushed the erikj/reliable_lookups branch from ed3b927 to 193beb5 Compare August 15, 2019 14:33
if tls_options is not None:
ep = wrapClientTLS(tls_options, ep)
return ep
with PreserveLoggingContext():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I use make_deferred_yieldable here I get lots of failures due to log contexts having been changed, even when removing all log context logic from the custom endpoint and the SRV resolver

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so this comes down to a mixup between the two types of function: (A) "follows synapse logcontext rules" vs (B) "ignores logcontext". Currently MatrixHostnameEndpoint.connect is a type-A function, whereas the thing that calls it (the depths of Twisted) is a type-B function. The TL;DR is that this doesn't work (in more detail: MatrixHostnameEndpoint.connect() returns a Deferred which reinstates the logcontext when it completes, but the Twisted code which calls it isn't expecting that, so the logcontext ends up ending up in places it shouldn't, and everything falls apart).

So one solution is to make MatrixHostnameEndpoint.connect a type-B function instead, which would work at runtime, but you would get errors from the patched inlineCallbacks which is installed by the unit tests to check that all the synapse functions are law-abiding type-A functions.

So a solution to this would be to make connect a thin (type-B) shim which calls a type-A _do_connect (or whatever) method which uses inlineCallbacks. Thus:

class MatrixHostnameEndpoint(object):

    def connect(self, protocol_factory):
        return run_in_background(self._do_connect, protocol_factory)

    @defer.inlineCallbacks
    def _do_connect(self, protocol_factory):
        # ... etc

You should then be able to use make_deferred_yieldable here instead of dropping the logcontext. The nice thing about that is that the logcontext will be then in place for the whole of _do_connect, so everything that endpoint.connect and the srv resolver log will end up with the right request id.

It'll probably still fail, but try it and let's have a look.

This refactors MatrixFederationAgent to move the SRV lookup into the
endpoint code, this has two benefits:
	1. Its easier to retry different host/ports in the same way as
	   HostnameEndpoint.
	2. We avoid SRV lookups if we have a free connection in the pool
@erikjohnston erikjohnston force-pushed the erikj/reliable_lookups branch from 193beb5 to c03e3e8 Compare August 15, 2019 14:43
@erikjohnston erikjohnston requested a review from a team August 15, 2019 14:58
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial thoughts

changelog.d/5864.misc Outdated Show resolved Hide resolved

while servers:
total_weight = sum(s.weight for s in servers)
target_weight = random.randint(0, total_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an off-by-one error here which favours the first entry in the list. Given two servers with weights (10, 10), we should choose the first server for randoms in the range [1,10] and the second for randoms in the range [11, 20]. Currently we additionally choose the first server for random 0.

Suggested change
target_weight = random.randint(0, total_weight)
target_weight = random.randint(1, total_weight)

(or switch to a 0-based calculation with randrange and change line 106)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this algorithm is the same as the one in the RFC2782, so I've documented it as such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the off-by-one error is still present?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately its complicated to change the algorithm to work with zero weights, as eventually you end up trying to do randint(1, 0) and everything blows up. I don't know whether we want to use a different algorithm which is correctly weighted, or just accept the off by one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, the RFC2782 algorithm has the off-by-one error? that's... bonkers.

To be clear: if you have two servers, each with a weight of 1, we'll currently use one twice as often as the other.

can we just special-case total_weight == 0? (always return the first server, or something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quite. So the RFC says that its valid to have multiple servers that have a weight of zero, which means:

In the presence of records containing weights greater than 0, records with weight 0 should have a very small chance of being selected.

We could just always order weights of zero last, and shuffle them randomly if there are multiple of them? That I think should work, though ignores that section of RFC a bit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just always order weights of zero last, and shuffle them randomly if there are multiple of them? That I think should work, though ignores that section of RFC a bit

it doesn't matter if we put them at the start or the end: a server with a weight of 0 has a 0% chance of being used in a sane algorithm. Perhaps we should do randint(0, total_weight) if there are any servers with a zero weight, and randint(1, total_weight) otherwise?

I guess it doesn't really matter, since I doubt anyone is really using SRV records to load-balance servers...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that synapse will sequentially try each server on the list until it finds one it can connect to, so putting the zero weights at the end of the returned list means that they will get tried last (modulo priority).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh, right. Fair enough. sgtm then.

synapse/http/federation/srv_resolver.py Outdated Show resolved Hide resolved
synapse/http/federation/srv_resolver.py Outdated Show resolved Hide resolved
@erikjohnston erikjohnston requested a review from richvdh August 20, 2019 11:39
synapse/http/federation/matrix_federation_agent.py Outdated Show resolved Hide resolved
if tls_options is not None:
ep = wrapClientTLS(tls_options, ep)
return ep
with PreserveLoggingContext():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so this comes down to a mixup between the two types of function: (A) "follows synapse logcontext rules" vs (B) "ignores logcontext". Currently MatrixHostnameEndpoint.connect is a type-A function, whereas the thing that calls it (the depths of Twisted) is a type-B function. The TL;DR is that this doesn't work (in more detail: MatrixHostnameEndpoint.connect() returns a Deferred which reinstates the logcontext when it completes, but the Twisted code which calls it isn't expecting that, so the logcontext ends up ending up in places it shouldn't, and everything falls apart).

So one solution is to make MatrixHostnameEndpoint.connect a type-B function instead, which would work at runtime, but you would get errors from the patched inlineCallbacks which is installed by the unit tests to check that all the synapse functions are law-abiding type-A functions.

So a solution to this would be to make connect a thin (type-B) shim which calls a type-A _do_connect (or whatever) method which uses inlineCallbacks. Thus:

class MatrixHostnameEndpoint(object):

    def connect(self, protocol_factory):
        return run_in_background(self._do_connect, protocol_factory)

    @defer.inlineCallbacks
    def _do_connect(self, protocol_factory):
        # ... etc

You should then be able to use make_deferred_yieldable here instead of dropping the logcontext. The nice thing about that is that the logcontext will be then in place for the whole of _do_connect, so everything that endpoint.connect and the srv resolver log will end up with the right request id.

It'll probably still fail, but try it and let's have a look.


while servers:
total_weight = sum(s.weight for s in servers)
target_weight = random.randint(0, total_weight)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the off-by-one error is still present?

@erikjohnston erikjohnston requested a review from richvdh August 23, 2019 14:35
@erikjohnston erikjohnston force-pushed the erikj/reliable_lookups branch from 4dfdb55 to fbb758a Compare August 23, 2019 14:37
Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! 🎉

@erikjohnston erikjohnston merged commit dfd10f5 into develop Aug 27, 2019
anoadragon453 added a commit that referenced this pull request Oct 3, 2019
Synapse 1.4.0 (2019-10-03)
==========================

Bugfixes
--------

- Redact `client_secret` in server logs. ([\#6158](#6158))

Synapse 1.4.0rc2 (2019-10-02)
=============================

Bugfixes
--------

- Fix bug in background update that adds last seen information to the `devices` table, and improve its performance on Postgres. ([\#6135](#6135))
- Fix bad performance of censoring redactions background task. ([\#6141](#6141))
- Fix fetching censored redactions from DB, which caused APIs like initial sync to fail if it tried to include the censored redaction. ([\#6145](#6145))
- Fix exceptions when storing large retry intervals for down remote servers. ([\#6146](#6146))

Internal Changes
----------------

- Fix up sample config entry for `redaction_retention_period` option. ([\#6117](#6117))

Synapse 1.4.0rc1 (2019-09-26)
=============================

Note that this release includes significant changes around 3pid
verification. Administrators are reminded to review the [upgrade notes](UPGRADE.rst#upgrading-to-v140).

Features
--------

- Changes to 3pid verification:
  - Add the ability to send registration emails from the homeserver rather than delegating to an identity server. ([\#5835](#5835), [\#5940](#5940), [\#5993](#5993), [\#5994](#5994), [\#5868](#5868))
  - Replace `trust_identity_server_for_password_resets` config option with `account_threepid_delegates`, and make the `id_server` parameteter optional on `*/requestToken` endpoints, as per [MSC2263](matrix-org/matrix-spec-proposals#2263). ([\#5876](#5876), [\#5969](#5969), [\#6028](#6028))
  - Switch to using the v2 Identity Service `/lookup` API where available, with fallback to v1. (Implements [MSC2134](matrix-org/matrix-spec-proposals#2134) plus `id_access_token authentication` for v2 Identity Service APIs from [MSC2140](matrix-org/matrix-spec-proposals#2140)). ([\#5897](#5897))
  - Remove `bind_email` and `bind_msisdn` parameters from `/register` ala [MSC2140](matrix-org/matrix-spec-proposals#2140). ([\#5964](#5964))
  - Add `m.id_access_token` to `unstable_features` in `/versions` as per [MSC2264](matrix-org/matrix-spec-proposals#2264). ([\#5974](#5974))
  - Use the v2 Identity Service API for 3PID invites. ([\#5979](#5979))
  - Add `POST /_matrix/client/unstable/account/3pid/unbind` endpoint from [MSC2140](matrix-org/matrix-spec-proposals#2140) for unbinding a 3PID from an identity server without removing it from the homeserver user account. ([\#5980](#5980), [\#6062](#6062))
  - Use `account_threepid_delegate.email` and `account_threepid_delegate.msisdn` for validating threepid sessions. ([\#6011](#6011))
  - Allow homeserver to handle or delegate email validation when adding an email to a user's account. ([\#6042](#6042))
  - Implement new Client Server API endpoints `/account/3pid/add` and `/account/3pid/bind` as per [MSC2290](matrix-org/matrix-spec-proposals#2290). ([\#6043](#6043))
  - Add an unstable feature flag for separate add/bind 3pid APIs. ([\#6044](#6044))
  - Remove `bind` parameter from Client Server POST `/account` endpoint as per [MSC2290](matrix-org/matrix-spec-proposals#2290). ([\#6067](#6067))
  - Add `POST /add_threepid/msisdn/submit_token` endpoint for proxying submitToken on an `account_threepid_handler`. ([\#6078](#6078))
  - Add `submit_url` response parameter to `*/msisdn/requestToken` endpoints. ([\#6079](#6079))
  - Add `m.require_identity_server` flag to /version's unstable_features. ([\#5972](#5972))
- Enhancements to OpenTracing support:
  - Make OpenTracing work in worker mode. ([\#5771](#5771))
  - Pass OpenTracing contexts between servers when transmitting EDUs. ([\#5852](#5852))
  - OpenTracing for device list updates. ([\#5853](#5853))
  - Add a tag recording a request's authenticated entity and corresponding servlet in OpenTracing. ([\#5856](#5856))
  - Add minimum OpenTracing for client servlets. ([\#5983](#5983))
  - Check at setup that OpenTracing is installed if it's enabled in the config. ([\#5985](#5985))
  - Trace replication send times. ([\#5986](#5986))
  - Include missing OpenTracing contexts in outbout replication requests. ([\#5982](#5982))
  - Fix sending of EDUs when OpenTracing is enabled with an empty whitelist. ([\#5984](#5984))
  - Fix invalid references to None while OpenTracing if the log context slips. ([\#5988](#5988), [\#5991](#5991))
  - OpenTracing for room and e2e keys. ([\#5855](#5855))
  - Add OpenTracing span over HTTP push processing. ([\#6003](#6003))
- Add an admin API to purge old rooms from the database. ([\#5845](#5845))
- Retry well-known lookups if we have recently seen a valid well-known record for the server. ([\#5850](#5850))
- Add support for filtered room-directory search requests over federation ([MSC2197](matrix-org/matrix-spec-proposals#2197), in order to allow upcoming room directory query performance improvements. ([\#5859](#5859))
- Correctly retry all hosts returned from SRV when we fail to connect. ([\#5864](#5864))
- Add admin API endpoint for setting whether or not a user is a server administrator. ([\#5878](#5878))
- Enable cleaning up extremities with dummy events by default to prevent undue build up of forward extremities. ([\#5884](#5884))
- Add config option to sign remote key query responses with a separate key. ([\#5895](#5895))
- Add support for config templating. ([\#5900](#5900))
- Users with the type of "support" or "bot" are no longer required to consent. ([\#5902](#5902))
- Let synctl accept a directory of config files. ([\#5904](#5904))
- Increase max display name size to 256. ([\#5906](#5906))
- Add admin API endpoint for getting whether or not a user is a server administrator. ([\#5914](#5914))
- Redact events in the database that have been redacted for a week. ([\#5934](#5934))
- New prometheus metrics:
  - `synapse_federation_known_servers`: represents the total number of servers your server knows about (i.e. is in rooms with), including itself. Enable by setting `metrics_flags.known_servers` to True in the configuration.([\#5981](#5981))
  - `synapse_build_info`: exposes the Python version, OS version, and Synapse version of the running server. ([\#6005](#6005))
- Give appropriate exit codes when synctl fails. ([\#5992](#5992))
- Apply the federation blacklist to requests to identity servers. ([\#6000](#6000))
- Add `report_stats_endpoint` option to configure where stats are reported to, if enabled. Contributed by @Sorunome. ([\#6012](#6012))
- Add config option to increase ratelimits for room admins redacting messages. ([\#6015](#6015))
- Stop sending federation transactions to servers which have been down for a long time. ([\#6026](#6026))
- Make the process for mapping SAML2 users to matrix IDs more flexible. ([\#6037](#6037))
- Return a clearer error message when a timeout occurs when attempting to contact an identity server. ([\#6073](#6073))
- Prevent password reset's submit_token endpoint from accepting trailing slashes. ([\#6074](#6074))
- Return 403 on `/register/available` if registration has been disabled. ([\#6082](#6082))
- Explicitly log when a homeserver does not have the `trusted_key_servers` config field configured. ([\#6090](#6090))
- Add support for pruning old rows in `user_ips` table. ([\#6098](#6098))

Bugfixes
--------

- Don't create broken room when `power_level_content_override.users` does not contain `creator_id`. ([\#5633](#5633))
- Fix database index so that different backup versions can have the same sessions. ([\#5857](#5857))
- Fix Synapse looking for config options `password_reset_failure_template` and `password_reset_success_template`, when they are actually `password_reset_template_failure_html`, `password_reset_template_success_html`. ([\#5863](#5863))
- Fix stack overflow when recovering an appservice which had an outage. ([\#5885](#5885))
- Fix error message which referred to `public_base_url` instead of `public_baseurl`. Thanks to @aaronraimist for the fix! ([\#5909](#5909))
- Fix 404 for thumbnail download when `dynamic_thumbnails` is `false` and the thumbnail was dynamically generated. Fix reported by rkfg. ([\#5915](#5915))
- Fix a cache-invalidation bug for worker-based deployments. ([\#5920](#5920))
- Fix admin API for listing media in a room not being available with an external media repo. ([\#5966](#5966))
- Fix list media admin API always returning an error. ([\#5967](#5967))
- Fix room and user stats tracking. ([\#5971](#5971), [\#5998](#5998), [\#6029](#6029))
- Return a `M_MISSING_PARAM` if `sid` is not provided to `/account/3pid`. ([\#5995](#5995))
- `federation_certificate_verification_whitelist` now will not cause `TypeErrors` to be raised (a regression in 1.3). Additionally, it now supports internationalised domain names in their non-canonical representation. ([\#5996](#5996))
- Only count real users when checking for auto-creation of auto-join room. ([\#6004](#6004))
- Ensure support users can be registered even if MAU limit is reached. ([\#6020](#6020))
- Fix bug where login error was shown incorrectly on SSO fallback login. ([\#6024](#6024))
- Fix bug in calculating the federation retry backoff period. ([\#6025](#6025))
- Prevent exceptions being logged when extremity-cleanup events fail due to lack of user consent to the terms of service. ([\#6053](#6053))
- Remove POST method from password-reset `submit_token` endpoint until we implement `submit_url` functionality. ([\#6056](#6056))
- Fix logcontext spam on non-Linux platforms. ([\#6059](#6059))
- Ensure query parameters in email validation links are URL-encoded. ([\#6063](#6063))
- Fix a bug which caused SAML attribute maps to be overridden by defaults. ([\#6069](#6069))
- Fix the logged number of updated items for the `users_set_deactivated_flag` background update. ([\#6092](#6092))
- Add `sid` to `next_link` for email validation. ([\#6097](#6097))
- Threepid validity checks on msisdns should not be dependent on `threepid_behaviour_email`. ([\#6104](#6104))
- Ensure that servers which are not configured to support email address verification do not offer it in the registration flows. ([\#6107](#6107))

Updates to the Docker image
---------------------------

- Avoid changing `UID/GID` if they are already correct. ([\#5970](#5970))
- Provide `SYNAPSE_WORKER` envvar to specify python module. ([\#6058](#6058))

Improved Documentation
----------------------

- Convert documentation to markdown (from rst) ([\#5849](#5849))
- Update `INSTALL.md` to say that Python 2 is no longer supported. ([\#5953](#5953))
- Add developer documentation for using SAML2. ([\#6032](#6032))
- Add some notes on rolling back to v1.3.1. ([\#6049](#6049))
- Update the upgrade notes. ([\#6050](#6050))

Deprecations and Removals
-------------------------

- Remove shared-secret registration from `/_matrix/client/r0/register` endpoint. Contributed by Awesome Technologies Innovationslabor GmbH. ([\#5877](#5877))
- Deprecate the `trusted_third_party_id_servers` option. ([\#5875](#5875))

Internal Changes
----------------

- Lay the groundwork for structured logging output. ([\#5680](#5680))
- Retry well-known lookup before the cache expires, giving a grace period where the remote well-known can be down but we still use the old result. ([\#5844](#5844))
- Remove log line for debugging issue #5407. ([\#5860](#5860))
- Refactor the Appservice scheduler code. ([\#5886](#5886))
- Compatibility with v2 Identity Service APIs other than /lookup. ([\#5892](#5892), [\#6013](#6013))
- Stop populating some unused tables. ([\#5893](#5893), [\#6047](#6047))
- Add missing index on `users_in_public_rooms` to improve the performance of directory queries. ([\#5894](#5894))
- Improve the logging when we have an error when fetching signing keys. ([\#5896](#5896))
- Add support for database engine-specific schema deltas, based on file extension. ([\#5911](#5911))
- Update Buildkite pipeline to use plugins instead of buildkite-agent commands. ([\#5922](#5922))
- Add link in sample config to the logging config schema. ([\#5926](#5926))
- Remove unnecessary parentheses in return statements. ([\#5931](#5931))
- Remove unused `jenkins/prepare_sytest.sh` file. ([\#5938](#5938))
- Move Buildkite pipeline config to the pipelines repo. ([\#5943](#5943))
- Remove unnecessary return statements in the codebase which were the result of a regex run. ([\#5962](#5962))
- Remove left-over methods from v1 registration API. ([\#5963](#5963))
- Cleanup event auth type initialisation. ([\#5975](#5975))
- Clean up dependency checking at setup. ([\#5989](#5989))
- Update OpenTracing docs to use the unified `trace` method. ([\#5776](#5776))
- Small refactor of function arguments and docstrings in` RoomMemberHandler`. ([\#6009](#6009))
- Remove unused `origin` argument on `FederationHandler.add_display_name_to_third_party_invite`. ([\#6010](#6010))
- Add a `failure_ts` column to the `destinations` database table. ([\#6016](#6016), [\#6072](#6072))
- Clean up some code in the retry logic. ([\#6017](#6017))
- Fix the structured logging tests stomping on the global log configuration for subsequent tests. ([\#6023](#6023))
- Clean up the sample config for SAML authentication. ([\#6064](#6064))
- Change mailer logging to reflect Synapse doesn't just do chat notifications by email now. ([\#6075](#6075))
- Move last-seen info into devices table. ([\#6089](#6089))
- Remove unused parameter to `get_user_id_by_threepid`. ([\#6099](#6099))
- Refactor the user-interactive auth handling. ([\#6105](#6105))
- Refactor code for calculating registration flows. ([\#6106](#6106))
@erikjohnston erikjohnston deleted the erikj/reliable_lookups branch January 9, 2020 15:48
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants