Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Consider increasing default federation timeout #8338

Open
MayeulC opened this issue Sep 17, 2020 · 11 comments
Open

Consider increasing default federation timeout #8338

MayeulC opened this issue Sep 17, 2020 · 11 comments
Labels
A-Config Configuration, or the documentation thereof A-Federation O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.

Comments

@MayeulC
Copy link

MayeulC commented Sep 17, 2020

Alternatively, expose it in a configuration file.

This is the timeout in question:

self.default_timeout = 60

Rationale

Related: #7113 #6699 #5445 #8118

Also related to #8080 and #5604, as the telltale sign is this log pattern:

synapse.http.federation.well_known_resolver - 246 - INFO - federation_transaction_transmission_loop-2612- Fetching https://haenerconsulting.com/.well-known/matrix/server
synapse.http.federation.well_known_resolver - 264 - INFO - federation_transaction_transmission_loop-2612- Error fetching https://haenerconsulting.com/.well-known/matrix/server: DNS lookup failed: Couldn't find the hostname 'haenerconsulting.com'.
(note: the above doesn't necessarily fail on .well-known nor DNS)
synapse.logging.context - 351 - WARNING - federation_transaction_transmission_loop-2612- Expected logging context federation_transaction_transmission_loop-2612 was lost
synapse.http.matrixfederationclient - 546 - INFO - federation_transaction_transmission_loop-2612- {PUT-O-1163} [haenerconsulting.com] Request failed: PUT matrix://haenerconsulting.com/_matrix/federation/v1/send/1600285938965: TimeoutError(60, 'Deferred')
synapse.logging.context - 396 - WARNING - federation_transaction_transmission_loop-2612- Re-starting finished log context federation_transaction_transmission_loop-2612
(multiple times)

Note: the timeout error can also be something like

synapse.http.matrixfederationclient - 546 - INFO - federation_transaction_transmission_loop-2812- {PUT-O-1251} [sigsegv.cc] Request failed: PUT matrix://sigsegv.cc/_matrix/federation/v1/send/1600285939053: AttributeError("'NoneType' object has no attribute 'stopListening'")

That indicates, if I am not mistaken, that the transaction actually returned something after the timeout expired.

High level symptoms

Federation was working bi-directionally with some (smaller?) homeservers. It wasn't working with matrix.org, though (my homeserver could receive events -- PDU and EDU -- but matrix.org wouldn't receive events from my server, except when backfilling from others, of course).

This started happening after a ~1-day federation outage (synapse was down for postgresql maintenance & upgrade, and I hit another synapse bug on start-up).

Probable causes

This is likely due to my having a slow internet connection (ADSL), subject to bufferbloat. Trying to transmit a lot of data at once will dramatically increase latency, even causing DNS lookups to fail.

The origin is likely due to events piling up while the server wasn't federating, and the new async I/O architecture being too performant (meaning it tries to send all at once on start-up. (I successfully recovered from multiple-week federation outages in the past).

I got around #8118 by increasing dnsmasq's max simultaneous connections to 300 from the default of 100, and its number of cache entries to 4096 up from 256, then restarting synapse twice. It is possible that changing the federation timeout could have helped with that as well.

Considerations

That might only happen in a number of scenarios:

  • Slow internet connection, or high latency
  • A lot of events to send to some servers, and a lot of catch-up to do (hence why federation was only breaking on some of the biggest servers of the network).

As such, it might be better to progressively increase the timeout (though it then makes no sense having a big timeout), or alternatively, use a large timeout on startup and progressively reduce it.

Rate-limitting requests (including DNS), especially on start-up might also help quite a bit.

@anoadragon453
Copy link
Member

Hey, thanks for your detailed issue.

Making federation timeout configurable generally sounds like a good idea. This would allow admins to play around with the value and see what works best for them given their resource limitations. We are a bit wary about changing the default value of 60 for now though, given that this can be a double-edged sword: while you'd end up being able to complete some federation operations that you otherwise wouldn't have before, you also end up spending more time waiting for servers that will never respond, which also decreases overall performance.

If it's configurable and people find that generally raising the timeout gives a better experience, then we can take a look at changing the defaults, or potentially working out a way to scale the timeout when needed.

The origin is likely due to events piling up while the server wasn't federating, and the new async I/O architecture being too performant (meaning it tries to send all at once on start-up. (I successfully recovered from multiple-week federation outages in the past).

Currently requests are indeed sent out in parallel as needed. There's been work in the past to queue batches of these requests, under the name "Federation Side-Bus", but it's been put on hiatus. This is a known problem though, and there are other solutions than just increasing the timeout, but it's a hard problem I'm afraid.

@richvdh
Copy link
Member

richvdh commented Sep 23, 2020

under the name "Federation Side-Bus"

federation-side bus.

@MayeulC
Copy link
Author

MayeulC commented Sep 23, 2020

Thank you for the detailed answer, that makes sense. I've indeed seen a few PR/issues mentioning batching requests, and wondered if it was related. I think it could actually help quite a bit.

Nit: a bus has a different meaning to me, as a high-bandwidth link that is connected to multiple receivers at once (think public transit). Federation queues might be more explicit, unless it's called like that for internal reasons, of course, or that I am mistaken about buses :)

Edit: ah, yes, #5883 looks pretty stale.

@rubo77
Copy link
Contributor

rubo77 commented Sep 29, 2020

you also end up spending more time waiting for servers that will never respond

Maybe we could solve this, if we could add a configuration, that would increase the timeout by hostname, so i.e. only for communication with some known problematic servers, while all other servers still run into the default timeout.

@a-0-dev
Copy link

a-0-dev commented Jan 22, 2022

Following a recent discussion in #synapse:matrix.org this still seems to be an issue, I am also suspecting this to cause trouble with my homeserver (logs are matching, and I am experiencing irregular downtimes of my synapse instance without any other log indication).

I would be very thankful if any experienced synapse dev would implement the fix suggested above (exposing the variable to the config)... I doesn't seem to be that much work at all to me, I'm just not familiar with the codebase at all.

Anyhow, the issue is from Sep 2020 and this is a bump/reminder :)

@anoadragon453 anoadragon453 added the T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. label Jan 24, 2022
@fuomag9
Copy link

fuomag9 commented Mar 1, 2022

Same, I'm full of DNSLookupError('no results for hostname lookup and HttpResponseException('500: Internal Server Error') when joining big rooms, but dns works fine with dig :(

@strufkin
Copy link

I had to increase timeout to 120 seconds, so I could join the room,

tail -f access.log|grep join
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:31:49 +0000] "OPTIONS /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" 204 0 72 0.012 0.012 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:32:52 +0000] "POST /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" **502 68 63 62.802 62.801** "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"


matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:37:27 +0000] "OPTIONS /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" 204 0 66 0.002 0.003 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:39:14 +0000] "POST /_matrix/client/r0/join/!TyHsKBJMsnkHlauFel%3Alibera.chat HTTP/2.0" **200 65 61 107.203 107.203** "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:43:57 +0000] "OPTIONS /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 204 0 72 0.006 0.006 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:44:09 +0000] "OPTIONS /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 204 0 72 0.002 0.002 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:44:20 +0000] "OPTIONS /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 204 0 72 0.972 0.971 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"



matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:45:52 +0000] "POST /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 200 64 69 114.815 114.816 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:45:53 +0000] "POST /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 200 64 67 104.365 104.365 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"
matrix.*.com 4443 *.*.*.* - - [24/Mar/2022:17:45:53 +0000] "POST /_matrix/client/r0/join/%23LCNC%3Amatrix.org?server_name=matrix.org HTTP/2.0" 200 64 67 93.416 93.416 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.10.7 Chrome/98.0.4758.74 Electron/17.0.0 Safari/537.36"

@logic
Copy link

logic commented May 27, 2022

As an additional datapoint, I was seeing similar issues trying to re-join the #fedora-devel channel on libera.chat after getting idle-booted. When I increased the timeout to 120s, Element started throwing new errors (unable to obtain room_version for the room, etc), and still wouldn't connect. Increased again to 240s, and was finally able to join the channel (after a couple of failed attempts).

I suspect, based on the errors Element was throwing, that I'm bumping into multiple problematic timeouts, but this is the one that successfully solved the problem for me this time around.

@DMRobertson DMRobertson added A-Config Configuration, or the documentation thereof S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. O-Uncommon Most users are unlikely to come across this or unexpected workflow and removed T-Task Refactoring, removal, replacement, enabling or disabling functionality, other engineering tasks. z-p2 (Deprecated Label) z-enhancement labels Aug 25, 2022
@jivanpal
Copy link

3–4 minutes is not enough for me, my /send_join requests to certain large rooms (e.g. on libera.chat) are timing out consistently. I have no trouble joining smaller rooms on the same homeservers. Increasing the timeout would allow me to join these rooms.

Here are logs from an attempt to join #ansible:libera.chat, for example:

2022-10-26 19:29:13,148 - synapse.http.matrixfederationclient - 299 - INFO - POST-34469 - {GET-O-92082} [libera.chat] Completed request: 200 OK in 0.35 secs, got 71 bytes - GET matrix://libera.chat/_matrix/federation/v1/query/directory?room_alias=%23ansible%3Alibera.chat
2022-10-26 19:29:13,609 - synapse.http.matrixfederationclient - 299 - INFO - POST-34469 - {GET-O-92083} [libera.chat] Completed request: 200 OK in 0.41 secs, got 528 bytes - GET matrix://libera.chat/_matrix/federation/v1/make_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%40jivan%3Aopaline.uk?ver=1&ver=2&ver=3&ver=4&ver=5&ver=6&ver=org.matrix.msc2176&ver=7&ver=8&ver=9&ver=org.matrix.msc3787&ver=10&ver=org.matrix.msc2716v4
2022-10-26 19:30:14,219 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:31:14,887 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:32:19,751 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:33:21,582 - synapse.http.matrixfederationclient - 672 - INFO - POST-34469 - {PUT-O-92085} [libera.chat] Request failed: PUT matrix://libera.chat/_matrix/federation/v2/send_join/%21czpAhMQgXXiRMvhOef%3Alibera.chat/%24mj5R7RnPgtZOnQvT0wKHgGef0J97YtdKIUw7xYGIgGQ: ResponseNeverReceived:[CancelledError()]
2022-10-26 19:33:21,582 - synapse.util.retryutils - 231 - INFO - POST-34469 - Connection to libera.chat was unsuccessful (<class 'synapse.api.errors.RequestSendFailed'>(Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >])); backoff now 600000
2022-10-26 19:33:21,583 - synapse.federation.federation_client - 819 - WARNING - POST-34469 - Failed to send_join via libera.chat: Failed to send request: ResponseNeverReceived: [<twisted.python.failure.Failure twisted.internet.defer.CancelledError: >]
2022-10-26 19:33:21,590 - synapse.http.server - 107 - INFO - POST-34469 - <XForwardedForRequest at 0x7f169172bd30 method='POST' uri='/_matrix/client/r0/join/%23ansible%3Alibera.chat' clientproto='HTTP/1.0' site='8008'> SynapseError: 502 - Failed to send_join via any server
2022-10-26 19:33:21,626 - synapse.access.http.8008 - 459 - INFO - POST-34469 - 10.162.39.129 - 8008 - {@jivan:opaline.uk} Processed request: 248.788sec/0.036sec (0.037sec, 0.003sec) (0.018sec/0.022sec/8) 68B 502 "POST /_matrix/client/r0/join/%23ansible%3Alibera.chat HTTP/1.0" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.12 Chrome/104.0.5112.124 Electron/20.3.0 Safari/537.36" [0 dbevts]

@anoadragon453
Copy link
Member

The faster remote room joins project will likely resolve this problem in most cases.

@jivanpal
Copy link

jivanpal commented Nov 1, 2022

@anoadragon453 Is there anything that Synapse admins can do in the meantime? My issue is not that the server I'm trying to federate with is being slow, it's that my server isn't being patient enough. The server I'm trying to federate with isn't even on Synapse 1.69, let alone whether it has this new feature enabled, so whilst I appreciate that it will fix this in the long term, what can we do now?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Config Configuration, or the documentation thereof A-Federation O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements.
Projects
None yet
Development

No branches or pull requests

10 participants