Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

server-key fetching logic is slow and queue-bound #3825

Closed
richvdh opened this issue Sep 7, 2018 · 2 comments · Fixed by #10035
Closed

server-key fetching logic is slow and queue-bound #3825

richvdh opened this issue Sep 7, 2018 · 2 comments · Fixed by #10035
Labels
A-Federation A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)

Comments

@richvdh
Copy link
Member

richvdh commented Sep 7, 2018

There are a number of problems with the logic in keyring.py for fetching server keys:

  • each request that needs a key for a given server gets queued up, and it's possible to end up with quite a long queue for a given server. If the lookup is successful, that's ok. However, if it fails (which may take many minutes while we wait for timeouts), then we try again for each request in the queue - so we can rapidly end up getting very badly behind. When we want key X for server Y, if there is already a request in the queue for that key, then we should just use the results from it, even if it fails.

  • relatedly, the queueing logic might never complete. If a given request wants keys from server A and server B, and a lookup is already in progress for A, it waits for that to complete. By that time, another request might be doing a lookup for B, so it waits for that to complete. Then we might be waiting for A again. etc. We should immediately start lookups for those servers which aren't already in progress, rather than waiting for the complete set.

  • see also store_server_verify_keys shouldn't need to lock the table #3819 and get_keys_from_store should do one big lookup, not hundreds of tiny ones #3818

@neilisfragile neilisfragile added the z-p2 (Deprecated Label) label Oct 5, 2018
@richvdh richvdh added the A-Performance Performance, both client-facing and admin-facing label Feb 27, 2019
@richvdh
Copy link
Member Author

richvdh commented May 24, 2019

relatedly, the queueing logic might never complete.

This is a huge problem while we are joining a room, and is a huge contributor to #1211. In particular:

  1. we do a send_join
  2. servers start sending us federation transactions, which means we need to fetch their keys, so we take out key-fetch locks for those servers
  3. we try to verify the results of the send join, so have to fetch hundreds of keys. Some of those servers are already locked due to the above, so we wait
  4. more transactions arrive from other servers (cf massive storm of presence EDUs when joining a large room #3120) so we lock those key lookups.
  5. The first key lookups complete; go back to 3.

@hawkowl hawkowl added the z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers label Jul 11, 2019
@benbz
Copy link
Contributor

benbz commented Dec 7, 2020

This (really the tight looping of Waiting for existing lookups logging in #5435) has come up several times in the past couple of months. When a HS hits this it is effectively unresponsive until it gets restarted

@DMRobertson DMRobertson added A-Federation and removed z-federation-meltdown z-outbound-federation-meltdown (Deprecated Label) Synapse melting down by trying to talk to too many servers labels Aug 25, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Federation A-Performance Performance, both client-facing and admin-facing z-p2 (Deprecated Label)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants