Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Viewing room list in a Space / Searching for rooms in a Space is very slow #11694

Open
reivilibre opened this issue Jan 6, 2022 · 7 comments
Labels
A-Spaces Hierarchical organization of rooms O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@reivilibre
Copy link
Contributor

Still need to find out what's going on here or even what the related requests are, but can't see an issue about it and this is surely something we ought to resolve if we want people to enjoy using spaces.

Description

The 'Explore Space' screen in Element is very slow, even if typing in a search term.

Steps to reproduce

  • go to a (decent size?) space: my example is Element's space
    • I suspect this issue may not be visible from the matrix.org homeserver since I expect someone would have said something by now!
  • open the 'Explore rooms' view in Element
  • type in a keyword
  • watch spinner for 2 minutes (in my case; probably varies depending upon how far down the room is in the tree?)
  • a result appears (spinner is still going, so it's probably not a complete result set. However this was the room I wanted in my situation so I could finish here.)

I would expect searching to be much faster than that; let's say 10 seconds or less to be generous given that I suppose I can understand that it may be making a few round trips to remote homeservers.

Version information

  • Homeserver: librepush.net
  • Version: 1.49.0, but this issue has been here for as long as I can remember the feature existing
  • Install method: pip from PyPI
  • Platform: amd64 VPS
@reivilibre reivilibre added the X-Needs-Info This issue is blocked awaiting information from the reporter label Jan 6, 2022
@clokep
Copy link
Member

clokep commented Jan 6, 2022

I would expect searching to be much faster than that; let's say 10 seconds or less to be generous given that I suppose I can understand that it may be making a few round trips to remote homeservers.

Searching is done client side, so your client is paginating through the entire space finding all results. I Don't know if they're shown as they get matched or at the end though.

See #10523 also for performance with this API.

@clokep clokep added A-Spaces Hierarchical organization of rooms T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jan 14, 2022
@reivilibre reivilibre self-assigned this Feb 23, 2022
@reivilibre reivilibre removed the X-Needs-Info This issue is blocked awaiting information from the reporter label Feb 23, 2022
@reivilibre
Copy link
Contributor Author

I got some time to look into this... Here are some notes

Client-side view of what's going on

The slow request seems to be this endpoint (with some example parameters):

GET https://matrix.librepush.net/_matrix/client/unstable/org.matrix.msc2946/rooms/!OJBlkJuUrsKnqtNnTi%3Amatrix.org/hierarchy?suggested_only=false&from=GiSyKxltKfZBfrhSVEXQVjUb&limit=20

Waiting: 9.43 sec.

Each one only includes a few results, so it makes scrolling through the list quite slow and frustrating. You get a drip of a few new rooms every 10 seconds. It's not great but at least it feels less dead?
Search is performed client-side as a filter, performing back-to-back requests if needed. This really exacerbates the problem when performing a search. This is probably quite poor even if the requests weren't so slow. This is the equivalent of a table scan to search the space, but with high latency and tiny (20) block sizes...!

Synapse-side view of what's going on

Let's have a look in Jaeger (N.B. This is not the exact same request as the one mentioned in the first section, but it feels fairly reproducible so it probably serves well as a guide).

First off I notice that it says '40 Errors'.

It seems to start off with 2 fairly slow requests to t2l.io:

That adds up to 2.5 sec straight away.

Then there are many small database queries, followed by a 120 ms request to vector.modular.im:

Now I start noticing all the errors that the summary talked about...

Looks like it's trying loads of different hosts (from t=3.06s to t=10.61s) before finally getting to an answer (t2l.io, which takes 3.8s to respond)*:

It also takes a wee while in get_current_state_ids (250 ms) at the end.

Then we get our 20 rooms and start the cycle again :D with the next client-server request.

Logs from matrix.org for one of these requests:

2022-02-23 16:24:25,834 - synapse.handlers.room_summary - 926 - INFO - GET-1763521- - room !uUTpgiQEFiCNjzmgEe:matrix.org is unpeekable and requester librepush.net is not a member / not allowed to join, omitting from summary
2022-02-23 16:24:25,834 - synapse.http.server - 95 - INFO - GET-1763521 - <XForwardedForRequest at 0x7f3fcfa44658 method='GET' uri='/_matrix/federation/v1/hierarchy/%21uUTpgiQEFiCNjzmgEe%3Amatrix.org?suggested_only=false' clientproto='HTTP/1.1' site='16102'> SynapseError: 404 - Unknown room: !uUTpgiQEFiCNjzmgEe:matrix.org
2022-02-23 16:24:25,836 - synapse.access.http.16102 - 448 - INFO - GET-1763521 - 2a02:c205:2022:1137::1 - 16102 - {librepush.net} Processed request: 0.010sec/0.001sec (0.001sec, 0.001sec) (0.001sec/0.005sec/3) 78B 404 "GET /_matrix/federation/v1/hierarchy/%21uUTpgiQEFiCNjzmgEe%3Amatrix.org?suggested_only=false HTTP/1.1" "Synapse/1.53.0" [0 dbevts]

It might be intended for us to get a 404 in this case (I guess?), but it's not helping that it seems to try the fallback unstable endpoint after being served a 404!

*I'm not convinced this is exactly right. I think it tries servers for each room it wants, then moves on to the next, but I haven't studied the code.
In any case, sequentially trying many servers and many rooms seems to be the main bottleneck here.

Some solution ideas:

  • shut off requesting from the unstable endpoint or be smarter about falling back to it (if we know the server supports the stable version, don't fall back?)?
    • should we even be falling back on a 404? If we have no choice, it'd be good to learn from this mistake in the future and make future proposals able to distinguish between an error and the feature not being supported.
  • race servers against each other (otherwise we're always vulnerable to picking a poor server), or otherwise adaptively change preferred servers
    • I don't know how unreasonable the response times were from the other servers in this situation.
  • can we combine these requests into one somehow? It feels a pity that we need to do so many round-trips to the same servers.
    • alternatively, wonder if HTTP request pipelining is something we could be doing?
  • read-ahead: have the homeserver predictively 'read' future blocks, so that they're ready to be received by the client with much lower latency

@reivilibre reivilibre removed their assignment Feb 23, 2022
@clokep
Copy link
Member

clokep commented Feb 23, 2022

This seems to be exacerbated by a few things:

  1. The fallback from /hierarchy -> unstable /hierarchy -> /spaces can be slow if you try all three (and still don't get an answer).
  2. The above, but made worse by matrix-org/matrix-doc#1492 and maybe we're treating unknown data as an unknown endpoint (so trying situations which we don't need to).
  3. Search over /hierarchy is client-side (this is just at thing we haven't gotten to yet FTR).

We also could be a bit more aggressive about caching results over federation, right now they're only cached for a few minutes.

@clokep
Copy link
Member

clokep commented Feb 24, 2022

Note that #12073 should help with the first part (as there's no longer a /spaces endpoint to try).

I think the clokep/hierarchy-404s branch might help a bit with the 404s issue, but not sure if that's the problem.

@reivilibre
Copy link
Contributor Author

@clokep kindly slipped me a sneaky branch to try (the one above) and empirically seems to have brought down most request times from 10s to 34s, with some outliers. :)

@reivilibre
Copy link
Contributor Author

I noticed when looking at the code that if we're in a space/room, we use our local copy and recurse into children. If we're not in a space/room, we request that room from a remote homeserver which includes all its subtree of spaces.

A knock-on effect is that joining a space or subspace makes the hierarchy slower because we send one HTTP request per child, rather than just requesting the parent and having the children included in the response.

The code seems like it has already been well-written and set up to cache remote results, so perhaps a simple(-seeming) way to make things faster would be to request the root space (or any parent space where we need to look up ≥2 or 3 of its children on a remote) from a remote homeserver. In the happy case, this will pre-cache the children and prevent us from sending one request per child. If the remote doesn't answer properly, then it's only a cache and we can fall back to making individual requests as before.

@MadLittleMods MadLittleMods added S-Major Major functionality / product severely impaired, no satisfactory workaround. O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience labels Nov 29, 2022
@dkasak
Copy link
Member

dkasak commented Dec 16, 2022

Search over /hierarchy is client-side (this is just at thing we haven't gotten to yet FTR).

I would just like to add that all things being equal, we should prefer exploring solutions which don't rely on server-side search, so that we don't dig ourselves into a hole which prevents us from encrypting room state in the future.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Spaces Hierarchical organization of rooms O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

4 participants