Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Placeholder: e2ee is slow in massive rooms #16043

Closed
turt2live opened this issue Dec 28, 2020 · 7 comments
Closed

Placeholder: e2ee is slow in massive rooms #16043

turt2live opened this issue Dec 28, 2020 · 7 comments
Labels
A-E2EE A-Performance O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Chronic

Comments

@turt2live
Copy link
Member

No description provided.

@turt2live turt2live self-assigned this Dec 28, 2020
@turt2live
Copy link
Member Author

Sending test messages in the megolm test room.

Initial observations:

Claiming the keys takes forever:
image

There's a bunch of half-second to_device message calls. Might be able to stack these?
image

There are also slightly above half-second calls to broadcast keys being withheld, which feels excessive given there shouldn't be any withheld keys?
image

The actual message send took 600ms:
image

ensureOlmSessionsForDevices is a bit slow (red bar on Task is overtime):
image

There are seemingly thousands of calls into WASM (presumably Olm), all which take 1-2ms each:
image
(this is about 1.76ms)

The WASM calls (and the long ensureOlmSessionsForDevices):
image

It takes 2ms for some devices to be encrypted:
image

... while others (most) take about 0.2ms:
image

There is a lot of unpickling going on:
image

Rough timeline for the web requests:
image

The CPU spikes a bit with each encryption, but nothing terribly dangerous (same timeframe as above):
image

The hill of ~50% CPU is what appears to be ensuring the devices exist and initial encryption. We might not be stacking the encryption up high enough?

@turt2live
Copy link
Member Author

We can probably make a bunch of this more concurrent and save session keys somewhere safe, leading to faster messages after reloading the app. We can also probably increase the number of devices per to_device message when sending out keys, and defer the withheld notifications to after the message is actually sent.

@uhoreg
Copy link
Member

uhoreg commented Dec 29, 2020

Claiming the keys takes forever:

Yup, that's the long timeout from fetching keys over federation, and basically means that some other homeserver is timing out with its reply. It's supposed to also do a keys claim with a shorter (2s) timeout first, and the long-timeout one isn't supposed to hold up the message sending. But maybe it only does the longer timeout one when it's automatically creating sessions when you start typing a message? It's been a while since I've looked at the code. (#11836)

There are also slightly above half-second calls to broadcast keys being withheld, which feels excessive given there shouldn't be any withheld keys?

"Withheld" could also be due to failures in creating olm sessions, which could be due to some devices having run out of one-time keys (which is likely if you sent a message large public room), or due to some servers failing to send keys (which seems likely given that the keys claim call took >10s, which means that some servers have timed out).

Another thing that I had been thinking about was batching up some of the IndexedDB operations so that, for example, it fetches the olm sessions for multiple devices at a time, rather than fetching them individually. My suspicion is that it won't produce enough speedup to be worth the effort (since it will probably involve lots of refactoring and increasing code complexity), but maybe your magic graphs can give some more insight.

@jryans jryans removed the defect label Mar 4, 2021
@novocaine novocaine added S-Critical Prevents work, causes data loss and/or has no workaround O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Aug 5, 2021
@novocaine novocaine added S-Major Severely degrades major functionality or product features, with no satisfactory workaround and removed S-Critical Prevents work, causes data loss and/or has no workaround labels Aug 25, 2021
@novocaine
Copy link
Contributor

Downgrading, as this does not prevent work, it merely makes it slower..

@novocaine
Copy link
Contributor

novocaine commented Aug 25, 2021

Not clear to me from this issue:

  • How much of the speed problems could be resolved by changes to EW?
  • What is a 'massive room'?
  • What is the ultimate impact on message delivery latency?

@turt2live don't suppose you have any of these details?

@turt2live
Copy link
Member Author

This issue is a bit of PS-sponsored work and has context that github can't surface (nor can I share here). To try and answer as best as I can:

How much of the speed problems could be resolved by changes to EW?

There are significant gains which can be achieved by changing element-web, though some are down to chosen technologies, design patterns, etc. The most major one is storing outbound sessions to reduce the impact of having to send out device messages, though the device message approach may need spec changes.

What is a 'massive room'?

In the context of this issue, 1000+ devices (roughly 300 users). The PS-sensitive side of this is focused on even larger rooms, however we needed a line in the sand to measure reliably against. At the time, this was the megolm test room with thousands of devices available.

What is the ultimate impact on message delivery latency?

Minutes, with a feeling of it taking days. This is also represented in large rooms like Element Internal where sending a message can take 2 minutes on a strong CPU/server, or 5 minutes on a weak one.

@richvdh
Copy link
Member

richvdh commented Feb 14, 2023

Duplicate of #15476.

@richvdh richvdh closed this as completed Feb 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-E2EE A-Performance O-Uncommon Most users are unlikely to come across this or unexpected workflow S-Major Severely degrades major functionality or product features, with no satisfactory workaround T-Defect Z-Chronic
Projects
None yet
Development

No branches or pull requests

5 participants