Placeholder: e2ee is slow in massive rooms #16043

turt2live · 2020-12-28T21:09:36Z

No description provided.

turt2live · 2020-12-28T21:26:52Z

Sending test messages in the megolm test room.

Initial observations:

Claiming the keys takes forever:

There's a bunch of half-second to_device message calls. Might be able to stack these?

There are also slightly above half-second calls to broadcast keys being withheld, which feels excessive given there shouldn't be any withheld keys?

The actual message send took 600ms:

ensureOlmSessionsForDevices is a bit slow (red bar on Task is overtime):

There are seemingly thousands of calls into WASM (presumably Olm), all which take 1-2ms each:

(this is about 1.76ms)

The WASM calls (and the long ensureOlmSessionsForDevices):

It takes 2ms for some devices to be encrypted:

... while others (most) take about 0.2ms:

There is a lot of unpickling going on:

Rough timeline for the web requests:

The CPU spikes a bit with each encryption, but nothing terribly dangerous (same timeframe as above):

The hill of ~50% CPU is what appears to be ensuring the devices exist and initial encryption. We might not be stacking the encryption up high enough?

turt2live · 2020-12-28T23:20:03Z

We can probably make a bunch of this more concurrent and save session keys somewhere safe, leading to faster messages after reloading the app. We can also probably increase the number of devices per to_device message when sending out keys, and defer the withheld notifications to after the message is actually sent.

uhoreg · 2020-12-29T15:21:51Z

Claiming the keys takes forever:

Yup, that's the long timeout from fetching keys over federation, and basically means that some other homeserver is timing out with its reply. It's supposed to also do a keys claim with a shorter (2s) timeout first, and the long-timeout one isn't supposed to hold up the message sending. But maybe it only does the longer timeout one when it's automatically creating sessions when you start typing a message? It's been a while since I've looked at the code. (#11836)

There are also slightly above half-second calls to broadcast keys being withheld, which feels excessive given there shouldn't be any withheld keys?

"Withheld" could also be due to failures in creating olm sessions, which could be due to some devices having run out of one-time keys (which is likely if you sent a message large public room), or due to some servers failing to send keys (which seems likely given that the keys claim call took >10s, which means that some servers have timed out).

Another thing that I had been thinking about was batching up some of the IndexedDB operations so that, for example, it fetches the olm sessions for multiple devices at a time, rather than fetching them individually. My suspicion is that it won't produce enough speedup to be worth the effort (since it will probably involve lots of refactoring and increasing code complexity), but maybe your magic graphs can give some more insight.

novocaine · 2021-08-25T16:20:01Z

Downgrading, as this does not prevent work, it merely makes it slower..

novocaine · 2021-08-25T16:22:26Z

Not clear to me from this issue:

How much of the speed problems could be resolved by changes to EW?
What is a 'massive room'?
What is the ultimate impact on message delivery latency?

@turt2live don't suppose you have any of these details?

turt2live · 2021-08-25T17:53:07Z

This issue is a bit of PS-sponsored work and has context that github can't surface (nor can I share here). To try and answer as best as I can:

How much of the speed problems could be resolved by changes to EW?

There are significant gains which can be achieved by changing element-web, though some are down to chosen technologies, design patterns, etc. The most major one is storing outbound sessions to reduce the impact of having to send out device messages, though the device message approach may need spec changes.

What is a 'massive room'?

In the context of this issue, 1000+ devices (roughly 300 users). The PS-sensitive side of this is focused on even larger rooms, however we needed a line in the sand to measure reliably against. At the time, this was the megolm test room with thousands of devices available.

What is the ultimate impact on message delivery latency?

Minutes, with a feeling of it taking days. This is also represented in large rooms like Element Internal where sending a message can take 2 minutes on a strong CPU/server, or 5 minutes on a weak one.

richvdh · 2023-02-14T14:36:14Z

Duplicate of #15476.

turt2live self-assigned this Dec 28, 2020

jryans added T-Defect defect A-Performance A-E2EE Z-Chronic labels Jan 22, 2021

jryans removed the defect label Mar 4, 2021

novocaine added S-Critical Prevents work, causes data loss and/or has no workaround O-Uncommon Most users are unlikely to come across this or unexpected workflow labels Aug 5, 2021

novocaine added S-Major Severely degrades major functionality or product features, with no satisfactory workaround and removed S-Critical Prevents work, causes data loss and/or has no workaround labels Aug 25, 2021

t3chguy mentioned this issue Jan 4, 2023

Messages stuck or very slow at "Encrypting your message..." #24153

Closed

turt2live removed their assignment Jan 10, 2023

richvdh closed this as completed Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Placeholder: e2ee is slow in massive rooms #16043

Placeholder: e2ee is slow in massive rooms #16043

turt2live commented Dec 28, 2020

turt2live commented Dec 28, 2020

turt2live commented Dec 28, 2020

uhoreg commented Dec 29, 2020

novocaine commented Aug 25, 2021

novocaine commented Aug 25, 2021 •

edited

Loading

turt2live commented Aug 25, 2021

richvdh commented Feb 14, 2023

Placeholder: e2ee is slow in massive rooms #16043

Placeholder: e2ee is slow in massive rooms #16043

Comments

turt2live commented Dec 28, 2020

turt2live commented Dec 28, 2020

turt2live commented Dec 28, 2020

uhoreg commented Dec 29, 2020

novocaine commented Aug 25, 2021

novocaine commented Aug 25, 2021 • edited Loading

turt2live commented Aug 25, 2021

richvdh commented Feb 14, 2023

novocaine commented Aug 25, 2021 •

edited

Loading