Faster joins: fixes to device list tracking #12993

richvdh · 2022-06-09T06:28:28Z

The device-list-tracking code makes lots of requests to methods like get_users_in_room and get_rooms_for_user, which are likely to return completely bogus results during a faster join. This brings danger of bugs where we think we have up-to-date device lists for particular users, but actually don't, leading to E2EE failures.

So, the work needed here is to (a) understand how device list tracking works; (b) ensure that it will end up correct (even if it's not entirely correct during a faster join).

The text was updated successfully, but these errors were encountered:

erikjohnston · 2022-07-29T13:00:41Z

I had a quick look at the code, and I think there are broadly three flows we care about here.

First, when we receive a device list update for a remote user we check if any of local users share a room with them, dropping the update if they don't. If we're doing a partial join with a room for that user, either:

we already share a room with that user and the update will be handled properly anyway; or
we don't share a room and so drop the update. This is fine as the database won't have marked the fact we're tracking the remote user's devices.

Secondly, when a local user queries a remote user's devices if the remote user is in the room we're partially joining then one of:

the server already has the remote user's devices cached, which we know must be up to date as to have cached those devices the server must already share a room with those users, and so we can just return the cached devices; or
the server already shares a room with the remote user, but has no cached devices, at which point it fetches the remote user's devices, caches them in the DB and returns them; or
the server doesn't already share a room with the remote user, and so always fetches the remote user's devices, doesn't cache them, and returns them to the user. Note that the local user's device also won't think it shares a room with the given user, so won't think its tracking the remote user.

In all three of those cases I think the current handling results in the correct behaviour.

Finally, when a local user adds a new device we need to send out a device update to all remote servers that think they share a room with that local user. For servers that are in the partially joined room we need to ensure that they get the device list update. If the user already share a room with the remote server then we'll already correctly share the device list update. Otherwise, we need to do either:

if we know the set of servers in the room then use that to send out device list updates; or
once the server has fully joined the room send out any device list updates for the local users in the room that happened while we were joining (we have a handy table for tracking local device list updates in a room: device_lists_changes_in_room).

TL;DR: I think the only thing we need to change is the last point, where we need to correctly send out any device list updates for the local user to servers in the partially joined room. The place where we currently do the logic is:

synapse/synapse/handlers/device.py

Lines 694 to 698 in 24ef146

    
           # Ignore any users that aren't ours 
        
           if self.hs.is_mine_id(user_id): 
        
               joined_user_ids = await self.store.get_users_in_room(room_id) 
        
               hosts = {get_domain_from_id(u) for u in joined_user_ids} 
        
               hosts.discard(self.server_name)

richvdh · 2022-07-29T14:47:51Z

I think when you say "we don't share a room with a user" you mean "we don't share a non-partial-state room with a user"?

we don't share a room and so drop the update. This is fine as the database won't have marked the fact we're tracking the remote user's devices.

This is ok provided the logic for deciding if we're tracking the remote user's devices is actually consistent. Are you sure there's no way for the database to show we are tracking devices while only sharing partial-state rooms with the user?

Also: are there not flows that happen when a local user joins or leaves a room? What happens if a local user joins a partial-state room? Sounds like a similar situation to the "local user adds new device" flow?

erikjohnston · 2022-08-01T15:52:27Z

I think when you say "we don't share a room with a user" you mean "we don't share a non-partial-state room with a user"?

~~Yup~~. Actually no, I think partial rooms still count if we have persisted any remote joins, as those will appear in get_rooms_for_user? Don't think that changes anything though.

we don't share a room and so drop the update. This is fine as the database won't have marked the fact we're tracking the remote user's devices.

This is ok provided the logic for deciding if we're tracking the remote user's devices is actually consistent. Are you sure there's no way for the database to show we are tracking devices while only sharing partial-state rooms with the user?

We only store cached devices if the server "shares a room with the remote user", which can include partial joins (as we determine this via calling get_rooms_for_user). We then check if we need to clear that table every time the remote user leaves a room (or the server leaves a room that the remote user is in). So, I think that logic is all fine, except we need to be careful of...

Also: are there not flows that happen when a local user joins or leaves a room? What happens if a local user joins a partial-state room? Sounds like a similar situation to the "local user adds new device" flow?

Yeah, I forgot about these flows. The crucial thing here is to ensure that if the servers stops sharing a room with a remote user (i.e. get_rooms_for_user used to return a non-empty set of rooms and now returns the empty set) we correctly clear out any cached devices in the DB.

In particular: if the server fails to join the room and so sends a leave for local users, we must ensure that we check if we need to clear the cached devices of any remote users.

richvdh · 2022-08-01T16:28:04Z

We then check if we need to clear that table every time the remote user leaves a room

... or, presumably, that we realise that we incorrectly accepted a join event for a remote user which should have been rejected. I think it should be fine, but we'll want test cases for this stuff.

squahtx · 2022-09-14T20:47:40Z

Writing this up so it stops falling out of my head. It's largely a recap of the notes above.

There are 3 4 bits of API that need to work:

/keys/query requests on a remote homeserver must not return stale data.
We must send out device list updates to remote homeservers in partially joined rooms, either during the partial join phase or the transition to fully joined.
We can only estimate the remote homeservers in a partially joined room, based on the servers_in_room from the /send_join response and new member events in the room.
- As a first pass, it's easy enough to use the estimate of participating remote homeservers, so that encryption mostly works during the partial join phase.
- However, we may incorrectly accept or reject new member events, which implies we must do something during the transition to a fully joined room.
  As Erik says, we can use the device_lists_changes_in_room table for this.
  But we can still neglect to send a device list update to a user if they join a different room we are in and leave the partially joined room.
  When we transition to fully joined, the user will no longer be in the room and we'll think we have no device list updates to send to that user.
/keys/query requests on the local homeserver must not return stale data.
We must not cache device lists of users whose homeserver is not sending us device list updates.

As Erik says, this is largely fine, since we won't cache the device lists for users on homeservers only in a partially joined room.
- There is a case where we can erroneously think a homeserver is in a room during the partial join phase, then decide they aren't when transitioning to fully joined.
  We have to ensure that we clear any cached device lists when that happens. It's probably a matter of calling _handle_potentially_left_users when transitioning to a fully joined room.
  - Note to self: What happens if we crash after updating the current state but before calling _handle_potentially_left_users?
/sync's device_lists.changed/left must correctly signal to clients when a remote user may have a device list update.
During the partial join phase, this is going to be broken, since we'll think most users aren't in the room.
- When we transition to fully joined, we may have to ignore lazy_load_members and signal that all members in the room have a device list update.
/keys/changes must also work. Annoyingly, this uses different code to /sync and will need testing too.

erikjohnston · 2022-09-20T12:17:50Z

When we transition to fully joined, the user will no longer be in the room and we'll think we have no device list updates to send to that user.

Why won't we see the join/leave in the partially joined room? I'd expect us to get the join in the initial state, and then the leave in the subsequent timelines?

erikjohnston · 2022-09-21T09:34:48Z

When we transition to fully joined, we may have to ignore lazy_load_members and signal that all members in the room have a device list update.

Do we have a good way of detecting when this happens in a /sync stream?

Also, I don't think any of the current device list tracking takes into account lazy members?

erikjohnston · 2022-09-21T14:21:45Z

After chatting with @squahtx the plan to fix these issues is:

To fix "/keys/query requests on a remote homeserver must not return stale data."
- Make outgoing device list tracking work after we partial-join #13891
- Tested in Faster room joins: Add edge case tests for outgoing device list updates complement#474
- Store the current device list stream ID when we start a partial join
- When we transition from partial to fully joined, send all device list updates in device_lists_changes_in_room for the room to all servers that are either:
  1. are in the room; or
  2. have had a membership change between the state at the initial join and the new current state.
To fix "/keys/query requests on the local homeserver must not return stale data."
- When transitioning from a partial to full join, call _handle_potentially_left_users (in a transaction) for all users that have gone from joined to no longer joined in the delta between the partial current state and the new complete current state.
- Implemented in Handle the case of remote users leaving a partial join room for device lists #13885
- Tested in Faster room joins: Add edge case tests for device list tracking complement#477
To fix sending changes down /sync and /keys/changes:
- During a partial join store all inbound device list updates even if we don't share a room with the remote user
- When we transition from partial to fully joined, we re-persist (with a new stream ID) any device list updates for users in the room that had a device list update between when the server started the partial join (see first bullet where we store this) and when it completed the partial join
- Tested in Faster room joins: Add happy path tests for device list tracking complement#475
- Implemented in Handle remote device list updates during partial join #13913

squahtx · 2022-09-23T12:28:09Z

Turns out there's a bug where we don't emit device_lists.left entries when we complete a partial state join and discover that a user is actually not in the room: #13886

I'm minded to ignore #13886 for now under the assumption that left entries are less important to get right than changed, since the latter will come down once users share a room again.

squahtx · 2022-09-23T12:56:52Z

Also filed #13887 to capture an edge case where /keys/query could return stale data, which we have decided to ignore for now.

Part of the work for #12993. Signed-off-by: Sean Quah <seanq@matrix.org>

Part of the work for #12993. Once #12993 is fully resolved, we expect `/keys/changes` to behave sensibly when joined to a room with partial state. Signed-off-by: Sean Quah <seanq@matrix.org>

c.f. #12993 (comment), point 3 This stores all device list updates that we receive while partial joins are ongoing, and processes them once we have the full state. Note: We don't actually process the device lists in the same ways as if we weren't partially joined. Instead of updating the device list remote cache, we simply notify local users that a change in the remote user's devices has happened. I think this is safe as if the local user requests the keys for the remote user and we don't have them we'll simply fetch them as normal.

squahtx · 2022-09-28T18:57:06Z

richvdh added A-Federated-Join joins over federation generally suck T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jun 9, 2022

richvdh added this to the Q3 2022: Faster joins (further work) milestone Jun 9, 2022

richvdh mentioned this issue Jul 15, 2022

Faster joins: audit code paths that do state queries #12643

Closed

richvdh changed the title ~~Faster joins: support E2E rooms~~ Faster joins: fix device list tracking Jul 20, 2022

richvdh added T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. and removed T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. labels Jul 22, 2022

richvdh changed the title ~~Faster joins: fix device list tracking~~ Faster joins: test device list tracking Aug 15, 2022

squahtx self-assigned this Aug 15, 2022

squahtx mentioned this issue Sep 22, 2022

Faster room joins: Send device list updates out to servers in partially joined rooms #13874

Merged

erikjohnston mentioned this issue Sep 23, 2022

Handle the case of remote users leaving a partial join room for device lists #13885

Merged

squahtx pushed a commit that referenced this issue Sep 23, 2022

Faster room joins: Avoid blocking /keys/changes

2fc9cc6

Part of the work for #12993. Signed-off-by: Sean Quah <seanq@matrix.org>

squahtx mentioned this issue Sep 23, 2022

Faster room joins: Avoid blocking /keys/changes #13888

Merged

DMRobertson mentioned this issue Sep 23, 2022

Make outgoing device list tracking work after we partial-join #13891

Closed

DMRobertson mentioned this issue Sep 23, 2022

Add new columns tracking when we partial-joined #13892

Merged

erikjohnston mentioned this issue Sep 26, 2022

Handle remote device list updates during partial join #13913

Merged

2 tasks

squahtx assigned erikjohnston and DMRobertson Sep 27, 2022

richvdh changed the title ~~Faster joins: test device list tracking~~ Faster joins: fixes to device list tracking Sep 28, 2022

kittykat mentioned this issue Oct 3, 2022

Improve time to join a remote room #14030

Closed

16 tasks

erikjohnston closed this as completed Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster joins: fixes to device list tracking #12993

Faster joins: fixes to device list tracking #12993

richvdh commented Jun 9, 2022 •

edited

Loading

erikjohnston commented Jul 29, 2022 •

edited

Loading

richvdh commented Jul 29, 2022

erikjohnston commented Aug 1, 2022 •

edited

Loading

richvdh commented Aug 1, 2022

squahtx commented Sep 14, 2022

erikjohnston commented Sep 20, 2022

erikjohnston commented Sep 21, 2022

erikjohnston commented Sep 21, 2022 •

edited by squahtx

Loading

squahtx commented Sep 23, 2022

squahtx commented Sep 23, 2022

squahtx commented Sep 28, 2022 •

edited by erikjohnston

Loading

Faster joins: fixes to device list tracking #12993

Faster joins: fixes to device list tracking #12993

Comments

richvdh commented Jun 9, 2022 • edited Loading

erikjohnston commented Jul 29, 2022 • edited Loading

richvdh commented Jul 29, 2022

erikjohnston commented Aug 1, 2022 • edited Loading

richvdh commented Aug 1, 2022

squahtx commented Sep 14, 2022

erikjohnston commented Sep 20, 2022

erikjohnston commented Sep 21, 2022

erikjohnston commented Sep 21, 2022 • edited by squahtx Loading

squahtx commented Sep 23, 2022

squahtx commented Sep 23, 2022

squahtx commented Sep 28, 2022 • edited by erikjohnston Loading

What we've done

What we are ignoring for now

richvdh commented Jun 9, 2022 •

edited

Loading

erikjohnston commented Jul 29, 2022 •

edited

Loading

erikjohnston commented Aug 1, 2022 •

edited

Loading

erikjohnston commented Sep 21, 2022 •

edited by squahtx

Loading

squahtx commented Sep 28, 2022 •

edited by erikjohnston

Loading