-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Faster joins: fixes to device list tracking #12993
Comments
I had a quick look at the code, and I think there are broadly three flows we care about here. First, when we receive a device list update for a remote user we check if any of local users share a room with them, dropping the update if they don't. If we're doing a partial join with a room for that user, either:
Secondly, when a local user queries a remote user's devices if the remote user is in the room we're partially joining then one of:
In all three of those cases I think the current handling results in the correct behaviour. Finally, when a local user adds a new device we need to send out a device update to all remote servers that think they share a room with that local user. For servers that are in the partially joined room we need to ensure that they get the device list update. If the user already share a room with the remote server then we'll already correctly share the device list update. Otherwise, we need to do either:
TL;DR: I think the only thing we need to change is the last point, where we need to correctly send out any device list updates for the local user to servers in the partially joined room. The place where we currently do the logic is: synapse/synapse/handlers/device.py Lines 694 to 698 in 24ef146
|
I think when you say "we don't share a room with a user" you mean "we don't share a non-partial-state room with a user"?
This is ok provided the logic for deciding if we're tracking the remote user's devices is actually consistent. Are you sure there's no way for the database to show we are tracking devices while only sharing partial-state rooms with the user? Also: are there not flows that happen when a local user joins or leaves a room? What happens if a local user joins a partial-state room? Sounds like a similar situation to the "local user adds new device" flow? |
We only store cached devices if the server "shares a room with the remote user", which can include partial joins (as we determine this via calling
Yeah, I forgot about these flows. The crucial thing here is to ensure that if the servers stops sharing a room with a remote user (i.e. In particular: if the server fails to join the room and so sends a leave for local users, we must ensure that we check if we need to clear the cached devices of any remote users. |
... or, presumably, that we realise that we incorrectly accepted a join event for a remote user which should have been rejected. I think it should be fine, but we'll want test cases for this stuff. |
Writing this up so it stops falling out of my head. It's largely a recap of the notes above. There are
|
Why won't we see the join/leave in the partially joined room? I'd expect us to get the join in the initial state, and then the leave in the subsequent timelines? |
Do we have a good way of detecting when this happens in a Also, I don't think any of the current device list tracking takes into account lazy members? |
After chatting with @squahtx the plan to fix these issues is:
|
Turns out there's a bug where we don't emit I'm minded to ignore #13886 for now under the assumption that |
Also filed #13887 to capture an edge case where |
Part of the work for #12993. Signed-off-by: Sean Quah <seanq@matrix.org>
c.f. #12993 (comment), point 3 This stores all device list updates that we receive while partial joins are ongoing, and processes them once we have the full state. Note: We don't actually process the device lists in the same ways as if we weren't partially joined. Instead of updating the device list remote cache, we simply notify local users that a change in the remote user's devices has happened. I think this is safe as if the local user requests the keys for the remote user and we don't have them we'll simply fetch them as normal.
The device-list-tracking code makes lots of requests to methods like
get_users_in_room
andget_rooms_for_user
, which are likely to return completely bogus results during a faster join. This brings danger of bugs where we think we have up-to-date device lists for particular users, but actually don't, leading to E2EE failures.So, the work needed here is to (a) understand how device list tracking works; (b) ensure that it will end up correct (even if it's not entirely correct during a faster join).
The text was updated successfully, but these errors were encountered: