Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Presence should send less presence when joining a room #15818

Closed
wants to merge 8 commits into from

Conversation

realtyem
Copy link
Contributor

@realtyem realtyem commented Jun 22, 2023

Currently, when a local user joins a room, presence is sent to all other servers in that room.

From my understanding, if this local user is already in another room, then any other user(implying a remote server) is already receiving presence from this local user. If that is correct, then removing this remote server from the list of servers that should be sent new presence data is as simple as determining if other rooms this local user shares with other users in this new room are known on the local server.

There exists a function already to do this very thing(with the additional caveat that it will return this new room as well as other existing rooms). get_mutual_rooms_between_users() was added as part of MSC2666(which is still marked as unstable, I believe) and uses get_rooms_for_user() under the hood. This turned out to be a hugely inefficient way to get rooms for any two users at a time. Let's just use the batching version of get_rooms_for_users() to collect all the currently relevant users room data in one go, which turned out to be better on an order of almost 2 magnitudes(probably because of the internal caching).

As part of the existing process for determining if a host should be sent presence:

  1. The list of users in the room is broken into two sections
    a. previous occupant
    b. newly joined
  2. each of these lists is then iterated with a for loop(so, both of these will be done twice):
    a. local users - added to a list to be dealt with separately, as this is 'in-house'(and not part of this work)
    b. remote users - host is added(from the user_id) to a list of servers to send presence to(this is where we will focus this work on)

Items under 2b can be checked with get_mutual_rooms_between_users() the Dict[user_id, Set[room_ids]] returned from get_rooms_for_users() to verify if this host should be in this list at all. This feels like it could be a lot of processing, as it would mean checking every single user against every other user. This is rather inefficient and has been optimized further by not processing if the host being evaluated is already in the list of remote servers to send presence to.(this is much better now that we are doing the processing on the spot instead of leaning on the database)

For e.g. say we have 10 users, 7 of which are on m.org and the other 3 on different remote servers. Instead of checking every single m.org user in the database for if they share another room with one of these users, once m.org is in the host list, just bypass it. This is not a perfect solution, as the one user who actually shares another room might be the last one checked, but at least has the potential to be the first as well. Such is the random of unordered sets. This optimization is still in place but as we are not doing individual database hits for pairs of users this particular situation is no longer relevant.

As part of this, I added 3 tests to PresenceJoinTestCase:
One to test the behavior I sought to change

  • test_remote_joins_with_previously_shared_rooms_do_not_send_presence

And 2 more to verify I didn't break the 'join-leave-join' presence behavior(plus I didn't see these modeled, so why not?)

  • test_remote_gets_presence_after_local_join_leave_join
  • test_remote_gets_presence_after_remote_join_leave_join

I base this summary of how to handle this on information gleaned from all these issues(and possibly others) as well as things linked inside them:

My gut tells me there is still a potential for a leak here. If verifying a remote user who found one other remote user that didn't share a room and required sending presence, but then would have found(but won't now because we bailed early) a third remote user on the second user's server later that meant it would not need to, it does not get removed. Without checking every single user(which means a lot more database hits), I'm not sure there is a way around this.Again, no longer the case.

UPDATES:

  • I noticed something interesting while testing this in my scratch branch with metrics attached: This does not always have all of the state data at a single time and instead gets batches of data in the case of a local user joining a room. This is actually a really nice realization, as it means we aren't necessarily doing the whole room at once.
  • This may not be working as intended or planned(originally or after this PR). I suspect partial state joins to be the culprit, but that is an arcane science I'm not completely familiar with, especially when it comes to calling any notifiers about new state. In any case, it seems to work in our favor. See pictures below for context.
Pictures of metrics(not included in PR):

To get these, I left then purged then rejoined Matrix HQ. Rejoining took approximately 11 hours(I think, my event persister seems to still be processing what it got 5 hours after that, but it could be something else).

Here is where my join to the room got processed:
room_join_self_process
As you can see, it is only processing myself. I am a newly joined user, so I am processed...but then nothing is actually sent out to any remote hosts(because 'previous host after filtered' is empty). I think this is because I've received the state of my own join, but other users state of join is not determined/received yet.

While waiting for Matrix HQ to continue it's join, in a different room... You can see here how some of the hosts were filtered out
room_join_morg_mid_process
Interesting thing here, we are only processing this at all because someone else joined the room. None of those hosts will be receiving anything but our own PresenceState as the newly joined user isn't local to this server. But it's nice the filtering appears work.

(Different room again)Worse case scenario, no remote hosts are filtered out.
room_join_morg_mid_process_worse_case
Again, it's not a local user that just joined, so they will only receive my PresenceState.

Sometime later(different room again, I believe) where the local homeserver processes a remote user joining. Again, only my own PresenceState is sent.
room_join_newly_joined_remote_example

Here looks to be the final processing of the join to Matrix HQ. This looks correct. Lots of newly joined users and hosts and filtering is working.
room_join_morg_final_join_process

Pull Request Checklist

Signed-off-by: Jason Little realtyem@gmail.com

…hen joining a room.

This is based on the premise that a user sharing another room with a different users is
already receiving presence and can therefore be excluded from this list. Care was taken
to maintain the split between previous users and newly joined users, as this split is
crucial to a filter used further down to remove 'old presence' data from what is sent.
@realtyem realtyem marked this pull request as ready for review June 22, 2023 21:25
@realtyem realtyem requested a review from a team as a code owner June 22, 2023 21:25
@realtyem realtyem requested a review from H-Shay June 30, 2023 22:25
@H-Shay H-Shay requested review from a team and removed request for H-Shay July 5, 2023 18:40
synapse/handlers/presence.py Outdated Show resolved Hide resolved
for other_user_id in prev_users_excluding_current_user:
# If this host has been seen in a list already, skip it.
if host not in prev_remote_hosts:
if await is_only_one_room_shared(user_id, other_user_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this, why are we comparing all users previously in the room to see if they share a room? Notably, I think both user_id and other_user_id can be remote users?

I'd have thought that at least one iteration here should be across the newly joined users or hosts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following this, why are we comparing all users previously in the room to see if they share a room?

Excellent question. We want to build a full list of all remote servers in the room, not the actual users. (We shortcut out so not to hit the database more than necessary). But, we want the list to excluded some servers, hence the comparison. The list needs to be both as full as possible and as empty as possible at the same time.

Notably, I think both user_id and other_user_id can be remote users?

Not just can be, they are. It means that we know the host from user_id is in this room, but if other_user_id's host(which we know is in this room obviously) is in yet another room that we know about because it's federated to this server, than user_id's host can be excluded, since they are already receiving presence. For example:
@me:myserver, @alice:server1 and @bob:server2 are already in Synapse Admins.
@me:myserver, @bob:server2 and @charlie:server2 are in Matrix HQ.
@alice:server1 is joining Matrix HQ.
Because we know server1 is already receiving presence info about @bob:server2, it can be excluded from the list of prev_remote_hosts.

In retrospect, perhaps an additional list maintaining all the hosts we know can be excluded would also be a good idea, to cut down on database hits? However, I see a potential gotcha here, as what happens to @charlie:server2? If we maintain an excluded list as well, he would never be checked against and therefore(at least at first) not receive the new presence data.(Depending of course on if @bob or @charlie is checked first)

I'd have thought that at least one iteration here should be across the newly joined users or hosts?

Newly joined(either local or remote) is dealt with in the next block below, starting at about line 1496.

It could actually be argued that presence would be retrieved on the first(next?) sync that occurs after the room is fully joined anyways, so why are we bothering with this at all?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an epiphany while at work tonight; the prev_remote_hosts only needs to be built if one of the newly joined users is local. If that's what you were getting at with your question, mission accomplished 😉. I'll do a refactor in a bit.

Copy link
Contributor Author

@realtyem realtyem Jul 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had an epiphany while at work tonight; the prev_remote_hosts only needs to be built if one of the newly joined users is local. If that's what you were getting at with your question, mission accomplished 😉. I'll do a refactor in a bit.

I still want to make this optimization a thing, but perhaps it's best left to another PR. If that's ok? Nope, never mind. Still need it after all. Testing is great. Can't whittle down the list if we don't have the data. I'm still gonna ponder on this, but right now it's not easily actionable.

…the batching version of get_rooms_for_users() instead and do the processing by hand.
Copy link
Member

@erikjohnston erikjohnston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@realtyem is this something that you still think will help after the recent refactoring?

@realtyem
Copy link
Contributor Author

@realtyem is this something that you still think will help after the recent refactoring?

TLDR: in the category of reducing federation spam caused by presence, this is a 3 on the priority scale of 1-10(10 being highest). You may close it for a redo, because bitrot.

A fair question. In regards to the other work done recently, this has no bearing either way. However I know a lot more about the internals of the presence system now compared to when this was written, and it stands to reason there may be a better way. Especially after the data came along that the entire batch of state from a room didn't come in at once but in stages. Since the whole point of this bit of work was the deduplication of sending presence to a server for the same join, it may not hold relevance in it's current form(truthfully, I had rather forgotten about this at all). Either way, age has not treated it well, it's fair to close and I'll try and revisit it later.

@realtyem
Copy link
Contributor Author

Duh, it's my own PR. I'll close it myself 🤦‍♂️

@realtyem realtyem closed this Sep 29, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants