-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Performance of device list updates could be improved (aka /login is unusably slow) #7721
Comments
I think the next steps here are to add a cache 👍 |
In #12048 I had an issue where logging in with a new device pretty much kills synapse for a while because of the device notify on login. Just batching up and slowing down sending the notifies would help a lot. |
It's worth mentioning that this effectively prevents logging in with Hydrogen (cf element-hq/hydrogen-web#699) |
oh, and fluffychat, which seems to do the same thing |
Note that there seems to be two related but distinct problems:
@richvdh ooi what were you seeing? Or was it a combination of the two? |
I was seeing all requests to my HS responding with a 502, due to my reverse-proxy timing out. The mean reactor tick time is about 15 seconds, which means Synapse is essentially unable to respond to anything. I think it's very busy sending out federation updates. |
related: #5373 |
@erikjohnston In #12048 I disabled |
#12132 will hopefully help a bit here, as it smears out sending out device list updates a bit. However, it won't help for the other issues that we're seeing. My current thinking is that we should try and get rid of the use of This does lead to the same amount of work being done (as
We can also do a vaguely similar thing for presence, i.e. get rid of 1. This can be done in a backwards compatible way be recording in the DB both device list updates by server and by room at the same time, then in a future release stop doing the former and bump the minimum DB schema version. |
I literally can't log in any more thanks to this. It would be very much appreciated if this could be prioritised... |
This is a first step in dealing with #7721. The idea is basically that rather than calculating the full set of users a device list update needs to be sent to up front, we instead simply record the rooms the user was in at the time of the change. This will allow a few things: 1. we can defer calculating the set of remote servers that need to be poked about the change; and 2. during `/sync` and `/keys/changes` we can avoid also avoid calculating users who share rooms with other users, and instead just look at the rooms that have changed. However, care needs to be taken to correctly handle server downgrades. As such this PR writes to both `device_lists_changes_in_room` and the `device_lists_outbound_pokes` table synchronously. In a future release we can then bump the database schema compat version to `69` and then we can assume that the new `device_lists_changes_in_room` exists and is handled. There is a temporary option to disable writing to `device_lists_outbound_pokes` synchronously, allowing us to test the new code path does work (and by implication upgrading to a future release and downgrading to this one will work correctly). Note: Ideally we'd do the calculation of room to servers on a worker (e.g. the background worker), but currently only master can write to the `device_list_outbound_pokes` table.
1.58.0rc1 should now include the fix described in #7721 (comment), so I think this is fixed from a user perspective. Logging in on a large account still takes a lot of CPU and DB though, so we may want to investigate that, but for now I think we can close this issue. |
I still can not log in with an account with many rooms, I have 8 GB RAM and 8 cores, is there anything in the logs that I can keep an eye out for? I tried playing around with |
I was investigating why logging in with a new device was leading up to poor performances on my account. It seems like both sending the device list update to other homeservers and other homeservers in turn hitting
GET /_matrix/federation/v1/user/devices/{userId}
.Looking a bit further, it looks like both code paths end up calling
_get_e2e_device_keys_txn
in the database code, which seems to be pretty expensive, and is called multiple time in a row, yet isn't cached. This results in quite bad performances:On the bottom right graph the
_get_e2e_device_keys_txn
transaction (in blue) is originating from sending out them.device_list_update
EDUs, while theget_devices_with_keys_by_user
one (in purple) is originating from responding toGET /_matrix/federation/v1/user/devices/{userId}
. Both transactions involve calling the_get_e2e_device_keys_txn
function (in fact, the first one does only that).On the bottom left graph these two transactions are colored respectively in green and purple.
The text was updated successfully, but these errors were encountered: