Synapse is unusably slow after presence enabled. #14090

DMRobertson · 2022-10-06T16:47:35Z

This is an issue to track the problems happening on @tomsisk's deployment of Synapse.

Previously discussed on #12160 (comment), #13955 and #13901.

Suspect a regression between 1.60 and 1.64, see #12160 (comment).

I updated this last night and used it for about an hour with a couple of users. There was nothing of note, presence was working and the service was responsive. So I let the service run over night with presence on and we never saw a huge memory spike, but not many people are online during off hours. By the morning, the service was still running albeit extremely slow. Nothing weird in CPU or memory usage, but it was taking up to 60 seconds to send a message, login, etc. It was running, but unusable, so I had to turn presence off. The slowness is what we normally see after just a couple of minutes if we turn presence on during the day. In those cases, I've never been able to see a huge memory spike like we have in the past, but it seems those may have been connected to specific users.

Question: would it hurt anything if I truncated the presence_stream table? I don't know whether it would help anything either, but I noticed there is a lot of data going back a few years that hasn't been updated.

Back to the topic at hand: even with presence off, there appears to still be a small leak somewhere, I'm hoping these graphs help.
Here's a spike in memory.

Which coincides with a spike in _handle_new_device_update_async calls:

Now if we look at the cache memory, we see an increase on getEvent, but after 30 minutes the metrics say this memory was released. There is no similar drop in memory in the first image.

Originally posted by @tomsisk in #13955 (comment)

The text was updated successfully, but these errors were encountered:

DMRobertson · 2022-10-06T17:06:22Z

Some random thoughts I had:

I think I've heard it said that CPython's allocator doesn't release memory back to the OS, which might explain the discrepancy between the two graphs. You might find they match up more closely if you use jemalloc. There are some very brief notes about it here.
It would be interesting to see what Synapse is doing when it's unusually slow, e.g. by taking a flamegraph sample using pyspy.

Question: would it hurt anything if I truncated the presence_stream table? I don't know whether it would help anything either, but I noticed there is a lot of data going back a few years that hasn't been updated.

Not sure. Would need to ask for input from other Synapse authors.

richvdh · 2022-10-06T17:19:16Z

Question: would it hurt anything if I truncated the presence_stream table? I don't know whether it would help anything either, but I noticed there is a lot of data going back a few years that hasn't been updated.

Not sure. Would need to ask for input from other Synapse authors.

Assuming you're using postgres, this should be ok: if you clear it out, then a few users might appear with the wrong presence for a while, but nothing worse. Worth keeping a backup though.

tomsisk · 2022-10-07T18:00:28Z

Truncating the presence_stream table unsurprisingly did not help.

Here is a pyspy flamegraph with presence on:

It could be a red herring, but the do_users_share_a_room transaction sticks out here. I turned DEBUG level logging on and it appears to be running this transaction a lot with about a thousand users included in each one. The commit 43adf25 was also made in the release we started having issues in.

I also see a lot of this in the logs, only when presence is on, but maybe this is just a symptom:

2022-10-07 17:31:07,614 - synapse.util.wheel_timer - 64 - WARNING - GET-8347 - Inserting into a wheel timer that hasn't been read from recently. Item: <redacted>
2022-10-07 17:31:07,614 - synapse.util.wheel_timer - 64 - WARNING - GET-8347 - Inserting into a wheel timer that hasn't been read from recently. Item: <redacted>
2022-10-07 17:31:07,614 - synapse.util.wheel_timer - 64 - WARNING - GET-8347 - Inserting into a wheel timer that hasn't been read from recently. Item: <redacted>

rubo77 · 2023-02-13T04:41:42Z

I got those same error messages after doing a pg_clusterupdate from PostgreSQL 14 to 15:

 "Inserting into a wheel timer that hasn't been read from recently. Item: %s",

This error is thrown here: https://mau.dev/maunium/synapse/-/blob/bb23055e7d63440de6a9e75c641502037cf17073/synapse/util/wheel_timer.py#L65

I left synapse running a few hours and those errors stopped. So I guess it came from the long downtime during the upgrade (several hours) and having a few hundred users on my server there is a lot to trace-back after the downtime

matrixbot mentioned this issue Dec 21, 2023

Synapse is unusably slow after presence enabled. element-hq/synapse#14090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synapse is unusably slow after presence enabled. #14090

Synapse is unusably slow after presence enabled. #14090

DMRobertson commented Oct 6, 2022 •

edited

Loading

DMRobertson commented Oct 6, 2022

richvdh commented Oct 6, 2022

tomsisk commented Oct 7, 2022

rubo77 commented Feb 13, 2023 •

edited

Loading

Synapse is unusably slow after presence enabled. #14090

Synapse is unusably slow after presence enabled. #14090

Comments

DMRobertson commented Oct 6, 2022 • edited Loading

DMRobertson commented Oct 6, 2022

richvdh commented Oct 6, 2022

tomsisk commented Oct 7, 2022

rubo77 commented Feb 13, 2023 • edited Loading

DMRobertson commented Oct 6, 2022 •

edited

Loading

rubo77 commented Feb 13, 2023 •

edited

Loading