Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Synapse is unusably slow after presence enabled. #14090

Open
DMRobertson opened this issue Oct 6, 2022 · 4 comments
Open

Synapse is unusably slow after presence enabled. #14090

DMRobertson opened this issue Oct 6, 2022 · 4 comments
Labels
A-Memory-Usage A-Presence O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@DMRobertson
Copy link
Contributor

DMRobertson commented Oct 6, 2022

This is an issue to track the problems happening on @tomsisk's deployment of Synapse.

Previously discussed on #12160 (comment), #13955 and #13901.

Suspect a regression between 1.60 and 1.64, see #12160 (comment).


I updated this last night and used it for about an hour with a couple of users. There was nothing of note, presence was working and the service was responsive. So I let the service run over night with presence on and we never saw a huge memory spike, but not many people are online during off hours. By the morning, the service was still running albeit extremely slow. Nothing weird in CPU or memory usage, but it was taking up to 60 seconds to send a message, login, etc. It was running, but unusable, so I had to turn presence off. The slowness is what we normally see after just a couple of minutes if we turn presence on during the day. In those cases, I've never been able to see a huge memory spike like we have in the past, but it seems those may have been connected to specific users.

Question: would it hurt anything if I truncated the presence_stream table? I don't know whether it would help anything either, but I noticed there is a lot of data going back a few years that hasn't been updated.

Back to the topic at hand: even with presence off, there appears to still be a small leak somewhere, I'm hoping these graphs help.
Here's a spike in memory.
image

Which coincides with a spike in _handle_new_device_update_async calls:
image

Now if we look at the cache memory, we see an increase on getEvent, but after 30 minutes the metrics say this memory was released. There is no similar drop in memory in the first image.
image

Originally posted by @tomsisk in #13955 (comment)

@squahtx squahtx added S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. O-Occasional Affects or can be seen by some users regularly or most users rarely A-Presence A-Memory-Usage labels Oct 6, 2022
@DMRobertson
Copy link
Contributor Author

Some random thoughts I had:

  • I think I've heard it said that CPython's allocator doesn't release memory back to the OS, which might explain the discrepancy between the two graphs. You might find they match up more closely if you use jemalloc. There are some very brief notes about it here.

  • It would be interesting to see what Synapse is doing when it's unusually slow, e.g. by taking a flamegraph sample using pyspy.

Question: would it hurt anything if I truncated the presence_stream table? I don't know whether it would help anything either, but I noticed there is a lot of data going back a few years that hasn't been updated.

Not sure. Would need to ask for input from other Synapse authors.

@richvdh
Copy link
Member

richvdh commented Oct 6, 2022

Question: would it hurt anything if I truncated the presence_stream table? I don't know whether it would help anything either, but I noticed there is a lot of data going back a few years that hasn't been updated.

Not sure. Would need to ask for input from other Synapse authors.

Assuming you're using postgres, this should be ok: if you clear it out, then a few users might appear with the wrong presence for a while, but nothing worse. Worth keeping a backup though.

@tomsisk
Copy link

tomsisk commented Oct 7, 2022

Truncating the presence_stream table unsurprisingly did not help.

Here is a pyspy flamegraph with presence on:
synapse_no_logging

It could be a red herring, but the do_users_share_a_room transaction sticks out here. I turned DEBUG level logging on and it appears to be running this transaction a lot with about a thousand users included in each one. The commit 43adf25 was also made in the release we started having issues in.

I also see a lot of this in the logs, only when presence is on, but maybe this is just a symptom:

2022-10-07 17:31:07,614 - synapse.util.wheel_timer - 64 - WARNING - GET-8347 - Inserting into a wheel timer that hasn't been read from recently. Item: <redacted>
2022-10-07 17:31:07,614 - synapse.util.wheel_timer - 64 - WARNING - GET-8347 - Inserting into a wheel timer that hasn't been read from recently. Item: <redacted>
2022-10-07 17:31:07,614 - synapse.util.wheel_timer - 64 - WARNING - GET-8347 - Inserting into a wheel timer that hasn't been read from recently. Item: <redacted>

@rubo77
Copy link
Contributor

rubo77 commented Feb 13, 2023

I got those same error messages after doing a pg_clusterupdate from PostgreSQL 14 to 15:

 "Inserting into a wheel timer that hasn't been read from recently. Item: %s",

This error is thrown here: https://mau.dev/maunium/synapse/-/blob/bb23055e7d63440de6a9e75c641502037cf17073/synapse/util/wheel_timer.py#L65

I left synapse running a few hours and those errors stopped. So I guess it came from the long downtime during the upgrade (several hours) and having a few hundred users on my server there is a lot to trace-back after the downtime

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Memory-Usage A-Presence O-Occasional Affects or can be seen by some users regularly or most users rarely S-Major Major functionality / product severely impaired, no satisfactory workaround. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
Development

No branches or pull requests

5 participants