-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
State resolution on big rooms can get very CPU-hungry #6943
Comments
does #6936 fix it? |
I've just updated my Synapse to latest develop and the graphs still show the same issue so I guess it doesn't fix it. |
So it looks like this issue is caused by v1 state res; I've left a big v1 room that was showing up in the logs and the CPU usage went down to under 10%, except for some spikes that match with activity with other v1 rooms. I think we might want to keep this issue open though because it doesn't sound right to me that v1 state res gets so CPU-hungry, we should at least investigate it. |
you might need to rejoin the offending rooms to be able to investigate it? |
Possibly yeah. For reference, the two rooms I had the most issues with are |
It looks like that's actually not only an issue with v1 rooms (just that state res v1 is logging more than v2). A flame graph I generated yesterday shows that most of the CPU time is spent in With ~41% spent on this line:
@clokep suggested that I try this change: diff --git a/synapse/storage/data_stores/main/event_federation.py b/synapse/storage/data_stores/main/event_federation.py
index 49a7b8b43..340eeb63b 100644
--- a/synapse/storage/data_stores/main/event_federation.py
+++ b/synapse/storage/data_stores/main/event_federation.py
@@ -93,10 +93,9 @@ class EventFederationWorkerStore(EventsWorkerStore, SignatureWorkerStore, SQLBas
txn.database_engine, "event_id", chunk
)
txn.execute(base_sql + clause, args)
- new_front.update(r[0] for r in txn)
+ new_front.update(r[0] for r in txn if r[0] not in ignore_events and r[0] not in results)
- new_front -= ignore_events
- new_front -= results
front = new_front
results.update(front) which seems to make things slightly better as the %CPU used by the state res went from > 100 to slightly over 60, but it's still high enough to sometimes make my homeserver unresponsive for over 10s. Though as Patrick pointed out it still might not be the most optimised solution since it's still |
Hopefully #7095 will help, though more work might be needed if it turns out to not be enough |
Grafana seems to show that it didn't do anything :/ |
Hmmm, nevermind that. Not sure why, but my homeserver got crazy again for a few hours after I updated it, hence my previous comment, but apparently it's been behaving fine for a bit more than a day now, so it looks like the fix helped in the end. |
Sometime between 1.9 and 1.10 (I think)
abolivier.bzh
started becoming very hungry on the boxe's CPU.I used to host it on a VPS that had 1 vCore and it was perfectly happy with it, but lately it became so unresponsive so often (clearly due to being CPU-bound) that last week I had to move it to a new box with 2 vCores.
However, even though Synapse now responds in a more timely manner and doesn't seem as CPU-bound as it used to be, it still spends around half of the time using up 80% of the CPU resources and messages stay gray in Riot for maybe 5-15s (despite Grafana saying that the event took less than a second to send).
From the graphs it looks like these spikes are due to the
persist_events
background job, and thepersist_events
andstate._resolve_events
indexes.It looks like these have matching spikes in the "DB time" graphs, though I'm not sure what could have happened since I don't remember updating PostgreSQL or its config on that box around that time.
I'm happy to pair with another member of the backend team to look at what might go wrong here, if that can help.
The text was updated successfully, but these errors were encountered: