Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

State resolution on big rooms can get very CPU-hungry #6943

Closed
babolivier opened this issue Feb 18, 2020 · 9 comments
Closed

State resolution on big rooms can get very CPU-hungry #6943

babolivier opened this issue Feb 18, 2020 · 9 comments
Assignees
Labels
A-Performance Performance, both client-facing and admin-facing

Comments

@babolivier
Copy link
Contributor

babolivier commented Feb 18, 2020

Sometime between 1.9 and 1.10 (I think) abolivier.bzh started becoming very hungry on the boxe's CPU.

I used to host it on a VPS that had 1 vCore and it was perfectly happy with it, but lately it became so unresponsive so often (clearly due to being CPU-bound) that last week I had to move it to a new box with 2 vCores.

However, even though Synapse now responds in a more timely manner and doesn't seem as CPU-bound as it used to be, it still spends around half of the time using up 80% of the CPU resources and messages stay gray in Riot for maybe 5-15s (despite Grafana saying that the event took less than a second to send).

From the graphs it looks like these spikes are due to the persist_events background job, and the persist_events and state._resolve_events indexes.

image

image

It looks like these have matching spikes in the "DB time" graphs, though I'm not sure what could have happened since I don't remember updating PostgreSQL or its config on that box around that time.

I'm happy to pair with another member of the backend team to look at what might go wrong here, if that can help.

@ara4n
Copy link
Member

ara4n commented Feb 19, 2020

does #6936 fix it?
this looks to be the same cpu perf things that erik has been chasing down.

@babolivier
Copy link
Contributor Author

does #6936 fix it?

I've just updated my Synapse to latest develop and the graphs still show the same issue so I guess it doesn't fix it.

@babolivier
Copy link
Contributor Author

So it looks like this issue is caused by v1 state res; I've left a big v1 room that was showing up in the logs and the CPU usage went down to under 10%, except for some spikes that match with activity with other v1 rooms.

I think we might want to keep this issue open though because it doesn't sound right to me that v1 state res gets so CPU-hungry, we should at least investigate it.

@richvdh
Copy link
Member

richvdh commented Mar 6, 2020

you might need to rejoin the offending rooms to be able to investigate it?

@babolivier
Copy link
Contributor Author

Possibly yeah. For reference, the two rooms I had the most issues with are !SfoVXiERkTtNJcmoRl:matrix.org and !HwocBmCtBcHQhILtYQ:matrix.org, and yeah, a rejoin will probably help figuring out what's happening.

@babolivier babolivier changed the title Synapse has become very hungry on CPU lately State resolution on big v1 rooms can get very CPU-hungry Mar 6, 2020
@neilisfragile neilisfragile added p1 A-Performance Performance, both client-facing and admin-facing labels Mar 9, 2020
@babolivier babolivier self-assigned this Mar 12, 2020
@babolivier babolivier changed the title State resolution on big v1 rooms can get very CPU-hungry State resolution on big rooms can get very CPU-hungry Mar 13, 2020
@babolivier
Copy link
Contributor Author

babolivier commented Mar 13, 2020

It looks like that's actually not only an issue with v1 rooms (just that state res v1 is logging more than v2).

A flame graph I generated yesterday shows that most of the CPU time is spent in _get_auth_chain_ids_txn - when Synapse gets very CPU-hungry about 75% of the whole CPU time is spent in that function:

image

With ~41% spent on this line:

@clokep suggested that I try this change:

diff --git a/synapse/storage/data_stores/main/event_federation.py b/synapse/storage/data_stores/main/event_federation.py
index 49a7b8b43..340eeb63b 100644
--- a/synapse/storage/data_stores/main/event_federation.py
+++ b/synapse/storage/data_stores/main/event_federation.py
@@ -93,10 +93,9 @@ class EventFederationWorkerStore(EventsWorkerStore, SignatureWorkerStore, SQLBas
                     txn.database_engine, "event_id", chunk
                 )
                 txn.execute(base_sql + clause, args)
-                new_front.update(r[0] for r in txn)
+                new_front.update(r[0] for r in txn if r[0] not in ignore_events and r[0] not in results)
 
-            new_front -= ignore_events
-            new_front -= results
 
             front = new_front
             results.update(front)

which seems to make things slightly better as the %CPU used by the state res went from > 100 to slightly over 60, but it's still high enough to sometimes make my homeserver unresponsive for over 10s. Though as Patrick pointed out it still might not be the most optimised solution since it's still O(len(new_front)) according to https://wiki.python.org/moin/TimeComplexity#set.

@erikjohnston
Copy link
Member

Hopefully #7095 will help, though more work might be needed if it turns out to not be enough

@richvdh richvdh closed this as completed Mar 19, 2020
@babolivier
Copy link
Contributor Author

babolivier commented Mar 21, 2020

Hopefully #7095 will help, though more work might be needed if it turns out to not be enough

Grafana seems to show that it didn't do anything :/
I'll try to do more profiling on Monday.

@babolivier babolivier reopened this Mar 21, 2020
@babolivier
Copy link
Contributor Author

Hmmm, nevermind that. Not sure why, but my homeserver got crazy again for a few hours after I updated it, hence my previous comment, but apparently it's been behaving fine for a bit more than a day now, so it looks like the fix helped in the end.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Performance Performance, both client-facing and admin-facing
Projects
None yet
Development

No branches or pull requests

5 participants