State resolution on big rooms can get very CPU-hungry #6943

babolivier · 2020-02-18T16:54:09Z

Sometime between 1.9 and 1.10 (I think) abolivier.bzh started becoming very hungry on the boxe's CPU.

I used to host it on a VPS that had 1 vCore and it was perfectly happy with it, but lately it became so unresponsive so often (clearly due to being CPU-bound) that last week I had to move it to a new box with 2 vCores.

However, even though Synapse now responds in a more timely manner and doesn't seem as CPU-bound as it used to be, it still spends around half of the time using up 80% of the CPU resources and messages stay gray in Riot for maybe 5-15s (despite Grafana saying that the event took less than a second to send).

From the graphs it looks like these spikes are due to the persist_events background job, and the persist_events and state._resolve_events indexes.

It looks like these have matching spikes in the "DB time" graphs, though I'm not sure what could have happened since I don't remember updating PostgreSQL or its config on that box around that time.

I'm happy to pair with another member of the backend team to look at what might go wrong here, if that can help.

The text was updated successfully, but these errors were encountered:

ara4n · 2020-02-19T11:56:54Z

does #6936 fix it?
this looks to be the same cpu perf things that erik has been chasing down.

babolivier · 2020-02-19T12:25:50Z

does #6936 fix it?

I've just updated my Synapse to latest develop and the graphs still show the same issue so I guess it doesn't fix it.

babolivier · 2020-03-06T17:45:22Z

So it looks like this issue is caused by v1 state res; I've left a big v1 room that was showing up in the logs and the CPU usage went down to under 10%, except for some spikes that match with activity with other v1 rooms.

I think we might want to keep this issue open though because it doesn't sound right to me that v1 state res gets so CPU-hungry, we should at least investigate it.

richvdh · 2020-03-06T18:17:45Z

you might need to rejoin the offending rooms to be able to investigate it?

babolivier · 2020-03-06T18:37:10Z

Possibly yeah. For reference, the two rooms I had the most issues with are !SfoVXiERkTtNJcmoRl:matrix.org and !HwocBmCtBcHQhILtYQ:matrix.org, and yeah, a rejoin will probably help figuring out what's happening.

babolivier · 2020-03-13T11:01:04Z

It looks like that's actually not only an issue with v1 rooms (just that state res v1 is logging more than v2).

A flame graph I generated yesterday shows that most of the CPU time is spent in _get_auth_chain_ids_txn - when Synapse gets very CPU-hungry about 75% of the whole CPU time is spent in that function:

With ~41% spent on this line:

synapse/synapse/storage/data_stores/main/event_federation.py

Line 99 in 509e381

new_front -= results

@clokep suggested that I try this change:

diff --git a/synapse/storage/data_stores/main/event_federation.py b/synapse/storage/data_stores/main/event_federation.py
index 49a7b8b43..340eeb63b 100644
--- a/synapse/storage/data_stores/main/event_federation.py
+++ b/synapse/storage/data_stores/main/event_federation.py
@@ -93,10 +93,9 @@ class EventFederationWorkerStore(EventsWorkerStore, SignatureWorkerStore, SQLBas
                     txn.database_engine, "event_id", chunk
                 )
                 txn.execute(base_sql + clause, args)
-                new_front.update(r[0] for r in txn)
+                new_front.update(r[0] for r in txn if r[0] not in ignore_events and r[0] not in results)
 
-            new_front -= ignore_events
-            new_front -= results
 
             front = new_front
             results.update(front)

which seems to make things slightly better as the %CPU used by the state res went from > 100 to slightly over 60, but it's still high enough to sometimes make my homeserver unresponsive for over 10s. Though as Patrick pointed out it still might not be the most optimised solution since it's still O(len(new_front)) according to https://wiki.python.org/moin/TimeComplexity#set.

erikjohnston · 2020-03-19T13:55:41Z

Hopefully #7095 will help, though more work might be needed if it turns out to not be enough

babolivier · 2020-03-21T00:11:02Z

Hopefully #7095 will help, though more work might be needed if it turns out to not be enough

Grafana seems to show that it didn't do anything :/
I'll try to do more profiling on Monday.

babolivier · 2020-03-23T09:56:03Z

Hmmm, nevermind that. Not sure why, but my homeserver got crazy again for a few hours after I updated it, hence my previous comment, but apparently it's been behaving fine for a bit more than a day now, so it looks like the fix helped in the end.

babolivier changed the title ~~Synapse has become very hungry on CPU lately~~ State resolution on big v1 rooms can get very CPU-hungry Mar 6, 2020

neilisfragile added p1 A-Performance Performance, both client-facing and admin-facing labels Mar 9, 2020

babolivier self-assigned this Mar 12, 2020

babolivier changed the title ~~State resolution on big v1 rooms can get very CPU-hungry~~ State resolution on big rooms can get very CPU-hungry Mar 13, 2020

erikjohnston assigned erikjohnston and unassigned babolivier Mar 17, 2020

erikjohnston mentioned this issue Mar 17, 2020

Improve get auth chain difference algorithm. #7095

Merged

richvdh closed this as completed Mar 19, 2020

babolivier reopened this Mar 21, 2020

babolivier assigned babolivier and unassigned erikjohnston Mar 21, 2020

babolivier closed this as completed Mar 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State resolution on big rooms can get very CPU-hungry #6943

State resolution on big rooms can get very CPU-hungry #6943

babolivier commented Feb 18, 2020 •

edited

Loading

ara4n commented Feb 19, 2020

babolivier commented Feb 19, 2020

babolivier commented Mar 6, 2020

richvdh commented Mar 6, 2020

babolivier commented Mar 6, 2020

babolivier commented Mar 13, 2020 •

edited

Loading

erikjohnston commented Mar 19, 2020

babolivier commented Mar 21, 2020 •

edited

Loading

babolivier commented Mar 23, 2020

State resolution on big rooms can get very CPU-hungry #6943

State resolution on big rooms can get very CPU-hungry #6943

Comments

babolivier commented Feb 18, 2020 • edited Loading

ara4n commented Feb 19, 2020

babolivier commented Feb 19, 2020

babolivier commented Mar 6, 2020

richvdh commented Mar 6, 2020

babolivier commented Mar 6, 2020

babolivier commented Mar 13, 2020 • edited Loading

erikjohnston commented Mar 19, 2020

babolivier commented Mar 21, 2020 • edited Loading

babolivier commented Mar 23, 2020

babolivier commented Feb 18, 2020 •

edited

Loading

babolivier commented Mar 13, 2020 •

edited

Loading

babolivier commented Mar 21, 2020 •

edited

Loading