Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294

PeterCxy · 2021-07-01T11:35:37Z

Description

When using the room deletion api to remove a large room (such as Matrix HQ) from the server, the purging process, if it needs more than a few seconds to finish, can sometimes enter a constant fail-retry loop due to unable to serialize access (because the tables are concurrently accessed and modified by other transactions constantly on a running server).

Steps to reproduce

On a moderately busy server (e.g. being in multiple moderately-sized federated rooms), try to purge a large federated room using the delete room api
Observe that the process gets stuck with unable to serialize access being reported in the logs. You can also observe the behavior using PgHero, in which one exact long-running query will appear again and again on a regular interval, indicating Synapse has been retrying it again and again.

Version information

Version: 1.37.1
Install method: pip

Platform: Debian 10 "buster", in a LXC container

Notes

The error will go away if the isolation level is changed to the lowest READ COMMITTED for the room-purging transaction, though I am not sure if this is correct or not, but I assume it should be fine given that we are just deleting everything related to a room.

diff --git a/synapse/storage/databases/main/purge_events.py b/synapse/storage/databases/main/purge_events.py
index 7fb7780d0..2619a6602 100644
--- a/synapse/storage/databases/main/purge_events.py
+++ b/synapse/storage/databases/main/purge_events.py
@@ -313,6 +313,7 @@ class PurgeEventsStore(StateGroupWorkerStore, CacheInvalidationWorkerStore):
         )

     def _purge_room_txn(self, txn, room_id: str) -> List[int]:
+        txn.execute("SET TRANSACTION ISOLATION LEVEL READ COMMITTED")
         # First we fetch all the state groups that should be deleted, before
         # we delete that information.
         txn.execute(

On a second note, is there a reason why the isolation level is set to REPEATABLE READ by default globally? Does Synapse really need REPEATABLE READ on every transaction?

The text was updated successfully, but these errors were encountered:

maranda · 2021-07-26T14:36:28Z

I think I replicated what should be the reported symptoms even on small empty rooms, just that the master process stops responding completely and makes Synapse unusable, there's also deadlocks reported by postgres:

2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse ERROR:  deadlock detected
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse DETAIL:  Process 88361 waits for ShareLock on transaction 199189797; blocked by process 88362.
        Process 88362 waits for ShareLock on transaction 199188675; blocked by process 88361.
        Process 88361: DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!ceAlNsZbKnKwkXshUJ:services.aria-net.org' )
        Process 88362: UPDATE event_json SET json = '{"auth_events":["$uV3PiiusEFaYkHJD9XGuuy3IoOI_35oRXD_KWuBBNBA","$VkJP1G-5_r7H0phbLMD21sExNzilqpkAaDIJ0dPQ4H4","$HkDKzsAiqZ0UjSrY4P30Mpcg6S2B64_W_pUNKE347eQ"],"content":{},"dep>
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse HINT:  See server log for query details.
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse CONTEXT:  while deleting tuple (1356100,2) in relation "event_json"
2021-07-25 16:03:27.599 UTC [88361] synapse_user@synapse STATEMENT:  DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!ceAlNsZbKnKwkXshUJ:services.aria-net.org' )
2021-07-25 16:05:10.719 UTC [88357] synapse_user@synapse ERROR:  could not serialize access due to concurrent update
2021-07-25 16:05:10.719 UTC [88357] synapse_user@synapse STATEMENT:  DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!yhuhlVYNMYpjlzRXhW:services.aria-net.org' )
2021-07-25 16:05:45.430 UTC [88361] synapse_user@synapse ERROR:  could not serialize access due to concurrent update
2021-07-25 16:05:45.430 UTC [88361] synapse_user@synapse STATEMENT:  DELETE FROM event_json WHERE event_id IN ( SELECT event_id FROM events WHERE room_id='!ceAlNsZbKnKwkXshUJ:services.aria-net.org' )
2021-07-25 16:06:44.606 UTC [88363] synapse_user@synapse ERROR:  could not serialize access due to concurrent update

This is quite a showstopper atm.

PeterCxy · 2021-07-26T15:00:20Z

@maranda Does the patch I posted in the issue fix it for you?

maranda · 2021-07-26T15:02:21Z

I'm patching Synapse as we speak, I'll be back with results asap.

maranda · 2021-07-26T16:42:46Z

It fixes the deadlocks @PeterCxy but not the Synapse master process from staggering and then becoming unresponsive. Requests in flight start accumulating and then it dies.

bradtgmurray · 2021-12-09T19:28:04Z

Seeing this happen to us as well. We've got some fairly large rooms that we're looking to purge (1 million+ events) and after grinding through the tables for over an hour we'll get one table fail to get cleaned up and then the transaction has to start over again.

Some thoughts:

Can we somehow avoid concurrent updates? We've only deleted two of these large rooms so far and both of them hit a concurrent update in room_stats_state, which presumably is some period stat tracking. Can we disable stat tracking for the room before purging it?

Does this all need to be in one transaction? Can we somehow kill the room in one transaction and then do the subsequent cleanup in subsequent transactions?

babolivier · 2021-12-13T15:54:53Z

We had a quick chat about this earlier in #synapse-dev today.

It's preferable to have the whole operation run in a single transaction, since otherwise a failure halfway through the process would likely lead to inconsistencies in the database. The code is written with the assumption that, since all local users have left the room before we start purging it, so concurrent access shouldn't be a thing, but it fails to account for the fact that other background processes of Synapse (such as updating room stats) might cause them.

We should try to pinpoint all of the processes that try to access/update data for a room after all of its local users have left, and try to make them not do that (or at least be more mindful that the room they're trying to work on might be in the process of being purged from the database), which isn't trivial. Or for a few of them (such as room stats) we could potentially move them to a new transaction, though this would need to be approached on a case by case basis.

On why Synapse defaults to REPEATABLE READ (it does), the reason is an abundance of caution because some transactions Synapse runs do require reads to be repeatable. Though this is very likely not the case for most of them, and as we can see it can cause some issues, so in the future we'll probably want to stop using it everywhere and default to READ COMMITTED. #11567 tracks this side of the issue.

richvdh · 2022-01-27T17:13:27Z

one possible cause of this: rows are removed from event_push_actions once they have been processed (or once they are more than a month old); this will conflict with the purge operation.

richvdh · 2022-01-27T18:23:23Z

another: we purge rows from stream_ordering_to_exterm when they are a month old. Again, this can conflict with purges.

An easy solution to both stream_ordering_to_exterm and event_push_actions is to do them at the start of the purge operation rather than the end, so that there is less chance of a concurrent update happening between the start of the purge transaction and the delete on stream_ordering_to_exterm/event_push_actions. That said, now that #11799 has landed (thanks to @Fizzadar), we can probably just set the isolation level for the purge transaction to READ COMMITTED.

richvdh · 2022-01-27T19:12:14Z

In the meantime, a workaround is to do the following directly on the database before running the purge:

delete from stream_ordering_to_exterm where room_id='<room id>';
delete from event_push_actions where room_id='<room id>';

This should be safe provided all local users have left the room.

bradtgmurray · 2022-04-26T20:52:43Z

Just bumping this up with some more data:

Trying to purge a room with 800k~ events. Deleting the room from each event related table is taking 20~ minutes (events, events_json, event_edges, etc) and the whole cycle is taking about 2 hours before we hit could not serialize access due to concurrent delete 4/5 shortly after removing <room id> from stream_ordering_to_exterm and then the 2 hours of database hammering starts over again.

Fizzadar · 2022-06-06T09:19:17Z

Happy to confirm that the PR above has fixed the issue on our synapse instance 🎉

To close: #10294. Signed off by Nick @ Beeper.

erikjohnston added S-Minor Blocks non-critical functionality, workarounds exist. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jul 23, 2021

dklimpel mentioned this issue Sep 10, 2021

Prevent multiple instances of shutting down the same room simultaneously #10797

Closed

dklimpel mentioned this issue Oct 31, 2021

Convert delete room admin API to async endpoint #11223

Merged

4 tasks

DMRobertson mentioned this issue Jan 28, 2022

investigate the impact of Twisted 22.1 #11848

Closed

Fizzadar mentioned this issue Jun 1, 2022

Use READ COMMITTED isolation level when purging rooms #12942

Merged

4 tasks

erikjohnston closed this as completed in #12942 Jul 18, 2022

erikjohnston pushed a commit that referenced this issue Jul 18, 2022

Use READ COMMITTED isolation level when purging rooms (#12942)

6785b0f

To close: #10294. Signed off by Nick @ Beeper.

realtyem mentioned this issue Jun 25, 2023

Fix could not serialize access due to concurrent DELETE from presence_stream #15826

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294

Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294

PeterCxy commented Jul 1, 2021

maranda commented Jul 26, 2021

PeterCxy commented Jul 26, 2021

maranda commented Jul 26, 2021

maranda commented Jul 26, 2021

bradtgmurray commented Dec 9, 2021

babolivier commented Dec 13, 2021 •

edited

Loading

richvdh commented Jan 27, 2022

richvdh commented Jan 27, 2022

richvdh commented Jan 27, 2022

bradtgmurray commented Apr 26, 2022 •

edited

Loading

Fizzadar commented Jun 6, 2022

Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294

Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294

Comments

PeterCxy commented Jul 1, 2021

Description

Steps to reproduce

Version information

Notes

maranda commented Jul 26, 2021

PeterCxy commented Jul 26, 2021

maranda commented Jul 26, 2021

maranda commented Jul 26, 2021

bradtgmurray commented Dec 9, 2021

babolivier commented Dec 13, 2021 • edited Loading

richvdh commented Jan 27, 2022

richvdh commented Jan 27, 2022

richvdh commented Jan 27, 2022

bradtgmurray commented Apr 26, 2022 • edited Loading

Fizzadar commented Jun 6, 2022

babolivier commented Dec 13, 2021 •

edited

Loading

bradtgmurray commented Apr 26, 2022 •

edited

Loading