-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Room deletion (shutdown) fail in a constant loop due to non-serializable access caused by PostgreSQL isolation levels #10294
Comments
I think I replicated what should be the reported symptoms even on small empty rooms, just that the master process stops responding completely and makes Synapse unusable, there's also deadlocks reported by postgres:
This is quite a showstopper atm. |
@maranda Does the patch I posted in the issue fix it for you? |
I'm patching Synapse as we speak, I'll be back with results asap. |
It fixes the deadlocks @PeterCxy but not the Synapse master process from staggering and then becoming unresponsive. Requests in flight start accumulating and then it dies. |
Seeing this happen to us as well. We've got some fairly large rooms that we're looking to purge (1 million+ events) and after grinding through the tables for over an hour we'll get one table fail to get cleaned up and then the transaction has to start over again. Some thoughts: Can we somehow avoid concurrent updates? We've only deleted two of these large rooms so far and both of them hit a concurrent update in Does this all need to be in one transaction? Can we somehow kill the room in one transaction and then do the subsequent cleanup in subsequent transactions? |
We had a quick chat about this earlier in #synapse-dev today. It's preferable to have the whole operation run in a single transaction, since otherwise a failure halfway through the process would likely lead to inconsistencies in the database. The code is written with the assumption that, since all local users have left the room before we start purging it, so concurrent access shouldn't be a thing, but it fails to account for the fact that other background processes of Synapse (such as updating room stats) might cause them. We should try to pinpoint all of the processes that try to access/update data for a room after all of its local users have left, and try to make them not do that (or at least be more mindful that the room they're trying to work on might be in the process of being purged from the database), which isn't trivial. Or for a few of them (such as room stats) we could potentially move them to a new transaction, though this would need to be approached on a case by case basis. On why Synapse defaults to |
one possible cause of this: rows are removed from |
another: we purge rows from An easy solution to both |
In the meantime, a workaround is to do the following directly on the database before running the purge: delete from stream_ordering_to_exterm where room_id='<room id>';
delete from event_push_actions where room_id='<room id>'; This should be safe provided all local users have left the room. |
Just bumping this up with some more data: Trying to purge a room with 800k~ events. Deleting the room from each event related table is taking 20~ minutes ( |
Happy to confirm that the PR above has fixed the issue on our synapse instance 🎉 |
To close: #10294. Signed off by Nick @ Beeper.
Description
When using the room deletion api to remove a large room (such as Matrix HQ) from the server, the purging process, if it needs more than a few seconds to finish, can sometimes enter a constant fail-retry loop due to
unable to serialize access
(because the tables are concurrently accessed and modified by other transactions constantly on a running server).Steps to reproduce
unable to serialize access
being reported in the logs. You can also observe the behavior using PgHero, in which one exact long-running query will appear again and again on a regular interval, indicating Synapse has been retrying it again and again.Version information
Version: 1.37.1
Install method: pip
Notes
The error will go away if the isolation level is changed to the lowest
READ COMMITTED
for the room-purging transaction, though I am not sure if this is correct or not, but I assume it should be fine given that we are just deleting everything related to a room.On a second note, is there a reason why the isolation level is set to
REPEATABLE READ
by default globally? Does Synapse really needREPEATABLE READ
on every transaction?The text was updated successfully, but these errors were encountered: