-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Update to 1.9.0 triggers sanity check error in room #6779
Comments
yes, I've seen this in the wild too. First let me say that the problem here is definitely that room bbb is messed up in your database: it contains events which should never have been allowed into that room, so anything we do from here is going to be a hack with the danger of making things worse rather than better. With that said, you might have some success with a query along the lines of: DELETE FROM state_groups_state sgs USING events e
WHERE e.event_id=sgs.event_id AND e.room_id != sgs.room_id
AND sgs.room_id= '<room id bbb>'; |
Sorry for the delay, I was still on sqlite (it is a very small homeserver...) and I had to move to postgresql before continuing further...
So strangely there is no occurrence where e.room_id != sgs.room_id. Thanks |
I guess something else must be wrong. Can you contact me via matrix? |
After an extensive debugging session with @richvdh, the root cause was linked to the room having somehow 2 candidates for a previous state group.
So the "solution" was to delete one of them:
For those interested in the debugging, here are the steps which led to finding the issue:
And then we repeat the last statement until we find the case with 2 predecessors. |
For the record: as @saintger says, the problem here was that state group 32 had two predecessors, 22 and 29. I think this was due to a bug in the way that we used to allocate state group ids caused state group 32 to be used twice. I think (hope) that bug has been long since fixed. For anyone else looking at this, I think this query would have raised a red flag much quicker than iterating through the state group chain: select * from state_groups sg1 join state_group_edges sge on sge.state_group=sg1.id join state_groups sg2 on sg2.id=sge.prev_state_group where sg2.room_id != sg1.room_id; if that returns any rows, there's a problem. |
The same thing happened to one of rooms on my homesever today. Currently I run synapse 1.11.0 (avhost/docker-matrix:v1.11.0). I got wrong entry in state_group_edges for a room that was created 3 days ago already on synapse 1.11.0. The version of room is 5. Deleting bad entry fixed the room for me. I'm not sure if that means that this bug is still present or it is a leftover after what older synapse versions did (I've been running and constantly upgrading synapse since 0.26 or 0.27). Either way it may be something to take a look at. |
This doesn't sound good. Is this the same room as the discussion in #6975? |
No, it's completely new room created 3 days ago. My synapse was already on 1.11.0 when affected room was created (and so was federated server where other member has account). |
ok can you create a new issue with more information please? |
Done. |
Description
I had a room aaa in the past which cannot be used anymore:
#2534
I kept the room aaa in this frozen state until recently where it was accessible again after an update from Synapse.
Today I upgraded my homeserver to Synapse v1.9.0 (running on Debian Buster) and another room bbb cannot be used anymore because of this error:
Commenting the code raising the exception makes the room usable again:
https://github.com/matrix-org/synapse/pull/6530/files/02137bef51779880d1b16194e2d2de9a693dc512
It is probably related to the following issues:
#3285
with a comment from @richvdh saying that existing room can be impacted.
I suppose that one (or both) room is somehow messed up.
What would be the fix or workaround in order to avoid losing all the history ? Could we fix the database ?
I can probably delete or wipe the room aaa, but I would really like to keep room bbb.
Thanks in advance,
The text was updated successfully, but these errors were encountered: