Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Update to 1.9.0 triggers sanity check error in room #6779

Closed
saintger opened this issue Jan 24, 2020 · 10 comments
Closed

Update to 1.9.0 triggers sanity check error in room #6779

saintger opened this issue Jan 24, 2020 · 10 comments

Comments

@saintger
Copy link

Description

I had a room aaa in the past which cannot be used anymore:
#2534
I kept the room aaa in this frozen state until recently where it was accessible again after an update from Synapse.

Today I upgraded my homeserver to Synapse v1.9.0 (running on Debian Buster) and another room bbb cannot be used anymore because of this error:

Exception: During auth for event xxx in room bbb, found event yyy in the state which is in room aaa

Commenting the code raising the exception makes the room usable again:
https://github.com/matrix-org/synapse/pull/6530/files/02137bef51779880d1b16194e2d2de9a693dc512

It is probably related to the following issues:
#3285
with a comment from @richvdh saying that existing room can be impacted.

I suppose that one (or both) room is somehow messed up.
What would be the fix or workaround in order to avoid losing all the history ? Could we fix the database ?

I can probably delete or wipe the room aaa, but I would really like to keep room bbb.

Thanks in advance,

@richvdh
Copy link
Member

richvdh commented Jan 27, 2020

yes, I've seen this in the wild too.

First let me say that the problem here is definitely that room bbb is messed up in your database: it contains events which should never have been allowed into that room, so anything we do from here is going to be a hack with the danger of making things worse rather than better.

With that said, you might have some success with a query along the lines of:

DELETE FROM state_groups_state sgs USING events e 
WHERE e.event_id=sgs.event_id AND e.room_id != sgs.room_id 
AND sgs.room_id= '<room id bbb>';

@saintger
Copy link
Author

saintger commented Feb 2, 2020

Sorry for the delay, I was still on sqlite (it is a very small homeserver...) and I had to move to postgresql before continuing further...

synapse=# SELECT COUNT(*) FROM state_groups_state sgs, events e WHERE e.event_id=sgs.event_id;
 count 
-------
   182
(1 row)

synapse=# SELECT COUNT(*) FROM state_groups_state sgs, events e WHERE e.event_id=sgs.event_id AND e.room_id != sgs.room_id;
 count 
-------
     0
(1 row)

So strangely there is no occurrence where e.room_id != sgs.room_id.
However I still got the same error as before ?
Did I misunderstand the query ?

Thanks

@richvdh
Copy link
Member

richvdh commented Feb 3, 2020

I guess something else must be wrong. Can you contact me via matrix? @richvdh:sw1v.org

@saintger
Copy link
Author

saintger commented Feb 5, 2020

After an extensive debugging session with @richvdh, the root cause was linked to the room having somehow 2 candidates for a previous state group.

sqlite> select * from state_group_edges where state_group=32;
32|29
32|22

So the "solution" was to delete one of them:

delete from state_group_edges where state_group=32 and prev_state_group=29;

For those interested in the debugging, here are the steps which led to finding the issue:

select * from event_forward_extremities where room_id='<room id>';
select * from event_to_state_groups where event_id='<previous event id>';
select * from state_group_edges where state_group=<previous state group>;

And then we repeat the last statement until we find the case with 2 predecessors.
Thanks a lot for @richvdh for helping me debug this.

@saintger saintger closed this as completed Feb 5, 2020
@richvdh
Copy link
Member

richvdh commented Feb 6, 2020

For the record: as @saintger says, the problem here was that state group 32 had two predecessors, 22 and 29. I think this was due to a bug in the way that we used to allocate state group ids caused state group 32 to be used twice. I think (hope) that bug has been long since fixed.

For anyone else looking at this, I think this query would have raised a red flag much quicker than iterating through the state group chain:

select * from state_groups sg1 join state_group_edges sge on sge.state_group=sg1.id join state_groups sg2 on sg2.id=sge.prev_state_group where sg2.room_id != sg1.room_id;

if that returns any rows, there's a problem.

@worldowner
Copy link

The same thing happened to one of rooms on my homesever today. Currently I run synapse 1.11.0 (avhost/docker-matrix:v1.11.0). I got wrong entry in state_group_edges for a room that was created 3 days ago already on synapse 1.11.0. The version of room is 5. Deleting bad entry fixed the room for me.

I'm not sure if that means that this bug is still present or it is a leftover after what older synapse versions did (I've been running and constantly upgrading synapse since 0.26 or 0.27). Either way it may be something to take a look at.

@richvdh
Copy link
Member

richvdh commented Feb 28, 2020

I got wrong entry in state_group_edges for a room that was created 3 days ago already on synapse 1.11.0.

This doesn't sound good. Is this the same room as the discussion in #6975?

@worldowner
Copy link

No, it's completely new room created 3 days ago. My synapse was already on 1.11.0 when affected room was created (and so was federated server where other member has account).
The issue happened only on my homeserver.

@richvdh
Copy link
Member

richvdh commented Feb 28, 2020

ok can you create a new issue with more information please?

@worldowner
Copy link

Done.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants