Update to 1.9.0 triggers sanity check error in room #6779

saintger · 2020-01-24T20:58:49Z

Description

I had a room aaa in the past which cannot be used anymore:
#2534
I kept the room aaa in this frozen state until recently where it was accessible again after an update from Synapse.

Today I upgraded my homeserver to Synapse v1.9.0 (running on Debian Buster) and another room bbb cannot be used anymore because of this error:

Exception: During auth for event xxx in room bbb, found event yyy in the state which is in room aaa

Commenting the code raising the exception makes the room usable again:
https://github.com/matrix-org/synapse/pull/6530/files/02137bef51779880d1b16194e2d2de9a693dc512

It is probably related to the following issues:
#3285
with a comment from @richvdh saying that existing room can be impacted.

I suppose that one (or both) room is somehow messed up.
What would be the fix or workaround in order to avoid losing all the history ? Could we fix the database ?

I can probably delete or wipe the room aaa, but I would really like to keep room bbb.

Thanks in advance,

The text was updated successfully, but these errors were encountered:

richvdh · 2020-01-27T11:36:57Z

yes, I've seen this in the wild too.

First let me say that the problem here is definitely that room bbb is messed up in your database: it contains events which should never have been allowed into that room, so anything we do from here is going to be a hack with the danger of making things worse rather than better.

With that said, you might have some success with a query along the lines of:

DELETE FROM state_groups_state sgs USING events e 
WHERE e.event_id=sgs.event_id AND e.room_id != sgs.room_id 
AND sgs.room_id= '<room id bbb>';

saintger · 2020-02-02T22:56:08Z

Sorry for the delay, I was still on sqlite (it is a very small homeserver...) and I had to move to postgresql before continuing further...

synapse=# SELECT COUNT(*) FROM state_groups_state sgs, events e WHERE e.event_id=sgs.event_id;
 count 
-------
   182
(1 row)

synapse=# SELECT COUNT(*) FROM state_groups_state sgs, events e WHERE e.event_id=sgs.event_id AND e.room_id != sgs.room_id;
 count 
-------
     0
(1 row)

So strangely there is no occurrence where e.room_id != sgs.room_id.
However I still got the same error as before ?
Did I misunderstand the query ?

Thanks

richvdh · 2020-02-03T12:47:30Z

I guess something else must be wrong. Can you contact me via matrix? @richvdh:sw1v.org

saintger · 2020-02-05T21:59:52Z

After an extensive debugging session with @richvdh, the root cause was linked to the room having somehow 2 candidates for a previous state group.

sqlite> select * from state_group_edges where state_group=32;
32|29
32|22

So the "solution" was to delete one of them:

delete from state_group_edges where state_group=32 and prev_state_group=29;

For those interested in the debugging, here are the steps which led to finding the issue:

select * from event_forward_extremities where room_id='<room id>';
select * from event_to_state_groups where event_id='<previous event id>';
select * from state_group_edges where state_group=<previous state group>;

And then we repeat the last statement until we find the case with 2 predecessors.
Thanks a lot for @richvdh for helping me debug this.

richvdh · 2020-02-06T11:02:41Z

For the record: as @saintger says, the problem here was that state group 32 had two predecessors, 22 and 29. I think this was due to a bug in the way that we used to allocate state group ids caused state group 32 to be used twice. I think (hope) that bug has been long since fixed.

For anyone else looking at this, I think this query would have raised a red flag much quicker than iterating through the state group chain:

select * from state_groups sg1 join state_group_edges sge on sge.state_group=sg1.id join state_groups sg2 on sg2.id=sge.prev_state_group where sg2.room_id != sg1.room_id;

if that returns any rows, there's a problem.

worldowner · 2020-02-28T09:54:55Z

The same thing happened to one of rooms on my homesever today. Currently I run synapse 1.11.0 (avhost/docker-matrix:v1.11.0). I got wrong entry in state_group_edges for a room that was created 3 days ago already on synapse 1.11.0. The version of room is 5. Deleting bad entry fixed the room for me.

I'm not sure if that means that this bug is still present or it is a leftover after what older synapse versions did (I've been running and constantly upgrading synapse since 0.26 or 0.27). Either way it may be something to take a look at.

richvdh · 2020-02-28T10:02:05Z

I got wrong entry in state_group_edges for a room that was created 3 days ago already on synapse 1.11.0.

This doesn't sound good. Is this the same room as the discussion in #6975?

worldowner · 2020-02-28T10:39:52Z

No, it's completely new room created 3 days ago. My synapse was already on 1.11.0 when affected room was created (and so was federated server where other member has account).
The issue happened only on my homeserver.

richvdh · 2020-02-28T11:11:03Z

ok can you create a new issue with more information please?

worldowner · 2020-02-28T11:36:59Z

Done.

saintger closed this as completed Feb 5, 2020

worldowner mentioned this issue Feb 24, 2020

Can't leave room (During auth for event xxx in room AAA, found event yyy in the state which is in room DDD #6975

Closed

worldowner mentioned this issue Feb 28, 2020

state_groups_state contains an event which is in a different room #7012

Closed

clokep mentioned this issue Mar 3, 2021

Many unread and unreadable rooms existing since compression matrix-org/rust-synapse-compress-state#27

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to 1.9.0 triggers sanity check error in room #6779

Update to 1.9.0 triggers sanity check error in room #6779

saintger commented Jan 24, 2020

richvdh commented Jan 27, 2020

saintger commented Feb 2, 2020

richvdh commented Feb 3, 2020

saintger commented Feb 5, 2020

richvdh commented Feb 6, 2020

worldowner commented Feb 28, 2020

richvdh commented Feb 28, 2020

worldowner commented Feb 28, 2020

richvdh commented Feb 28, 2020

worldowner commented Feb 28, 2020

Update to 1.9.0 triggers sanity check error in room #6779

Update to 1.9.0 triggers sanity check error in room #6779

Comments

saintger commented Jan 24, 2020

Description

richvdh commented Jan 27, 2020

saintger commented Feb 2, 2020

richvdh commented Feb 3, 2020

saintger commented Feb 5, 2020

richvdh commented Feb 6, 2020

worldowner commented Feb 28, 2020

richvdh commented Feb 28, 2020

worldowner commented Feb 28, 2020

richvdh commented Feb 28, 2020

worldowner commented Feb 28, 2020