A room lost all its forward extremities, leading to "no create event in auth events" #7845

richvdh · 2020-07-14T13:59:55Z

this actually goes back to March; sadly this is the first time I've had a chance to write it up in a github issue.

Somehow a room (on the mozilla.org instance) ended up with no event_forward_extremities in the database at all, despite still having a number of local members.

The room was manually "fixed" by inserting a new row into event_forward_extremities and restarting synapse.

The only thing that looks odd is this entry in the postgres error log:

2020-03-04 15:14:33 UTC:172.20.37.63(58809):8e71e000-3607-11ea-8fb7-3d4a5344b740-live@8e71e000-3607-11ea-8fb7-3d4a5344b740-live:[14889]:ERROR: duplicate key value violates unique constraint "events_pkey"
2020-03-04 15:14:33 UTC:172.20.37.63(58809):8e71e000-3607-11ea-8fb7-3d4a5344b740-live@8e71e000-3607-11ea-8fb7-3d4a5344b740-live:[14889]:DETAIL: Key (stream_ordering)=(1691477) already exists.
2020-03-04 15:14:33 UTC:172.20.37.63(58809):8e71e000-3607-11ea-8fb7-3d4a5344b740-live@8e71e000-3607-11ea-8fb7-3d4a5344b740-live:[14889]:STATEMENT: INSERT INTO events (contains_url, depth, event_id, origin_server_ts, outlier, processed, received_ts, room_id, sender, stream_ordering, topological_ordering, type) VALUES(false, 5557, '$iVFN6H4LReXpOG_N0ygHoatqCO61LmyzNj5B9D6x-8M', 1583334868334, false, true, 1583334873818, '!<room>:mozilla.org', '@<sender>:mozilla.org', 1691477, 5557, 'm.room.message')

The text was updated successfully, but these errors were encountered:

richvdh · 2020-07-14T14:53:14Z

After the above error, synapse will have removed all traces of the failed event from the database, and retried.

The eventual event, as persisted, was:

 stream_ordering | topological_ordering |                   event_id                   |      type      |       room_id       | content | unrecognized_keys | processed | outlier | depth | origin_server_ts |  received_ts  |        sender        | contains_url 
-----------------+----------------------+----------------------------------------------+----------------+---------------------+---------+-------------------+-----------+---------+-------+------------------+---------------+----------------------+--------------
         1691479 |                 5557 | $iVFN6H4LReXpOG_N0ygHoatqCO61LmyzNj5B9D6x-8M | m.room.message | !<room>:mozilla.org |         |                   | t         | f       |  5557 |    1583334868334 | 1583334875837 | @<sender>:mozilla.org | f

The slightly later received_ts and stream_ordering are consistent with this being the result of a retry. The original stream_ordering (1691477) is assigned to the previous event in the same room (which happens to be a redaction, not that that should matter).

So, questions:

why was 1691477 used for two events?
why did we end up with an empty extremity set?

richvdh · 2020-07-14T15:13:55Z

why did we end up with an empty extremity set?

well, this part is easily answered. When we get an integrity error and retry, _persist_events_txn is called with delete_existing=True, which makes it delete all previous traces of the previous attempt at https://github.com/matrix-org/synapse/blob/v1.17.0/synapse/storage/data_stores/main/events.py#L396. However, this happens after updating the forward extremities at https://github.com/matrix-org/synapse/blob/v1.17.0/synapse/storage/data_stores/main/events.py#L372. Net effect is that we end up with no forward extremities.

richvdh · 2020-07-14T16:59:02Z

ok, I think this was caused by running two master instances at once. Don't do that, people.

As far as I can tell from the sentry logs, the only time this has actually done anything in the last two years is when we had two master workers running at once, and even then, it made a bit of a mess of it (see #7845 (comment)). Generally I feel like this code is doing more harm than good.

babolivier · 2020-07-14T17:14:05Z

ok, I think this was caused by running two master instances at once. Don't do that, people.

Was the Mozilla side of the issue also caused by this? I don't recall them running two master instances at once.

richvdh · 2020-07-14T17:33:36Z

yes, I think there were two master processes running at once on the mozilla instance (by accident) at this point. There were lots of exceptions in sentry around the time this happened.

babolivier · 2020-07-14T17:35:16Z

Oh right, that would probably explain it then.

As far as I can tell from the sentry logs, the only time this has actually done anything in the last two years is when we had two master workers running at once, and even then, it made a bit of a mess of it (see #7845 (comment)). Generally I feel like this code is doing more harm than good.

richvdh mentioned this issue Jul 14, 2020

no create event in auth events #7642

Closed

richvdh mentioned this issue Jul 14, 2020

remove retry_on_integrity_error wrapper for persist_events #7848

Merged

richvdh closed this as completed in #7848 Jul 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A room lost all its forward extremities, leading to "no create event in auth events" #7845

A room lost all its forward extremities, leading to "no create event in auth events" #7845

richvdh commented Jul 14, 2020

richvdh commented Jul 14, 2020

richvdh commented Jul 14, 2020

richvdh commented Jul 14, 2020

babolivier commented Jul 14, 2020

richvdh commented Jul 14, 2020 •

edited

Loading

babolivier commented Jul 14, 2020

A room lost all its forward extremities, leading to "no create event in auth events" #7845

A room lost all its forward extremities, leading to "no create event in auth events" #7845

Comments

richvdh commented Jul 14, 2020

richvdh commented Jul 14, 2020

richvdh commented Jul 14, 2020

richvdh commented Jul 14, 2020

babolivier commented Jul 14, 2020

richvdh commented Jul 14, 2020 • edited Loading

babolivier commented Jul 14, 2020

richvdh commented Jul 14, 2020 •

edited

Loading