Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

A room lost all its forward extremities, leading to "no create event in auth events" #7845

Closed
richvdh opened this issue Jul 14, 2020 · 6 comments · Fixed by #7848
Closed

Comments

@richvdh
Copy link
Member

richvdh commented Jul 14, 2020

this actually goes back to March; sadly this is the first time I've had a chance to write it up in a github issue.

Somehow a room (on the mozilla.org instance) ended up with no event_forward_extremities in the database at all, despite still having a number of local members.

The room was manually "fixed" by inserting a new row into event_forward_extremities and restarting synapse.

The only thing that looks odd is this entry in the postgres error log:

2020-03-04 15:14:33 UTC:172.20.37.63(58809):8e71e000-3607-11ea-8fb7-3d4a5344b740-live@8e71e000-3607-11ea-8fb7-3d4a5344b740-live:[14889]:ERROR: duplicate key value violates unique constraint "events_pkey"
2020-03-04 15:14:33 UTC:172.20.37.63(58809):8e71e000-3607-11ea-8fb7-3d4a5344b740-live@8e71e000-3607-11ea-8fb7-3d4a5344b740-live:[14889]:DETAIL: Key (stream_ordering)=(1691477) already exists.
2020-03-04 15:14:33 UTC:172.20.37.63(58809):8e71e000-3607-11ea-8fb7-3d4a5344b740-live@8e71e000-3607-11ea-8fb7-3d4a5344b740-live:[14889]:STATEMENT: INSERT INTO events (contains_url, depth, event_id, origin_server_ts, outlier, processed, received_ts, room_id, sender, stream_ordering, topological_ordering, type) VALUES(false, 5557, '$iVFN6H4LReXpOG_N0ygHoatqCO61LmyzNj5B9D6x-8M', 1583334868334, false, true, 1583334873818, '!<room>:mozilla.org', '@<sender>:mozilla.org', 1691477, 5557, 'm.room.message')
@richvdh
Copy link
Member Author

richvdh commented Jul 14, 2020

After the above error, synapse will have removed all traces of the failed event from the database, and retried.

The eventual event, as persisted, was:

 stream_ordering | topological_ordering |                   event_id                   |      type      |       room_id       | content | unrecognized_keys | processed | outlier | depth | origin_server_ts |  received_ts  |        sender        | contains_url 
-----------------+----------------------+----------------------------------------------+----------------+---------------------+---------+-------------------+-----------+---------+-------+------------------+---------------+----------------------+--------------
         1691479 |                 5557 | $iVFN6H4LReXpOG_N0ygHoatqCO61LmyzNj5B9D6x-8M | m.room.message | !<room>:mozilla.org |         |                   | t         | f       |  5557 |    1583334868334 | 1583334875837 | @<sender>:mozilla.org | f

The slightly later received_ts and stream_ordering are consistent with this being the result of a retry. The original stream_ordering (1691477) is assigned to the previous event in the same room (which happens to be a redaction, not that that should matter).

So, questions:

  • why was 1691477 used for two events?
  • why did we end up with an empty extremity set?

@richvdh
Copy link
Member Author

richvdh commented Jul 14, 2020

why did we end up with an empty extremity set?

well, this part is easily answered. When we get an integrity error and retry, _persist_events_txn is called with delete_existing=True, which makes it delete all previous traces of the previous attempt at https://github.com/matrix-org/synapse/blob/v1.17.0/synapse/storage/data_stores/main/events.py#L396. However, this happens after updating the forward extremities at https://github.com/matrix-org/synapse/blob/v1.17.0/synapse/storage/data_stores/main/events.py#L372. Net effect is that we end up with no forward extremities.

@richvdh
Copy link
Member Author

richvdh commented Jul 14, 2020

ok, I think this was caused by running two master instances at once. Don't do that, people.

richvdh added a commit that referenced this issue Jul 14, 2020
As far as I can tell from the sentry logs, the only time this has actually done
anything in the last two years is when we had two master workers running at
once, and even then, it made a bit of a mess of it (see
#7845 (comment)).

Generally I feel like this code is doing more harm than good.
@babolivier
Copy link
Contributor

ok, I think this was caused by running two master instances at once. Don't do that, people.

Was the Mozilla side of the issue also caused by this? I don't recall them running two master instances at once.

@richvdh
Copy link
Member Author

richvdh commented Jul 14, 2020

yes, I think there were two master processes running at once on the mozilla instance (by accident) at this point. There were lots of exceptions in sentry around the time this happened.

@babolivier
Copy link
Contributor

Oh right, that would probably explain it then.

richvdh added a commit that referenced this issue Jul 15, 2020
As far as I can tell from the sentry logs, the only time this has actually done
anything in the last two years is when we had two master workers running at
once, and even then, it made a bit of a mess of it (see
#7845 (comment)).

Generally I feel like this code is doing more harm than good.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants