-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Error loops when database consistency is in question. #12147
Comments
This seems odd, and it does seem like some kind of database corruption, though goodness only knows how that happened — have you done anything manual to the database that could be at fault? The first concern are the two positions tables; they should both have a single row and are just used to track the latest position of some stream processing.
I imagine you could get away with adding a row As for your third error ( SELECT * FROM event_forward_extremities WHERE room_id = '!your.room.id:here'; Essentially, if there are none, new events won't know any events to refer to as events that precede them (if I understand correctly). |
Thank you very much for the reply! I may have damaged the database at one point when I removed the entries from The tables holding the stream positions are indeed empty: synapse=# table user_directory_stream_pos;
lock | stream_id
------+-----------
(0 rows) synapse=# SELECT * FROM user_directory_stream_pos;
lock | stream_id
------+-----------
(0 rows) I don't fully understand Forward Extremities, though I keep seeing references to it. The channel remaining, and still having issues, doesn't seem to have forward extremities associated: synapse=# SELECT * FROM event_forward_extremities WHERE room_id = '!<removed>:chat.geekforbes.com';
event_id | room_id
----------+---------
(0 rows) I really need to get the automated backups working so I can restore if this happens again >,< |
Added the empty rows: synapse=# INSERT INTO stats_incremental_position("lock", "stream_id") VALUES ('X', 0);
INSERT 0 1
synapse=# INSERT INTO user_directory_stream_pos("lock", "stream_id") VALUES ('X', 0);
INSERT 0 1 The stack traces are no longer flooding the homeserver log, so I triggered the background task to regenerate the directory ( |
This is starting to feel suuuuper dodgy, but if you're desperate, you might be able to use this query to figure them out: SELECT e.event_id FROM events AS e LEFT JOIN event_edges AS edg ON edg.prev_event_id = e.event_id WHERE edg.event_id IS NULL AND e.room_id = '!RRSDPUGrUYjUEUAtPk:librepush.net'; (I hope your and then use this query instead to insert them into the table... INSERT INTO event_forward_extremities (event_id, room_id) SELECT e.event_id, e.room_id FROM events AS e LEFT JOIN event_edges AS edg ON edg.prev_event_id = e.event_id WHERE edg.event_id IS NULL AND e.room_id = '!RRSDPUGrUYjUEUAtPk:librepush.net'; (you may want to do this whilst Synapse is offline to make sure no caches are working against you) However if your database is screwed up to this extent, it's hard to imagine that there isn't a lot more breakage hiding somewhere. It might come and bite you in the bum at some point in the future, and it will be tricky to support. |
(Is it worth looking at your Postgres logs to see if it mentions anything about database corruption or so? I have no idea what that would look like, but better to try and figure out what's gone on whilst you still can) |
@reivilibre Thank you very much! The SQL commands you suggested allowed me to access the failing room again and I was able to export the data. I had already removed all the federated rooms from my server via API and I destroyed and recreated the failing rooms I wasn't too concerned about. I hear your concerns about the integrity of the remaining database and I can understand the potential for future bum-biting. The main reason I hadn't destroyed and recreated was a reference in one of the docs that said "don't, there are usually ways to recover". Starting from a fresh database sounds like a good plan. At this point, the database has been running steady for more than 4 years. I'm hoping I can transfer user data so my users won't have to re-signup. |
Ouch, I hope this won't be too much of a sting for you then.
I think you could probably transfer the contents of your I will close this issue though since I think Postgres corruption isn't something we can (or even should?) really work around that well. I don't see any code paths that delete the rows that went missing for you, so I don't think it's a bug in Synapse itself. Sorry that you ran into this! If you ever find the reason why or any log lines in Postgres that might suggest why, it would be very nice to hear about that, though — and it might help others in the future if they somehow end up in the same place. |
Just as a follow-up, I do not have great logs for postgresql (<shame.gif>) so I can't really look through the logs there to see what / when corruption happened. That aside, I dumped the corrupt database and spun up a new postgresql container to run it. Added to the docker-compose config were another synapse and element-web instance. I reconfigured the frontend proxy to forward an alternative hostname to this "corrupt" cluster, rewriting the hostname in the proxy to match the original hostname.
This works and grants my users access to the old system for data retrieval purposes. The primary stack started with a new / fresh database. I was able to copy over the user table as @reivilibre suggested, though I omitted certain old and testing accounts. I was then able to rebuild my bots and services to point to new rooms, invited users as applicable, and force the acceptance of the invite via the Admin API. Users were confused by being kicked out of the app and having everything start fresh. Goes to show server notices and @room messages are easy to ignore. My next big step is to setup proper automated backups. |
Description
My postgresql db with several years of accumulated history has hit an issue, I'm not sure if it's corruption or some other type of damage. Certain rooms, especially older, locked up and would refuse to retrieve history or post new messages. Interactions with the user or room directories fails with a store error:
raise StoreError(404, "No row found (%s)" % (table,))
.Steps to reproduce
My current theory is I'm suffering from database corruption, so I don't think this can be easily reproduced.
With the API and
force_purge
I was able to destroy and recreate the rooms, sacrificing the history for functionality. I've got a few rooms left that I'm hoping I could retrieve history from, but at this point I think I'd take a smoothly working instance over pulling archival data.As per https://matrix-org.github.io/synapse/latest/usage/administration/admin_api/background_updates.html#run, I triggered the background task
regenerate_directory
. At first, it seemed it was doing something as the status page showed positive numbers for the fields:total_item_count
,total_duration_ms
, andaverage_items_per_ms
; however those now show0
consistently and the following error is looping:This error is similar to the others being thrown:
When I try to post to an afflicted room, the message eventually states it couldn't send and offers to resend or delete the message. Meanwhile, in the logs, errors seem to point towards missing rows:
Version information
Homeserver:
chat.geekforbes.com
Version:
synapse: 1.53.0
python_version: 3.8.12
element-web: version: fe88030bc9ce-react-fe88030bc9ce-js-fe88030bc9ce
Olm version: 3.2.8
Postgres: 14
Install method:
Docker-compose
Platform:
Docker-compose services: synapse, element-web, and postgresql; running within docker-compose behind an Nginx reverse proxy on a virtual instance in the cloud.
The text was updated successfully, but these errors were encountered: