Error loops when database consistency is in question. #12147

leprasmurf · 2022-03-03T12:29:14Z

Description

My postgresql db with several years of accumulated history has hit an issue, I'm not sure if it's corruption or some other type of damage. Certain rooms, especially older, locked up and would refuse to retrieve history or post new messages. Interactions with the user or room directories fails with a store error: raise StoreError(404, "No row found (%s)" % (table,)).

Steps to reproduce

unknown

My current theory is I'm suffering from database corruption, so I don't think this can be easily reproduced.

With the API and force_purge I was able to destroy and recreate the rooms, sacrificing the history for functionality. I've got a few rooms left that I'm hoping I could retrieve history from, but at this point I think I'd take a smoothly working instance over pulling archival data.

As per https://matrix-org.github.io/synapse/latest/usage/administration/admin_api/background_updates.html#run, I triggered the background task regenerate_directory. At first, it seemed it was doing something as the status page showed positive numbers for the fields: total_item_count, total_duration_ms, and average_items_per_ms; however those now show 0 consistently and the following error is looping:

2022-03-03 11:45:22,321 - INFO - synapse.storage.background_updates - 398 - background_updates-0 - Starting update batch on background update 'populate_user_directory_cleanup'
2022-03-03 11:45:22,325 - ERROR - synapse.storage.background_updates - 269 - background_updates-0 - Error doing update
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/background_updates.py", line 267, in run_background_updates
    result = await self.do_next_background_update(sleep)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/background_updates.py", line 391, in do_next_background_update
    await self._do_background_update(desired_duration_ms)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/background_updates.py", line 434, in _do_background_update
    items_updated = await update_handler(progress, batch_size)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/user_directory.py", line 152, in _populate_user_directory_cleanup
    await self.update_user_directory_stream_pos(position)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/user_directory.py", line 586, in update_user_directory_stream_pos
    await self.db_pool.simple_update_one(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 1776, in simple_update_one
    await self.runInteraction(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 743, in runInteraction
    result = await self.runWithConnection(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 857, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 244, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 282, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 850, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 609, in new_transaction
    r = func(cursor, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 1796, in simple_update_one_txn
    raise StoreError(404, "No row found (%s)" % (table,))
synapse.api.errors.StoreError: 404: No row found (user_directory_stream_pos)

This error is similar to the others being thrown:

2022-03-03 00:00:00,954 - ERROR - synapse.metrics.background_process_metrics - 244 - stats.notify_new_event-16562 - Background process 'stats.notify_new_event' threw an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/synapse/metrics/background_process_metrics.py", line 242, in run
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/stats.py", line 73, in process
    await self._unsafe_process()
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/stats.py", line 82, in _unsafe_process
    self.pos = await self.store.get_stats_positions()
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/databases/main/stats.py", line 227, in get_stats_positions
    return await self.db_pool.simple_select_one_onecol(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 1479, in simple_select_one_onecol
    return await self.runInteraction(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 743, in runInteraction
    result = await self.runWithConnection(
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 857, in runWithConnection
    return await make_deferred_yieldable(
  File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 244, in inContext
    result = inContext.theWork()  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/site-packages/twisted/python/threadpool.py", line 260, in <lambda>
    inContext.theWork = lambda: context.call(  # type: ignore[attr-defined]
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 117, in callWithContext
    return self.currentContext().callWithContext(ctx, func, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/python/context.py", line 82, in callWithContext
    return func(*args, **kw)
  File "/usr/local/lib/python3.8/site-packages/twisted/enterprise/adbapi.py", line 282, in _runWithConnection
    result = func(conn, *args, **kw)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 850, in inner_func
    return func(db_conn, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 609, in new_transaction
    r = func(cursor, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/storage/database.py", line 1532, in simple_select_one_onecol_txn
    raise StoreError(404, "No row found")
synapse.api.errors.StoreError: 404: No row found

When I try to post to an afflicted room, the message eventually states it couldn't send and offers to resend or delete the message. Meanwhile, in the logs, errors seem to point towards missing rows:

2022-03-03 12:07:24,745 - ERROR - synapse.http.server - 100 - PUT-164817 - Failed handle request via 'RoomSendEventRestServlet': <XForwardedForRequest at 0x7f2ca030b6d0 method='PUT' uri='/_matrix/client/r0/rooms/!<removed>%3Achat.geekforbes.com/send/m.room.encrypted/m1646309226499.17' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/usr/local/lib/python3.8/site-packages/synapse/rest/client/room.py", line 271, in on_POST
    ) = await self.event_creation_handler.create_and_send_nonmember_event(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 854, in create_and_send_nonmember_event
    event, context = await self.create_event(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 608, in create_event
    event, context = await self.create_new_client_event(
  File "/usr/local/lib/python3.8/site-packages/synapse/util/metrics.py", line 106, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 974, in create_new_client_event
    assert (
AssertionError: Attempting to create a non-m.room.create event with no prev_events
2022-03-03 12:07:24,746 - INFO - synapse.access.http.8008 - 427 - PUT-164817 - 1.2.3.4 - 8008 - {@leprasmurf:chat.geekforbes.com} Processed request: 0.006sec/0.000sec (0.002sec, 0.001sec) (0.001sec/0.002sec/3) 55B 500 "PUT /_matrix/client/r0/rooms/!<removed>%3Achat.geekforbes.com/send/m.room.encrypted/m1646309226499.17 HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" [0 dbevts]
2022-03-03 12:07:41,752 - ERROR - synapse.http.server - 100 - PUT-164856 - Failed handle request via 'RoomSendEventRestServlet': <XForwardedForRequest at 0x7f2c902203a0 method='PUT' uri='/_matrix/client/r0/rooms/!<removed>%3Achat.geekforbes.com/send/m.room.encrypted/m1646309226499.17' clientproto='HTTP/1.1' site='8008'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/twisted/internet/defer.py", line 1660, in _inlineCallbacks
    result = current_context.run(gen.send, result)
  File "/usr/local/lib/python3.8/site-packages/synapse/rest/client/room.py", line 271, in on_POST
    ) = await self.event_creation_handler.create_and_send_nonmember_event(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 854, in create_and_send_nonmember_event
    event, context = await self.create_event(
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 608, in create_event
    event, context = await self.create_new_client_event(
  File "/usr/local/lib/python3.8/site-packages/synapse/util/metrics.py", line 106, in measured_func
    r = await func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/synapse/handlers/message.py", line 974, in create_new_client_event
    assert (
AssertionError: Attempting to create a non-m.room.create event with no prev_events
2022-03-03 12:07:41,754 - INFO - synapse.access.http.8008 - 427 - PUT-164856 - 1.2.3.4 - 8008 - {@leprasmurf:chat.geekforbes.com} Processed request: 0.006sec/0.001sec (0.003sec, 0.001sec) (0.001sec/0.002sec/3) 55B 500 "PUT /_matrix/client/r0/rooms/!<removed>%3Achat.geekforbes.com/send/m.room.encrypted/m1646309226499.17 HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" [0 dbevts]
2022-03-03 12:08:22,830 - INFO - synapse.http.server - 95 - POST-164920 - <XForwardedForRequest at 0x7f2c933607f0 method='POST' uri='/_matrix/client/r0/rooms/!<removed>%3Achat.geekforbes.com/read_markers' clientproto='HTTP/1.1' site='8008'> SynapseError: 404 - Could not find event $xAhZSCvAdxv-rgrwZX8vA7z6_zRdP1ek9HQJaPvv7E8
2022-03-03 12:08:22,830 - INFO - synapse.access.http.8008 - 427 - POST-164920 - 1.2.3.4 - 8008 - {@leprasmurf:chat.geekforbes.com} Processed request: 0.003sec/0.000sec (0.001sec, 0.000sec) (0.000sec/0.001sec/1) 99B 404 "POST /_matrix/client/r0/rooms/!<removed>%3Achat.geekforbes.com/read_markers HTTP/1.1" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36" [0 dbevts]

Version information

Homeserver:
chat.geekforbes.com
Version:
synapse: 1.53.0
python_version: 3.8.12
element-web: version: fe88030bc9ce-react-fe88030bc9ce-js-fe88030bc9ce
Olm version: 3.2.8
Postgres: 14
Install method:
Docker-compose
Platform:
Docker-compose services: synapse, element-web, and postgresql; running within docker-compose behind an Nginx reverse proxy on a virtual instance in the cloud.

The text was updated successfully, but these errors were encountered:

reivilibre · 2022-03-04T12:33:42Z

This seems odd, and it does seem like some kind of database corruption, though goodness only knows how that happened — have you done anything manual to the database that could be at fault?

The first concern are the two positions tables; they should both have a single row and are just used to track the latest position of some stream processing.
Example from my homeserver:

synapse=# table user_directory_stream_pos;
 lock | stream_id 
------+-----------
 X    |   3670766
(1 row)

synapse=# table stats_incremental_position;
 lock | stream_id 
------+-----------
 X    |   3670766
(1 row)

I imagine you could get away with adding a row ('X', 0) into these, frankly, but it raises questions about why two rows have spontaneously vanished (and whether other rows in your database may have done the same!)...

As for your third error (Attempting to create a non-m.room.create event with no prev_events), I wonder if something similar hasn't happened to the event_forward_extremities table: mind running this query against your database to see if there are any forward extremities for your room?:

SELECT * FROM event_forward_extremities WHERE room_id = '!your.room.id:here';

Essentially, if there are none, new events won't know any events to refer to as events that precede them (if I understand correctly).
At this stage I'd be concerned about the health of your database and suggest seeing if you can find out what's gone on to cause this... perfect recovery may not be possible (though it also may be possible to bodge something together; whether anyone would recommend it is a separate matter :P).

leprasmurf · 2022-03-04T13:04:23Z

Thank you very much for the reply! I may have damaged the database at one point when I removed the entries from cache_invalidation_stream_by_instance. I also ran the VACUUM process a few times and I don't know if any of the options I provided could have had an adverse impact. These were actions I took when I started having issues, so I may have just made it worse instead of breaking in the first place.

The tables holding the stream positions are indeed empty:

synapse=# table user_directory_stream_pos;
 lock | stream_id
------+-----------
(0 rows)

synapse=# SELECT * FROM user_directory_stream_pos;
 lock | stream_id
------+-----------
(0 rows)

I don't fully understand Forward Extremities, though I keep seeing references to it. The channel remaining, and still having issues, doesn't seem to have forward extremities associated:

synapse=# SELECT * FROM event_forward_extremities WHERE room_id = '!<removed>:chat.geekforbes.com';
 event_id | room_id
----------+---------
(0 rows)

I really need to get the automated backups working so I can restore if this happens again >,<

leprasmurf · 2022-03-04T13:13:37Z

Added the empty rows:

synapse=# INSERT INTO stats_incremental_position("lock", "stream_id") VALUES ('X', 0);
INSERT 0 1
synapse=# INSERT INTO user_directory_stream_pos("lock", "stream_id") VALUES ('X', 0);
INSERT 0 1

The stack traces are no longer flooding the homeserver log, so I triggered the background task to regenerate the directory ($ curl -H "Authorization: Bearer ${access_token}" -XPOST -d '{"job_name": "regenerate_directory"}' 'https://chat.geekforbes.com/_synapse/admin/v1/background_updates/start_job') which appears to have completed without error!

reivilibre · 2022-03-04T13:22:23Z

The channel remaining, and still having issues, doesn't seem to have forward extremities associated

This is starting to feel suuuuper dodgy, but if you're desperate, you might be able to use this query to figure them out:

SELECT e.event_id FROM events AS e LEFT JOIN event_edges AS edg ON edg.prev_event_id = e.event_id WHERE edg.event_id IS NULL AND e.room_id = '!RRSDPUGrUYjUEUAtPk:librepush.net';

(I hope your event_edges table is correct!)

and then use this query instead to insert them into the table...

INSERT INTO event_forward_extremities (event_id, room_id) SELECT e.event_id, e.room_id FROM events AS e LEFT JOIN event_edges AS edg ON edg.prev_event_id = e.event_id WHERE edg.event_id IS NULL AND e.room_id = '!RRSDPUGrUYjUEUAtPk:librepush.net';

(you may want to do this whilst Synapse is offline to make sure no caches are working against you)

However if your database is screwed up to this extent, it's hard to imagine that there isn't a lot more breakage hiding somewhere. It might come and bite you in the bum at some point in the future, and it will be tricky to support.
One option you might consider (especially if you can't get it working reliably again) is to export your interesting data, leave all federated rooms and then start from a fresh database.

reivilibre · 2022-03-04T13:24:56Z

(Is it worth looking at your Postgres logs to see if it mentions anything about database corruption or so? I have no idea what that would look like, but better to try and figure out what's gone on whilst you still can)

leprasmurf · 2022-03-05T12:20:17Z

@reivilibre Thank you very much! The SQL commands you suggested allowed me to access the failing room again and I was able to export the data. I had already removed all the federated rooms from my server via API and I destroyed and recreated the failing rooms I wasn't too concerned about.

I hear your concerns about the integrity of the remaining database and I can understand the potential for future bum-biting. The main reason I hadn't destroyed and recreated was a reference in one of the docs that said "don't, there are usually ways to recover".

Starting from a fresh database sounds like a good plan. At this point, the database has been running steady for more than 4 years. I'm hoping I can transfer user data so my users won't have to re-signup.

reivilibre · 2022-03-07T10:32:39Z

At this point, the database has been running steady for more than 4 years.

Ouch, I hope this won't be too much of a sting for you then.

I'm hoping I can transfer user data so my users won't have to re-signup.

I think you could probably transfer the contents of your users table across without worry, assuming you keep the password pepper the same in your config file.

I will close this issue though since I think Postgres corruption isn't something we can (or even should?) really work around that well. I don't see any code paths that delete the rows that went missing for you, so I don't think it's a bug in Synapse itself. Sorry that you ran into this! If you ever find the reason why or any log lines in Postgres that might suggest why, it would be very nice to hear about that, though — and it might help others in the future if they somehow end up in the same place.

leprasmurf · 2022-03-10T16:45:01Z

Just as a follow-up, I do not have great logs for postgresql (<shame.gif>) so I can't really look through the logs there to see what / when corruption happened.

That aside, I dumped the corrupt database and spun up a new postgresql container to run it. Added to the docker-compose config were another synapse and element-web instance. I reconfigured the frontend proxy to forward an alternative hostname to this "corrupt" cluster, rewriting the hostname in the proxy to match the original hostname.

server {
    server_name chats.geekforbes.com;
...
    location ~ ^(/_matrix|/_synapse/client) {
        proxy_set_header Host "chat.geekforbes.com";
        proxy_pass http://backends;
    }

This works and grants my users access to the old system for data retrieval purposes. The primary stack started with a new / fresh database. I was able to copy over the user table as @reivilibre suggested, though I omitted certain old and testing accounts. I was then able to rebuild my bots and services to point to new rooms, invited users as applicable, and force the acceptance of the invite via the Admin API.

Users were confused by being kicked out of the app and having everything start fresh. Goes to show server notices and @room messages are easy to ignore.

My next big step is to setup proper automated backups.

leprasmurf mentioned this issue Mar 3, 2022

lots of AssertionError: 'Attempting to create an event with no prev_events' #8094

Closed

reivilibre added the X-Needs-Info This issue is blocked awaiting information from the reporter label Mar 4, 2022

reivilibre closed this as completed Mar 7, 2022

reivilibre mentioned this issue Apr 22, 2022

"Error leaving room No create event in auth events" after migrating from sqlite to postgresql #12487

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error loops when database consistency is in question. #12147

Error loops when database consistency is in question. #12147

leprasmurf commented Mar 3, 2022

reivilibre commented Mar 4, 2022

leprasmurf commented Mar 4, 2022

leprasmurf commented Mar 4, 2022

reivilibre commented Mar 4, 2022

reivilibre commented Mar 4, 2022

leprasmurf commented Mar 5, 2022 •

edited

Loading

reivilibre commented Mar 7, 2022

leprasmurf commented Mar 10, 2022

Error loops when database consistency is in question. #12147

Error loops when database consistency is in question. #12147

Comments

leprasmurf commented Mar 3, 2022

Description

Steps to reproduce

Version information

reivilibre commented Mar 4, 2022

leprasmurf commented Mar 4, 2022

leprasmurf commented Mar 4, 2022

reivilibre commented Mar 4, 2022

reivilibre commented Mar 4, 2022

leprasmurf commented Mar 5, 2022 • edited Loading

reivilibre commented Mar 7, 2022

leprasmurf commented Mar 10, 2022

leprasmurf commented Mar 5, 2022 •

edited

Loading