PG_LOGICAL_SLOT_PEEK_CHANGES query crashes #546

carlos-pereira-costoya · 2024-06-21T15:08:14Z

PGSync version: 3.0.0
Postgres version: PostgreSQL 15.5 (AWS Aurora db.r7g.xlarge, with writer and reader nodes)
Elasticsearch/OpenSearch version: 2.1.1 (OpenSearch)
Redis version: 5.0.6 (ElastiCache)
Python version: 3.9

Error Message (if any):

Problem description:

Hi @toluaina,

We are experiencing issues in the following context:

A migration from PostgreSQL to Elastic is performed from scratch for all records without any problems (5 million), without having previous replication slots or indexes in OpenSearch.
Another block of 5 million records is added to the relational database system in a single commit.
The pgsync process is started and it fails when trying to retrieve the changes linked to the change slot.

We tried executing the query manually and encountered the same behavior:

SELECT count(*) AS count_1 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('coverage_street', NULL, NULL) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 10319409 AND CAST(CAST(xid AS TEXT) AS BIGINT) < 10725529

The query terminates because PostgreSQL cuts off communication after 15-20 minutes. During execution, it uses up to 40 GB of disk space in our case.

The only way I've seen it finish is by including the limit upto_nchanges as a parameter (I tried with 100).

In the postgresql there are two replication slots with 68 GB of data each one.

From what I've researched, this happens because the function PG_LOGICAL_SLOT_PEEK_CHANGES needs to convert the changes stored in files on disk into the storage of the database instance.

https://pgpedia.info/p/pg_logical_slot_peek_changes.html

Also, in the past, I've seen work being done to solve this in pgsync:

#267

However, I have not found the parameter PG_LOGICAL_SLOT_UPTO_NCHANGES in either the version I use (3.0.0) or the most recent version of pgsync, so I have not been able to test it. In other versions (3.0.0), there is the parameter PG_LOGICAL_SLOT_UPTO_NCHANGES, but it doesn't work either.

Has this functionality been incorporated to obtain the changes from the slots in batch mode?
Is this behavior common, or does pgsync not allow commits with a large number of records (> 5M)?
Could having a cluster for the PostgreSQL affect the retrieval and processing of data in the temporary space?

Thank you very much!

The text was updated successfully, but these errors were encountered:

carlos-pereira-costoya · 2024-06-27T12:03:27Z

Hi!

With a smaller volume of data, when replication slots contains about to 1GB, the same behavior also occurs. There comes a point where, when trying to retrieve the changes, the connection is cut off.

Cheers

toluaina · 2024-06-30T13:57:05Z

Hi and sorry about the delay on this.
Have you tried reducing LOGICAL_SLOT_CHUNK_SIZE.
This is the parameter you want. The default is 5000.
Unfortunately, this was not documented. But you can see how it here

carlos-pereira-costoya · 2024-07-01T07:44:45Z

Hi @toluaina

I just tested with the LOGICAL_SLOT_CHUNK_SIZE=100 with similar results:

In one of the tests, the replication slot has been consumed by pgsync (takes about 45 min).
But in another tests we have the same results as in the previous posts of this discussion thread.

The connection to the database is active, in the active connections, the active query was:

SELECT count(*) AS count_1 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('coverage_street', NULL, NULL) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 11560368 AND CAST(CAST(xid AS TEXT) AS BIGINT) < 12072880

There isn't any limit to filter the input in the query. The limit in the PG_LOGICAL_SLOT_PEEK_CHANGES is the third parameter that is NULL (upto_nchanges).

Inside the code there are the following notes;

            upto_nchanges (Optional[int], optional): The maximum number of changes to read. Defaults to None.
            limit (Optional[int], optional): The maximum number of rows to return. Defaults to None.

I think that in this case, we need to set the number of the changes to read also, the upto_nchanges parameter.
The LOGICAL_SLOT_CHUNK_SIZE parameter, changes the default value for the limit variable, but is the number of rows to return after retireve the changes from the replication slot.

toluaina · 2024-07-01T21:13:48Z

The idea was to filter on txmin and txmax and then paginate the resultset with limit/offset.
In your case, it seems most of the rows might have the same value for xmin.
I will need to re-work the logic to use upto_nchanges specifically.

carlos-pereira-costoya · 2024-07-02T08:43:28Z

Yes, I haven't checked it, but it's possible that the xmin value of the records is the same because we have a process with several steps involving transactions that perform large commits. For example, in one of the steps, files are loaded from S3 into PostgreSQL tables, but the files contain millions of records. I understand this already generates a high transaction log, although I'm not sure if all of them have the same xmin.

Ok, thanks, can you create a new release or in a separate commit?

toluaina · 2024-07-02T20:59:08Z

Can you please try with the version on the main branch?
Please note: this does not have a lot of tests yet but I was able to reproduce the scenario.
This approach is a bit simpler.
There is still one expensive Query which is when we do an aggregate for the MAX(lsm) (I cannot figure out how to avoid this)

carlos-pereira-costoya · 2024-07-03T07:45:08Z

Hi!

Thanks for the fix, but yes, the max(lsm) is a very expensive query.
I have tested with the last version of the main branch, but I got the same results, the query to peek the changes get a lots of memory and disk:

SELECT count(*) AS count_1
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('coverage_street', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 12096230 AND CAST(CAST(xid AS TEXT) AS BIGINT) < 12370229

Current replication slots:

select slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),restart_lsn)) as replicationSlotLag,
active from pg_replication_slots;

To solve this, It's not possible to create a loop with the upto_nchanges to avoid to decode all database files in the same interval of time? I suposse that would be slow but I thinks that works if the pgsync computes the max in a custom variable.

toluaina · 2024-07-03T11:53:56Z

Are you suggesting that you are still seeing this query below:

SELECT count(*) AS count_1
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('coverage_street', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 12096230 AND CAST(CAST(xid AS TEXT) AS BIGINT) < 12370229

That query was removed and should no longer be present if you are running off the main branch.
Can you please confirm?

carlos-pereira-costoya · 2024-07-04T07:16:09Z

Sorry, I missplaced the source folder of the code. But I check again and the query with the max is again very expensive:

SELECT max(lsn) AS max_1 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('coverage_street', NULL, NULL) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 12096230 AND CAST(CAST(xid AS TEXT) AS BIGINT) < 12504771

toluaina · 2024-07-04T07:40:44Z

I figured the MAX(lsn) would be an issue from the original report.
The latest changes in main removes the MAX(lsn) completely

carlos-pereira-costoya · 2024-07-04T08:07:02Z

Yes, you are rigth, the new version work with the SLOT_SIZE:

SELECT xid, data 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('coverage_street', '41/A75A0A58', 50) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 12096230 AND CAST(CAST(xid AS TEXT) AS BIGINT) < 12513642

carlos-pereira-costoya · 2024-07-04T08:14:22Z

But finally I got the same error:

File "/usr/local/lib/python3.9/site-packages/pgsync/sync.py", line 1228, in pull
    self.logical_slot_changes(
  File "/usr/local/lib/python3.9/site-packages/pgsync/sync.py", line 382, in logical_slot_changes
    changes: int = self.logical_slot_peek_changes(
  File "/usr/local/lib/python3.9/site-packages/pgsync/base.py", line 594, in logical_slot_peek_changes
    return self.fetchall(statement)
  File "/usr/local/lib/python3.9/site-packages/pgsync/base.py", line 894, in fetchall
    return conn.execute(statement).fetchall()
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1418, in execute
    return meth(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 515, in _execute_on_connection
    return connection._execute_clauseelement(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1640, in _execute_clauseelement
    ret = self._execute_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1846, in _execute_context
    return self._exec_single_context(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1986, in _exec_single_context
    self._handle_dbapi_exception(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 2353, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1967, in _exec_single_context
    self.dialect.do_execute(
  File "/usr/local/lib/python3.9/site-packages/sqlalchemy/engine/default.py", line 924, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected

[SQL: SELECT xid, data 
FROM PG_LOGICAL_SLOT_PEEK_CHANGES(%(PG_LOGICAL_SLOT_PEEK_CHANGES_1)s, %(PG_LOGICAL_SLOT_PEEK_CHANGES_2)s, %(PG_LOGICAL_SLOT_PEEK_CHANGES_3)s) 
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= %(param_1)s AND CAST(CAST(xid AS TEXT) AS BIGINT) < %(param_2)s]
[parameters: {'PG_LOGICAL_SLOT_PEEK_CHANGES_1': 'coverage_street', 'PG_LOGICAL_SLOT_PEEK_CHANGES_2': '41/A75A0A58', 'PG_LOGICAL_SLOT_PEEK_CHANGES_3': 50, 'param_1': 12096230, 'param_2': 12513642}]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

carlos-pereira-costoya · 2024-07-04T09:03:54Z

I just tested with LOGICAL_SLOT_CHUNK_SIZE=1 but also fails, same error.

toluaina · 2024-08-13T17:41:20Z

I feel there are multiple things going on here.
Is anyone able to reliably reproduce this?

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) SSL SYSCALL error: EOF detected

toluaina added a commit that referenced this issue Jul 2, 2024

fix PG_LOGICAL_SLOT_PEEK_CHANGES query crashe #546

71fd7b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PG_LOGICAL_SLOT_PEEK_CHANGES query crashes #546

PG_LOGICAL_SLOT_PEEK_CHANGES query crashes #546

carlos-pereira-costoya commented Jun 21, 2024 •

edited

Loading

carlos-pereira-costoya commented Jun 27, 2024

toluaina commented Jun 30, 2024

carlos-pereira-costoya commented Jul 1, 2024 •

edited

Loading

toluaina commented Jul 1, 2024

carlos-pereira-costoya commented Jul 2, 2024 •

edited

Loading

toluaina commented Jul 2, 2024

carlos-pereira-costoya commented Jul 3, 2024

toluaina commented Jul 3, 2024

carlos-pereira-costoya commented Jul 4, 2024

toluaina commented Jul 4, 2024

carlos-pereira-costoya commented Jul 4, 2024

carlos-pereira-costoya commented Jul 4, 2024

carlos-pereira-costoya commented Jul 4, 2024 •

edited

Loading

toluaina commented Aug 13, 2024

PG_LOGICAL_SLOT_PEEK_CHANGES query crashes #546

PG_LOGICAL_SLOT_PEEK_CHANGES query crashes #546

Comments

carlos-pereira-costoya commented Jun 21, 2024 • edited Loading

carlos-pereira-costoya commented Jun 27, 2024

toluaina commented Jun 30, 2024

carlos-pereira-costoya commented Jul 1, 2024 • edited Loading

toluaina commented Jul 1, 2024

carlos-pereira-costoya commented Jul 2, 2024 • edited Loading

toluaina commented Jul 2, 2024

carlos-pereira-costoya commented Jul 3, 2024

toluaina commented Jul 3, 2024

carlos-pereira-costoya commented Jul 4, 2024

toluaina commented Jul 4, 2024

carlos-pereira-costoya commented Jul 4, 2024

carlos-pereira-costoya commented Jul 4, 2024

carlos-pereira-costoya commented Jul 4, 2024 • edited Loading

toluaina commented Aug 13, 2024

carlos-pereira-costoya commented Jun 21, 2024 •

edited

Loading

carlos-pereira-costoya commented Jul 1, 2024 •

edited

Loading

carlos-pereira-costoya commented Jul 2, 2024 •

edited

Loading

carlos-pereira-costoya commented Jul 4, 2024 •

edited

Loading