Bump vendor/postgres #1573

hlinnaka · 2022-04-25T15:19:30Z

This brings us the performance improvements to WAL redo from
neondatabase/postgres#144

hlinnaka · 2022-04-26T08:20:39Z

The pg_regress test is failing. In the debug build, it fails with assertion failure:

2022-04-26T07:12:19.265374Z ERROR pagestream{timeline=2fdd02eb95f315e821dac04ea3278617 tenant=b6e17e6cc23e4ed293994374f5d57c75}:get_page{rel=1663/16384/2674 blkno=29 req_lsn=0/4E53660}: wal-redo-postgres: TRAP: FailedAssertion("relpersistence == 0 || reln->smgr_relpersistence == relpersistence", File: "/home/circleci/project/tmp_install/build/../../vendor/postgres/src/backend/storage/smgr/smgr.c", Line: 197, PID: 18984)

In release build, with an error:

2022-04-26T06:58:52.163195Z ERROR pagestream{timeline=57a11af6ff1e464511940c3f5b84498f tenant=aa327d76e07f49b7bb4e4529d2cdf7e9}:get_page{rel=1663/16384/24201 blkno=289 req_lsn=0/83DD1F8}: wal-redo-postgres: 2022-04-26 06:58:52.163 UTC [643] FATAL:  Inmem storage overflow

I'm not seeing this on my laptop.

knizhnik · 2022-04-26T13:52:13Z

In release build, with an error:

I also can not reproduce it on my laptop?
Should we try to increase MAX_PAGES to 256?

hlinnaka · 2022-04-26T14:03:06Z

In release build, with an error:

I also can not reproduce it on my laptop? Should we try to increase MAX_PAGES to 256?

I'd rather decrease MAX_PAGES to make the problem appear more readily, so that we can hunt it down and fix it. I don't think there is any WAL record that legitimately requires that many buffers.

hlinnaka · 2022-04-26T14:32:22Z

I lowered MAX_PAGES to 32, and now I can reproduce the "Inmem storage overflow" error:

2022-04-26T14:23:28.546042Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: wal-redo-postgres: 2022-04-26 17:23:28.546 EEST [1596627] FATAL:  Inmem storage overflow
2022-04-26 17:23:28.546 EEST [1596627] CONTEXT:  WAL redo at 0/6A73D30 for Hash/SPLIT_ALLOCATE_PAGE: new_bucket 29, meta_page_masks_updated F, issplitpoint_changed F

2022-04-26T14:23:28.546092Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: wal-redo-postgres: ---------------------------------------
seccomp: bad syscall 11
---------------------------------------

2022-04-26T14:23:28.547059Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: error applying 30650 WAL records to reconstruct page image at LSN 0/7EBDBF0
2022-04-26T14:23:28.548297Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: wal-redo-postgres exited with code exit status: 1

It seems to happen at many different WAL record types. I think the important thing here is that you have a lot of WAL records to replay in one batch, 30650 records in the above example. Even if one WAL record needs to evict just one page from the local buffer cache, into the "inmem" area, that adds up if you have a lot of records to replay.

It occurs to me that we could clear the "inmem" area after every record. Every record is supposed to be independent of each other, it shouldn't be necessary to carry over any other buffers than the target page we're applying the records for.

knizhnik · 2022-04-26T16:27:29Z

O, sorry.
It was sure that smgr is reset after applying each record.
But it is not true. Before there was motivation for it because hash reconstruction is not so cheap operation.
Now inmem_init just do one assignment so it can be really performed for each record.
I have prepared PR for it:
neondatabase/postgres#154

Now wiht 32 pages regression tests passed.
With 4/8 pages vacuum test is not passed - I have not investigated yet was going wrong in this test.

hlinnaka · 2022-04-28T12:46:26Z

CI failed:

asyncpg.exceptions.PostgresIOError: could not read block 159041 in rel 1663/12974/16384.0 from page server at lsn 0/40025640
DETAIL:  page server returned error: could not find data for key 000000067F000032AE000040000000026D41 at LSN 0/16965B0, for request at LSN 0/4002BD28
test_runner/batch_others/test_parallel_copy.py:53: in test_parallel_copy

I cannot reproduce that locally, and I believe this doesn't happen consistently on the CI either.

Is this related to this PR, or is the test flaky on 'main'?
Have we seen this bug before?

hlinnaka · 2022-04-29T14:56:02Z

@knizhnik, can you finish this, please? It's a confusing situation that the tip of 'main' on vendor/postgres is not what's actually used in the builds.

knizhnik · 2022-04-29T16:40:02Z

@knizhnik, can you finish this, please? It's a confusing situation that the tip of 'main' on vendor/postgres is not what's actually used in the builds.

Sorry, but what should I do?
parallel_copy_test is failed. But it is marked as "flacky".
The error "could not find data for key" can be caused by race condition in GC which we discussed several times but still patch #1447 is not committed.
I am not sure if this problem is actually caused by this bug. I can try to investigate it. Or just try to rerun tests once again.
I do not think that optimizations of wal redo at postgres side can have direct influence on parallel copy behaviour.
But increased wal redo speed may make this race condition more frequent.

hlinnaka · 2022-04-29T22:54:15Z

Sorry, but what should I do? parallel_copy_test is failed. But it is marked as "flacky". The error "could not find data for key" can be caused by race condition in GC which we discussed several times but still patch #1447 is not committed. I am not sure if this problem is actually caused by this bug. I can try to investigate it. Or just try to rerun tests once again. I do not think that optimizations of wal redo at postgres side can have direct influence on parallel copy behaviour. But increased wal redo speed may make this race condition more frequent.

Ok, if you're convinced it's not caused by this bug, then let's ignore it and push this.

We really need to hunt down that GC bug...

This brings us the performance improvements to WAL redo from neondatabase/postgres#144

hlinnaka requested a review from knizhnik April 25, 2022 15:19

hlinnaka force-pushed the bump-vendor-postgres branch from 3cb2b1a to a5d9e23 Compare April 25, 2022 18:25

knizhnik approved these changes Apr 25, 2022

View reviewed changes

hlinnaka force-pushed the bump-vendor-postgres branch 2 times, most recently from c6094e6 to 2458372 Compare April 26, 2022 06:47

hlinnaka force-pushed the bump-vendor-postgres branch 6 times, most recently from 3c24e0a to 52da6a9 Compare April 26, 2022 12:29

hlinnaka mentioned this pull request Apr 26, 2022

Fix regression failure in WAL redo process neondatabase/postgres#152

Merged

hlinnaka force-pushed the bump-vendor-postgres branch from 52da6a9 to c8327ae Compare April 26, 2022 14:07

hlinnaka mentioned this pull request Apr 27, 2022

Perform inmem_smgr cleaup after processing each record neondatabase/postgres#154

Merged

hlinnaka force-pushed the bump-vendor-postgres branch from eea5e2b to 81b5c6b Compare April 28, 2022 11:09

Bump vendor/postgres

7223ed6

This brings us the performance improvements to WAL redo from neondatabase/postgres#144

knizhnik force-pushed the bump-vendor-postgres branch from 81b5c6b to 7223ed6 Compare May 3, 2022 05:13

knizhnik merged commit 62449d6 into main May 3, 2022

knizhnik deleted the bump-vendor-postgres branch May 3, 2022 06:25

hlinnaka mentioned this pull request May 3, 2022

Bump vendor/postgres to reduce 'LSN ahead of' logging. #1578

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump vendor/postgres #1573

Bump vendor/postgres #1573

hlinnaka commented Apr 25, 2022

hlinnaka commented Apr 26, 2022

knizhnik commented Apr 26, 2022

hlinnaka commented Apr 26, 2022

hlinnaka commented Apr 26, 2022

knizhnik commented Apr 26, 2022

hlinnaka commented Apr 28, 2022

hlinnaka commented Apr 29, 2022

knizhnik commented Apr 29, 2022

hlinnaka commented Apr 29, 2022

Bump vendor/postgres #1573

Bump vendor/postgres #1573

Conversation

hlinnaka commented Apr 25, 2022

hlinnaka commented Apr 26, 2022

knizhnik commented Apr 26, 2022

hlinnaka commented Apr 26, 2022

hlinnaka commented Apr 26, 2022

knizhnik commented Apr 26, 2022

hlinnaka commented Apr 28, 2022

hlinnaka commented Apr 29, 2022

knizhnik commented Apr 29, 2022

hlinnaka commented Apr 29, 2022