Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump vendor/postgres #1573

Merged
merged 1 commit into from
May 3, 2022
Merged

Bump vendor/postgres #1573

merged 1 commit into from
May 3, 2022

Conversation

hlinnaka
Copy link
Contributor

This brings us the performance improvements to WAL redo from
neondatabase/postgres#144

@hlinnaka hlinnaka requested a review from knizhnik April 25, 2022 15:19
@hlinnaka hlinnaka force-pushed the bump-vendor-postgres branch from 3cb2b1a to a5d9e23 Compare April 25, 2022 18:25
@hlinnaka hlinnaka force-pushed the bump-vendor-postgres branch 2 times, most recently from c6094e6 to 2458372 Compare April 26, 2022 06:47
@hlinnaka
Copy link
Contributor Author

The pg_regress test is failing. In the debug build, it fails with assertion failure:

2022-04-26T07:12:19.265374Z ERROR pagestream{timeline=2fdd02eb95f315e821dac04ea3278617 tenant=b6e17e6cc23e4ed293994374f5d57c75}:get_page{rel=1663/16384/2674 blkno=29 req_lsn=0/4E53660}: wal-redo-postgres: TRAP: FailedAssertion("relpersistence == 0 || reln->smgr_relpersistence == relpersistence", File: "/home/circleci/project/tmp_install/build/../../vendor/postgres/src/backend/storage/smgr/smgr.c", Line: 197, PID: 18984)

In release build, with an error:

2022-04-26T06:58:52.163195Z ERROR pagestream{timeline=57a11af6ff1e464511940c3f5b84498f tenant=aa327d76e07f49b7bb4e4529d2cdf7e9}:get_page{rel=1663/16384/24201 blkno=289 req_lsn=0/83DD1F8}: wal-redo-postgres: 2022-04-26 06:58:52.163 UTC [643] FATAL:  Inmem storage overflow

I'm not seeing this on my laptop.

@knizhnik
Copy link
Contributor

In release build, with an error:

I also can not reproduce it on my laptop?
Should we try to increase MAX_PAGES to 256?

@hlinnaka
Copy link
Contributor Author

In release build, with an error:

I also can not reproduce it on my laptop? Should we try to increase MAX_PAGES to 256?

I'd rather decrease MAX_PAGES to make the problem appear more readily, so that we can hunt it down and fix it. I don't think there is any WAL record that legitimately requires that many buffers.

@hlinnaka hlinnaka force-pushed the bump-vendor-postgres branch from 52da6a9 to c8327ae Compare April 26, 2022 14:07
@hlinnaka
Copy link
Contributor Author

I lowered MAX_PAGES to 32, and now I can reproduce the "Inmem storage overflow" error:

2022-04-26T14:23:28.546042Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: wal-redo-postgres: 2022-04-26 17:23:28.546 EEST [1596627] FATAL:  Inmem storage overflow
2022-04-26 17:23:28.546 EEST [1596627] CONTEXT:  WAL redo at 0/6A73D30 for Hash/SPLIT_ALLOCATE_PAGE: new_bucket 29, meta_page_masks_updated F, issplitpoint_changed F

2022-04-26T14:23:28.546092Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: wal-redo-postgres: ---------------------------------------
seccomp: bad syscall 11
---------------------------------------

2022-04-26T14:23:28.547059Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: error applying 30650 WAL records to reconstruct page image at LSN 0/7EBDBF0
2022-04-26T14:23:28.548297Z ERROR pagestream{timeline=a734c593f879cca1b8a4fa7e28f68c00 tenant=cb0ec521677f4aa09759c4dc85e0e4ee}:get_page{rel=1663/16384/23688 blkno=0 req_lsn=0/7EBD7B0}: wal-redo-postgres exited with code exit status: 1

It seems to happen at many different WAL record types. I think the important thing here is that you have a lot of WAL records to replay in one batch, 30650 records in the above example. Even if one WAL record needs to evict just one page from the local buffer cache, into the "inmem" area, that adds up if you have a lot of records to replay.

It occurs to me that we could clear the "inmem" area after every record. Every record is supposed to be independent of each other, it shouldn't be necessary to carry over any other buffers than the target page we're applying the records for.

@knizhnik
Copy link
Contributor

O, sorry.
It was sure that smgr is reset after applying each record.
But it is not true. Before there was motivation for it because hash reconstruction is not so cheap operation.
Now inmem_init just do one assignment so it can be really performed for each record.
I have prepared PR for it:
neondatabase/postgres#154

Now wiht 32 pages regression tests passed.
With 4/8 pages vacuum test is not passed - I have not investigated yet was going wrong in this test.

@hlinnaka
Copy link
Contributor Author

CI failed:

asyncpg.exceptions.PostgresIOError: could not read block 159041 in rel 1663/12974/16384.0 from page server at lsn 0/40025640
DETAIL:  page server returned error: could not find data for key 000000067F000032AE000040000000026D41 at LSN 0/16965B0, for request at LSN 0/4002BD28
test_runner/batch_others/test_parallel_copy.py:53: in test_parallel_copy

I cannot reproduce that locally, and I believe this doesn't happen consistently on the CI either.

  1. Is this related to this PR, or is the test flaky on 'main'?
  2. Have we seen this bug before?

@hlinnaka
Copy link
Contributor Author

@knizhnik, can you finish this, please? It's a confusing situation that the tip of 'main' on vendor/postgres is not what's actually used in the builds.

@knizhnik
Copy link
Contributor

@knizhnik, can you finish this, please? It's a confusing situation that the tip of 'main' on vendor/postgres is not what's actually used in the builds.

Sorry, but what should I do?
parallel_copy_test is failed. But it is marked as "flacky".
The error "could not find data for key" can be caused by race condition in GC which we discussed several times but still patch #1447 is not committed.
I am not sure if this problem is actually caused by this bug. I can try to investigate it. Or just try to rerun tests once again.
I do not think that optimizations of wal redo at postgres side can have direct influence on parallel copy behaviour.
But increased wal redo speed may make this race condition more frequent.

@hlinnaka
Copy link
Contributor Author

Sorry, but what should I do? parallel_copy_test is failed. But it is marked as "flacky". The error "could not find data for key" can be caused by race condition in GC which we discussed several times but still patch #1447 is not committed. I am not sure if this problem is actually caused by this bug. I can try to investigate it. Or just try to rerun tests once again. I do not think that optimizations of wal redo at postgres side can have direct influence on parallel copy behaviour. But increased wal redo speed may make this race condition more frequent.

Ok, if you're convinced it's not caused by this bug, then let's ignore it and push this.

We really need to hunt down that GC bug...

This brings us the performance improvements to WAL redo from
neondatabase/postgres#144
@knizhnik knizhnik force-pushed the bump-vendor-postgres branch from 81b5c6b to 7223ed6 Compare May 3, 2022 05:13
@knizhnik knizhnik merged commit 62449d6 into main May 3, 2022
@knizhnik knizhnik deleted the bump-vendor-postgres branch May 3, 2022 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants