Don't hold walproposer WAL in memory #141

petuhovskiy · 2022-03-04T23:36:54Z

No description provided.

petuhovskiy · 2022-03-24T15:29:12Z

This PR is generally ready for review. CI tests are passing, all except test_pageserver_catchup_while_compute_down timeouted in debug, investigating that.

petuhovskiy · 2022-03-28T06:45:21Z

Looked at the test and the logs, failure is probably not related to this PR. I see that test_pageserver_catchup_while_compute_down test is running queries on compute when pageserver is down, seems that sometimes it's not possible and test hangs.

@petuhovskiy

We intentionally write while pageserver is down, so we shouldn't query it. Noticed by @petuhovskiy at neondatabase/postgres#141 (comment)

@petuhovskiy

We intentionally write while pageserver is down, so we shouldn't query it. Noticed by @petuhovskiy at neondatabase/postgres#141 (comment)

src/backend/replication/walproposer_utils.c

arssher

Nice. So this commit removes in memory queue, puts during recovery WAL on disk in its usual place and runs separate XLogReader for each safekeeper. This prevents OOM when one safekeeper deeply lags behind. Let's put that into commit message.

To preserve sending of empty AppendRequests to deliver commit_lsn and other metadata to safekeepers now SendAppendRequests always sends one message even if there is no WAL data.

I added minor comments to the branch, consider merging them.

BTW I did a quick performance comparison during normal work and it is strange. With patch
test_walproposer_pgbench.tps_pgbench: 12,308.0612
without
test_walproposer_pgbench.tps_pgbench: 5,390.6744
not sure why yet

petuhovskiy · 2022-04-04T12:36:12Z

Performance change for a local single safekeeper setup is quite strange, but I expect some difference in a cloud setup. I'll probably try to run my cloud tests after everything is merged to main.

petuhovskiy · 2022-04-14T07:17:45Z

Here are results for running cloud benchmark after this patch. It's ~20k TPS, seems similar to the previous results from here (~20.5k TPS).

WAL is no longer in memory to prevent OOM in the compute. Removed in-memory queue because it's not needed anymore. When streaming, WAL is now read directly from disk. Every safekeeper has a separate XLogReader. walproposer will now read as much WAL as it can for a single AppendRequest message, it can help with recovering lagging safekeepers. Because Recovery needs to save WAL for streaming, now walproposer can write WAL to disk and `--sync-safekeepers` mode will create pg_wal directory if needed. Replication slot `restart_lsn` is now synced with `truncate_lsn` to prevent truncation of disk WAL until needed.

petuhovskiy requested a review from arssher March 4, 2022 23:37

petuhovskiy force-pushed the walproposer-wal-on-disk branch from 78e0afb to 50c7fe6 Compare March 22, 2022 14:33

petuhovskiy mentioned this pull request Mar 24, 2022

Bump vendor/postgres to store WAL on disk only neondatabase/neon#1342

Merged

arssher mentioned this pull request Mar 28, 2022

Make shared_buffers large in test_pageserver_catchup. neondatabase/neon#1422

Merged

arssher added a commit to neondatabase/neon that referenced this pull request Mar 28, 2022

Make shared_buffers large in test_pageserver_catchup.

8bdea32

We intentionally write while pageserver is down, so we shouldn't query it. Noticed by @petuhovskiy at neondatabase/postgres#141 (comment)

arssher added a commit to neondatabase/neon that referenced this pull request Mar 28, 2022

Make shared_buffers large in test_pageserver_catchup.

75002ad

We intentionally write while pageserver is down, so we shouldn't query it. Noticed by @petuhovskiy at neondatabase/postgres#141 (comment)

petuhovskiy added 3 commits March 29, 2022 07:01

Store walproposer WAL on disk

ea1ba01

Sync slot restart_lsn with truncate_lsn

c36b68f

Remove walproposer queue

6c7e336

petuhovskiy force-pushed the walproposer-wal-on-disk branch from c1a1d25 to 6c7e336 Compare March 29, 2022 07:02

arssher reviewed Apr 1, 2022

View reviewed changes

src/backend/replication/walproposer_utils.c Outdated Show resolved Hide resolved

arssher approved these changes Apr 1, 2022

View reviewed changes

Rename recv prefix to walprop

3cf57fe

petuhovskiy merged commit 8481459 into main Apr 4, 2022

petuhovskiy deleted the walproposer-wal-on-disk branch April 4, 2022 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Don't hold walproposer WAL in memory #141

Don't hold walproposer WAL in memory #141

Uh oh!

petuhovskiy commented Mar 4, 2022

Uh oh!

petuhovskiy commented Mar 24, 2022

Uh oh!

petuhovskiy commented Mar 28, 2022

Uh oh!

Uh oh!

arssher left a comment •

edited

Loading

Uh oh!

petuhovskiy commented Apr 4, 2022

Uh oh!

petuhovskiy commented Apr 14, 2022

Uh oh!

Uh oh!

Don't hold walproposer WAL in memory #141

Don't hold walproposer WAL in memory #141

Uh oh!

Conversation

petuhovskiy commented Mar 4, 2022

Uh oh!

petuhovskiy commented Mar 24, 2022

Uh oh!

petuhovskiy commented Mar 28, 2022

Uh oh!

Uh oh!

arssher left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petuhovskiy commented Apr 4, 2022

Uh oh!

petuhovskiy commented Apr 14, 2022

Uh oh!

Uh oh!

arssher left a comment •

edited

Loading