Epic: recovery of lagging safekeepers #1012

stepashka · 2021-12-16T14:39:16Z

The main issue here is that lagging safekeeper make compute hold old WAL in memory, which can lead to compute getting killed by OOM, especially when WAL is generated at a fast rate.
We'll mitigate the OOM issue with #1337 but without a p2p recovery solution in place we expect a compute node startup to significantly slow down when one of the safekeepers is lagging.

Hold walproposer WAL on disk only, not in memory #1337
cut off WAL proposer queue after the threshold
p2p recovery of safekeepers

It is similar to XLogReader, but when either requested segment is missing locally or requested LSN is before basebackup_lsn NeonWALReader asynchronously fetches WAL from one of safekeepers. Patch includes walproposer switch to NeonWALReader, splitting wouldn't make much sense as it is hard to test otherwise. This finally removes risk of pg_wal explosion (as well as slow start time) when one safekeeper is lagging, at the same time allowing to recover it. In the future reader should also be used by logical walsender for similar reasons (currently we download the tail on compute start synchronously). The main test is test_lagging_sk. However, I also run it manually a lot varying MAX_SEND_SIZE on both sides (on safekeeper and on walproposer), testing various fragmentations (one side having small buffer, another, both), which brought up #6055 closes #1012

stepashka added the c/storage/safekeeper Component: storage: safekeeper label Dec 16, 2021

stepashka added this to the Limited Preview milestone Dec 16, 2021

stepashka mentioned this issue Dec 16, 2021

Do not hold all the WAL for lagging safekeepers in memory in compute node #593

Closed

stepashka added the t/Epic Issue type: Epic label Dec 16, 2021

This was referenced Dec 16, 2021

WalProposerRecovery should check that all necessary WAL is provided by Safekeeper #1014

Closed

Epic: pre-launch performance testing safekeeper and follow ups #961

Closed

stepashka added the p/cloud Product: Neon Cloud label Dec 27, 2021

stepashka mentioned this issue Jan 14, 2022

Epic: safekeeper coordination #543

Closed

10 tasks

stepashka changed the title ~~Epic: recovery of straggler safekeepers~~ Epic: recovery of lagging safekeepers Jan 14, 2022

stepashka modified the milestones: Limited Preview, Technical preview Jan 24, 2022

stepashka removed the p/cloud Product: Neon Cloud label Jan 26, 2022

petuhovskiy mentioned this issue Mar 2, 2022

Hold walproposer WAL on disk only, not in memory #1337

Closed

stepashka removed this from the Technical preview milestone Mar 15, 2022

stepashka added this to the 1.0 Technical preview milestone Apr 21, 2022

stepashka removed this from the 1.0 Technical preview milestone May 25, 2022

vadim2404 added the H12023 label Nov 18, 2022

stepashka assigned arssher Sep 28, 2023

arssher closed this as completed in cdb08f0 Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: recovery of lagging safekeepers #1012

Epic: recovery of lagging safekeepers #1012

stepashka commented Dec 16, 2021 •

edited by arssher

Loading

Epic: recovery of lagging safekeepers #1012

Epic: recovery of lagging safekeepers #1012

Comments

stepashka commented Dec 16, 2021 • edited by arssher Loading

stepashka commented Dec 16, 2021 •

edited by arssher

Loading