Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: recovery of lagging safekeepers #1012

Closed
1 of 3 tasks
stepashka opened this issue Dec 16, 2021 · 0 comments
Closed
1 of 3 tasks

Epic: recovery of lagging safekeepers #1012

stepashka opened this issue Dec 16, 2021 · 0 comments
Assignees
Labels
c/storage/safekeeper Component: storage: safekeeper H12023 t/Epic Issue type: Epic

Comments

@stepashka
Copy link
Member

stepashka commented Dec 16, 2021

The main issue here is that lagging safekeeper make compute hold old WAL in memory, which can lead to compute getting killed by OOM, especially when WAL is generated at a fast rate.
We'll mitigate the OOM issue with #1337 but without a p2p recovery solution in place we expect a compute node startup to significantly slow down when one of the safekeepers is lagging.

@stepashka stepashka added the c/storage/safekeeper Component: storage: safekeeper label Dec 16, 2021
@stepashka stepashka added this to the Limited Preview milestone Dec 16, 2021
@stepashka stepashka added the t/Epic Issue type: Epic label Dec 16, 2021
@stepashka stepashka added the p/cloud Product: Neon Cloud label Dec 27, 2021
@stepashka stepashka changed the title Epic: recovery of straggler safekeepers Epic: recovery of lagging safekeepers Jan 14, 2022
@stepashka stepashka removed the p/cloud Product: Neon Cloud label Jan 26, 2022
@stepashka stepashka removed this from the Technical preview milestone Mar 15, 2022
@stepashka stepashka added this to the 1.0 Technical preview milestone Apr 21, 2022
@stepashka stepashka removed this from the 1.0 Technical preview milestone May 25, 2022
arssher added a commit that referenced this issue Dec 9, 2023
It is similar to XLogReader, but when either requested segment is missing
locally or requested LSN is before basebackup_lsn NeonWALReader asynchronously
fetches WAL from one of safekeepers.

Patch includes walproposer switch to NeonWALReader, splitting wouldn't make much
sense as it is hard to test otherwise. This finally removes risk of pg_wal
explosion (as well as slow start time) when one safekeeper is lagging, at the
same time allowing to recover it.

In the future reader should also be used by logical walsender for similar
reasons (currently we download the tail on compute start synchronously).

The main test is test_lagging_sk. However, I also run it manually a lot varying
MAX_SEND_SIZE on both sides (on safekeeper and on walproposer), testing various
fragmentations (one side having small buffer, another, both), which brought up
#6055

closes #1012
arssher added a commit that referenced this issue Dec 15, 2023
It is similar to XLogReader, but when either requested segment is missing
locally or requested LSN is before basebackup_lsn NeonWALReader asynchronously
fetches WAL from one of safekeepers.

Patch includes walproposer switch to NeonWALReader, splitting wouldn't make much
sense as it is hard to test otherwise. This finally removes risk of pg_wal
explosion (as well as slow start time) when one safekeeper is lagging, at the
same time allowing to recover it.

In the future reader should also be used by logical walsender for similar
reasons (currently we download the tail on compute start synchronously).

The main test is test_lagging_sk. However, I also run it manually a lot varying
MAX_SEND_SIZE on both sides (on safekeeper and on walproposer), testing various
fragmentations (one side having small buffer, another, both), which brought up
#6055

closes #1012
arssher added a commit that referenced this issue Dec 26, 2023
It is similar to XLogReader, but when either requested segment is missing
locally or requested LSN is before basebackup_lsn NeonWALReader asynchronously
fetches WAL from one of safekeepers.

Patch includes walproposer switch to NeonWALReader, splitting wouldn't make much
sense as it is hard to test otherwise. This finally removes risk of pg_wal
explosion (as well as slow start time) when one safekeeper is lagging, at the
same time allowing to recover it.

In the future reader should also be used by logical walsender for similar
reasons (currently we download the tail on compute start synchronously).

The main test is test_lagging_sk. However, I also run it manually a lot varying
MAX_SEND_SIZE on both sides (on safekeeper and on walproposer), testing various
fragmentations (one side having small buffer, another, both), which brought up
#6055

closes #1012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/safekeeper Component: storage: safekeeper H12023 t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

3 participants