Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify WAL-disabled crash-recovery consistency globally #11613

Open
fuatbasik opened this issue Jul 13, 2023 · 5 comments
Open

Verify WAL-disabled crash-recovery consistency globally #11613

fuatbasik opened this issue Jul 13, 2023 · 5 comments
Labels

Comments

@fuatbasik
Copy link
Contributor

Expected behavior

Crash Recovery Consistency needs to be verified globally (whitebox and blackbox tests and txns) when using Rocks in WAL disabled setting.

Actual behavior

According to this PR (#9338), currently disabling WAL with whitebox tests is not possible. We should be able to test consistency of the recovered data on WAL disabled use cases, where unflushed writes are expected to be lost.

Steps to reproduce the behavior

There is a TODO on the code that disables running whitebox tests with WAL disabled setting:

see tools/db_crashtest.py

    # TODO: enable this once we figure out how to adjust kill odds for WAL-
    # disabled runs, and either (1) separate full `db_stress` runs out of
    # whitebox crash or (2) support verification at end of `db_stress` runs
    # that ran with WAL disabled.
@fuatbasik
Copy link
Contributor Author

fuatbasik commented Jul 13, 2023

I would like to work on this if you could give me some pointers on how to fix this and where to start.

@ajkr
Copy link
Contributor

ajkr commented Aug 2, 2023

enable this once we figure out how to adjust kill odds

I think this was necessary because some of the db_stress invocations produced by whitebox crash test mainly/only crash in write functions. When WAL is disabled there are very few calls to write functions since writes to SST files are buffered and other writes (e.g., MANIFEST) are infrequent. So db_stress might end up never crashing itself. We do have a 15 minute timeout for it to be killed externally, though consistently hitting that would make it essentially a blackbox crash test with a longer than usual interval:

# If the running time is 15 minutes over the run time, explicit kill and

and either ...

The two options listed imply full db_stress runs do not pass post-run validation when WAL is disabled. I don't remember what the issue was but we can try it and see what happens.

@ajkr
Copy link
Contributor

ajkr commented Aug 2, 2023

I don't remember what the issue was but we can try it and see what happens.

Couldn't find any issue after specifying -reopen=0. I probably didn't want to do that for our full db_stress runs because then there is no recovery, which makes the WAL vs. no WAL cases indistinguishable. So let's say the issue is to make -reopen > 0 work with -disable_wal=1. Will update the TODO.

@fuatbasik
Copy link
Contributor Author

fuatbasik commented Aug 4, 2023

Thanks a lot @ajkr. I also read the updated TODO.

To solve first problem -- crash tests are not crashing in 15 minutes, can we add more kill points to the code? Would it make sense to add more kill points tointeresting code paths like middle of a flush, during compaction or to the middle of atomic operations? If you think this is a good approach any other places you would recommend to add kill points? Alternatively, can we set a really small memtable to trigger more flushes to and hit kill points more frequently.

For the second one, what I understand is there is another issue wal-disabled=1 and reopen>0 is not supported. Why that is the case and what needs to be done to support this? A relative question, it looks like repopen gracefully closes the DB. Which crash scenarios it helps to cover?

Thanks a lot for your help.

@ajkr
Copy link
Contributor

ajkr commented Aug 6, 2023

can we add more kill points to the code? Would it make sense to add more kill points tointeresting code paths like middle of a flush, during compaction or to the middle of atomic operations?

Yes that sounds good to me.

what I understand is there is another issue wal-disabled=1 and reopen>0 is not supported. Why that is the case and what needs to be done to support this? A relative question, it looks like repopen gracefully closes the DB. Which crash scenarios it helps to cover?

The restriction was added 10 years ago (3827403) and might not still be relevant. Feel free to remove it. I suppose the scenario covered by -reopen>0 is the graceful close/open process. You are right that isn't supposed to involve WAL, unless it's broken or we add coverage for avoid_flush_during_shutdown=true.

facebook-github-bot pushed a commit that referenced this issue Aug 16, 2023
Summary:
See #11613

Pull Request resolved: #11665

Reviewed By: hx235

Differential Revision: D48010507

Pulled By: ajkr

fbshipit-source-id: 65c6d87d2c6ffc9d25f1d17106eae467ec528082
@ajkr ajkr added the up-for-grabs Up for grabs label Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants