Verify WAL-disabled crash-recovery consistency globally #11613

fuatbasik · 2023-07-13T12:59:50Z

Expected behavior

Crash Recovery Consistency needs to be verified globally (whitebox and blackbox tests and txns) when using Rocks in WAL disabled setting.

Actual behavior

According to this PR (#9338), currently disabling WAL with whitebox tests is not possible. We should be able to test consistency of the recovered data on WAL disabled use cases, where unflushed writes are expected to be lost.

Steps to reproduce the behavior

There is a TODO on the code that disables running whitebox tests with WAL disabled setting:

see tools/db_crashtest.py

    # TODO: enable this once we figure out how to adjust kill odds for WAL-
    # disabled runs, and either (1) separate full `db_stress` runs out of
    # whitebox crash or (2) support verification at end of `db_stress` runs
    # that ran with WAL disabled.

The text was updated successfully, but these errors were encountered:

fuatbasik · 2023-07-13T13:00:10Z

I would like to work on this if you could give me some pointers on how to fix this and where to start.

ajkr · 2023-08-02T20:05:23Z

enable this once we figure out how to adjust kill odds

I think this was necessary because some of the db_stress invocations produced by whitebox crash test mainly/only crash in write functions. When WAL is disabled there are very few calls to write functions since writes to SST files are buffered and other writes (e.g., MANIFEST) are infrequent. So db_stress might end up never crashing itself. We do have a 15 minute timeout for it to be killed externally, though consistently hitting that would make it essentially a blackbox crash test with a longer than usual interval:

rocksdb/tools/db_crashtest.py

Line 930 in cf95821

# If the running time is 15 minutes over the run time, explicit kill and

and either ...

The two options listed imply full db_stress runs do not pass post-run validation when WAL is disabled. I don't remember what the issue was but we can try it and see what happens.

ajkr · 2023-08-02T21:14:02Z

I don't remember what the issue was but we can try it and see what happens.

Couldn't find any issue after specifying -reopen=0. I probably didn't want to do that for our full db_stress runs because then there is no recovery, which makes the WAL vs. no WAL cases indistinguishable. So let's say the issue is to make -reopen > 0 work with -disable_wal=1. Will update the TODO.

fuatbasik · 2023-08-04T08:48:42Z

Thanks a lot @ajkr. I also read the updated TODO.

To solve first problem -- crash tests are not crashing in 15 minutes, can we add more kill points to the code? Would it make sense to add more kill points tointeresting code paths like middle of a flush, during compaction or to the middle of atomic operations? If you think this is a good approach any other places you would recommend to add kill points? Alternatively, can we set a really small memtable to trigger more flushes to and hit kill points more frequently.

For the second one, what I understand is there is another issue wal-disabled=1 and reopen>0 is not supported. Why that is the case and what needs to be done to support this? A relative question, it looks like repopen gracefully closes the DB. Which crash scenarios it helps to cover?

Thanks a lot for your help.

ajkr · 2023-08-06T15:44:15Z

can we add more kill points to the code? Would it make sense to add more kill points tointeresting code paths like middle of a flush, during compaction or to the middle of atomic operations?

Yes that sounds good to me.

what I understand is there is another issue wal-disabled=1 and reopen>0 is not supported. Why that is the case and what needs to be done to support this? A relative question, it looks like repopen gracefully closes the DB. Which crash scenarios it helps to cover?

The restriction was added 10 years ago (3827403) and might not still be relevant. Feel free to remove it. I suppose the scenario covered by -reopen>0 is the graceful close/open process. You are right that isn't supposed to involve WAL, unless it's broken or we add coverage for avoid_flush_during_shutdown=true.

Summary: See #11613 Pull Request resolved: #11665 Reviewed By: hx235 Differential Revision: D48010507 Pulled By: ajkr fbshipit-source-id: 65c6d87d2c6ffc9d25f1d17106eae467ec528082

ajkr mentioned this issue Aug 2, 2023

clarify TODO for whitebox disable_wal=1 in db_crashtest.py #11665

Closed

ajkr assigned fuatbasik Aug 11, 2023

ajkr added the enhancement label Aug 11, 2023

ajkr unassigned fuatbasik Dec 19, 2023

ajkr added the up-for-grabs Up for grabs label Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify WAL-disabled crash-recovery consistency globally #11613

Verify WAL-disabled crash-recovery consistency globally #11613

fuatbasik commented Jul 13, 2023

fuatbasik commented Jul 13, 2023 •

edited

Loading

ajkr commented Aug 2, 2023

ajkr commented Aug 2, 2023

fuatbasik commented Aug 4, 2023 •

edited

Loading

ajkr commented Aug 6, 2023

Verify WAL-disabled crash-recovery consistency globally #11613

Verify WAL-disabled crash-recovery consistency globally #11613

Comments

fuatbasik commented Jul 13, 2023

Expected behavior

Actual behavior

Steps to reproduce the behavior

fuatbasik commented Jul 13, 2023 • edited Loading

ajkr commented Aug 2, 2023

ajkr commented Aug 2, 2023

fuatbasik commented Aug 4, 2023 • edited Loading

ajkr commented Aug 6, 2023

fuatbasik commented Jul 13, 2023 •

edited

Loading

fuatbasik commented Aug 4, 2023 •

edited

Loading