Rollback other pending memtable flushes when a flush fails #11865

cbi42 · 2023-09-20T16:31:46Z

Summary: when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen:

Start Flush0 for memtable M0 to SST0
Start Flush1 for memtable M1 to SST1
Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it
Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1
Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed
Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1
Error opening SST1 since it's already deleted with an  error message like the following:

IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory

This happens since:

We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false.
Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status:

rocksdb/db/db_impl/db_impl_compaction_flush.cc

Line 3161 in f42e70b

ReleaseFileNumberFromPendingOutputs(pending_outputs_inserted_elem);

This PR fixes the issue by rollback these pending flushes.

There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error.

There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path.

Test plan:

Added repro unit tests.

facebook-github-bot · 2023-09-21T00:30:24Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-09-21T05:41:43Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-09-21T06:40:27Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-09-21T16:38:29Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-09-21T16:40:32Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

anand1976

LGTM. Thanks!

anand1976 · 2023-09-21T17:29:35Z

db/flush_job.cc

-        log_buffer_, &committed_flush_jobs_info_,
-        !(mempurge_s.ok()) /* write_edit : true if no mempurge happened (or if aborted),
+    assert(!db_options_.atomic_flush);
+    if (!db_options_.atomic_flush && skip_since_bg_error &&


Do we need to check for skip_since_bg_error being non-null here? Regardless, we might want to rollback if there's a bg error.

Yes, removed the check.

anand1976 · 2023-09-21T17:41:00Z

db/db_flush_test.cc

+
+  TEST_SYNC_POINT("Wait for error recover");
+  ASSERT_EQ(1, NumTableFilesAtLevel(0));
+}


Disable SyncPoint processing and callbacks?

anand1976 · 2023-09-21T17:43:15Z

db/db_impl/db_impl_compaction_flush.cc

@@ -296,6 +297,9 @@ Status DBImpl::FlushMemTableToOutputFile(
                     flush_reason);

  bool switched_to_mempurge = false;
+  // If we exit a flush job since there is already a bg error,
+  // we should not set the bg error again.
+  bool skip_since_bg_error = false;


Maybe call it skip_set_bg_error to make the purpose clearer.

Yes, updated the name in FlushMemtableToOutputFile and FlushJob.

facebook-github-bot · 2023-09-21T18:00:52Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-09-21T18:24:03Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

cbi42 · 2023-09-21T18:27:27Z

Thanks for the review! I made one additional change about checking bg error before PickMemtable(). This is to prevent a non-recovery flush from picking all memtable, which means a concurrent recovery flush cannot pick any memtable. Then the non-recovery flush rollbacks all of them due to the new change we made in FlushJob::Run(). I think the recovery thread can stuck in waiting for flush to finish in this case.

Edit: we may drop flushes that do no have error recovery as FlushReason when we fix the issue with atomic_flush=true later. But I'm including this change here to hopefully make this PR not introducing new stuck scenario.

facebook-github-bot · 2023-09-21T18:31:45Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-09-21T18:36:47Z

@cbi42 has updated the pull request. You must reimport the pull request before landing.

facebook-github-bot · 2023-09-21T19:44:43Z

@cbi42 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

anand1976

LGTM

facebook-github-bot · 2023-09-21T22:35:25Z

@cbi42 merged this pull request in b927ba5.

…11865) Summary: when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen: ``` Start Flush0 for memtable M0 to SST0 Start Flush1 for memtable M1 to SST1 Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1 Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1 Error opening SST1 since it's already deleted with an error message like the following: IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory ``` This happens since: 1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false. 2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161 This PR fixes the issue by rollback these pending flushes. There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error. There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path. Pull Request resolved: facebook#11865 Test Plan: * Added repro unit tests. Reviewed By: anand1976 Differential Revision: D49484922 Pulled By: cbi42 fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d

* Rollback other pending memtable flushes when a flush fails (#11865) Summary: when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen: ``` Start Flush0 for memtable M0 to SST0 Start Flush1 for memtable M1 to SST1 Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1 Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1 Error opening SST1 since it's already deleted with an error message like the following: IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory ``` This happens since: 1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false. 2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161 This PR fixes the issue by rollback these pending flushes. There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error. There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path. Pull Request resolved: #11865 Test Plan: * Added repro unit tests. Reviewed By: anand1976 Differential Revision: D49484922 Pulled By: cbi42 fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d * Fix a bug with atomic_flush that causes DB to stuck after a flush failure (#11872) Summary: With atomic_flush=true, a flush job with younger memtables wait for older memtables to be installed before install its memtables. If the flush for older memtables failed, auto-recovery starts a resume thread which can becomes stuck waiting for all background work to finish (including the flush for younger memtables). If a non-recovery flush starts now and tries to flush, it can make the situation worse since it will fail due to background error but never rollback its memtable: https://github.com/facebook/rocksdb/blob/269478ee4618283cd6d710fdfea9687157a259c1/db/db_impl/db_impl_compaction_flush.cc#L725 This prevents any future flush to pick old memtables. A more detailed repro is in unit test. This PR fixes this issue by 1. Ensure we rollback memtables if an atomic flush fails due to background error 2. When there is a background error, abort atomic flushes that are waiting for older memtables to be installed 3. Do not schedule non-recovery flushes when there is a background error that stops background work There was another issue with atomic_flush=true where DB can hang during DB close, see more in #11867. The fix in this PR, specifically fix 2 above, should be enough to resolve it too. Pull Request resolved: #11872 Test Plan: new unit test. Reviewed By: jowlyzhang Differential Revision: D49556867 Pulled By: cbi42 fbshipit-source-id: 4a0210ff28a8552a99ece7fbb0f574fd24b4da3f * Only flush after recovery for retryable IOError (#11880) Summary: #11872 causes a unit test to start failing with the error message below. The cause is that the additional call to `FlushAllColumnFamilies()` in `DBImpl::ResumeImpl()` can run while DB is closing. More detailed explanation: there are two places where we call `ResumeImpl()`: 1. in `ErrorHandler::RecoverFromBGError`, for manual resume or recovery from errors like OutOfSpace through sst file manager, and 2. in `Errorhandler::RecoverFromRetryableBGIOError`, for error recovery from errors like flush failure due to retryable IOError. This is tracked by `ErrorHandler::recovery_thread_`. Here is how DB close waits for error recovery: https://github.com/facebook/rocksdb/blob/49da91ec097b4efcd8a8e4dc1b287e9f81eb4093/db/db_impl/db_impl.cc#L540-L543 `CancelErrorRecovery()` waits until `recovery_thread_` finishes and `IsRecoveryInProgress()` checks the `recovery_in_prog_` flag. The additional call to `FlushAllColumnFamilies()` in `ResumeImpl()` happens after it clears bg error and the `recovery_in_prog_` flag: https://github.com/facebook/rocksdb/blob/49da91ec097b4efcd8a8e4dc1b287e9f81eb4093/db/db_impl/db_impl.cc#L436-L463. So if `ResumeImpl()` is called in `RecoverFromBGError()`, we can have a thread running `FlushAllColumnFamilies()` while DB is closing and thought that recovery is done. The fix is to only do the additional call to `FlushAllColumnFamilies()` when doing error recovery through `Errorhandler::RecoverFromRetryableBGIOError` by setting flags in `DBRecoverContext`. Pull Request resolved: #11880 Test Plan: `gtest-parallel --repeat=100 --workers=4 ./error_handler_fs_test --gtest_filter="*AutoRecoverFlushError*"` reproduces the error pretty reliably. ```[==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBErrorHandlingFSTest [ RUN ] DBErrorHandlingFSTest.AutoRecoverFlushError error_handler_fs_test: db/column_family.cc:1618: rocksdb::ColumnFamilySet::~ColumnFamilySet(): Assertion `last_ref' failed. Received signal 6 (Aborted) ... #10 0x00007fac4409efd6 in __GI___assert_fail (assertion=0x7fac452c0afa "last_ref", file=0x7fac452c9fb5 "db/column_family.cc", line=1618, function=0x7fac452cb950 "rocksdb::ColumnFamilySet::~ColumnFamilySet()") at assert.c:101 101 in assert.c #11 0x00007fac44b5324f in rocksdb::ColumnFamilySet::~ColumnFamilySet (this=0x7b5400000000) at db/column_family.cc:1618 1618 assert(last_ref); #12 0x00007fac44e0f047 in std::default_delete<rocksdb::ColumnFamilySet>::operator() (this=0x7b5800000940, __ptr=0x7b5400000000) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; #13 std::__uniq_ptr_impl<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); #14 std::unique_ptr<rocksdb::ColumnFamilySet, std::default_delete<rocksdb::ColumnFamilySet> >::reset (this=0x7b5800000940, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); #15 rocksdb::VersionSet::~VersionSet (this=this@entry=0x7b5800000900) at db/version_set.cc:5081 5081 column_family_set_.reset(); #16 0x00007fac44e0f97a in rocksdb::VersionSet::~VersionSet (this=0x7b5800000900) at db/version_set.cc:5078 5078 VersionSet::~VersionSet() { #17 0x00007fac44bf0b2f in std::default_delete<rocksdb::VersionSet>::operator() (this=0x7b8c00000068, __ptr=0x7b5800000900) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85 85 delete __ptr; #18 std::__uniq_ptr_impl<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182 182 _M_deleter()(__old_p); #19 std::unique_ptr<rocksdb::VersionSet, std::default_delete<rocksdb::VersionSet> >::reset (this=0x7b8c00000068, __p=0x0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456 456 _M_t.reset(std::move(__p)); #20 rocksdb::DBImpl::CloseHelper (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:676 676 versions_.reset(); #21 0x00007fac44bf1346 in rocksdb::DBImpl::CloseImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:720 720 Status DBImpl::CloseImpl() { return CloseHelper(); } #22 rocksdb::DBImpl::~DBImpl (this=this@entry=0x7b8c00000000) at db/db_impl/db_impl.cc:738 738 closing_status_ = CloseImpl(); #23 0x00007fac44bf2bba in rocksdb::DBImpl::~DBImpl (this=0x7b8c00000000) at db/db_impl/db_impl.cc:722 722 DBImpl::~DBImpl() { #24 0x00007fac455444d4 in rocksdb::DBTestBase::Close (this=this@entry=0x7b6c00000000) at db/db_test_util.cc:678 678 delete db_; #25 0x00007fac455455fb in rocksdb::DBTestBase::TryReopen (this=this@entry=0x7b6c00000000, options=...) at db/db_test_util.cc:707 707 Close(); #26 0x00007fac45543459 in rocksdb::DBTestBase::Reopen (this=0x7ffed74b79a0, options=...) at db/db_test_util.cc:670 670 ASSERT_OK(TryReopen(options)); #27 0x00000000004f2522 in rocksdb::DBErrorHandlingFSTest_AutoRecoverFlushError_Test::TestBody (this=this@entry=0x7b6c00000000) at db/error_handler_fs_test.cc:1224 1224 Reopen(options); ``` Reviewed By: jowlyzhang Differential Revision: D49579701 Pulled By: cbi42 fbshipit-source-id: 3fc8325e6dde7e7faa8bcad95060cb4e26eda638 * Update HISTORY.md and version.h for 8.6.6

Summary: when atomic_flush=false, there are certain cases where we try to install memtable results with already deleted SST files. This can happen when the following sequence events happen: ``` Start Flush0 for memtable M0 to SST0 Start Flush1 for memtable M1 to SST1 Flush 1 returns OK, but don't install to MANIFEST and let whoever flushes M0 to take care of it Flush0 finishes with a retryable IOError, it rollbacks M0, (incorrectly) does not rollback M1, and deletes SST0 and SST1 Starts Flush2 for M0, it does not pick up M1 since it thought M1 is flushed Flush2 writes SST2 and finishes OK, tries to install SST2 and SST1 Error opening SST1 since it's already deleted with an error message like the following: IO error: No such file or directory: While open a file for random read: /tmp/rocksdbtest-501/db_flush_test_3577_4230653031040984171/000011.sst: No such file or directory ``` This happens since: 1. We currently only rollback the memtables that we are flushing in a flush job when atomic_flush=false. 2. Pending output SSTs from previous flushes are deleted since a pending file number is released whenever a flush job is finished no matter of flush status: https://github.com/facebook/rocksdb/blob/f42e70bf561d4be9b6bbe7316d1c2c0c8a3818e6/db/db_impl/db_impl_compaction_flush.cc#L3161 This PR fixes the issue by rollback these pending flushes. There is another issue where if a new flush for new memtable starts and finishes after Flush0 finishes. Its output may also be deleted (see more in unit test). It is fixed by checking bg error status before installing a memtable result, and rollback if there is an error. There is a more efficient fix where we just don't release the pending file output number for flushes that delegate installation. It is more efficient since it does not have to rewrite the flush output file. With the fix in this PR, we can end up with a giant file if a lot of memtables are being flushed together. However, the more efficient fix is a bit more complicated to implement (requires associating such pending file numbers with flush job/memtables) and is more risky since it changes normal flush code path. Pull Request resolved: facebook/rocksdb#11865 Test Plan: * Added repro unit tests. Reviewed By: anand1976 Differential Revision: D49484922 Pulled By: cbi42 fbshipit-source-id: 25b536c08f4e02e7f1d0f86571663737d2b5d53d

facebook-github-bot added the CLA Signed label Sep 20, 2023

cbi42 marked this pull request as draft September 20, 2023 20:26

cbi42 force-pushed the non-atomic-flush-fix branch 2 times, most recently from c1c9906 to f2ee4f7 Compare September 21, 2023 00:09

cbi42 requested review from anand1976 and ajkr September 21, 2023 00:38

cbi42 marked this pull request as ready for review September 21, 2023 00:39

cbi42 force-pushed the non-atomic-flush-fix branch from 422fb91 to 485f8a4 Compare September 21, 2023 16:38

cbi42 added 5 commits September 21, 2023 09:40

fix

d9fc315

Add repro and fix for another error scenario

bbcd426

Add changelog.

f0476c6

Unstuck unit test and small fixes.

4b6b4bf

Do no rely on clock to wait for error recovery.

de7feb1

cbi42 force-pushed the non-atomic-flush-fix branch from 485f8a4 to de7feb1 Compare September 21, 2023 16:40

anand1976 approved these changes Sep 21, 2023

View reviewed changes

Check bg error before PickMemtable

c43c2fe

cbi42 requested a review from anand1976 September 21, 2023 18:29

Address comments

4a2b9f9

cbi42 force-pushed the non-atomic-flush-fix branch from c15d5bb to 4a2b9f9 Compare September 21, 2023 18:36

anand1976 approved these changes Sep 21, 2023

View reviewed changes

facebook-github-bot closed this in b927ba5 Sep 21, 2023

facebook-github-bot added the Merged label Sep 21, 2023

cbi42 mentioned this pull request Sep 25, 2023

Patch 8.6.6 #11886

Merged

igorcanadi mentioned this pull request Jan 17, 2024

[SYS-6913] Upgrade RocksDB-Cloud to 8.9.1 rockset/rocksdb-cloud#315

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollback other pending memtable flushes when a flush fails #11865

Rollback other pending memtable flushes when a flush fails #11865

cbi42 commented Sep 20, 2023 •

edited

Loading

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

anand1976 left a comment

anand1976 Sep 21, 2023

cbi42 Sep 21, 2023

anand1976 Sep 21, 2023

anand1976 Sep 21, 2023

cbi42 Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

cbi42 commented Sep 21, 2023 •

edited

Loading

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

anand1976 left a comment

facebook-github-bot commented Sep 21, 2023

Rollback other pending memtable flushes when a flush fails #11865

Rollback other pending memtable flushes when a flush fails #11865

Conversation

cbi42 commented Sep 20, 2023 • edited Loading

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

anand1976 left a comment

Choose a reason for hiding this comment

anand1976 Sep 21, 2023

Choose a reason for hiding this comment

cbi42 Sep 21, 2023

Choose a reason for hiding this comment

anand1976 Sep 21, 2023

Choose a reason for hiding this comment

anand1976 Sep 21, 2023

Choose a reason for hiding this comment

cbi42 Sep 21, 2023

Choose a reason for hiding this comment

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

cbi42 commented Sep 21, 2023 • edited Loading

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

facebook-github-bot commented Sep 21, 2023

anand1976 left a comment

Choose a reason for hiding this comment

facebook-github-bot commented Sep 21, 2023

cbi42 commented Sep 20, 2023 •

edited

Loading

cbi42 commented Sep 21, 2023 •

edited

Loading