Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a race condition between recovery and backup #11955

Closed
wants to merge 3 commits into from

Conversation

jowlyzhang
Copy link
Contributor

@jowlyzhang jowlyzhang commented Oct 13, 2023

A race condition between recovery and backup can happen with error messages like this:
Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory

PR #6949 introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event:

[Backup engine] disable file deletion
[Recovery] disable file deletion <= this is optional for the race condition, it may or may not get called
[Backup engine] get list of file to copy/link
[Recovery] force enable file deletion
.... some files refered by backup engine get deleted
[Backup engine] copy/link file <= error no file found

This PR fixes this with:

  1. Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call.

  2. disable_delete_obsolete_files_ is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because ErrorHandler::SetBGError() can be called multiple times by different threads when they receive a manifest IO error(details per PR First step towards handling MANIFEST write error #6949), resulting in DBImpl::DisableFileDeletions to be called multiple times too. Making a force enable file deletion call that resets the counter disable_delete_obsolete_files_ to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. This PR adds a std::atomic<int> disable_file_deletion_count_ to the error handler to track the needed counter decrease more precisely. This PR tracks and caps file deletion enabling/disabling in error handler.

  3. for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function DBImpl::EnableFileDeletionsWithLock was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around.

Test plan:
existing unit tests

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang jowlyzhang requested a review from anand1976 October 13, 2023 23:40
Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice finding! I think ideally we wouldn't rely on disable_delete_obsolete_files_ for preserving SST files whose MANIFEST append/sync returned an error. In case of an error, we don't know the state of the MANIFEST - it might refer to the newly created files, in which case they're not really obsolete, or it might not.

Is there a way that file deletion can check with the ErrorHandler whether seemingly obsolete files are needed due to manifest recovery?

db/db_impl/db_impl_files.cc Outdated Show resolved Hide resolved
std::optional<int> remain_counter;
if (s.ok()) {
assert(versions_->io_status().ok());
int disable_file_deletion_count =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to track the number of times file deletion is disabled by the error handler? I wonder if it would be simpler to check and do it once in SetBGError if not done already, and reenable it in ClearBGError?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. The main motivation is for the recovery and other threads like backup thread to express their need to disable file deletion equally, so that the recovery thread cannot re-enable file deletion against the will of the backup engine thread.

Quoting the PR that added this file deletion disabling in the first place #6949
multiple threads can call LogAndApply() at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call ErrorHandler::SetBGError() in which file deletion will be disabled

So SetBGError can potentially be called multiple times, while ClearBGError should only be called once by the recovery thread. The original implementation work around this with a EnableFileDeletions(/*force*/true). However, as this race condition shows, it can trump other thread's attempt to keep file deletion disabled. So here we track this number more precisely to avoid this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I understand the problem. I'm just suggesting that EnableFileDeletions be called from ClearBGError, so the details are not exposed here to DBImpl.

Copy link
Contributor Author

@jowlyzhang jowlyzhang Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed an important detail in your original comment, which is to "do it once in SetGBError if not done already. A boolean type to cap the counter per recovery to 1 makes sense to me and is more efficient. Moving the EnableFileDeletions to be within error handler so that the details are contained is an improvement too. EnableFileDeletions will acquire the mutex while ClearBGError already holds the mutex, so we still needed a EnableFileDeletionsWithLock version of the API, plus there are some useful db info logging related to enabling to keep around that ideally should be done outside of the mutex. I wonder for these reasons, maybe it's still worth keeping this in DBImpl, WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do log info in SetBGError and other places in error_handler.cc while holding the mutex. A retryable error and recovery from it is relatively rare, so should be ok I think. I agree with keeping EnableFileDeletionsWithLock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough check. This makes sense. I have updated the PR with these improvement suggestions.

@jowlyzhang
Copy link
Contributor Author

Nice finding! I think ideally we wouldn't rely on disable_delete_obsolete_files_ for preserving SST files whose MANIFEST append/sync returned an error. In case of an error, we don't know the state of the MANIFEST - it might refer to the newly created files, in which case they're not really obsolete, or it might not.

Is there a way that file deletion can check with the ErrorHandler whether seemingly obsolete files are needed due to manifest recovery?

Thank you for the proposal. This is a more efficient and targeted way to handle the original issue. You are right that disabling file deletion could be an overkill. I will look into this and some other things in the obsolete file deletion procedures. I also saw a few Deleting non-existing... type of warning logs while checking this, will try to fix them too.

@jowlyzhang jowlyzhang force-pushed the race_backup_recovery branch from 2f3c261 to 750f61c Compare October 16, 2023 17:48
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@jowlyzhang jowlyzhang force-pushed the race_backup_recovery branch from 750f61c to cb74fa0 Compare October 16, 2023 22:40
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@jowlyzhang jowlyzhang force-pushed the race_backup_recovery branch from cb74fa0 to 89569b0 Compare October 16, 2023 22:45
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@anand1976 anand1976 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for addressing all the comments!

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang jowlyzhang force-pushed the race_backup_recovery branch from 89569b0 to 5e45142 Compare October 17, 2023 18:46
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jowlyzhang
Copy link
Contributor Author

LGTM. Thanks for addressing all the comments!

Thank you for the detailed review and improvement suggestions!

@facebook-github-bot
Copy link
Contributor

@jowlyzhang merged this pull request in 933ee29.

@jowlyzhang jowlyzhang deleted the race_backup_recovery branch October 17, 2023 21:45
@ajkr
Copy link
Contributor

ajkr commented Oct 18, 2023

Nice finding! I think ideally we wouldn't rely on disable_delete_obsolete_files_ for preserving SST files whose MANIFEST append/sync returned an error. In case of an error, we don't know the state of the MANIFEST - it might refer to the newly created files, in which case they're not really obsolete, or it might not.

Is there a way that file deletion can check with the ErrorHandler whether seemingly obsolete files are needed due to manifest recovery?

By the way I didn't fully explain the thought process here. There is one more step to show the potential problem. That is, user may call EnableFileDeletions(), which by default sets force=true, and interfere with recovery. It's actually pretty common API, for example, a physical shard transfer involves disabling file deletions, getting live file listing, copying the files, and then reenabling file deletions. If people do that without thinking about potential auto-recovery happening in the background (because how would they know :) ), the file deletion reenable step can cause necessary files to be deleted, which in turn causes crash-recovery to fail.

@jowlyzhang
Copy link
Contributor Author

jowlyzhang commented Oct 18, 2023

Nice finding! I think ideally we wouldn't rely on disable_delete_obsolete_files_ for preserving SST files whose MANIFEST append/sync returned an error. In case of an error, we don't know the state of the MANIFEST - it might refer to the newly created files, in which case they're not really obsolete, or it might not.
Is there a way that file deletion can check with the ErrorHandler whether seemingly obsolete files are needed due to manifest recovery?

By the way I didn't fully explain the thought process here. There is one more step to show the potential problem. That is, user may call EnableFileDeletions(), which by default sets force=true, and interfere with recovery. It's actually pretty common API, for example, a physical shard transfer involves disabling file deletions, getting live file listing, copying the files, and then reenabling file deletions. If people do that without thinking about potential auto-recovery happening in the background (because how would they know :) ), the file deletion reenable step can cause necessary files to be deleted, which in turn causes crash-recovery to fail.

Thank you for the example for how this can lead to a failure mode, not just inefficiency. This made me wonder though is there a strong reason to make the API EnableFileDeletions() having a force mode, let alone making it default to true. Assuming everyone has a good reason to keep the file deletion disabled, the rational thinking would be, let's disable file deletion as long as someone wants it disabled, and only enable it when everyone agrees to re-enable it. Not being able to re enable file deletion immediately (the force mode) would only result in inefficiency, not a correctness issue, right? Assuming every caller handle the re-enable in time, the inefficiency won't be a concern either. what do you think about making the API not having a forcing mode or not default to force?

@ajkr
Copy link
Contributor

ajkr commented Oct 18, 2023

Sure. Ideally we change it in a way that breaks noticeably for force==true users and works as usual for force==false users. Also good would be break it noticeably for both.

Still, I wonder if FindObsoleteFiles() is considering these files obsolete too soon.

ajkr pushed a commit that referenced this pull request Oct 19, 2023
Summary:
A race condition between recovery and backup can happen with error messages like this:
```Failure in BackupEngine::CreateNewBackup with: IO error: No such file or directory: While opening a file for sequentially reading: /dev/shm/rocksdb_test/rocksdb_crashtest_whitebox/002653.log: No such file or directory```

PR #6949  introduced disabling file deletion during error handling of manifest IO errors. Aformentioned race condition is caused by this chain of event:

[Backup engine]     disable file deletion
[Recovery]              disable file deletion <= this is optional for the race condition, it may or may not get called
[Backup engine]     get list of file to copy/link
[Recovery]              force enable file deletion
....                            some files refered by backup engine get deleted
[Backup engine]    copy/link file <= error no file found

This PR fixes this with:
1) Recovery thread is currently forcing enabling file deletion as long as file deletion is disabled. Regardless of whether the previous error handling is for manifest IO error and that disabled it in the first place. This means it could incorrectly enabling file deletions intended by other threads like backup threads, file snapshotting threads. This PR does this check explicitly before making the call.

2) `disable_delete_obsolete_files_` is designed as a counter to allow different threads to enable and disable file deletion separately. The recovery thread currently does a force enable file deletion, because `ErrorHandler::SetBGError()` can be called multiple times by different threads when they receive a manifest IO error(details per PR #6949), resulting in `DBImpl::DisableFileDeletions` to be called multiple times too. Making a force enable file deletion call that resets the counter `disable_delete_obsolete_files_` to zero is a workaround for this. However, as it shows in the race condition, it can incorrectly suppress other threads like a backup thread's intention to keep the file deletion disabled. <strike>This PR adds a `std::atomic<int> disable_file_deletion_count_` to the error handler to track the needed counter decrease more precisely</strike>. This PR tracks and caps file deletion enabling/disabling in error handler.

3) for recovery, the section to find obsolete files and purge them was moved to be done after the attempt to enable file deletion. The actual finding and purging is more likely to happen if file deletion was previously disabled and get re-enabled now. An internal function `DBImpl::EnableFileDeletionsWithLock` was added to support change 2) and 3). Some useful logging was explicitly added to keep those log messages around.

Pull Request resolved: #11955

Test Plan: existing unit tests

Reviewed By: anand1976

Differential Revision: D50290592

Pulled By: jowlyzhang

fbshipit-source-id: 73aa8331ca4d636955a5b0324b1e104a26e00c9b
@jowlyzhang
Copy link
Contributor Author

Sure. Ideally we change it in a way that breaks noticeably for force==true users and works as usual for force==false users. Also good would be break it noticeably for both.

Still, I wonder if FindObsoleteFiles() is considering these files obsolete too soon.

Thank you for helping check this. It's a good idea to augment FindObsoleteFiles to be considerate for this scenario. I'm thinking of building some context in error handler for FindObsoleteFiles to consult about these to-be-quarantined files on top of the minimum pending output file number mechanism. You may already be familiar with what this file number does, let me describe what I mean by iterating what this file number does. Currently when a background flush/compaction job starts and ends, it reserves and releases a minimum file number for the pending output. FindObsoleteFiles respects the registered live files in VersionSet and refrain from deleting any files with a file number equal to or bigger than this minimum file number. At the end of the background job, releasing this file number happens regardless of its success, the idea is that if it's successful, the temp files would now being registered in the VersionSet's live files, so it won't get deleted. If it fails, the temp files are garbage to be collected, it should be deleted now.

In this manifest IO error case, the job fails, but we want to postpone the deletion of the temp files to be after recovery succeeds. One idea would be, as we continue to release this job's file number from the pending output file pool, we also add this file number to the error handler as the minimum file number of files to quarantine. FindObsoleteFiles will consult this file number too when finding files to delete, and the error handler is responsible for clearing up this file number when recovery succeeds. When multiple background jobs all try to sets this quarantine file number, error handler just keeps the minimum one of them all. Let me know if this idea sounds feasible or if you have any concerns. Thanks for helping me check this!

@ajkr
Copy link
Contributor

ajkr commented Oct 19, 2023

In this manifest IO error case, the job fails, but we want to postpone the deletion of the temp files to be after recovery succeeds. One idea would be, as we continue to release this job's file number from the pending output file pool, we also add this file number to the error handler as the minimum file number of files to quarantine. FindObsoleteFiles will consult this file number too when finding files to delete, and the error handler is responsible for clearing up this file number when recovery succeeds. When multiple background jobs all try to sets this quarantine file number, error handler just keeps the minimum one of them all. Let me know if this idea sounds feasible or if you have any concerns. Thanks for helping me check this!

Yes this sounds great. My only concern is space usage growing if flush/compaction happens while error recovery is ongoing. IIRC we disable background work during a hard error like MANIFEST failure. Is that right, and do we also avoid triggering auto-recovery flushes?

@jowlyzhang
Copy link
Contributor Author

jowlyzhang commented Oct 19, 2023

IIRC we disable background work during a hard error like MANIFEST failure. Is that right, and do we also avoid triggering auto-recovery flushes?

These are great questions, thank you for this perspective! I checked the error handling logic more. It's true that all MANIFEST errors are initially marked as either hard or fatal errors that will stop non recovery background work. There is a special case for handling the no WAL scenario that will mark it as soft error (Editted: want to add this important detail, this marking as soft error will also mark the flag soft_error_no_bg_work_ to be true too, so non recovery background work still cannot run) and attempts to trigger a recovery flush

rocksdb/db/error_handler.cc

Lines 477 to 492 in 2e514e4

} else if (BackgroundErrorReason::kFlushNoWAL == reason ||
BackgroundErrorReason::kManifestWriteNoWAL == reason) {
// When the BG Retryable IO error reason is flush without WAL,
// We map it to a soft error. At the same time, all the background work
// should be stopped except the BG work from recovery. Therefore, we
// set the soft_error_no_bg_work_ to true. At the same time, since DB
// continues to receive writes when BG error is soft error, to avoid
// to many small memtable being generated during auto resume, the flush
// reason is set to kErrorRecoveryRetryFlush.
Status bg_err(new_bg_io_err, Status::Severity::kSoftError);
CheckAndSetRecoveryAndBGError(bg_err);
soft_error_no_bg_work_ = true;
context.flush_reason = FlushReason::kErrorRecoveryRetryFlush;
recover_context_ = context;
return StartRecoverFromRetryableBGIOError(bg_io_err);
} else {

But it seems to me any recovery flush effort, either from manual DB::Resume or auto recovery only actually triggers flush when recovery succeeds:

if (s.ok()) {
if (context.flush_reason == FlushReason::kErrorRecoveryRetryFlush) {
s = RetryFlushesForErrorRecovery(FlushReason::kErrorRecoveryRetryFlush,
true /* wait */);
} else {
// We cannot guarantee consistency of the WAL. So force flush Memtables of
// all the column families
FlushOptions flush_opts;
// We allow flush to stall write since we are trying to resume from error.
flush_opts.allow_write_stall = true;
s = FlushAllColumnFamilies(flush_opts, context.flush_reason);
}
if (!s.ok()) {
ROCKS_LOG_INFO(immutable_db_options_.info_log,
"DB resume requested but failed due to Flush failure [%s]",
s.ToString().c_str());
}
}

At that time, error handler should have cleaned up the quarantine file set or we need to make sure it does. So recovery flush shouldn't be a concern?

@ajkr
Copy link
Contributor

ajkr commented Oct 23, 2023

Sure, a quarantine file set makes sense, and in that case recovery flush is not a concern. I was thinking if we prevent file deletion more broadly during recovery then failed recovery flushes could cause dead SST files to accumulate. But I guess your implementation approach will make that not a possibility.

facebook-github-bot pushed a commit that referenced this pull request Oct 25, 2023
…es (#11955) (#11979)

Summary:
With fragmented record span across multiple blocks, if any following blocks corrupted with arbitary data, and intepreted log number less than the current log number, program will fall into infinite loop due to
not skipping buffer leading bytes

Pull Request resolved: #11979

Test Plan: existing unit tests

Reviewed By: ajkr

Differential Revision: D50604408

Pulled By: jowlyzhang

fbshipit-source-id: e50a0c7e7c3d293fb9d5afec0a3eb4a1835b7a3b
facebook-github-bot pushed a commit that referenced this pull request Dec 20, 2024
…3234)

Summary:
`DBErrorHandlingFSTest.AtomicFlushNoSpaceError` is flaky due to seg fault during error recovery:
```
...
frame #5: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>*, std::vector<rocksdb::ObsoleteBlobFileInfo, std::allocator<rocksdb::ObsoleteBlobFileInfo>>*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>>>*, unsigned long) [inlined] std::vector<rocksdb::ObsoleteFileInfo, std::allocator<rocksdb::ObsoleteFileInfo>>::begin(this=<unavailable>) at stl_vector.h:812:16
frame #6: 0x00007f0b3ea0a9d6 librocksdb.so.9.10`rocksdb::VersionSet::GetObsoleteFiles(this=0x0000000000000000, files=size=0, blob_files=size=0, manifest_filenames=size=0, min_pending_output=18446744073709551615) at version_set.cc:7258:18
frame #7: 0x00007f0b3e8ccbc0 librocksdb.so.9.10`rocksdb::DBImpl::FindObsoleteFiles(this=<unavailable>, job_context=<unavailable>, force=<unavailable>, no_full_scan=<unavailable>) at db_impl_files.cc:162:30
frame #8: 0x00007f0b3e85e698 librocksdb.so.9.10`rocksdb::DBImpl::ResumeImpl(this=<unavailable>, context=<unavailable>) at db_impl.cc:434:20
frame #9: 0x00007f0b3e921516 librocksdb.so.9.10`rocksdb::ErrorHandler::RecoverFromBGError(this=<unavailable>, is_manual=<unavailable>) at error_handler.cc:632:46
```

I suspect this is due to DB being destructed and reopened during recovery. Specifically, the [ClearBGError() call](https://github.com/facebook/rocksdb/blob/c72e79a262bf696faf5f8becabf92374fc14b464/db/db_impl/db_impl.cc#L425) can release and reacquire mutex, and DB can be closed during this time. So it's not safe to access DB state after ClearBGError(). There was a similar story in #9496. [Moving the obsolete files logic after ClearBGError()](#11955) probably makes the seg fault more easily triggered.

This PR updates `ClearBGError()` to guarantee that db close cannot finish until the method is returned and the mutex is released. So that we can safely access DB state after calling it.

Pull Request resolved: #13234

Test Plan: I could not trigger the seg fault locally, will just monitor future test failures.

Reviewed By: jowlyzhang

Differential Revision: D67476836

Pulled By: cbi42

fbshipit-source-id: dfb3e9ccd4eb3d43fc596ec10e4052861eeec002
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants