-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false #11991
Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false #11991
Conversation
@jaykorean has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
@jaykorean has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
db34caf
to
76b9502
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
26e887e
to
87fe776
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
87fe776
to
2de7d57
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
2de7d57
to
dc65221
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
dc65221
to
eb852ca
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
eb852ca
to
ec02b59
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
ec02b59
to
07ca6d6
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
2 similar comments
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
2ccd98c
to
e0da375
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
e0da375
to
5ab03c2
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
5ab03c2
to
7b5ea4c
Compare
@@ -761,22 +758,12 @@ void ErrorHandler::RecoverFromRetryableBGIOError() { | |||
// recover from the retryable IO error and no other BG errors. Clean | |||
// the bg_error and notify user. | |||
TEST_SYNC_POINT("RecoverFromRetryableBGIOError:RecoverSuccess"); | |||
Status old_bg_error = bg_error_; | |||
is_db_stopped_.store(false, std::memory_order_release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May need to move is_db_stopped_
to ClearBGError() too since it's not set in ClearBGError().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I guess this wasn't supposed to be missing in ClearBGError()
in the first place. Thanks for catching this.
@@ -689,6 +692,7 @@ const Status& ErrorHandler::StartRecoverFromRetryableBGIOError( | |||
// Automatic recover from Retryable BG IO error. Must be called after db | |||
// mutex is released. | |||
void ErrorHandler::RecoverFromRetryableBGIOError() { | |||
assert(recovery_in_prog_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cbi42 do you think we want to assert other fields here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, it's hard to figure out all the invariants :)
df81ab4
to
81e265f
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through the unit test and just have some minor comments.
fault_fs_->SetFilesystemActive(true); | ||
|
||
// Set up sync point so that we can wait for the recovery thread to finish | ||
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the second recovery thread finish before this dependency is set? Maybe we don't even have to wait until it's finished in this unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't wait and recovery thread is still alive, Close()
throws. (#12002 won't fix this case)
std::terminate()
std::default_delete<std::thread>::operator()(std::thread*) const
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)
@@ -689,6 +692,7 @@ const Status& ErrorHandler::StartRecoverFromRetryableBGIOError( | |||
// Automatic recover from Retryable BG IO error. Must be called after db | |||
// mutex is released. | |||
void ErrorHandler::RecoverFromRetryableBGIOError() { | |||
assert(recovery_in_prog_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, it's hard to figure out all the invariants :)
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
d8a3c49
to
3ec955e
Compare
@jaykorean has updated the pull request. You must reimport the pull request before landing. |
@jaykorean has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
@jaykorean merged this pull request in 04225a2. |
Summary
@cbi42 helped investigation and found a potential scenario where
RecoverFromRetryableBGIOError()
may start withrecovery_in_prog_
set as false. (and other booleans likebg_error_
andsoft_error_no_bg_work_
)Thread 1
StartRecoverFromRetryableBGIOError()
): (mutex held) setsrecovery_in_prog_ = true
Thread 1's
recovery_thread_
RecoverFromRetryableBGIOError()
->ResumeImpl()
->ClearBGError()
: setsrecovery_in_prog_ = false
ClearBGError()
->NotifyOnErrorRecoveryEnd()
: releasesmutex
Thread 2
StartRecoverFromRetryableBGIOError()
): (mutex held) setsrecovery_in_prog_ = true
recovery_thread_
) to finishThread 1's
recovery_thread_
NotifyOnErrorRecoveryEnd()
RecoverFromRetryableBGIOError()
: setsrecovery_in_prog_ = false
Thread 2's
recovery_thread_
recovery_in_prog_
set asfalse
Fix
bg_error_
,recovery_in_prog_
and other fields afterResumeImpl()
already returnedOK()
.DBErrorHandlingFSTest
Test Plan
DBErrorHandlingFSTest::MultipleRecoveryThreads
added to reproduce the scenario.assert(recovery_in_prog_);
at the start ofErrorHandler::RecoverFromRetryableBGIOError()
fails the test without the fix and succeeds with the fix as expected.