Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false #11991

Conversation

jaykorean
Copy link
Contributor

@jaykorean jaykorean commented Oct 20, 2023

Summary

@cbi42 helped investigation and found a potential scenario where RecoverFromRetryableBGIOError() may start with recovery_in_prog_ set as false. (and other booleans like bg_error_ and soft_error_no_bg_work_)

Thread 1

  • StartRecoverFromRetryableBGIOError()): (mutex held) sets recovery_in_prog_ = true

Thread 1's recovery_thread_

  • (waits for mutex and acquires it)
  • RecoverFromRetryableBGIOError() -> ResumeImpl() -> ClearBGError(): sets recovery_in_prog_ = false
  • ClearBGError() -> NotifyOnErrorRecoveryEnd(): releases mutex

Thread 2

  • StartRecoverFromRetryableBGIOError()): (mutex held) sets recovery_in_prog_ = true
  • Waits for Thread 1 (recovery_thread_) to finish

Thread 1's recovery_thread_

  • re-lock mutex in NotifyOnErrorRecoveryEnd()
  • Still inside RecoverFromRetryableBGIOError(): sets recovery_in_prog_ = false
  • Done

Thread 2's recovery_thread_

  • recovery thread started with recovery_in_prog_ set as false

Fix

  • Remove double-clearing bg_error_, recovery_in_prog_ and other fields after ResumeImpl() already returned OK().
  • Minor typo and linter fixes in DBErrorHandlingFSTest

Test Plan

  • DBErrorHandlingFSTest::MultipleRecoveryThreads added to reproduce the scenario.
  • Adding assert(recovery_in_prog_); at the start of ErrorHandler::RecoverFromRetryableBGIOError() fails the test without the fix and succeeds with the fix as expected.

@facebook-github-bot
Copy link
Contributor

@jaykorean has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean changed the title Add recovery_in_prog_ Assertion in RecoverFromRetryableBGIOError Set recovery_in_prog_ after waiting for the other recovery thread Oct 20, 2023
@facebook-github-bot
Copy link
Contributor

@jaykorean has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from db34caf to 76b9502 Compare October 20, 2023 18:05
@jaykorean jaykorean marked this pull request as draft October 20, 2023 18:05
@jaykorean jaykorean changed the title Set recovery_in_prog_ after waiting for the other recovery thread Fix for RecoverFromRetryableBGIOError starting with recovery_in_prog_ false Oct 20, 2023
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from 26e887e to 87fe776 Compare October 23, 2023 05:45
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from 87fe776 to 2de7d57 Compare October 23, 2023 06:01
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from 2de7d57 to dc65221 Compare October 23, 2023 16:46
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from dc65221 to eb852ca Compare October 23, 2023 17:41
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from eb852ca to ec02b59 Compare October 23, 2023 17:44
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from ec02b59 to 07ca6d6 Compare October 23, 2023 17:44
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

2 similar comments
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from 2ccd98c to e0da375 Compare October 24, 2023 17:57
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from e0da375 to 5ab03c2 Compare October 24, 2023 18:38
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from 5ab03c2 to 7b5ea4c Compare October 24, 2023 21:58
@jaykorean jaykorean requested review from hx235 and cbi42 October 30, 2023 18:05
@@ -761,22 +758,12 @@ void ErrorHandler::RecoverFromRetryableBGIOError() {
// recover from the retryable IO error and no other BG errors. Clean
// the bg_error and notify user.
TEST_SYNC_POINT("RecoverFromRetryableBGIOError:RecoverSuccess");
Status old_bg_error = bg_error_;
is_db_stopped_.store(false, std::memory_order_release);
Copy link
Member

@cbi42 cbi42 Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to move is_db_stopped_ to ClearBGError() too since it's not set in ClearBGError().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess this wasn't supposed to be missing in ClearBGError() in the first place. Thanks for catching this.

@@ -689,6 +692,7 @@ const Status& ErrorHandler::StartRecoverFromRetryableBGIOError(
// Automatic recover from Retryable BG IO error. Must be called after db
// mutex is released.
void ErrorHandler::RecoverFromRetryableBGIOError() {
assert(recovery_in_prog_);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cbi42 do you think we want to assert other fields here as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, it's hard to figure out all the invariants :)

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from df81ab4 to 81e265f Compare October 31, 2023 16:29
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through the unit test and just have some minor comments.

fault_fs_->SetFilesystemActive(true);

// Set up sync point so that we can wait for the recovery thread to finish
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the second recovery thread finish before this dependency is set? Maybe we don't even have to wait until it's finished in this unit test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't wait and recovery thread is still alive, Close() throws. (#12002 won't fix this case)

std::terminate() 
std::default_delete<std::thread>::operator()(std::thread*) const 
std::unique_ptr<std::thread, std::default_delete<std::thread>>::~unique_ptr()
rocksdb::ErrorHandler::~ErrorHandler() (rocksdb/db/error_handler.h:31)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBImpl::~DBImpl() (rocksdb/db/db_impl/db_impl.cc:725)
rocksdb::DBTestBase::Close() (rocksdb/db/db_test_util.cc:678)

@@ -689,6 +692,7 @@ const Status& ErrorHandler::StartRecoverFromRetryableBGIOError(
// Automatic recover from Retryable BG IO error. Must be called after db
// mutex is released.
void ErrorHandler::RecoverFromRetryableBGIOError() {
assert(recovery_in_prog_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, it's hard to figure out all the invariants :)

@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@jaykorean jaykorean requested a review from cbi42 October 31, 2023 18:10
Copy link
Member

@cbi42 cbi42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jaykorean jaykorean force-pushed the add_assertion_to_RecoverFromRetryableBGIOError branch from d8a3c49 to 3ec955e Compare October 31, 2023 21:34
@facebook-github-bot
Copy link
Contributor

@jaykorean has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jaykorean has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jaykorean merged this pull request in 04225a2.

@jaykorean jaykorean deleted the add_assertion_to_RecoverFromRetryableBGIOError branch November 1, 2023 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants