Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove corrupted WAL files in kPointRecoveryMode with avoid_flush_duing_recovery set true #9634

Closed

Conversation

akankshamahajan15
Copy link
Contributor

@akankshamahajan15 akankshamahajan15 commented Feb 24, 2022

Summary:

  1. In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
    flush the data from WAL to L0 for all column families if possible. As a
    result, not all column families can increase their log_numbers, and
    min_log_number_to_keep won't change.
  2. For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.

If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.

As a solution,

  1. the corrupted WALs whose numbers are larger than the
    corrupted wal and smaller than the new WAL will be moved to archive folder.
  2. Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful

Test Plan: 1. Added new unit tests
2. make crast_test -j

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@riversand963
Copy link
Contributor

Thanks @akankshamahajan15 for the PR. Hopefully we can improve the recovery process after this series of PRs.

One thing that I should have emphasized: currently, RocksDB will write and persist new MANIFEST files even before recovery succeeds. See https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L611, https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1280, https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L523, https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L521, etc. (I didn't do a thorough search).

This is behavior is problematic. I think one important thing we want to achieve in this PR is to write new MANIFEST (call LogAndApply) when we are sure the db is consistent.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@akankshamahajan15
Copy link
Contributor Author

One thing that I should have emphasized: currently, RocksDB will write and persist new MANIFEST files even before recovery succeeds. See https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L611, https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L1280, https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L523, https://github.com/facebook/rocksdb/blob/main/db/db_impl/db_impl_open.cc#L521, etc. (I didn't do a thorough search).

Yes, you mentioned in the task, but that step wasn't clear to me at that time. I will update the code now to persist new MANIFEST only after the DB is consistent.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

db/corruption_test.cc Outdated Show resolved Hide resolved
db/corruption_test.cc Outdated Show resolved Hide resolved
db/corruption_test.cc Outdated Show resolved Hide resolved
db/corruption_test.cc Outdated Show resolved Hide resolved
db/corruption_test.cc Outdated Show resolved Hide resolved
db/corruption_test.cc Outdated Show resolved Hide resolved
db/db_impl/db_impl.h Outdated Show resolved Hide resolved
db/db_impl/db_impl.h Outdated Show resolved Hide resolved
db/db_impl/db_impl.h Outdated Show resolved Hide resolved
db/db_impl/db_impl.h Outdated Show resolved Hide resolved
db/db_impl/db_impl.h Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@akankshamahajan15
Copy link
Contributor Author

akankshamahajan15 commented Apr 11, 2022

Also, can you update HISTORY.md?

I updated the HISTORY.md. I added two separate fixes. One is moving corruption wals. Second is to create only one MANIFEST after db is recovered succesfully.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@akankshamahajan15 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@ajkr
Copy link
Contributor

ajkr commented Apr 26, 2022

@akankshamahajan15 Is this right?

the corrupted WALs whose numbers are larger than the
corrupted wal and smaller than the new WAL will be moved to archive folder.

In the following recovery we archived (126119, 126216].

2022/04/20-20:12:59.858251 556 [db/db_impl/db_impl_open.cc:1163] Point in time recovered to log #126119 seq #1362971066
2022/04/20-20:12:59.858951 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126125 mode 2
2022/04/20-20:13:02.307344 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126128 mode 2
2022/04/20-20:13:05.299616 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126137 mode 2
2022/04/20-20:13:06.112663 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126142 mode 2
2022/04/20-20:13:06.807948 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126145 mode 2
2022/04/20-20:13:06.922859 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126151 mode 2
2022/04/20-20:13:08.407391 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126154 mode 2
2022/04/20-20:13:12.479291 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126160 mode 2
2022/04/20-20:13:13.141196 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126165 mode 2
2022/04/20-20:13:13.689726 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126168 mode 2
2022/04/20-20:13:14.429917 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126170 mode 2
2022/04/20-20:13:14.791170 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126172 mode 2
2022/04/20-20:13:15.354656 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126178 mode 2
2022/04/20-20:13:17.160268 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126183 mode 2
2022/04/20-20:13:19.541990 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126193 mode 2
2022/04/20-20:13:20.461264 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126196 mode 2
2022/04/20-20:13:21.247930 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126202 mode 2
2022/04/20-20:13:23.510047 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126205 mode 2
2022/04/20-20:13:24.541666 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126211 mode 2
2022/04/20-20:13:26.070769 556 [db/db_impl/db_impl_open.cc:913] Recovering log #126216 mode 2

Then the following recovery processes 126119 followed by 126219, which doesn't have consecutive seqnos so recovery terminates at 126119:

2022/04/20-23:38:04.973200 548 [db/db_impl/db_impl_open.cc:1163] Point in time recovered to log #126119 seq #1362971066
2022/04/20-23:38:04.973920 548 [db/db_impl/db_impl_open.cc:913] Recovering log #126219 mode 2
2022/04/20-23:38:05.117620 548 [WARN] [db/db_impl/db_impl_open.cc:919] 126219.log: dropping 47392297 bytes
2022/04/20-23:38:05.118405 548 [ERROR] [db/db_impl/db_impl_open.cc:1217] Column family inconsistency: SST file contains data beyond the point of corruption.

126119 is too early -- I'd expect it to be earlier than the previous run, and in this case it is earlier than a flushed file too.

@ajkr
Copy link
Contributor

ajkr commented Apr 26, 2022

BTW, I still haven't figured out why we need to remove the corrupt WALs rather than skip over them during recovery, so any clarification on that would be appreciated.

@riversand963
Copy link
Contributor

BTW, I still haven't figured out why we need to remove the corrupt WALs rather than skip over them during recovery, so any clarification on that would be appreciated.

We can but we do not have to.

Thinking more about this, I think we do not have to remove the corrupted WALs as long as we persist the new MANIFEST after successfully syncing the new WAL.

If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.

@riversand963
Copy link
Contributor

Wait, I remember something, though.
It seems we only sync the new WAL with empty sentinel batch during recovery. It seems possible for an unsynced WAL before the new WAL to become corrupted, without violating the contract between RocksDB and the FS.
Assume we have 10.log, 11.log and 12.log. None of them has been sync'ed, and all of them are higher than "min_log_number_to_keep".
In the first recovery, we stop at 11.log, but in second attempt, is it possible to find a corruption in 10.log which causes the sequence numbers of 10.log and 11.log not consecutive? The WALs are not sync'ed.

riversand963 added a commit to riversand963/rocksdb that referenced this pull request Apr 26, 2022
ajkr added a commit to ajkr/rocksdb that referenced this pull request Apr 26, 2022
facebook-github-bot pushed a commit that referenced this pull request Apr 26, 2022
Summary:
Left HISTORY.md and unit tests.
Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that.

Pull Request resolved: #9906

Reviewed By: riversand963

Differential Revision: D35940093

Pulled By: ajkr

fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b
akankshamahajan15 pushed a commit that referenced this pull request Apr 26, 2022
Summary:
Left HISTORY.md and unit tests.
Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that.

Pull Request resolved: #9906

Reviewed By: riversand963

Differential Revision: D35940093

Pulled By: ajkr

fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants