Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Background error handling #573

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

rohansuri
Copy link
Contributor

@rohansuri rohansuri commented Mar 16, 2020

I have started making some progress. I'll keep updating the plan in file 270.md and also leave TODO markers in the code where changes are required. Will start implementing once we've addressed all the comments and the plan looks good.

@petermattis
Copy link
Collaborator

This change is Reviewable

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a good start.

You can also grep for EventListener.BackgroundError which is called during failed compactions and flushes. Another thing to grep for is [Ll]ogger\.Fatalf(. The latter often indicate internal invariants being violated, though it looks like some of the Fatalfs are due to IO errors.

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @rohansuri)


270.md, line 90 at r1 (raw file):

    #### Questions

    Q. How could the batch get corrupted? RocksDB encodes the batch itself. The

We've seen bugs that have caused either real or apparent batch corruption.


270.md, line 103 at r1 (raw file):

    Since the batch is incorrectly encoded and simply re-reading the WAL and
    constructing the memtable would again fail, we probably need to reject the
    write by removing the batch from the WAL manually.

This sounds difficult to do automatically and may violate application expectations. This might what RocksDB refers to as an Unrecoverable error, requiring user intervention to fix.


270.md, line 266 at r1 (raw file):

# TODOs
Q. Only no space errors are recoverable. How is the recovery done?

I just looked at this. See util/sst_file_manager_impl.cc. When an out of space error is encountered, a thread is kicked off to run SstFileManagerImpl::ClearError. Looks like that thread loops checking free space very 5 seconds and tries to recover when the free space reaches some threshold.

Copy link
Contributor Author

@rohansuri rohansuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @petermattis and @rohansuri)


270.md, line 90 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We've seen bugs that have caused either real or apparent batch corruption.

Interesting. Could you link any bugs you remember? Also if there is a possibility of mismatch between encoding/decoding logic then should we also do bounds checking for each varint byte access and check for overflows as well in batchDecodeStr?

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @rohansuri)


270.md, line 90 at r1 (raw file):

Could you link any bugs you remember?

Here are two different issues that resulted in the same symptom:

#540
#36

Also if there is a possibility of mismatch between encoding/decoding logic then should we also do bounds checking for each varint byte access and check for overflows as well in batchDecodeStr?

There is a trade-off here between safety and performance. I'd be worried that doing bounds checking for each varint byte access will negatively impact performance. The argument against doing so is that in-memory corruptions should be rare, and on-disk corruptions should be protected by the WAL and sstable checksums.

Copy link
Contributor Author

@rohansuri rohansuri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @petermattis and @rohansuri)


270.md, line 90 at r1 (raw file):

Here are two different issues that resulted in the same symptom

Thanks for sharing them. I didn't think about the recovery code path that might apply incompletely written batches from WAL to memtable.

I'd be worried that doing bounds checking for each varint byte access will negatively impact performance.

Agreed, reading batches lies on a hot code path of user issued writes.

@rohansuri
Copy link
Contributor Author

I've added sections for kFlush and kWriteCallback. I'll finish the final section on kCompaction and then answer all TODOs and explore the SstFileManager you mentioned.

I've understood the crux of the task. We need to return contextual errors (capturing code, subcode) from the deepest functions in the call stack, for example vfs layer or any constraint checking done before the final steps of applying a version edit and bubble these errors up to the main operation whose completion they affect. Finally, in the main operation a background error is set depending on the error's severity.

@rohansuri
Copy link
Contributor Author

This is still WIP and not ready for review.

@rohansuri
Copy link
Contributor Author

@petermattis I think it'd be easier to do this over multiple PRs, just like it was done in RocksDB, since the features are incremental in nature. Is that fine?

Current PR will be to simply place DB in read-only mode upon any user write or background write (flush, compaction). Doing this will not allow any further flushes, compactions, user writes. There will be no auto recovery from out of space errors. The only recovery in code is through DB::Resume which is callable by user which allows resumption only from fatal, soft errors.
facebook/rocksdb#3997

In the next PR, we add auto recovery from out of space errors by bringing in relevant parts of SstFileManager. This includes polling the filesystem for freespace and checking if it is over a particular threshold, upon which we unset the error. This threshold depends on the severity of the error.
facebook/rocksdb#4164

Out of space errors while flushing or doing WAL writes are considered as hard errors. This places the DB in read-only mode. In this case the threshold for auto recovery is reserved_disk_buffer_, which is defined as ColumnFamilyOptions::write_buffer_size i.e. the size of memtable. For Pebble this would be DB.mu.mem.nextSize. The rationale is explained here. Although I wonder if we should choose a higher threshold since the recovery involves a forced flush itself and flushes would take more disk space than just the memtable size since they'd include meta blocks, wdyt?

Out of space errors while compaction are considered as soft errors. This doesn't place the DB in read-only mode i.e. user writes, flushes, compactions are still allowed. I think we allow them because user facing writes and flushes are of more priority than compactions and if you're not having enough space for running compactions (larger space requirements), but you do have enough space to accept user writes, do flushes (lesser space requirements like 64MB) , then we atleast do them rather than stopping everything. Note the subtlety here: if a compaction fails due to out of space error, of course the immediate user writes, flushes will also fail and we'll stop everything because they are hard errors. So it might seem that there's no benefit of not stopping everything in case of compactions. However, consider the case if we are out of space during a compaction and then get back some free space before any user writes, flushes happen. We could get back some space because the user deletes some files or maybe our own deleteObsoleteFile clears up the intermediate files of the failed compaction. The bg error will still stay soft and now subsequent user writes, flushes will be accepted.

The other important aspect in the case of a soft out of space error is to still allow compactions but throttle them. That is, only allow them if it will not result in going out of space. To do this, we estimate the compaction output size as the input size (i.e. the worst case where no keys have overwrites, deletes) and check with the filesystem if we have enough space. This is important to do, otherwise we'd allow large compactions and intermediately fail and then clean up. This cycle going unchecked might disrupt concurrent user writes, flushes so throttling is a good idea.

What's left for me to understand is the threshold for when this soft error will be cleared so that we'll go back to full throttle compactions. I'll get back to understanding this when doing the second PR.

@petermattis
Copy link
Collaborator

@petermattis I think it'd be easier to do this over multiple PRs, just like it was done in RocksDB, since the features are incremental in nature. Is that fine?

Yes! I prefer multiple incremental PRs over one massive PR.

Current PR will be to simply place DB in read-only mode upon any user write or background write (flush, compaction). Doing this will not allow any further flushes, compactions, user writes. There will be no auto recovery from out of space errors. The only recovery in code is through DB::Resume which is callable by user which allows resumption only from fatal, soft errors.
facebook/rocksdb#3997

SGTM

In the next PR, we add auto recovery from out of space errors by bringing in relevant parts of SstFileManager. This includes polling the filesystem for freespace and checking if it is over a particular threshold, upon which we unset the error. This threshold depends on the severity of the error.
facebook/rocksdb#4164

Sounds good, though I'm not imagining exposing something like SstFileManager at this point. As long as you're only talking about bringing in portions of its functionality, but keeping it private to the implementation, then I'm good with this.

Out of space errors while flushing or doing WAL writes are considered as hard errors. This places the DB in read-only mode. In this case the threshold for auto recovery is reserved_disk_buffer_, which is defined as ColumnFamilyOptions::write_buffer_size i.e. the size of memtable. For Pebble this would be DB.mu.mem.nextSize. The rationale is explained here. Although I wonder if we should choose a higher threshold since the recovery involves a forced flush itself and flushes would take more disk space than just the memtable size since they'd include meta blocks, wdyt?

I can see using a higher threshold. A memtable is typically 64MB or 128MB in size. That's a very small amount of disk. A DB will make very little progress if only that amount of disk space is freed.

The other important aspect in the case of a soft out of space error is to still allow compactions but throttle them. That is, only allow them if it will not result in going out of space. To do this, we estimate the compaction output size as the input size (i.e. the worst case where no keys have overwrites, deletes) and check with the filesystem if we have enough space. This is important to do, otherwise we'd allow large compactions and intermediately fail and then clean up. This cycle going unchecked might disrupt concurrent user writes, flushes so throttling is a good idea.

Does RocksDB always check the filesystem has sufficient room for the compaction, or does it only do so when it is encountered a soft out of space error? Estimating the compaction output size as the input size sounds good to start. It isn't always true and can possibly be improved in the future with knowledge of point and range tombstones in the inputs.

@rohansuri
Copy link
Contributor Author

rohansuri commented May 27, 2020

Sounds good, though I'm not imagining exposing something like SstFileManager at this point. As long as you're only talking about bringing in portions of its functionality, but keeping it private to the implementation, then I'm good with this.

Yup, I'll just bring in the required parts. Those don't require any setting from the user anyway, so we're good without exposing it. Although there's one feature that will require exposing it, if we want it now - RocksDB allows sharing of the same SstFileManager instance across DB instances so that the auto recovery of them is done one by one. This removes the possibility of otherwise all DBs individually checking for free space not knowing about each other, finding enough space hence clearing their errors and again erroring out due to all beginning to write together and hence none progressing. I think it is fine to leave it out for now? And have the auto recovery logic run separately inside every DB?

I can see using a higher threshold. A memtable is typically 64MB or 128MB in size. That's a very small amount of disk. A DB will make very little progress if only that amount of disk space is freed.

Agreed. What would be a good heuristic for this?

Does RocksDB always check the filesystem has sufficient room for the compaction, or does it only do so when it is encountered a soft out of space error?

Only when the DB has encountered a soft error specifically due to out of space.

@rohansuri
Copy link
Contributor Author

You can also grep for EventListener.BackgroundError which is called during failed compactions and flushes. Another thing to grep for is [Ll]ogger.Fatalf(. The latter often indicate internal invariants being violated, though it looks like some of the Fatalfs are due to IO errors

Some of the fatal ones are due internal programmatic errors/incorrect usage, for example not acquiring the lock on versionSet before calling the logAndApply. Do we continue to panic on such errors or do we place the DB in read only mode? I think we should continue to panic for such kind of errors, since their scope is more like the C++ assert statements.

@rohansuri rohansuri force-pushed the background-error-handling branch 4 times, most recently from 94444f7 to dc64e39 Compare June 30, 2020 16:08
@rohansuri
Copy link
Contributor Author

rohansuri commented Jul 1, 2020

@petermattis Do we want to mark invariant errors at all places or only where they occur in the code path of the background operations (WAL write, memtable write, flush, compaction)? For example, versionSet.load is not called in any of the background operations, but could return errors due to violation of invariants. We could mark everywhere for consistency in code, but error type checking would only be done on errors from background operations.

@rohansuri
Copy link
Contributor Author

@petermattis I've been thinking about the error propagation semantics for the commit pipeline since we no longer want to panic on error. This needs some thought since RocksDB's group commit approach differs from our commit pipeline. This will cause our error propagation to differ from RocksDB and hence we need to decide what's the right behaviour.

Lets take some examples where no batch requests sync.

Example 1:
5 batches have their WAL write succeeded. All 5 placed into pipeline. Memtable Apply for batch 2 fails, succeeds for all others. Batch 1 is ok to be published. However, batch 2 and subsequent batches must not be published and batch 2's apply error must be propagated to subsequent batches.

Example 2:
WAL write for 1, 2, 3 succeed, but fails for batch 4 and batch 2's Apply also fails. Batch 5 wouldn't be placed in the pipeline, since we'll check for db.error_handler.IsDBStopped before placing it while holding the commit mutex. Batch 1 is ok to be published. Batch 2, 3 will return with batch 2's Apply error. Batch 4 will return with WAL write error.

Example 3:
All 5 WAL writes succeed but batch 2, 4's apply fails. We propagate batch 2's apply error to 2, 3, 4, 5.

Hence a memtable apply failure must be propagated to subsequent batches in the pipeline, so that none of them get published. This error propagation must be done with commit mutex held to ensure no new batches are accepted in the pipeline.

In cases where a batch requests a sync, it isn't possible to propagate the sync error to its subsequent non-sync batches in the pipeline. This is because publish doesn't wait for sync. Hence by the time a batch gets back its sync error, subsequent batches might have already published and returned with no errors. But this is ok since we allow this even in the current code where a reader could read a subsequent published batch and only later Pebble panics due to a previous batch's sync error. However, sync errors must be propagated to subsequent sync batches in pipeline, else they'll keep waiting on for sync. This is already a TODO in flushLoop. I'll raise a PR for this.

In cases where a batch gets both a sync error and a memtable apply error (its own apply fails or a previous batch's propagated error), this writer will return the propagated memtable error.

Do these semantics sound right? Am I missing any cases?

Just a use case question, why does DB.Set take write options? Why isn't a DB wide sync=true/false sufficient?

@rohansuri
Copy link
Contributor Author

@petermattis any thoughts on the above comment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants