[WIP] Background error handling #573

rohansuri · 2020-03-16T13:55:31Z

I have started making some progress. I'll keep updating the plan in file 270.md and also leave TODO markers in the code where changes are required. Will start implementing once we've addressed all the comments and the plan looks good.

petermattis · 2020-03-16T13:55:37Z

This change is

petermattis

Looks like a good start.

You can also grep for EventListener.BackgroundError which is called during failed compactions and flushes. Another thing to grep for is [Ll]ogger\.Fatalf(. The latter often indicate internal invariants being violated, though it looks like some of the Fatalfs are due to IO errors.

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @rohansuri)

270.md, line 90 at r1 (raw file):

    #### Questions

    Q. How could the batch get corrupted? RocksDB encodes the batch itself. The

We've seen bugs that have caused either real or apparent batch corruption.

270.md, line 103 at r1 (raw file):

    Since the batch is incorrectly encoded and simply re-reading the WAL and
    constructing the memtable would again fail, we probably need to reject the
    write by removing the batch from the WAL manually.

This sounds difficult to do automatically and may violate application expectations. This might what RocksDB refers to as an Unrecoverable error, requiring user intervention to fix.

270.md, line 266 at r1 (raw file):

# TODOs
Q. Only no space errors are recoverable. How is the recovery done?

I just looked at this. See util/sst_file_manager_impl.cc. When an out of space error is encountered, a thread is kicked off to run SstFileManagerImpl::ClearError. Looks like that thread loops checking free space very 5 seconds and tries to recover when the free space reaches some threshold.

rohansuri

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @petermattis and @rohansuri)

270.md, line 90 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

We've seen bugs that have caused either real or apparent batch corruption.

Interesting. Could you link any bugs you remember? Also if there is a possibility of mismatch between encoding/decoding logic then should we also do bounds checking for each varint byte access and check for overflows as well in batchDecodeStr?

petermattis

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @rohansuri)

270.md, line 90 at r1 (raw file):

Could you link any bugs you remember?

Here are two different issues that resulted in the same symptom:

#540
#36

Also if there is a possibility of mismatch between encoding/decoding logic then should we also do bounds checking for each varint byte access and check for overflows as well in batchDecodeStr?

There is a trade-off here between safety and performance. I'd be worried that doing bounds checking for each varint byte access will negatively impact performance. The argument against doing so is that in-memory corruptions should be rare, and on-disk corruptions should be protected by the WAL and sstable checksums.

rohansuri

Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @petermattis and @rohansuri)

270.md, line 90 at r1 (raw file):

Here are two different issues that resulted in the same symptom

Thanks for sharing them. I didn't think about the recovery code path that might apply incompletely written batches from WAL to memtable.

I'd be worried that doing bounds checking for each varint byte access will negatively impact performance.

Agreed, reading batches lies on a hot code path of user issued writes.

rohansuri · 2020-04-15T15:59:31Z

I've added sections for kFlush and kWriteCallback. I'll finish the final section on kCompaction and then answer all TODOs and explore the SstFileManager you mentioned.

I've understood the crux of the task. We need to return contextual errors (capturing code, subcode) from the deepest functions in the call stack, for example vfs layer or any constraint checking done before the final steps of applying a version edit and bubble these errors up to the main operation whose completion they affect. Finally, in the main operation a background error is set depending on the error's severity.

rohansuri · 2020-05-11T19:18:09Z

This is still WIP and not ready for review.

rohansuri · 2020-05-19T19:25:48Z

@petermattis I think it'd be easier to do this over multiple PRs, just like it was done in RocksDB, since the features are incremental in nature. Is that fine?

Current PR will be to simply place DB in read-only mode upon any user write or background write (flush, compaction). Doing this will not allow any further flushes, compactions, user writes. There will be no auto recovery from out of space errors. The only recovery in code is through DB::Resume which is callable by user which allows resumption only from fatal, soft errors.
facebook/rocksdb#3997

In the next PR, we add auto recovery from out of space errors by bringing in relevant parts of SstFileManager. This includes polling the filesystem for freespace and checking if it is over a particular threshold, upon which we unset the error. This threshold depends on the severity of the error.
facebook/rocksdb#4164

Out of space errors while flushing or doing WAL writes are considered as hard errors. This places the DB in read-only mode. In this case the threshold for auto recovery is reserved_disk_buffer_, which is defined as ColumnFamilyOptions::write_buffer_size i.e. the size of memtable. For Pebble this would be DB.mu.mem.nextSize. The rationale is explained here. Although I wonder if we should choose a higher threshold since the recovery involves a forced flush itself and flushes would take more disk space than just the memtable size since they'd include meta blocks, wdyt?

Out of space errors while compaction are considered as soft errors. This doesn't place the DB in read-only mode i.e. user writes, flushes, compactions are still allowed. I think we allow them because user facing writes and flushes are of more priority than compactions and if you're not having enough space for running compactions (larger space requirements), but you do have enough space to accept user writes, do flushes (lesser space requirements like 64MB) , then we atleast do them rather than stopping everything. Note the subtlety here: if a compaction fails due to out of space error, of course the immediate user writes, flushes will also fail and we'll stop everything because they are hard errors. So it might seem that there's no benefit of not stopping everything in case of compactions. However, consider the case if we are out of space during a compaction and then get back some free space before any user writes, flushes happen. We could get back some space because the user deletes some files or maybe our own deleteObsoleteFile clears up the intermediate files of the failed compaction. The bg error will still stay soft and now subsequent user writes, flushes will be accepted.

The other important aspect in the case of a soft out of space error is to still allow compactions but throttle them. That is, only allow them if it will not result in going out of space. To do this, we estimate the compaction output size as the input size (i.e. the worst case where no keys have overwrites, deletes) and check with the filesystem if we have enough space. This is important to do, otherwise we'd allow large compactions and intermediately fail and then clean up. This cycle going unchecked might disrupt concurrent user writes, flushes so throttling is a good idea.

What's left for me to understand is the threshold for when this soft error will be cleared so that we'll go back to full throttle compactions. I'll get back to understanding this when doing the second PR.

petermattis · 2020-05-20T12:06:49Z

@petermattis I think it'd be easier to do this over multiple PRs, just like it was done in RocksDB, since the features are incremental in nature. Is that fine?

Yes! I prefer multiple incremental PRs over one massive PR.

Current PR will be to simply place DB in read-only mode upon any user write or background write (flush, compaction). Doing this will not allow any further flushes, compactions, user writes. There will be no auto recovery from out of space errors. The only recovery in code is through DB::Resume which is callable by user which allows resumption only from fatal, soft errors.
facebook/rocksdb#3997

SGTM

In the next PR, we add auto recovery from out of space errors by bringing in relevant parts of SstFileManager. This includes polling the filesystem for freespace and checking if it is over a particular threshold, upon which we unset the error. This threshold depends on the severity of the error.
facebook/rocksdb#4164

Sounds good, though I'm not imagining exposing something like SstFileManager at this point. As long as you're only talking about bringing in portions of its functionality, but keeping it private to the implementation, then I'm good with this.

Out of space errors while flushing or doing WAL writes are considered as hard errors. This places the DB in read-only mode. In this case the threshold for auto recovery is reserved_disk_buffer_, which is defined as ColumnFamilyOptions::write_buffer_size i.e. the size of memtable. For Pebble this would be DB.mu.mem.nextSize. The rationale is explained here. Although I wonder if we should choose a higher threshold since the recovery involves a forced flush itself and flushes would take more disk space than just the memtable size since they'd include meta blocks, wdyt?

I can see using a higher threshold. A memtable is typically 64MB or 128MB in size. That's a very small amount of disk. A DB will make very little progress if only that amount of disk space is freed.

The other important aspect in the case of a soft out of space error is to still allow compactions but throttle them. That is, only allow them if it will not result in going out of space. To do this, we estimate the compaction output size as the input size (i.e. the worst case where no keys have overwrites, deletes) and check with the filesystem if we have enough space. This is important to do, otherwise we'd allow large compactions and intermediately fail and then clean up. This cycle going unchecked might disrupt concurrent user writes, flushes so throttling is a good idea.

Does RocksDB always check the filesystem has sufficient room for the compaction, or does it only do so when it is encountered a soft out of space error? Estimating the compaction output size as the input size sounds good to start. It isn't always true and can possibly be improved in the future with knowledge of point and range tombstones in the inputs.

rohansuri · 2020-05-27T15:43:36Z

Sounds good, though I'm not imagining exposing something like SstFileManager at this point. As long as you're only talking about bringing in portions of its functionality, but keeping it private to the implementation, then I'm good with this.

Yup, I'll just bring in the required parts. Those don't require any setting from the user anyway, so we're good without exposing it. Although there's one feature that will require exposing it, if we want it now - RocksDB allows sharing of the same SstFileManager instance across DB instances so that the auto recovery of them is done one by one. This removes the possibility of otherwise all DBs individually checking for free space not knowing about each other, finding enough space hence clearing their errors and again erroring out due to all beginning to write together and hence none progressing. I think it is fine to leave it out for now? And have the auto recovery logic run separately inside every DB?

I can see using a higher threshold. A memtable is typically 64MB or 128MB in size. That's a very small amount of disk. A DB will make very little progress if only that amount of disk space is freed.

Agreed. What would be a good heuristic for this?

Does RocksDB always check the filesystem has sufficient room for the compaction, or does it only do so when it is encountered a soft out of space error?

Only when the DB has encountered a soft error specifically due to out of space.

rohansuri · 2020-05-27T15:59:16Z

You can also grep for EventListener.BackgroundError which is called during failed compactions and flushes. Another thing to grep for is [Ll]ogger.Fatalf(. The latter often indicate internal invariants being violated, though it looks like some of the Fatalfs are due to IO errors

Some of the fatal ones are due internal programmatic errors/incorrect usage, for example not acquiring the lock on versionSet before calling the logAndApply. Do we continue to panic on such errors or do we place the DB in read only mode? I think we should continue to panic for such kind of errors, since their scope is more like the C++ assert statements.

rohansuri · 2020-07-01T16:09:12Z

@petermattis Do we want to mark invariant errors at all places or only where they occur in the code path of the background operations (WAL write, memtable write, flush, compaction)? For example, versionSet.load is not called in any of the background operations, but could return errors due to violation of invariants. We could mark everywhere for consistency in code, but error type checking would only be done on errors from background operations.

rohansuri · 2020-07-12T09:48:21Z

@petermattis I've been thinking about the error propagation semantics for the commit pipeline since we no longer want to panic on error. This needs some thought since RocksDB's group commit approach differs from our commit pipeline. This will cause our error propagation to differ from RocksDB and hence we need to decide what's the right behaviour.

Lets take some examples where no batch requests sync.

Example 1:
5 batches have their WAL write succeeded. All 5 placed into pipeline. Memtable Apply for batch 2 fails, succeeds for all others. Batch 1 is ok to be published. However, batch 2 and subsequent batches must not be published and batch 2's apply error must be propagated to subsequent batches.

Example 2:
WAL write for 1, 2, 3 succeed, but fails for batch 4 and batch 2's Apply also fails. Batch 5 wouldn't be placed in the pipeline, since we'll check for db.error_handler.IsDBStopped before placing it while holding the commit mutex. Batch 1 is ok to be published. Batch 2, 3 will return with batch 2's Apply error. Batch 4 will return with WAL write error.

Example 3:
All 5 WAL writes succeed but batch 2, 4's apply fails. We propagate batch 2's apply error to 2, 3, 4, 5.

Hence a memtable apply failure must be propagated to subsequent batches in the pipeline, so that none of them get published. This error propagation must be done with commit mutex held to ensure no new batches are accepted in the pipeline.

In cases where a batch requests a sync, it isn't possible to propagate the sync error to its subsequent non-sync batches in the pipeline. This is because publish doesn't wait for sync. Hence by the time a batch gets back its sync error, subsequent batches might have already published and returned with no errors. But this is ok since we allow this even in the current code where a reader could read a subsequent published batch and only later Pebble panics due to a previous batch's sync error. However, sync errors must be propagated to subsequent sync batches in pipeline, else they'll keep waiting on for sync. This is already a TODO in flushLoop. I'll raise a PR for this.

In cases where a batch gets both a sync error and a memtable apply error (its own apply fails or a previous batch's propagated error), this writer will return the propagated memtable error.

Do these semantics sound right? Am I missing any cases?

Just a use case question, why does DB.Set take write options? Why isn't a DB wide sync=true/false sufficient?

rohansuri · 2020-08-06T15:03:55Z

@petermattis any thoughts on the above comment?

petermattis reviewed Mar 16, 2020

View reviewed changes

rohansuri commented Apr 5, 2020

View reviewed changes

petermattis reviewed Apr 6, 2020

View reviewed changes

rohansuri commented Apr 9, 2020

View reviewed changes

rohansuri force-pushed the background-error-handling branch from 28fcd26 to 9ccc23a Compare April 15, 2020 15:49

rohansuri force-pushed the background-error-handling branch from 9ccc23a to 959f6c2 Compare May 11, 2020 19:17

rohansuri force-pushed the background-error-handling branch 4 times, most recently from 94444f7 to dc64e39 Compare June 30, 2020 16:08

rohansuri force-pushed the background-error-handling branch from dc64e39 to ea10c7d Compare July 23, 2020 15:14

rohansuri added 4 commits August 6, 2020 00:44

db: add errorHandler to decide on bg error severity

cc41c3e

db: no new flushes, compactions if bg error is set

ecd2d0a

kMemtable if wal switch fails

3845479

set kMemtable, kWrite

0196327

rohansuri force-pushed the background-error-handling branch from ea10c7d to 0196327 Compare August 6, 2020 15:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Background error handling #573

[WIP] Background error handling #573

rohansuri commented Mar 16, 2020 •

edited

Loading

petermattis commented Mar 16, 2020

petermattis left a comment

rohansuri left a comment

petermattis left a comment

rohansuri left a comment

rohansuri commented Apr 15, 2020

rohansuri commented May 11, 2020

rohansuri commented May 19, 2020

petermattis commented May 20, 2020

rohansuri commented May 27, 2020 •

edited

Loading

rohansuri commented May 27, 2020

rohansuri commented Jul 1, 2020 •

edited

Loading

rohansuri commented Jul 12, 2020

rohansuri commented Aug 6, 2020

[WIP] Background error handling #573

Are you sure you want to change the base?

[WIP] Background error handling #573

Conversation

rohansuri commented Mar 16, 2020 • edited Loading

petermattis commented Mar 16, 2020

petermattis left a comment

Choose a reason for hiding this comment

rohansuri left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

rohansuri left a comment

Choose a reason for hiding this comment

rohansuri commented Apr 15, 2020

rohansuri commented May 11, 2020

rohansuri commented May 19, 2020

petermattis commented May 20, 2020

rohansuri commented May 27, 2020 • edited Loading

rohansuri commented May 27, 2020

rohansuri commented Jul 1, 2020 • edited Loading

rohansuri commented Jul 12, 2020

rohansuri commented Aug 6, 2020

rohansuri commented Mar 16, 2020 •

edited

Loading

rohansuri commented May 27, 2020 •

edited

Loading

rohansuri commented Jul 1, 2020 •

edited

Loading