-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Background error handling #573
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good start.
You can also grep for EventListener.BackgroundError
which is called during failed compactions and flushes. Another thing to grep for is [Ll]ogger\.Fatalf(
. The latter often indicate internal invariants being violated, though it looks like some of the Fatalf
s are due to IO errors.
Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @rohansuri)
270.md, line 90 at r1 (raw file):
#### Questions Q. How could the batch get corrupted? RocksDB encodes the batch itself. The
We've seen bugs that have caused either real or apparent batch corruption.
270.md, line 103 at r1 (raw file):
Since the batch is incorrectly encoded and simply re-reading the WAL and constructing the memtable would again fail, we probably need to reject the write by removing the batch from the WAL manually.
This sounds difficult to do automatically and may violate application expectations. This might what RocksDB refers to as an Unrecoverable
error, requiring user intervention to fix.
270.md, line 266 at r1 (raw file):
# TODOs Q. Only no space errors are recoverable. How is the recovery done?
I just looked at this. See util/sst_file_manager_impl.cc
. When an out of space error is encountered, a thread is kicked off to run SstFileManagerImpl::ClearError
. Looks like that thread loops checking free space very 5 seconds and tries to recover when the free space reaches some threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @petermattis and @rohansuri)
270.md, line 90 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
We've seen bugs that have caused either real or apparent batch corruption.
Interesting. Could you link any bugs you remember? Also if there is a possibility of mismatch between encoding/decoding logic then should we also do bounds checking for each varint byte access and check for overflows as well in batchDecodeStr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @rohansuri)
270.md, line 90 at r1 (raw file):
Could you link any bugs you remember?
Here are two different issues that resulted in the same symptom:
Also if there is a possibility of mismatch between encoding/decoding logic then should we also do bounds checking for each varint byte access and check for overflows as well in batchDecodeStr?
There is a trade-off here between safety and performance. I'd be worried that doing bounds checking for each varint byte access will negatively impact performance. The argument against doing so is that in-memory corruptions should be rare, and on-disk corruptions should be protected by the WAL and sstable checksums.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 4 files reviewed, 3 unresolved discussions (waiting on @petermattis and @rohansuri)
270.md, line 90 at r1 (raw file):
Here are two different issues that resulted in the same symptom
Thanks for sharing them. I didn't think about the recovery code path that might apply incompletely written batches from WAL to memtable.
I'd be worried that doing bounds checking for each varint byte access will negatively impact performance.
Agreed, reading batches lies on a hot code path of user issued writes.
28fcd26
to
9ccc23a
Compare
I've added sections for I've understood the crux of the task. We need to return contextual errors (capturing code, subcode) from the deepest functions in the call stack, for example vfs layer or any constraint checking done before the final steps of applying a version edit and bubble these errors up to the main operation whose completion they affect. Finally, in the main operation a background error is set depending on the error's severity. |
9ccc23a
to
959f6c2
Compare
This is still WIP and not ready for review. |
@petermattis I think it'd be easier to do this over multiple PRs, just like it was done in RocksDB, since the features are incremental in nature. Is that fine? Current PR will be to simply place DB in read-only mode upon any user write or background write (flush, compaction). Doing this will not allow any further flushes, compactions, user writes. There will be no auto recovery from out of space errors. The only recovery in code is through In the next PR, we add auto recovery from out of space errors by bringing in relevant parts of Out of space errors while flushing or doing WAL writes are considered as hard errors. This places the DB in read-only mode. In this case the threshold for auto recovery is reserved_disk_buffer_, which is defined as Out of space errors while compaction are considered as soft errors. This doesn't place the DB in read-only mode i.e. user writes, flushes, compactions are still allowed. I think we allow them because user facing writes and flushes are of more priority than compactions and if you're not having enough space for running compactions (larger space requirements), but you do have enough space to accept user writes, do flushes (lesser space requirements like 64MB) , then we atleast do them rather than stopping everything. Note the subtlety here: if a compaction fails due to out of space error, of course the immediate user writes, flushes will also fail and we'll stop everything because they are hard errors. So it might seem that there's no benefit of not stopping everything in case of compactions. However, consider the case if we are out of space during a compaction and then get back some free space before any user writes, flushes happen. We could get back some space because the user deletes some files or maybe our own The other important aspect in the case of a soft out of space error is to still allow compactions but throttle them. That is, only allow them if it will not result in going out of space. To do this, we estimate the compaction output size as the input size (i.e. the worst case where no keys have overwrites, deletes) and check with the filesystem if we have enough space. This is important to do, otherwise we'd allow large compactions and intermediately fail and then clean up. This cycle going unchecked might disrupt concurrent user writes, flushes so throttling is a good idea. What's left for me to understand is the threshold for when this soft error will be cleared so that we'll go back to full throttle compactions. I'll get back to understanding this when doing the second PR. |
Yes! I prefer multiple incremental PRs over one massive PR.
SGTM
Sounds good, though I'm not imagining exposing something like
I can see using a higher threshold. A memtable is typically 64MB or 128MB in size. That's a very small amount of disk. A DB will make very little progress if only that amount of disk space is freed.
Does RocksDB always check the filesystem has sufficient room for the compaction, or does it only do so when it is encountered a soft out of space error? Estimating the compaction output size as the input size sounds good to start. It isn't always true and can possibly be improved in the future with knowledge of point and range tombstones in the inputs. |
Yup, I'll just bring in the required parts. Those don't require any setting from the user anyway, so we're good without exposing it. Although there's one feature that will require exposing it, if we want it now - RocksDB allows sharing of the same
Agreed. What would be a good heuristic for this?
Only when the DB has encountered a soft error specifically due to out of space. |
Some of the fatal ones are due internal programmatic errors/incorrect usage, for example not acquiring the lock on |
94444f7
to
dc64e39
Compare
@petermattis Do we want to mark invariant errors at all places or only where they occur in the code path of the background operations (WAL write, memtable write, flush, compaction)? For example, |
@petermattis I've been thinking about the error propagation semantics for the commit pipeline since we no longer want to panic on error. This needs some thought since RocksDB's group commit approach differs from our commit pipeline. This will cause our error propagation to differ from RocksDB and hence we need to decide what's the right behaviour. Lets take some examples where no batch requests sync. Example 1: Example 2: Example 3: Hence a memtable apply failure must be propagated to subsequent batches in the pipeline, so that none of them get published. This error propagation must be done with commit mutex held to ensure no new batches are accepted in the pipeline. In cases where a batch requests a sync, it isn't possible to propagate the sync error to its subsequent non-sync batches in the pipeline. This is because publish doesn't wait for sync. Hence by the time a batch gets back its sync error, subsequent batches might have already published and returned with no errors. But this is ok since we allow this even in the current code where a reader could read a subsequent published batch and only later Pebble panics due to a previous batch's sync error. However, sync errors must be propagated to subsequent sync batches in pipeline, else they'll keep waiting on for sync. This is already a TODO in In cases where a batch gets both a sync error and a memtable apply error (its own apply fails or a previous batch's propagated error), this writer will return the propagated memtable error. Do these semantics sound right? Am I missing any cases? Just a use case question, why does |
dc64e39
to
ea10c7d
Compare
ea10c7d
to
0196327
Compare
@petermattis any thoughts on the above comment? |
I have started making some progress. I'll keep updating the plan in file 270.md and also leave TODO markers in the code where changes are required. Will start implementing once we've addressed all the comments and the plan looks good.