-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix behaviour on write failures #32
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great findings @banks 👏 👏 👏 ! I left few comments.
I think once that fix is merged we need to create a new release of raft-wal and bump the Consul version. While it's not a very common use case and won't impact the v2 experiment that much it's better to have it as part of the GA.
beforeCRC := w.writer.crc | ||
beforeIndexStart := w.writer.indexStart | ||
beforeWriteOffset := w.writer.writeOffset | ||
beforeOffsets := w.offsets.Load() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
offsets being atomic here, I'm not sure this is still safe. As we could be overwriting other updates of offsets between the start of the append and when defer runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This library relies heavily on the fact that Append
may only be called serially and the caller must ensure that. That's why we don't have locks around the writer state for example. I think with that documented constraint there is no way for this to be racey but I might still be wrong so please correct me if you see anything!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer I think it's safe too 😅.
The reason is that we rely on commitIdx from the reader perspective to know if it's safe to read, so even if the offset is incremented it's safe from the reader perspective.
Also we only increment the commitIdx after the flush to disk is successful.
@@ -117,3 +118,116 @@ func TestConcurrentReadersAndWriter(t *testing.T) { | |||
require.Greater(t, int(atomic.LoadUint64(&numReads)), 1000) | |||
require.Greater(t, int(atomic.LoadUint64(&sealedMaxIndex)), 1000) | |||
} | |||
|
|||
func TestWriterRecoversFromWriteFailure(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that the following test cases would fail the first WriteAt
or Sync
. We could maybe add a test case where some writes pass before a failure happens to simulate a partial write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually wrote the tests to fail not the first WriteAt
initially on the same thought and they did nothing...
It turns out (and I'd forgotten) that we buffering the whole batch in memory and only actually write to the file once per Append
. So there is never more than one WriteTo
at least at the VFS/file level! See the flush
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I understand that as only one flush is done by batch. The test case that I'm referring to is:
1- Call Append, no error
2- Call Append, error
3- Call Append, no error
Verify that call 1 and 3 are in disk and 2 is not. That would prevent a regression where we rewind a successful write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm still missing something but I think that is what the non-empty
variants test here - they first append a batch successfully and then fail the next batch.
I pulled that out of the table because I wanted to assert both versions for every other case so it made more sense to just call t.Run
twice per case. Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 I missed the double t.Run call at the end. Yes that's exactly what I had in mind!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀 🚀 🚀
While looking at some other changes I noticed that we don't handle write failures gracefully.
If an error is returned from Append due to file writing, fsync or even an internal validation error (but after some state has been modified) then the SegmentWriter is left in an invalid internal state with partial writes still reflected.
Originally I thought this would be catastrophic because in this case Raft will fail the write RPC (follower) or step down (leader) but will eventually try to continue writing more logs without closing an re-opening any files. I hypothesized that this could cause arbitrary (possibly silent) corruption as subsequent writes would start to write at some random offset in the file after a load of junk...
So I wrote a test to explore all these cases. It did fail but at least in a "safe" way: in this case, the partially written logs would remain in memory in the writers
offsets
index which means that subsequent attempts to write the next raft index would always error with anon-monotonic
error. This at least means no corruption would happen although the node would basically be unable to make any progress until it was either restarted or perhaps had the last records truncated (I didn't work out what happens in that case actually).But we can do better!
This PR fixes the behavior to be equivalent to BoltDB again - we rollback the in-memory changes on an error in Append which means that the writes may or may not have made it to disk but a subsequent write will continue to be correct and will overwrite the same part of the log resulting in a correct log regardless. If the error is due to an unwritable sector and that failure can't be recovered by the hardware then writes will continue to fail until the hardware is replaced but at least nothing is left in a potentially corrupt state. More likely a bad sector will be fixed by the hardware remapping it and so a subsequent write will succeed and we'll now allow that and do the right thing.