Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix behaviour on write failures #32

Merged
merged 1 commit into from
Oct 16, 2023
Merged

Fix behaviour on write failures #32

merged 1 commit into from
Oct 16, 2023

Conversation

banks
Copy link
Member

@banks banks commented Oct 10, 2023

While looking at some other changes I noticed that we don't handle write failures gracefully.

If an error is returned from Append due to file writing, fsync or even an internal validation error (but after some state has been modified) then the SegmentWriter is left in an invalid internal state with partial writes still reflected.

Originally I thought this would be catastrophic because in this case Raft will fail the write RPC (follower) or step down (leader) but will eventually try to continue writing more logs without closing an re-opening any files. I hypothesized that this could cause arbitrary (possibly silent) corruption as subsequent writes would start to write at some random offset in the file after a load of junk...

So I wrote a test to explore all these cases. It did fail but at least in a "safe" way: in this case, the partially written logs would remain in memory in the writers offsets index which means that subsequent attempts to write the next raft index would always error with a non-monotonic error. This at least means no corruption would happen although the node would basically be unable to make any progress until it was either restarted or perhaps had the last records truncated (I didn't work out what happens in that case actually).

But we can do better!

This PR fixes the behavior to be equivalent to BoltDB again - we rollback the in-memory changes on an error in Append which means that the writes may or may not have made it to disk but a subsequent write will continue to be correct and will overwrite the same part of the log resulting in a correct log regardless. If the error is due to an unwritable sector and that failure can't be recovered by the hardware then writes will continue to fail until the hardware is replaced but at least nothing is left in a potentially corrupt state. More likely a bad sector will be fixed by the hardware remapping it and so a subsequent write will succeed and we'll now allow that and do the right thing.

@rboyer rboyer requested review from analogue and dhiaayachi October 10, 2023 22:00
Copy link

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great findings @banks 👏 👏 👏 ! I left few comments.

I think once that fix is merged we need to create a new release of raft-wal and bump the Consul version. While it's not a very common use case and won't impact the v2 experiment that much it's better to have it as part of the GA.

segment/writer.go Show resolved Hide resolved
beforeCRC := w.writer.crc
beforeIndexStart := w.writer.indexStart
beforeWriteOffset := w.writer.writeOffset
beforeOffsets := w.offsets.Load()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offsets being atomic here, I'm not sure this is still safe. As we could be overwriting other updates of offsets between the start of the append and when defer runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This library relies heavily on the fact that Append may only be called serially and the caller must ensure that. That's why we don't have locks around the writer state for example. I think with that documented constraint there is no way for this to be racey but I might still be wrong so please correct me if you see anything!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking closer I think it's safe too 😅.

The reason is that we rely on commitIdx from the reader perspective to know if it's safe to read, so even if the offset is incremented it's safe from the reader perspective.
Also we only increment the commitIdx after the flush to disk is successful.

@@ -117,3 +118,116 @@ func TestConcurrentReadersAndWriter(t *testing.T) {
require.Greater(t, int(atomic.LoadUint64(&numReads)), 1000)
require.Greater(t, int(atomic.LoadUint64(&sealedMaxIndex)), 1000)
}

func TestWriterRecoversFromWriteFailure(t *testing.T) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the following test cases would fail the first WriteAt or Sync. We could maybe add a test case where some writes pass before a failure happens to simulate a partial write.

Copy link
Member Author

@banks banks Oct 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually wrote the tests to fail not the first WriteAt initially on the same thought and they did nothing...

It turns out (and I'd forgotten) that we buffering the whole batch in memory and only actually write to the file once per Append. So there is never more than one WriteTo at least at the VFS/file level! See the flush method.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I understand that as only one flush is done by batch. The test case that I'm referring to is:
1- Call Append, no error
2- Call Append, error
3- Call Append, no error
Verify that call 1 and 3 are in disk and 2 is not. That would prevent a regression where we rewind a successful write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm still missing something but I think that is what the non-empty variants test here - they first append a batch successfully and then fail the next batch.

I pulled that out of the table because I wanted to assert both versions for every other case so it made more sense to just call t.Run twice per case. Is that what you mean?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 I missed the double t.Run call at the end. Yes that's exactly what I had in mind!

Copy link

@dhiaayachi dhiaayachi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 🚀 🚀

@banks banks merged commit 3aaaab8 into main Oct 16, 2023
1 check passed
@banks banks deleted the fix/write-fail branch October 16, 2023 12:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants