Fix behaviour on write failures #32

banks · 2023-10-10T21:56:10Z

While looking at some other changes I noticed that we don't handle write failures gracefully.

If an error is returned from Append due to file writing, fsync or even an internal validation error (but after some state has been modified) then the SegmentWriter is left in an invalid internal state with partial writes still reflected.

Originally I thought this would be catastrophic because in this case Raft will fail the write RPC (follower) or step down (leader) but will eventually try to continue writing more logs without closing an re-opening any files. I hypothesized that this could cause arbitrary (possibly silent) corruption as subsequent writes would start to write at some random offset in the file after a load of junk...

So I wrote a test to explore all these cases. It did fail but at least in a "safe" way: in this case, the partially written logs would remain in memory in the writers offsets index which means that subsequent attempts to write the next raft index would always error with a non-monotonic error. This at least means no corruption would happen although the node would basically be unable to make any progress until it was either restarted or perhaps had the last records truncated (I didn't work out what happens in that case actually).

But we can do better!

This PR fixes the behavior to be equivalent to BoltDB again - we rollback the in-memory changes on an error in Append which means that the writes may or may not have made it to disk but a subsequent write will continue to be correct and will overwrite the same part of the log resulting in a correct log regardless. If the error is due to an unwritable sector and that failure can't be recovered by the hardware then writes will continue to fail until the hardware is replaced but at least nothing is left in a potentially corrupt state. More likely a bad sector will be fixed by the hardware remapping it and so a subsequent write will succeed and we'll now allow that and do the right thing.

dhiaayachi

Great findings @banks 👏 👏 👏 ! I left few comments.

I think once that fix is merged we need to create a new release of raft-wal and bump the Consul version. While it's not a very common use case and won't impact the v2 experiment that much it's better to have it as part of the GA.

segment/writer.go

dhiaayachi · 2023-10-11T15:28:53Z

segment/writer.go

+	beforeCRC := w.writer.crc
+	beforeIndexStart := w.writer.indexStart
+	beforeWriteOffset := w.writer.writeOffset
+	beforeOffsets := w.offsets.Load()


offsets being atomic here, I'm not sure this is still safe. As we could be overwriting other updates of offsets between the start of the append and when defer runs.

This library relies heavily on the fact that Append may only be called serially and the caller must ensure that. That's why we don't have locks around the writer state for example. I think with that documented constraint there is no way for this to be racey but I might still be wrong so please correct me if you see anything!

Looking closer I think it's safe too 😅.

The reason is that we rely on commitIdx from the reader perspective to know if it's safe to read, so even if the offset is incremented it's safe from the reader perspective.
Also we only increment the commitIdx after the flush to disk is successful.

dhiaayachi · 2023-10-11T15:36:34Z

segment/writer_test.go

@@ -117,3 +118,116 @@ func TestConcurrentReadersAndWriter(t *testing.T) {
 	require.Greater(t, int(atomic.LoadUint64(&numReads)), 1000)
 	require.Greater(t, int(atomic.LoadUint64(&sealedMaxIndex)), 1000)
 }
+
+func TestWriterRecoversFromWriteFailure(t *testing.T) {


My understanding is that the following test cases would fail the first WriteAt or Sync. We could maybe add a test case where some writes pass before a failure happens to simulate a partial write.

I actually wrote the tests to fail not the first WriteAt initially on the same thought and they did nothing...

It turns out (and I'd forgotten) that we buffering the whole batch in memory and only actually write to the file once per Append. So there is never more than one WriteTo at least at the VFS/file level! See the flush method.

Yes I understand that as only one flush is done by batch. The test case that I'm referring to is:
1- Call Append, no error
2- Call Append, error
3- Call Append, no error
Verify that call 1 and 3 are in disk and 2 is not. That would prevent a regression where we rewind a successful write.

Maybe I'm still missing something but I think that is what the non-empty variants test here - they first append a batch successfully and then fail the next batch.

I pulled that out of the table because I wanted to assert both versions for every other case so it made more sense to just call t.Run twice per case. Is that what you mean?

🤦 I missed the double t.Run call at the end. Yes that's exactly what I had in mind!

dhiaayachi

🚀 🚀 🚀

rboyer requested review from analogue and dhiaayachi October 10, 2023 22:00

Fix behaviour on write failures

870283f

banks force-pushed the fix/write-fail branch from 395a0a1 to 870283f Compare October 10, 2023 22:07

dhiaayachi reviewed Oct 11, 2023

View reviewed changes

raskchanky approved these changes Oct 11, 2023

View reviewed changes

dhiaayachi approved these changes Oct 13, 2023

View reviewed changes

banks merged commit 3aaaab8 into main Oct 16, 2023
1 check passed

banks deleted the fix/write-fail branch October 16, 2023 12:58

dhiaayachi mentioned this pull request Oct 20, 2023

bump raft-wal version to 0.4.1 hashicorp/consul#19314

Merged

4 tasks

hc-github-team-consul-core mentioned this pull request Oct 24, 2023

Backport of bump raft-wal version to 0.4.1 into release/1.17.x hashicorp/consul#19349

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix behaviour on write failures #32

Fix behaviour on write failures #32

banks commented Oct 10, 2023

dhiaayachi left a comment

dhiaayachi Oct 11, 2023

banks Oct 11, 2023

dhiaayachi Oct 12, 2023

dhiaayachi Oct 11, 2023

banks Oct 11, 2023 •

edited

Loading

dhiaayachi Oct 12, 2023

banks Oct 12, 2023

dhiaayachi Oct 13, 2023

dhiaayachi left a comment

Fix behaviour on write failures #32

Fix behaviour on write failures #32

Conversation

banks commented Oct 10, 2023

dhiaayachi left a comment

Choose a reason for hiding this comment

dhiaayachi Oct 11, 2023

Choose a reason for hiding this comment

banks Oct 11, 2023

Choose a reason for hiding this comment

dhiaayachi Oct 12, 2023

Choose a reason for hiding this comment

dhiaayachi Oct 11, 2023

Choose a reason for hiding this comment

banks Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

dhiaayachi Oct 12, 2023

Choose a reason for hiding this comment

banks Oct 12, 2023

Choose a reason for hiding this comment

dhiaayachi Oct 13, 2023

Choose a reason for hiding this comment

dhiaayachi left a comment

Choose a reason for hiding this comment

banks Oct 11, 2023 •

edited

Loading