Sealed segment without IndexStart #31

rledisez · 2023-09-27T12:37:09Z

[INFO]  raft: initial configuration: index=7 servers="[{Suffrage:Voter ID:00000001-0001-0001-0001-000000000001 Address:192.168.121.27:4419} {Suffrage:Voter ID:00000003-0003-0003-0003-000000000003 Address:192.168.121.217:4419} {Suffrage:Voter ID:00000002-0002-0002-0002-000000000002 Address:192.168.121.112:4419}]"
[INFO]  raft: entering follower state: follower="Node at 00000001-0001-0001-0001-000000000001 [Follower]" leader-address= leader-id=
[WARN]  raft: heartbeat timeout reached, not part of a stable configuration or a non-voter or not allowed to candidate, not triggering a leader election
[DEBUG] raft: lost leadership because received a requestVote with a newer term
[WARN]  raft: rejecting vote request since our last index is greater: candidate=00000003-0003-0003-0003-000000000003 last-index=424 last-candidate-index=423
[WARN]  raft: clearing log suffix: from=424 to=424

# Process is restarted

[ERROR] raft: failed to get log: index=1 error="sealed segment has no index block"
panic: sealed segment has no index block
goroutine 1 [running]:
github.com/hashicorp/raft.NewRaft(0xc0003423c0, {0x32b3ec0?, 0xc000355f50}, {0x32b7b90?, 0xc000342460}, {0x32b5b48?, 0xc000342460}, {0x32b3470?, 0xc000355ec0}, {0x32bb698, ...})
[...]

In this trace, raft is clearing the last index from the WAL (424). It calls (w *WAL) DeleteRange() which calls (w *WAL) truncateTailLocked().

truncateTailLocked() seals the segment but does not set the IndexStart, which is wrong from my understanding. So at the next start, the segment is sealed without any IndexStart set, which triggers the panic from (r *Reader) findFrameOffset.

The text was updated successfully, but these errors were encountered:

banks · 2023-10-16T14:34:52Z

@rledisez Thanks for reporting this and describing how to reproduce!

Sorry I only just saw this, I will work on reproducing that issue in a test.

Would you be able to share more context around what was happening in your cluster. Is this in Consul or another bit of software? Are you running in production or a testing environment. Thanks!

banks · 2023-10-16T15:51:34Z

@dhiaayachi and I looked at this and you've completely right - this is a case that fell through the cracks!

In typical operation it's the Segment writer that decides when it's full and writes out the segment index - the WAL just writes to it until it does so and then updates the meta data with that index and creates a new segment.

When I designed tail truncation, I decided it's simpler to just seal the tail segment if we still keep some records in it but not all and create a new one rather than try to reason about overwriting only the records we need (which may have been committed in a batch with records that are being truncated etc.) This is still sensible as it means we don't have to reason about this in the storage format and crash recovery.

BUT I totally missed that just marking the segment as sealed in metadata isn't enough. The tests cover this common case, but use mocks at each layer which has many advantages but meant that this missed expectation between the WAL "sealing" a segment while the segment writer only actually writes out indexes when it's full is an issue.

I propose to fix it with a ForceSeal method in the Writer interface that forces the segment to have an index written out. I'll be able to assert this is called correctly in existing tests that exercise this case. But this also highlights that it would be valuable to have a similar set of "integration" tests that run the same scenarios but end-to-end using the actual Segment/Meta/VFS layers with WAL - that would help add additional assurance that we are not missing other cases like this.

rledisez · 2023-10-16T20:55:08Z

Thx for the detailed feedback. To give a bit of context I'm working on a software using hashicorp/raft. This issue appeared when I was running a test that kills the leader. It was sometimes (I guess) killed right before the log was propagated to followers, so when it came back in the cluster, the new follower had a lesser index.

banks · 2023-10-16T21:14:16Z

Thanks @rledisez! I have a PR that should fix it up for review.

banks added the bug Something isn't working label Oct 16, 2023

banks mentioned this issue Oct 16, 2023

Fix truncate tail #33

Merged

banks closed this as completed in #33 Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sealed segment without IndexStart #31

Sealed segment without IndexStart #31

rledisez commented Sep 27, 2023

banks commented Oct 16, 2023

banks commented Oct 16, 2023

rledisez commented Oct 16, 2023

banks commented Oct 16, 2023

Sealed segment without IndexStart #31

Sealed segment without IndexStart #31

Comments

rledisez commented Sep 27, 2023

banks commented Oct 16, 2023

banks commented Oct 16, 2023

rledisez commented Oct 16, 2023

banks commented Oct 16, 2023