Fix RAFT WAL repair. #2549

derekcollison · 2021-09-21T01:44:16Z

When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there.
On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry.

This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg.

Signed-off-by: Derek Collison derek@nats.io

/cc @nats-io/core

When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there. On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry. This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg. Signed-off-by: Derek Collison <derek@nats.io>

wallyqs

LGTM

ripienaar · 2021-09-21T07:30:05Z

server/raft.go

@@ -418,12 +420,17 @@ func (s *Server) startRaftNode(cfg *RaftConfig) (RaftNode, error) {
 		for index := state.FirstSeq; index <= state.LastSeq; index++ {
 			ae, err := n.loadEntry(index)
 			if err != nil {
-				n.warn("Could not load %d from WAL [%+v] with error: %v", index, state, err)
-				continue
+				n.warn("Could not load %d from WAL [%+v]: %v", index, state, err)


These are pretty scary warnings to users, can we add some indication about what they can do about them? Or mention if the server will recover on its own?

Seeing just these in the server log and no indication of what action to take is not a good experience. For myself even, I have no idea what to do about these log lines.

Especially true for the panic later

TBH at this point we do not know why sans the error, so that is what we report. If system keeps working you will ignore but if not will report to us what the log line says which may help us.

In this case we do not continue processing the WAL on startup. If there are others more up to date they will get elected and catch us up.

We used to just skip over which was wrong in hindsight.

OK, but like this log says "ask Derek" :) if that's the intended outcome, LGTM :)

If we were leader stepdown as well. Signed-off-by: Derek Collison <derek@nats.io>

wallyqs

LGTM

kozlovic

LGTM

Signed-off-by: Derek Collison <derek@nats.io>

kozlovic

LGTM

derekcollison requested review from wallyqs, ripienaar and kozlovic September 21, 2021 01:44

wallyqs approved these changes Sep 21, 2021

View reviewed changes

ripienaar reviewed Sep 21, 2021

View reviewed changes

Avoid panic if WAL was truncated out from underneath of us.

63c2428

If we were leader stepdown as well. Signed-off-by: Derek Collison <derek@nats.io>

derekcollison requested review from ripienaar and wallyqs September 21, 2021 14:29

wallyqs approved these changes Sep 21, 2021

View reviewed changes

kozlovic approved these changes Sep 21, 2021

View reviewed changes

Merge to fix conflicts

052bb7c

Signed-off-by: Derek Collison <derek@nats.io>

kozlovic approved these changes Sep 21, 2021

View reviewed changes

derekcollison merged commit 0dd4e9f into main Sep 21, 2021

derekcollison deleted the raft-panic branch September 21, 2021 15:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix RAFT WAL repair. #2549

Fix RAFT WAL repair. #2549

derekcollison commented Sep 21, 2021

wallyqs left a comment

ripienaar Sep 21, 2021 •

edited

Loading

derekcollison Sep 21, 2021

ripienaar Sep 21, 2021

wallyqs left a comment

kozlovic left a comment

kozlovic left a comment

Fix RAFT WAL repair. #2549

Fix RAFT WAL repair. #2549

Conversation

derekcollison commented Sep 21, 2021

wallyqs left a comment

Choose a reason for hiding this comment

ripienaar Sep 21, 2021 • edited Loading

Choose a reason for hiding this comment

derekcollison Sep 21, 2021

Choose a reason for hiding this comment

ripienaar Sep 21, 2021

Choose a reason for hiding this comment

wallyqs left a comment

Choose a reason for hiding this comment

kozlovic left a comment

Choose a reason for hiding this comment

kozlovic left a comment

Choose a reason for hiding this comment

ripienaar Sep 21, 2021 •

edited

Loading