Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix RAFT WAL repair. #2549

Merged
merged 3 commits into from
Sep 21, 2021
Merged

Fix RAFT WAL repair. #2549

merged 3 commits into from
Sep 21, 2021

Conversation

derekcollison
Copy link
Member

When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there.
On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry.

This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg.

Signed-off-by: Derek Collison derek@nats.io

/cc @nats-io/core

When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there.
On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry.

This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg.

Signed-off-by: Derek Collison <derek@nats.io>
Copy link
Member

@wallyqs wallyqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -418,12 +420,17 @@ func (s *Server) startRaftNode(cfg *RaftConfig) (RaftNode, error) {
for index := state.FirstSeq; index <= state.LastSeq; index++ {
ae, err := n.loadEntry(index)
if err != nil {
n.warn("Could not load %d from WAL [%+v] with error: %v", index, state, err)
continue
n.warn("Could not load %d from WAL [%+v]: %v", index, state, err)
Copy link
Contributor

@ripienaar ripienaar Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are pretty scary warnings to users, can we add some indication about what they can do about them? Or mention if the server will recover on its own?

Seeing just these in the server log and no indication of what action to take is not a good experience. For myself even, I have no idea what to do about these log lines.

Especially true for the panic later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH at this point we do not know why sans the error, so that is what we report. If system keeps working you will ignore but if not will report to us what the log line says which may help us.

In this case we do not continue processing the WAL on startup. If there are others more up to date they will get elected and catch us up.

We used to just skip over which was wrong in hindsight.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but like this log says "ask Derek" :) if that's the intended outcome, LGTM :)

If we were leader stepdown as well.

Signed-off-by: Derek Collison <derek@nats.io>
Copy link
Member

@wallyqs wallyqs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Derek Collison <derek@nats.io>
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekcollison derekcollison merged commit 0dd4e9f into main Sep 21, 2021
@derekcollison derekcollison deleted the raft-panic branch September 21, 2021 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants