-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix RAFT WAL repair. #2549
Fix RAFT WAL repair. #2549
Conversation
When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there. On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry. This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg. Signed-off-by: Derek Collison <derek@nats.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -418,12 +420,17 @@ func (s *Server) startRaftNode(cfg *RaftConfig) (RaftNode, error) { | |||
for index := state.FirstSeq; index <= state.LastSeq; index++ { | |||
ae, err := n.loadEntry(index) | |||
if err != nil { | |||
n.warn("Could not load %d from WAL [%+v] with error: %v", index, state, err) | |||
continue | |||
n.warn("Could not load %d from WAL [%+v]: %v", index, state, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are pretty scary warnings to users, can we add some indication about what they can do about them? Or mention if the server will recover on its own?
Seeing just these in the server log and no indication of what action to take is not a good experience. For myself even, I have no idea what to do about these log lines.
Especially true for the panic later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH at this point we do not know why sans the error, so that is what we report. If system keeps working you will ignore but if not will report to us what the log line says which may help us.
In this case we do not continue processing the WAL on startup. If there are others more up to date they will get elected and catch us up.
We used to just skip over which was wrong in hindsight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, but like this log says "ask Derek" :) if that's the intended outcome, LGTM :)
If we were leader stepdown as well. Signed-off-by: Derek Collison <derek@nats.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Derek Collison <derek@nats.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When we stored a message in the raft layer in a wrong position (state corrupt), we would panic, leaving the message there.
On restart we would truncate the WAL and try to repair, but we truncated to the wrong index of the bad entry.
This change also includes additional changes to truncateWAL and also reduces the conditional for panic on storeMsg.
Signed-off-by: Derek Collison derek@nats.io
/cc @nats-io/core