-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bumps default Raft protocol to version 3. #3477
Conversation
r.Check(wantPeers(s2, 4)) | ||
r.Check(wantPeers(s3, 4)) | ||
r.Check(wantPeers(s4, 4)) | ||
r.Check(wantRaft([]*Server{s1, s2, s3, s4})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there more places in the tests where it is more correct to call wantRaft instead of wantPeers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I got them all but I'll make one more pass through and look over the remaining wantPeers
uses before I merge. Maybe we can get rid of wantPeers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a look through the rest and the other uses of wantPeers
are justified to actually look for voters, given what they are testing.
This should have been there before, but it's more obvious that this is incorrect now that we default the Raft protocol to 3, which puts new servers in a read-only state while Autopilot waits for them to become healthy.
Relaxed the check for a given number of voter peers and instead do a thorough check that all servers see each other in their Raft configurations.
These now just check for Raft replication to be set up, and don't care about the number of voter peers.
sockets when it sees io.EOF.
530a51f
to
a6f19c0
Compare
When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates becuase it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611
When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates because it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611
* Relaxes Autopilot promotion logic. When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of breaking rolling updates because it's at odds with the Autopilot non-voter promotion logic. That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here. Fixes #3611 * Gets rid of unnecessary extra not-a-voter check.
Closes #3327.
There were a few changes and cleanups that were exposed by this change, so commits have been broken out for clarity.