Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relaxes Autopilot promotion logic. #3623

Merged
merged 2 commits into from
Oct 31, 2017
Merged

Relaxes Autopilot promotion logic. #3623

merged 2 commits into from
Oct 31, 2017

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Oct 26, 2017

When we defaulted the Raft protocol version to 3 in #3477 we made the numPeers() routine more strict to only count voters (this is more conservative and more correct). This had the side effect of
breaking rolling updates because it's at odds with the Autopilot non-voter promotion logic.

That logic used to wait to only promote to maintain an odd quorum of servers. During a rolling update (add one new server, wait, and then kill an old server) the dead server cleanup would still count the old server as a peer, which is conservative and the right thing to do, and no longer count the non-voter. This would wait to promote, so you could get into a stalemate. It is safer to promote early than remove early, so by promoting as soon as possible we have chosen that as the solution here.

Fixes #3611

When we defaulted the Raft protocol version to 3 in #3477 we made
the numPeers() routine more strict to only count voters (this is
more conservative and more correct). This had the side effect of
breaking rolling updates because it's at odds with the Autopilot
non-voter promotion logic.

That logic used to wait to only promote to maintain an odd quorum
of servers. During a rolling update (add one new server, wait, and
then kill an old server) the dead server cleanup would still count
the old server as a peer, which is conservative and the right thing
to do, and no longer count the non-voter. This would wait to promote,
so you could get into a stalemate. It is safer to promote early than
remove early, so by promoting as soon as possible we have chosen
that as the solution here.

Fixes #3611
}
}
}
}

// fmtServer prints info about a server in a standard way for logging.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't split this out - made some small logging fixes while I was in here.

// be an even quorum.
if isVoter() {
t.Fatalf("should not be a voter")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this test is validating the right thing? Per the new implementation, isVoter should return true after server stabilization period even before you kill one of the old nodes in line 205. I added a sleep before line 200 and this test failed at the isVoter call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this block can be removed entirely..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean - I was just trying to make sure we saw at least one time where it wasn't a voter and was stabilizing. I'll move up the check to make sure it gets promoted to voter so that's more explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually deleting it is better - we don't need re-verify stabilization here.

@slackpad slackpad merged commit 6bf55d1 into master Oct 31, 2017
@slackpad slackpad deleted the relax-autopilot branch October 31, 2017 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants