Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix dead server removal condition to use correct failure tolerance #4017

Merged
merged 6 commits into from
Dec 16, 2019

Conversation

preetapan
Copy link
Contributor

The dead server removal code in autopilot was more conservative than necessary. This PR changes it to use the correct failure tolerance so that as long as there is quorum, failed servers are cleaned up correctly.

This is an example scenario where the current logic failed to remove dead servers:

  • Add two new replacement servers to a cluster of 3 servers for an upgrade.
  • Stop two old servers

@pearkes pearkes added this to the 1.0.7 milestone Apr 9, 2018
@pearkes pearkes removed this from the 1.0.7 milestone Apr 11, 2018
Copy link
Contributor

@pierresouchay pierresouchay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We went thru this during an outage very recently in this exact scenario

@banks @mkeeler Possible to merge this?

@dadgar dadgar removed their request for review August 29, 2019 17:56
@hanshasselberg hanshasselberg force-pushed the b-autopilot-cleanupdeadservers-quorom-bugfix branch from 18e0726 to aef8a8f Compare December 10, 2019 11:12
@hanshasselberg
Copy link
Member

hanshasselberg commented Dec 10, 2019

I rebased this PR and fixed the tests. This PR changes how many dead servers autopilot can delete which depends on the number of raft peers (which is the number of raft voters, possibly including itself).

@banks
Copy link
Member

banks commented Dec 10, 2019

@i0rek your table seems dangerous to me - if there are 3 peers, allowing removal of 2 of them breaks quorum and shouldn't really be allowed.

Also since this PR we discussed adding a way to configure a minimum quorum size since AutoPilot will happily reduce the quorum size currently. I forget if that was WIP or merged (I think it was in the last release though). I also see MinQuorum in the diff now which I presume was from that change.

How does that interact with this change now? It would be handy to see in the table the MinQuorum setting as well as the actual number of peers so we can work out what is safe/expected?

@banks
Copy link
Member

banks commented Dec 10, 2019

Ah yeah the PR for the MinQuorum fix was #6654 and AFAICT it actually implemented the same change this PR intended to as well already since we moved to allowing any number of servers to be removed provided there was still a minQuorum size left up.

I think this PR should be closed since the one above already implemented what was intended here. The changes I see seem to take it too far and allow unsafe removal of more servers than necessary to leave you with the configured min quorum size?

@banks
Copy link
Member

banks commented Dec 10, 2019

Oh wait that PR didn't change - we always attempted to remove only a minority we just had a bug in the integer rounding.

Old code was just removalCount < peers/2 as the condition to go ahead. An example where that is wrong as Preetha provided was 3 servers intended (quorum or 2) but another 2 have been started as replacements and then old 2 removed. That should be safe to clean both since there will still be 3 remaining however while 5/2 is 2.5 which is great than 2, due to integer truncation 5/2 is 2 and 2 > 2 is false.

So I see the value in the fix for that, but your table still puzzles me a bit. Here is my version:

Peers MinQuorum Dead removalCount <= (peers-1)/2 Removal Allowed
1 1 1 1 <= 0 is false
3 3 1 1 <= 1 is true ❌ because of MinQuorum check
4 3 3 3 <= 1 is true ❌ because of MinQuorum check
5 3 2 2 <= 2 is true ✅(Preetha's example)

So I think I've convinced myself the new logic is OK when combined with the MinQuorum check.

I also wondered why we make this a binary decision?

If there are 3 dead servers and we can only safely remove 2, why don't we just remove 2 of them instead of refusing to remove any? I think the answer is that we can't know which two are best to remove in this case - consider the upgrade case where there are meant to be 3 servers but two new ones are started to replace two old. Now if one of the healthy new servers happens to fail just at the same time the two old servers are taken down, autopilot could choose to delete one new server and one old and be left in a broken state.

So overall I think this is correct behaviour now! The key for me to understanding this PR was noting that it's about correcting an integer division bug not changing the intent of what Consul does.

@hanshasselberg
Copy link
Member

Thank you for reviewing @banks! My table doesn't have the MinQuorum, which I agree is confusing. I didn't think of adding it.

So overall I think this is correct behaviour now! The key for me to understanding this PR was noting that it's about correcting an integer division bug not changing the intent of what Consul does.

Thank you for your explanation! I was not sure myself if this PR is correct, but since we had it sitting for so long and it was already approved I went ahead and prepared it for merging. Thinking that while working on it, things would become more clear :).

@hanshasselberg
Copy link
Member

hanshasselberg commented Dec 12, 2019

I thought about this again and my table bothered me too much. I extracted a function which decides if autopilot should move on removing servers or not. The output of the tests is the following:

{peers:1 minQuorum:1 deadServers:1 ok:false}: denied, because removing 1/1 servers would leave less then minimal allowed quorum of 1 servers
{peers:3 minQuorum:3 deadServers:1 ok:false}: denied, because removing 1/3 servers would leave less then minimal allowed quorum of 3 servers
{peers:4 minQuorum:3 deadServers:3 ok:false}: denied, because removing 3/4 servers would leave less then minimal allowed quorum of 3 servers
{peers:5 minQuorum:3 deadServers:3 ok:false}: denied, because removing 3/5 servers would leave less then minimal allowed quorum of 3 servers
{peers:5 minQuorum:3 deadServers:2 ok:true}: allowed, because removing 2/5 servers leaves a majority of servers above the minimal allowed quorum 3
{peers:5 minQuorum:3 deadServers:1 ok:true}: allowed, because removing 1/5 servers leaves a majority of servers above the minimal allowed quorum 3
{peers:9 minQuorum:3 deadServers:5 ok:false}: denied, because removing the majority of servers 5/9 is not safe

@hanshasselberg hanshasselberg force-pushed the b-autopilot-cleanupdeadservers-quorom-bugfix branch from ca83496 to 86220e8 Compare December 12, 2019 13:12
@hanshasselberg hanshasselberg self-assigned this Dec 12, 2019
@codecov
Copy link

codecov bot commented Dec 16, 2019

Codecov Report

Merging #4017 into master will increase coverage by 0.16%.
The diff coverage is 21.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4017      +/-   ##
==========================================
+ Coverage   65.65%   65.81%   +0.16%     
==========================================
  Files         443      439       -4     
  Lines       53292    52746     -546     
==========================================
- Hits        34987    34714     -273     
+ Misses      14085    13869     -216     
+ Partials     4220     4163      -57
Impacted Files Coverage Δ
agent/consul/autopilot/autopilot.go 8.86% <21.42%> (+2%) ⬆️
agent/http_oss.go 38.88% <0%> (-11.12%) ⬇️
agent/consul/state/session_oss.go 68.88% <0%> (-6.5%) ⬇️
agent/consul/raft_rpc.go 78.72% <0%> (-6.39%) ⬇️
agent/session_endpoint.go 59.48% <0%> (-6.36%) ⬇️
api/session.go 71.42% <0%> (-4.44%) ⬇️
agent/consul/session_ttl.go 85.48% <0%> (-3.23%) ⬇️
api/lock.go 80.72% <0%> (-2.41%) ⬇️
agent/structs/service_definition.go 37.93% <0%> (-1.06%) ⬇️
agent/acl.go 88.73% <0%> (-0.95%) ⬇️
... and 54 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7ac47d6...775f6ac. Read the comment docs.

@hanshasselberg hanshasselberg merged commit c47dbff into master Dec 16, 2019
@hanshasselberg hanshasselberg deleted the b-autopilot-cleanupdeadservers-quorom-bugfix branch December 16, 2019 22:35
@ghost
Copy link

ghost commented Jan 25, 2020

Hey there,

This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days.

If you are still experiencing problems, or still have questions, feel free to open a new one 👍.

@ghost ghost locked and limited conversation to collaborators Jan 25, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants