-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dead server removal condition to use correct failure tolerance #4017
Fix dead server removal condition to use correct failure tolerance #4017
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
18e0726
to
aef8a8f
Compare
I rebased this PR and fixed the tests. This PR changes how many dead servers autopilot can delete which depends on the number of raft peers (which is the number of raft voters, possibly including itself). |
@i0rek your table seems dangerous to me - if there are 3 peers, allowing removal of 2 of them breaks quorum and shouldn't really be allowed. Also since this PR we discussed adding a way to configure a minimum quorum size since AutoPilot will happily reduce the quorum size currently. I forget if that was WIP or merged (I think it was in the last release though). I also see How does that interact with this change now? It would be handy to see in the table the MinQuorum setting as well as the actual number of peers so we can work out what is safe/expected? |
Ah yeah the PR for the MinQuorum fix was #6654 and AFAICT it actually implemented the same change this PR intended to as well already since we moved to allowing any number of servers to be removed provided there was still a minQuorum size left up. I think this PR should be closed since the one above already implemented what was intended here. The changes I see seem to take it too far and allow unsafe removal of more servers than necessary to leave you with the configured min quorum size? |
Oh wait that PR didn't change - we always attempted to remove only a minority we just had a bug in the integer rounding. Old code was just So I see the value in the fix for that, but your table still puzzles me a bit. Here is my version:
So I think I've convinced myself the new logic is OK when combined with the MinQuorum check. I also wondered why we make this a binary decision? If there are 3 dead servers and we can only safely remove 2, why don't we just remove 2 of them instead of refusing to remove any? I think the answer is that we can't know which two are best to remove in this case - consider the upgrade case where there are meant to be 3 servers but two new ones are started to replace two old. Now if one of the healthy new servers happens to fail just at the same time the two old servers are taken down, autopilot could choose to delete one new server and one old and be left in a broken state. So overall I think this is correct behaviour now! The key for me to understanding this PR was noting that it's about correcting an integer division bug not changing the intent of what Consul does. |
Thank you for reviewing @banks! My table doesn't have the MinQuorum, which I agree is confusing. I didn't think of adding it.
Thank you for your explanation! I was not sure myself if this PR is correct, but since we had it sitting for so long and it was already approved I went ahead and prepared it for merging. Thinking that while working on it, things would become more clear :). |
I thought about this again and my table bothered me too much. I extracted a function which decides if autopilot should move on removing servers or not. The output of the tests is the following:
|
ca83496
to
86220e8
Compare
Codecov Report
@@ Coverage Diff @@
## master #4017 +/- ##
==========================================
+ Coverage 65.65% 65.81% +0.16%
==========================================
Files 443 439 -4
Lines 53292 52746 -546
==========================================
- Hits 34987 34714 -273
+ Misses 14085 13869 -216
+ Partials 4220 4163 -57
Continue to review full report at Codecov.
|
Hey there, This issue has been automatically locked because it is closed and there hasn't been any activity for at least 30 days. If you are still experiencing problems, or still have questions, feel free to open a new one 👍. |
The dead server removal code in autopilot was more conservative than necessary. This PR changes it to use the correct failure tolerance so that as long as there is quorum, failed servers are cleaned up correctly.
This is an example scenario where the current logic failed to remove dead servers: