-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No automatic manager shutdown on demotion/removal #1829
Conversation
I think I understand why this wasn't working.
There is a comment in the code explaining that this is not done atomically and that can be a problem:
Basically, this change triggers the problem very often. The demoted manager will shut down as soon as it sees its role change to "worker". This is likely to be in between Reordering the calls would fix the problem in some cases, but create other problems. It wouldn't work when demoting the leader, because leader would remove itself from raft before it has a chance to update the node object. The most correct way I can think of to fix this is to reconcile the member list as the My only concern about this approach is that "observed role" ( |
3220c44
to
fa998a9
Compare
Current coverage is 54.60% (diff: 44.06%)@@ master #1829 diff @@
==========================================
Files 102 103 +1
Lines 17050 17150 +100
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 9336 9364 +28
- Misses 6580 6648 +68
- Partials 1134 1138 +4
|
I've added a commit that implements the role reconcilation I described above. Now CI is passing now. PTAL. I can split this PR into two if people want me to. |
fa998a9
to
a2c96b9
Compare
a2c96b9
to
a1cba5b
Compare
I made two more fixes that seem to help with lingering occasional integration test issues. First, I had to remove this code from // switch role to agent immediately to shutdown manager early
if role == ca.WorkerRole {
n.role = role
n.roleCond.Broadcast()
} In rare cases, this could cause problems with rapid promotion/demotion cycles. If a certificate renewal got kicked off before the demotion (say, by an earlier promotion), the role could flip from worker back to manager, until another renewal happens. This would cause the manager to start and join as a new member, which caused problems in the integration tests. I also added a commit that adds a This PR is split into 3 commits, and I can open separate PRs for them if necessary. This seems very solid now. I ran the integration tests 32 times without any failures. PTAL |
// updated after the Raft member list has been reconciled with the | ||
// desired role from the spec. Note that this doesn't show whether the | ||
// node has obtained a certificate that reflects its current role. | ||
NodeRole role = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that your suggestion of reconciled_role
would be more clear here, since you're right that "observed" implies (to me at least) the role of the certificate the node is currently using.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hesitation about this is that the worker is going to act on this field to trigger certificate renewals. "Reconciled role" has meaning to the manager, but not the worker. From the worker's perspective, "Get a new certificate if your role changes" makes sense, but "Get a new certificate if your reconciled role changes" doesn't make as much sense.
What about changing the field under Spec
to DesiredRole
and leaving this one as Role
? I didn't think of that before, but as long as the field number doesn't change, it's fine to rename protobuf fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That'd make sense too. :) Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed the Spec
field to DesiredRole
, thanks.
@@ -394,33 +401,22 @@ func (m *Manager) Run(parent context.Context) error { | |||
if err != nil { | |||
errCh <- err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since errCh
doesn't seem to be used any more, should it be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since
errCh
doesn't seem to be used any more, should it be removed?
Fixed, thanks
a1cba5b
to
3af077c
Compare
Race revealed by CI fixed in #1846 |
// Role defines the role the node should have. | ||
NodeRole role = 2; | ||
// DesiredRole defines the role the node should have. | ||
NodeRole desired_role = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't this breaking api change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not in gRPC. If we rename the field in the JSON API, it would be. Not sure what we should do about that. We could either keep the current name in JSON, or back out this part of the change and keep Role
for gRPC as well. Not sure what's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, field number and type stay the same, so this won't break anything.
// Role is the *observed* role for this node. It differs from the | ||
// desired role set in Node.Spec.Role because the role here is only | ||
// updated after the Raft member list has been reconciled with the | ||
// desired role from the spec. Note that this doesn't show whether the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that if the role is being used to tell whether or not an action may be performed, the certificate should be verified. This field is mostly informational.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add that if the role is being used to tell whether or not an action may be performed, the certificate should be verified. This field is mostly informational.
Thanks, added this.
LGTM |
167b493
to
f333714
Compare
This passed Docker integration tests. |
for _, node := range nodes { | ||
rm.reconcileRole(node) | ||
} | ||
if len(rm.pending) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, is rm.pending
always empty? So ticker will never be set?
I'm also wondering if the nodes in nodes
should be added to pending? If any of them fail to be reconciled, they will not be reconciled again until that particular node is updated, or until some other updated node fails to be reconciled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, the nodes should indeed be added to pending
. Fixed.
tickerCh = ticker.C | ||
} | ||
case <-tickerCh: | ||
for _, node := range nodes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be iterating over rm.pending
? nodes
doesn't ever get updated, so will this be iterating over the list of nodes in existence when the watch first started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are correct. I've fixed this. Thanks for finding these problems.
7536335
to
cee1cc0
Compare
Since promoting/demoting nodes is now eventually consistent, does any logic in the control API need to change with regards to the quorum safeguard? In a 3-node cluster, if you demote 2 nodes in rapid succession, the demotions may both succeed because the raft membership has not changed yet. Not sure if this is a valid test, but it fails (I put it in
|
Thanks for bringing this up - it's an interesting issue and I want to make sure we handle that case right. I took a look, and I think the code in this PR is doing the right thing. The quorum safeguard in I think what we're doing here is a big improvement from the old code, which did have this issue because calls to the The reason your test is failing is that |
Also, in theory once it's possible to finish the demotion without losing quorum, |
But maybe you have a valid point that we shouldn't allow the |
I think my numbers are also wrong in the test. It also fails in master, so it's just an invalid test I think. Maybe 5, 1 down, 2 demotions? :) Also, yes you're right, the role manager isn't running, so that'd also be a problem. But yes, I was mainly wondering if there'd be problem if the control api says all is well, but the demotion can't actually happen (until something else comes back up). Apologies I wasn't clear - I wasn't trying to imply that this PR wasn't eventually doing the demotion. I don't think I have any other feedback other than the UI thing. :) It LGTM. |
Also regarding the delay in satisfying the demotion, I guess demotion takes a little time already, so I don't think this PR actually introduces any new issues w.r.t. immediate feedback, so 👍 for thinking about it later on |
6b60061
to
ce45628
Compare
@aaronlehmann LGTM, feel free to merge after rebase. |
When a node is demoted, two things need to happen: the node object itself needs to be updated, and the raft member list needs to be updated to remove that node. Previously, it was possible to get into a bad state where the node had been updated but not removed from the member list. This changes the approach so that controlapi only updates the node object, and there is a goroutine that watches for node changes and updates the member list accordingly. This means that demotion will work correctly even if there is a node failure or leader change in between the two steps. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This is an attempt to fix the demotion process. Right now, the agent finds out about a role change, and independently, the raft node finds out it's no longer in the cluster, and shuts itself down. This causes the manager to also shut itself down. This is very error-prone and has led to a lot of problems. I believe there are corner cases that are not properly addressed. This changes things so that raft only signals to the higher level that the node has been removed. The manager supervision code will shut down the manager, and wait a certain amount of time for a role change (which should come through the agent reconnecting to a different manager now that the local manager is shut down). Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Shutting down the manager can deadlock if a component is waiting for store updates to go through. This Cancel method allows current and future proposals to be interrupted to avoid those deadlocks. Once everything using raft has shut down, the Stop method can be used to complete raft shutdown. Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
ce45628
to
be580ec
Compare
This is an attempt to fix the demotion process. Right now, the agent finds out about a role change, and independently, the raft node finds out it's no longer in the cluster, and shuts itself down. This causes the manager to also shut itself down. This is very error-prone and has led to a lot of problems. I believe there are corner cases that are not properly addressed.
This changes things so that raft only signals to the higher level that the node has been removed. The manager supervision code will shut down the manager, and wait a certain amount of time for a role change (which should come through the agent reconnecting to a different manager now that the local manager is shut down).
In addition, I had to fix a longstanding problem with demotion. Demoting a node involves updating both the node object itself and the raft member list. This can't be done atomically. If the leader shuts itself down when the object is updated but before the raft member list is updated, we end up in an inconsistent state. I fixed this by adding code that reconciles the raft member list against the node objects.
cc @LK4D4
Signed-off-by: Aaron Lehmann aaron.lehmann@docker.com