Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No automatic manager shutdown on demotion/removal #1829

Merged
merged 5 commits into from
Jan 7, 2017

Conversation

aaronlehmann
Copy link
Collaborator

@aaronlehmann aaronlehmann commented Dec 22, 2016

This is an attempt to fix the demotion process. Right now, the agent finds out about a role change, and independently, the raft node finds out it's no longer in the cluster, and shuts itself down. This causes the manager to also shut itself down. This is very error-prone and has led to a lot of problems. I believe there are corner cases that are not properly addressed.

This changes things so that raft only signals to the higher level that the node has been removed. The manager supervision code will shut down the manager, and wait a certain amount of time for a role change (which should come through the agent reconnecting to a different manager now that the local manager is shut down).

In addition, I had to fix a longstanding problem with demotion. Demoting a node involves updating both the node object itself and the raft member list. This can't be done atomically. If the leader shuts itself down when the object is updated but before the raft member list is updated, we end up in an inconsistent state. I fixed this by adding code that reconciles the raft member list against the node objects.

cc @LK4D4

Signed-off-by: Aaron Lehmann aaron.lehmann@docker.com

@aaronlehmann
Copy link
Collaborator Author

I think I understand why this wasn't working. controlapi's UpdateNode handler demotes a node in two steps:

  • Update the node object in the store (store.UpdateNode)
  • Remove the raft member (raft.RemoveMember)

There is a comment in the code explaining that this is not done atomically and that can be a problem:

// TODO(abronan): the remove can potentially fail and leave the node with
// an incorrect role (worker rather than manager), we need to reconcile the
// memberlist with the desired state rather than attempting to remove the
// member once.

Basically, this change triggers the problem very often. The demoted manager will shut down as soon as it sees its role change to "worker". This is likely to be in between store.UpdateNode and raft.RemoveMember. This can cause quorum to be lost, because the quorum doesn't change to reflect the demotion until RemoveMember succeeds.

Reordering the calls would fix the problem in some cases, but create other problems. It wouldn't work when demoting the leader, because leader would remove itself from raft before it has a chance to update the node object.

The most correct way I can think of to fix this is to reconcile the member list as the TODO suggests. node.Spec.Role would control whether the node should be part of the raft member list or not. The raft member list would be reconciled against this. Then we could have node.Role, an observed property derived from the raft member list. Certificate renewals would be triggered based on changes to the observed role, not the desired role (and the CA would issue certificates based on observed role). By the time the observed role changes, the member list has been updated, and it's safe to do things like shut down the manager.

My only concern about this approach is that "observed role" (node.Role) could be confusing. In the paragraph above, observed role is based on whether the raft member list has been reconciled. It doesn't show whether the node has renewed its certificate to one that reflects the updated role. But I think this is still worth trying. If necessary, we can call the field something more specific like node.ReconciledRole.

@codecov-io
Copy link

codecov-io commented Jan 4, 2017

Current coverage is 54.60% (diff: 44.06%)

Merging #1829 into master will decrease coverage by 0.15%

@@             master      #1829   diff @@
==========================================
  Files           102        103     +1   
  Lines         17050      17150   +100   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           9336       9364    +28   
- Misses         6580       6648    +68   
- Partials       1134       1138     +4   

Sunburst

Powered by Codecov. Last update 42baa0d...be580ec

@aaronlehmann
Copy link
Collaborator Author

I've added a commit that implements the role reconcilation I described above. Now Node has a Role field outside of spec as well, that reflects the actual role that the CA will respect. On demotion, this field isn't updated until the node is removed from the member list.

CI is passing now. PTAL.

I can split this PR into two if people want me to.

@aaronlehmann aaronlehmann changed the title [WIP] No automatic manager shutdown on demotion/removal No automatic manager shutdown on demotion/removal Jan 4, 2017
@aaronlehmann
Copy link
Collaborator Author

I made two more fixes that seem to help with lingering occasional integration test issues.

First, I had to remove this code from node.go which sets the role to WorkerRole before obtaining a worker certificate:

                               // switch role to agent immediately to shutdown manager early
                               if role == ca.WorkerRole {
                                       n.role = role
                                       n.roleCond.Broadcast()
                               }

In rare cases, this could cause problems with rapid promotion/demotion cycles. If a certificate renewal got kicked off before the demotion (say, by an earlier promotion), the role could flip from worker back to manager, until another renewal happens. This would cause the manager to start and join as a new member, which caused problems in the integration tests.

I also added a commit that adds a Cancel method to raft.Node. This interrupts current and future proposals, which deals with possible deadlocks on shutdown. We shut down services such as the dispatcher, CA, and so on before shutting down raft, because those other service depend on raft. But if one of those services is trying to write to raft and the quorum is not met, its Stop method might block waiting for the write to complete. The Cancel method gives us a way to prevent writes to raft from blocking things during the manager shutdown process. This probably was an issue before this PR, although somehow the change to node.go discussed above makes it much easier to trigger.

This PR is split into 3 commits, and I can open separate PRs for them if necessary.

This seems very solid now. I ran the integration tests 32 times without any failures.

PTAL

// updated after the Raft member list has been reconciled with the
// desired role from the spec. Note that this doesn't show whether the
// node has obtained a certificate that reflects its current role.
NodeRole role = 9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that your suggestion of reconciled_role would be more clear here, since you're right that "observed" implies (to me at least) the role of the certificate the node is currently using.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hesitation about this is that the worker is going to act on this field to trigger certificate renewals. "Reconciled role" has meaning to the manager, but not the worker. From the worker's perspective, "Get a new certificate if your role changes" makes sense, but "Get a new certificate if your reconciled role changes" doesn't make as much sense.

What about changing the field under Spec to DesiredRole and leaving this one as Role? I didn't think of that before, but as long as the field number doesn't change, it's fine to rename protobuf fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That'd make sense too. :) Thanks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the Spec field to DesiredRole, thanks.

@@ -394,33 +401,22 @@ func (m *Manager) Run(parent context.Context) error {
if err != nil {
errCh <- err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since errCh doesn't seem to be used any more, should it be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since errCh doesn't seem to be used any more, should it be removed?

Fixed, thanks

@aaronlehmann
Copy link
Collaborator Author

Race revealed by CI fixed in #1846

// Role defines the role the node should have.
NodeRole role = 2;
// DesiredRole defines the role the node should have.
NodeRole desired_role = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this breaking api change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not in gRPC. If we rename the field in the JSON API, it would be. Not sure what we should do about that. We could either keep the current name in JSON, or back out this part of the change and keep Role for gRPC as well. Not sure what's best.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, field number and type stay the same, so this won't break anything.

// Role is the *observed* role for this node. It differs from the
// desired role set in Node.Spec.Role because the role here is only
// updated after the Raft member list has been reconciled with the
// desired role from the spec. Note that this doesn't show whether the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that if the role is being used to tell whether or not an action may be performed, the certificate should be verified. This field is mostly informational.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add that if the role is being used to tell whether or not an action may be performed, the certificate should be verified. This field is mostly informational.

Thanks, added this.

@stevvooe
Copy link
Contributor

stevvooe commented Jan 5, 2017

LGTM

@aaronlehmann aaronlehmann force-pushed the correct-demotion branch 2 times, most recently from 167b493 to f333714 Compare January 5, 2017 22:02
@aaronlehmann
Copy link
Collaborator Author

This passed Docker integration tests.

for _, node := range nodes {
rm.reconcileRole(node)
}
if len(rm.pending) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, is rm.pending always empty? So ticker will never be set?
I'm also wondering if the nodes in nodes should be added to pending? If any of them fail to be reconciled, they will not be reconciled again until that particular node is updated, or until some other updated node fails to be reconciled.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, the nodes should indeed be added to pending. Fixed.

tickerCh = ticker.C
}
case <-tickerCh:
for _, node := range nodes {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be iterating over rm.pending? nodes doesn't ever get updated, so will this be iterating over the list of nodes in existence when the watch first started?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. I've fixed this. Thanks for finding these problems.

@aaronlehmann aaronlehmann force-pushed the correct-demotion branch 2 times, most recently from 7536335 to cee1cc0 Compare January 6, 2017 00:31
@cyli
Copy link
Contributor

cyli commented Jan 6, 2017

Since promoting/demoting nodes is now eventually consistent, does any logic in the control API need to change with regards to the quorum safeguard? In a 3-node cluster, if you demote 2 nodes in rapid succession, the demotions may both succeed because the raft membership has not changed yet.

Not sure if this is a valid test, but it fails (I put it in manager/controlapi/node_test.go) because the second demote succeeds.

func TestUpdateNodeDemoteTwice(t *testing.T) {
	t.Parallel()
	tc := cautils.NewTestCA(nil)
	defer tc.Stop()
	ts := newTestServer(t)
	defer ts.Stop()

	nodes, _ := raftutils.NewRaftCluster(t, tc)
	defer raftutils.TeardownCluster(t, nodes)

	// Assign one of the raft node to the test server
	ts.Server.raft = nodes[1].Node
	ts.Server.store = nodes[1].MemoryStore()

	// Create a node object for each of the managers
	assert.NoError(t, nodes[1].MemoryStore().Update(func(tx store.Tx) error {
		assert.NoError(t, store.CreateNode(tx, &api.Node{
			ID: nodes[1].SecurityConfig.ClientTLSCreds.NodeID(),
			Spec: api.NodeSpec{
				DesiredRole: api.NodeRoleManager,
				Membership:  api.NodeMembershipAccepted,
			},
			Role: api.NodeRoleManager,
		}))
		assert.NoError(t, store.CreateNode(tx, &api.Node{
			ID: nodes[2].SecurityConfig.ClientTLSCreds.NodeID(),
			Spec: api.NodeSpec{
				DesiredRole: api.NodeRoleManager,
				Membership:  api.NodeMembershipAccepted,
			},
			Role: api.NodeRoleManager,
		}))
		assert.NoError(t, store.CreateNode(tx, &api.Node{
			ID: nodes[3].SecurityConfig.ClientTLSCreds.NodeID(),
			Spec: api.NodeSpec{
				DesiredRole: api.NodeRoleManager,
				Membership:  api.NodeMembershipAccepted,
			},
			Role: api.NodeRoleManager,
		}))
		return nil
	}))

	// Try to demote Nodes 2 and 3 in quick succession, this should fail because of the quorum safeguard
	r2, err := ts.Client.GetNode(context.Background(), &api.GetNodeRequest{NodeID: nodes[2].SecurityConfig.ClientTLSCreds.NodeID()})
	assert.NoError(t, err)
	r3, err := ts.Client.GetNode(context.Background(), &api.GetNodeRequest{NodeID: nodes[3].SecurityConfig.ClientTLSCreds.NodeID()})
	assert.NoError(t, err)

	_, err = ts.Client.UpdateNode(context.Background(), &api.UpdateNodeRequest{
		NodeID: nodes[2].SecurityConfig.ClientTLSCreds.NodeID(),
		Spec: &api.NodeSpec{
			DesiredRole: api.NodeRoleWorker,
			Membership:  api.NodeMembershipAccepted,
		},
		NodeVersion: &r2.Node.Meta.Version,
	})
	assert.NoError(t, err)

	_, err = ts.Client.UpdateNode(context.Background(), &api.UpdateNodeRequest{
		NodeID: nodes[3].SecurityConfig.ClientTLSCreds.NodeID(),
		Spec: &api.NodeSpec{
			DesiredRole: api.NodeRoleWorker,
			Membership:  api.NodeMembershipAccepted,
		},
		NodeVersion: &r3.Node.Meta.Version,
	})
	assert.Error(t, err)
	assert.Equal(t, codes.FailedPrecondition, grpc.Code(err))
}

@aaronlehmann
Copy link
Collaborator Author

Thanks for bringing this up - it's an interesting issue and I want to make sure we handle that case right. I took a look, and I think the code in this PR is doing the right thing.

The quorum safeguard in controlapi only applies to changing the desired role, but actually completing the demotion is guarded by another quorum safeguard in roleManager. This code is not concurrent, so there shouldn't be any consistency issues.

I think what we're doing here is a big improvement from the old code, which did have this issue because calls to the controlapi could happen concurrently, and that was the only place the quorum check happened.

The reason your test is failing is that controlapi no longer actually removes the node from the raft member list. You need to have a roleManager running for that to happen. For the purpose of the test, it might be better to explicitly call RemoveMember, so you don't have to rely on a race to see possible issues.

@aaronlehmann
Copy link
Collaborator Author

Also, in theory once it's possible to finish the demotion without losing quorum, roleManager would finish it automatically. Which is pretty cool, I think.

@aaronlehmann
Copy link
Collaborator Author

But maybe you have a valid point that we shouldn't allow the DesiredRole change to go through if it's something that can't be satisfied immediately, just for the sake of usability. Maybe that's good followup material?

@cyli
Copy link
Contributor

cyli commented Jan 6, 2017

I think my numbers are also wrong in the test. It also fails in master, so it's just an invalid test I think. Maybe 5, 1 down, 2 demotions? :) Also, yes you're right, the role manager isn't running, so that'd also be a problem.

But yes, I was mainly wondering if there'd be problem if the control api says all is well, but the demotion can't actually happen (until something else comes back up). Apologies I wasn't clear - I wasn't trying to imply that this PR wasn't eventually doing the demotion.

I don't think I have any other feedback other than the UI thing. :) It LGTM.

@cyli
Copy link
Contributor

cyli commented Jan 6, 2017

Also regarding the delay in satisfying the demotion, I guess demotion takes a little time already, so I don't think this PR actually introduces any new issues w.r.t. immediate feedback, so 👍 for thinking about it later on

@aaronlehmann aaronlehmann force-pushed the correct-demotion branch 2 times, most recently from 6b60061 to ce45628 Compare January 6, 2017 22:29
@LK4D4
Copy link
Contributor

LK4D4 commented Jan 7, 2017

@aaronlehmann LGTM, feel free to merge after rebase.

When a node is demoted, two things need to happen: the node object
itself needs to be updated, and the raft member list needs to be updated
to remove that node. Previously, it was possible to get into a bad state
where the node had been updated but not removed from the member list.

This changes the approach so that controlapi only updates the node
object, and there is a goroutine that watches for node changes and
updates the member list accordingly. This means that demotion will work
correctly even if there is a node failure or leader change in between
the two steps.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
This is an attempt to fix the demotion process. Right now, the agent
finds out about a role change, and independently, the raft node finds
out it's no longer in the cluster, and shuts itself down. This causes
the manager to also shut itself down. This is very error-prone and has
led to a lot of problems. I believe there are corner cases that are not
properly addressed.

This changes things so that raft only signals to the higher level that
the node has been removed. The manager supervision code will shut down
the manager, and wait a certain amount of time for a role change (which
should come through the agent reconnecting to a different manager now
that the local manager is shut down).

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Shutting down the manager can deadlock if a component is waiting for
store updates to go through. This Cancel method allows current and
future proposals to be interrupted to avoid those deadlocks. Once
everything using raft has shut down, the Stop method can be used to
complete raft shutdown.

Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
Signed-off-by: Aaron Lehmann <aaron.lehmann@docker.com>
@aaronlehmann aaronlehmann merged commit 0bc3921 into moby:master Jan 7, 2017
@aaronlehmann aaronlehmann deleted the correct-demotion branch January 7, 2017 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants