Only allocate attachments needed by nodes #2725

dperny · 2018-08-06T18:30:17Z

- What I did

Instead of allocating an attachment for every network on every node, allocate and deallocate attachments on nodes based off whether or not the node needs the attachment because a task using that network is on the node.

A quick fix for the problem raised in #2721.

Essentially, there are load balancer optimizations that rely on nodes having network attachments (IP addresses) for every network being used on the node. Because we can't know what tasks will be needed by a node until after they've passed through the scheduler, we were previously taking a naive approach of allocating a network attachment for every network and every node. This results in far more IP addresses being used than is necessary, which can quickly exhaust the subnet space on a large cluster.

In this PR, we simply watch task updates and allocate new network attachments as soon as we know the node that a task will land on. In this case, only the nodes running tasks which require a network actually have attachments for that network.

NOTE: This PR is WIP.
This approach means that tasks have already passed through the scheduler and are headed down to the node, and the node update containing the new network attachment may not have arrived by the time the task is ready to start. There may or may not exist a race condition, where a task may not be able to start because it needs a load balancer attachment that hasn't yet arrived. There may need to be a second PR, either to Docker or to Swarmkit, to block task creation until all the requisite node attachments have arrived.

- How I did it

When task updates come in, attempt re-allocating the node.

- How to test it

Includes an automated test.

To manually test, create services with tasks attached to networks on a cluster with several nodes, and add and remove services so that tasks get created and destroyed. Inspect nodes and verify that nodes have attachments for exclusively for networks being used by tasks on that node (and for ingress).

- Description for the changelog

Fix every node having a network attachment for every network, even if the node didn't need one.

anshulpundir · 2018-08-06T18:39:25Z

Can we please capture the different design options here #1477 ? @dperny

codecov · 2018-08-06T18:40:38Z

Codecov Report

❗ No coverage uploaded for pull request base (master@496d19b). Click here to learn what that means.
The diff coverage is 70.83%.

@@            Coverage Diff            @@
##             master    #2725   +/-   ##
=========================================
  Coverage          ?   61.71%           
=========================================
  Files             ?      134           
  Lines             ?    21843           
  Branches          ?        0           
=========================================
  Hits              ?    13480           
  Misses            ?     6898           
  Partials          ?     1465

dperny · 2018-08-06T19:04:43Z

@anshulpundir this is not a fix for #1477. This is a fix for an issue that fixing #1477 would fix.

ctelfer · 2018-08-06T20:10:33Z

manager/allocator/network.go

+					attach.Network.ID, node.ID,
+				)
+			}
+			node.Attachments = append(node.Attachments[0:i], node.Attachments[i+1:]...)


May be being stupid here, but this seems like it would not work correctly. We are in a for-loop over the attachments. If we are on attachment i and we remove it from the slice, then attachment i+1 now occupies position i. But the loop is next going to inspect index i+1 which was previously i+2. In other words we just skipped the attachment originally at i+1 before i was removed. Did I get that right?

huh... maybe. i'm unsure how this works actually. good catch.

Pushed up a change to fix this.

ctelfer · 2018-08-06T22:06:44Z

manager/allocator/allocator_test.go

 	// Validate node has 2 LB IP address (1 for each network).
 	watchNetwork(t, netWatch, false, isValidNetwork)                                      // ingress
 	watchNetwork(t, netWatch, false, isValidNetwork)                                      // overlayID1
+	watchNetwork(t, netWatch, false, isValidNetwork)                                      // overlayID1


Is this last one validating that the overlayIDunused gets deleted from the network when task reconciliation occurs? If so, maybe update the comment. if not, then (pardon my confusion) what is the other network event here?

yeah sorry, i copypasted that line.

ctelfer · 2018-08-06T22:37:47Z

Re: the race condition -- What currently prevents Docker engine from creating and/or starting a task before the task's network attachments "arrive"? In other words, how does the fact that the LB attachment is associated with a node, but not a task affect the way that swarm propagates said attachments to the node. Sorry for the dumb question -- just want to understand what might be involved from the engine side.

dperny · 2018-08-07T13:56:46Z

The network attachments for a task are embedded in the task object. The network attachments for a node are embedded in the node object, not in the task, but may be depended on by the task. Does that make sense?

dperny · 2018-08-07T14:04:48Z

We should "forward-port" this behavior change to the new-allocator branch. Once we settle on merging this, I will do so and open a new PR against that branch.

ctelfer · 2018-08-07T18:06:09Z

Ok, got it re: the attachments ... makes sense. I presume there is, then, no sort of way to ensure that the node update arrives before the task update to docker engine. Hrm... even if there were, engine probably handles those in separate goroutines.

If the task arrived before the node update, I suspect that the task allocation would just fail and then retry until it succeeds in the current engine code. But that is certainly noisy and suboptimal. Will look around to see how a "stalling" type behavior could be inserted for awaiting the swarm per-node IP. The good news is that such a change could easily be backwards-compatible with older swarm nodes and so upgrade wouldn't be an issue there.

ctelfer · 2018-08-07T18:13:09Z

Pretty sure the answer is "yes", but just to be sure -- will this scheme still work for containers (not tasks) connecting to an attachable overlay network via, say, docker run...?

dperny · 2018-08-07T18:19:30Z

@ctelfer see the corresponding PR at moby/moby#37604 for what has to happen there.

The actual case, if a task arrives before a node has an attachment, is much worse. The task will attempt to start and then fail. The task's failure will cause it to move into a terminal state and be deallocated. In the process, the node, now having no task requiring that network, will then have its attachment deallocated. A new task will get allocated and scheduled, and then the node attachment will be allocated, and the race starts again.

ctelfer · 2018-08-07T18:40:58Z

It looks like swarmkit's error library already provides this nice mechanism for reporting temporary errors and telling swarm to retry later: docker/swarmkit/agent/exec/errors.go:MakeTemporary. It also looks like the path between the swarmkit controller.Do() function and the docker engine's daemon/network.go:daemon.createNetwork() call mostly clear. That is, the various calls in between just flag the error and return it without interpretation.

So, if I'm reading this right, it might be possible to address the race condition by changing the code of createNetwork() so it invokes MakeTemporary() on the error it would return when it can't find the network attachment. I.e. on line 350 of daemon/network .go:

        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id)
                }

becomes something like

        // Top of file
        import swarmErrors "docker/swarmkit/agent/exec/errors.go"
...
        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, swarmErrors.MakeTemporary(fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id))
                }

Does this seem plausible?

dperny · 2018-08-07T18:54:20Z

@ctelfer let's bring that conversation into the other PR.

ctelfer · 2018-08-07T20:47:03Z

Ah, didn't see that you had started that PR. Will do so.

talex5

Don't know much about this code, but it seems OK as a temporary workaround.

talex5 · 2018-08-09T13:33:41Z

manager/allocator/network.go

+			// skip any tasks that are no longer relevant
+
+			// we only want tasks that are in desired state running. anything
+			// else is on its way out and irrelevant


If a task is e.g. running but desired to be shut down, then presumably the load balancer IP is still in use, so I don't think it's safe to deallocate it yet.

e.g. if a node has two tasks

Task1: state=running, desired=shutdown

Task2: state=running, desired=shutdown

When Task2 terminates, we'll come here to decide whether we still need the load balancer IP. This function will say we can release it, I guess, even though we can't.

I think my logic here was that this would only allocate new attachments, not free old ones, but you describe the actual case, and my logic was wrong. I will change it.

talex5 · 2018-08-09T13:40:55Z

manager/allocator/network.go

+	for _, attach := range node.Attachments {
+		// for every attachment, go through every network. if the attachment
+		// belongs to one of the networks, then go to the next attachment. if
+		// no network matches


Is the rest of this comment missing, or are we supposed to continue reading a few lines down?

ctelfer · 2018-08-16T15:41:55Z

I've tested this code end-to-end for what it's worth. It avoids allocating when not needed and reclaims the addresses when they become available. I was able, for example, to saturate a /27 (32 total addresses) network with tasks on a 3-node cluster with 27 tasks when forcing all tasks to live on one node. This breaks down as:

1 network address (all 0s)
1 broadcast address (all 1s)
1 default router (.1)
1 VIP
1 per-node load balancing IP
27 tasks

When I run without node constraints the number of tasks I can deploy is 25 which is expected because 2 more IP addresses are needed (i.e. one for each of the 3 nodes).

ctelfer

LGTM FWIW

Instead of allocating an attachment for every network on every node, allocate and deallocate attachments on nodes based off whether or not the node needs the attachment because a task using that network is on the node. Signed-off-by: Drew Erny <drew.erny@docker.com>

dperny · 2018-08-17T15:15:12Z

Test that's failing is the same test that's failing on master. I'd recommend moving forward with this PR in spite of the test failure.

When the network allocator starts, it performs two passes of allocation. The first, with existingAddressesOnly set to "true", simply re-allocates any already reserved addresses, which make the local driver state consistent with the state in swarmkit's object store. The second pass then performs any outstanding new allocations, from when the allocator last stopped. Since moby#2725, nodes only have attachments allocated for them if they have a task currently scheduled which requires those networks. This happens after a task is allocated and scheduled. Before this change, it was possible that, if a Task was correctly allocated, but the allocator stopped before the Node was also allocated, during the restore phase, an empty api.NetworkAttachment object was added to the Node's attachments. Then, in the new allocations phase, when trying to process all attachments, we were unconditionally looking at the NetworkAttachment object's Network field, which was nil. This caused a segfault and crash. With this change, we no longer add these errant NetworkAttachment objects to nodes. Signed-off-by: Drew Erny <drew.erny@docker.com>

When the network allocator starts, it performs two passes of allocation. The first, with existingAddressesOnly set to "true", simply re-allocates any already reserved addresses, which make the local driver state consistent with the state in swarmkit's object store. The second pass then performs any outstanding new allocations, from when the allocator last stopped. Since moby#2725, nodes only have attachments allocated for them if they have a task currently scheduled which requires those networks. This happens after a task is allocated and scheduled. Before this change, it was possible that, if a Task was correctly allocated, but the allocator stopped before the Node was also allocated, during the restore phase, an empty api.NetworkAttachment object was added to the Node's attachments. Then, in the new allocations phase, when trying to process all attachments, we were unconditionally looking at the NetworkAttachment object's Network field, which was nil. This caused a segfault and crash. With this change, we no longer add these errant NetworkAttachment objects to nodes. Signed-off-by: Drew Erny <drew.erny@docker.com> (cherry picked from commit 2d71271) Signed-off-by: Drew Erny <drew.erny@docker.com>

When the network allocator starts, it performs two passes of allocation. The first, with existingAddressesOnly set to "true", simply re-allocates any already reserved addresses, which make the local driver state consistent with the state in swarmkit's object store. The second pass then performs any outstanding new allocations, from when the allocator last stopped. Since moby#2725, nodes only have attachments allocated for them if they have a task currently scheduled which requires those networks. This happens after a task is allocated and scheduled. Before this change, it was possible that, if a Task was correctly allocated, but the allocator stopped before the Node was also allocated, during the restore phase, an empty api.NetworkAttachment object was added to the Node's attachments. Then, in the new allocations phase, when trying to process all attachments, we were unconditionally looking at the NetworkAttachment object's Network field, which was nil. This caused a segfault and crash. With this change, we no longer add these errant NetworkAttachment objects to nodes. Signed-off-by: Drew Erny <drew.erny@docker.com>

Forward ports the changes in moby#2725 to the new allocator. Signed-off-by: Drew Erny <drew.erny@docker.com>

ctelfer reviewed Aug 6, 2018

View reviewed changes

dperny force-pushed the wait-for-node-ip branch from 287dea3 to 480c388 Compare August 6, 2018 20:24

ctelfer reviewed Aug 6, 2018

View reviewed changes

dperny force-pushed the wait-for-node-ip branch from 480c388 to 0c29da1 Compare August 7, 2018 14:01

dperny mentioned this pull request Aug 7, 2018

Block task starting until node attachments are ready moby/moby#37604

Merged

anshulpundir requested review from talex5 and mefyl August 7, 2018 18:57

talex5 reviewed Aug 9, 2018

View reviewed changes

ctelfer approved these changes Aug 16, 2018

View reviewed changes

dperny changed the title ~~[WIP] Only allocate attachments needed by nodes~~ Only allocate attachments needed by nodes Aug 17, 2018

dperny force-pushed the wait-for-node-ip branch from 0c29da1 to 710aef8 Compare August 17, 2018 15:02

anshulpundir approved these changes Aug 17, 2018

View reviewed changes

anshulpundir merged commit cfa742c into moby:master Aug 20, 2018

dperny mentioned this pull request Aug 21, 2018

Rewrite Allocator #2516

Open

dperny mentioned this pull request Oct 17, 2018

Fix nil pointer dereference in node allocation #2764

Merged

dperny added a commit to dperny/swarmkit-1 that referenced this pull request Nov 1, 2018

Update new allocator to allocate node attachments as needed

96daaaf

Forward ports the changes in moby#2725 to the new allocator. Signed-off-by: Drew Erny <drew.erny@docker.com>

dperny mentioned this pull request Nov 1, 2018

[new-allocator] Update new allocator to allocate node attachments as needed #2777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only allocate attachments needed by nodes #2725

Only allocate attachments needed by nodes #2725

dperny commented Aug 6, 2018 •

edited

Loading

anshulpundir commented Aug 6, 2018

codecov bot commented Aug 6, 2018 •

edited

Loading

dperny commented Aug 6, 2018

ctelfer Aug 6, 2018

dperny Aug 6, 2018

dperny Aug 6, 2018

ctelfer Aug 6, 2018

dperny Aug 7, 2018

ctelfer commented Aug 6, 2018

dperny commented Aug 7, 2018

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

ctelfer commented Aug 7, 2018

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

talex5 left a comment

talex5 Aug 9, 2018

dperny Aug 17, 2018

talex5 Aug 9, 2018

ctelfer commented Aug 16, 2018 •

edited

Loading

ctelfer left a comment

dperny commented Aug 17, 2018

Only allocate attachments needed by nodes #2725

Only allocate attachments needed by nodes #2725

Conversation

dperny commented Aug 6, 2018 • edited Loading

anshulpundir commented Aug 6, 2018

codecov bot commented Aug 6, 2018 • edited Loading

Codecov Report

dperny commented Aug 6, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctelfer commented Aug 6, 2018

dperny commented Aug 7, 2018

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

ctelfer commented Aug 7, 2018

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

dperny commented Aug 7, 2018

ctelfer commented Aug 7, 2018

talex5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ctelfer commented Aug 16, 2018 • edited Loading

ctelfer left a comment

Choose a reason for hiding this comment

dperny commented Aug 17, 2018

dperny commented Aug 6, 2018 •

edited

Loading

codecov bot commented Aug 6, 2018 •

edited

Loading

ctelfer commented Aug 16, 2018 •

edited

Loading