Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only allocate attachments needed by nodes #2725

Merged
merged 1 commit into from
Aug 20, 2018

Conversation

dperny
Copy link
Collaborator

@dperny dperny commented Aug 6, 2018

- What I did

Instead of allocating an attachment for every network on every node, allocate and deallocate attachments on nodes based off whether or not the node needs the attachment because a task using that network is on the node.

A quick fix for the problem raised in #2721.

Essentially, there are load balancer optimizations that rely on nodes having network attachments (IP addresses) for every network being used on the node. Because we can't know what tasks will be needed by a node until after they've passed through the scheduler, we were previously taking a naive approach of allocating a network attachment for every network and every node. This results in far more IP addresses being used than is necessary, which can quickly exhaust the subnet space on a large cluster.

In this PR, we simply watch task updates and allocate new network attachments as soon as we know the node that a task will land on. In this case, only the nodes running tasks which require a network actually have attachments for that network.

NOTE: This PR is WIP.
This approach means that tasks have already passed through the scheduler and are headed down to the node, and the node update containing the new network attachment may not have arrived by the time the task is ready to start. There may or may not exist a race condition, where a task may not be able to start because it needs a load balancer attachment that hasn't yet arrived. There may need to be a second PR, either to Docker or to Swarmkit, to block task creation until all the requisite node attachments have arrived.

- How I did it

When task updates come in, attempt re-allocating the node.

- How to test it

Includes an automated test.

To manually test, create services with tasks attached to networks on a cluster with several nodes, and add and remove services so that tasks get created and destroyed. Inspect nodes and verify that nodes have attachments for exclusively for networks being used by tasks on that node (and for ingress).

- Description for the changelog

Fix every node having a network attachment for every network, even if the node didn't need one.

@anshulpundir
Copy link
Contributor

Can we please capture the different design options here #1477 ? @dperny

@codecov
Copy link

codecov bot commented Aug 6, 2018

Codecov Report

❗ No coverage uploaded for pull request base (master@496d19b). Click here to learn what that means.
The diff coverage is 70.83%.

@@            Coverage Diff            @@
##             master    #2725   +/-   ##
=========================================
  Coverage          ?   61.71%           
=========================================
  Files             ?      134           
  Lines             ?    21843           
  Branches          ?        0           
=========================================
  Hits              ?    13480           
  Misses            ?     6898           
  Partials          ?     1465

@dperny
Copy link
Collaborator Author

dperny commented Aug 6, 2018

@anshulpundir this is not a fix for #1477. This is a fix for an issue that fixing #1477 would fix.

attach.Network.ID, node.ID,
)
}
node.Attachments = append(node.Attachments[0:i], node.Attachments[i+1:]...)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be being stupid here, but this seems like it would not work correctly. We are in a for-loop over the attachments. If we are on attachment i and we remove it from the slice, then attachment i+1 now occupies position i. But the loop is next going to inspect index i+1 which was previously i+2. In other words we just skipped the attachment originally at i+1 before i was removed. Did I get that right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh... maybe. i'm unsure how this works actually. good catch.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed up a change to fix this.

// Validate node has 2 LB IP address (1 for each network).
watchNetwork(t, netWatch, false, isValidNetwork) // ingress
watchNetwork(t, netWatch, false, isValidNetwork) // overlayID1
watchNetwork(t, netWatch, false, isValidNetwork) // overlayID1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this last one validating that the overlayIDunused gets deleted from the network when task reconciliation occurs? If so, maybe update the comment. if not, then (pardon my confusion) what is the other network event here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah sorry, i copypasted that line.

@ctelfer
Copy link

ctelfer commented Aug 6, 2018

Re: the race condition -- What currently prevents Docker engine from creating and/or starting a task before the task's network attachments "arrive"? In other words, how does the fact that the LB attachment is associated with a node, but not a task affect the way that swarm propagates said attachments to the node. Sorry for the dumb question -- just want to understand what might be involved from the engine side.

@dperny
Copy link
Collaborator Author

dperny commented Aug 7, 2018

The network attachments for a task are embedded in the task object. The network attachments for a node are embedded in the node object, not in the task, but may be depended on by the task. Does that make sense?

@dperny
Copy link
Collaborator Author

dperny commented Aug 7, 2018

We should "forward-port" this behavior change to the new-allocator branch. Once we settle on merging this, I will do so and open a new PR against that branch.

@ctelfer
Copy link

ctelfer commented Aug 7, 2018

Ok, got it re: the attachments ... makes sense. I presume there is, then, no sort of way to ensure that the node update arrives before the task update to docker engine. Hrm... even if there were, engine probably handles those in separate goroutines.

If the task arrived before the node update, I suspect that the task allocation would just fail and then retry until it succeeds in the current engine code. But that is certainly noisy and suboptimal. Will look around to see how a "stalling" type behavior could be inserted for awaiting the swarm per-node IP. The good news is that such a change could easily be backwards-compatible with older swarm nodes and so upgrade wouldn't be an issue there.

@ctelfer
Copy link

ctelfer commented Aug 7, 2018

Pretty sure the answer is "yes", but just to be sure -- will this scheme still work for containers (not tasks) connecting to an attachable overlay network via, say, docker run...?

@dperny
Copy link
Collaborator Author

dperny commented Aug 7, 2018

@ctelfer see the corresponding PR at moby/moby#37604 for what has to happen there.

The actual case, if a task arrives before a node has an attachment, is much worse. The task will attempt to start and then fail. The task's failure will cause it to move into a terminal state and be deallocated. In the process, the node, now having no task requiring that network, will then have its attachment deallocated. A new task will get allocated and scheduled, and then the node attachment will be allocated, and the race starts again.

@ctelfer
Copy link

ctelfer commented Aug 7, 2018

It looks like swarmkit's error library already provides this nice mechanism for reporting temporary errors and telling swarm to retry later: docker/swarmkit/agent/exec/errors.go:MakeTemporary. It also looks like the path between the swarmkit controller.Do() function and the docker engine's daemon/network.go:daemon.createNetwork() call mostly clear. That is, the various calls in between just flag the error and return it without interpretation.

So, if I'm reading this right, it might be possible to address the race condition by changing the code of createNetwork() so it invokes MakeTemporary() on the error it would return when it can't find the network attachment. I.e. on line 350 of daemon/network .go:

        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id)
                }

becomes something like

        // Top of file
        import swarmErrors "docker/swarmkit/agent/exec/errors.go"
...
        if agent && driver == "overlay" {
                nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
                if !exists {
                        return nil, swarmErrors.MakeTemporary(fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id))
                }

Does this seem plausible?

@dperny
Copy link
Collaborator Author

dperny commented Aug 7, 2018

@ctelfer let's bring that conversation into the other PR.

@ctelfer
Copy link

ctelfer commented Aug 7, 2018

Ah, didn't see that you had started that PR. Will do so.

Copy link
Contributor

@talex5 talex5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know much about this code, but it seems OK as a temporary workaround.

// skip any tasks that are no longer relevant

// we only want tasks that are in desired state running. anything
// else is on its way out and irrelevant
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a task is e.g. running but desired to be shut down, then presumably the load balancer IP is still in use, so I don't think it's safe to deallocate it yet.

e.g. if a node has two tasks

  • Task1: state=running, desired=shutdown
  • Task2: state=running, desired=shutdown

When Task2 terminates, we'll come here to decide whether we still need the load balancer IP. This function will say we can release it, I guess, even though we can't.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my logic here was that this would only allocate new attachments, not free old ones, but you describe the actual case, and my logic was wrong. I will change it.

for _, attach := range node.Attachments {
// for every attachment, go through every network. if the attachment
// belongs to one of the networks, then go to the next attachment. if
// no network matches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the rest of this comment missing, or are we supposed to continue reading a few lines down?

@ctelfer
Copy link

ctelfer commented Aug 16, 2018

I've tested this code end-to-end for what it's worth. It avoids allocating when not needed and reclaims the addresses when they become available. I was able, for example, to saturate a /27 (32 total addresses) network with tasks on a 3-node cluster with 27 tasks when forcing all tasks to live on one node. This breaks down as:

  • 1 network address (all 0s)
  • 1 broadcast address (all 1s)
  • 1 default router (.1)
  • 1 VIP
  • 1 per-node load balancing IP
  • 27 tasks

When I run without node constraints the number of tasks I can deploy is 25 which is expected because 2 more IP addresses are needed (i.e. one for each of the 3 nodes).

Copy link

@ctelfer ctelfer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM FWIW

@dperny dperny changed the title [WIP] Only allocate attachments needed by nodes Only allocate attachments needed by nodes Aug 17, 2018
Instead of allocating an attachment for every network on every node,
allocate and deallocate attachments on nodes based off whether or not
the node needs the attachment because a task using that network is on
the node.

Signed-off-by: Drew Erny <drew.erny@docker.com>
@dperny
Copy link
Collaborator Author

dperny commented Aug 17, 2018

Test that's failing is the same test that's failing on master. I'd recommend moving forward with this PR in spite of the test failure.

@anshulpundir anshulpundir merged commit cfa742c into moby:master Aug 20, 2018
@dperny dperny mentioned this pull request Aug 21, 2018
dperny added a commit to dperny/swarmkit-1 that referenced this pull request Oct 17, 2018
When the network allocator starts, it performs two passes of allocation.
The first, with existingAddressesOnly set to "true", simply re-allocates
any already reserved addresses, which make the local driver state
consistent with the state in swarmkit's object store. The second pass
then performs any outstanding new allocations, from when the allocator
last stopped.

Since moby#2725, nodes only have attachments allocated for them if they have
a task currently scheduled which requires those networks. This happens
after a task is allocated and scheduled.

Before this change, it was possible that, if a Task was correctly
allocated, but the allocator stopped before the Node was also allocated,
during the restore phase, an empty api.NetworkAttachment object was
added to the Node's attachments. Then, in the new allocations phase,
when trying to process all attachments, we were unconditionally looking
at the NetworkAttachment object's Network field, which was nil. This
caused a segfault and crash.

With this change, we no longer add these errant NetworkAttachment
objects to nodes.

Signed-off-by: Drew Erny <drew.erny@docker.com>
dperny added a commit to dperny/swarmkit-1 that referenced this pull request Oct 18, 2018
When the network allocator starts, it performs two passes of allocation.
The first, with existingAddressesOnly set to "true", simply re-allocates
any already reserved addresses, which make the local driver state
consistent with the state in swarmkit's object store. The second pass
then performs any outstanding new allocations, from when the allocator
last stopped.

Since moby#2725, nodes only have attachments allocated for them if they have
a task currently scheduled which requires those networks. This happens
after a task is allocated and scheduled.

Before this change, it was possible that, if a Task was correctly
allocated, but the allocator stopped before the Node was also allocated,
during the restore phase, an empty api.NetworkAttachment object was
added to the Node's attachments. Then, in the new allocations phase,
when trying to process all attachments, we were unconditionally looking
at the NetworkAttachment object's Network field, which was nil. This
caused a segfault and crash.

With this change, we no longer add these errant NetworkAttachment
objects to nodes.

Signed-off-by: Drew Erny <drew.erny@docker.com>
dperny added a commit to dperny/swarmkit-1 that referenced this pull request Oct 18, 2018
When the network allocator starts, it performs two passes of allocation.
The first, with existingAddressesOnly set to "true", simply re-allocates
any already reserved addresses, which make the local driver state
consistent with the state in swarmkit's object store. The second pass
then performs any outstanding new allocations, from when the allocator
last stopped.

Since moby#2725, nodes only have attachments allocated for them if they have
a task currently scheduled which requires those networks. This happens
after a task is allocated and scheduled.

Before this change, it was possible that, if a Task was correctly
allocated, but the allocator stopped before the Node was also allocated,
during the restore phase, an empty api.NetworkAttachment object was
added to the Node's attachments. Then, in the new allocations phase,
when trying to process all attachments, we were unconditionally looking
at the NetworkAttachment object's Network field, which was nil. This
caused a segfault and crash.

With this change, we no longer add these errant NetworkAttachment
objects to nodes.

Signed-off-by: Drew Erny <drew.erny@docker.com>
(cherry picked from commit 2d71271)
Signed-off-by: Drew Erny <drew.erny@docker.com>
anshulpundir pushed a commit to anshulpundir/swarmkit that referenced this pull request Oct 19, 2018
When the network allocator starts, it performs two passes of allocation.
The first, with existingAddressesOnly set to "true", simply re-allocates
any already reserved addresses, which make the local driver state
consistent with the state in swarmkit's object store. The second pass
then performs any outstanding new allocations, from when the allocator
last stopped.

Since moby#2725, nodes only have attachments allocated for them if they have
a task currently scheduled which requires those networks. This happens
after a task is allocated and scheduled.

Before this change, it was possible that, if a Task was correctly
allocated, but the allocator stopped before the Node was also allocated,
during the restore phase, an empty api.NetworkAttachment object was
added to the Node's attachments. Then, in the new allocations phase,
when trying to process all attachments, we were unconditionally looking
at the NetworkAttachment object's Network field, which was nil. This
caused a segfault and crash.

With this change, we no longer add these errant NetworkAttachment
objects to nodes.

Signed-off-by: Drew Erny <drew.erny@docker.com>
dperny added a commit to dperny/swarmkit-1 that referenced this pull request Nov 1, 2018
Forward ports the changes in moby#2725 to the new allocator.

Signed-off-by: Drew Erny <drew.erny@docker.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants