-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only allocate attachments needed by nodes #2725
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2725 +/- ##
=========================================
Coverage ? 61.71%
=========================================
Files ? 134
Lines ? 21843
Branches ? 0
=========================================
Hits ? 13480
Misses ? 6898
Partials ? 1465 |
@anshulpundir this is not a fix for #1477. This is a fix for an issue that fixing #1477 would fix. |
manager/allocator/network.go
Outdated
attach.Network.ID, node.ID, | ||
) | ||
} | ||
node.Attachments = append(node.Attachments[0:i], node.Attachments[i+1:]...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be being stupid here, but this seems like it would not work correctly. We are in a for-loop over the attachments. If we are on attachment i
and we remove it from the slice, then attachment i+1
now occupies position i
. But the loop is next going to inspect index i+1
which was previously i+2
. In other words we just skipped the attachment originally at i+1
before i
was removed. Did I get that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh... maybe. i'm unsure how this works actually. good catch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed up a change to fix this.
287dea3
to
480c388
Compare
manager/allocator/allocator_test.go
Outdated
// Validate node has 2 LB IP address (1 for each network). | ||
watchNetwork(t, netWatch, false, isValidNetwork) // ingress | ||
watchNetwork(t, netWatch, false, isValidNetwork) // overlayID1 | ||
watchNetwork(t, netWatch, false, isValidNetwork) // overlayID1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this last one validating that the overlayIDunused
gets deleted from the network when task reconciliation occurs? If so, maybe update the comment. if not, then (pardon my confusion) what is the other network event here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah sorry, i copypasted that line.
Re: the race condition -- What currently prevents Docker engine from creating and/or starting a task before the task's network attachments "arrive"? In other words, how does the fact that the LB attachment is associated with a node, but not a task affect the way that swarm propagates said attachments to the node. Sorry for the dumb question -- just want to understand what might be involved from the engine side. |
The network attachments for a task are embedded in the task object. The network attachments for a node are embedded in the node object, not in the task, but may be depended on by the task. Does that make sense? |
480c388
to
0c29da1
Compare
We should "forward-port" this behavior change to the |
Ok, got it re: the attachments ... makes sense. I presume there is, then, no sort of way to ensure that the node update arrives before the task update to docker engine. Hrm... even if there were, engine probably handles those in separate goroutines. If the task arrived before the node update, I suspect that the task allocation would just fail and then retry until it succeeds in the current engine code. But that is certainly noisy and suboptimal. Will look around to see how a "stalling" type behavior could be inserted for awaiting the swarm per-node IP. The good news is that such a change could easily be backwards-compatible with older swarm nodes and so upgrade wouldn't be an issue there. |
Pretty sure the answer is "yes", but just to be sure -- will this scheme still work for containers (not tasks) connecting to an attachable overlay network via, say, |
@ctelfer see the corresponding PR at moby/moby#37604 for what has to happen there. The actual case, if a task arrives before a node has an attachment, is much worse. The task will attempt to start and then fail. The task's failure will cause it to move into a terminal state and be deallocated. In the process, the node, now having no task requiring that network, will then have its attachment deallocated. A new task will get allocated and scheduled, and then the node attachment will be allocated, and the race starts again. |
It looks like swarmkit's error library already provides this nice mechanism for reporting temporary errors and telling swarm to retry later: So, if I'm reading this right, it might be possible to address the race condition by changing the code of if agent && driver == "overlay" {
nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
if !exists {
return nil, fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id)
} becomes something like // Top of file
import swarmErrors "docker/swarmkit/agent/exec/errors.go"
...
if agent && driver == "overlay" {
nodeIP, exists := daemon.GetAttachmentStore().GetIPForNetwork(id)
if !exists {
return nil, swarmErrors.MakeTemporary(fmt.Errorf("Failed to find a load balancer IP to use for network: %v", id))
} Does this seem plausible? |
@ctelfer let's bring that conversation into the other PR. |
Ah, didn't see that you had started that PR. Will do so. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't know much about this code, but it seems OK as a temporary workaround.
manager/allocator/network.go
Outdated
// skip any tasks that are no longer relevant | ||
|
||
// we only want tasks that are in desired state running. anything | ||
// else is on its way out and irrelevant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a task is e.g. running but desired to be shut down, then presumably the load balancer IP is still in use, so I don't think it's safe to deallocate it yet.
e.g. if a node has two tasks
- Task1: state=running, desired=shutdown
- Task2: state=running, desired=shutdown
When Task2 terminates, we'll come here to decide whether we still need the load balancer IP. This function will say we can release it, I guess, even though we can't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my logic here was that this would only allocate new attachments, not free old ones, but you describe the actual case, and my logic was wrong. I will change it.
manager/allocator/network.go
Outdated
for _, attach := range node.Attachments { | ||
// for every attachment, go through every network. if the attachment | ||
// belongs to one of the networks, then go to the next attachment. if | ||
// no network matches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the rest of this comment missing, or are we supposed to continue reading a few lines down?
I've tested this code end-to-end for what it's worth. It avoids allocating when not needed and reclaims the addresses when they become available. I was able, for example, to saturate a /27 (32 total addresses) network with tasks on a 3-node cluster with 27 tasks when forcing all tasks to live on one node. This breaks down as:
When I run without node constraints the number of tasks I can deploy is 25 which is expected because 2 more IP addresses are needed (i.e. one for each of the 3 nodes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM FWIW
Instead of allocating an attachment for every network on every node, allocate and deallocate attachments on nodes based off whether or not the node needs the attachment because a task using that network is on the node. Signed-off-by: Drew Erny <drew.erny@docker.com>
0c29da1
to
710aef8
Compare
Test that's failing is the same test that's failing on master. I'd recommend moving forward with this PR in spite of the test failure. |
When the network allocator starts, it performs two passes of allocation. The first, with existingAddressesOnly set to "true", simply re-allocates any already reserved addresses, which make the local driver state consistent with the state in swarmkit's object store. The second pass then performs any outstanding new allocations, from when the allocator last stopped. Since moby#2725, nodes only have attachments allocated for them if they have a task currently scheduled which requires those networks. This happens after a task is allocated and scheduled. Before this change, it was possible that, if a Task was correctly allocated, but the allocator stopped before the Node was also allocated, during the restore phase, an empty api.NetworkAttachment object was added to the Node's attachments. Then, in the new allocations phase, when trying to process all attachments, we were unconditionally looking at the NetworkAttachment object's Network field, which was nil. This caused a segfault and crash. With this change, we no longer add these errant NetworkAttachment objects to nodes. Signed-off-by: Drew Erny <drew.erny@docker.com>
When the network allocator starts, it performs two passes of allocation. The first, with existingAddressesOnly set to "true", simply re-allocates any already reserved addresses, which make the local driver state consistent with the state in swarmkit's object store. The second pass then performs any outstanding new allocations, from when the allocator last stopped. Since moby#2725, nodes only have attachments allocated for them if they have a task currently scheduled which requires those networks. This happens after a task is allocated and scheduled. Before this change, it was possible that, if a Task was correctly allocated, but the allocator stopped before the Node was also allocated, during the restore phase, an empty api.NetworkAttachment object was added to the Node's attachments. Then, in the new allocations phase, when trying to process all attachments, we were unconditionally looking at the NetworkAttachment object's Network field, which was nil. This caused a segfault and crash. With this change, we no longer add these errant NetworkAttachment objects to nodes. Signed-off-by: Drew Erny <drew.erny@docker.com>
When the network allocator starts, it performs two passes of allocation. The first, with existingAddressesOnly set to "true", simply re-allocates any already reserved addresses, which make the local driver state consistent with the state in swarmkit's object store. The second pass then performs any outstanding new allocations, from when the allocator last stopped. Since moby#2725, nodes only have attachments allocated for them if they have a task currently scheduled which requires those networks. This happens after a task is allocated and scheduled. Before this change, it was possible that, if a Task was correctly allocated, but the allocator stopped before the Node was also allocated, during the restore phase, an empty api.NetworkAttachment object was added to the Node's attachments. Then, in the new allocations phase, when trying to process all attachments, we were unconditionally looking at the NetworkAttachment object's Network field, which was nil. This caused a segfault and crash. With this change, we no longer add these errant NetworkAttachment objects to nodes. Signed-off-by: Drew Erny <drew.erny@docker.com> (cherry picked from commit 2d71271) Signed-off-by: Drew Erny <drew.erny@docker.com>
When the network allocator starts, it performs two passes of allocation. The first, with existingAddressesOnly set to "true", simply re-allocates any already reserved addresses, which make the local driver state consistent with the state in swarmkit's object store. The second pass then performs any outstanding new allocations, from when the allocator last stopped. Since moby#2725, nodes only have attachments allocated for them if they have a task currently scheduled which requires those networks. This happens after a task is allocated and scheduled. Before this change, it was possible that, if a Task was correctly allocated, but the allocator stopped before the Node was also allocated, during the restore phase, an empty api.NetworkAttachment object was added to the Node's attachments. Then, in the new allocations phase, when trying to process all attachments, we were unconditionally looking at the NetworkAttachment object's Network field, which was nil. This caused a segfault and crash. With this change, we no longer add these errant NetworkAttachment objects to nodes. Signed-off-by: Drew Erny <drew.erny@docker.com>
Forward ports the changes in moby#2725 to the new allocator. Signed-off-by: Drew Erny <drew.erny@docker.com>
- What I did
Instead of allocating an attachment for every network on every node, allocate and deallocate attachments on nodes based off whether or not the node needs the attachment because a task using that network is on the node.
A quick fix for the problem raised in #2721.
Essentially, there are load balancer optimizations that rely on nodes having network attachments (IP addresses) for every network being used on the node. Because we can't know what tasks will be needed by a node until after they've passed through the scheduler, we were previously taking a naive approach of allocating a network attachment for every network and every node. This results in far more IP addresses being used than is necessary, which can quickly exhaust the subnet space on a large cluster.
In this PR, we simply watch task updates and allocate new network attachments as soon as we know the node that a task will land on. In this case, only the nodes running tasks which require a network actually have attachments for that network.
NOTE: This PR is WIP.
This approach means that tasks have already passed through the scheduler and are headed down to the node, and the node update containing the new network attachment may not have arrived by the time the task is ready to start. There may or may not exist a race condition, where a task may not be able to start because it needs a load balancer attachment that hasn't yet arrived. There may need to be a second PR, either to Docker or to Swarmkit, to block task creation until all the requisite node attachments have arrived.
- How I did it
When task updates come in, attempt re-allocating the node.
- How to test it
Includes an automated test.
To manually test, create services with tasks attached to networks on a cluster with several nodes, and add and remove services so that tasks get created and destroyed. Inspect nodes and verify that nodes have attachments for exclusively for networks being used by tasks on that node (and for ingress).
- Description for the changelog
Fix every node having a network attachment for every network, even if the node didn't need one.