-
Notifications
You must be signed in to change notification settings - Fork 881
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use direct server return in east-west overlay load balancing #2270
Conversation
Lint Failure |
Thanks. I'd patched it but forgot to push. Have updated now. In general, am not sure how I feel about the CamelCase rule for constants that are meant to be one-to-one mappings with constants from the OS (as in this case). It would seem like those should keep the same form in order to be easy to discover / line up and verify. |
@ctelfer, which release are we planning to pick up this new change ? |
I don't know. I just wanted to put this out there to make sure it was captured. If we can't find a good way to match the behavior in Windows we may have to introduce swarm changes or a new type of network (.e.g. "overlay2"). My one thought for windows integration so far is that we could do an ingress NAT for traffic coming in on the VIPs to direct the traffic to a container on the node. Ideal would be to match on the MAC to direct to a specific container, but we could do a second layer of load balancing where the incoming node load balancer knows the identity of all containers on the local node. This is actually something we don't currently track and so its a bit of an overhaul to add it. It also seems like a Bad Idea to have two layers of load balancing where the second doesn't serve any real purpose except NAT. |
@@ -51,6 +51,10 @@ type Sandbox interface { | |||
// RemoveAliasIP removes the passed IP address from the named interface | |||
RemoveAliasIP(ifName string, ip *net.IPNet) error | |||
|
|||
// DisableARPForVIP disables ARP replies and requests for VIP addresses | |||
// on a particular interface | |||
DisableARPForVIP(ifName string) error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better DisableARPForIfc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm.. Don't object to renaming, but the suggestion is less accurate. The function doesn't disable ARP for all addresses. It just disables it for addresses assigned to a different interface. We really only care that the VIP addresses have ARP suppressed for purposes of this API. But the operation does change behavior slightly. Say you had:
+---------------+
eth0(10.1.0.2) ------+ container +------ eth1(10.2.0.7)
+---------------+
The way containers are currently configured, if you ARPed 10.2.0.7 on eth0, you'd get a response back indicating that that address was on that interface. The ARP configurations that this function performs would prevent that. (the contaienr would ignore incoming ARPs for 10.2.0.7 on eth0) Ideally, we would be configuring this behavior ONLY for the VIP addresses (which get assigned to lo
). Unfortunately, there isn't a nice way to do that without using arptables as far as I know. And requiring arptables and the like on docker installations is not feasible.
As far as ARPing across interfaces goes, my personal opinion is that this behavior should be the restriction that this function puts in place should be the default. But the Linux devs disagree. In any case I don't think that the more relaxed behavior is something an application developer should rely on. (see https://lwn.net/Articles/45373/) So the restriction should be ok. It would also be a property of the network, in any case. (i.e. only applies to endpoints that are on a "dsr"-option-enabled "overlay" network).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point, it's difficult to express with a concise name. nvm for the moment then
I have pushed a change that makes the DSR behavior an overlay-network-specific property that one must enable by adding |
sandbox.go
Outdated
@@ -767,7 +775,10 @@ func (sb *sandbox) releaseOSSbox() { | |||
} | |||
|
|||
for _, ep := range sb.getConnectedEndpoints() { | |||
releaseOSSboxResources(osSbox, ep) | |||
ep.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the releaseOSSboxResources
is already getting the ep.Lock can we check this inside the function itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
facepalm Yep, that's much more sensible...
Modify the loadbalancing for east-west traffic to use direct routing rather than NAT and update tasks to use direct service return under linux. This avoids hiding the source address of the sender and improves the performance in single-client/single-server tests. Signed-off-by: Chris Telfer <ctelfer@docker.com>
Allow DSR to be a configurable option through a generic option to the overlay driver. On the one hand this approach makes sense insofar as only overlay networks can currently perform load balancing. On the other hand, this approach has several issues. First, should we create another type of swarm scope network, this will prevent it working. Second, the service core code is separate from the driver code and the driver code can't influence the core data structures. So the driver code can't set this option itself. Therefore, implementing in this way requires some hack code to test for this option in controller.NewNetwork. A more correct approach would be to make this a generic option for any network. Then the driver could ignore, reject or be unaware of the option depending on the chosen model. This would require changes to: * libnetwork - naturally * the docker API - to carry the option * swarmkit - to propagate the option * the docker CLI - to support the option * moby - to translate the API option into a libnetwork option Given the urgency of requests to address this issue, this approach will be saved for a future iteration. Signed-off-by: Chris Telfer <ctelfer@docker.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This is a WIP to update the load balancing to support direct server return (DSR). The notion, at a high level is that instead of the IPVS load balancer modifying the destination IP address of a packet, it only modifies the destination MAC address and leaves the VIP in place. Furthermore, it avoids performing SNAT on the outgoing packet. The tasks in a service would each be programmed with the VIP(s) as an IP alias on the loopback interface so that they can receive and accept said packet. This requires that the stack enable IP forwarding but libnetwork sets this by default (in Linux) already for other reasons. To round out the picture, the server must also have ARP configured so that it does not respond to ARP queries for the VIP nor attempt ARP queries with the VIP as a source protocol address.
This approach will not easily work for ingress processing because libnetwork uses SNAT to ensure the routability of traffic from the outside network. However, this does address a long-standing concern for L7 load balancers running in the host network and balancing traffic to internal Docker networks (similar to moby/moby#35082). In this case, the L7 load balancer in the host network would be limited in the number of unique 5-tuples that it can open to the service task(s) leading to the connection tracking recycling issues mentioned. This change would also address issue that some folks have voiced about NAT hiding the original address of the client in east-west traffic.
The PR so far does not have any support for Windows networking which is why it is a WIP. I have tested it with Linux clusters and it has passed my tests thus far.