Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

Several seconds delay after attaching with re-cycled IP address #441

Closed
magnars opened this issue Mar 6, 2015 · 9 comments
Closed

Several seconds delay after attaching with re-cycled IP address #441

magnars opened this issue Mar 6, 2015 · 9 comments
Labels
Milestone

Comments

@magnars
Copy link

magnars commented Mar 6, 2015

Hi!

I'm experimenting with attaching/detaching docker containers with weave. It seems pretty consistent that when attaching to an IP that has previously been used (and detached) - even tho the attachment is immediately visible in weave ps - it takes quite a few seconds to actually connect. Attaching to new IPs is instant.

I was asked to open an issue on here by @rade.

re-cycling IPs should be fine, and any update delays should be quite small. If you have an easily reproducible example for when there is a long delay, please file an issue.

Here is my reproduction. The luisbebop/echo-server just echoes back what is sent to it, on port 8800. The tutum/curl image is an ubuntu with curl on it.

docker run --name echo -d luisbebop/echo-server
sudo weave launch
sudo weave attach 10.10.0.7/16 echo
docker attach $(sudo weave run 10.10.0.6/16 -tid tutum/curl)
curl 10.10.0.7:8800

Now, outside of the curl container, detach then attach:

sudo weave detach 10.10.0.7/16 echo
sudo weave attach 10.10.0.7/16 echo
sudo weave ps

The ps shows that the attachment is in effect, but back inside the curl container:

curl 10.10.0.7:8800

Takes quite a while. It eventually will work, tho. With a new IP there is no delay.

I am running this on a Ubuntu 14.04 LTS inside virtualbox, if that makes a difference.

The use case I'm looking at is seamless deploys, spinning up a new application container and switching over to it, without other services needing to be restarted/linked. It seems everything is going to be fine untill I run out of new IPs to assign to containers :)

My hope at this point is that it wont be a problem in practice when I cycle through 255 IP addresses. Otherwise I might be able to wait for the attachment to go fully into effect before detaching the old app.

@dpw
Copy link
Contributor

dpw commented Mar 6, 2015

Thanks, I can reproduce this.

It's due to the ARP (aka neighbour) cache. In your example, when you do weave attach then weave detach, a fresh MAC address is allocated to the container. On Linux, with the default value of /proc/sys/net/ipv4/neigh/*/base_reachable_time_ms, ARP mappings take up to 45 seconds to be marked stale. So it can take a while for containers to realize they need to ARP for the new MAC address. You can see this by doing ip -s neigh show in the curl container.

You can reduce that time by setting /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms (and then restarting the containers on the weave network). That sysctl is a time in milliseconds; the kernel multiplies it by a random number in the range 0.5 to 1.5 to get the actual expiry time. Its default value is 30000. When I set it to 5000, recovery becomes much quicker (i.e. below 10 seconds).

Reducing the reachable time will increase by ARP traffic, although not necessarily by a huge amount because receiving data on a TCP connection prevents the associated ARP mapping from going stale.

Changing /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms might be undesirable for other reasons, because it could affect network devices not associated with weave. It should be straightforward to make the weave script adjust the setting only for the container ethwe devices it creates. Would that work for you? If it does, we can consider whether that kind of solution would be a good idea in general.

The ideal solution would be for weave to actively invalidate ARP mappings across the network when a container is detached, but that would be a much more substantial change.

@rade rade added the chore label Mar 7, 2015
@rade
Copy link
Member

rade commented Mar 7, 2015

Reducing base_reachable_ms on the ethwe interface in the container does sound like a good idea.

btw, when I looked at arp behaviour a while ago, I found this stackoverflow entry most informative. Based on that, it would seem that there is little risk in reducing base_reachable_ms; it should only increase ARP traffic in pathological (though far from impossible) cases, e.g. when the interaction with the remote is mostly unidirectional, i.e. no traffic is coming back or only sporadically or after some significant delay.

@dpw
Copy link
Contributor

dpw commented Mar 9, 2015

btw, when I looked at arp behaviour a while ago, I found this stackoverflow entry most informative.

I saw that, but I'm not sure it is entirely accurate, or at least not up to date. In the current kernel source code, I haven't found a mechanism by which route cache entries pin arp/neighbour cache entries. And the statement that "the kernel will sometimes change timeout values based on positive feedback" is fairly vague when it seems to be the key mechanism that prevents arp entries going stale (search for dst_confirm to see where this occurs; there are quite a few, but the one on incoming TCP acks looks like the most significant).

So I think we'd need to confirm experimentally what effect changing base_reachable_time_ms. Though the actual impact would be heavily workload-dependent, of course.

@dpw dpw self-assigned this Mar 9, 2015
@binocarlos
Copy link
Contributor

I'm having this exact issue trying to migrate a container and its IP to another host - anything that has spoken to it in the original location suffers the delay.

Can confirm that the moment I echo 5000 > /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms - the delay is removed to negligible.

A couple of points/questions:

  • Is there a way of lowering the base_reachable_time_ms for a certain interface/container?
  • I suppose this would need to be done on every host that might have routed to the moved IP.
  • Could it be useful to lower the value - wait n * 2 then up it again - as a kind of cache buster?

Thanks for this thread - gave me a huge boost to remove the delay and get on with the demo I'm doing :-)

@dpw
Copy link
Contributor

dpw commented Mar 11, 2015

Can confirm that the moment I echo 5000 > /proc/sys/net/ipv4/neigh/default/base_reachable_time_ms - the delay is removed to negligible.

Thanks. We're very likely to make the weave script set that for the ethwe interface on all containers started with weave run.

A couple of points/questions:

  • Is there a way of lowering the base_reachable_time_ms for a certain interface/container?
nsenter -n -t $(docker inspect --format='{{.State.Pid}}' $CONTAINER) -- sh -c 'echo 5 >/proc/sys/net/ipv4/neigh/ethwe/base_reachable_time'

where $CONTAINER is the container ID.

  • I suppose this would need to be done on every host that might have routed to the moved IP.

Yes.

  • Could it be useful to lower the value - wait n * 2 then up it again - as a kind of cache buster?

No, Linux doesn't actually evict entries from the ARP cache due to that value, it just marks them as stale. So dropping the value then increasing it again won't have much of an effect.

@dpw
Copy link
Contributor

dpw commented Mar 11, 2015

After a lot of poking containers and looking at the impact on the ARP cache, and comparing the observed behaviour with the kernel code, I've convinced myself that setting base_reachable_time to 5 seconds is a safe and reasonable thing to do, as long as you also set delay_first_probe_time to 2 seconds (so that the delay_first_probe_time is always less than the effective reachable_time, which can be half base_reachable_time).

I have almost finished writing a long explanation of why this is so, but I'm not going to get that done before the end of the day, and I want to get the PR out today.

The "tl;dr" is that the kernel has a intricate mechanism so that, if two IP hosts are communicating using TCP, they usually won't send ARP requests to each other, even if there are periods much longer than reachable_time when they don't communicate.

@dpw
Copy link
Contributor

dpw commented Mar 12, 2015

Could it be useful to lower the value - wait n * 2 then up it again - as a kind of cache buster?

BTW, if you really want to flush the ARP cache, you can do it with

nsenter -n -t $(docker inspect --format='{{.State.Pid}}' $CONTAINER) ip neigh flush dev ethwe

or you can remove specific entries with

nsenter -n -t $(docker inspect --format='{{.State.Pid}}' $CONTAINER) ip neigh flush to $IP_ADDRESS

But you might need to do this for any container on the weave network. Ideally, I think weave would do this for all containers on the network whenever a container is detached, but weave doesn't currently distribute such notifications so this is not a small change.

@binocarlos
Copy link
Contributor

@dpw thank you so much this is really helpful

From my observations getting a demo of a migrating IP to work - setting a value of 5000 on the ethwe interface would be a great change to make. I can't speak for performance under load but in terms of observable latency it works great.

Would this remove the need for busting the cache for specific containers? Doing that across a cluster feels like a really hard thing to do.

Thanks again :-)

@dpw
Copy link
Contributor

dpw commented Mar 13, 2015

Oh great: "fix base_reachable_time(_ms) not effective immediatly when changed" torvalds/linux@4bf6980dd032853

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants