Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

441 slow ARP fixes #457

Merged
merged 8 commits into from
Mar 16, 2015
Merged

441 slow ARP fixes #457

merged 8 commits into from
Mar 16, 2015

Conversation

dpw
Copy link
Contributor

@dpw dpw commented Mar 16, 2015

Addresses #441.

dpw added 7 commits March 16, 2015 15:12
When an IP address was reused by a new container, other containers
could experience delays of many seconds the first time they tried to
talk to that new container before successful communication would
occur.  This delay came from three sources related to ARP cache
behaviour:

- The kernel would take a while to decide that an ARP mapping was
  stale due to lack of positive evidence (due to the
  base_reachable_time default of 30s)

- Even once an ARP mapping was considered stale, the kernel would wait
  a few seconds before sending an ARP request, in the hope that a
  higher layer (e.g. TCP) would confirm the ARP entry (due to the
  delay_first_probe default of 5s)

- When the kernel sent out an ARP request to update a stale entry,
  initial attempts would unicast the request to the old MAC address
  before resorting to broadcast (due to the ucast_solicit default of
  3)

This change reduces all these values, so that any delays
are a few seconds.

For most workloads, this should not greatly increase the amount
of ARP traffic, because communication via TCP is often sufficient
to confirm ARP entries.  And one unicast ARP request is still sent
just in case.

Unfortunately, there's a fixed-but-only-recently bug in the kernel
that means that changing base_reachable_time can take a while to take
effect:
<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4bf6980dd0328530783fd657c776e3719b421d30>.
This means that a container will continue to use a long reachable_time
delay for the first few minutes of its life.
Not quite a pure refactoring because it adds a missing
'validate_cidr'.
Using arping, which is part of the iputils package and so seems to
be ubiquitous.
The most likely cause of failure is the container going away, in which.
case we don't need to clean up.  And there are other points of possible
failure in the same category in attach(), so making sure we clean up the
container interface under all conditions is a  broader issue.
@dpw dpw force-pushed the 441_slow_arp_fixes branch from 2ff5f2e to 4dcf77c Compare March 16, 2015 15:12
@rade rade merged commit 4dcf77c into weaveworks:master Mar 16, 2015
@rade rade removed the in progress label Mar 16, 2015
@rade rade modified the milestone: 0.10.0 Apr 18, 2015
@rade rade mentioned this pull request Apr 22, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants