-
Notifications
You must be signed in to change notification settings - Fork 670
DNS load balancing with Weave is biased by getaddrinfo's sorting #1245
Comments
I reckon we should just document the behaviour. It's not actually completely unreasonable, i.e. it favours local servers over others, right? |
I had a bit-more-that-a-quick look through the algorithm and AFAICT we can't influence the sort order - it seems to be based purely on the ip addresses returned. I considered removed the shuffle code from the DNS server (as its not useful) but in the case of no ips being considered 'local', it might actually do what we expect, as the behaviour of the algorithm is to preserve the order in the last resort. This needs testing. Assuming the above is true, and no one finds any new information, we will document this behaviour and close this issue. |
Yes, for non 'local' IPs, the shuffling is useful - 3 containers on host2, and from host1 you get random:
From host2 you always get 10.40.0.2:
|
Add note to explain DNS shuffle effects. Fixes #1245.
Also relevant: http://www.zytrax.com/books/dns/ch9/rr.html#services .
So, it seems that if glibc respected RFC 6724 we wouldn't have this problem |
It occurs to me that we could get our DNS to resolve special names "blah.single.weave.works" (or s.t. like that) to a single random record chosen from "blah.weave.works". |
@rade similar: hashicorp/consul#1481 |
Here's an example of how this behaves... the behavior depends on on the exact layout of the bits in the client/server container addresses: https://gist.github.com/SpComb/c509bd064bc75151e6b41e8bc949d13f For multiple server addresses in the same subnet as the client, glibc seems to use a longest-prefix match to sort the destination addresses: |
WTF github, I did not unassign anyone... must be some accidential keyboard hotkey. Anyways, best workaround I can figure out for this would indeed be to only return a single randomly chosen A record for such round-robin service names... in the case of servers going down, that would rely on the client applications to retry resolving and connect to a different IP (in addition to healtchecks to remove the broken servers from the DNS pool). |
Weave probably cannot do much about this one, but it's a good example of how unreliable it is to do load balancing based on plain A-records, following on #1213.
If hostname
foo
maps to three IPs in WeaveDNS, the user would expect to randomly and uniformly load balance across those IPs when connecting to hostnamefoo
. However, due to the A-record sorting ofgetaddrinfo
(seegai.conf
's manpage for further details) that's not the case.For instance,
getaddrinfo
always prioritizes the local IP. Here's an example:I launch 3 containers with the same A-record (
foo
) in WeaveDNS.Now, if I connect to
foo
from an external container, access seems to be random:However, if I connect to
foo
inside the containers namedfoo
, the local container is always favored:This is because ubuntu uses OpenBSD's variant of
nc
, which usesgetaddrinfo
for name resolution:When using
netcat-traditional
(which usesgethostbyname
), access tofoo
is random, even when the local container is associated to afoo
A-record:The text was updated successfully, but these errors were encountered: