net: retry DNS lookups before failure? #16865

bradfitz · 2016-08-24T19:58:32Z

I've frequently noticed that our net DNS tests running on builders are often flaky.

For example:

https://build.golang.org/log/ce5a87135d1a5ed4f17bd998ace2e0060b9ad597
https://build.golang.org/log/b3e762fc83d463acba21987ff558c8018b33c7cb
https://build.golang.org/log/250fc567590d125f1c8fd27740105eb7288ab16c

--- FAIL: TestLookupDotsWithRemoteSource (5.05s)
    lookup_test.go:566: LookupSRV(xmpp-server, tcp, google.com): lookup _xmpp-server._tcp.google.com on 8.8.8.8:53: no such host (mode=go)

--- FAIL: TestLookupDotsWithRemoteSource (5.46s)
    lookup_test.go:540: LookupMX(google.com): lookup google.com on 8.8.8.8:53: no such host (mode=cgo)
FAIL
FAIL    net 7.838s

--- FAIL: TestLookupGmailNS (5.01s)
    lookup_test.go:142: lookup gmail.com. on 8.8.8.8:53: dial udp 8.8.8.8:53: i/o timeout
FAIL
FAIL    net 7.336s

etc.

Notice they're all after 5 seconds. (our default DNS timeout)

Did a UDP request get lost?

Did a UDP response get lost?

Does NAT make some builders worse?

Should we make builders re-try all DNS tests N times?

But this is also flaky (but to a much lesser degree) on my desktop on wired ethernet. With 500 runs, I still see occasional failures.

Maybe we should make our net package's DNS code automatically resend the UDP request after half the timeout? (i.e. after 2.5 seconds by default)

/cc @mdempsky @josharian @minux @ianlancetaylor @mikioh

The text was updated successfully, but these errors were encountered:

mdempsky · 2016-08-24T20:13:54Z

I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests.

A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.

bradfitz · 2016-08-24T21:02:01Z

I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests.

Flaky tests is how I started down this path, but then I realized our DNS client just might need work too.

But looking at the cited test failures, I see one is pure Go, one is cgo, and one is dial udp 8.8.8.8:53: i/o timeout ... how does a UDP dial time out!? Isn't it instant?

A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.

@adg and I started working on that once (can't find the bug) but never finished, apparently.

bradfitz · 2016-08-25T02:18:16Z

It seems our code already does try to do a certain number of attempts (for i := 0; i < cfg.attempts; i++ { in dnsclient_unix.go:func tryOneName), but I don't think it's necessarily working.

It looks like one deadline is set up before the loop, then the first one will fail due to timeout, and all the rest will all necessarily fail because the timeout is already dead.

What do other DNS implementations do?

mdempsky · 2016-08-25T04:12:19Z

Yeah, that appears to be part of the problem at least. libresolv in glibc uses cfg.timeout to compute individual UDP round-trip timeouts, not as a global timeout.

It has a kind of goofy timeout calculation logic though. For the first server in the nameserver list, it uses cfg.timeout directly. But for the rest, it uses (cfg.timeout << ns) / len(cfg.servers), where ns is the server's (0-based) index in cfg.servers.

Checking glibc commit history, it looks like that logic came from BIND 8.2.3 in 2000 (see bminor/glibc@e685e07). Prior to that, there was a somewhat seemingly more sane approach: for the first attempt to each server, use cfg.timeout directly; but for retries use (cfg.timeout << try) / len(cfg.servers).

I want to say this is just an accident because of how they split out a function similar to our exchange, but renaming the variable from try to ns seems like it was intentional.

djbdns's client library doesn't respect the timeout/retry settings in resolv.conf. In stub resolver mode, it simply always uses 3 retries, and uses timeouts of 3s, 11s, and 45s per UDP query.

bradfitz · 2016-08-26T01:47:18Z

More good comments in #16885 (comment)

bradfitz · 2016-08-26T01:47:40Z

And #16885 (comment) suggests it might be a Go 1.7 regression worth fixing in a point release.

mdempsky · 2016-08-26T01:53:40Z

Related to integrating context support?

bradfitz · 2016-08-26T01:54:46Z

I think so.

Nitecon · 2016-08-26T05:14:56Z

This would definitely be good for fixing in a 1.7 point release as it does currently cause us production impact and does not follow the resolver implementations for max timeouts as mentioned by Alex in #16885

mdempsky · 2016-08-26T05:32:50Z

Behavior changed in f60fcca.

mdempsky · 2016-08-29T20:26:28Z

@bradfitz Have you already started on this? If not, I can prepare a CL.

bradfitz · 2016-08-29T20:56:37Z

I haven't. Please do!

gopherbot · 2016-08-29T23:00:35Z

CL https://golang.org/cl/28057 mentions this issue.

bradfitz · 2016-08-29T23:42:53Z

@mdempsky, think we should cherry-pick this back to Go 1.7.1?

/cc @ianlancetaylor @broady

mdempsky · 2016-08-29T23:47:47Z

@bradfitz It should definitely be cherry-picked to Go 1.7.x. If 1.7.1 isn't final yet, I think this makes sense to include.

mlafeldt · 2016-09-01T18:28:59Z

This might be related: One of our core components actually experienced many dial udp 10.8.0.2:53: i/o timeout errors in production after updating from Go 1.6 to 1.7. We had to switch to the cgo DNS implementation to still be able to move forward with Go 1.7.

We were actually wondering why the Go DNS resolver does not try TCP when UDP fails here:

go/src/net/dnsclient_unix.go

Lines 161 to 164 in adb1e67

    
           c, err := d.dialDNS(ctx, network, server) 
        
           if err != nil { 
        
           	return nil, err 
        
           }

/cc @Luzifer @denderello

mikioh · 2016-09-02T01:11:01Z

Drop testing label because this looks a functionality issue.

The handling of "options timeout:n" is supposed to be per individual DNS server exchange, not per Lookup call. Fixes #16865. Change-Id: I2304579b9169c1515292f142cb372af9d37ff7c1 Reviewed-on: https://go-review.googlesource.com/28057 Run-TryBot: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-on: https://go-review.googlesource.com/28640

mikioh · 2016-10-05T01:04:15Z

This is a regression of Go 1.7. Go 1.6.3 or below, or Go 1.7.1 or above works fine.

jbrook · 2016-11-17T09:00:24Z

Maybe I shouldn't be commenting on closed tickets but we have been experiencing DNS UDP dial timeouts in our production environment using Go 1.7.3.

dial tcp: lookup xxx.xxxxxx.com on 10.129.0.2:53: dial udp 10.129.0.2:53: i/o timeout"}

Go 1.7.3 using official Docker image on AWS EC2.
AWS EC2 instance has DNS resolving to a Route53 private hosted zone.

bradfitz · 2016-11-17T15:40:43Z

You can comment on closed tickets if you'd like, but it's just not very effectual, as we don't track closed tickets.

bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Aug 24, 2016

bradfitz added this to the Go1.8 milestone Aug 24, 2016

bradfitz self-assigned this Aug 24, 2016

bradfitz mentioned this issue Aug 26, 2016

net: DNS timeout not implemented per spec #16885

Closed

mdempsky assigned mdempsky and unassigned bradfitz Aug 29, 2016

gopherbot closed this as completed in 11e3955 Aug 29, 2016

bradfitz modified the milestones: Go1.7.1, Go1.8 Aug 29, 2016

mikioh removed the Testing An issue that has been verified to require only test changes, not just a test failure. label Sep 2, 2016

ncw mentioned this issue Sep 8, 2016

tcp lookup www.googleapi.com no such host rclone/rclone#683

Closed

alext mentioned this issue Sep 8, 2016

Upgrade to Go 1.7.1 alphagov/router#114

Merged

n4wei mentioned this issue Sep 8, 2016

"cf api" fails if the 1st nameserver fails to resolve the endpoint cloudfoundry/cli#763

Closed

zcahana mentioned this issue Sep 19, 2016

Sidecar fails to update Nginx with Go 1.7.1 amalgam8/amalgam8#266

Closed

mikioh mentioned this issue Oct 5, 2016

net: Dial timeout reports incorrect problem DNS entry #17329

Closed

XenoPhex mentioned this issue Dec 29, 2016

cli 6.23.0 doesn't build with e.g. go-1.6.2 anymore cloudfoundry/cli#1035

Closed

scottjab-stripe mentioned this issue Mar 21, 2017

Bump golang version to 1.7.5 stripe-archive/sequins#70

Merged

gdetrez mentioned this issue Aug 19, 2017

Regression in Go 1.7 affecting DNS resolution hlandau/acmetool#271

Closed

golang locked and limited conversation to collaborators Nov 17, 2017

gopherbot added the FrozenDueToAge label Nov 17, 2017

rsc unassigned mdempsky Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: retry DNS lookups before failure? #16865

net: retry DNS lookups before failure? #16865

bradfitz commented Aug 24, 2016

mdempsky commented Aug 24, 2016

bradfitz commented Aug 24, 2016

bradfitz commented Aug 25, 2016

mdempsky commented Aug 25, 2016

bradfitz commented Aug 26, 2016

bradfitz commented Aug 26, 2016

mdempsky commented Aug 26, 2016

bradfitz commented Aug 26, 2016

Nitecon commented Aug 26, 2016

mdempsky commented Aug 26, 2016

mdempsky commented Aug 29, 2016

bradfitz commented Aug 29, 2016

gopherbot commented Aug 29, 2016

bradfitz commented Aug 29, 2016

mdempsky commented Aug 29, 2016

mlafeldt commented Sep 1, 2016

mikioh commented Sep 2, 2016

mikioh commented Oct 5, 2016

jbrook commented Nov 17, 2016

bradfitz commented Nov 17, 2016

net: retry DNS lookups before failure? #16865

net: retry DNS lookups before failure? #16865

Comments

bradfitz commented Aug 24, 2016

mdempsky commented Aug 24, 2016

bradfitz commented Aug 24, 2016

bradfitz commented Aug 25, 2016

mdempsky commented Aug 25, 2016

bradfitz commented Aug 26, 2016

bradfitz commented Aug 26, 2016

mdempsky commented Aug 26, 2016

bradfitz commented Aug 26, 2016

Nitecon commented Aug 26, 2016

mdempsky commented Aug 26, 2016

mdempsky commented Aug 29, 2016

bradfitz commented Aug 29, 2016

gopherbot commented Aug 29, 2016

bradfitz commented Aug 29, 2016

mdempsky commented Aug 29, 2016

mlafeldt commented Sep 1, 2016

mikioh commented Sep 2, 2016

mikioh commented Oct 5, 2016

jbrook commented Nov 17, 2016

bradfitz commented Nov 17, 2016