Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: retry DNS lookups before failure? #16865

Closed
bradfitz opened this issue Aug 24, 2016 · 20 comments
Closed

net: retry DNS lookups before failure? #16865

bradfitz opened this issue Aug 24, 2016 · 20 comments

Comments

@bradfitz
Copy link
Contributor

I've frequently noticed that our net DNS tests running on builders are often flaky.

For example:

https://build.golang.org/log/ce5a87135d1a5ed4f17bd998ace2e0060b9ad597
https://build.golang.org/log/b3e762fc83d463acba21987ff558c8018b33c7cb
https://build.golang.org/log/250fc567590d125f1c8fd27740105eb7288ab16c

--- FAIL: TestLookupDotsWithRemoteSource (5.05s)
    lookup_test.go:566: LookupSRV(xmpp-server, tcp, google.com): lookup _xmpp-server._tcp.google.com on 8.8.8.8:53: no such host (mode=go)

--- FAIL: TestLookupDotsWithRemoteSource (5.46s)
    lookup_test.go:540: LookupMX(google.com): lookup google.com on 8.8.8.8:53: no such host (mode=cgo)
FAIL
FAIL    net 7.838s

--- FAIL: TestLookupGmailNS (5.01s)
    lookup_test.go:142: lookup gmail.com. on 8.8.8.8:53: dial udp 8.8.8.8:53: i/o timeout
FAIL
FAIL    net 7.336s

etc.

Notice they're all after 5 seconds. (our default DNS timeout)

Did a UDP request get lost?

Did a UDP response get lost?

Does NAT make some builders worse?

Should we make builders re-try all DNS tests N times?

But this is also flaky (but to a much lesser degree) on my desktop on wired ethernet. With 500 runs, I still see occasional failures.

Maybe we should make our net package's DNS code automatically resend the UDP request after half the timeout? (i.e. after 2.5 seconds by default)

/cc @mdempsky @josharian @minux @ianlancetaylor @mikioh

@bradfitz bradfitz added the Testing An issue that has been verified to require only test changes, not just a test failure. label Aug 24, 2016
@bradfitz bradfitz added this to the Go1.8 milestone Aug 24, 2016
@bradfitz bradfitz self-assigned this Aug 24, 2016
@mdempsky
Copy link
Contributor

I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests.

A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.

@bradfitz
Copy link
Contributor Author

I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests.

Flaky tests is how I started down this path, but then I realized our DNS client just might need work too.

But looking at the cited test failures, I see one is pure Go, one is cgo, and one is dial udp 8.8.8.8:53: i/o timeout ... how does a UDP dial time out!? Isn't it instant?

A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response.

@adg and I started working on that once (can't find the bug) but never finished, apparently.

@bradfitz
Copy link
Contributor Author

It seems our code already does try to do a certain number of attempts (for i := 0; i < cfg.attempts; i++ { in dnsclient_unix.go:func tryOneName), but I don't think it's necessarily working.

It looks like one deadline is set up before the loop, then the first one will fail due to timeout, and all the rest will all necessarily fail because the timeout is already dead.

What do other DNS implementations do?

@mdempsky
Copy link
Contributor

Yeah, that appears to be part of the problem at least. libresolv in glibc uses cfg.timeout to compute individual UDP round-trip timeouts, not as a global timeout.

It has a kind of goofy timeout calculation logic though. For the first server in the nameserver list, it uses cfg.timeout directly. But for the rest, it uses (cfg.timeout << ns) / len(cfg.servers), where ns is the server's (0-based) index in cfg.servers.

Checking glibc commit history, it looks like that logic came from BIND 8.2.3 in 2000 (see bminor/glibc@e685e07). Prior to that, there was a somewhat seemingly more sane approach: for the first attempt to each server, use cfg.timeout directly; but for retries use (cfg.timeout << try) / len(cfg.servers).

I want to say this is just an accident because of how they split out a function similar to our exchange, but renaming the variable from try to ns seems like it was intentional.

djbdns's client library doesn't respect the timeout/retry settings in resolv.conf. In stub resolver mode, it simply always uses 3 retries, and uses timeouts of 3s, 11s, and 45s per UDP query.

@bradfitz
Copy link
Contributor Author

More good comments in #16885 (comment)

@bradfitz
Copy link
Contributor Author

And #16885 (comment) suggests it might be a Go 1.7 regression worth fixing in a point release.

@mdempsky
Copy link
Contributor

Related to integrating context support?

@bradfitz
Copy link
Contributor Author

I think so.

@Nitecon
Copy link

Nitecon commented Aug 26, 2016

This would definitely be good for fixing in a 1.7 point release as it does currently cause us production impact and does not follow the resolver implementations for max timeouts as mentioned by Alex in #16885

@mdempsky
Copy link
Contributor

Behavior changed in f60fcca.

@mdempsky
Copy link
Contributor

@bradfitz Have you already started on this? If not, I can prepare a CL.

@bradfitz
Copy link
Contributor Author

I haven't. Please do!

@mdempsky mdempsky assigned mdempsky and unassigned bradfitz Aug 29, 2016
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/28057 mentions this issue.

@bradfitz
Copy link
Contributor Author

@mdempsky, think we should cherry-pick this back to Go 1.7.1?

/cc @ianlancetaylor @broady

@mdempsky
Copy link
Contributor

@bradfitz It should definitely be cherry-picked to Go 1.7.x. If 1.7.1 isn't final yet, I think this makes sense to include.

@bradfitz bradfitz modified the milestones: Go1.7.1, Go1.8 Aug 29, 2016
@mlafeldt
Copy link

mlafeldt commented Sep 1, 2016

This might be related: One of our core components actually experienced many dial udp 10.8.0.2:53: i/o timeout errors in production after updating from Go 1.6 to 1.7. We had to switch to the cgo DNS implementation to still be able to move forward with Go 1.7.

We were actually wondering why the Go DNS resolver does not try TCP when UDP fails here:

c, err := d.dialDNS(ctx, network, server)
if err != nil {
return nil, err
}

/cc @Luzifer @denderello

@mikioh
Copy link
Contributor

mikioh commented Sep 2, 2016

Drop testing label because this looks a functionality issue.

@mikioh mikioh removed the Testing An issue that has been verified to require only test changes, not just a test failure. label Sep 2, 2016
gopherbot pushed a commit that referenced this issue Sep 7, 2016
The handling of "options timeout:n" is supposed to be per individual
DNS server exchange, not per Lookup call.

Fixes #16865.

Change-Id: I2304579b9169c1515292f142cb372af9d37ff7c1
Reviewed-on: https://go-review.googlesource.com/28057
Run-TryBot: Matthew Dempsky <mdempsky@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-on: https://go-review.googlesource.com/28640
@mikioh
Copy link
Contributor

mikioh commented Oct 5, 2016

This is a regression of Go 1.7. Go 1.6.3 or below, or Go 1.7.1 or above works fine.

@jbrook
Copy link

jbrook commented Nov 17, 2016

Maybe I shouldn't be commenting on closed tickets but we have been experiencing DNS UDP dial timeouts in our production environment using Go 1.7.3.

dial tcp: lookup xxx.xxxxxx.com on 10.129.0.2:53: dial udp 10.129.0.2:53: i/o timeout"}

Go 1.7.3 using official Docker image on AWS EC2.
AWS EC2 instance has DNS resolving to a Route53 private hosted zone.

@bradfitz
Copy link
Contributor Author

You can comment on closed tickets if you'd like, but it's just not very effectual, as we don't track closed tickets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants