-
Notifications
You must be signed in to change notification settings - Fork 18k
net: retry DNS lookups before failure? #16865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm okay with us changing the DNS resolver logic to more closely match other DNS client libraries if that helps the flakiness, but I'm hesitant to do things like change default timeouts / retry logic just to appease flaky tests. A possible testing-side fix: we could run a simple local DNS server that just knows how to respond to certain fixed DNS queries. It doesn't even need to implement proper DNS packet decoding. It just needs to copy the 16-bit query ID at the start of the packet, and then do an exact byte-string lookup on the rest to decide on a response. |
Flaky tests is how I started down this path, but then I realized our DNS client just might need work too. But looking at the cited test failures, I see one is pure Go, one is cgo, and one is
@adg and I started working on that once (can't find the bug) but never finished, apparently. |
It seems our code already does try to do a certain number of attempts ( It looks like one deadline is set up before the loop, then the first one will fail due to timeout, and all the rest will all necessarily fail because the timeout is already dead. What do other DNS implementations do? |
Yeah, that appears to be part of the problem at least. libresolv in glibc uses cfg.timeout to compute individual UDP round-trip timeouts, not as a global timeout. It has a kind of goofy timeout calculation logic though. For the first server in the nameserver list, it uses cfg.timeout directly. But for the rest, it uses Checking glibc commit history, it looks like that logic came from BIND 8.2.3 in 2000 (see bminor/glibc@e685e07). Prior to that, there was a somewhat seemingly more sane approach: for the first attempt to each server, use I want to say this is just an accident because of how they split out a function similar to our djbdns's client library doesn't respect the timeout/retry settings in resolv.conf. In stub resolver mode, it simply always uses 3 retries, and uses timeouts of 3s, 11s, and 45s per UDP query. |
More good comments in #16885 (comment) |
And #16885 (comment) suggests it might be a Go 1.7 regression worth fixing in a point release. |
Related to integrating context support? |
I think so. |
This would definitely be good for fixing in a 1.7 point release as it does currently cause us production impact and does not follow the resolver implementations for max timeouts as mentioned by Alex in #16885 |
Behavior changed in f60fcca. |
@bradfitz Have you already started on this? If not, I can prepare a CL. |
I haven't. Please do! |
CL https://golang.org/cl/28057 mentions this issue. |
@mdempsky, think we should cherry-pick this back to Go 1.7.1? |
@bradfitz It should definitely be cherry-picked to Go 1.7.x. If 1.7.1 isn't final yet, I think this makes sense to include. |
This might be related: One of our core components actually experienced many We were actually wondering why the Go DNS resolver does not try TCP when UDP fails here: Lines 161 to 164 in adb1e67
/cc @Luzifer @denderello |
Drop testing label because this looks a functionality issue. |
The handling of "options timeout:n" is supposed to be per individual DNS server exchange, not per Lookup call. Fixes #16865. Change-Id: I2304579b9169c1515292f142cb372af9d37ff7c1 Reviewed-on: https://go-review.googlesource.com/28057 Run-TryBot: Matthew Dempsky <mdempsky@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org> Reviewed-on: https://go-review.googlesource.com/28640
This is a regression of Go 1.7. Go 1.6.3 or below, or Go 1.7.1 or above works fine. |
Maybe I shouldn't be commenting on closed tickets but we have been experiencing DNS UDP dial timeouts in our production environment using Go 1.7.3. dial tcp: lookup xxx.xxxxxx.com on 10.129.0.2:53: dial udp 10.129.0.2:53: i/o timeout"} Go 1.7.3 using official Docker image on AWS EC2. |
You can comment on closed tickets if you'd like, but it's just not very effectual, as we don't track closed tickets. |
I've frequently noticed that our net DNS tests running on builders are often flaky.
For example:
https://build.golang.org/log/ce5a87135d1a5ed4f17bd998ace2e0060b9ad597
https://build.golang.org/log/b3e762fc83d463acba21987ff558c8018b33c7cb
https://build.golang.org/log/250fc567590d125f1c8fd27740105eb7288ab16c
etc.
Notice they're all after 5 seconds. (our default DNS timeout)
Did a UDP request get lost?
Did a UDP response get lost?
Does NAT make some builders worse?
Should we make builders re-try all DNS tests N times?
But this is also flaky (but to a much lesser degree) on my desktop on wired ethernet. With 500 runs, I still see occasional failures.
Maybe we should make our net package's DNS code automatically resend the UDP request after half the timeout? (i.e. after 2.5 seconds by default)
/cc @mdempsky @josharian @minux @ianlancetaylor @mikioh
The text was updated successfully, but these errors were encountered: