Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: resolver should call res_init when resolv.conf changes #21083

Open
oconnor663 opened this issue Jul 19, 2017 · 11 comments
Open

net: resolver should call res_init when resolv.conf changes #21083

oconnor663 opened this issue Jul 19, 2017 · 11 comments
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. help wanted NeedsFix The path to resolution is known, but the work has not been done. Proposal Proposal-Accepted
Milestone

Comments

@oconnor663
Copy link

oconnor663 commented Jul 19, 2017

If you force the cgo resolver on non-Debian-based Linux with glibc <=2.25 (that is, any stable glibc as of this writing) and Go 1.8.3 (probably any released version), it's very easy to get into a state where all network connections fail even though the machine's network is up:

package main

import (
	"fmt"
	"net"
	"time"
)

func main() {
	i := 1
	for {
		ip, err := net.LookupIP("google.com")
		if err != nil {
			fmt.Printf("%3d error: %s\n", i, err)
		} else {
			fmt.Printf("%3d success: %s\n", i, ip)
		}
		time.Sleep(1 * time.Second)
		i++
	}
}
  1. Turn off / unplug the network connection on your machine.
  2. Observe that NetworkManager or whatever daemon on your machine updates /etc/resolv.conf and removes the nameservers. (This does depend on what Linux distro / local network configuration you have, so if /etc/resolv.conf never changes you might need to repro this on a different machine.)
  3. GODEBUG=netdns=cgo go run test.go (or whatever you named it) on the file above. The GODEBUG setting forces the net library to use the cgo resolver. Note that the pure-Go resolver does not exhibit this bug.
  4. Watch it for a few seconds and observe a couple lookup errors (no such host). This is expected, because the network is down.
  5. Without stopping the test program above, turn the network back on.
  6. Confirm again that /etc/resolv.conf changes, and that your nameservers are back.
  7. Keep watching the output of the test program.
    BUG: The test program's DNS requests keep failing. Sometimes they fix themselves eventually, I'm not sure under exactly what conditions, but sometimes I never see them recover. It seems like if the test program ran with the network off for a minute, recovery is less likely than if you turned the network back on after a couple seconds.

This seems to be related to a bug in glibc, where /etc/resolv.conf isn't checked for changes as often as it should be. It looks like glibc is going to ship a fix for this in their next release, though affected versions will still be around for years. Debian has patched this fix themselves for a long time (this patch?), which is why I don't think the bug will repro on Debian-based distros. I've recently had to work around this issue in application code, by calling libc::res_init manually. The Rust standard library has started calling res_init after DNS failures by default, and that issue links to similar workarounds in other languages and applications.

Seems like a lot of big programs have gone through the pain of re-discovering this issue. Here's Mozilla Firefox from 14 years ago. And more recently, Chef (and Ruby).

I think the Go standard library should consider adding a workaround similar to what Rust has done. It looks like this was considered in #10850, but it might not have been clear how common this bug is? The most painful case is a long-running Go process on a laptop where the WiFi comes and goes.

[Apologies if the "proposal" label was incorrect. This could be called a bugfix, depending on how you feel about it :) ]

@gopherbot gopherbot added this to the Proposal milestone Jul 19, 2017
@ianlancetaylor
Copy link
Contributor

CC @mdempsky @iangudger

@rsc
Copy link
Contributor

rsc commented Jul 31, 2017

Links:

Both patches simply call res_init and then return the error they already hold. They do not retry the call that failed. I assume calling code is expected to do that eventually (this is to break a loop that would let everything keep getting the bad result). I guess if it's been 14+ years the distros are never going to fix this themselves / glibc isn't?

At least in glibc, res_init has a lock so it's OK to call from multiple goroutines simultaneously. It would be pretty easy to call res_init on the error result from getaddrinfo, as in those two CLs.

Rust seems to do this only for lookup_host, not other network calls. It's unclear if there are other entry points we should be considering.

Overall this seems unfortunate but fairly low risk to try to fix on the Go side. Not for a point release but maybe for Go 1.10.

Update: Don't do this on OS X (see rust-lang/rust#43592), and maybe evaluate whether BSD libc more generally should do it.

ping @mdempsky @iangudger ?

@oconnor663
Copy link
Author

oconnor663 commented Jul 31, 2017

I guess if it's been 14+ years the distros are never going to fix this themselves / glibc isn't?

If their timeline is still right, it looks like glibc is actually going to ship a fix for this tomorrow. But I assume some supported systems (most current non-Debian distros?) will be running versions of glibc with this bug for many years.

Rust seems to do this only for lookup_host, not other network calls

I think the private function lookup_host ends up getting called by everything network related in the Rust standard library. For example, TcpStream::connect -> each_addr -> to_socket_addrs -> resolve_socket_addr -> lookup_host.

Overall this seems unfortunate but fairly low risk to try to fix on the Go side.

Two big benefits to fixing this on the Go side:

  1. If App A depends on Library B, and Library B makes network calls, I don't know of a way for A's authors to reliably fix B's calls without making code changes in B.
  2. Fixing this bug in application code requires cgo, which is a bummer if your build didn't need cgo before.

@iangudger
Copy link
Contributor

I have no opinion on this.

@oconnor663
Copy link
Author

oconnor663 commented Aug 1, 2017

Possible risk: we'll have to be careful about calling res_init from multiple threads. See rust-lang/rust#43592

@rsc
Copy link
Contributor

rsc commented Aug 14, 2017

Ping @mdempsky.

@mdempsky
Copy link
Contributor

Calling res_init after a lookup failure seems like a kludge to me. It wouldn't necessarily detect when we're just getting a different response (e.g., if search paths change). Also, I saw that this is what Mozilla and Rust both do, but I couldn't immediately track down how Ruby/Chef decide when to call res_init.

I would be inclined to just periodically call res_init (e.g., before every getaddrinfo call, unless it's been less than 5 seconds since the previous one?), so the behavior more closely matches the Go resolver behavior.

@mikioh mikioh changed the title proposal: the net/cgo_unix DNS resolver should call res_init on lookup failure proposal: net: cgo DNS stub resolver should call res_init on lookup failure Sep 20, 2017
@rsc
Copy link
Contributor

rsc commented Sep 25, 2017

My concern with the "before every getaddrinfo call except not more than once in 5 seconds" is that every 5 seconds there will be a latency hiccup where calls back up on the res_init finishing, a res_init that 99% of the time is not necessary. Using the failure to trigger the res_init seems less costly. It seems to be working for Keybase at least (written in Go, linked above).

@rsc
Copy link
Contributor

rsc commented Sep 25, 2017

Superficial reading of the glibc fix looks like they are stat'ing resolv.conf to see if it changes. If we did that instead of blindly calling res_init, then @mdempsky's suggestion seems a bit better. Also, we're already watching resolv.conf for changes, because we decide whether to use cgo at all based on the resolv.conf being "complicated" or not.

If we are already stat'ing the resolv.conf, maybe we should just call res_init each time we see it change. But that check is on the non-cgo path only and would have to be moved up earlier, to be common to all lookups. Still, that would make the "time to notice a change" the same in both Go and cgo modes.

If you move the tryUpdate calls into the common code paths instead of the go-only ones, then it might also have the effect that if resolv.conf changes from "complicated" to "simple" then we might start using the Go native resolver again, which might be a win.

If we're going to do this for Go 1.10, we should do it fairly soon to get it soaking.

Anyone interested in trying this? @mikioh? @mdempsky?

@rsc rsc changed the title proposal: net: cgo DNS stub resolver should call res_init on lookup failure net: resolver should call res_init when resolv.conf changes Sep 25, 2017
@rsc rsc modified the milestones: Proposal, Go1.10 Sep 25, 2017
@oconnor663
Copy link
Author

That bug linked above (rust-lang/rust#43592) is a real crasher. We think that res_init is threadsafe under glibc, but we know it's not on macOS at least. The Rust stdlib ended up working around this with a runtime call to gnu_get_libc_version, rust-lang/rust#44965.

@bradfitz bradfitz added help wanted NeedsFix The path to resolution is known, but the work has not been done. labels Nov 17, 2017
@bradfitz bradfitz modified the milestones: Go1.10, Go1.11 Nov 17, 2017
@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jun 27, 2018
@ianlancetaylor ianlancetaylor added the early-in-cycle A change that should be done early in the 3 month dev cycle. label Dec 19, 2018
@ianlancetaylor ianlancetaylor modified the milestones: Go1.12, Go1.13 Dec 19, 2018
@andybons andybons modified the milestones: Go1.13, Go1.14 Jul 8, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
@XANi
Copy link

XANi commented Aug 24, 2022

Still happens on Debian GLIBC 2.34-4 /Go 1.18.3

I get that when openvpn disconnects after long session, with resolv.conf updated by resolvconf. No other app has that problem

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
early-in-cycle A change that should be done early in the 3 month dev cycle. help wanted NeedsFix The path to resolution is known, but the work has not been done. Proposal Proposal-Accepted
Projects
None yet
Development

No branches or pull requests

9 participants