-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: resolver should call res_init when resolv.conf changes #21083
Comments
Links:
Both patches simply call res_init and then return the error they already hold. They do not retry the call that failed. I assume calling code is expected to do that eventually (this is to break a loop that would let everything keep getting the bad result). I guess if it's been 14+ years the distros are never going to fix this themselves / glibc isn't? At least in glibc, res_init has a lock so it's OK to call from multiple goroutines simultaneously. It would be pretty easy to call res_init on the error result from getaddrinfo, as in those two CLs. Rust seems to do this only for lookup_host, not other network calls. It's unclear if there are other entry points we should be considering. Overall this seems unfortunate but fairly low risk to try to fix on the Go side. Not for a point release but maybe for Go 1.10. Update: Don't do this on OS X (see rust-lang/rust#43592), and maybe evaluate whether BSD libc more generally should do it. ping @mdempsky @iangudger ? |
If their timeline is still right, it looks like glibc is actually going to ship a fix for this tomorrow. But I assume some supported systems (most current non-Debian distros?) will be running versions of glibc with this bug for many years.
I think the private function
Two big benefits to fixing this on the Go side:
|
I have no opinion on this. |
Possible risk: we'll have to be careful about calling |
Ping @mdempsky. |
Calling res_init after a lookup failure seems like a kludge to me. It wouldn't necessarily detect when we're just getting a different response (e.g., if search paths change). Also, I saw that this is what Mozilla and Rust both do, but I couldn't immediately track down how Ruby/Chef decide when to call res_init. I would be inclined to just periodically call res_init (e.g., before every getaddrinfo call, unless it's been less than 5 seconds since the previous one?), so the behavior more closely matches the Go resolver behavior. |
My concern with the "before every getaddrinfo call except not more than once in 5 seconds" is that every 5 seconds there will be a latency hiccup where calls back up on the res_init finishing, a res_init that 99% of the time is not necessary. Using the failure to trigger the res_init seems less costly. It seems to be working for Keybase at least (written in Go, linked above). |
Superficial reading of the glibc fix looks like they are stat'ing resolv.conf to see if it changes. If we did that instead of blindly calling res_init, then @mdempsky's suggestion seems a bit better. Also, we're already watching resolv.conf for changes, because we decide whether to use cgo at all based on the resolv.conf being "complicated" or not. If we are already stat'ing the resolv.conf, maybe we should just call res_init each time we see it change. But that check is on the non-cgo path only and would have to be moved up earlier, to be common to all lookups. Still, that would make the "time to notice a change" the same in both Go and cgo modes. If you move the tryUpdate calls into the common code paths instead of the go-only ones, then it might also have the effect that if resolv.conf changes from "complicated" to "simple" then we might start using the Go native resolver again, which might be a win. If we're going to do this for Go 1.10, we should do it fairly soon to get it soaking. |
That bug linked above (rust-lang/rust#43592) is a real crasher. We think that |
Still happens on Debian GLIBC 2.34-4 /Go 1.18.3 I get that when openvpn disconnects after long session, with resolv.conf updated by resolvconf. No other app has that problem |
If you force the cgo resolver on non-Debian-based Linux with glibc <=2.25 (that is, any stable glibc as of this writing) and Go 1.8.3 (probably any released version), it's very easy to get into a state where all network connections fail even though the machine's network is up:
/etc/resolv.conf
and removes the nameservers. (This does depend on what Linux distro / local network configuration you have, so if/etc/resolv.conf
never changes you might need to repro this on a different machine.)GODEBUG=netdns=cgo go run test.go
(or whatever you named it) on the file above. TheGODEBUG
setting forces the net library to use the cgo resolver. Note that the pure-Go resolver does not exhibit this bug.no such host
). This is expected, because the network is down./etc/resolv.conf
changes, and that your nameservers are back.BUG: The test program's DNS requests keep failing. Sometimes they fix themselves eventually, I'm not sure under exactly what conditions, but sometimes I never see them recover. It seems like if the test program ran with the network off for a minute, recovery is less likely than if you turned the network back on after a couple seconds.
This seems to be related to a bug in glibc, where
/etc/resolv.conf
isn't checked for changes as often as it should be. It looks like glibc is going to ship a fix for this in their next release, though affected versions will still be around for years. Debian has patched this fix themselves for a long time (this patch?), which is why I don't think the bug will repro on Debian-based distros. I've recently had to work around this issue in application code, by callinglibc::res_init
manually. The Rust standard library has started callingres_init
after DNS failures by default, and that issue links to similar workarounds in other languages and applications.I think the Go standard library should consider adding a workaround similar to what Rust has done. It looks like this was considered in #10850, but it might not have been clear how common this bug is? The most painful case is a long-running Go process on a laptop where the WiFi comes and goes.
[Apologies if the "proposal" label was incorrect. This could be called a bugfix, depending on how you feel about it :) ]
The text was updated successfully, but these errors were encountered: