-
Notifications
You must be signed in to change notification settings - Fork 18k
net: forceCloseSockets in test is not safe for finalized fds #15525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Repro? |
Repro is check out patchset 6 or earlier from the Gerrit review above ( |
I suppose that this is the same as #14910. Once we break net.{Conn,Listener,PacketConn} loose from our control it can be garbage collected. Keep your belongings with you safely. |
@mikioh I don't see that Brad's tests call .Fd() |
Yup, I mistook the code path. |
@randall77, I believe this is a possible dup of #15277. If SSA thinks the arg slots are dead and is reusing them for other things, that might end up smashing them too early, causing premature garbage collection. |
I can't repro. Following Brad's instructions I get a smattering of i/o timeouts, bind: address already in use, unreachable networks, etc, but no EBADF or anything similar. I tried on both linux and darwin amd64. Anyone who can repro, please provide more detailed instructions. Or try patching in https://go-review.googlesource.com/c/22365/ for me and see if it helps. |
I tried refs/changes/31/22031/6 on both Linux and Darwin and they both just hang. After 10 minutes the test runner sends them a SIGQUIT. So that seems not to be a working repro. |
I'll try to repro this again. (I was on Darwin at the time, sitting next to @ianlancetaylor who I showed this to also) I don't think it required a specific |
I reproduced with "go test -short -count=20 net" on Linux. I get lots of failures, some of them probably not real problems, but I also do get many EBADFs.
Next step is probably to make the os.File close finalizer print the stacks of all current goroutines before closing an fd. |
(Since Russ can reproduce this, I'm not spending time on an easier repro.) |
To reproduce this on Linux it suffices to start at tip and comment out the I can't figure out how, even in the presence of early finalization, ln might get closed more than once, but that seems to be what is happening. Closing the fd twice could, if the fd were reused between the two, step on someone else. But then having stepped on someone else, when that someone else closes their fd, that could step on a third person, and so on, creating a chain of errors like in the failure output. I can't come up with another explanation for how this one async close could cause so many different test failures otherwise. But I also can't explain how the fd could possibly be closed twice. We're so careful about that - even if the program were racy or there were a normal Close racing with the finalizer close (and there doesn't appear to be), the uses of the fd are interlocked and reference counted, and we only actually close the fd when the ref count drops to zero. Note: I commented last night that it looked like the liveness bug I just fixed, but that makes no sense since that was specific to SSA and this is not. I deleted that comment. |
It is in fact a double-close of an fd. The problem is that if a test does not close its own *netFD (what's inside a net.Conn or net.Listener), then a later test that makes use of forceCloseSockets will close the fd itself (using syscall.Close, not the *netFD which has all the right checks). Then when the finalizer on the *netFD eventually runs, it too will close that same numeric fd, and by then it may have been reused, leading to the kind of chain reaction I described in my previous message. Skipping the two tests that make use of forceCloseSockets makes the repro pass. So either we just make sure to close all the *netFDs we open in the net test, or we change forceCloseSockets to call close on *netFDs not sockets, or we delete forceCloseSockets entirely (it seems like a bad idea as implemented today), or we change it to use syscall.Shutdown instead of syscall.Close, or we live with the fact that forgetting to close a *netFD blows up badly. Since it's just a bad test, its OK to fix for Go 1.7 or to leave for Go 1.8. Leaving for @mikioh. |
Oops, I don't remember the reason why I left forceCloseSockets in non-main test functions but it's clearly wrong, thanks. |
CL https://golang.org/cl/23505 mentions this issue. |
I was just debugging mysterious test failures (EBADFs) in random places in the net package from a CL of mine.
I notice the test failures went away when I disabled my new (unrelated) tests.
It turned out that I was forgetting to close some net.Conns in my test (fixed: https://go-review.googlesource.com/#/c/22031/6..7/src/net/net_test.go) and the corruption in the net package state was coming from finalizers.
So somehow the finalizers are messing up the event loop's state.
/cc @aclements @ianlancetaylor @dvyukov
The text was updated successfully, but these errors were encountered: