net: forceCloseSockets in test is not safe for finalized fds #15525

bradfitz · 2016-05-03T18:26:48Z

I was just debugging mysterious test failures (EBADFs) in random places in the net package from a CL of mine.

I notice the test failures went away when I disabled my new (unrelated) tests.

It turned out that I was forgetting to close some net.Conns in my test (fixed: https://go-review.googlesource.com/#/c/22031/6..7/src/net/net_test.go) and the corruption in the net package state was coming from finalizers.

So somehow the finalizers are messing up the event loop's state.

/cc @aclements @ianlancetaylor @dvyukov

dvyukov · 2016-05-03T18:40:40Z

Repro?
Were EBADF coming from your calls or from runtime netpoll?

bradfitz · 2016-05-03T18:44:23Z

Repro is check out patchset 6 or earlier from the Gerrit review above (git fetch https://go.googlesource.com/go refs/changes/31/22031/6 && git checkout FETCH_HEAD) and then run the net tests and watch it fail. Use go test -count=20 net to make it fail reliably.

mikioh · 2016-05-03T23:03:32Z

I suppose that this is the same as #14910. Once we break net.{Conn,Listener,PacketConn} loose from our control it can be garbage collected. Keep your belongings with you safely.

dvyukov · 2016-05-04T06:15:29Z

@mikioh I don't see that Brad's tests call .Fd()

mikioh · 2016-05-04T08:14:09Z

@dvyukov,

Yup, I mistook the code path.

rsc · 2016-05-17T14:06:24Z

@randall77, I believe this is a possible dup of #15277. If SSA thinks the arg slots are dead and is reusing them for other things, that might end up smashing them too early, causing premature garbage collection.

randall77 · 2016-05-17T15:59:25Z

I can't repro. Following Brad's instructions I get a smattering of i/o timeouts, bind: address already in use, unreachable networks, etc, but no EBADF or anything similar. I tried on both linux and darwin amd64.

Anyone who can repro, please provide more detailed instructions. Or try patching in https://go-review.googlesource.com/c/22365/ for me and see if it helps.

rsc · 2016-05-18T01:08:23Z

I tried refs/changes/31/22031/6 on both Linux and Darwin and they both just hang. After 10 minutes the test runner sends them a SIGQUIT. So that seems not to be a working repro.

bradfitz · 2016-05-18T01:11:38Z

I'll try to repro this again. (I was on Darwin at the time, sitting next to @ianlancetaylor who I showed this to also)

I don't think it required a specific -testing.run=.... regexp, but maybe it did.

rsc · 2016-05-18T01:26:48Z

I reproduced with "go test -short -count=20 net" on Linux. I get lots of failures, some of them probably not real problems, but I also do get many EBADFs.

Cherry-picking Keith's CL did not help.
"go test -a -gcflags=-ssa=0 -short -count=20" still fails.

Next step is probably to make the os.File close finalizer print the stacks of all current goroutines before closing an fd.

bradfitz · 2016-05-20T22:05:48Z

(Since Russ can reproduce this, I'm not spending time on an easier repro.)

rsc · 2016-05-27T14:10:19Z

To reproduce this on Linux it suffices to start at tip and comment out the defer ln.Close() in the goroutine inside net_test.go's TestZeroByteRead and then run go test -short -count=20 net. I am completely baffled. The test is installing various hooks that might be causing problems, so maybe this is just a buggy test. Or maybe it's a real finalization problem. It's difficult to tell. I asked @aclements to look at it when he finishes the bug he is currently working on.

I can't figure out how, even in the presence of early finalization, ln might get closed more than once, but that seems to be what is happening. Closing the fd twice could, if the fd were reused between the two, step on someone else. But then having stepped on someone else, when that someone else closes their fd, that could step on a third person, and so on, creating a chain of errors like in the failure output. I can't come up with another explanation for how this one async close could cause so many different test failures otherwise. But I also can't explain how the fd could possibly be closed twice. We're so careful about that - even if the program were racy or there were a normal Close racing with the finalizer close (and there doesn't appear to be), the uses of the fd are interlocked and reference counted, and we only actually close the fd when the ref count drops to zero.

Note: I commented last night that it looked like the liveness bug I just fixed, but that makes no sense since that was specific to SSA and this is not. I deleted that comment.

rsc · 2016-05-27T15:51:43Z

It is in fact a double-close of an fd. The problem is that if a test does not close its own *netFD (what's inside a net.Conn or net.Listener), then a later test that makes use of forceCloseSockets will close the fd itself (using syscall.Close, not the *netFD which has all the right checks). Then when the finalizer on the *netFD eventually runs, it too will close that same numeric fd, and by then it may have been reused, leading to the kind of chain reaction I described in my previous message. Skipping the two tests that make use of forceCloseSockets makes the repro pass.

So either we just make sure to close all the *netFDs we open in the net test, or we change forceCloseSockets to call close on *netFDs not sockets, or we delete forceCloseSockets entirely (it seems like a bad idea as implemented today), or we change it to use syscall.Shutdown instead of syscall.Close, or we live with the fact that forgetting to close a *netFD blows up badly.

Since it's just a bad test, its OK to fix for Go 1.7 or to leave for Go 1.8. Leaving for @mikioh.

mikioh · 2016-05-27T18:40:15Z

Oops, I don't remember the reason why I left forceCloseSockets in non-main test functions but it's clearly wrong, thanks.

gopherbot · 2016-05-27T19:00:23Z

CL https://golang.org/cl/23505 mentions this issue.

bradfitz added this to the Go1.7Maybe milestone May 3, 2016

rsc modified the milestones: Go1.7, Go1.7Maybe May 17, 2016

bradfitz added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 20, 2016

rsc modified the milestones: Go1.7Beta, Go1.7 May 27, 2016

rsc modified the milestones: Go1.7Maybe, Go1.7Beta May 27, 2016

rsc changed the title ~~net: Conn finalizers can corrupt network state~~ net: forceCloseSockets in test is not safe for finalized fds May 27, 2016

rsc assigned mikioh May 27, 2016

gopherbot closed this as completed in dc5b523 May 30, 2016

mikioh removed their assignment May 31, 2016

golang locked and limited conversation to collaborators May 31, 2017

gopherbot added the FrozenDueToAge label May 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net: forceCloseSockets in test is not safe for finalized fds #15525

net: forceCloseSockets in test is not safe for finalized fds #15525

bradfitz commented May 3, 2016 •

edited

Loading

dvyukov commented May 3, 2016

bradfitz commented May 3, 2016 •

edited

Loading

mikioh commented May 3, 2016

dvyukov commented May 4, 2016

mikioh commented May 4, 2016

rsc commented May 17, 2016

randall77 commented May 17, 2016

rsc commented May 18, 2016

bradfitz commented May 18, 2016 •

edited

Loading

rsc commented May 18, 2016

bradfitz commented May 20, 2016

rsc commented May 27, 2016

rsc commented May 27, 2016

mikioh commented May 27, 2016

gopherbot commented May 27, 2016

net: forceCloseSockets in test is not safe for finalized fds #15525

net: forceCloseSockets in test is not safe for finalized fds #15525

Comments

bradfitz commented May 3, 2016 • edited Loading

dvyukov commented May 3, 2016

bradfitz commented May 3, 2016 • edited Loading

mikioh commented May 3, 2016

dvyukov commented May 4, 2016

mikioh commented May 4, 2016

rsc commented May 17, 2016

randall77 commented May 17, 2016

rsc commented May 18, 2016

bradfitz commented May 18, 2016 • edited Loading

rsc commented May 18, 2016

bradfitz commented May 20, 2016

rsc commented May 27, 2016

rsc commented May 27, 2016

mikioh commented May 27, 2016

gopherbot commented May 27, 2016

bradfitz commented May 3, 2016 •

edited

Loading

bradfitz commented May 3, 2016 •

edited

Loading

bradfitz commented May 18, 2016 •

edited

Loading