Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netlink: Close does not unblock concurrent Receive operation #162

Closed
libesz opened this issue Feb 12, 2020 · 8 comments · Fixed by #169
Closed

netlink: Close does not unblock concurrent Receive operation #162

libesz opened this issue Feb 12, 2020 · 8 comments · Fixed by #169

Comments

@libesz
Copy link

libesz commented Feb 12, 2020

Similar to #136, though the use-case is different.
I would want to listen for any IP Route changes so I use blocking Receive() for multicast messages. I cannot use timeout on it because than there is a chance to miss events while restarting the Receive cycle.
In order to break the Receive (i.e. on graceful exit), I close the Conn on another goroutine, which appears to be the only way to do it.
The problem is that Receive in it's core read function acquires the R side of the RWMutex:

s.mu.RLock()

Close, however wants to acquire the W side here:
s.mu.Lock()

which is not possible until the R(eader) side is freed.
This results in a deadlock, until something is received on the socket, which is not deterministic.
I don't have a solution in mind, but maybe close function should lock the W side of the mutex only after calling the close syscall, ensuring all blocking operations are woke up (and so releasing the lock). If other parallel operations are handling the errors appropriately, it might not cause problems if s.closed temporarily appears to be false, while the socket is already closed.

@mdlayher
Copy link
Owner

I cannot use timeout on it because than there is a chance to miss events while restarting the Receive cycle.

I'm not sure I understand why you're restarting the Receive cycle. Why not keep the goroutine running to receive messages until application exit?

You can set a deadline in the past to immediately time out a read, as I'm doing in: https://github.com/mdlayher/netstate/blob/master/watcher_linux.go (rtnetlink used here, but same applies for netlink)

I don't have a solution in mind, but maybe close function should lock the W side of the mutex only after calling the close syscall, ensuring all blocking operations are woke up (and so releasing the lock).

This is probably a reasonable solution. I'd have to review the code again to see why exactly I set it up the way I did though.

@libesz
Copy link
Author

libesz commented Feb 12, 2020

I don't want to restart the Receive cycle, just mentioned that I thought it would be an opportunity to periodically unblock the read if I use some sort of timeout.
But honestly I didn't consider to modify the timeout parallel when Receive is already running to unblock it. Am I right this is the approach on the linked package?

@mdlayher
Copy link
Owner

mdlayher commented Feb 12, 2020

Yep that's correct. If you set a deadline that has already passed (I like a very small UNIX timestamp) it'll immediately cause an I/O timeout error on any blocking syscalls like that.

Since this has tripped a few people up, I think it's probably still worth looking into. But that solution should help with your immediate problem.

@mdlayher
Copy link
Owner

mdlayher commented Jun 9, 2020

Looking at this again, I would expect that moving the mutex locking below closing the file descriptor would solve this problem, so something like:

func (s *sysSocket) Close() error {
	// Close the socket and indicate to other goroutines that the file
	// descriptor has been closed, so further calls return EBADF.
	err := s.fd.Close()

	s.mu.Lock()
	s.closed = true
	s.mu.Unlock()

	// Stop the associated goroutine and wait for it to return.
	s.g.stop()

	return err
}

However it's been a long time since I've looked at this code and I'd have to remember why exactly we did it that way in the first place. The doc comment does indicate that a concurrent Close unblocked Receive at one point, so this is a regression.

I am slightly afraid that making a small change like this could result in a send on closed channel elsewhere, so this will take some more consideration.

@mdlayher mdlayher changed the title Blocking Receive() prevents Close() netlink: Close does not unblock concurrent Receive operation Jun 9, 2020
@mdlayher
Copy link
Owner

mdlayher commented Jun 9, 2020

/cc @acln0 in case you are still working with this package more frequently than I am. :)

mdlayher added a commit that referenced this issue Jun 11, 2020
Signed-off-by: Matt Layher <mdlayher@gmail.com>
@shotahino
Copy link

@mdlayher I am observing the same issue. Close does not unblock Receive.

The doc comment does indicate that a concurrent Close unblocked Receive at one point, so this is a regression

Do you know when this might have regressed?

@mdlayher
Copy link
Owner

See c5f8ab7. It won't be fixed until v1.2.0.

@mdlayher
Copy link
Owner

I believe I have finally fixed this problem in #169, and the tests I've written seem to agree with that assertion. I will tag a v1.2.0 (which is Go 1.12+ only) and have folks give it a try in their own applications. I will update mine as well.

mdlayher added a commit that referenced this issue Dec 20, 2020
Signed-off-by: Matt Layher <mdlayher@gmail.com>
mdlayher added a commit that referenced this issue Dec 20, 2020
Signed-off-by: Matt Layher <mdlayher@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants