Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netlink: blocking Receive prevents Send #136

Closed
acln0 opened this issue May 4, 2019 · 5 comments
Closed

netlink: blocking Receive prevents Send #136

acln0 opened this issue May 4, 2019 · 5 comments

Comments

@acln0
Copy link
Collaborator

acln0 commented May 4, 2019

I encountered this while using the nfqueue subsystem. Given a var nlconn *netlink.Conn:

One goroutine (henceforth "the producer") does approximately this:

for {
	msgs, err := nlconn.Receive()
	if err != nil {
		// handle	
	}
	ch <- msgs
}

It sends the messages to a channel to fan them out to workers. The workers do approximately this:

for msgs := range ch {
	for _, msg := range msgs {
		reply := process(msg)
		nlconn.Send(reply)
	}
}

To set the stage for the pseudo-deadlock, assume the producer calls Receive. The request travels down to the lockedNetNSGoroutine associated with the connection, it is sent on g.funcC.

So far so good. The lockedNetNSGoroutine receives the request, runs it, and blocks here.

Meanwhile, a worker has finished processing a message it was handed previously, and wants to send a reply. It calls Send, which blocks here because the lockedNetNSGoroutine is busy waiting for the blocking call to recvmsg to return, and is not servicing the channel.

I have refactored my code such that all the processing happens synchronously for now, so instead of a producer and a number of workers, I have only this:

for {
	msgs, err := nlconn.Receive()
	if err != nil {
		// handle
	}

	for msgs := range ch {
		for _, msg := range msgs {
			reply := process(msg)
			nlconn.Send(reply)
		}
	}
}

This works, and I don't think it is prohibitively expensive in terms of performance. It may even be faster than fanning out. I haven't measured anything yet, since my project is in a prototype stage. There may also be the possibility for me to use two different *netlink.Conns bound to the same netfilter queue, one for receiving and one for sending. I don't know if this works: I have not tried it yet.

I thought about potential solutions for a little while, but I couldn't figure out any good ones. Therefore, I have filed this bug for discussion. Can something be done here? Should we at least document this? While debugging the issue, I was stumped for a while, since Receive and Send both acquire a read lock (!), and it seemed absurd to me that they'd be mutually exclusive in a deeper sense, until I saw the stack traces of the goroutines involved.

Thanks,
Andrei

@mdlayher
Copy link
Owner

This is a tricky one, but I don't have an immediate solution in mind. The goroutine is locked to its OS thread to ensure messages get routed to the right place. I suppose I haven't hit this yet because I've never had a system that needed to wait on a multicast group and then send again on the same socket.

The locks you mention at the higher layer are for serializing Execute usage, but it looks like you worked through that already.

I'm not sure it'd be feasible to somehow relinquish the lock and let the locked goroutine do something else if epoll wait is triggered.

Perhaps for the time being, the best bet is to document this.

@acln0
Copy link
Collaborator Author

acln0 commented May 11, 2019

I read the netlink man page a little, and came across this interesting tidbit:

Both nlmsg_seq and nlmsg_pid are opaque to netlink core.

which makes me wonder about being able to set the PID for a socket from outside (and storing it in the Conn struct to be used for every send), thus lifting the restriction of the one goroutine locked to the OS thread per socket. I initially thought that the pid business (and the need for LockOSThread) is an unfortunate coincidence of package netlink passing a zero pid in the address when setting up the connection. But then I came across some e-mails you sent to some netlink lists which suggest that you did indeed try setting PIDs yourself, and it didn't work.

I'm going to investigate this more at some point, just not right now. Thanks!

@dhaavi
Copy link

dhaavi commented Nov 27, 2020

It seems that this has been resolved by #171. Thanks @ti-mo!

I just investigated a performance issue in safing/portmaster, which uses florianl/go-nfqueue, which in turn uses this library.

We implemented a costly workaround, but I can confirm we can now Receive and Send concurrently.
(As #171 is not yet released, I had to override the dependency to master though)

@mdlayher
Copy link
Owner

Glad to hear it. We are due for a other release but I've been dragging my feet on it. Hopefully I can do it soon.

@mdlayher
Copy link
Owner

With v1.2.0 out, I believe this is solved!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants