Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netlink: performance bottleneck due to single goroutine in high performance applications #154

Closed
smarterclayton opened this issue Oct 9, 2019 · 22 comments

Comments

@smarterclayton
Copy link

In scenarios where a locked OS thread isn't necessary, is the goroutine in sysSocket.write/read still required? Perhaps I'm missing a subtlety of M:N with the use of Recvmsg and netlink, but my expectation was that with the proper descriptor any goroutine could read correctly from that socket if the netns is consistent. Is that not the case?

I ask because in high traffic systems I'm seeing a fair amount of go runtime scheduler time in profiles when GOMAXPROCS>1 in go 1.12 due to the forced go routine change, and in testing in lightly loaded setups with a 100-200 netlink multicast event per second rate I saw a 5-10% CPU improvement if the goroutine is skipped (some from not needing to close the channel or defer). I am not a socket expert of course in this case so I figured it was easier to ask.

@mdlayher
Copy link
Owner

mdlayher commented Oct 9, 2019

Are you setting DisableNSLockThread? If so I would be interested in seeing what optimizations can be made on that path and would welcome PRs. I haven't had a need for higher performance myself yet.

If you truly need maximum performance, you could also use the SyscallConn method and deal with the socket directly.

@smarterclayton
Copy link
Author

smarterclayton commented Oct 9, 2019 via email

@mdlayher
Copy link
Owner

mdlayher commented Oct 9, 2019

I'm back on my computer today and had a chance to think about this more.

First, is your code open source? That'd help me understand what all it is that you're doing.

Second, the original reasons for the goroutine locking are here:

netlink/conn_linux.go

Lines 655 to 674 in 99e7bba

// It is important to lock this goroutine to its OS thread for the duration
// of the netlink socket being used, or else the kernel may end up routing
// messages to the wrong places.
// See: http://lists.infradead.org/pipermail/libnl/2017-February/002293.html.
//
//
// In addition, the OS thread must also remain locked because we attempt
// to manipulate the network namespace of the thread within this goroutine.
//
// The intent is to never unlock the OS thread, so that the thread
// will terminate when the goroutine exits starting in Go 1.10:
// https://go-review.googlesource.com/c/go/+/46038.
//
// However, due to recent instability and a potential bad interaction
// with the Go runtime for threads which are not unlocked, we have
// elected to temporarily unlock the thread when the goroutine terminates:
// https://github.com/golang/go/issues/25128#issuecomment-410764489.
//
// Locking the thread is not implemented if the caller explicitly asks
// for an unlocked thread.

I suspect that at the time, I may have had another bug that caused issues with routing messages between sockets. If the caller disables thread locking via the config, perhaps we can allow multiple goroutines to directly access the socket simultaneously without any issues. That said, I haven't bothered to challenge the status quo as my needs are a lot simpler than many of the nfnetlink use cases that have popped up with folks like yourself, @ti-mo, and @florianl.

If we can prove that it's safe to do so when network namespace manipulation is not required, I'd love to see a free performance boost for folks who don't need that functionality.

@mdlayher mdlayher changed the title netlink: Bypass egoroutine in sysSocket.read? netlink: performance bottleneck due to single goroutine in high performance applications Oct 9, 2019
@ti-mo
Copy link
Contributor

ti-mo commented Oct 11, 2019

I've never personally encountered the issue @mdlayher describes about messages being routed to the wrong sockets, always been wondering what the exact symptoms would be. I suspect it would look like the situation I'm describing below if network namespaces were involved.

The reason behind locking a worker thread to a network namespace is simple, but requires some context (which I assume @smarterclayton has 🙂). For a deep-dive, read https://www.weave.works/blog/linux-namespaces-golang-followup and related articles.


tl;dr: When Setns() is called (at least the one in the most popular netns library out there), syscall.Syscall(SYS_SETNS... is called under the hood, and the setns(2) manpage says:

Given a file descriptor referring to a namespace, reassociate the calling thread with that namespace.

Because of the way Go's OS threads are implemented, there is absolutely no guarantee that a Setns() followed by a Dial() are executed on the same OS thread, which is why this library spawns a worker in a goroutine, locks it to an OS thread and then moves that OS thread into the netns. This prevents closures delegated to the worker from being executed in the host netns and vice versa -- it prevents the runtime from scheduling an unrelated goroutine on an OS thread that has been moved into another netns.

Imagine your program made the runtime create 10 OS threads and your code contains a Setns() somewhere. The (random) OS thread the Setns() gets scheduled on gets moved to another netns. From that point on, the runtime's thread pool contains 1 thread that is in a different netns. Any Dial() calls that happen to get scheduled on that thread, whether they're inet or netlink, will create sockets in another netns than you'd expect. This could lead to security and functionality bugs, and just all-round weird behaviour.


That said, it's no surprise to me that shoving a closure down a channel and waiting for results carries a CPU overhead because it simply needs to chase more pointers. Perhaps we could extend DisableNSLockThread to skip the worker model completely, I think that would be an improvement for users that don't need network namespaces.

@mdlayher Could you somehow dig up or reproduce your findings of back in the day around sockets getting mixed up?

@sbezverk
Copy link

@mdlayher Instead of opening a new issue as you suggested over the slack channel, I think my issue is in the same area as the one reported by @smarterclayton .
I have developed kube proxy which is using nftables instead of iptables. While running performance profiling, I noticed a significant time consumer.

netlink
newLockedNetNSGoroutine
func1
0.03s (0.13%)
of 9.29s (39.40%)

netlink
(*lockedNetNSGoroutine)
run
func1
0 of 8.34s (35.37%)

Here is the repo with my code: https://github.com/sbezverk/nfproxy
Happy to try any improvements you might suggest. I have a pdf file with pprof data, let me know if you want to see it.

@sbezverk
Copy link

profile001.pdf

Here is pprof collected data..

@mdlayher
Copy link
Owner

I have a couple of ideas, but ultimately I really need a program that can reproduce these types of profiles, to verify if any of them actually pan out.

  1. we could work out a smarter approach to avoid an extra peek syscall on reads, so that a typical read doesn't require so many syscalls
  2. we could call system calls on multiple goroutines when network namespace/thread locking support are disabled

Ideally I need an example program which runs on a typical Linux machine without something like Kubernetes running.

@sbezverk
Copy link

@mdlayher Here is e2e test I use to test nftableslib functionality,

https://github.com/sbezverk/nftableslib/blob/master/cmd/e2e/e2e.go

It is basic but it does test connectivity as well. Please let me know if it works for you.

@ti-mo
Copy link
Contributor

ti-mo commented Jan 22, 2020

@mdlayher Reminder that you implemented this thread locking because of https://lists.infradead.org/pipermail/libnl/2017-February/002293.html. The behaviour you reported in this email thread is what we need a reproducer for. If we have a reproducer, we can simply try removing the single thread-locked executor goroutine and see what breaks. I wouldn't touch the actual syscall logic for now.

@mdlayher
Copy link
Owner

The thread locking logic remains necessary due to network namespace support. When that is disabled via config, we can explore our options.

@sbezverk
Copy link

@mdlayher +1 for ability to disable it via parameter. Some applications, example nfproxy operates in host namespace so hopefully disabling "network namespace support" might give an extra performance boost.

@mdlayher
Copy link
Owner

@sbezverk that config option exists today in the netlink.Config struct. Have you tried it? There's probably still more work to be done on optimizing that path though.

@sbezverk
Copy link

@mdlayher I was not aware, I will give it a try and let you know. thank you.

@sbezverk
Copy link

@mdlayher tested and see on average 2x times improvements. See latest
profile001.pdf
Futex time got almost eliminated.

@sbezverk
Copy link

Here is what I did in nftables library to activate it,


	disableNSLockThread := false
	if cc.NetNS == 0 {
		// Since process is operating in main namespace, there is no reason to have NSLockThread lock,
		// as it causes some performance penalties.
		disableNSLockThread = true
	}
	return netlink.Dial(unix.NETLINK_NETFILTER, &netlink.Config{NetNS: cc.NetNS, DisableNSLockThread: disableNSLockThread})

What do you think, is it a safe assumption if netns == 0 it is safe to disable NSLockThread?

@mdlayher
Copy link
Owner

Dial will return an error if NetNS and DisableNSLockThread are both set. See the docs: https://godoc.org/github.com/mdlayher/netlink#Config.

@ti-mo
Copy link
Contributor

ti-mo commented Jan 23, 2020

@mdlayher For network namespaces, only socket creation needs to take place on a locked thread in another namespace. Once the socket has been created, the thread can be safely reassigned to the main namespace. As long as you hold onto the fd, the socket can be written to from the main namespace. You only implemented thread locking to work around https://lists.infradead.org/pipermail/libnl/2017-February/002293.html.

@mdlayher
Copy link
Owner

For network namespaces, only socket creation needs to take place on a locked thread in another namespace. Once the socket has been created, the thread can be safely reassigned to the main namespace.

This is news to me. Do you have documentation I can read?

@ti-mo
Copy link
Contributor

ti-mo commented Jan 24, 2020

I don't think this is explicitly documented anywhere, but I remember talking about this with you a couple months ago after I implemented these e2e integration tests in conntracct: https://github.com/ti-mo/conntracct/blob/286c127/pkg/bpf/integration_test.go#L397-L401.

This test locks the OS thread, creates and enters a new netns with the thread, brings up interfaces, sets up some nft rules, opens two UDP sockets in the netns, returns the thread to the main namespace, unlocks the thread, and returns. If you read the udpecho package, you'll see it's a simple package that echoes UDP packets, it's completely unaware of the netns it operates in.

That makes me conclude it's possible to read/write to sockets in other namespaces as long as the fd is held onto. Otherwise, any socket operations would fail as soon as the creating thread returns to the main namespace. This should be no different for netlink sockets, someone just needs to dive in and confirm this.

So: https://lists.infradead.org/pipermail/libnl/2017-February/002293.html should be the only reason the thread locking exists, and I'm sure it was already there when I contributed netns support. 🙂 We should try to reproduce the phenomenon you described in the thread, and try to narrow down the root cause, since it might mean that we could remove a lot of code and improve efficiency by a ton.

@ti-mo
Copy link
Contributor

ti-mo commented Jul 28, 2020

FWIW I'm working on this in #171. With the release of a possibly-breaking v1.2.0 on the horizon, this might be a good time to revisit and bury this one. I'll attempt some spelunking to reproduce the message misrouting bug so we can safely move forward with this. As a bonus, this removes some thorny code.

@mdlayher
Copy link
Owner

I'm pretty sure #171 and the newly released v1.2.0 will make a significant difference for this issue.

@mdlayher
Copy link
Owner

Thinking about it more, this issue is really old and so much has changed, so I'm just going to close it out. We can always open a new one if there are more issues to be handled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants