-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce thread locking behaviour to dial phase only #171
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I like it, thanks for doing the work.
@mdlayher are you positive the TestIntegrationConnConcurrent* family of tests is able to reproduce this? Can you remember which Go and kernel version you were using at the time when this was written?
As far as I can remember yes, and no I don't have the Go/kernel version at the time. This occurred when the library was quite immature and it's possible that there were other problems in the package which exacerbated the issue. I would expect the concurrent tests to catch any problems if they still occur.
We don't need to lock OS threads permanently to operate in multiple namespaces.
Cool, sounds good to me.
Still looking for a way to inject a mock func getNS somewhere into the dial() code path to emulate systems with network namespaces disabled.
I'm good leaving it out for now so we can think about it more.
To clarify, would you intend for this to go into a v1.2.0 (Go 1.12+ only) and not v1.1.1 (Go 1.11+)?
With the goal of dropping support for Go 1.11 and lower, this might be a good time to revisit.
I'm still planning to proceed with the plan in #170 and with Go 1.15 RC1 available it'll be sooner rather than later. I suppose for safety reasons I would prefer to merge this into v1.2.0 since v1.1.1 will be the last "stable" release that supports Go 1.11, even with some of the existing bugs/misfeatures such as Close not unblocking receive and this thread locking behavior.
c48ae9b
to
0d7ef2d
Compare
Sorry, I haven't been around to work on this. Initial results are promising! The integration test suite completes about twice as fast on my machine, reflected in the profiles by an absence of lock contention around the worker closure channel. (since it's gone, obviously) I'll do some virtme spelunking against older kernels and see if I can reproduce this thing. Take care! |
Alright, I went as far back as kernel Edit: Also went back to Go 1.10 (1.9 doesn't have deadline support on net fds), also confirmed working on kernels as early as 3.2. So, the oldest combo I've tested is test binaries from Go 1.10 running on kernel |
Timing the execution time of the test suite on an 8-core Ryzen 3700X running kernel `5.7.19-1-ck`. Test binary built with go1.15.2. Before: `sudo ./netlink.test -test.v 29.23s user 21.35s system 1299% cpu 3.892 total` After: `sudo ./netlink.test -test.v 4.29s user 0.79s system 532% cpu 0.955 total` These are not one-off results, the execution time sat consistently around the values shown. Compatibility-wise, the integration test suite passes on systems as early as Linux `3.2.0-5-amd64` with test binaries built with Go 1.10.
0d7ef2d
to
6189c36
Compare
Timing the execution of the test suite on an 8-core Ryzen 3700X running kernel Before: These are not one-off results, the execution time sat consistently around the values shown. Compatibility-wise, the integration test suite passes on systems as early as Linux |
According to the test suite, this works with Go 1.10 just fine, so not sure if we should worry too much about this. Re: mocking a test for disabled namespaces: |
@mdlayher I've added a short |
Thanks again for this. |
👋 Back again with another controversial change. 😄 This aims to eliminate the bottleneck described in #154. This is the approach I described in #154 (comment). With the goal of dropping support for Go 1.11 and lower, this might be a good time to revisit.
To Do:
newLockedNetNSGoroutine
was removedThere are 2 elements at play here:
The message misrouting discussed in http://lists.infradead.org/pipermail/libnl/2017-February/002293.html
With this change, I'd like to start spelunking older versions of the kernel and the Go runtime for this bug. @mdlayher are you positive the
TestIntegrationConnConcurrent*
family of tests is able to reproduce this? Can you remember which Go and kernel version you were using at the time when this was written?Socket creation inside network namespaces
As far as namespaces are concerned, OS threads only briefly need to enter a netns to create the netlink socket and can safely be returned to the host namespace once the socket fd has been allocated. We don't need to lock OS threads permanently to operate in multiple namespaces.
Some feedback please. 🙏 Still looking for a way to inject a mock
func getNS
somewhere into thedial()
code path to emulate systems with network namespaces disabled.