Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce thread locking behaviour to dial phase only #171

Merged
merged 2 commits into from
Oct 12, 2020

Conversation

ti-mo
Copy link
Contributor

@ti-mo ti-mo commented Jul 28, 2020

👋 Back again with another controversial change. 😄 This aims to eliminate the bottleneck described in #154. This is the approach I described in #154 (comment). With the goal of dropping support for Go 1.11 and lower, this might be a good time to revisit.

To Do:

  • Reintroduce a test for machines with network namespaces disabled, since newLockedNetNSGoroutine was removed

There are 2 elements at play here:

  • The message misrouting discussed in http://lists.infradead.org/pipermail/libnl/2017-February/002293.html
    With this change, I'd like to start spelunking older versions of the kernel and the Go runtime for this bug. @mdlayher are you positive the TestIntegrationConnConcurrent* family of tests is able to reproduce this? Can you remember which Go and kernel version you were using at the time when this was written?

  • Socket creation inside network namespaces
    As far as namespaces are concerned, OS threads only briefly need to enter a netns to create the netlink socket and can safely be returned to the host namespace once the socket fd has been allocated. We don't need to lock OS threads permanently to operate in multiple namespaces.


Some feedback please. 🙏 Still looking for a way to inject a mock func getNS somewhere into the dial() code path to emulate systems with network namespaces disabled.

Copy link
Owner

@mdlayher mdlayher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I like it, thanks for doing the work.

@mdlayher are you positive the TestIntegrationConnConcurrent* family of tests is able to reproduce this? Can you remember which Go and kernel version you were using at the time when this was written?

As far as I can remember yes, and no I don't have the Go/kernel version at the time. This occurred when the library was quite immature and it's possible that there were other problems in the package which exacerbated the issue. I would expect the concurrent tests to catch any problems if they still occur.

We don't need to lock OS threads permanently to operate in multiple namespaces.

Cool, sounds good to me.

Still looking for a way to inject a mock func getNS somewhere into the dial() code path to emulate systems with network namespaces disabled.

I'm good leaving it out for now so we can think about it more.


To clarify, would you intend for this to go into a v1.2.0 (Go 1.12+ only) and not v1.1.1 (Go 1.11+)?

With the goal of dropping support for Go 1.11 and lower, this might be a good time to revisit.

I'm still planning to proceed with the plan in #170 and with Go 1.15 RC1 available it'll be sooner rather than later. I suppose for safety reasons I would prefer to merge this into v1.2.0 since v1.1.1 will be the last "stable" release that supports Go 1.11, even with some of the existing bugs/misfeatures such as Close not unblocking receive and this thread locking behavior.

conn.go Outdated Show resolved Hide resolved
@ti-mo
Copy link
Contributor Author

ti-mo commented Sep 11, 2020

Sorry, I haven't been around to work on this. Initial results are promising! The integration test suite completes about twice as fast on my machine, reflected in the profiles by an absence of lock contention around the worker closure channel. (since it's gone, obviously)

I'll do some virtme spelunking against older kernels and see if I can reproduce this thing. Take care!

@ti-mo
Copy link
Contributor Author

ti-mo commented Sep 12, 2020

Alright, I went as far back as kernel 3.2.0-5-amd64 that shipped with Debian Wheezy, integration test suite ran without any hiccups. Test binaries built with go1.15.2. All is well!

Edit: Also went back to Go 1.10 (1.9 doesn't have deadline support on net fds), also confirmed working on kernels as early as 3.2. So, the oldest combo I've tested is test binaries from Go 1.10 running on kernel 3.2.0-5-amd64. 👍

Timing the execution time of the test suite on an 8-core Ryzen 3700X running kernel `5.7.19-1-ck`.
Test binary built with go1.15.2.

Before: `sudo ./netlink.test -test.v  29.23s user 21.35s system 1299% cpu 3.892 total`
After: `sudo ./netlink.test -test.v  4.29s user 0.79s system 532% cpu 0.955 total`

These are not one-off results, the execution time sat consistently around the values shown.

Compatibility-wise, the integration test suite passes on systems as early as Linux `3.2.0-5-amd64`
with test binaries built with Go 1.10.
@ti-mo
Copy link
Contributor Author

ti-mo commented Sep 12, 2020

Timing the execution of the test suite on an 8-core Ryzen 3700X running kernel 5.7.19-1-ck. Test binary built with go1.15.2.

Before: sudo ./netlink.test -test.v 29.23s user 21.35s system 1299% cpu 3.892 total
After: sudo ./netlink.test -test.v 4.29s user 0.79s system 532% cpu 0.955 total

These are not one-off results, the execution time sat consistently around the values shown.

Compatibility-wise, the integration test suite passes on systems as early as Linux 3.2.0-5-amd64
with test binaries built with Go 1.10.

@ti-mo
Copy link
Contributor Author

ti-mo commented Sep 12, 2020

To clarify, would you intend for this to go into a v1.2.0 (Go 1.12+ only) and not v1.1.1 (Go 1.11+)?

According to the test suite, this works with Go 1.10 just fine, so not sure if we should worry too much about this.

Re: mocking a test for disabled namespaces: fileNetNS() doesn't have tests right now, which we should probably address. Still thinking about how to test Dial(). Any ideas?

@ti-mo ti-mo changed the title [WIP] Reduce thread locking behaviour to dial phase only Reduce thread locking behaviour to dial phase only Sep 17, 2020
@ti-mo
Copy link
Contributor Author

ti-mo commented Sep 17, 2020

@mdlayher I've added a short netns_linux_test.go to validate the behaviour of netns.disabled. Along with the existing integration test suite, that brings coverage of netns_linux.go to 100%. Feel free to merge if no more feedback from your side!

@mdlayher mdlayher merged commit 26242dd into mdlayher:master Oct 12, 2020
@mdlayher
Copy link
Owner

Thanks again for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants