-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
conn, device, tun: implement vectorized I/O on Linux #64
Conversation
Thanks for this very cool approach! Before I jump in, some surfacy questions:
You marked this as a draft. What's missing? What's broken? Should I wait to look at it until it's ready, or is this still worth a read through now?
Why would somebody want to do this? Can we eliminate whatever that reason is, so that there's one choice that works for everyone, and then not need the env var? |
It's ready, it's had extensive testing already, but as we had some new rebasing to do there were some likely changes (which Jordan has pushed already). The code is very stable and ready for review - we're not expecting any major changes from further testing. We are planning to continue doing more testing with this code integrated into tailscaled.
We added this flag so that if a user discovers that they're on a buggy kernel, or discovers an edge case we have not been able to, that there is a knob they can use to get functional behavior. Offloads downstream of TUN are extremely lightly used outside of VMMs, we're not actually aware of any public code that makes use of these kernel code paths outside of VMMs. We could remove the flag, most of the behavior can be removed by changing the option using ethtool, however using the environment variable provided also reduces memory allocation from buffers. |
…ard-go This is temporary while we work to upstream performance work in WireGuard/wireguard-go#64. A replace directive is less ideal as it breaks dependent code without duplication of the directive. Signed-off-by: Jordan Whited <jordan@tailscale.com>
…ard-go This is temporary while we work to upstream performance work in WireGuard/wireguard-go#64. A replace directive is less ideal as it breaks dependent code without duplication of the directive. Signed-off-by: Jordan Whited <jordan@tailscale.com>
…ard-go (#6692) This is temporary while we work to upstream performance work in WireGuard/wireguard-go#64. A replace directive is less ideal as it breaks dependent code without duplication of the directive. Signed-off-by: Jordan Whited <jordan@tailscale.com>
It may be worth mentioning that the in-kernel implementation isn't optimized, and the golang version is the preferred method on Linux. I tried flipping a note to the mailing list a couple months back before the article came out, but it was moderated. |
I don't think that's strictly true in all cases. We managed to achieve excellent performance with these patches and that's a great outcome - but for some shapes of CPU kernel is still faster and uses less resources. There are also additional considerations around packet pacing such as in congestion control that also work better with the kernel implementation in many scenarios. Both are solid implementations but I don't think that one should be preferred without context or condition. Despite being C the kernel version has also probably had more verification work done on it overall, so it also may be considered by some as more secure (though there'd be plenty to debate in general). In short throughput is important, but it's not the only consideration. |
conn/bind_windows.go
Outdated
@@ -321,6 +321,11 @@ func (bind *WinRingBind) Close() error { | |||
return nil | |||
} | |||
|
|||
func (bind *WinRingBind) BatchSize() int { | |||
// TODO: implement batching in and out of the ring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presumably all those '!fixup' commits are intended to be squashed down, and this whole PR made into one big commit. That's fine, so long as it's complete. But it looks like the Windows part still isn't done. So I'm wondering what the intention is with regards to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the fixups are essentially from a git commit --fixup
workflow, there are many ways to cut this pie, but one of the slightly more helpful patterns here is that subsequent reviews can easily look at a diff of the changes made in response to comments, rather than seeking them out of a new singular whole. git rebase --autosquash
will flatten these down, and this is the default behavior of "rebase and merge" on Github. It is our intent to have the fixups squashed.
Regarding Windows, we're not planning to include full support in this change. As you can see, we moved the API over for consistency and in anticipation of doing that. We have identified additional changes that need to be made and tested independently in order to get there, and we've started looking more deeply into those details already, but it's worthwhile getting these changes reviewed and landed in the meantime. We certainly don't want to leave Windows behind - but it has it's own ABI and API differences for offloads, and we also likely need to make some changes in upstream repos as well (e.g. to implement ReadBatch/WriteBatch support in x/* (for bindstd)), and for full effect also some changes in Wintun we don't know the full shape of yet. There may be interesting discussions to be had as to whether rio is worthwhile once batching and segment offloading are implemented, but we can have that discussion once we have more concrete change proposals and benchmark results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Windows, we're not planning to include full support in this change.
Okay. In that case, please separate Linux into its own commit, since Windows (and other platforms) will be a standalone commit too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That requires going back on the prior discussion we had of unifying the IO interface, so we'd have to reintroduce having two different IO APIs (one for a batch of bufs, the other for a single set of bufs). I'm not sure the split would be worthwhile, it makes the interfaces substantially more complex to read and use, and may even reduce performance due to the need for more dynamic interface usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't require that. Rather, in the initial commit, you do for Linux what you did for Windows - batchsize=1 with the new api. Then in the subsequent one, you wireup sendmmsg. Will make bisecting easier too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first commit in the series excludes mmsg and TSO/GRO, batchSize=1 everywhere. d115be4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second commit in the series implements mmsg and TSO/GRO e6bfbdf
// silently clamp the value to other maximums, such as linux clamping to | ||
// net.core.{r,w}mem_max (see _linux.go for additional implementation that works | ||
// around this limitation) | ||
const socketBufferSize = 7 << 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be platform defined? Seems odd to justify the use of 7MBs on the grounds of macOS and then apply it everywhere. Or is it the case where every other platform clamps below 7MB and never above, so 7MB represents a useful upperbound?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be platform defined anytime if we decide that's a good idea, if we demonstrate a scenario where it is worthwhile.
This value would ideally actually be tunable, it is essentially the main tuning knob around which you can address BDP. What we observed in testing is that we need a value substantially larger than the rather conservative default of hundreds of kilobytes in scenarios where there is either significant latency between peers, or significant bandwidth available.
The 7mb value while a bit strange courtesy of macOS limit, is not strongly worse than say, splitting this out and picking 8mb for Linux. We'd like to see more feedback from users regarding use cases here, in our testing this makes 100ms latency links perform much better (once whatever is inside also warms up, such as window expansion), as well as removes a significant bottleneck in reaching speeds in the 1-10gbps range. It is possible it is not sufficient for some 25gbps+ use cases, however other bottlenecks in the system need to be resolved first, so adding potentially substantially more memory usage now would be premature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just worry that we'll want to change it at some point and then inadvertently break macOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just worry that we'll want to change it at some point and then inadvertently break macOS.
Break because we make the value too large and miss the comment? Would adding a darwin test checking the value be sufficient in this case? Happy to break it out per-platform, but maybe one test file is simpler? Or we could do both.
controlFns = append(controlFns, | ||
|
||
// Attempt to set the socket buffer size beyond net.core.{r,w}mem_max by | ||
// using SO_*BUFFORCE. This requires CAP_NET_ADMIN, and is allowed here to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do you envision tun being available but not SO_*BUFFORCE? Android where it's passed over a unix domain socket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, for better and worse it's becoming a bit chaotic out there what feature sets overlap, largely due to security policy constraints. The two big aspects are Android and container runtimes, with there being much larger variation in configurations in the latter. In Android this call fails due to lack of CAP_NET_ADMIN, though I would have expected it to also be caught by the selinux configuration (as sadly the offload is) 🤷🏼
We might consider reporting the outcome, but to some extent I'm also waiting to see what happens here with the defaults once userspace QUIC deployments become more common. This will come up there once those folks also get through optimizations, perhaps it's time to change the defaults or some of the constraints.
I'd be happy to write some notes about this in a doc somewhere, if we have a suitable doc in a suitable place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be happy to write some notes about this in a doc somewhere, if we have a suitable doc in a suitable place.
The right place for those types of implementation notes is commit messages. It seems like this snippet can be separated out into a standalone thing. If you think there's interesting context and such, just put it in its own commit and stick that in a commit message. (But in case if it's not that interesting and doesn't matter, don't bother - up to you.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The socket options for UDP snd/rcv buffer size have been broken out into their own commit, with these details included in the commit message aeed2ba.
conn/controlfns_linux.go
Outdated
switch network { | ||
case "udp4": | ||
c.Control(func(fd uintptr) { | ||
err = syscall.SetsockoptInt(int(fd), syscall.IPPROTO_IP, syscall.IP_PKTINFO, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applies everywhere, but I just noticed -- sometimes syscall is used and othertimes unix is used. Should we be centralizing around using the unix package instead? At least for Windows code, syscall is obsolete and x/sys/windows is where it's at.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move a bunch over, while it's true the syscall package is mostly frozen, it's also in that regard explicitly more stable, but I don't imagine any of these packages are likely to cause us regressions. As we are going to use things that likely won't arrive in the syscall package it may be more consistent to use the new packages, except where we end up with API/symbol conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction: I spoke too soon. There are a bunch of places where we're using syscall because it defines symbols that are portable across unix and windows, for example syscall.EAFNOSUPPORT
which in a windows build is equivalent to the windows.WSAEAFNOSUPPORT
symbol, but removes the need to bifurcate large volumes of code such as about 50% of bind_std.go.
I could bind the value to a local symbol from three files, but that's just repeating what the syscall package is doing, I think it would be a net loss to cost of maintenance, and definitely adds substantial build time compared with not doing so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's use unix.
where possible, and only use syscall.
when absolutely necessary? Or, perhaps a better way of thinking of it is to use unix.
in files that are _unix-tagged, and windows.
in files that are _windows-tagged, and then syscall.
is leftover for the shared files.
Does that seem reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
James applied this in f54aa5a.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like syscall.
is still used in the Windows portion of aeed2ba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in a6db822
// Set SO_RCVBUF/SO_SNDBUF - this could be common with the _unix code except | ||
// for the unfortunate type specificity of syscall.Handle. | ||
func(network, address string, c syscall.RawConn) error { | ||
return c.Control(func(fd uintptr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this actually do anything on Windows, given that Windows uses WinRIO? It might, I just don't know myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, at least for API users who use StdNetBind, such as Tailscale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SO_{RCV,SND}BUF
seems like a new thing. Can you separate its use out into a different commit with its own justification?
(It can stay in this PR, but just have a separate commit with a message I can read to understand this.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The socket options for UDP snd/rcv buffer size have been broken out into their own commit, with these details included in the commit message aeed2ba.
tun/tun_linux.go
Outdated
} | ||
|
||
const ( | ||
// TODO: support TSO with ECN bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the status of this TODO? How important is it? Any plans to do it? Or not really?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be done, but my current thinking is that it's low priority. Operating over a WireGuard tunnel I think it would be pretty rare for ECN to be marked, echoed, and responded w/CWR bit set. The middle boxes don't have visibility to the headers for marking to occur, and WireGuard doesn't leak them to the outer header. In the case of forwarding beyond the tunnel, maybe this becomes more commonplace.
The Linux Kernel netdev features doc says:
NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit
set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6).
I don't currently know what properly means. Probably just setting on the first/last packet in the batch? Would have to confirm. Windows docs hints:
If the CWR bit in the TCP header of the large TCP packet is set, the miniport driver must set this bit in the TCP header of the first packet that it creates from the large TCP packet. The miniport driver may choose to set this bit in the TCP header of the last packet that it creates from the large TCP packet, although this is less desirable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worth noting that the kernel implementation does leak the ECN bits in accordance with some RFC, which helps CAKE do its thing. This has long been an overdue feature of wireguard-go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted. In that case I imagine TSO_ECN support should be added before ECN leak. We can certainly add TSO_ECN support in a followup series. I'm not yet familiar with everything required for the ECN leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take a look at PACKET_CB(skb)->ds = ip_tunnel_ecn_encap(0, ip_hdr(skb), skb);
in send.c and then at INET_ECN_decapsulate(skb, PACKET_CB(skb)->ds, ip_hdr(skb)->tos);
and INET_ECN_decapsulate(skb, PACKET_CB(skb)->ds, ipv6_get_dsfield(ipv6_hdr(skb)));
in receive.c. The ip_tunnel_ecn_encap
and INET_ECN_decapsulate
and ipv6_get_dsfield
functions are all worth looking at in their own right.
Again, awesome work on this patchset. Can't wait to merge it. That GRO/TSO code is really cool. I just took a last pass at it. After addressing those comments, please do proceed to the squashing/resplitting of commits. |
👍 I'm going to start squashing/resplitting/rebasing. There is one thread outstanding with an open question: #64 (comment). Will include any changes needed there if/when they arise. |
cdac47d
to
aeed2ba
Compare
@zx2c4 I've squashed, rebased, and split the commits. Ready for final review. |
Alright, I've merged those first two commits. So congratulations! That's the bulk of the work! The last one still has the overlooked syscall->windows thing to fix. But more importantly, I'd like to actually see some justification and reasoning in the commit message for why those sockopts are necessary and what they accomplish and some measurements on the speedup achieved. Right now it just looks like fiddling some knobs. |
The conn.Bind UDP sockets' send and receive buffers are now being sized to 7MB, whereas they were previously inheriting the system defaults. The system defaults are considerably small and can result in dropped packets on high speed links. By increasing the size of these buffers we are able to achieve higher throughput in the aforementioned case. The iperf3 results below demonstrate the effect of this commit between two Linux computers with 32-core Xeon Platinum CPUs @ 2.9Ghz. There is roughly ~125us of round trip latency between them. The first result is from commit 792b49c which uses the system defaults, e.g. net.core.{r,w}mem_max = 212992. The TCP retransmits are correlated with buffer full drops on both sides. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 4.74 GBytes 4.08 Gbits/sec 2742 285 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 4.74 GBytes 4.08 Gbits/sec 2742 sender [ 5] 0.00-10.04 sec 4.74 GBytes 4.06 Gbits/sec receiver The second result is after increasing SO_{SND,RCV}BUF to 7MB, i.e. applying this commit. Starting Test: protocol: TCP, 1 streams, 131072 byte blocks [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-10.00 sec 6.14 GBytes 5.27 Gbits/sec 0 3.15 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - Test Complete. Summary Results: [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 6.14 GBytes 5.27 Gbits/sec 0 sender [ 5] 0.00-10.04 sec 6.14 GBytes 5.25 Gbits/sec receiver The specific value of 7MB is chosen as it is the max supported by a default configuration of macOS. A value greater than 7MB may further benefit throughput for environments with higher network latency and lower CPU clocks, but will also increase latency under load (bufferbloat). Some platforms will silently clamp the value to other maximums. On Linux, we use SO_{SND,RCV}BUFFORCE in case 7MB is beyond net.core.{r,w}mem_max. Co-authored-by: James Tucker <james@tailscale.com> Signed-off-by: James Tucker <james@tailscale.com> Signed-off-by: Jordan Whited <jordan@tailscale.com>
aeed2ba
to
a6db822
Compare
I've added more details to the commit message in a6db822 along with some measurements. |
That looks really good (and you noticed the ordering change of the S-o-b :-P). Thanks for the measurements. Looks like that makes a huge difference. Hopefully a RRUL doesn't suffer too much from it in terms of latency. Committed! So I'll close this PR now, with one outstanding thing I don't yet know what to do about: Unfortunately, it looks like this isn't actually used on Windows. Because listenConfig() is only used by bind_std.go. So the control functions framework is _std-specific. Should we remove the Windows code because it's not used? Or keep it because it could be used if for some reason you use bind_std on Windows instead of the winrio stuff? Or maybe the Windows code should use the control functions too? |
Also, I made a small cleanup here: https://git.zx2c4.com/wireguard-go/commit/?id=b6a68cf211aa7b585c8e178e401060a4764ce5db If you guys have a moment to finish off bind_windows.go, that would be super great. |
This commit changes the tun.Device and conn.Bind interfaces to accept packet vectors for reading and writing. Internal plumbing between these interfaces now passes a vector of packets. Vectors move untouched between these interfaces, i.e. if 128 packets are received from conn.Bind.Read(), 128 packets are passed to tun.Device.Write(). There is no internal buffering.
Platform-specific implementations of tun.Device have been updated to the new tun.Device interface, but only Linux supports passing more than one packet for now. The Linux tun.Device implementation accomplishes this via TSO and GRO, which is made possible by virtio extensions in the TUN driver. Linux TUN offloading can be disabled by setting the WG_DISABLE_OFFLOAD environment variable to 1.
conn.LinuxSocketEndpoint has been deleted in favor of a collapsed conn.StdNetBind. conn.StdNetBind makes use of recvmmsg() and sendmmsg() on Linux. All platforms fall under conn.StdNetBind, except for Windows, which remains in conn.WinRingBind. Sticky sockets support has been refactored as part of this work to eventually be applicable on platforms other than just Linux, however Linux remains the sole platform that fully implements it. The conn.Bind UDP socket buffer is now being sized to 7MB, whereas it was previously inheriting the system default.
device.Keypairs locking has been simplified from a RWMutex + atomic to just a Mutex.
Signed-off-by: Jordan Whited jordan@tailscale.com
Signed-off-by: James Tucker james@tailscale.com
Co-authored-by: James Tucker james@tailscale.com