-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider adding Scatter/Gather IO on Stream #25344
Comments
cc @stephentoub |
cc @davidfowl -- I can imagine this might be interesting for you as well. |
Oh nice! Yes I like this. Should we also consider adding overloads for ReadOnlySequence<byte>? |
This could work for Sockets cross-plat, and on Linux/Mac for file I/O. On Windows we could emulate it trivially for files. There would be a long period of writing optimized methods for SslStream, compression, etc., but that could be done piecemeal. Other considerations:
|
We may need these for efficient HTTP/3 if QUIC implementation doesn't implement one of nagle/cork. We should coordinate |
Following up re: HTTP/3. We ended up implementing this in our It allows us to gather headers, framing, and user data in a single I/O: runtime/src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/Http3RequestStream.cs Lines 372 to 422 in a0c51bf
We would use this in HTTP/1 and HTTP/2 in a similar way. |
Having gathered write will allow to optimize the SSL Record size in SslStream, currently if you have several buffers and use separate Write calls it will create separate records adding more overhead See https://hpbn.co/transport-layer-security-tls/#optimize-tls-record-size |
We have defined some gather write operations for QUIC on the QuicStream class. Ideally, these would not be QUIC-specific; i.e. we'd add them to Stream and then implement them in QuicStream. |
@carlossanlop any questions here or can we move it to API review? |
I would like to see a prototype of how these will be consumed (both read and write) and perf numbers to demonstrate its meaningful enough to warrant adding more virtuals to Sream. Where are we going to use the scattered read methods? How much does the gathered write support improve http/2 throughput? How are callers going to avoid allocating IReadOnlyList instances? I'm not against adding these. But we should take any new virtuals on Stream seriously and do our due diligence. |
This in particular is something I was worried about. Also kestrel would use this for writes but not reads. @tmds 's custom transport also wanted to use these. Why not something directly on socket rather than virtuals ok stream? |
The idea is you use the same List/array for every send/receive. So you only need one allocation per socket, which doesn't seem like an issue.
Socket already supports this. https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketTaskExtensions.cs#L38 The point of this would be to support it in non-Socket cases, like SslStream or QuicStream. |
This is a hard question to answer authoritatively, because there's always a way to avoid using scatter/gather by simply coalescing buffers. That said, coalescing has some cost itself, which can vary based on a variety of factors; and also it's a bit of a pain to do. @scalablecory said above:
And that's a nice convenience, but it's not clear to me that gathered writes on QuicStream are necessary to achieve this. We already achieve this in HTTP1.1/HTTP2 by managing our buffers to specifically enable this. |
I prototyped this in HTTP/1 and I saw a measurable but very small impact. I don't remember the exact numbers but it was less than 1-2%. With HTTP/3, I use it to avoid buffering every data frame when content length isn't known, just to prepend a handful of bytes. I haven't bothered to measure in this case because it was such an obvious and easy win. |
For chunked encoding, I'm guessing? It does seem like chunked encoding and HTTP3 framing when length is unknown are pretty much ideal use cases for this. It avoids a potentially large buffer copy. |
I believe I was also deferring sending headers buffer until sending content when possible, to send them out in one go. There were a few opportunities. (With nodelay on, this can result in 1 packet instead of 2 for some requests) |
The biggest cost is in the number of syscalls. Copying data is cheap compared to syscalls. So coalescing in userspace is probably going to give you most of the gain, without introducing dedicated APIs. When implementing a protocol using |
This is exactly why I was looking for concrete perf data.
To do this well, NetworkStream wouldn't be able to use those existing APIs (mismatch of input and output types), so either it would need to access new internal methods, or Socket would need new public ones. |
True. It would be nice to add Memory overloads to Socket for these cases anyway. But it does mean more work to wire this up all the way. |
The Windows path is actually fairly easy (I did a minimal version of this for my prototyping), but the Linux PAL is scary huge. |
There's a lot of code in the Linux PAL stuff, but it should be relatively straightforward. |
Confirmed this morning that we can implement this in our SChannel code. This would give us more efficient TLS, as each SslStream.Write() needs to write not just data but an envelope. |
Numbers? |
I'm trying to figure out what conclusions to draw from this data. An obvious one is, fewer IOs are better. But when performing the same # of IOs, buffering seems faster than gathering (either buffered + gathered or unbuffered + gathered). Doesn't this imply we should just do buffering? |
BTW, I ran the "two HttpContent Writes" scenario with an 8K write buffer (instead of 4K). Here's what I got:
Again, # of IOs is most important; again, Buffered beats both Gathered strategies. What's also interesting here, though, is that Unbuffered + Gathered suffers hugely because it now has to perform two separate writes. And that's because gathering does not allow coalescing across user writes, whereas buffering does. |
You don't want to buffer at all of the layers because that's wasted CPU cycles copying. Especially for larger buffers, it's wasteful. The APIs we expose today require the caller to provide the buffer, so we need to do something with it, and the options are to copy or to write directly (when there's no transform of the data itself required). I can see 2 approaches:
As an example of this, consider a chunked encoding write implementation. WriteAsync wants to wrap the user provided data with a header and a footer:
If data is small enough (say < 2K), then it makes sense to copy that into an internal buffer before sending. But lets say the user wrote 65K (large file transfer or something). Then doing multiple writes or a single write with multiple buffers likely make more sense. BTW, the multi-buffer writes is just a way of pushing that eventual copy further down the stack. It doesn't stop the eventual copy, just the intermediate ones. |
@geoffkizer with those numbers did you max out the CPU? You should also try it to with multiple buffer sizes to see where the tipping point is. |
No, I'm simply taking Cory's test from above and modifying it slightly. I do agree it would be better to do a real throughput test here and max out CPU, but that's not what the code does currently. |
Based on what I said above, I think there are some scenarios where this is clearly going to be a win. We need to find the right heuristic though, because doing it all the time definitely isn't. Do we want to invest the effort here exploring this? Measuring something real means making changes to the stack and trying them out. The scenarios that stick out to me so far is websockets and chunked encoding of large buffers. These are write scenarios. I'm not convinced it'll help for TLS as much because the data needs to be transformed... |
I agree
I agree
It would certainly be good to get more data here. Edit: In particular, we've called out that nonencrypted, chunked responses with large chunks are a scenario that could benefit. My suggestion would be to do a real WRK (or similar) based throughput test across machines (no loopback) for this scenario. The test server would be a simple socket-based server that reads the request up to \r\n\r\n, then sends the response, in each of the following ways: (1) Three separate writes for (a) response header + chunk prefix; (b) chunk contents; (c) chunk suffix + final chunk Then run variations with different chunk sizes. |
Lots to unpack here 😄. Just to keep me from getting too myopic and only thinking of HTTP, I'll start with a goal and a non-goal for this API:
And the non-goal: gathering is not an alternative to buffering. It would only replace buffering on a per-Write basis, not as a general strategy. As demonstrated, it actually works in conjunction with buffering by further reducing I/O count in the scenario where you'd need to flush immediately after overflowing your buffer. It won't stop a user's And specifically for We should decide if we want to introduce buffering. Buffering has some costs, of course -- the one I'm most interested in understanding is the extra per-connection memory usage we'd introduce. Either way, gathering would be able to help. If we decide we don't want additional memory usage and don't buffer, it'll help. If we decide we do want to buffer, it will help us extract better perf out of that buffer, or may lead to us using a smaller buffer size than we otherwise would. At least for SChannel, when considering gathered VS perfectly buffered, I don't think gathering will be a major win (read: reduced I/O counts) beyond the generic benefits described above. The little test app I wrote doesn't require localhost; I'll run it on my network to see how it affects things. |
These three scenarios are tested by my code: (1) is unbuffered, (2) is buffered or buffered+gathered, (3) is unbuffered gathered -- what is it you're trying to look at? |
I'm spending lots of my 6.0 dev time reducing this 😄. Just don't stash stuff on the types, rent them just in time. Though for reads that's difficult. I suggest you employ the zero-byte read tactic as well. I haven't looked at YARP memory usage as yet. |
I agree; I basically stole them from your code :) In general I think the scenarios you implemented are fine. My concerns are more about how the code runs. Specifically: We have a bunch of server perf tests that run in crank using WRK, and we have a bunch of infrastructure around running these automatically and getting various measurements from them etc. That's why I suggested doing something similar here. |
Oh, I completely agree here. You have a good list of problems. I'd like to do better here, but I'm not sure how.
This is where I start to disagree. Gathered writes help, but I'm not sure they really help all that much. I think it makes a few specific scenarios somewhat better. Scenarios like the one discussed above -- chunked encoding with large chunks and no SSL. It doesn't help with Content-Length encoding, it doesn't help with SSL, it doesn't help with HTTP2 or HTTP3, etc. That said, it would be good to get better data about all of these scenarios so we can quantify the potential benefits here. Ultimately this comes down to a prioritization exercise. How much benefit vs how much cost? |
Hrm that's strange; it maxes out a core 100% on mine. Also it can run over network just fine; you can call it with IP to bind to. |
Sure it does; you have one potential I/O for headers and one for content? |
Yeah, you are right, of course. |
It maxes out two cores on mine (out of 16) -- one is hard pegged at 100%, the other bounces off it. Other CPUs show significant usage but aren't maxed out. Overall CPU usage is only about 33%. The client and server processes seem to account for something like 10-12%, and are presumably what's causing those two CPUs to be maxed out or close. That means something like 2/3rds of the CPU usage in this test is actually in kernel mode -- which is good, since the test code isn't doing that much work. (Data is just a hot read of Task Manager so take it with a grain of salt.) But this points out another reason why we should be maxing total CPU and measuring throughput -- that will better accoutn for all the work happening on other threads in the kernel. |
Yea, we need to run this with crank. |
I'm hacking something together to run in crank -- been meaning to learn more about crank anyway... |
FYI, my ScatterGatherServer code is here: https://github.com/geoffkizer/netperf/tree/master/ScatterGatherServer Feedback welcome. I'm working with @sebastienros to get perf results using this. This implements three variations of sending a Content-Length encoded response of a configurable length:
|
FWIW On Windows there is https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfilescatter We are most likely going to implement that as part of #24847 as |
So, #24847 is defining a version of scatter/gather for offset-based file IO. We should follow the same pattern for Stream (and Socket as well). |
To match the pattern approved in #24847, I think these APIs would look like this: public virtual long Read(IReadOnlyList<Memory<byte>> buffers);
public virtual void Write(IReadOnlyList<ReadOnlyMemory<byte>> buffers);
public virtual ValueTask<long> ReadAsync(IReadOnlyList<Memory<byte>> buffers, CancellationToken cancellationToken = default);
public virtual ValueTask WriteAsync(IReadOnlyList<ReadOnlyMemory<byte>> buffers, CancellationToken cancellationToken = default); The only change here from above is returning |
@geoffkizer how about using the Gather and Scatter prefixes in method names? |
I don't have a strong opinion either way... Socket has some existing scatter/gather support on Send and Receive; it doesn't use the terms Scatter/Gather as part of the API there. But we don't need to be consistent with that. |
I'm very much against that. These should be overloads of Read/Write{Async}. These are just different representations of data, just as are span, array, and memory. If it makes sense for a given stream, it's perfectly valid for a stream to implement the existing apis using OS scatter/gather functionality, e.g. splitting a span into multiple pieces. Conversely, it's valid for these new overloads to be implemented with a single buffer, e.g. copying all write buffers to a single rented array and writing that out. These really are just overloads with different representations of the data. On top of that, there's no precedent for the scatter/gather terminology in our libraries, and in fact there's precedent for not using it (e.g. Socket). I don't want to force all developers using Stream to learn this terminology, which adds no benefit, just extra mental complexity. |
The Socket class has this already, in the form of Send/Receive operations that take an
IList<ArraySegment<byte>>
. This isn't exposed up through Stream classes, however.One specific scenario where this would be useful is writing chunked-encoded request bodies in SocketsHttpHandler. When the user calls the request Stream's Write[Async] method, we need to first write a chunk header with the chunk length followed by \r\n, then write the chunk contents from the user's buffer to the underlying stream (which is either SslStream or NetworkStream). This requires us to either do two writes (which means two kernel calls and means the chunk header will likely be sent in a separate packet), or to copy into a local buffer so we can do a single write. It would be more performant (and simpler) to do a gather write where we pass two buffers, the first containing the chunk header, the second containing the chunk data, passed directly through.
There are various examples in HTTP2 where scatter/gather IO would be useful as well. (Mostly gather, since gather helps deal with the "framing" scenario above. Scatter reads are less useful since you don't control the framing yourself.)
Proposed API:
The base Stream class would provide default implementations on top of existing Read/Write APIs.
The text was updated successfully, but these errors were encountered: