Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding Scatter/Gather IO on Stream #25344

Open
geoffkizer opened this issue Mar 7, 2018 · 62 comments
Open

Consider adding Scatter/Gather IO on Stream #25344

geoffkizer opened this issue Mar 7, 2018 · 62 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO
Milestone

Comments

@geoffkizer
Copy link
Contributor

The Socket class has this already, in the form of Send/Receive operations that take an IList<ArraySegment<byte>>. This isn't exposed up through Stream classes, however.

One specific scenario where this would be useful is writing chunked-encoded request bodies in SocketsHttpHandler. When the user calls the request Stream's Write[Async] method, we need to first write a chunk header with the chunk length followed by \r\n, then write the chunk contents from the user's buffer to the underlying stream (which is either SslStream or NetworkStream). This requires us to either do two writes (which means two kernel calls and means the chunk header will likely be sent in a separate packet), or to copy into a local buffer so we can do a single write. It would be more performant (and simpler) to do a gather write where we pass two buffers, the first containing the chunk header, the second containing the chunk data, passed directly through.

There are various examples in HTTP2 where scatter/gather IO would be useful as well. (Mostly gather, since gather helps deal with the "framing" scenario above. Scatter reads are less useful since you don't control the framing yourself.)

Proposed API:

        public virtual int Read(IReadOnlyList<Memory<byte>> buffers);
        public virtual void Write(IReadOnlyList<ReadOnlyMemory<byte>> buffers);
        public virtual ValueTask<int> ReadAsync(IReadOnlyList<Memory<byte>> buffers, CancellationToken cancellationToken = default);
        public virtual ValueTask WriteAsync(IReadOnlyList<ReadOnlyMemory<byte>> buffers, CancellationToken cancellationToken = default);

The base Stream class would provide default implementations on top of existing Read/Write APIs.

@geoffkizer
Copy link
Contributor Author

cc @stephentoub

@geoffkizer
Copy link
Contributor Author

cc @davidfowl -- I can imagine this might be interesting for you as well.

@davidfowl
Copy link
Member

Oh nice! Yes I like this. Should we also consider adding overloads for ReadOnlySequence<byte>?

@scalablecory
Copy link
Contributor

scalablecory commented Sep 19, 2019

This could work for Sockets cross-plat, and on Linux/Mac for file I/O. On Windows we could emulate it trivially for files. There would be a long period of writing optimized methods for SslStream, compression, etc., but that could be done piecemeal.

Other considerations:

  • We want to use IReadOnlyList<ReadOnlyMemory> for efficiency, but might consider adding convenience methods for ReadOnlySequence?
  • Worth adding a non-readonly Sequence for symmetry?

@scalablecory
Copy link
Contributor

We may need these for efficient HTTP/3 if QUIC implementation doesn't implement one of nagle/cork. We should coordinate QuicStream additions with this proposal. CC @jkotalik

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@scalablecory
Copy link
Contributor

scalablecory commented Feb 19, 2020

Following up re: HTTP/3. We ended up implementing this in our QuicStream, as it does not support nagle/cork. We used ROM<ROM<byte>> but IReadOnlyList<ROM<byte>> would work just as well.

It allows us to gather headers, framing, and user data in a single I/O:

private async ValueTask WriteRequestContentAsync(ReadOnlyMemory<byte> buffer, CancellationToken cancellationToken)
{
if (buffer.Length == 0)
{
return;
}
long remaining = _requestContentLengthRemaining;
if (remaining != -1)
{
// This HttpContent had a precomputed length, and a DATA frame was written as part of the headers. We can write directly without framing.
if (buffer.Length > _requestContentLengthRemaining)
{
string msg = SR.net_http_content_write_larger_than_content_length;
throw new IOException(msg, new HttpRequestException(msg));
}
_requestContentLengthRemaining -= buffer.Length;
if (_sendBuffer.ActiveLength != 0)
{
// We haven't sent out headers yet, so write them together with the user's content buffer.
// Because we have a Content-Length, we can write it in a single DATA frame.
BufferFrameEnvelope(Http3FrameType.Data, remaining);
_gatheredSendBuffer[0] = _sendBuffer.ActiveMemory;
_gatheredSendBuffer[1] = buffer;
await _stream.WriteAsync(_gatheredSendBuffer, cancellationToken).ConfigureAwait(false);
_sendBuffer.Discard(_sendBuffer.ActiveLength);
}
else
{
// Headers already sent, send just the content buffer directly.
await _stream.WriteAsync(buffer, cancellationToken).ConfigureAwait(false);
}
}
else
{
// Variable-length content: write both a DATA frame and buffer. (and headers, which will still be in _sendBuffer if this is the first content write)
// It's up to the HttpContent to give us sufficiently large writes to avoid excessively small DATA frames.
BufferFrameEnvelope(Http3FrameType.Data, buffer.Length);
_gatheredSendBuffer[0] = _sendBuffer.ActiveMemory;
_gatheredSendBuffer[1] = buffer;
await _stream.WriteAsync(_gatheredSendBuffer, cancellationToken).ConfigureAwait(false);
_sendBuffer.Discard(_sendBuffer.ActiveLength);
}
}

We would use this in HTTP/1 and HTTP/2 in a similar way.

@maryamariyan maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 23, 2020
@JeremyKuhne JeremyKuhne removed the untriaged New issue has not been triaged by the area owner label Mar 3, 2020
@pepone
Copy link
Contributor

pepone commented Apr 13, 2020

Having gathered write will allow to optimize the SSL Record size in SslStream, currently if you have several buffers and use separate Write calls it will create separate records adding more overhead

See https://hpbn.co/transport-layer-security-tls/#optimize-tls-record-size

@geoffkizer
Copy link
Contributor Author

We have defined some gather write operations for QUIC on the QuicStream class. Ideally, these would not be QUIC-specific; i.e. we'd add them to Stream and then implement them in QuicStream.

@scalablecory
Copy link
Contributor

@carlossanlop any questions here or can we move it to API review?

@stephentoub
Copy link
Member

I would like to see a prototype of how these will be consumed (both read and write) and perf numbers to demonstrate its meaningful enough to warrant adding more virtuals to Sream. Where are we going to use the scattered read methods? How much does the gathered write support improve http/2 throughput? How are callers going to avoid allocating IReadOnlyList instances?

I'm not against adding these. But we should take any new virtuals on Stream seriously and do our due diligence.

@davidfowl
Copy link
Member

How are callers going to avoid allocating IReadOnlyList instances?

This in particular is something I was worried about. Also kestrel would use this for writes but not reads.

@tmds 's custom transport also wanted to use these.

Why not something directly on socket rather than virtuals ok stream?

@geoffkizer
Copy link
Contributor Author

How are callers going to avoid allocating IReadOnlyList instances?

The idea is you use the same List/array for every send/receive. So you only need one allocation per socket, which doesn't seem like an issue.

Why not something directly on socket rather than virtuals ok stream?

Socket already supports this. https://github.com/dotnet/runtime/blob/master/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketTaskExtensions.cs#L38

The point of this would be to support it in non-Socket cases, like SslStream or QuicStream.

@geoffkizer
Copy link
Contributor Author

How much does the gathered write support improve http/2 throughput?

This is a hard question to answer authoritatively, because there's always a way to avoid using scatter/gather by simply coalescing buffers. That said, coalescing has some cost itself, which can vary based on a variety of factors; and also it's a bit of a pain to do.

@scalablecory said above:

It allows us to gather headers, framing, and user data in a single I/O [for HTTP3]:

And that's a nice convenience, but it's not clear to me that gathered writes on QuicStream are necessary to achieve this. We already achieve this in HTTP1.1/HTTP2 by managing our buffers to specifically enable this.

@scalablecory
Copy link
Contributor

scalablecory commented Oct 14, 2020

I prototyped this in HTTP/1 and I saw a measurable but very small impact. I don't remember the exact numbers but it was less than 1-2%.

With HTTP/3, I use it to avoid buffering every data frame when content length isn't known, just to prepend a handful of bytes. I haven't bothered to measure in this case because it was such an obvious and easy win.

@geoffkizer
Copy link
Contributor Author

I prototyped this in HTTP/1

For chunked encoding, I'm guessing?

It does seem like chunked encoding and HTTP3 framing when length is unknown are pretty much ideal use cases for this. It avoids a potentially large buffer copy.

@scalablecory
Copy link
Contributor

scalablecory commented Oct 14, 2020

I prototyped this in HTTP/1

For chunked encoding, I'm guessing?

I believe I was also deferring sending headers buffer until sending content when possible, to send them out in one go. There were a few opportunities.

(With nodelay on, this can result in 1 packet instead of 2 for some requests)

@tmds
Copy link
Member

tmds commented Oct 15, 2020

The biggest cost is in the number of syscalls. Copying data is cheap compared to syscalls. So coalescing in userspace is probably going to give you most of the gain, without introducing dedicated APIs.

When implementing a protocol using System.IO.Pipelines, you anyway end up copying because there is no way to pass Memory from one Pipe to another.

@stephentoub
Copy link
Member

The biggest cost is in the number of syscalls. Copying data is cheap compared to syscalls. So coalescing in userspace is probably going to give you most of the gain, without introducing dedicated APIs.

This is exactly why I was looking for concrete perf data.

Socket already supports this

To do this well, NetworkStream wouldn't be able to use those existing APIs (mismatch of input and output types), so either it would need to access new internal methods, or Socket would need new public ones.

@geoffkizer
Copy link
Contributor Author

To do this well, NetworkStream wouldn't be able to use those existing APIs (mismatch of input and output types)

True. It would be nice to add Memory overloads to Socket for these cases anyway. But it does mean more work to wire this up all the way.

@scalablecory
Copy link
Contributor

It would be nice to add Memory overloads to Socket for these cases anyway. But it does mean more work to wire this up all the way.

The Windows path is actually fairly easy (I did a minimal version of this for my prototyping), but the Linux PAL is scary huge.

@geoffkizer
Copy link
Contributor Author

There's a lot of code in the Linux PAL stuff, but it should be relatively straightforward.

@scalablecory
Copy link
Contributor

Confirmed this morning that we can implement this in our SChannel code. This would give us more efficient TLS, as each SslStream.Write() needs to write not just data but an envelope.

@stephentoub
Copy link
Member

Numbers?

@geoffkizer
Copy link
Contributor Author

I'm trying to figure out what conclusions to draw from this data.

An obvious one is, fewer IOs are better.

But when performing the same # of IOs, buffering seems faster than gathering (either buffered + gathered or unbuffered + gathered).

Doesn't this imply we should just do buffering?

@geoffkizer
Copy link
Contributor Author

BTW, I ran the "two HttpContent Writes" scenario with an 8K write buffer (instead of 4K). Here's what I got:

I/O strategy Number of I/Os Speed
Buffered 1 114,048
Buffered + Gathered 1 104,672
Unbuffered + Gathered 2 57,353
Unbuffered 7 18,398

Again, # of IOs is most important; again, Buffered beats both Gathered strategies.

What's also interesting here, though, is that Unbuffered + Gathered suffers hugely because it now has to perform two separate writes. And that's because gathering does not allow coalescing across user writes, whereas buffering does.

@davidfowl
Copy link
Member

davidfowl commented Mar 29, 2021

You don't want to buffer at all of the layers because that's wasted CPU cycles copying. Especially for larger buffers, it's wasteful. The APIs we expose today require the caller to provide the buffer, so we need to do something with it, and the options are to copy or to write directly (when there's no transform of the data itself required). I can see 2 approaches:

  • Copy when the provided data is small (below some threshold)
  • Do multiple writes (or a single multi buffer write) when the data is big (above some threshold)

As an example of this, consider a chunked encoding write implementation. WriteAsync wants to wrap the user provided data with a header and a footer:

  • hex length
  • data
  • \r\n

If data is small enough (say < 2K), then it makes sense to copy that into an internal buffer before sending. But lets say the user wrote 65K (large file transfer or something). Then doing multiple writes or a single write with multiple buffers likely make more sense.

BTW, the multi-buffer writes is just a way of pushing that eventual copy further down the stack. It doesn't stop the eventual copy, just the intermediate ones.

@davidfowl
Copy link
Member

davidfowl commented Mar 29, 2021

@geoffkizer with those numbers did you max out the CPU? You should also try it to with multiple buffer sizes to see where the tipping point is.

@geoffkizer
Copy link
Contributor Author

@geoffkizer with those numbers did you max out the CPU?

No, I'm simply taking Cory's test from above and modifying it slightly.

I do agree it would be better to do a real throughput test here and max out CPU, but that's not what the code does currently.

@davidfowl
Copy link
Member

Based on what I said above, I think there are some scenarios where this is clearly going to be a win. We need to find the right heuristic though, because doing it all the time definitely isn't.

Do we want to invest the effort here exploring this? Measuring something real means making changes to the stack and trying them out.

The scenarios that stick out to me so far is websockets and chunked encoding of large buffers. These are write scenarios.

I'm not convinced it'll help for TLS as much because the data needs to be transformed...

@geoffkizer
Copy link
Contributor Author

geoffkizer commented Mar 29, 2021

The scenarios that stick out to me so far is websockets and chunked encoding of large buffers. These are write scenarios.

I agree

I'm not convinced it'll help for TLS as much because the data needs to be transformed...

I agree

Do we want to invest the effort here exploring this?

It would certainly be good to get more data here.

Edit: In particular, we've called out that nonencrypted, chunked responses with large chunks are a scenario that could benefit.

My suggestion would be to do a real WRK (or similar) based throughput test across machines (no loopback) for this scenario.

The test server would be a simple socket-based server that reads the request up to \r\n\r\n, then sends the response, in each of the following ways:

(1) Three separate writes for (a) response header + chunk prefix; (b) chunk contents; (c) chunk suffix + final chunk
(2) Single buffered write, where the buffer is constructed each time (i.e. copy (a)+(b)+(c) into a single buffer and send)
(3) Single gathered write where we pass (a)/(b)/(c) as the list of gathered buffers

Then run variations with different chunk sizes.

@scalablecory
Copy link
Contributor

scalablecory commented Mar 29, 2021

Lots to unpack here 😄.

Just to keep me from getting too myopic and only thinking of HTTP, I'll start with a goal and a non-goal for this API:

Stream's lack of guarantees around efficiency make it difficult to use when you have multiple buffers you want to write.

  • To be efficient you either need intimate knowledge of the streams being used, or you are forced to do defensive buffering yourself.
  • If you do end up with your own buffering, it can be troublesome to understand what buffer size is correct. TCP MSS, SslStream's 16KB SChannel buffer, HTTP/2 framing, TLS framing, etc. -- it is very easy to increase buffer sizes until you find one that works "well", but it's very challenging to find an actually optimal size.
  • This can be compounded in layered code -- especially abstract layering -- where you might end up with a different buffer at each layer. Our own mini bufferbloat problem.
  • Gathered writes avoids all of this by giving every layer full control over the entire write rather than just a piece of it.

And the non-goal: gathering is not an alternative to buffering. It would only replace buffering on a per-Write basis, not as a general strategy. As demonstrated, it actually works in conjunction with buffering by further reducing I/O count in the scenario where you'd need to flush immediately after overflowing your buffer. It won't stop a user's HttpContent from inefficiently making a bunch of small writes to a Stream, for instance.

And specifically for HttpClient/others:

We should decide if we want to introduce buffering. Buffering has some costs, of course -- the one I'm most interested in understanding is the extra per-connection memory usage we'd introduce.

Either way, gathering would be able to help. If we decide we don't want additional memory usage and don't buffer, it'll help. If we decide we do want to buffer, it will help us extract better perf out of that buffer, or may lead to us using a smaller buffer size than we otherwise would.

At least for SChannel, when considering gathered VS perfectly buffered, I don't think gathering will be a major win (read: reduced I/O counts) beyond the generic benefits described above.

The little test app I wrote doesn't require localhost; I'll run it on my network to see how it affects things.

@scalablecory
Copy link
Contributor

The test server would be a simple socket-based server that reads the request up to \r\n\r\n, then sends the response, in each of the following ways:

(1) Three separate writes for (a) response header + chunk prefix; (b) chunk contents; (c) chunk suffix + final chunk
(2) Single buffered write, where the buffer is constructed each time (i.e. copy (a)+(b)+(c) into a single buffer and send)
(3) Single gathered write where we pass (a)/(b)/(c) as the list of gathered buffers

These three scenarios are tested by my code: (1) is unbuffered, (2) is buffered or buffered+gathered, (3) is unbuffered gathered -- what is it you're trying to look at?

@davidfowl
Copy link
Member

We should decide if we want to introduce buffering. Buffering has some costs, of course -- the one I'm most interested in understanding is the extra per-connection memory usage we'd introduce.

I'm spending lots of my 6.0 dev time reducing this 😄. Just don't stash stuff on the types, rent them just in time. Though for reads that's difficult. I suggest you employ the zero-byte read tactic as well. I haven't looked at YARP memory usage as yet.

@geoffkizer
Copy link
Contributor Author

These three scenarios are tested by my code: (1) is unbuffered, (2) is buffered or buffered+gathered, (3) is unbuffered gathered -- what is it you're trying to look at?

I agree; I basically stole them from your code :) In general I think the scenarios you implemented are fine. My concerns are more about how the code runs.

Specifically:
(1) It's basically measuring linear elapsed time instead of overall throughput (i.e. maxing CPU with many concurrent requests). The latter is a better measure of what we really care about. And it can sometimes hit issues like lock contention or cache problems that you wouldn't see in a simple linear test.
(2) It reports the best run of a set of runs. I'd rather see an average reported. I saw a fair amount of variance in my runs, and I'm guessing it may have something to do with this.
(3) It would be good to run remotely instead of over loopback. The amount of work happening in these tests is pretty small overall and I'd expect the actual cost of networking IO to play an important role in overall throughput. Loopback is highly optimized to just shuffle bits around and not actually construct IP packets etc.

We have a bunch of server perf tests that run in crank using WRK, and we have a bunch of infrastructure around running these automatically and getting various measurements from them etc. That's why I suggested doing something similar here.

@geoffkizer
Copy link
Contributor Author

Stream's lack of guarantees around efficiency make it difficult to use when you have multiple buffers you want to write.

Oh, I completely agree here. You have a good list of problems. I'd like to do better here, but I'm not sure how.

Gathered writes avoids all of this by giving every layer full control over the entire write rather than just a piece of it.

This is where I start to disagree. Gathered writes help, but I'm not sure they really help all that much. I think it makes a few specific scenarios somewhat better. Scenarios like the one discussed above -- chunked encoding with large chunks and no SSL. It doesn't help with Content-Length encoding, it doesn't help with SSL, it doesn't help with HTTP2 or HTTP3, etc.

That said, it would be good to get better data about all of these scenarios so we can quantify the potential benefits here.

Ultimately this comes down to a prioritization exercise. How much benefit vs how much cost?

@scalablecory
Copy link
Contributor

(1) It's basically measuring linear elapsed time instead of overall throughput (i.e. maxing CPU with many concurrent requests). The latter is a better measure of what we really care about. And it can sometimes hit issues like lock contention or cache problems that you wouldn't see in a simple linear test.
(2) It reports the best run of a set of runs. I'd rather see an average reported. I saw a fair amount of variance in my runs, and I'm guessing it may have something to do with this.
(3) It would be good to run remotely instead of over loopback. The amount of work happening in these tests is pretty small overall and I'd expect the actual cost of networking IO to play an important role in overall throughput. Loopback is highly optimized to just shuffle bits around and not actually construct IP packets etc.

Hrm that's strange; it maxes out a core 100% on mine. Also it can run over network just fine; you can call it with IP to bind to.

@scalablecory
Copy link
Contributor

It doesn't help with Content-Length encoding

Sure it does; you have one potential I/O for headers and one for content?

@geoffkizer
Copy link
Contributor Author

Sure it does; you have one potential I/O for headers and one for content?

Yeah, you are right, of course.

@geoffkizer
Copy link
Contributor Author

Hrm that's strange; it maxes out a core 100% on mine.

It maxes out two cores on mine (out of 16) -- one is hard pegged at 100%, the other bounces off it. Other CPUs show significant usage but aren't maxed out.

Overall CPU usage is only about 33%. The client and server processes seem to account for something like 10-12%, and are presumably what's causing those two CPUs to be maxed out or close. That means something like 2/3rds of the CPU usage in this test is actually in kernel mode -- which is good, since the test code isn't doing that much work.

(Data is just a hot read of Task Manager so take it with a grain of salt.)

But this points out another reason why we should be maxing total CPU and measuring throughput -- that will better accoutn for all the work happening on other threads in the kernel.

@davidfowl
Copy link
Member

Yea, we need to run this with crank.

cc @sebastienros

@geoffkizer
Copy link
Contributor Author

I'm hacking something together to run in crank -- been meaning to learn more about crank anyway...

@geoffkizer
Copy link
Contributor Author

FYI, my ScatterGatherServer code is here: https://github.com/geoffkizer/netperf/tree/master/ScatterGatherServer

Feedback welcome. I'm working with @sebastienros to get perf results using this.

This implements three variations of sending a Content-Length encoded response of a configurable length:

  • Mode SendMultiple sends response headers and response body in separate Send calls.
  • Mode BufferSends copies the headers and body into a single buffer and does a single Send.
  • Mode GatherSends sends headers and body in a single gathered Send.

@adamsitnik
Copy link
Member

On Windows we could emulate it trivially for files.

FWIW On Windows there is ReadFileScatter and WriteFileGather. The problem is that it requires FILE_FLAG_OVERLAPPED (async file handle) and FILE_FLAG_NO_BUFFERING (not exposed by .NET yet, requires inputs to be aligned and of a certain size).

https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-readfilescatter
https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefilegather

We are most likely going to implement that as part of #24847 as SafeFileHandle extension methods

@geoffkizer
Copy link
Contributor Author

So, #24847 is defining a version of scatter/gather for offset-based file IO. We should follow the same pattern for Stream (and Socket as well).

@geoffkizer
Copy link
Contributor Author

To match the pattern approved in #24847, I think these APIs would look like this:

  public virtual long Read(IReadOnlyList<Memory<byte>> buffers);
  public virtual void Write(IReadOnlyList<ReadOnlyMemory<byte>> buffers);
  public virtual ValueTask<long> ReadAsync(IReadOnlyList<Memory<byte>> buffers, CancellationToken cancellationToken = default);
  public virtual ValueTask WriteAsync(IReadOnlyList<ReadOnlyMemory<byte>> buffers, CancellationToken cancellationToken = default);

The only change here from above is returning long instead of int for the read ops.

@adamsitnik
Copy link
Member

@geoffkizer how about using the Gather and Scatter prefixes in method names?

@geoffkizer
Copy link
Contributor Author

how about using the Gather and Scatter prefixes in method names?

I don't have a strong opinion either way...

Socket has some existing scatter/gather support on Send and Receive; it doesn't use the terms Scatter/Gather as part of the API there. But we don't need to be consistent with that.

@stephentoub
Copy link
Member

stephentoub commented May 22, 2021

how about using the Gather and Scatter prefixes in method names?

I'm very much against that. These should be overloads of Read/Write{Async}. These are just different representations of data, just as are span, array, and memory. If it makes sense for a given stream, it's perfectly valid for a stream to implement the existing apis using OS scatter/gather functionality, e.g. splitting a span into multiple pieces. Conversely, it's valid for these new overloads to be implemented with a single buffer, e.g. copying all write buffers to a single rented array and writing that out. These really are just overloads with different representations of the data. On top of that, there's no precedent for the scatter/gather terminology in our libraries, and in fact there's precedent for not using it (e.g. Socket). I don't want to force all developers using Stream to learn this terminology, which adds no benefit, just extra mental complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.IO
Projects
None yet
Development

No branches or pull requests

10 participants