-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support IList<ReadOnlyMemory<byte>> SocketAsyncEventArgs #49941
Comments
Tagging subscribers to this area: @GrabYourPitchforks Issue DetailsBackground and MotivationDoing some profiling writing large payloads via Kestrel, about 1.2 % of the overall time is spent converting from Proposed APInamespace System.Net.Sockets
{
public class SocketAsyncEventArgs
{
+ public IList<ReadOnlyMemory<byte>>? MemoryList { get; set; }
}
} Usage Examplesprivate void SetBufferList(in ReadOnlySequence<byte> buffer)
{
if (_bufferList == null)
{
_bufferList = new List<ReadOnlyMemory<byte>>();
}
foreach (ReadOnlyMemory<byte> b in buffer)
{
_bufferList.Add(b);
}
// The act of setting this list, sets the buffers in the internal buffer list
BufferList = _bufferList;
} RisksMore ways of doing the same thing. When to choose MemoryList vs BufferList. Do they overwrite each other? etc.
|
Tagging subscribers to this area: @dotnet/ncl Issue DetailsBackground and MotivationDoing some profiling writing large payloads via Kestrel, about 1.2 % of the overall time is spent converting from Proposed APInamespace System.Net.Sockets
{
public class SocketAsyncEventArgs
{
+ public IList<ReadOnlyMemory<byte>>? MemoryList { get; set; }
}
} Usage Examplesprivate void SetBufferList(in ReadOnlySequence<byte> buffer)
{
if (_bufferList == null)
{
_bufferList = new List<ReadOnlyMemory<byte>>();
}
foreach (ReadOnlyMemory<byte> b in buffer)
{
_bufferList.Add(b);
}
// The act of setting this list, sets the buffers in the internal buffer list
MemoryList = _bufferList;
} RisksMore ways of doing the same thing. When to choose MemoryList vs BufferList. Do they overwrite each other? etc.
|
I agree, this would be a good thing to do. Main difficulty is what you highlighted above:
The other consideration here is, we want this to work seamlessly with NetworkStream, so whatever way we choose to expose this in SAEA and other socket APIs will effectively commit us to the same approach in Stream etc. So we need to design this in concert with #25344. Also note, ReadOnlyMemory only covers send cases; presumably we want to support receive here as well. Which complicates things further. |
All that said: Can we improve the cost of MemoryMarshal.TryGetArray? That seems really high for what should be a very common operation. |
BTW this all came up because I'm looking at this dotnet/aspnetcore#31110
Are you saying we should use the same types on both? I agree. I wouldn't gate one of the other but they can be designed together.
Good point. I think multi-buffer sends are 90% and multi buffer receives are 10% (I pulled those numbers out of thin air btw).
No idea. @GrabYourPitchforks? The other issue is that Kestrel uses 4K buffers so we end up doing this lots of time doing these on individual 4K chunks. |
BTW there's a bunch of other overhead I hit as well. I filed the issue for the easy one so far. |
Yes. E.g. is it IList or IReadOnlyList? We need to use exactly the same types here across Socket/Stream or we could accidentally cause ourselves a lot of pain.
Yeah, we could choose to implement this in parts in different ways, we just need to feel confident that the design works in all cases. |
Please file it all :) |
Or |
The problem with ReadOnlySequence<byte> is that we'd end up forcing creation of a linked list. |
Not sure why? Change runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEventArgs.cs Lines 160 to 168 in 1ee59da
Can then enumerate the |
I meant to create the ReadOnlySequence in the first place. |
True; but might already have one, like if you were forwarding upstream? (like your usage) public async Task InvokeAsync(HttpContext context)
{
var reader = context.Request.BodyReader;
var writer = _upstream.Output;
while (true)
{
var readResult = await reader.ReadAsync(default);
var buffer = readResult.Buffer;
var isCompleted = readResult.IsCompleted;
if (buffer.IsEmpty && isCompleted)
{
return;
}
await writer.WriteAsync(buffer);
} More mean your usage example could also be an api; and then don't need to create two Lists, since the first thing the private void SetBufferList(in ReadOnlySequence<byte> buffer)
{
if (_bufferListInternal == null)
{
_bufferListInternal = new List<ReadOnlyMemory<byte>>();
}
else
{
_bufferListInternal.Clear();
}
foreach (ReadOnlyMemory<byte> b in buffer)
{
_bufferListInternal.Add(b);
}
} |
I wouldn't mind adding directly to the buffer list. The setting API is jank and requires me to manage my own list. That would be a decent middleground |
We need a receive variant too, as well as
@antonfirsov has ideas to make |
This is interesting, because I'd expect working with What is the overhead of creating a GCHandle to a pre-pinned array? I'd expect it to be close to a no-op. Maybe this is something that can be optimized. |
Allocating a GCHandle isn't a noop, and we shouldn't be doing it. The stack should be re-written on top of |
I'm not proposing we use |
The memory we create doesn't have a GC handle since it's using the pinned object heap. This code is allocating a GC handle for the underlying array when it doesn't need to. On top of that, we need to unpack the Memory back into an ArraySegment for each buffer in the linked list chain before setting the result on the SAEA. There's all sorts of inefficiencies around multi-buffer sends, this is one of the lower hanging fruit 😄. |
Yeah, it's not free, and it'll be more expensive the more contentious it is. Creating a GCHandle entails synchronization (best case an interlocked, worst case a lock). It doesn't really matter what the type of the handle is, nor whether the array is already pinned or not, it's the same appx operation and cost. Below is an example, all serialized so no contention.
private byte[] _normal = new byte[1024];
private byte[] _pinned = GC.AllocateArray<byte>(1024, pinned: true);
[Benchmark]
[Arguments(GCHandleType.Normal)]
[Arguments(GCHandleType.Pinned)]
[Arguments(GCHandleType.Weak)]
[Arguments(GCHandleType.WeakTrackResurrection)]
public void Normal(GCHandleType type) => GCHandle.Alloc(_normal, type).Free();
[Benchmark]
[Arguments(GCHandleType.Normal)]
[Arguments(GCHandleType.Pinned)]
[Arguments(GCHandleType.Weak)]
[Arguments(GCHandleType.WeakTrackResurrection)]
public void Pinned(GCHandleType type) => GCHandle.Alloc(_pinned, type).Free(); |
Right, we create the Memory with the CreateFromPinnedArray to avoid GCHandle overhead and as it turns out, we only avoid it for single buffer writes. PS I'm discovering these inefficiencies because I've been looking at this dotnet/aspnetcore#31110 and some other scenarios around reducing memory. |
I didn't mean to wrap the I proposed those overloads in #48477 (comment), copying them here for better visibility: public class Socket
{
public ValueTask SendAsync(IReadOnlyList<ReadOnlyMemory<byte>> buffers, CancellationToken cancellationToken = default);
public ValueTask<int> ReceiveAsync(IReadOnlyList<Memory<byte>> buffers, CancellationToken cancellationToken= default);
} @geoffkizer had a valid concern in a comment above:
With the To provide better experience for callers with a ROS, we may also add a convenience overload for ROS, which wraps the public ValueTask SendAsync(ReadOnlySequence<byte> buffers, CancellationToken cancellationToken = default); @davidfowl would you be able to consume this stuff instead of SAEA? This would have some (but hopefully low) overhead, I think definitely much lower than the |
We can stack alloc a decent amount; I think we stack alloc up to 8 today -- @tmds would know for sure. And we could increase this (somewhat) if it makes a difference for real scenarios. But regardless, we would still be using a cached SAEA instance under the covers, and that would reuse cached WSABUFs etc. It's just that we wouldn't expose Memory support on SAEA publicly, it would be an internal-only thing. In other words I don't think there's an issue here. |
OK then I prefer this approach as well. I thought the goal was to avoid SAEA (which is something I want to do for reads BTW). |
Why is that? |
BTW, how many buffers would you be using here typically? I would be careful about approaches that often require using a lot of buffers for scatter/gather -- e.g. >4 to be conservative, or maybe >8. I'm not sure the kernels here do a great job in this case. |
SAEA is an implementation detail for the socket methods, a detail we can change over time. Right now it stores the kitchen sink: if we cache one for use in single-buffer ReadAsync calls, we're still paying for all the state associated with every other operation it can support. My ideal over time is that we evolve to have some smaller internal data structure(s) per operation on top of which we can build both the socket operations and SAEA, which at that point would be purely legacy / for compat. Waving my hands. |
We use 4K buffers today so 8 buffers would be 32K. Large responses are sometimes more than this so it can get up there in the number of buffers (256 for 1MB). I'm playing around with different buffer sizes and dynamic buffer sizes but that's not likely to be something that we commit anytime soon. Right now, I'm trying to reduce the overhead at all of the layers as much as possible.
Yes, I know this, but we don't use the Task methods, they add overhead we don't need (and no I haven't measured 😄), but I don't see the value in adding overhead especially at the networking layer. We'll kinda do insane things here to avoid it. Right now, for ReadAsync we end up holding our derived |
While it's true that the Task APIs are basically just wrappers over public SAEA today, that will likely change over time, as @stephentoub pointed out. The goal would be to reduce memory usage etc in the Task APIs in a way that is difficult to do via public SAEA. In other words, in the future you'll probably want to use Task APIs anyway. |
It's fine to change them but right now the Task APIs don't let us control the scheduling which we care about (actually I'd prefer that directly in the socket APIs themselves). But I'm less concerned about sends than receives. We've already changed the send side to pool SAEA so we have a very small number of those now, but receives mean we need one per connection. I'd like to see how we achieve lower memory usage with the Task APIs directly, it'd need to pool something across instances to get the same low footprint. |
What scheduling does SAEA support that the Socket APIs don't? |
On Unix we stackalloc up to 8 for this reason: runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs Lines 21 to 23 in 400311b
|
I misspoke, it isn't the scheduling, it's just the unnecessary overhead in our scenario:
|
To get this issue back on topic, because I'd like to see a proposal that we all agree with:
The problem I have with the SendAsync API is that it will inevitably have to pool WSABufs/IOVectors over a certain size (8?). If those heap allocated arrays are rented from the array pool, I think that would be a big improvement. The next thing would be to pool PreAllocatedOverlapped on windows across different socket instances (PreallocatedOverlapped). I think if we did those, I would feel better about using the SendAsync overloads. |
I wonder if it would make more sense to just allow you to reuse Socket instances, which would mean you'd reuse the cached SAEAs and PreallocatedOverlapped and other things too. |
I guess this would mean designing new |
Triage: We should keep discussing final design for post-6.0. It is reasonable for .NET 7. |
Would you consider using the same API already used for
This would make it simpler to support TCP and Quick with the same code. |
Now that we have in-line arrays I wonder if these should be span based APIs. |
Which specifically? Async and span don't mix well. |
It wouldn’t work for SocketAsyncEventArgs but IIRC the underlying methods don’t need to pin the array passed in, They are usually copied. I know the lowest API we have is SAEA so this approach doesn’t flow through Socket.SendAsync in a clean way… |
I'm still not understanding. Can you sketch what it is you're envisioning? If the operation can't complete immediately and fully, such that the operation needs to pend and try again later, the relevant data or buffers still needs to be available. Do you just mean an API like |
I was thinking about how we end up calling these APIs:
The buffers themselves are pinned for the lifetime of the async operation but the holder array just needs to be pinned for the length of the synchronous call.
Yes exactly. We would flow this span all the way down to the OS API calls then translate it into the relevant Span<WSABuffer/IOVector>. This doesn't gel with the current SAEA BufferList property, and I haven't thought through how to flow it from the user code down to the OS API. Contrived example: using System.Net;
using System.Net.Sockets;
using System.Runtime.CompilerServices;
var socket = new Socket(SocketType.Stream, ProtocolType.Tcp);
socket.Bind(new IPEndPoint(IPAddress.Loopback, 5000));
socket.Listen();
var client = socket.Accept();
var buffers = new Buffers();
buffers[0] = GetHeaders();
buffers[1] = GetBody();
await client.SendAsync(buffers);
ReadOnlyMemory<byte> GetHeaders() => default;
ReadOnlyMemory<byte> GetBody() => default;
[InlineArray(2)]
struct Buffers
{
private ReadOnlyMemory<byte> _buffers;
}
static class Extensions
{
public static ValueTask<int> SendAsync(this Socket socket, ReadOnlySpan<ReadOnlyMemory<byte>> buffers) => default;
} |
Just to be extra clear, though, it's not just about pinning. The list of buffers needs to be available asynchronously from the current stack if the operation doesn't complete synchronously, which means copying it somewhere. |
Are you talking about the Linux case where we queue the operation? Or do you mean in both cases? |
I'm talking about the non-Windows case where if the native call on the non-blocking socket returns EAGAIN / EWOULDBLOCK, we wait on an epoll / kqueue and try again when notified that we can make forward progress with the operation. |
I see, yes that makes sense. It would force us to copy the array in that case. |
Background and Motivation
Doing some profiling writing large payloads via Kestrel, about 1.2 % of the overall time is spent converting from
ReadOnlySequence<byte>
to aList<ArraySegment<byte>>
for the eventual call to the underlying socket API that takes an array of pointers to buffers. Most of the time is spent calling intoMemoryMarshal.TryGetArray
to get anArraySegment<byte>
from theMemory<byte>
most of which is a pointless conversion (which mostly defeats the purpose of use passing a pre-pinned memory handle to the networking stack).Proposed API
namespace System.Net.Sockets { public class SocketAsyncEventArgs { + public IList<ReadOnlyMemory<byte>>? MemoryList { get; set; } } }
Usage Examples
Risks
More ways of doing the same thing. When to choose MemoryList vs BufferList. Do they overwrite each other? etc.
The text was updated successfully, but these errors were encountered: