win32threadpool.cpp should use GetQueuedCompletionStatusEx rather than GetQueuedCompletionStatus #11314

benaadams · 2018-10-24T21:47:05Z

Pulling back multiple events (when available) per kernel call to dispatch rather than a single one per kernel call.

Also maybe use a completion port per core rather than a single completion port (to reduce contention)? (Not sure what strategy to use for attaching handles though; probably something better than round robin)

/cc @stephentoub @kouvel @vancem

davidfowl · 2018-10-25T01:42:23Z

cc @geoffkizer did some experimentation here and didn't see much difference in plaintext. On windows all of the IO queues pull from the same port (we have a single global IO completion port) so it's sorta single producer multiple consumers model vs libuv where there's a unique port per loop thread polling for multiple items. I dunno if that explains the differences we saw though....

stephentoub · 2018-10-25T02:00:17Z

Pulling back multiple events (when available) per kernel call to dispatch rather than a single one per kernel call.

This also means that either a) some events may have been able to run on other threads but are now waiting to be run by the thread that dequeued them while it processes other items, or b) additional queueing costs will be incurred to queue the execution of those additional items to the worker pool.

should use

Are you investigating this?

benaadams · 2018-10-25T02:16:40Z

Are you investigating this?

I'm thinking about it; not sure on a good way to objectively evaluate the approach; or the right approach atm.

Multiple completion ports would probably be good (though risks being imbalanced) but then offloading to the main threadpool would reintroduce contention that it would reduce.

benaadams · 2018-10-25T02:20:25Z

For GetQueuedCompletionStatusEx checking the averages and min/max of ulNumEntriesRemoved to see how often it is above 1 might be interesting (or show its not worth it)

benaadams · 2018-10-25T02:36:11Z

Though there is easier low hanging fruit to pick first... https://github.com/dotnet/coreclr/issues/19672

JeffCyr · 2018-10-25T14:35:02Z

This also means that either a) some events may have been able to run on other threads but are now waiting to be run by the thread that dequeued them while it processes other items, or b) additional queueing costs will be incurred to queue the execution of those additional items to the worker pool.

Could it work by implementing a work stealing queue for IO pool threads similar to the worker pool threads? So if a batch bigger than 1 is fetched from the io completion port, it goes to a thread local queue that other idle threads can consume.

stephentoub · 2018-10-25T15:18:00Z

Could it work by implementing a work stealing queue for IO pool threads similar to the worker pool threads?

That's one form of "additional queueing costs will be incurred to queue the execution of those additional items" (I shouldn't have been specific about "worker pool"). But that adds non-trivial costs, makes it likely that certain work items will get starved when they otherwise would have been processed sooner, etc., and I'm skeptical it would really help. The work-stealing benefit for the worker pool is primarily based on the notion that a given piece of work is being split into multiple pieces, so it's the work item being processed that's generating subwork, preferred by this thread but made available to others just in case... with the suggested approach, the work items would likely be unrelated. Someone could experiment with it of course.

stephentoub · 2018-10-25T15:18:52Z

Though there is easier low hanging fruit to pick first...

Are you working through some trace/profile? What app?

GSPP · 2018-10-27T10:42:48Z

Completion ports are not meant to be sharded. The single completion port design is what's recommended.

geoffkizer · 2018-10-31T17:26:49Z

@benaadams (or anyone else interested):

I have some code that implements this. It's Socket-specific but that's likely good enough for testing.

https://github.com/geoffkizer/corefx/tree/socketcustomiocp

There are env vars for controlling different parameters like # of IOCPs and # of completions to retrieve per GQCSEx call.

As @davidfowl mentioned above, I tried playing around with this on Plaintext but the wins were tiny at best (basically noise). In particular, sharding the IOCP doesn't seem to help; presumably the kernel is already doing a good job here internally. Batching with GQCSEx doesn't really seem to help either; the overhead specifically in GQCS doesn't seem to be that high.

That said, I do think this is worth additional investigation.

Be aware that this code replaces the existing IOCP infrastructure with a custom implementation. I compared the current coreclr implementation vs this custom implementation with equivalent configuration (batch size = 1, completion ports = 1) and didn't really see much difference, from which I conclude that the current IOCP implementation is reasonably efficient. If you do find scenarios/configurations where you see wins with this code, it would be good to understand whether the wins are due to the specific configuration or the IOCP implementation or both.

benaadams · 2018-11-24T05:03:22Z

Mostly was wondering if OverlappedData.CheckVMForIOPacket from this loop could be skipped with GQCSEx

internal static unsafe void PerformIOCompletionCallback(uint errorCode, uint numBytes, NativeOverlapped* pNativeOverlapped)
{
    do
    {
        OverlappedData overlapped = OverlappedData.GetOverlappedFromNative(pNativeOverlapped);

        if (overlapped._callback is IOCompletionCallback iocb)
        {
            // We got here because of UnsafePack (or) Pack with EC flow suppressed
            iocb(errorCode, numBytes, pNativeOverlapped);
        }
        else
        {
            // We got here because of Pack
            var helper = (_IOCompletionCallback)overlapped._callback;
            helper._errorCode = errorCode;
            helper._numBytes = numBytes;
            helper._pNativeOverlapped = pNativeOverlapped;
            ExecutionContext.RunInternal(helper._executionContext, _ccb, helper);
        }

        //Quickly check the VM again, to see if a packet has arrived.
        OverlappedData.CheckVMForIOPacket(out pNativeOverlapped, out errorCode, out numBytes);
    } while (pNativeOverlapped != null);
}

geoffkizer · 2018-11-24T20:47:23Z

This code, as I understand it, is a legacy of when we supported multiple AppDomains per process. There's a single thread pool per process and a single IOCP. Thus at a low level (below this code), we're doing the usual GQCS logic, and when a completion arrives, we "enter" the appropriate AppDomain and execute the callback. But "entering" the AppDomain used to have some nontrivial cost, so to optimize the common case of a single AppDomain, we also have the logic above, which does a 0-timeout GQCS and checks if the received completion (if any) is in the same AppDomain. If so, keep looping without having to switch AppDomains.

This isn't ideal, because if there's no completion to retrieve immediately, then the GCQS is pointless and just adds extra overhead. However, in practice it's not clear that this matters. In benchmark scenarios, there's usually a completion available.

I played around with trying to change this at one point, but didn't see any useful results.

The complexity here is part of why I've tried to prototype other IOCP approaches by simply replacing the whole IOCP/thread pool implementation instead of modifying it in place.

sandersaares · 2020-08-27T18:52:31Z

We have a scenario where our app processes very large quantities of packets, using IOCP to retrieve and process batches of packets on particular threads. This app is currently written in C++ to ensure we have the fine grained control we need. Making greater use of .NET for more developer-friendly business logic and higher-level feature implementation is something that keeps popping up now and then. The app is quite performance-critical, however. Basically a packet switch.

A quick experiment with a naive .NET Core sockets implementation shows less than 50% of the performance we get from our C++ implementation. It is better than I would have expected a few years ago but not enough. I suspect if we could have greater control over IOCP handling - in particular but not exclusively to take more items at once from the completion port -, we could eliminate one obstacle to greater use of .NET in our suite of services. Our sockets are each also associated with specific CPU cores, which reduces chatter across cores - we do not want our completion handling distributed across different cores.

I have only done a cursory examination of how this is implemented in .NET Core but from what I understand, completions are simply handled by the thread pool, triggered by single-item retrievals from the completion port on whatever thread gets the task.

To better meet our needs, we would be looking for functionality here that allows us to:

associate a specific completion port with a specific Socket
have completions from each completion port be processed on a specific thread, retrieved in bulk fashion via GetQCSEx
do so as efficiently as possible and, ideally, without allocating memory

Of course, we could always simply implement this ourselves via PInvoke but given that we already have a C++ implementation, this does not seem like a worthwhile investment. I hope this feedback can help bring .NET Core closer to our needs one day.

(Perhaps not ideal thread to post this in - I will see if I can split it into more individual requests)

geoffkizer · 2020-08-28T01:44:42Z

@sandersaares Thanks for posting this. As you can see in the above discussion, we've explored doing this like this in the past but we haven't landed on a solution here.

I would love to get your benchmark code and play with it a bit. Can you post it somewhere or email it to me?

sandersaares · 2020-08-28T05:50:38Z

The app I used for benchmarking the C# Socket performance is https://github.com/sandersaares/Shubar/

This app can generate the test input streams: grinch.zip

In my scenario I used 5 machines running the script in the zip to generate load for 1 benchmark app.

davidfowl · 2020-08-28T06:05:03Z

Ah this is udp, there are some low hanging fruit optimizations that can be made #33418, though that's likely only part of the story here.

geoffkizer · 2020-08-28T07:01:37Z

The specific issue for UDP optimization is here: #30797

Feedback welcome.

kouvel · 2020-08-28T18:46:58Z

It might be interesting to try using GetQueuedCompletionStatusEx again and having it work similarly to how epoll threads work on Linux, only posting to the IO completion port rather than to the worker thread pool (to preserve the current prioritization).

kouvel · 2022-04-25T20:52:21Z

GetQueuedCompletionStatusEx is used after #64834. Using multiple IOCPs didn't seem to improve perf.

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

benaadams mentioned this issue Jan 31, 2020

SocketAsyncEngine.Unix perf experiments #14304

Closed

maryamariyan added the untriaged New issue has not been triaged by the area owner label Feb 26, 2020

jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 9, 2020

davidfowl mentioned this issue Jan 6, 2021

Move the IOCP thread pool on windows to managed code #46610

Closed

kouvel closed this as completed Apr 25, 2022

ghost locked as resolved and limited conversation to collaborators May 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

win32threadpool.cpp should use GetQueuedCompletionStatusEx rather than GetQueuedCompletionStatus #11314

win32threadpool.cpp should use GetQueuedCompletionStatusEx rather than GetQueuedCompletionStatus #11314

benaadams commented Oct 24, 2018

davidfowl commented Oct 25, 2018

stephentoub commented Oct 25, 2018

benaadams commented Oct 25, 2018

benaadams commented Oct 25, 2018

benaadams commented Oct 25, 2018

JeffCyr commented Oct 25, 2018

stephentoub commented Oct 25, 2018

stephentoub commented Oct 25, 2018

GSPP commented Oct 27, 2018

geoffkizer commented Oct 31, 2018

benaadams commented Nov 24, 2018

geoffkizer commented Nov 24, 2018

sandersaares commented Aug 27, 2020 •

edited

Loading

geoffkizer commented Aug 28, 2020

sandersaares commented Aug 28, 2020

davidfowl commented Aug 28, 2020

geoffkizer commented Aug 28, 2020

kouvel commented Aug 28, 2020 •

edited

Loading

kouvel commented Apr 25, 2022

win32threadpool.cpp should use GetQueuedCompletionStatusEx rather than GetQueuedCompletionStatus #11314

win32threadpool.cpp should use GetQueuedCompletionStatusEx rather than GetQueuedCompletionStatus #11314

Comments

benaadams commented Oct 24, 2018

davidfowl commented Oct 25, 2018

stephentoub commented Oct 25, 2018

benaadams commented Oct 25, 2018

benaadams commented Oct 25, 2018

benaadams commented Oct 25, 2018

JeffCyr commented Oct 25, 2018

stephentoub commented Oct 25, 2018

stephentoub commented Oct 25, 2018

GSPP commented Oct 27, 2018

geoffkizer commented Oct 31, 2018

benaadams commented Nov 24, 2018

geoffkizer commented Nov 24, 2018

sandersaares commented Aug 27, 2020 • edited Loading

geoffkizer commented Aug 28, 2020

sandersaares commented Aug 28, 2020

davidfowl commented Aug 28, 2020

geoffkizer commented Aug 28, 2020

kouvel commented Aug 28, 2020 • edited Loading

kouvel commented Apr 25, 2022

sandersaares commented Aug 27, 2020 •

edited

Loading

kouvel commented Aug 28, 2020 •

edited

Loading