-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
win32threadpool.cpp should use GetQueuedCompletionStatusEx rather than GetQueuedCompletionStatus #11314
Comments
cc @geoffkizer did some experimentation here and didn't see much difference in plaintext. On windows all of the IO queues pull from the same port (we have a single global IO completion port) so it's sorta single producer multiple consumers model vs libuv where there's a unique port per loop thread polling for multiple items. I dunno if that explains the differences we saw though.... |
This also means that either a) some events may have been able to run on other threads but are now waiting to be run by the thread that dequeued them while it processes other items, or b) additional queueing costs will be incurred to queue the execution of those additional items to the worker pool.
Are you investigating this? |
I'm thinking about it; not sure on a good way to objectively evaluate the approach; or the right approach atm. Multiple completion ports would probably be good (though risks being imbalanced) but then offloading to the main threadpool would reintroduce contention that it would reduce. |
For |
Though there is easier low hanging fruit to pick first... https://github.com/dotnet/coreclr/issues/19672 |
Could it work by implementing a work stealing queue for IO pool threads similar to the worker pool threads? So if a batch bigger than 1 is fetched from the io completion port, it goes to a thread local queue that other idle threads can consume. |
That's one form of "additional queueing costs will be incurred to queue the execution of those additional items" (I shouldn't have been specific about "worker pool"). But that adds non-trivial costs, makes it likely that certain work items will get starved when they otherwise would have been processed sooner, etc., and I'm skeptical it would really help. The work-stealing benefit for the worker pool is primarily based on the notion that a given piece of work is being split into multiple pieces, so it's the work item being processed that's generating subwork, preferred by this thread but made available to others just in case... with the suggested approach, the work items would likely be unrelated. Someone could experiment with it of course. |
Are you working through some trace/profile? What app? |
Completion ports are not meant to be sharded. The single completion port design is what's recommended. |
@benaadams (or anyone else interested): I have some code that implements this. It's Socket-specific but that's likely good enough for testing. https://github.com/geoffkizer/corefx/tree/socketcustomiocp There are env vars for controlling different parameters like # of IOCPs and # of completions to retrieve per GQCSEx call. As @davidfowl mentioned above, I tried playing around with this on Plaintext but the wins were tiny at best (basically noise). In particular, sharding the IOCP doesn't seem to help; presumably the kernel is already doing a good job here internally. Batching with GQCSEx doesn't really seem to help either; the overhead specifically in GQCS doesn't seem to be that high. That said, I do think this is worth additional investigation. Be aware that this code replaces the existing IOCP infrastructure with a custom implementation. I compared the current coreclr implementation vs this custom implementation with equivalent configuration (batch size = 1, completion ports = 1) and didn't really see much difference, from which I conclude that the current IOCP implementation is reasonably efficient. If you do find scenarios/configurations where you see wins with this code, it would be good to understand whether the wins are due to the specific configuration or the IOCP implementation or both. |
Mostly was wondering if internal static unsafe void PerformIOCompletionCallback(uint errorCode, uint numBytes, NativeOverlapped* pNativeOverlapped)
{
do
{
OverlappedData overlapped = OverlappedData.GetOverlappedFromNative(pNativeOverlapped);
if (overlapped._callback is IOCompletionCallback iocb)
{
// We got here because of UnsafePack (or) Pack with EC flow suppressed
iocb(errorCode, numBytes, pNativeOverlapped);
}
else
{
// We got here because of Pack
var helper = (_IOCompletionCallback)overlapped._callback;
helper._errorCode = errorCode;
helper._numBytes = numBytes;
helper._pNativeOverlapped = pNativeOverlapped;
ExecutionContext.RunInternal(helper._executionContext, _ccb, helper);
}
//Quickly check the VM again, to see if a packet has arrived.
OverlappedData.CheckVMForIOPacket(out pNativeOverlapped, out errorCode, out numBytes);
} while (pNativeOverlapped != null);
} |
This code, as I understand it, is a legacy of when we supported multiple AppDomains per process. There's a single thread pool per process and a single IOCP. Thus at a low level (below this code), we're doing the usual GQCS logic, and when a completion arrives, we "enter" the appropriate AppDomain and execute the callback. But "entering" the AppDomain used to have some nontrivial cost, so to optimize the common case of a single AppDomain, we also have the logic above, which does a 0-timeout GQCS and checks if the received completion (if any) is in the same AppDomain. If so, keep looping without having to switch AppDomains. This isn't ideal, because if there's no completion to retrieve immediately, then the GCQS is pointless and just adds extra overhead. However, in practice it's not clear that this matters. In benchmark scenarios, there's usually a completion available. I played around with trying to change this at one point, but didn't see any useful results. The complexity here is part of why I've tried to prototype other IOCP approaches by simply replacing the whole IOCP/thread pool implementation instead of modifying it in place. |
We have a scenario where our app processes very large quantities of packets, using IOCP to retrieve and process batches of packets on particular threads. This app is currently written in C++ to ensure we have the fine grained control we need. Making greater use of .NET for more developer-friendly business logic and higher-level feature implementation is something that keeps popping up now and then. The app is quite performance-critical, however. Basically a packet switch. A quick experiment with a naive .NET Core sockets implementation shows less than 50% of the performance we get from our C++ implementation. It is better than I would have expected a few years ago but not enough. I suspect if we could have greater control over IOCP handling - in particular but not exclusively to take more items at once from the completion port -, we could eliminate one obstacle to greater use of .NET in our suite of services. Our sockets are each also associated with specific CPU cores, which reduces chatter across cores - we do not want our completion handling distributed across different cores. I have only done a cursory examination of how this is implemented in .NET Core but from what I understand, completions are simply handled by the thread pool, triggered by single-item retrievals from the completion port on whatever thread gets the task. To better meet our needs, we would be looking for functionality here that allows us to:
Of course, we could always simply implement this ourselves via PInvoke but given that we already have a C++ implementation, this does not seem like a worthwhile investment. I hope this feedback can help bring .NET Core closer to our needs one day. (Perhaps not ideal thread to post this in - I will see if I can split it into more individual requests) |
@sandersaares Thanks for posting this. As you can see in the above discussion, we've explored doing this like this in the past but we haven't landed on a solution here. I would love to get your benchmark code and play with it a bit. Can you post it somewhere or email it to me? |
The app I used for benchmarking the C# Socket performance is https://github.com/sandersaares/Shubar/ This app can generate the test input streams: grinch.zip In my scenario I used 5 machines running the script in the zip to generate load for 1 benchmark app. |
Ah this is udp, there are some low hanging fruit optimizations that can be made #33418, though that's likely only part of the story here. |
The specific issue for UDP optimization is here: #30797 Feedback welcome. |
It might be interesting to try using |
|
Pulling back multiple events (when available) per kernel call to dispatch rather than a single one per kernel call.
Also maybe use a completion port per core rather than a single completion port (to reduce contention)? (Not sure what strategy to use for attaching handles though; probably something better than round robin)
/cc @stephentoub @kouvel @vancem
The text was updated successfully, but these errors were encountered: