-
Notifications
You must be signed in to change notification settings - Fork 545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow using an app-provided thread #1923
Comments
We've discussed possibly supporting this, but never came to and hard conclusion. How would you actually use this, if we added support? There is significant work involved and we wouldn't want to do this unless something would definitely use it. |
I don’t have any particular plans myself, and am not in a position where I am likely to use MsQuic in the near future. That said, I would not be surprised if some people have ruled out MsQuic as an option because of this without filing an issue. |
Perhaps. There is a reason we went with this model of owning the threads in MsQuic though. There is a lot of complexity involved in implementing a performant parallelized networking layer, and by owning the threads in MsQuic we can do all the hard work internally and the apps get it for free. If we add support for this Issue, I do expect apps that use this model to have a significant performance decrease from those that do not. |
Is the current model compatible with Node.js, for example, or would that require marshalling? |
I have no experience with Node.js so I cannot answer that. |
I would actually be slightly scared to put the QUIC workers on the UI thread in Node.js. Unlike the current TCP and UDP implementations, which do a TINY amount of work at the user mode level, QUIC is very computationally expensive, including encryption, synchronous DNS lookup, and all the timing requirements for the protocol. I highly suspect running QUIC directly on the UI thread would start to cause the UI to lag. And because you'd be limited to a single thread, you'd lose a lot of perf there as well. Very little in Node currently is computationally expensive, and anything that is usually is marshalled to a separate thread in some way. |
How does that compare to the current TLS client and server? Also, my understanding is that Node.js usually scales by having multiple instances of the server running, or by using multiple contexts. So using 2x the CPU for less than 2x performance is not guaranteed to be a win. |
Again, I don't know how that's currently done in Node, but TLS is very expensive, so I'd assume it's never done on a blocking thread.
MsQuic scales thread with processor count. Additionally, RSS (receive side scaling) uses dedicated threads per processor to match the NIC's processor receive indications, so it does scale very well; especially on multi-NUMA node machines. |
At a minimum, I would like to be able to integrate my own code into MsQuic’s event loop somehow. I might need to handle HTTP/1.1 and HTTP/2 traffic as well, for instance. |
@DemiMarie we're doing work on refactoring how scheduling works, and would be happy to take inputs and suggestions. We refactored the QUIC worker thread so that it can be run by another thread: //
// General purpose execution context abstraction layer. Used for driving worker
// loops.
//
typedef struct CXPLAT_EXECUTION_CONTEXT CXPLAT_EXECUTION_CONTEXT;
//
// Returns FALSE when it's time to cleanup.
//
typedef
_IRQL_requires_max_(PASSIVE_LEVEL)
BOOLEAN
(*CXPLAT_EXECUTION_FN)(
_Inout_ CXPLAT_EXECUTION_CONTEXT* Context,
_Inout_ uint64_t* TimeNowUs, // The current time, in microseconds.
_In_ CXPLAT_THREAD_ID ThreadID // The current thread ID.
);
typedef struct CXPLAT_EXECUTION_CONTEXT {
void* Context;
CXPLAT_EXECUTION_FN Callback;
uint64_t NextTimeUs;
BOOLEAN Ready;
} CXPLAT_EXECUTION_CONTEXT; And usage: // TODO - Add synchronization around this stuff.
uint32_t ExecutionContextCount = 0;
CXPLAT_EXECUTION_CONTEXT* ExecutionContexts[8];
void CxPlatAddExecutionContext(CXPLAT_EXECUTION_CONTEXT* Context)
{
CXPLAT_FRE_ASSERT(ExecutionContextCount < ARRAYSIZE(ExecutionContexts));
ExecutionContexts[ExecutionContextCount] = Context;
ExecutionContextCount++;
}
BOOLEAN CxPlatRunExecutionContexts(_In_ CXPLAT_THREAD_ID ThreadID)
{
if (ExecutionContextCount == 0) {
return FALSE;
}
uint64_t TimeNow = CxPlatTimeUs64();
for (uint32_t i = 0; i < ExecutionContextCount; i++) {
CXPLAT_EXECUTION_CONTEXT* Context = ExecutionContexts[i];
if (Context->Ready || Context->NextTimeUs <= TimeNow) {
if (!Context->Callback(Context->Context, &TimeNow, ThreadID)) {
// Remove the context from the array.
if (i + 1 < ExecutionContextCount) {
ExecutionContexts[i] = ExecutionContexts[--ExecutionContextCount];
} else {
ExecutionContextCount--;
}
}
}
}
return TRUE;
} With this model exposed to the API, it would allow the app's thread do drive the execution contexts. The complexity comes in trying to continue to have things like RSS and CID-based routing still work effectively. |
@nibanks so one thought I had is to allow the MsQuic event loop to handle other things as well, such as pollable file descriptors on Unix and I/O completion ports and waitable events on Windows. The latter will require using undocumented NT kernel APIs, but I imagine it would not be too hard for you to work around that problem. As far as RSS and CID-based routing, what are the tricky parts? Would it be possible to decouple the networking code from the state machine, as Quinn does? Would there be a performance penalty in doing so? |
@DemiMarie yes we've thought about designs both where msquic handles everything and where we expose interfaces such that the app can handle everything. Both have complexities, mostly originating from the fact that there is no single, easy pattern that works cross-platform. Just for the datapath layer, epoll, kqueue, iocp, etc. all have slight differences that complicate things.
Anything is possible, but we have to balance complexity and performance. Unlike any other QUIC stack that I know, MsQuic is designed to align RSS all the way up from the NIC even into the application thread; all on the same CPU (if everything is used properly). This is very complicated and difficult to achieve, and providing for a generic interface that other threads could control will make it more difficult. That isn't to say we don't want to go there. We want to figure out a good way to do this, but still haven't quite achieved it yet. |
Just stumbling about this issue, might be worth adding my 2 cents: However, I'm at a loss at how I would integrate it with PHP (via FFI). The PHP model generally requires PHP Code to be invoked only from a single thread (and then do multi-processing if needed for scaling horizontally). I would like if MsQuic would not fully decouple the networking, as I definitely appreciate it trying to optimize the networking, setting socket options etc. Just the small task of socket I/O readiness I would need to be abstracted away. |
It's definitely a goal to be able to allow the app thread to drive the execution. It's still a work in progress though. Thanks for the feedback! |
What is being described/requested here is sans-io model:
By externalising all IO and timers, library becomes effectively just a state machine. One benefit it brings is ease of porting library to other platforms: it simply doesn't contain any platform specific code anymore. Currently we ruled out msquic for one of our projects because plugging IO for consoles platforms requires maintaining fork. |
As you can see from the recently linked draft PR, we're actively working on exposing a way for external control of the execution. |
I had a look, it is very early work and it is hard to see how it will shape up. It might allow better control over threading, but it looks like it retain lot of responsibility for IO in msquic, making porting it still a hassle. Ideal sans-io interface should accept time delta since last poll and vector of Realistic payloads passed in/out from sans-io library are likely to be more complicated with enums for connection created, closed, reporting IO errors, supporting. multiple buffers per socket for scatter/gather IO, etc. |
That model assumes you have socket handles, which is not always correct in terms of XDP and DPDK. |
Indeed so. RSS alignment (which really helps performance) is another factor. |
I don't follow. Socket handle doesn't have to point to actual kernel socket, call it io_handle , just an identifier which IO layer outside of msquic can use to understand how to send bytes there and msquic use it to identify QUIC endpoint bytes belong to. I am not familiar with XDP or DPDK , but surely it has notion of source:port,dst:port even if it crafts raw packets including all of IP headers, io_handle can be mapped to these network tuples. Same for RSS, because all IO is externalised , msquic gives up control on it and it is up to IO layer to chose CPU to run IO on. If required there can be msquic instance per CPU to have share nothing architecture. With msquic acting just as a state machine app has full flexibility how and when to drive it. |
Jump |
Describe the feature you'd like supported
It would be nice if MsQuic allowed apps to provide their own threads, and perform event polling themselves.
Proposed solution
See above.
Additional context
In some environments, such as Lua and Node.js, all callbacks must eventually be run on a single thread. This currently requires marshaling them back to the main thread, which is less efficient than if MsQuic could integrate into the built-in event loop. Other environments, such as Rust with Tokio, already provide their own high-performance event loops, and having to use a separate thread for QUIC would require additional locking.
The text was updated successfully, but these errors were encountered: