-
Notifications
You must be signed in to change notification settings - Fork 411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feature requests: submit requests from any thread #109
Comments
I can add some color here as I've made my uring interface multi-threaded. Maintaining a thread-safe free-list of SQE indexes is no problem. The imperfect part comes when submitting the SQEs. A client always submits the number of SQEs it prepared, however the actual SQEs that it submits may be different than the ones it prepared if another thread has published its SQE indexes, but not yet called submit. However, this seems to work fine because they'll eventually submit each other's SQEs. I don't know if there's any implications with submitting a subset of SQEs that are linked. For example, thread (1) wants to submit SQEs A and B which are linked, and thread (2) wants to submit C. If (1) publishes first, but (2) publishes and then submits, only A might be submitted before (2) gets around to submitting B and C. Some improvement here would be nice, but it may also be a non-issue if preparing SQEs are always immediately followed by submitting them. (I don't keep a lock around these two items to prevent the lock from being held under syscall). |
@paxostv as you point out, it's possible to do, but there are some challenges: ensuring different thread don't try to use the same sqe, ensuring no partial seqs get submitted, and maybe deal with sqe list being full also? Being able to pass sqe list as an argument doesn't require this synchronization. Ideally, the application is written so only a single thread is submitting sqes to an io_uring. That is also what |
Not sure I like the idea of a separate system call, as I don't think it'll solve anything. You're still going to need the ring to actually submit them, and the ring is locked internally. This means that even if you're not using the SQ ring at all, two threads could still contend and block if they tried to do that at the same time. So let me ask, why aren't you just using a ring per thread in your thread pool? Most users of io_uring won't be using thread pools at all, since one of the goal was not to need them as blocking processing will be done internally anyway. You can share the async backend between rings for all your threads, so the overhead should be fairly minimal. And then you don't need to share a ring at all. Maybe I'm missing something and there are specific reasons why you are sharing a ring (or multiple rings?) between threads in a thread pool? |
Good question! There are a number of reasons, this is the most important one: It's allowed to queue work to the ThreadPool that blocks. The ThreadPool will add additional threads to handle work when it detects that. You can also block until an async operation completes. So if that async operation relies on the thread being able to reach io_uring handling, you're in a deadlock. The challenges comes from trying to fit io_uring in the .NET model, which is different from the preferred io_uring model. |
It's definitely possible. Why not? A couple reasons come to mind.
There's alternative models, but honestly keeping a free list of SQE indexes, where each CPU can cache a subset, and then call into a single ring isn't hard to implement. There's just an "iffy" (but correct) factor for the reason I stated above. If there's better ways to go about this, I'd be happy to explore. |
The general reason is that we look at adopting io_uring in the framework so it works with existing code and applications. If we were building a framework from scratch, it could be centered on the async model of io_uring, but that is not what we're doing. |
We just had some discussions around this topic in the tokio discord chatroom. The tokio runtime works similar to the .NET runtime: User-code can be executed on a multithreaded threadpool, and is generally not concerned about on which thread it runs. This means IO handles (like sockets/files) could be used from more than 1 thread during their lifetime. In addition to this they could be migrated between threads even while IO calls are active, due to the utilization of a work-stealing user-space scheduler. We are now asking ourselves how to integrate io_uring into such a model.
Would it be allowed to perform each IO operation from a different ring - and e.g. always use the thread-local submission queue and ring to submit the request? That could be an option. If the first IO operation however somehow "binds" the IO handle to that ring it might not work out. If we can submit IO operations on different rings, are there any restrictions around it? One thing I could think of is whether all operations submitted on the previous ring must have been completed. We might run into situations where the handle is migrated even though an operation is still in-flight, and where the next operation is aimed to be submitted from a new thread. |
@Matthias247 .NET behaves similar. Adding some more specifics, which may be the same in tokio: In .NET all on-going operations can be cancelled by closing the handle (for example when calling Does tokio allow users to block the threadpool threads? And can users wait on operations that are handled by io_uring? Blocking threadpool threads is allowed in .NET. This includes waiting for operations that would be handled by io_uring. |
Yeah, there will be a similar challenge for this in tokio too. Destructors for IO operations could be called at any time - there are no guaranteed run to completion semantics. And we need to at least make them safe. There are 2 strategies for that:
It has a single-threaded and multi-threaded mode. In the multi-threaded mode we could theoretically block the thread-pool (e.g. in order to wait for cancelled operation to complete - which was probably your background question). However it comes with a variety of gotchas. One is that that in application with lots of cancellations - e.g. a webserver which terminates lots of keep-alive connections through timeouts - we will block the threadpool very often and ruin performance. Blocking for an operation to complete will however require that the ring runs outside of the current thread - or last least to "migrate it away" before entering the blocking mode -otherwise the thread would deadlock by waiting on itself. So this mode favors running "uring threads" outside of the normal threadpool executor threads.
They would be able to asynchronously await the operations. Theoretically users could also // Block the current thread until a Future resolves
let result = block_on(async {
io_future.await;
}); but that is not necessarily something that needs to be supported. |
For context, I'm another maintainer of Tokio. As mentioned, we have a work-stealing based runtime where sockets (and other resources) may migrate across threads. To be explicit, I'm trying to discuss the general "work-stealing" pattern and not design something specific for Tokio. Thinking more about the problem, the goal we have isn't so much to be able to submit from multiple threads. We could have a ring per thread. However, there is a bunch of functionality that is per ring. For example, buffer pools, registering FDs, etc... Also, we don't really care if the same thread that submits an event gets the completion. If, for example, if there were a way to create one ring per thread but then associate these rings in a "cluster" of some kind, this would be sufficient. In this case, if an FD, buffer, ... is registered with any ring in the cluster, it is registered with all rings of the cluster. Additionally, completion events could be made available in any of the rings in the cluster. @axboe would something like that be a plausible direction? |
@axboe How do you do this? I can find nothing in the man pages or tests explaining or showing how to do this. |
@yxhuvud I need to get that documented... But basically, you setup the first ring as usual. Then for subsequent rings where you want to share the async backend, you set io_uring_params->wq_fd to that first ring file descriptor, and ensure that you set IORING_SETUP_ATTACH_WQ in the setup flags as well. That's it. |
@axboe, just to make sure I understand correctly, when you use IORING_SETUP_ATTACH_WQ, the rings share the async backend, but the completions are still sent to their respective rings, correct? That is, the interface that the user-space application sees is the same, though the backend is shared? |
Correct, it behaves the exact same way. |
I don't know if there have been developments in this space. Supporting starting operations from arbitrary threads would be a nice way to adopt io_uring in existing applications and frameworks. Consider:
These two system calls could be a single call against io_uring. |
That's just why most old threading based programs need to be reconstructed, rather than waiting for io_uring to compromise. For example, we started to write our project in the last few years with modern C++, and it is based on coroutines per thread. We deem it as the sanest thing to create a ring per thread, and concurrently submit to it among coroutines. |
Thanks for all the contributions! I found this discussion very enlightening. In a single threaded application the io-ring is pretty easy and straight forward to use, which is a great achievement. However, in the multi-threaded case, one thing is not entirely clear to me: If the application makes sure to prevents races, is it ok for different threads to access the same io-ring? Is it okay for one thread to poll for completions, while another thread submits operations? |
It's fine, the problem is when apps modify/access SQ and/or CQ (including tail/head pointers) in parallel.
Yes. In both cases there might be minor performance implications because of internal implementation details. |
I'm closing this one, one of the core tenets of io_uring is that you should not share a ring if you can avoid it. Either implement the sharing so that you have a single thread submitting, or have a ring-per-thread. |
When working with
io_uring
from a stack that has a user-space thread pool (like .NET) it would be interesting to be able to submit requests from any thread.The shared memory for requests makes it challenging to synchronize access between multiple threads that want to use the same io_uring instance.
Consider adding a system call that accepts a list of requests, and completions.
When the requests can be completed without blocking, the completions are filled in. If the requests would block, the requests are added to the io_uring, and will complete asynchronously and can be retrieved from the shared memory completion buffer.
The text was updated successfully, but these errors were encountered: