Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature requests: submit requests from any thread #109

Closed
tmds opened this issue Apr 14, 2020 · 22 comments
Closed

feature requests: submit requests from any thread #109

tmds opened this issue Apr 14, 2020 · 22 comments

Comments

@tmds
Copy link

tmds commented Apr 14, 2020

When working with io_uring from a stack that has a user-space thread pool (like .NET) it would be interesting to be able to submit requests from any thread.

The shared memory for requests makes it challenging to synchronize access between multiple threads that want to use the same io_uring instance.

Consider adding a system call that accepts a list of requests, and completions.
When the requests can be completed without blocking, the completions are filled in. If the requests would block, the requests are added to the io_uring, and will complete asynchronously and can be retrieved from the shared memory completion buffer.

@tmds
Copy link
Author

tmds commented Apr 14, 2020

cc @axboe @benaadams @davidfowl

@tmds
Copy link
Author

tmds commented Apr 14, 2020

cc @antonfirsov @karelz

@paxostv
Copy link

paxostv commented Apr 14, 2020

I can add some color here as I've made my uring interface multi-threaded.

Maintaining a thread-safe free-list of SQE indexes is no problem. The imperfect part comes when submitting the SQEs. A client always submits the number of SQEs it prepared, however the actual SQEs that it submits may be different than the ones it prepared if another thread has published its SQE indexes, but not yet called submit. However, this seems to work fine because they'll eventually submit each other's SQEs.

I don't know if there's any implications with submitting a subset of SQEs that are linked. For example, thread (1) wants to submit SQEs A and B which are linked, and thread (2) wants to submit C. If (1) publishes first, but (2) publishes and then submits, only A might be submitted before (2) gets around to submitting B and C.

Some improvement here would be nice, but it may also be a non-issue if preparing SQEs are always immediately followed by submitting them. (I don't keep a lock around these two items to prevent the lock from being held under syscall).

@tmds
Copy link
Author

tmds commented Apr 16, 2020

@paxostv as you point out, it's possible to do, but there are some challenges: ensuring different thread don't try to use the same sqe, ensuring no partial seqs get submitted, and maybe deal with sqe list being full also?

Being able to pass sqe list as an argument doesn't require this synchronization.
And if the call can also return completions when they happen synchronously, the thread making the call can consume that information and continue its business.

Ideally, the application is written so only a single thread is submitting sqes to an io_uring. That is also what liburing assumes. We're not in that ideal situation. For our multi-threaded use-case I think this syscall would be useful.

@axboe
Copy link
Owner

axboe commented Apr 16, 2020

Not sure I like the idea of a separate system call, as I don't think it'll solve anything. You're still going to need the ring to actually submit them, and the ring is locked internally. This means that even if you're not using the SQ ring at all, two threads could still contend and block if they tried to do that at the same time.

So let me ask, why aren't you just using a ring per thread in your thread pool? Most users of io_uring won't be using thread pools at all, since one of the goal was not to need them as blocking processing will be done internally anyway. You can share the async backend between rings for all your threads, so the overhead should be fairly minimal. And then you don't need to share a ring at all.

Maybe I'm missing something and there are specific reasons why you are sharing a ring (or multiple rings?) between threads in a thread pool?

@tmds
Copy link
Author

tmds commented Apr 16, 2020

So let me ask, why aren't you just using a ring per thread in your thread pool?

Good question!

There are a number of reasons, this is the most important one:

It's allowed to queue work to the ThreadPool that blocks. The ThreadPool will add additional threads to handle work when it detects that.

You can also block until an async operation completes. So if that async operation relies on the thread being able to reach io_uring handling, you're in a deadlock.

The challenges comes from trying to fit io_uring in the .NET model, which is different from the preferred io_uring model.

@paxostv
Copy link

paxostv commented Apr 16, 2020

It's definitely possible. Why not? A couple reasons come to mind.

  1. The assumption against using a thread-pool is that the application can control all I/O, which isn't necessarily true. In my case, network sockets and local file access, yes, however it's not uncommon to use a third-party library for other forms of storage (e.g. rocksdb).
  2. There'd be many instances. For cases where blocking is unexpected (small critical sections, no blocking I/O), I keep two request processing threads per HW thread, which can be 64+ for larger server instances. Even maintaining per-CPU I/O rings would still require multi-threading.
  3. There's some indirection that's necessary with collecting CQEs with multiple rings. Either a thread needs to be dedicated to waiting for completions for every I/O ring instance, or a single thread with an epoll mechanism for determining which ring to process.

There's alternative models, but honestly keeping a free list of SQE indexes, where each CPU can cache a subset, and then call into a single ring isn't hard to implement. There's just an "iffy" (but correct) factor for the reason I stated above.

If there's better ways to go about this, I'd be happy to explore.

@tmds
Copy link
Author

tmds commented Apr 17, 2020

There are a number of reasons, this is the most important one:

The general reason is that we look at adopting io_uring in the framework so it works with existing code and applications.

If we were building a framework from scratch, it could be centered on the async model of io_uring, but that is not what we're doing.

@Matthias247
Copy link

We just had some discussions around this topic in the tokio discord chatroom. The tokio runtime works similar to the .NET runtime: User-code can be executed on a multithreaded threadpool, and is generally not concerned about on which thread it runs. This means IO handles (like sockets/files) could be used from more than 1 thread during their lifetime. In addition to this they could be migrated between threads even while IO calls are active, due to the utilization of a work-stealing user-space scheduler. We are now asking ourselves how to integrate io_uring into such a model.

So let me ask, why aren't you just using a ring per thread in your thread pool? Most users of io_uring won't be using thread pools at all, since one of the goal was not to need them as blocking processing will be done internally anyway.

Would it be allowed to perform each IO operation from a different ring - and e.g. always use the thread-local submission queue and ring to submit the request? That could be an option. If the first IO operation however somehow "binds" the IO handle to that ring it might not work out.

If we can submit IO operations on different rings, are there any restrictions around it? One thing I could think of is whether all operations submitted on the previous ring must have been completed. We might run into situations where the handle is migrated even though an operation is still in-flight, and where the next operation is aimed to be submitted from a new thread.

@tmds
Copy link
Author

tmds commented Apr 28, 2020

@Matthias247 .NET behaves similar. Adding some more specifics, which may be the same in tokio:

In .NET all on-going operations can be cancelled by closing the handle (for example when calling Socket.Dispose). This requires tracking all io_urings where the handle is used.

Does tokio allow users to block the threadpool threads? And can users wait on operations that are handled by io_uring?

Blocking threadpool threads is allowed in .NET. This includes waiting for operations that would be handled by io_uring.

@Matthias247
Copy link

In .NET all on-going operations can be cancelled by closing the handle (for example when calling Socket.Dispose). This requires tracking all io_urings where the handle is used.

Yeah, there will be a similar challenge for this in tokio too. Destructors for IO operations could be called at any time - there are no guaranteed run to completion semantics. And we need to at least make them safe. There are 2 strategies for that:

  1. Block the current thread inside the destructor until the IO operation finishes
  2. Mark the operation as cancelled but still let it continue to run. This requires users to pass the ownership of buffers to the runtime, so that those can still be safely utilized while the background operation continues to run.

Does tokio allow users to block the threadpool threads?

It has a single-threaded and multi-threaded mode. In the multi-threaded mode we could theoretically block the thread-pool (e.g. in order to wait for cancelled operation to complete - which was probably your background question). However it comes with a variety of gotchas. One is that that in application with lots of cancellations - e.g. a webserver which terminates lots of keep-alive connections through timeouts - we will block the threadpool very often and ruin performance.

Blocking for an operation to complete will however require that the ring runs outside of the current thread - or last least to "migrate it away" before entering the blocking mode -otherwise the thread would deadlock by waiting on itself.

So this mode favors running "uring threads" outside of the normal threadpool executor threads.

And can users wait on operations that are handled by io_uring?

They would be able to asynchronously await the operations. Theoretically users could also

// Block the current thread until a Future resolves
let result = block_on(async {
    io_future.await;
});

but that is not necessarily something that needs to be supported.

@carllerche
Copy link

carllerche commented May 9, 2020

For context, I'm another maintainer of Tokio. As mentioned, we have a work-stealing based runtime where sockets (and other resources) may migrate across threads. To be explicit, I'm trying to discuss the general "work-stealing" pattern and not design something specific for Tokio.

Thinking more about the problem, the goal we have isn't so much to be able to submit from multiple threads. We could have a ring per thread. However, there is a bunch of functionality that is per ring. For example, buffer pools, registering FDs, etc... Also, we don't really care if the same thread that submits an event gets the completion.

If, for example, if there were a way to create one ring per thread but then associate these rings in a "cluster" of some kind, this would be sufficient. In this case, if an FD, buffer, ... is registered with any ring in the cluster, it is registered with all rings of the cluster. Additionally, completion events could be made available in any of the rings in the cluster.

@axboe would something like that be a plausible direction?

@yxhuvud
Copy link

yxhuvud commented Jul 24, 2020

You can share the async backend between rings for all your threads, so the overhead should be fairly minimal.

@axboe How do you do this? I can find nothing in the man pages or tests explaining or showing how to do this.

@axboe
Copy link
Owner

axboe commented Jul 24, 2020

@yxhuvud I need to get that documented...

But basically, you setup the first ring as usual. Then for subsequent rings where you want to share the async backend, you set io_uring_params->wq_fd to that first ring file descriptor, and ensure that you set IORING_SETUP_ATTACH_WQ in the setup flags as well.

That's it.

@axboe
Copy link
Owner

axboe commented Jul 24, 2020

@lukehsiao
Copy link

@axboe, just to make sure I understand correctly, when you use IORING_SETUP_ATTACH_WQ, the rings share the async backend, but the completions are still sent to their respective rings, correct? That is, the interface that the user-space application sees is the same, though the backend is shared?

@axboe
Copy link
Owner

axboe commented Jul 30, 2020

That is, the interface that the user-space application sees is the same, though the backend is shared?

Correct, it behaves the exact same way.

@tmds
Copy link
Author

tmds commented Nov 6, 2020

I don't know if there have been developments in this space.

Supporting starting operations from arbitrary threads would be a nice way to adopt io_uring in existing applications and frameworks.

Consider:

int rv = recvmsg(...)
if (rv == -1 && errno == EAGAIN)
{
   epollctl(...)
}

These two system calls could be a single call against io_uring.
Though it doesn't leverage the full potential of io_uring, it will improve performance and can be applied to existing frameworks which were not designed for the io_uring model.

@beef9999
Copy link

beef9999 commented Dec 28, 2021

That's just why most old threading based programs need to be reconstructed, rather than waiting for io_uring to compromise.

For example, we started to write our project in the last few years with modern C++, and it is based on coroutines per thread. We deem it as the sanest thing to create a ring per thread, and concurrently submit to it among coroutines.

@maierlars
Copy link

Thanks for all the contributions! I found this discussion very enlightening. In a single threaded application the io-ring is pretty easy and straight forward to use, which is a great achievement.

However, in the multi-threaded case, one thing is not entirely clear to me:
Suppose I use one io-ring per thread. If one of the threads now starts a computationally expensive task or is blocked for other reasons (third-party library), it's io operations will starve in completion, although there are other threads, that could easily handle them. I don't see a nice solution to that right now.

If the application makes sure to prevents races, is it ok for different threads to access the same io-ring? Is it okay for one thread to poll for completions, while another thread submits operations?

@isilence
Copy link
Collaborator

isilence commented May 1, 2022

If the application makes sure to prevents races, is it ok for different threads to access the same io-ring?

It's fine, the problem is when apps modify/access SQ and/or CQ (including tail/head pointers) in parallel.

Is it okay for one thread to poll for completions, while another thread submits operations?

Yes.

In both cases there might be minor performance implications because of internal implementation details.

@axboe
Copy link
Owner

axboe commented Jun 25, 2022

I'm closing this one, one of the core tenets of io_uring is that you should not share a ring if you can avoid it. Either implement the sharing so that you have a single thread submitting, or have a ring-per-thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants