-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
asyncio: Added async i/o APIs. #10605
base: main
Are you sure you want to change the base?
Conversation
Any specific reason why a callback-based system was chosen? Alternative would be With callback based system, this gets complex (maintain a multithread-safe hash-table for all operations, callback sets the value |
This is one of those "you can't please everyone" things. I started with a poll interface and was asked to change it to callbacks. You could just have an array of bools and have the callback set each one to true as pieces finish loading, and each frame check for true instead of DONE...or something like that. |
This generally looks good. Is there going to be WriteFileAsync()? What's the interaction with storage? |
Looks pretty good. Do have some input
While this is probably fine heuristically, I would probably add a hard maximum of 64. There's a lot of people gaming on old workstations that would exceed that cap and be unlikely to benefit from more in the typical case of I/O on a single drive.
this is fine. a note in the documentation that callbacks should try to be brief and forward heavy processing to other threads may be warranted, because many assets do need reasonably heavy processing.
don't disagree but it might be more straightforward to start with an implementation that can portably share logic. A relatively thin shim over pread/pwrite on the Unix likes and ReadFile/WriteFile on Win32 can enable a single implementation with better characteristics than the generic implementation. Emscripten would benefit most from its own implementation, IIRC all the libc synchronous I/O actually gets proxied to the main runtime thread in order to simplify handling the promises from the underlying JS interfaces.
Android doesn't have a thing for this. Mac and iOS have dispatch_io which just doesn't mesh well with this interface (in particular there is no way to supply a read buffer) and at least when it was new wasn't any better for regular files than a simpler threadpool anyway |
Crazy thought... do you actually want that many I/O threads? Maybe we should start with one and add a hint that lets applications tune that? |
having a fixed number of not enough threads makes the system unusable for many real projects and having a few too many idle threads is always cheaper than creating them on-demand. This is what posix aio implementations generally got wrong and why nobody really uses them for heavy I/O (except for FreeBSD's implementation I guess). tunability as a range with a heuristic selection within that range might be appropriate. All of the factors that would determine the optimal number of threads for an I/O-busy program are not really statically knowable, but a user could profile their particular I/O pattern across a good variety of machines and come up with a range that might be better than just relying on the heuristic alone. |
LoadFileAsync is useful because it has open the file, figure out how large the file is, allocate memory for it, and read into it. WriteFileAsync would only need to open and write an existing buffer, so it's only saving one step. But we could definitely add it! Storage just needs an equivalent of LoadFileAsync. On the built-in implementation, it literally will just call SDL_LoadFileAsync. Current thinking was not to implement positional reads/writes at the storage level, but I'm still trying to understand what data in a SDL_Storage interface does when you have a file like Spider-Man 2's New York map, which is presumably dozens of gigabytes.
Right now it spins threads on-demand if there aren't any idle ones, up to a maximum count, and ones that are idle for more than 30 seconds will start terminating until only one is left (we always keep one around so there isn't a failure case when no threads can start). Until you use async i/o, it won't initialize anything so it doesn't have a resource cost at all until you touch it. The idea being that you likely have a flood of startup loading and then there's not a lot of value in keeping idle threads around. There's a FIXME to add a hint, and we definitely should tune the defaults in any case. |
Okay, single-threaded Emscripten support is in, so this is Good Enough to ship and do improvements with system-specific APIs later...but do we really want this callback API? It forces people deal with locking and worries that they'll clog the thread pool if they don't move quickly enough. Maybe it would be better if we give them something like: SDL_GetAsyncResults(&details) Internally we keep a queue of completed tasks, and each call to this returns the next completed one with all the details. You don't poll specific tasks, you just get the next one in the completed list. Call doesn't block, returns false if nothing is ready. This dramatically simplifies SDL's code and the mental load an app needs to manage. The app can still toss work to a background thread to process the loaded data if they want...but if they just didn't want to block on i/o, they can otherwise still benefit in a single-threaded app. |
Sure, I buy it, let’s do that instead. |
The one thing that would be nice is a way to wait for multiple I/O operations to complete. WaitForCompletion(array of operations) |
You probably want a separate handle for each request. For example if a library does async loading it wants to get its own data separate from the app. |
Allowing for separate completion queues (probably specified when creating an async file object, with NULL refering to some default completion queue) would probably be the most useful form of this approach, since different threads would have different interests. That also conveniently translates very directly to how IOCP and io_uring both deal with completions. Should also be convenient for tying I/O ops directly to different asset pipelines. |
I think allowing separate queues of I/O operations and then enumerating within each queue with |
d04bd1b
to
5e312e7
Compare
Okay, this now has queryable tasks and "task groups" and the background thread callbacks are gone. I'm super happy with how this came out. Here's the category documentation from the new version, which gives the basic idea: CategoryIOAsyncSDL offers a way to perform i/o asynchronously. This allows an app to read or write files without waiting for data to actually transfer; the functions that request i/o never block while the request is fulfilled. Instead, the data moves in the background and the app can check for results at their leisure. This is more complicated that just reading and writing files in a synchronous way, but it can allow for more efficiency, and never having framerate drops as the hard drive catches up, etc. The general usage pattern for async i/o is:
If you want to track several tasks at once, regardless of the order in which they complete:
This all works, without blocking, in a single thread, but one can also use a task group in a background thread and sleep until new results have arrived:
And, of course, to match the synchronous SDL_LoadFile, we offer SDL_LoadFileAsync as a convenience function. This will handle allocating a buffer, slurping in the file data, and null-terminating it; you still get a task handle to check later or add to a task group. |
5e312e7
to
320a93f
Compare
Nice! |
320a93f
to
78bc0d2
Compare
Hi - I was sent this direction because I've been peripherally involved in SDL_Gpu (doing reviews focused on performance and answering questions about PCI and DMA). I happen to build storage APIs professionally though, so looking at this pull request is far more up my alley. I had a really long review I was working on, but the recent changes you've made addressed about 80% of what I was going to say so this will now be much shorter (but still probably long). I think for now I'll stick to just public API feedback, and maybe I'll post in a few separate posts. First, I do not think it is possible to implement the API calls that do not require a task group efficiently on Linux. On Linux, your asynchronous I/O choices are either Linux AIO (not to be confused with POSIX AIO) or io_uring. Both of these APIs require you to first create some per-thread context (e.g. a task group) in order to submit I/O. Creating these on demand per-file is expensive and may run into OS-imposed limits on the number of io_urings or aio_contexts available. I strongly suggest that you only expose the APIs that require a task group - it's the only way to get true asynchronous I/O. I've personally made a similar mistake with APIs released 10+ years ago and it was really painful to find my way out of the corner I painted myself into. Second, opening and closing the file should be asynchronous operations too (and they should be submitted to the task group). These can actually secretly block a long time in some cases. Of course, on many operating systems there isn't an asynchronous open/close option at all, and the implementations there will be blocking. But at least on Linux with io_uring you can asynchronously open and close files, so the API should allow for the best behavior rather than forcing the worst. More review to follow. |
(Sorry, I just squashed this down to a single commit right before you posted that; it's all the same stuff, just in one commit.) I'm skeptical that async opens are worth the API complexity, but I can totally see a close operation blocking while cached writes make their way to disk. I'm nervous about adding the complexity, but let me think on this a little tonight. As for tasks vs task groups: presumably with io_uring (and Win32 i/o completion ports), we'll have a single internal thread that lives just to deal with moving data from the OS in a single context/completion port to our own task groups that mostly sleeps. If that's not a terrible idea, then it doesn't matter if the app wants to query a single task or a task group, it's just a question of whether they look at a struct directly or shift the head of a linked list of structs before looking at it. |
(Also, more review is totally welcome at any length you're willing to write, because I'm 100% winging it over here like I know what I'm doing.) |
The shift to support "task groups" is an excellent and necessary change. However, I do still have a few pieces of feedback on the group task group design.
(More coming) |
Using a thread is, unequivocally, a terrible idea (sorry, that sounds really harsh and I promise I don't mean it to be, but I've been at this a long time and everyone wants to use a thread). The hardware (and I've helped design NVMe from close to the beginning) wants you to create a separate "queue" per thread. The hardware accepts command submissions in parallel (and they're all asynchronous too!). And the operating systems have spent the last decade re-designing themselves to do parallel, asynchronous I/O internally to match the hardware. You are supposed to create a per-thread "context" (io_uring, aio_context, ioRing, iocp, whatever) at thread start up time, and use that exclusively for the I/O operations you perform on that thread. This will allow for parallel, lockless I/O submission and provide the lowest latency. |
I have noticed file open latency before. I've never seen it be significant for a single file, and the many-file usecase is probably covered adequately enough by SDL_LoadFileAsync and perhaps similar composite operations. I'd agree its probably not worth the complexity.
An async sync-and-close would be useful for savefiles. System APIs typically will not wait for cached writes to hit the disk before closing a file, having this behavior be explicit as a composite operation is useful.
Windows will block inside an Overlapped ReadFile when the underlying disk driver's queue is full. It's not that hard to hit this with random reads because Windows disk drivers have shallow queues compared to eg Linux. This is unfortunate because other pending reads to the same device might still succeed without blocking by hitting the filesystem cache, and obviously when combined with other uses of Overlapped I/O (eg sockets) that should never block this is even worse. It will also block on Overlapped WriteFile for writes that extend a file, which is a pretty normal thing to happen. Because of this behavior the typical procedure is to still use a threadpool to issue overlapped I/O. Probably a lesser concern, but Wine also doesn't attempt to provide asynchronous I/O to regular files at all. As far as io_uring goes, I definitely agree with jaedan that this would be a suboptimal use of it, but also think that the typical use of a per-thread ring is both needlessly resource-expensive (they are not lightweight objects) and a poor fit for games' logic. A ring wrapped in a mutex for each task group is probably the least bad approach here, although I definitely lack jaedan's level of experience with io_uring. Extending the API with batching operations would probably bring that kind of implementation closer to optimal (also an improvement for other implementations) but that can be a later enhancement IMO. |
We could create a heavy weight async I/O context when a thread first uses the async API. This could be stored in TLS and be automatically cleaned up, so the users get the benefit of that without having to explicitly track things themselves. It might be worth adding a io_uring and I/O ring implementation to validate that this API works efficiently with those interfaces. The only thing that affects the ABI is the question of how asynchronous I/O interacts with the storage API, and it's becoming clear that it doesn't. This seems like it's intended as a low level high performance interface to disk files. If that's true, then let's bump this out of the ABI milestone. |
On Linux for many many years (via Linux aio) and on Windows for the last 3, this is no longer the standard procedure. We shouldn't make the API inefficient on modern operating systems by design, but rather accept the inefficiency only for operating systems with poor asynchronous IO support. Let's imagine we have a system with a background thread that does the IO processing. This thread may spin while there's IO outstanding for efficiency, but it certainly can't just spin all the time. IO in games is sporadic, so most commonly this thread will be asleep when we next start a burst of IO. So we queue up the operation into the shared queue or ring and then signal the background thread to wake. On most operating systems, thread scheduling latency is only at about 1ms granularity, and even if it reacts instantly just the mechanics of restoring thread state to wake it takes a significant chunk of time. So all of this added time is added to our IO request. But then the underlying SSD will complete the request in 20-100 microseconds most likely. Why would we 10x our latency because of software design? It makes no sense.
I really can't disagree more on both points. These objects are not cheap to create because they usually do memory allocations and system calls, but actual resource use is very minimal. Usually it's 64 bytes * number of queues entries. And you make one of these per thread that wants to perform IO. If you have true asynchronous IO, then you only need a very limited number of threads in your game that are doing asset loading. We're talking kilobytes of total resources use here. And regarding a poor fit for game logic, a ring buffer for command submission is a perfect fit for game logic. On each tick you enqueue all your IO operations. At end of tick you submit and check for completions. It cannot be more ideal. It also happens to be exactly how the GPU API works too, and it would be how an optimal networking API would function as well. |
Bumping this to 3.2.0, since the ABI lock is about to happen (we can add new stuff after ABI lock). |
yeah, basically the entire liburing API is thread-unsafe except for the actual io_uring_enter syscalls. I'm not even 100% sure that it's safe to submit from one thread and reap from another, but it seems to work fine in practice.
that's the main advantage. But yeah the syscall overhead probably isn't significant enough that it would turn up when profiling. |
I think the io_uring thread issues are settled in 7d4fdab. I put a separate mutex on the submission and completion queues, and the completion side copies the data out and unlocks as fast as possible...so any serious contention probably won't be an issue in any reasonable case, and the wait() call just uses io_uring's API to block until there's new results, but then it grabs a mutex and uses the peek codepath to make sure that only one thread gets a valid event. If the kernel wakes up multiple waiters when there's only one new event, the existing API allows for this, so I think this part is solid now. |
I'm wondering if we should just change... extern SDL_DECLSPEC SDL_AsyncIOTask * SDLCALL SDL_CloseAsyncIO(SDL_AsyncIO *asyncio, SDL_AsyncIOQueue *queue, void *userdata); ...to have a We could also (or instead) add this to SDL_WriteAsyncIO. Other options are always implicitly flush on close (easy for everyone, but less efficient for use-cases that could skip it), or add an explicit flush operation that the app is responsible for queueing correctly, or add a way to specify some operations need to serialize behind the scenes (this would be |
d8adf13
to
269502a
Compare
This seems like a good idea. |
that is perfectly adequate.
this would be a little more questionable. I suppose it's possible some user may want to sync a rarely written file to disk but also keep it open for the purposes of reading, but this is not a viable strategy for keeping the disk coherent if doing parallel writes to different portions of the file because fsync() will sync incomplete writes and also because later write may complete before an earlier write. Getting this right basically requires the implementation to serialize writes whenever a synchronizing write is queued, which is why databases etc tend to basically implement their own disk cache in-process and do direct I/O (which is inherently synchronized with the disk) instead.
probably harmless, but maybe someone would object.
hairy for the same reasons as synchronizing writes.
I've never seen an async framework provide chained operations without some sort of hairiness either. |
Okay, optional flush/fsyncdata on close is in, 52c58b9. Unrelated, a pending TODO item is to remove SDL_AsyncIOTask from the public headers, and just return bool instead of a task handle, and this makes sense with the current API. But code is already implemented in both backends to attempt to cancel in-flight tasks...it's not currently used because the public API doesn't offer it. If we want to add it, we'll need the task handle in the public API. If we don't, we can lose the task handles and delete some code from the backends. But if we don't offer it now, we can't add it later without adding replacement APIs that return a handle. So which way do we want to go on this? Canceling a task isn't guaranteed; if it hasn't started, both our threadpool and io_uring can cancel it, but otherwise they'll just have to wait for results once the work has gotten underway. For files, this usually isn't a big deal, because presumably everything will finish in finite time. But if we ever open this up for something that can block indefinitely (like socket i/o), cancellation might become super-important. I'm 100% okay with this interface only ever being used on files, though. |
my super-opinionated and possibly unpopular take is that async I/O cancellability still doesn't really make sense for sockets except for automatically at program shutdown and maybe when closing the socket. If the connection is broken, the request in progress should error out. UDP sendto is essentially bounded time, but UDP recvfrom might warrant cancellability. It's also worth pointing out that while async I/O for sockets is the gold standard in scalability, if you're writing netcode for a FPS or something you definitely want kqueue/epoll/etc and non-blocking I/O over a threadpool, so the implementation should special-case sockets if it wants to be broadly useful for games. There's a few benchmarks that strongly suggest this is true for io_uring too (epoll methods being dramatically faster than io_uring for sockets at modest scales), but I do know some work has gone into improving this over the past couple years. A simple single threaded reactor model is just generally superior to the model we need for file I/O, especially in terms of latency, at small connection counts. (Note that while it seems to have been largely ignored, Windows 10 got a more epoll-like socket readiness notification interface near the end of its cycle too.) |
I have mixed feelings on this. On the one hand it would be really nice to simplify the API. On the other hand I can totally see someone trying to cancel all pending I/O on shutdown or out of memory or something. |
Also, is it time for an IORing implementation? :D |
Windows IoRing support is in, as of 26b4bf9. If you aren't on Windows 11, it falls back to the threadpool, but it's noticeably faster on this Win11 VirtualBox VM than the threadpool version! |
Removing SDL_AsyncIOTask cleans it up nicely, but does it make sense to add a function to cancel all the I/O on a queue? |
I was thinking about that as an option. One still needs to wait for the completion, though, to free buffers that are in-flight, so all it gains you is maybe some tasks will finish faster, or maybe a little less disk i/o gets started. We could have it as a flag to SDL_DestroyAsyncIOQueue, to say "I'm going down and I don't care if this stuff finishes or not, get me out of here asap." |
Hey, @madebr: I'm getting test failures like this:
But running |
Actually, this doesn't need a flag. Since you can't get the results without the queue, it's reasonable to cancel everything by default to make the queue destruction finish more quickly; the app has to assume all of it failed anyway, since any of it could fail without warning. |
...of course, we don't actually have a list of pending tasks for the queue to cancel them, so screw it, I'm leaving it alone. I'll probably remove the cancel stuff from the implementation once this merges, so we can reasonable find it in revision control if we want to restore it later. |
This is still in-progress, but I've been heads-down on this for awhile, so I'm putting it in a draft PR.
This implements an async i/o API, and a "generic" backend that implements it with a pool of SDL threads that can block on synchronous SDL_IOStreams. This is likely "good enough" for most things, and it's usable on everything but single-threaded Emscripten builds.
The downsides to the generic backend:
(SDL_GetCPUCount * 2) +1
at most),and it uses the same thread pool that is used for notifying apps that i/o is complete, which is to say if the app uses that callback for heavy processing, until the callback returns, it won't be available for doing more i/o. In practice, you'd have to have lots of read/writes that also have very heavy callbacks before you'd see bottlenecks, I think, but system-specific APIs can just get all the i/o cooking separately regardless of callback weight.(EDIT: this isn't true now. The generic backend uses up to 8 threads for i/o, but they are only limited by disk bandwidth, and no longer have to worry about the app hogging them in a callback.)It can't ever work for single-threaded Emscripten, because it needs threads.(EDIT: it works in single-threaded in Emscripten now...it's just synchronous, because Emscripten's filesystem API is synchronous and almost always just a memcpy from MEMFS anyhow).This would benefit from a Windows backend, that uses win32 IO Completion Ports, to solve these limitations. Linux can use io_uring if that's generally available at this point. Emscripten
could use actual async file APIs and run stuff between mainloop iterations if single-threaded(EDIT: see above), or hand off from there to the usual thread pool if not. I'm sure Mac/iOS/Android have a thing, too, but the ABI wouldn't change, so we can deal with those later.Fixes #1374.