Implement an ntci::Proactor using io_uring #24

mattrm456 · 2023-04-21T04:32:32Z

This PR adds initial support for an ntci::Proactor implemented using io_uring on Linux.

io_uring is an alternative system call interface that allows user space to initiate certain system calls in batches , to execute asynchronously, and learn about completed system calls in batches, attempting to reduce the system call overhead required for, e.g. one system call per sendmsg, recvmsg, etc. Userspace may learn about completed system calls without a system call itself, or it can make a system call to block until a minimum number of previously submitted system calls complete. A more thorough description of io_uring can be found in its Introduction.

In the design of NTC, an ntci::Proactor is an abstraction of an interface to an operating system whereby a user "proactively" initiates socket operations (e.g. connect, accept, send, receive) and is asynchronously notified when those operations complete. This PR implements the ntci::Proactor abstract class using io_uring.

The io_uring interface to the kernel consists of a set of three system calls, one to create a ring, one to configure/control a ring, and one to submit operations/reap completions. Userspace writes system call submissions to a ring buffer and then performs a system call to notify the kernel to drain the submission queue. The kernel writes the status of completed system calls to a separate ring buffer, which is drained by userspace by a system call. Userspace may submit and reap completions simultaneously. The submission queue ring buffer and completion queue ring buffer is memory mapped into userspace according to the ring buffer configuration specified by the kernel when the ring is created. Given this design, it is possible for userspace to utilize io_uring functionality with nothing by support for syscall and mmap by libc. This implementation leverages these characteristics so that it may be built on any Linux machine, even those machines that do not support io_uring (such as older kernels.) To achieve this, this implementation eschews both the linux/io_uring.h header and any support library, such as liburing.

io_uring is conceptually simple but has many aspects that make it challanging to figure out how to use correctly and efficiently:

io_uring has evolved significantly since its first release, gaining many features over time. Not all kernel versions support all features. Some features can be detected with io_uring's capabilities system, but not all (e.g. support for certain flags when submitting operations.) This implementation will probably work on Linux kernel versions >= 5.11 (it has only been tested on 6.2) but future work is necessary to support and test all functionality on older kernels. That is left for a future PR.
The principle argument for io_uring seems to be the efficiency gains for submitting recvmsg and sendmsg in batches, avoiding the system call overhead per-operation. While batching recvmsg may make some sense, it is unclear why batching sendmsg calls makes sense: this is trading for greater system call efficiency by increasing latency and reducing throughput. Moreover, an ntci::Proactor is designed to be safely used by multiple threads concurrently calling ntci::Proactor::send, ntci::Proactor::receive, and ntci::Proactor::wait. If a thread is blocked in wait for at least one event to complete, another thread calling send or receive must submit and enter the ring immediately, otherwise the kernel never learns about the submission, never initiates the operation, which then never completes, and the thread blocked in wait remains blocked forever. This implementation chooses to always submit sendmsg operations and immediately enter the ring, and only defers entering the ring for other operations when it can be safely determined that only one thread will be calling ntci::Proactor::wait and that thread is the one calling ntci::Proactor::send, ntci::Proactor::receive, etc.
io_uring has evolved support for a number of ways of timing out when waiting for operation to complete. Each are a bit clumsy to use and inconsistent in what they support: userspace may submit a "timeout" operation with a relative timeout or (on some kernels) an absolute timeout, with (on some kernels) a specified clock type. Such "timers" need to constantly removed or at least re-submitted. New kernels support a direct specification of a timeout when entering the ring, but it is not clear if absolute timeouts are or will be supported, or any clock type other than CLOCK_MONOTONIC.
Very recent kernels support canceling all pending operations for a file descriptor, but older kernels do not. On older kernels, this implementation must maintain a data structure of all pending operations for a socket, so they can be cancelled when the user wishes to close the socket.
It is unclear if two threads can enter the ring simultaneously, where both are instructing the kernel to drain the submission queue. This implementation relies on that being safe, which should be reviewed or revisited in later improvements. If two threads cannot enter a ring simultaneously to submit, it is unclear how to use io_uring from multiple threads at all: the obvious (but perhaps naive) solution would be to maintain a thread safe copy of the submission queue, that sits in front of the ring's submission queue, where threads initiating operations perform a thread-safe push onto this queue, then "kick" the ring with a "no-op" operation, waking up a thread blocked on entering the queue, so that thread can go right back into entering the queue again to actually submit the operations to kernel. This seems contradictory to the design and goals of io_uring.

The implementation of ntco_ioring is lightly tested in its component test driver, but is folded into the NTC proactor framework and more thoroughly tested in the ntcf_system test driver.

For now, I consider support for io_uring in NTC as experimental. As such, it is disabled by default in the build configuration. Builders may enable it by specifying configure --with-ioring during the configuration step of the build process. Further support is detected at runtime; if a machine does not support the io_uring system calls the "ioring" proactor will be automatically disabled.

Initial performance tests are underwhelming. The existing epoll backend out-performs this io_uring backend for most workloads, though a small few do show io_uring winning on latency and throughput. Future work should be planned to optimize this implementation.

smtrfnv · 2023-06-21T09:03:15Z

groups/ntc/ntco/ntco_ioring.cpp

+{
+    const long k_SYSTEM_CALL_SETUP = 425;
+
+    return static_cast<int>(::syscall(k_SYSTEM_CALL_SETUP,


I think it can be a good idea to write a comment with the actual system call function name, like int io_uring_setup(u32 entries, struct io_uring_params *p) in this case.
Also applicable to other places.

UPD. I got it that there are no glibc wrappers yet for io_uring syscalls, so the comment is not really applicable.

smtrfnv · 2023-06-21T10:28:49Z

groups/ntc/ntco/ntco_ioring.cpp

+                                      ring,
+                                      k_SYSTEM_CALL_REGISTER_PROBE,
+                                      probe,
+                                      256));


better to replace with IoRingProbe::k_ENTRY_COUNT or pas as a parameter to the function

smtrfnv · 2023-06-21T11:21:46Z

groups/ntc/ntco/ntco_ioring.cpp

+
+    NTCO_IORING_LOG_CREATED(d_ring);
+
+    rc = ntco::IoRingUtil::probe(d_ring, &d_probe);


Maybe it can be a good idea to validate that operations like SENDMSG (ano other 100% required) are supported.

This is a good idea but will require a non-trivial amount of work. I've created #67 for this.

smtrfnv · 2023-06-22T09:17:51Z

There are multiple structures which are defined manually so that this code can be built on machines with old Linux kernel. But there are no tests which validate that these structures are defined properly. I think it can be done in compile time (maybe in the test driver) if the library is built on a system with modern kernel.

smtrfnv · 2023-06-22T14:13:02Z

groups/ntc/ntco/ntco_ioring.cpp

+
+bool IoRingCompletion::isValid() const
+{
+    return true;


This method seems to be unused

Maybe you've wanted to do something like return d_userData != 0 ?

I have removed this function.

smtrfnv · 2023-06-22T14:46:09Z

groups/ntc/ntco/ntco_ioring.cpp

+    d_operation = static_cast<bsl::uint8_t>(ntco::IoRingOperation::e_SENDMSG);
+    d_handle    = handle;
+    d_event     = reinterpret_cast<__u64>(event);
+    d_flags     = 0;


Approach to explicitly zero d_flags here and in other places does not look consistent.
IoRingSubmission members are zeroed in its constructor. And despite that fact, d_flags is zeroed here one more time, while other members aren't. I think it's better to remove explicit zeroeing of d_flags.

smtrfnv · 2023-06-23T08:46:39Z

A comment probably mainly for myself: I am reading https://kernel.dk/io_uring.pdf and there is such statement:

However, since the sqe lifetime is only that of the actual submission of it, it's possible for the application to
drive a higher pending request count than the SQ ring size would indicate. The application must take care not to do so,
or it could risk overflowing the CQ ring.

I haven't yet found if there is some protection implemented here

smtrfnv · 2023-06-23T09:26:03Z

groups/ntc/ntco/ntco_ioring.cpp

+    // has previously registered the 'waiter'.
+    void wait(ntci::Waiter waiter);
+
+    // Acquire usage of the most suitable proactor selected according to


Don't we use /// to write contracts?

The '///' is only supposed to be for comments on public declarations. Any '///' that occurs in an implementation is a (harmless) mistake, most of the time caused by the tool previously used to move comments around to conform to the latest documentation strategy.

smtrfnv · 2023-06-26T09:01:03Z

groups/ntc/ntco/ntco_ioring.cpp

+    NTCO_IORING_LOG_EVENT_STARTING(event);
+
+    ntco::IoRingSubmissionMode::Value mode;
+    if (NTCCFG_LIKELY(isWaiter())) {


Could you please elaborate why you consider this condition likely to be true?

This is probably an unjustified usage of the branch prediction annotation. But the observation is this function is likely to be called by an I/O thread process a socket, timer, or previously deferred timer and that that wants to defer another function or schedule another timer.

smtrfnv · 2023-06-26T10:18:12Z

groups/ntc/ntco/ntco_ioring.cpp

+
+bool IoRingDevice::supportsCancelByHandle() const
+{
+    return true;


Why does it always return true?

I wanted to wait until we merged standard support for the run-time detection of the operation system kernel version in ntscfg_platform. I have properly implemented detection of this type operation now.

smtrfnv · 2023-06-26T10:20:29Z

groups/ntc/ntco/ntco_ioring.cpp

+
+    // Cancel all outstanding operations initiated for the specified
+    // 'socket'. Return the error.
+    ntsa::Error cancel(const bsl::shared_ptr<ntci::ProactorSocket>& socket)


It's not a subject of the review, but it looks like that cancellation should be treated as async operation, right? I mean that ideally ntci::ProactorSocket would have a method like processCancelled(...) to be notified when all operations are cancelled.

I think you are correct. I've created a new issue for this work: #68.

smtrfnv reviewed Jun 21, 2023

View reviewed changes

smtrfnv reviewed Jun 22, 2023

View reviewed changes

smtrfnv reviewed Jun 23, 2023

View reviewed changes

smtrfnv reviewed Jun 26, 2023

View reviewed changes

smtrfnv approved these changes Jun 26, 2023

View reviewed changes

This was referenced Aug 17, 2023

Validate ntco::IoRing structures against kernel definitions, when possible #66

Open

Validate at runt-time that all required io_uring features are available, else disable the io_uring driver #67

Open

mattrm456 added 16 commits August 17, 2023 16:27

Initial io_uring driver

acf4180

Provider a wrapper for system calls

4b053b4

clang-format

a520a07

Register io_uring proactor factory

9a691bb

Split IoRing::send(ntsa::Data) into separate functions

68f13c9

Add IoRingSubmission

8bd765a

Wrap io_uring_params

e59b490

Cosmetica

dbaddcc

Add testing mechanism

34008c4

I/O ring testing

568ab22

Handle submission queue overflow

8dbcf67

clang-format

2bc1606

Implement an ntci::Reactor using io_uring

c062c07

Implement an ntci::Reactor using io_uring

22a01bc

Implement an ntci::Reactor using io_uring

6924bde

Properly detect io_uring support for cancellation by file descriptor

f9627f0

Complete review comments

cac53b3

mattrm456 force-pushed the io_uring branch from e09acc5 to cac53b3 Compare August 18, 2023 14:11

mattrm456 merged commit b3a640b into bloomberg:main Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement an ntci::Proactor using io_uring #24

Implement an ntci::Proactor using io_uring #24

mattrm456 commented Apr 21, 2023

smtrfnv Jun 21, 2023 •

edited

Loading

smtrfnv Jun 23, 2023 •

edited

Loading

mattrm456 Aug 17, 2023

smtrfnv Jun 21, 2023

mattrm456 Aug 17, 2023

smtrfnv Jun 21, 2023

mattrm456 Aug 17, 2023

smtrfnv commented Jun 22, 2023

smtrfnv Jun 22, 2023

mattrm456 Aug 17, 2023

smtrfnv Jun 22, 2023

mattrm456 Aug 17, 2023

smtrfnv commented Jun 23, 2023

smtrfnv Jun 23, 2023

mattrm456 Aug 17, 2023

smtrfnv Jun 26, 2023

mattrm456 Aug 17, 2023

smtrfnv Jun 26, 2023

mattrm456 Aug 18, 2023

smtrfnv Jun 26, 2023

mattrm456 Aug 18, 2023


		NTCO_IORING_LOG_CREATED(d_ring);

		rc = ntco::IoRingUtil::probe(d_ring, &d_probe);

Implement an ntci::Proactor using io_uring #24

Implement an ntci::Proactor using io_uring #24

Conversation

mattrm456 commented Apr 21, 2023

smtrfnv Jun 21, 2023 • edited Loading

Choose a reason for hiding this comment

smtrfnv Jun 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smtrfnv commented Jun 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smtrfnv commented Jun 23, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

smtrfnv Jun 21, 2023 •

edited

Loading

smtrfnv Jun 23, 2023 •

edited

Loading