Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement an ntci::Proactor using io_uring #24

Merged
merged 17 commits into from
Aug 18, 2023

Conversation

mattrm456
Copy link
Contributor

This PR adds initial support for an ntci::Proactor implemented using io_uring on Linux.

io_uring is an alternative system call interface that allows user space to initiate certain system calls in batches , to execute asynchronously, and learn about completed system calls in batches, attempting to reduce the system call overhead required for, e.g. one system call per sendmsg, recvmsg, etc. Userspace may learn about completed system calls without a system call itself, or it can make a system call to block until a minimum number of previously submitted system calls complete. A more thorough description of io_uring can be found in its Introduction.

In the design of NTC, an ntci::Proactor is an abstraction of an interface to an operating system whereby a user "proactively" initiates socket operations (e.g. connect, accept, send, receive) and is asynchronously notified when those operations complete. This PR implements the ntci::Proactor abstract class using io_uring.

The io_uring interface to the kernel consists of a set of three system calls, one to create a ring, one to configure/control a ring, and one to submit operations/reap completions. Userspace writes system call submissions to a ring buffer and then performs a system call to notify the kernel to drain the submission queue. The kernel writes the status of completed system calls to a separate ring buffer, which is drained by userspace by a system call. Userspace may submit and reap completions simultaneously. The submission queue ring buffer and completion queue ring buffer is memory mapped into userspace according to the ring buffer configuration specified by the kernel when the ring is created. Given this design, it is possible for userspace to utilize io_uring functionality with nothing by support for syscall and mmap by libc. This implementation leverages these characteristics so that it may be built on any Linux machine, even those machines that do not support io_uring (such as older kernels.) To achieve this, this implementation eschews both the linux/io_uring.h header and any support library, such as liburing.

io_uring is conceptually simple but has many aspects that make it challanging to figure out how to use correctly and efficiently:

  1. io_uring has evolved significantly since its first release, gaining many features over time. Not all kernel versions support all features. Some features can be detected with io_uring's capabilities system, but not all (e.g. support for certain flags when submitting operations.) This implementation will probably work on Linux kernel versions >= 5.11 (it has only been tested on 6.2) but future work is necessary to support and test all functionality on older kernels. That is left for a future PR.

  2. The principle argument for io_uring seems to be the efficiency gains for submitting recvmsg and sendmsg in batches, avoiding the system call overhead per-operation. While batching recvmsg may make some sense, it is unclear why batching sendmsg calls makes sense: this is trading for greater system call efficiency by increasing latency and reducing throughput. Moreover, an ntci::Proactor is designed to be safely used by multiple threads concurrently calling ntci::Proactor::send, ntci::Proactor::receive, and ntci::Proactor::wait. If a thread is blocked in wait for at least one event to complete, another thread calling send or receive must submit and enter the ring immediately, otherwise the kernel never learns about the submission, never initiates the operation, which then never completes, and the thread blocked in wait remains blocked forever. This implementation chooses to always submit sendmsg operations and immediately enter the ring, and only defers entering the ring for other operations when it can be safely determined that only one thread will be calling ntci::Proactor::wait and that thread is the one calling ntci::Proactor::send, ntci::Proactor::receive, etc.

  3. io_uring has evolved support for a number of ways of timing out when waiting for operation to complete. Each are a bit clumsy to use and inconsistent in what they support: userspace may submit a "timeout" operation with a relative timeout or (on some kernels) an absolute timeout, with (on some kernels) a specified clock type. Such "timers" need to constantly removed or at least re-submitted. New kernels support a direct specification of a timeout when entering the ring, but it is not clear if absolute timeouts are or will be supported, or any clock type other than CLOCK_MONOTONIC.

  4. Very recent kernels support canceling all pending operations for a file descriptor, but older kernels do not. On older kernels, this implementation must maintain a data structure of all pending operations for a socket, so they can be cancelled when the user wishes to close the socket.

  5. It is unclear if two threads can enter the ring simultaneously, where both are instructing the kernel to drain the submission queue. This implementation relies on that being safe, which should be reviewed or revisited in later improvements. If two threads cannot enter a ring simultaneously to submit, it is unclear how to use io_uring from multiple threads at all: the obvious (but perhaps naive) solution would be to maintain a thread safe copy of the submission queue, that sits in front of the ring's submission queue, where threads initiating operations perform a thread-safe push onto this queue, then "kick" the ring with a "no-op" operation, waking up a thread blocked on entering the queue, so that thread can go right back into entering the queue again to actually submit the operations to kernel. This seems contradictory to the design and goals of io_uring.

The implementation of ntco_ioring is lightly tested in its component test driver, but is folded into the NTC proactor framework and more thoroughly tested in the ntcf_system test driver.

For now, I consider support for io_uring in NTC as experimental. As such, it is disabled by default in the build configuration. Builders may enable it by specifying configure --with-ioring during the configuration step of the build process. Further support is detected at runtime; if a machine does not support the io_uring system calls the "ioring" proactor will be automatically disabled.

Initial performance tests are underwhelming. The existing epoll backend out-performs this io_uring backend for most workloads, though a small few do show io_uring winning on latency and throughput. Future work should be planned to optimize this implementation.

{
const long k_SYSTEM_CALL_SETUP = 425;

return static_cast<int>(::syscall(k_SYSTEM_CALL_SETUP,
Copy link
Contributor

@smtrfnv smtrfnv Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can be a good idea to write a comment with the actual system call function name, like int io_uring_setup(u32 entries, struct io_uring_params *p) in this case.
Also applicable to other places.

Copy link
Contributor

@smtrfnv smtrfnv Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPD. I got it that there are no glibc wrappers yet for io_uring syscalls, so the comment is not really applicable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

ring,
k_SYSTEM_CALL_REGISTER_PROBE,
probe,
256));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to replace with IoRingProbe::k_ENTRY_COUNT or pas as a parameter to the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK


NTCO_IORING_LOG_CREATED(d_ring);

rc = ntco::IoRingUtil::probe(d_ring, &d_probe);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it can be a good idea to validate that operations like SENDMSG (ano other 100% required) are supported.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea but will require a non-trivial amount of work. I've created #67 for this.

@smtrfnv
Copy link
Contributor

smtrfnv commented Jun 22, 2023

There are multiple structures which are defined manually so that this code can be built on machines with old Linux kernel. But there are no tests which validate that these structures are defined properly. I think it can be done in compile time (maybe in the test driver) if the library is built on a system with modern kernel.


bool IoRingCompletion::isValid() const
{
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This method seems to be unused
  2. Maybe you've wanted to do something like return d_userData != 0 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed this function.

d_operation = static_cast<bsl::uint8_t>(ntco::IoRingOperation::e_SENDMSG);
d_handle = handle;
d_event = reinterpret_cast<__u64>(event);
d_flags = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach to explicitly zero d_flags here and in other places does not look consistent.
IoRingSubmission members are zeroed in its constructor. And despite that fact, d_flags is zeroed here one more time, while other members aren't. I think it's better to remove explicit zeroeing of d_flags.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@smtrfnv
Copy link
Contributor

smtrfnv commented Jun 23, 2023

A comment probably mainly for myself: I am reading https://kernel.dk/io_uring.pdf and there is such statement:

However, since the sqe lifetime is only that of the actual submission of it, it's possible for the application to
drive a higher pending request count than the SQ ring size would indicate. The application must take care not to do so,
or it could risk overflowing the CQ ring.

I haven't yet found if there is some protection implemented here

// has previously registered the 'waiter'.
void wait(ntci::Waiter waiter);

// Acquire usage of the most suitable proactor selected according to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we use /// to write contracts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The '///' is only supposed to be for comments on public declarations. Any '///' that occurs in an implementation is a (harmless) mistake, most of the time caused by the tool previously used to move comments around to conform to the latest documentation strategy.

NTCO_IORING_LOG_EVENT_STARTING(event);

ntco::IoRingSubmissionMode::Value mode;
if (NTCCFG_LIKELY(isWaiter())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate why you consider this condition likely to be true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably an unjustified usage of the branch prediction annotation. But the observation is this function is likely to be called by an I/O thread process a socket, timer, or previously deferred timer and that that wants to defer another function or schedule another timer.


bool IoRingDevice::supportsCancelByHandle() const
{
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it always return true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to wait until we merged standard support for the run-time detection of the operation system kernel version in ntscfg_platform. I have properly implemented detection of this type operation now.


// Cancel all outstanding operations initiated for the specified
// 'socket'. Return the error.
ntsa::Error cancel(const bsl::shared_ptr<ntci::ProactorSocket>& socket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a subject of the review, but it looks like that cancellation should be treated as async operation, right? I mean that ideally ntci::ProactorSocket would have a method like processCancelled(...) to be notified when all operations are cancelled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct. I've created a new issue for this work: #68.

@mattrm456 mattrm456 merged commit b3a640b into bloomberg:main Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants