Use io_uring for sockets on Linux#124374
Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements an experimental opt-in io_uring-backed socket event engine for Linux as an alternative to epoll. The implementation is comprehensive, including both readiness-based polling (Phase 1) and completion-based I/O operations (Phase 2), along with extensive testing infrastructure and evidence collection tooling.
Changes:
- Native layer: cmake configuration, PAL networking headers, and io_uring system call integration with graceful epoll fallback
- Managed layer: socket async engine extensions for io_uring completion handling, operation lifecycle tracking, buffer pinning, and telemetry
- Testing: comprehensive functional tests, layout contract validation, stress tests, and CI infrastructure for dual-mode test execution
- Tooling: evidence collection and validation scripts for performance comparison and envelope testing
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| src/native/libs/configure.cmake | Adds CMake configuration checks for io_uring header and poll32_events struct member |
| src/native/libs/System.Native/pal_networking.h | Defines new io_uring interop structures (IoUringCompletion, IoUringSocketEventPortDiagnostics) and function signatures |
| src/native/libs/System.Native/entrypoints.c | Registers new io_uring-related PAL export entry points |
| src/native/libs/Common/pal_config.h.in | Adds CMake defines for io_uring feature detection |
| src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs | Adds layout contract tests for io_uring interop structures and telemetry counter verification |
| src/libraries/System.Net.Sockets/tests/FunctionalTests/System.Net.Sockets.Tests.csproj | Implements MSBuild infrastructure for creating io_uring test archive variants (enabled/disabled/default) |
| src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs | Adds comprehensive functional and stress tests for io_uring socket workflows |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketsTelemetry.cs | Adds 12 new PollingCounters for io_uring observability metrics |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs | Implements managed wrappers for io_uring prepare operations with error handling |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs | Core io_uring integration: submission batching, completion handling, operation tracking, and diagnostics polling |
| src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs | Operation-level io_uring support: buffer pinning, user_data allocation, completion processing, and state machine |
| src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs | Defines managed interop structures matching native layout for io_uring operations |
| eng/testing/io-uring/validate-collect-sockets-io-uring-evidence-smoke.sh | Smoke validation script for evidence collection tooling |
| eng/testing/io-uring/collect-sockets-io-uring-evidence.sh | Comprehensive evidence collection script for functional/perf validation and envelope testing |
| docs/workflow/testing/libraries/testing.md | Adds references to io_uring-specific documentation |
| docs/workflow/testing/libraries/testing-linux-sockets-io-uring.md | Detailed validation guide for io_uring backend testing |
| docs/workflow/testing/libraries/io-uring-pr-evidence-template.md | PR evidence template for documenting io_uring validation results |
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/tests/FunctionalTests/IoUring.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/tests/FunctionalTests/TelemetryTest.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
…tion on single-core
…etry batching, packed capabilities, and native shim hardening Merge-blocking: - Set SOCK_CLOEXEC | SOCK_NONBLOCK on accept SQEs via AcceptFlags constant - Add EPERM to IsIgnoredIoUringSubmitError and drain rejected SQEs as failed completions instead of re-queuing them (breaks infinite-retry spin) - Replace SEND_ZC reattach FailFast with Debug.Fail, slot cleanup, and error completion; add generation check in HandleZeroCopyNotification Performance: - Batch per-CQE telemetry (depletion/recycle/early-data) into drain-batch accumulators flushed once per DrainCqeRingBatch - Replace Interlocked.CompareExchange pair in TryTrackPreparedIoUringOperation with Volatile.Read/Write (event-loop-only path) - Move 5 permanently-true SQE ring invariant checks to one-time init validation - Convert CQE tag dispatch from switch to if-chain for branch prediction - Clear only NativeMsghdr header instead of full native storage stride - Copy buffered multishot recv data outside lock Capabilities and telemetry: - Convert LinuxIoUringCapabilities from 8-bool positional struct to packed uint flags with fluent With* mutators - Add slot high-water-mark and cancellation-queue-overflow production counters - Add capacity planning comments near SlotIndexBits - Add Debug.Assert on non-EXT_ARG fallback path Security and resilience: - Add BitOperations.IsPow2 asserts on kernel-reported ring sizes in TryMmapRings - Add c_static_assert(sizeof(size_t) >= 8) in native shim - Add ringFd < 0 validation at all native shim entry points - Wrap DangerousRelease in try/finally in FreeCompletionSlot - Block provided-buffer resize during CQ overflow recovery - Fix FreeIoUringProvidedBufferRing transient inconsistent capability state - Guard Dispose against freeing registered ring memory - Replace _persistentMultishotRecvDataQueueCount with computed property - Use Volatile.Write for teardown TrackedOperationGeneration clear - Silently ignore TagNone CQEs from ASYNC_CANCEL completions - Add EINTR comment on native shim CloseFd Tests: - Add accepted-socket FD_CLOEXEC and O_NONBLOCK verification test - Add forced-submit-EPERM graceful degradation test
…buffer ring group ID hardening, sweep re-arm cap, wake circuit-breaker, and test coverage MpscQueue: - Co-locate Items and States into single SegmentEntry[] array for cache locality - Add TryEnqueue with bounded retry (MaxEnqueueSlowAttempts=2048) and SpinWait backoff; catch OOM in RentUnlinkedSegment - Handle TryEnqueue failure at prepare-queue and cancel-queue call sites - Remove AggressiveInlining from lock-containing Rent/ReturnUnlinkedSegment - Promote ARM64 concurrent stress test from OuterLoop to regular CI Code quality: - Collapse redundant WriteSendSqe/WriteSendZcSqe and WriteSendMsgSqe/ WriteSendMsgZcSqe wrappers; call WriteSendLikeSqe/WriteSendMsgLikeSqe directly - Replace string-based telemetry test hook dispatch with IoUringCounterFieldForTest enum for compile-time safety - Centralize 11 debug test env vars into IoUringTestEnvironmentVariables class - Move s_ioUringResolvedConfigurationLogged to per-engine instance field - Add SQE zeroing socket-only assumption comment Resilience: - Replace fragile group ID toggle (1/2) with sequential allocation starting at 0x8000 to avoid collision with other io_uring users - Cap CQ overflow stale-tracked sweep re-arms at 8 with diagnostic log - Add eventfd wake failure circuit-breaker: after 8 consecutive failures, reduce completion wait timeout from 50ms to 1ms; reset on successful wake - Null out multishot accept sockaddr pointer to eliminate shared-buffer race Performance: - Add _nextPreparedReceivePostedWordHint for O(1) common-case bitset search in TryAcquireBufferForPreparedReceive; update hint on recycle and acquisition - Remove AggressiveInlining from IsProvidedBufferResizeQuiescent - Bound TryAcquireBufferForPreparedReceive retry by word count instead of ring size Tests: - CQ overflow recovery with zero tracked operations - Wakeup eventfd FD_CLOEXEC verification - SQPOLL + DEFER_TASKRUN mutual exclusivity assertion - NativeMsghdr 32-bit rejection path - UDP oversized datagram with zero-length ReceiveFrom buffer - CounterDelta monotonicity assertion (replaces silent underflow) - Clarify zero-copy small-buffer test name and forced-error intent
…oA split, cancellation batching, configuration centralization, registered ring fd EINVAL fallback, and test coverage - Convert static io_uring counters to per-engine instance fields with aggregation - Group 20+ managed ring mmap fields into ManagedRingState struct with property accessors - Split TrackedOperation/TrackedOperationGeneration into separate IoUringTrackedOperationState array for cache locality; shrink IoUringCompletionSlot from 32 to 24 bytes - Batch ProcessCancellation ThreadPool callbacks via static ConcurrentQueue with cooperative worker drain - Replace ConcurrentQueue<SocketIOEvent> with MpscQueue on Linux via SocketIOEventQueue wrapper - Centralize configuration resolution into IoUringConfigurationInputs with contradiction validation warnings - Collapse CounterPair struct into static TryPublishManagedCounterDelta method - Add registered ring fd EINVAL fallback on all four io_uring_enter call sites (submit, SQPOLL wakeup, EXT_ARG wait, non-EXT_ARG wait) - Treat kernel EINVAL from submit as drainable error; convert internal invariant violations to ThrowInternalException(string) to bypass drain - Add MpscQueue drained-segment recycling with slow-path-only producer quiescence tracking - Add provided-buffer ring OOM test hook and EINTR retry limit test hook in native shim - Replace ThrowInternalException with Debug.Fail at unreachable/defensive sites in slots and dispatch - Add tests for generation wrap-around dispatch, fork/exec close-on-exec, queue saturation, slot capacity stress, kernel version fallback, cancellation routing, and MpscQueue OOM recovery
…egate wrapper allocation
… Debug-only compilation and Release stubs
| // Layout assertions for managed interop structs (kernel struct mirrors). | ||
| c_static_assert(sizeof(size_t) >= 8); | ||
| c_static_assert(sizeof(size_t) == sizeof(void*)); |
There was a problem hiding this comment.
c_static_assert(sizeof(size_t) >= 8) (and the following pointer-size asserts) will fail compilation on 32-bit Linux, even if io_uring is meant to be disabled there. Consider gating SHIM_HAVE_IO_URING (or just these layout asserts) on 64-bit (e.g., __SIZEOF_POINTER__ == 8) so System.Native still builds for 32-bit targets and the shim can fall back to the stub implementations.
| // Layout assertions for managed interop structs (kernel struct mirrors). | |
| c_static_assert(sizeof(size_t) >= 8); | |
| c_static_assert(sizeof(size_t) == sizeof(void*)); | |
| // Layout assertions for managed interop structs (kernel struct mirrors). | |
| #if defined(__SIZEOF_POINTER__) && __SIZEOF_POINTER__ == 8 | |
| c_static_assert(sizeof(size_t) >= 8); | |
| c_static_assert(sizeof(size_t) == sizeof(void*)); | |
| #endif |
…submitter_task, drain all non-EFAULT submit errors, and test coverage
…AULT submit errors, EINVAL registered-ring-fd fallback, and source-specific error context
| if (next is not null) | ||
| { | ||
| Interlocked.CompareExchange(ref _tail.Value, next, tail); | ||
| } |
There was a problem hiding this comment.
Are we assuming that _tail.Value is eventually consistent? Otherwise, I believe this scenario could end up with the tail having some invalid value,
At the top you grab the current tail,
Segment tail = Volatile.Read(ref _tail.Value)!;Then if the entry array is full you continue to create a new tail. If that fails you will "refresh" the next variable to the current next.
next = Volatile.Read(ref tail.Next);Now, assume that your thread context switches out at this point and some other thread(s) enqueues a bunch of items that causes a new tail to be added.
Then we context switch back in and since tail and _tail.Value are not the same you will set _tail.Value to next but the value of next points to the previous tail.
There was a problem hiding this comment.
Maybe this was this part of the description
Segment recycling limited to segments that lost the tail-link CAS race (never previously published), avoiding need for producer quiescence tracking
| while (true) | ||
| { | ||
| Segment tail = Volatile.Read(ref _tail.Value)!; | ||
| int index = Interlocked.Increment(ref tail.EnqueueIndex.Value) - 1; |
There was a problem hiding this comment.
Based on my comment below (or above if reading from discussion page) could we cause a certain race condition that keeps resetting _tail.Value to a previous value and if we keep incrementing this value, we hit an integer overflow?
| { | ||
| get | ||
| { | ||
| Segment head = Volatile.Read(ref _head.Value)!; |
There was a problem hiding this comment.
This seems pretty computationally heavy; is there a reason you just can't have a single _count variable that you atomically increment/decrement that you just check for 0 here?
| fixedRecvBufferId, | ||
| ref completionAuxiliaryData)) | ||
| { | ||
| completionResultCode = -Interop.Sys.ConvertErrorPalToPlatform(Interop.Error.ENOBUFS); |
There was a problem hiding this comment.
Why the negation? I see you do it below as well. I did a quick search around the repo and only saw this referenced in one other place and they did not do negation and the folks referencing that code don't appear to being a negation either.
| int32_t state = atomic_load_explicit(&s_forceEnterEintrRetryLimitOnce, memory_order_relaxed); | ||
| if (state < 0) | ||
| { | ||
| const char* configuredValue = getenv(SHIM_TEST_FORCE_ENTER_EINTR_RETRY_LIMIT_ONCE_ENV); |
There was a problem hiding this comment.
Should this be behind a #ifdef DEBUG?
| private const string ConnectActivityName = ActivitySourceName + ".Connect"; | ||
| private static readonly ActivitySource s_connectActivitySource = new ActivitySource(ActivitySourceName); | ||
|
|
||
| internal static class Keywords |
There was a problem hiding this comment.
Maybe IoUringKeywords would be a better description.
| #if DEBUG | ||
| // Test-only knob to make wait-buffer saturation deterministic for io_uring diagnostics coverage. | ||
| // Only available in DEBUG builds so production code never reads test env vars. | ||
| if (OperatingSystem.IsLinux()) |
There was a problem hiding this comment.
Should you also check DOTNET_SYSTEM_NET_SOCKETS_IO_URING or do we assume that DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_EVENT_BUFFER_COUNT is only set when the feature flag is enabled?
| try | ||
| { | ||
| RecordAndAssertEventLoopThreadIdentity(); | ||
| LinuxEventLoopEnableRings(); |
There was a problem hiding this comment.
Wonder if these could be more generic. i.e.
LinuxEventLoopEnableRings -> EventLoopInit
LinuxEventLoopBeforeWait -> EventLoopBeforeWait
LinuxEventLoopTryCompletionWait -> EventLoopTryCompleteWait
etc.
There was a problem hiding this comment.
Hmm, I guess it would be an issue if someone wanted to add their own "EventLoopInit" or equivalent for the other methods :)
| } | ||
| else | ||
| { | ||
| Debug.Assert( |
There was a problem hiding this comment.
Does this mean we have not tested this on Kernels before 6.1?
| { | ||
| // Snapshot the wakeup generation counter before entering the blocking syscall. | ||
| // After waking, we compare to detect wakeups that arrived during the syscall. | ||
| uint wakeGenBefore = Volatile.Read(ref _ioUringWakeupGeneration); |
There was a problem hiding this comment.
Going to need to define this outside the if statement so you can reference it after the if/else
Contributes to #753
Summary
This document describes the complete, production-grade io_uring socket I/O engine in .NET's
System.Net.Socketslayer.When enabled via
DOTNET_SYSTEM_NET_SOCKETS_IO_URING=1on Linux kernel 6.1+, the engine replaces epoll with a managed io_uring completion-mode backend that:The native shim is intentionally minimal - 433 lines of C wrapping the three io_uring syscalls (setup, enter, register) plus eventfd and mmap helpers. All ring management, SQE construction, CQE dispatch, operation lifecycle, feature negotiation, overflow recovery, and SQPOLL wakeup detection lives in managed code.
The engine proper is organized as eight partial class files extending
SocketAsyncEngine: the main file (SocketAsyncEngine.Linux.cs, 3848 lines) holds ring setup, flag negotiation, CQE drain, SQE prep orchestration, completion slot layout, and the event loop; the remaining seven partials handle ring mmap lifecycle (IoUringRings, 343 lines), completion slot pool management (IoUringSlots, 437 lines), SQE writing (IoUringSqeWriters, 327 lines), completion dispatch (IoUringCompletionDispatch, 668 lines), diagnostics logging (IoUringDiagnostics, 324 lines), configuration resolution (IoUringConfiguration, 128 lines), and debug test hooks (IoUringTestHooks, 214 lines). A separateIoUringTestAccessors.Linux.csfile (938 lines) exposes all test-observable state through strongly-typed accessors. Tests access this surface throughInternalTestShims.Linux.cs(644 lines), a centralized reflection shim with[DynamicDependency]annotations for trimmer/AOT safety.Key metrics:
2. Architecture
Ring Ownership and Event Loop
The architecture follows the SINGLE_ISSUER contract: exactly one thread - the event loop thread - owns the io_uring instance. All ring mutations (SQE writes, CQ head advances, io_uring_enter calls) happen on this thread. Other threads communicate via two MPSC queues.
graph TD WT[Worker Threads] -->|"MpscQueue<IoUringPrepareWorkItem>"| EL[Event Loop Thread] WT -->|"MpscQueue<ulong> (cancel)"| EL WT -->|"eventfd write (wake)"| EL EL -->|"Writes SQEs / Drains CQEs / io_uring_enter"| K[Kernel - io_uring] K -->|"CQE completions"| EL EL -->|"ThreadPool.QueueUserWorkItem"| TP[ThreadPool]The Thin Native Shim Approach
The native shim (
pal_io_uring_shim.c, 433 lines) wraps exactly:io_uring_setup(viasyscall(__NR_io_uring_setup, ...)withSYS_io_uring_setupfallback)io_uring_enter(with and without EXT_ARG)io_uring_registermmap/munmap(for ring mapping)eventfd/read/write(for cross-thread wakeup; EINTR-looped)uname(for kernel version detection)All ring pointer arithmetic, SQE field population, CQE parsing, SQPOLL wakeup detection (via
Volatile.Readon the mmap'd SQ flags word), overflow recovery, and operation lifecycle management happens in managed C#. This is deliberate:<linux/io_uring.h>- no liburing dependency._Static_assert(IORING_SETUP_CLOEXEC == (1U << 19), ...)in the shim and layout contract tests in C#).Threading Model
The event loop thread owns:
_completionSlots[]/_completionSlotStorage[]arraysSQ_NEED_WAKEUPon the mmap'd SQ flags pointerWorker threads interact solely through:
TryEnqueueIoUringPreparation()-> MPSC prepare queue -> eventfd writeTryRequestIoUringCancellation()-> MPSC cancel queue -> eventfd writeVolatile.Readon_ioUringTeardownInitiatedto avoid publishing work after shutdownPartial Class File Organization
SocketAsyncEngine.Linux.csSocketAsyncEngine.IoUringSlots.Linux.csSocketAsyncEngine.IoUringRings.Linux.csTryMmapRings: maps SQ/CQ/SQE regions, validates mmap offset bounds, derives all ring pointers.CleanupManagedRings: multi-step teardown.LinuxFreeIoUringResources: full teardown orchestrationSocketAsyncEngine.IoUringSqeWriters.Linux.csWrite*Sqemethods: send, sendZc, recv, readFixed, providedBufferRecv, multishotRecv, accept, multishotAccept, sendMsg, sendMsgZc, recvMsg, connect, asyncCancel. Deduplicated viaWriteSendLikeSqeandWriteSendMsgLikeSqeSocketAsyncEngine.IoUringCompletionDispatch.Linux.csSocketEventHandlerpartial:DispatchSingleIoUringCompletion,DispatchMultishotIoUringCompletion,DispatchZeroCopyIoUringNotification, multishot accept/recv dispatch, buffer materialization, completion result routingSocketAsyncEngine.IoUringDiagnostics.Linux.csNetEventSource.Info/Errorlog helpers for all io_uring events: async-cancel failures, queue overflows, CQ overflow entry/completion with branch discriminator, deferred rearm nudge, teardown summary, advanced feature stateSocketAsyncEngine.IoUringConfiguration.Linux.csIsIoUringEnabled,IsSqPollRequested,IsZeroCopySendOptedIn,IsIoUringDirectSqeDisabledwith[FeatureSwitchDefinition]annotations for JIT-eliminable code pathsSocketAsyncEngine.IoUringTestHooks.Linux.cs#if DEBUG-gated EAGAIN/ECANCELED forced result injection, per-opcode mask parsing from environment, result application/resolution/restorationSocketAsyncEngine.IoUringTestAccessors.Linux.csSubmission Path: Standard vs. SQPOLL
In standard mode,
io_uring_entersubmits pending SQEs and optionally waits for CQEs. In SQPOLL mode, a kernel thread continuously polls the SQ ring. Managed code detects idle viaVolatile.Readon the mmap'd_managedSqFlagsPtrchecking forIORING_SQ_NEED_WAKEUP. When the kernel thread is awake, noio_uring_enteris needed for submission.Flag Negotiation (Peel Loop)
Setup builds an initial flag set:
CQSIZE | SUBMIT_ALL | COOP_TASKRUN | SINGLE_ISSUER | NO_SQARRAY | CLOEXEC. SQPOLL (mutually exclusive with DEFER_TASKRUN) or DEFER_TASKRUN is added based on configuration. OnEINVAL, flags are peeled in order:NO_SQARRAYfirst, thenCLOEXEC.EPERMis never retried (respects seccomp/kernel policy). After setup,FD_CLOEXECis set as a fallback viafcntlfor kernels whereIORING_SETUP_CLOEXECwas peeled.CQ Overflow Recovery State Machine
CQ overflow is detected on every
DrainCqeRingBatchentry viaObserveManagedCqOverflowCounter, which compares the mmap'd overflow counter against the last-observed value using wrapping uint32 delta arithmetic. When a delta is seen, the engine enters a three-branch recovery state machine:_liveAcceptCompletionSlotCount > 0and not in teardown. Defers multishot accept re-arm nudges until post-drain._ioUringTeardownInitiatedis set. Teardown owns recovery completion.During overflow recovery, CQ head advances happen per-CQE (not batched) to relieve kernel pressure immediately. Recovery completes when the CQ ring is fully drained and no new overflow delta is observed. On completion:
AssertCompletionSlotPoolConsistencyvalidates free-list integrity, telemetry is incremented, and for the MultishotAcceptArming branch,TryQueueDeferredMultishotAcceptRearmAfterRecoverynudges accept contexts.After recovery completes, a delayed sweep (
TrySweepStaleTrackedIoUringOperationsAfterCqOverflowRecovery) fires 250ms later to retire tracked operations whose CQEs were dropped. The sweep skips intentionally long-lived multishot accept and persistent multishot recv slots. Operations still in the waiting state are canceled; already-transitioned operations are detached and their slots freed.3. Key Data Structures
Completion Slot Pool
Three parallel SoA arrays, all indexed by slot index:
IoUringCompletionSlot[](hot, 32 bytes each,[StructLayout(LayoutKind.Explicit, Size = 32)]):Generation(ulong) - 43-bit generation fieldFreeListNext(int) - intrusive free list, -1 = end_packedState(uint) -IoUringCompletionOperationKindin low 8 bits, boolean flagsIsZeroCopySend/ZeroCopyNotificationPending/UsesFixedRecvBufferin bits 8-10FixedRecvBufferId(ushort)#if DEBUGonly):TestForcedResult(int)IoUringCompletionSlotStorage[](cold): Per-slot tracked operation reference (TrackedOperation,TrackedOperationGeneration),DangerousRefSocketHandlefor fd lifetime, pre-allocated native inline storage slab (NativeMsghdr + 4 IOVectors + 128B socket addr + 128B control + socklen_t), message writeback pointers for recvmsg.MemoryHandle[](zero-copy pin holds): OneSystem.Buffers.MemoryHandleper slot index, holding the pin for SEND_ZC payloads until the NOTIF CQE arrives.Layout contract tests verify
IoUringCompletionSlotfield offsets and the 32-byte total size via reflection on every test run. ADebug.AssertinInitializeCompletionSlotPoolfires if the size drifts.Generation Encoding
13-bit slot index (
SlotIndexBits = 13, capacity 8192) and 43-bit generation (GenerationBits = 56 - 13 = 43,GenerationMask = (1UL << 43) - 1UL) packed into the 56-bituser_datapayload. The upper 8 bits of user_data carry a tag byte (2 = reserved completion, 3 = wakeup signal). Generation is initialized to 1 (not 0) so stale CQEs referencing generation 0 are rejected. On wrap, generation remaps from2^43-1back to 1, skipping zero.IoUringCompletionOperationKind
A 3-variant enum (
None,Accept,Message) stored in the packed state of eachIoUringCompletionSlot. This determines per-completion post-processing behavior: accept completions read sockaddr length from the native slab; message completions copy writeback data from the native msghdr.IoUringCompletionDispatchKind
A 10-variant enum (
Default,ReadOperation,WriteOperation,SendOperation,BufferListSendOperation,BufferMemoryReceiveOperation,BufferListReceiveOperation,ReceiveMessageFromOperation,AcceptOperation,ConnectOperation) stored as a packed integer inside eachAsyncOperation, set at operation creation time and consumed at CQE dispatch to route completions without virtual dispatch. Defined in the shared Unix partial class (SocketAsyncContext.Unix.cs) so it compiles on all Unix TFMs.MPSC Queue
MpscQueue<T>is a lock-free segmented queue with cache-line-padded head/tail pointers and anEnqueueIndexcounter per segment. Features:Lock) to reduce allocation pressure during burst enqueue patternsTryEnqueueFast/TryDequeueFast) inlined for the common non-full/non-empty caseIsEmptyproperty is snapshot-based, not linearizable - a return of true can mean an enqueue is mid-flightProvided Buffer Ring
IoUringProvidedBufferRing(1,013 lines): Kernel-registered buffer pool for recv operations. Features:IORING_REGISTER_PBUF_RINGDebug.Assert(IsCurrentThreadEventLoopThread())on resize evaluationBeginDeferredRecyclePublish/EndDeferredRecyclePublishbracket the CQE drain loop to batchPublishTailcallsEvaluateProvidedBufferRingResize, gated bySystem.Net.Sockets.IoUringAdaptiveBufferSizingAppContext switchInUseCount == 0and_trackedIoUringOperationCount == 0before swapIORING_REGISTER_BUFFERSfor fixed-buffer recv viaREAD_FIXEDopcodeLinuxIoUringCapabilities
An immutable
readonly structsnapshot captured after ring setup and stored as_ioUringCapabilities. ExposesIsIoUringPort,Mode,SupportsMultishotRecv,SupportsMultishotAccept,SupportsZeroCopySend,SqPollEnabled,SupportsProvidedBufferRings, andHasRegisteredBuffers. Eliminates scattered per-capability flag reads; the entire capability set is decided once at initialization and updated only for provided-buffer state changes.IoUringResolvedConfiguration
An immutable
readonly structcapturing all resolved configuration inputs at startup:IoUringEnabled,SqPollRequested,DirectSqeDisabled,ZeroCopySendOptedIn,RegisterBuffersEnabled,AdaptiveProvidedBufferSizingEnabled,ProvidedBufferSize,PrepareQueueCapacity,CancellationQueueCapacity. Logged once viaSocketsTelemetry.Log.ReportIoUringResolvedConfigurationandNetEventSource.Info.4. Feature Inventory
Complete Feature Stack
IoUringSqe*pointers via mmap'd ring_multishotAcceptState(0=disarmed, 1=arming, otherwise encoded user_data)_persistentMultishotRecvDataQueue[FeatureSwitchDefinition]+ env var); JIT-eliminable when switch is falseIoUringCompletionDispatchKindeliminates virtual dispatch on the CQE hot pathIORING_SETUP_CLOEXECflag with static assert in shim; fcntl fallback; dedicated test#if DEBUG), per-opcode mask[Conditional("DEBUG")]AssertSingleThreadAccessat CQE dispatch entry points; mmap offset bounds validation5. Configuration Surface
Production Environment Variables
DOTNET_SYSTEM_NET_SOCKETS_IO_URING"1"to enableDOTNET_SYSTEM_NET_SOCKETS_IO_URING_SQPOLL"1"to enableProduction AppContext Switches
System.Net.Sockets.UseIoUringfalse[FeatureSwitchDefinition])System.Net.Sockets.UseIoUringSqPollfalse[FeatureSwitchDefinition]enables JIT elimination)System.Net.Sockets.IoUringAdaptiveBufferSizingfalsePrecedence: Environment variable wins over AppContext switch for the master gate. SQPOLL requires both surfaces enabled (dual opt-in).
SQPOLL dual opt-in: Both the AppContext switch AND the environment variable must be enabled. The AppContext switch is the outer gate - if false,
IsSqPollRequested()returns immediately without checking the env var, and the JIT can statically eliminate all SQPOLL branches.Debug-Only Test Controls
All
DOTNET_SYSTEM_NET_SOCKETS_IO_URING_TEST_*environment variables are gated behind#if DEBUG:TEST_DIRECT_SQE(0/1): disable/enable direct SQE submissionTEST_ZERO_COPY_SEND(0/1): disable/enable zero-copy sendTEST_REGISTER_BUFFERS: control registered buffer behaviorTEST_PROVIDED_BUFFER_SIZE: override provided buffer sizeTEST_ADAPTIVE_BUFFER_SIZING(1): force adaptive sizing onTEST_PREPARE_QUEUE_CAPACITY: override prepare queue capacityTEST_QUEUE_ENTRIES: override SQ ring size (must be power of 2, 2-1024)TEST_FORCE_EAGAIN_ONCE_MASK: comma-separated opcode names for forced EAGAINTEST_FORCE_ECANCELED_ONCE_MASK: comma-separated opcode names for forced ECANCELED6. Safety and Correctness Measures
Fd Lifetime Management
Every direct SQE preparation takes a
DangerousAddRefon the socket'sSafeSocketHandle, stored in_completionSlotStorage[slotIndex].DangerousRefSocketHandle. This keeps the fd alive from SQE prep through CQE retirement, preventing fd-reuse races after close. The ref is released inFreeCompletionSlot.Stale CQE Protection
Generation-based ABA protection. Each completion slot starts at generation 1. On free, generation increments (wrapping from
2^43-1to 1, skipping 0). CQE dispatch compares the CQE's encoded generation against the slot's current generation; mismatches are silently dropped as stale.Zero-Copy Send Lifecycle
SEND_ZC produces two CQEs: a data completion and a NOTIF. The slot's
IsZeroCopySendandZeroCopyNotificationPendingflags track this two-phase lifecycle. After the first CQE, the slot is kept alive and the tracked operation is reattached viaTryReattachTrackedIoUringOperation(generation CAS from 0 to new generation, then operation CAS from null to operation). The NOTIF CQE triggersHandleZeroCopyNotificationwhich frees the slot and releases the pin hold.Multishot Accept Arming
The
_multishotAcceptStatefield uses a three-state protocol:0(disarmed),1(arming - SQE being written but user_data not yet published), or the encoded user_data value itself (armed).GetArmedMultishotAcceptUserDataForCancellationspins briefly if the arming transition is in flight.Teardown Ordering
LinuxFreeIoUringResourcesfollows a strict multi-phase teardown:CleanupManagedRings(also closes ring fd, terminating SQPOLL thread)DrainQueuedIoUringOperationsForTeardownruns twice - once before and once after native port closure to catch late-arriving items)DrainTrackedIoUringOperationsForTeardownNativeMemory.FreeCleanupManagedRingsnulls all mmap-derived pointers before unmapping to prevent use-after-unmap.Nullable Avoidance
The SQE retry drain path avoids wrapping
SocketEventHandler(a struct) in aNullable<T>wrapper. Presence is tracked via a separatedrainHandlerInitializedboolean, avoiding boxing pressure on the hot path.SQE Size Validation
TryGetNextManagedSqechecksringInfo.SqeSize != (uint)sizeof(IoUringSqe)at runtime, catching 128-byte SQE kernels that would corrupt the ring.TryMmapRingsadditionally rejectsSetupSqe128negotiations.7. Performance Optimizations
CQ Head Advance Batching
Outside of overflow recovery, CQ head advances are deferred:
_managedCachedCqHeadis incremented locally and the singleVolatile.Writeto*_managedCqHeadPtrhappens once at the end of the drain batch (in thefinallyblock). During overflow recovery, advances happen per-CQE to relieve kernel pressure.SQE Zeroing
Each
TryGetNextManagedSqecall writesUnsafe.WriteUnaligned(sqe, default(IoUringSqe))for JIT-vectorized 64-byte zeroing before returning the SQE. This eliminates stale field concerns and enables eachWrite*Sqemethod to write only the fields it needs.SQE Writer Deduplication
Send-like operations share
WriteSendLikeSqe(differing only by opcode:SendvsSendZc). Sendmsg-like operations shareWriteSendMsgLikeSqe(SendMsgvsSendMsgZc). This reduces copy-paste without sacrificing readability.SQE Acquire With Retry
TryAcquireManagedSqeWithRetryattempts up toMaxIoUringSqeAcquireSubmitAttempts(16) rounds. Between retries, it runsDrainCqeRingBatchto free CQ slots, then submits pending SQEs. The drain handler is lazily initialized to avoid struct construction on the fast path.Completion Slot Drain Recovery
When
AllocateCompletionSlotreturns -1 (pool exhausted), the engine drains CQEs inline (guarded by_completionSlotDrainInProgressto prevent recursion) and retries allocation.Provided Buffer Deferred Recycle
BeginDeferredRecyclePublish/EndDeferredRecyclePublishbracket the CQE drain loop. Buffer descriptor writes accumulate without individualVolatile.Writetail publishes. A single tail publish happens atEndDeferredRecyclePublish.Diagnostics Polling
Diagnostic counters are polled every
IoUringDiagnosticsPollInterval(64) event loop iterations, not on every CQE. Managed deltas are accumulated in per-engine fields and published in batch toSocketsTelemetry.Lazy Lock Allocation
_multishotAcceptQueueGateand_persistentMultishotRecvDataGateonSocketAsyncContextare lazy-initialized viaEnsureLockInitialized(CAS from null). Most sockets never use these paths, so theLockobjects are only allocated when needed.Event Loop Wait
The event loop first tries a non-blocking
DrainCqeRingBatch. If no CQEs are available, it issuesio_uring_enterwithGETEVENTSand a 50ms EXT_ARG timeout (bounded wait). This trades worst-case 50ms latency for starvation resilience when eventfd wakes are missed or deferred.8. Telemetry and Observability
Stable PollingCounters (10)
Published when the EventSource is enabled on Linux. Counter names are centralized in
IoUringCounterNames:io-uring-prepare-nonpinnable-fallbacksio-uring-socket-event-buffer-fullio-uring-cq-overflowsio-uring-cq-overflow-recoveriesio-uring-prepare-queue-overflowsio-uring-prepare-queue-overflow-fallbacksio-uring-completion-slot-exhaustionsio-uring-provided-buffer-depletionsio-uring-sqpoll-wakeupsio-uring-sqpoll-submissions-skippedDiagnostic Backing Fields (17)
Written internally for structured logging and test access. Not published as PollingCounters. Include:
Startup Events
ReportIoUringResolvedConfiguration: Logged once with all resolved config inputsReportSocketEngineBackendSelected(event ID 7): Reports io_uring vs. epoll selection and SQPOLL statusReportIoUringSqPollNegotiatedWarning: WARNING-level when SQPOLL is negotiatedStructured Logging
IoUringDiagnostics.Linux.cscentralizes all log helpers withNetEventSource.Info/Error:Collectible via
dotnet-counters,dotnet-trace, or any OpenTelemetry-compatible collector.9. Test Coverage
Test Access Architecture
The test project does not use
InternalsVisibleTo. Instead:IoUringTestAccessors.Linux.cs(938 lines) defines all test-visible snapshot types and accessor methods insideSocketAsyncEngine(production assembly)InternalTestShims.Linux.cs(644 lines) in the test project mirrors these types and resolves them via reflection[DynamicDependency(DynamicallyAccessedMemberTypes.All, "System.Net.Sockets.SocketAsyncEngine", "System.Net.Sockets")]attribute preserves all targets under trimming and AOTTest Suite (132 test methods across 6,665 lines)
Coverage areas:
#if DEBUG)NativeMsghdrLayoutContract_IsStableandCompletionSlotLayoutContract_IsStableverify ABI alignment via reflectionCqOverflow_ReflectionTargets_Stableensures field names are documented and stableRingFd_HasCloexecFlag_Setverifies theFD_CLOEXECbit viafcntlHard to Test In-Process
10. Graceful Degradation
11. Path to Default-On
SQPOLL will likely remain opt-in permanently due to its CPU cost trade-off.
Future Kernel Features
12. Distribution Readiness
Kernel Version Matrix
The minimum kernel cutoff is a single 6.1 requirement. All sub-features are detected at runtime via opcode probing.
Memory Overhead
13. Conclusion
This implementation delivers a complete io_uring integration with:
#if DEBUG-gated test hooks for deterministic failure injection[FeatureSwitchDefinition]annotations for JIT elimination of SQPOLL branchesDangerousAddRef/DangerousReleaseThe managed-ring architecture (minimal native shim + C# ring management) trades a small initial complexity cost for long-term maintainability: standard .NET breakpoints, managed stack traces, EventSource telemetry, and xUnit tests in the same language as the implementation.
The code is production-ready with the current opt-in gate. The environment variable requirement is appropriate for the initial release. Graceful degradation means unexpected issues fall back to the proven epoll path.