-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize epoll events on thread pool and process events in the same thread #35330
Parallelize epoll events on thread pool and process events in the same thread #35330
Conversation
…work and process write work in same thread
Tagging subscribers to this area: @dotnet/ncl |
Perf on JsonPlatform benchmark
12-proc x64 machine
28-proc x64 machine
56-proc x64 machine (2-socket), limited to 1 socket
56-proc x64 machine (2-socket), not limited
32-proc arm64 machine
|
The event-processing work items can theoretically become long-running, probably should fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
{ | ||
ManualResetEventSlim? e = Event; | ||
if (e != null) | ||
{ | ||
// Sync operation. Signal waiting thread to continue processing. | ||
e.Set(); | ||
} | ||
else if (processAsyncOperationSynchronously) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to change sightly how this is structured, in particular for the e != null case. We get there if a synchronous operation is performed on a Socket on which an asynchronous operation was ever performed. In such cases, we've permanently moved the socket to be non-blocking, which means we need to simulate the blocking behavior of all subsequent sync operations, and we do that by using a MRES instead of a callback. We don't want to require a thread pool thread just to set that event, as doing so could lead to thread pool starvation, with a sync operation waiting on a thread pool thread for a work item to be processed by the thread pool that will unblock it. So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we want the epoll thread to set such an MRES rather than queuing a work item to do it.
I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.
To set MRES on a epoll thread we would need to check two queues (send and receives) which would require to take two locks:
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Line 832 in 6ef538d
using (Lock()) |
and then perform 0 to two casts (depending on whether queues are empty or not):
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Line 137 in 723e7f2
get { return CallbackOrEvent as ManualResetEventSlim; } |
Ofc we would have to do that for every epoll event returned by epoll wait.
With @kouvel proposal after we receive an epoll notification we just add a bunch of simple events to a queue (this is very fast) and schedule a work item to the thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid that it would require us to make the enqueueing more complex and decrease throughput in a noticeable way.
We only need to do it if the event is for a sync operation, which can be checked cheaply.
The alternative is potential deadlock / long delays while waiting for the thread pool's starvation detection to introduce more threads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to do it if the event is for a sync operation, which can be checked cheaply.
But we don't know this when we are receiving the epoll notification. To check this we need to translate the socket handle to socket context:
runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Line 333 in 6ef538d
_handleToContextMap.TryGetValue(handle, out SocketAsyncContext? context); |
and then take the two locks that I've described above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we could consider having a dedicated epoll thread for sync-over-nonblocking sockets. EDIT: I realize this won't work well, add the socket may already be associated with a particular epoll.
I understand your pushback, but I think this is a big deal. Convince me it's not if you disagree :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The operation may be cancelled by time this gets dequeued for execution.
Looks like ProcessQueuedOperation
accounts for that.
So the suggested change is something like:
processAsyncOperationSynchronously
becomesprocessAsyncOperationOnConcurrentQueue
, and enqueues to theConcurrentQueue<AsyncOperation>
HandleEvents
gets called directly on the epoll thread,ConcurrentQueue
processing is deferred to the ThreadPool after callingHandleEvents
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K. In the new commits I made changes to track whether the first operation in each queue is synchronous and to check it speculatively from the epoll thread. The speculative check along with the concurrent dictionary dequeue is not making a noticeable difference to RPS. Considering that the alternative is not incorrect, it seems like a speculative check would be enough to avoid the starvation issue.
I don't think there's a need to queue AsyncOperation
to the queue, but anyway it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code would be simpler using ConcurrentQueue<AsyncOperation>
.
Taking the lock for each operation queue on the epoll thread seems to reduce RPS by ~20-25 K.
I plan to take a shot at replacing these locks with Interlocked
operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it involves taking the lock, and doing so seems to reduce RPS by an additional 15-20 K RPS on top of taking the locks, not sure fully why, though there is a bit more work involved in extracting those.
What lock does this refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's the lock taken in SocketAsyncContext.HandleEvent
before this change (linked above by @adamsitnik), in this change the lock taken in SocketAsyncContext.ProcessSyncEventOrGetAsyncEvent
.
The code would be simpler using
ConcurrentQueue<AsyncOperation>
Agreed, I wanted to get that to work and tried it first but there currently seem to be obstacles to doing that.
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Outdated
Show resolved
Hide resolved
@@ -387,5 +445,17 @@ private bool TryRegister(SafeSocketHandle socket, IntPtr handle, out Interop.Err | |||
Interop.Sys.SocketEvents.Read | Interop.Sys.SocketEvents.Write, handle); | |||
return error == Interop.Error.SUCCESS; | |||
} | |||
|
|||
private struct Event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
readonly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for readonly, I think that we could also give it a less generic name, EpollEvent
for example.
private struct Event | |
private readonly struct EpollEvent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW do you think that changing SocketEvent
size from 4 bytes to 1 could improve the perf in any way?
runtime/src/libraries/Common/src/Interop/Unix/System.Native/Interop.SocketEvent.cs
Line 13 in 6ef538d
internal enum SocketEvents : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're suggesting that to help reduce the size of EpollEvent? It wouldn't change its size; it would just result in more of the same space being padding. In fact, I think it could actually be slightly worse in terms of codegen, e.g.
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8ABAJgEYBYAKGIGYACMhgYQYG8aHunHjykDAKIA3GADsM5VrGwYY5ABQBJSQAUMUBgAts4gCYAbGGgYAhAJ7yGI7IYCuMAJQMAvAD4G4mAHdhYySVdA2NTWwdnAG4uHnomAX8JDFIZGDkYUhV1TR09IxMGAEtJGztHFw8vX0TJTOD8sLKomO4W3gZcTXswDBqpNs5qHmH21QwNLQAJPONooZHuOOArGAYANSa5hdb5hbjRJKUxidyQguXrcPK24cHtkemzt1P8rfueDYjnq5g3hYBfNqA3YMNpxTpQbq9A61AY3WKMY45R6veGLRjFXqfRx/EZo9ow5JZcY5eqhIolH5OfF3d7cFHGZ5k3744bY1auUoRXHDYG8mj/IA===
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
By only executing at most N work items and then scheduling a replica and exiting? Seems reasonable. |
Yea that's what I was thinking. Maybe a batch of |
Smaller threshold would be better for that issue but probably worse for perf. For instance what if processing a request takes 10 ms, it wouldn't take running many of those for the work item to appear long-running. Maybe also a time-based thing like the thread pool does, or something else. |
Co-Authored-By: Stephen Toub <stoub@microsoft.com>
|
||
// An event was successfully dequeued, and as there may be more events to process, speculatively schedule a work | ||
// item to parallelize processing of events. Since this is only for additional parallelization, doing so | ||
// speculatively is ok. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth mentioning explicitly that the parallelization makes it impossible for continuations to block one another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here. If a continuation blocks on a synchronous socket operation the next epoll event that serves that blocking operation would schedule another work item. The parallelization done here is only for making use of more procs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's possible for a continuation to block on another continuation for the same socket even without the parallelization done here.
It may be on different sockets. And we shouldn't reply on the next epoll_wait with events to get things moving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If parallelization is not done here, and if a blocking socket operation is performed on this thread as a result of running a user continuation (on any socket), then the set of IO events that are already queued would not be releasing that blocking operation anyway since it's a new operation that got added. Only another epoll event would get that blocking operation moving (before this change too).
However, if it's some other kind of blocking related to other already-queued, for example if a user continuation blocks waiting for another already-in-progress socket operation to complete, then the parallelization done here would help with unblocking that more quickly, though not guaranteed and can still lead to thread pool starvation issues, and there would have already been potential thread pool starvation issues before this change from those kinds of blocking.
I'm trying to understand if there would be a correctness issue from not parallelizing here. If there is a possibility of correctness issue then it may be necessary to queue up a replica before the loop instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To alleviate potential issues from other kinds of blocking in user callbacks maybe it would be safer to ensure there is a queued replica (non-speculatively) if it has not yet scheduled a replica.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change I mentioned above in the latest update
src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs
Show resolved
Hide resolved
Maybe something that relates to how many events we get from epoll? |
No significant change to perf on the 28-proc x64 machine with the latest changes |
…irst dequeue, delegating scheduling of more work items to other threads
No significant change to perf on the 28-proc x64 machine with the latest changes |
Addressed feedback above and added a few more comments. |
Updated numbers below with preview 5 SDK. These are with hill climbing disabled. JsonPlatform28-proc x64 machine
12-proc x64 machine
32-proc arm64 machine
I'm not seeing an issue with epoll thread count > 1. Decreasing latency between getting an epoll notification and scheduling a thread didn't seem to help. 12-proc x64 machine with cpuset 0-3To sort of simulate a smaller VM.
FortunesPlatformThis benchmark seems to be affected by the number of connections and epoll threads. On the x64 machines, in some cases with 512 connections and 1 epoll thread the change seems to be performing slightly worse than the baseline, while with 256 connections and 1 epoll thread the change seems to be performing slightly better. 28-proc x64 machine
12-proc x64 machine
32-proc arm64 machine
12-proc x64 machine with cpuset 0-3To sort of simulate a smaller VM.
|
For the sync operation case I tried having a server do synchronous reads after an async operation on 256 sockets, while a client writes to those sockets using async operations. The starvation issue appeared very quickly before the fixes (within a couple of seconds). The baseline and after the fixes it did not hit a noticeable starvation issue after minutes. I don't think it would be easy to repro the races, or at least it seems like they would not be frequent enough to trigger a starvation sequence. |
Hopefully this PR should be close to checkin now. I probably won't be able to spend much time on it this/next week, but let me know if I can help to unblock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me! thank you @kouvel !!
the PR looks ready to me @stephentoub @tmds is there anything that should be addressed? If not, I would like to merge this PR. |
It does have 2 types of socket connection, one for DB and other for HTTP, so the dual load might be a factor. |
@sebastienros kindly collected some numbers on an x64 VM with 4 procs in the same modes as above, here are the results. JsonPlatform
FortunesPlatform
Numbers are pretty close, there doesn't appear to be a noticeable regression. On this VM FortunesPlatform seems to perform better with more epoll threads before and after the change. The extra load may have something to do with it, I'm not seeing a clear pattern yet but this diff between 8 and 1 epoll threads seems to be higher than on the other machines. |
Ok, thanks for fixing and confirming. |
First updated numbers should be available tomorrow morning. |
|
||
if ((events & Interop.Sys.SocketEvents.Read) != 0 && | ||
_receiveQueue.IsNextOperationSynchronous_Speculative && | ||
_receiveQueue.ProcessSyncEventOrGetAsyncEvent(this) == null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kouvel @stephentoub @adamsitnik I think there is an issue when IsNextOperationSynchronous_Speculative
is true
but the operation is not really a SyncEvent operation. The queue moves to Processing
but no-one dispatches the operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, will fix
Fixes dotnet#35330 (comment) by skipping state transitions when an async operation needs to be processed.
Fixes #35330 (comment) by skipping state transitions when an async operation needs to be processed.
Applies the technique from dotnet/runtime#35330 to IOQueue.
It was seen that on larger machines when fewer epoll threads are used, throughput drops significantly. An issue was that the epoll threads were not able to queue work items to the thread pool quickly enough to keep thread pool threads fully occupied. When the thread pool is not fully occupied, thread pool threads end up waiting for work and enqueues are much slower as they need to release a thread, creating a positive feedback loop, and lots of thread pool threads being released to look for work items and not finding any. It also doesn't help that the thread pool requests many more threads than necessary for the number of work items enqueued, and that the enqueue overhead is repeated for each epoll socket event.
Following @adamsitnik's idea on batching the enqueues to the thread pool and requesting only one thread to limit the overhead, this change tries to reduce and delegate the overhead in epoll threads into the thread pool and automatically parallelizing that work, and to decrease the number of redundant thread pool work items a bit.