-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/5.0] Fix regression in SocketAsyncContext.Unix (sequential processing of reads & writes) #46745
Conversation
…tnet#45683) * SocketAsyncContext.Unix: fix double processing of AsyncOperations When a socket event occurs, the pending operations gets triggered to continue their work by calling the Process method. The changes in dotnet#37974 cause Process to be called twice on the same AsyncOperation. When Process is called, the operation can complete, and the AsyncOperation instance may be reused for a different operation. * Remove processAsyncEvents
Tagging subscribers to this area: @dotnet/ncl Issue DetailsBackport of #45683 /cc @geoffkizer @tmds @karelz Customer ImpactWhen we get epoll notifications for a particular socket, we can get either a read notification, a write notification, or both. Each notification will cause us to perform the IO and then invoke the user callback. Prior to PR #37974, when we got both notifications, we would dispatch one notification to the thread pool so that both user callbacks can be processed concurrently. Unfortunately, #37974 inadvertently broke this behavior, and instead resulted in the notifications being processed in sequence. This means that the second IO operation and callback won't be invoked until the first callback completes, which could in theory take arbitrarily long. This can lead to unexpected behavior and, at worst, deadlocks. It's probably not that common in practice, but it would be extremely hard to diagnose if it was hit. Regression?Yes, caused by #37974 in 5.0. TestingIt's not really possible to write a test for this because it requires epoll signalling both the read and write notification in the same notification, which is timing dependent and very difficult to make happen consistently. RiskLow. The fix is to basically just revert to the previous correct behavior here, and isolate the changes from #37974 so they do not affect the common path.
|
Although it is very unlikely (if not impossible) that this will result in a perf regression, we need to run TechEmpower with these changes. @adamsitnik can you help with this? I'm fine doing it myself after a quick offline guidance, but I'm out of practice. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
https://github.com/aspnet/Benchmarks/blob/master/scenarios/README.md dotnet tool install Microsoft.Crank.Controller --version "0.1.0-*" –global
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml
--scenario plaintext --profile aspnet-citrine-lin To profile and get a trace file: |
This shouldn't impact TechEmpower, and if it does, we'd still want the change. |
Did 3-3 runs of the plaintext and json scenarios. As expected, the differences do not seem to be statistically significant. plaintext (RPS)
json (RPS)
|
@antonfirsov @geoffkizer thanks! |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you for verifying perf before merging the changes @antonfirsov !
@danmosemsft this is ready for 5.0 servicing ... |
Tactics is sympathetic looking for more clarity around how likely this is to happen, given no customers have reported this. Maybe @geoffkizer could come to a future tactics. |
It's pretty unlikely, because it requires a combination of a couple factors, plus lucky timing where both notifications occur at the same time. It's a bit more likely to happen under load because this will stress the timing factors. Even when it does happen, the impact is largely benign unless you happen to do something complicated in the first callback, which is unusual. The problem is that if you do hit this, the impact could be very very bad (deadlock), and extremely hard to diagnose. |
I wonder how we can help Tactics make a decision (about whether to wait for customer evidence of impact or not) @geoffkizer any concerns about the risk here -- it looks like the code was refactored since 3.1, so I can't simply "eyeball and see it's the same". I don't know how to evaluate the risk of something in this code. That would reduce the numerator in risk/benefit. I wonder if we could/should do more to reduce the risk somehow. @stephentoub any other perspective about the risk here? |
Here's the change that introduced the bug: antonfirsov@205f997#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176 The only code that matters here is the change to ProcessSyncEventOrGetAsyncEvent. The rest of the code only kicks in when you have a particular app context switch set to enable "inline completions" (which is what this PR was intended to introduce). Here's the fix we want to backport: antonfirsov@6e12567#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176 Again, just look at ProcessSyncEventOrGetAsyncEvent and ignore the changes in HandleInlineCompletions. You'll see that the fix simply reverts the code in ProcessSyncEventOrGetAsyncEvent to what it did previously. It's true that all this code has been refactored a bit since 3.1, but at least we are reverting to a known state in 5.0 here. |
(BTW if anyone knows how to make github show that diff directly, please let me know :)) |
All that said, this code did change a fair amount during 5.0 and so I'm not sure how much stock to put in to reverting to this previous 5.0 behavior. It looks like ProcessSyncEventOrGetAsyncEvent and the related logic was heavily refactored in this change and some previous ones: antonfirsov@8157271#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176 That change was something like 6 weeks prior to the bad PR above, so the "known 5.0 state" was only such for about 6 weeks or so in the middle of 5.0. |
Adding @kouvel who made the change I mentioned in the last comment. The only thing that I think is potentially scary here is that with this fix, we now go through the "two operations to process, so schedule one" path in HandleEvents. This path didn't exist in 3.1, and was only really exercised for that 6 week period during 5.0 before the bad PR went in. Kount, any thoughts here? |
The fix is related to this comment: runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs Lines 2097 to 2109 in 6e12567
The bug causes Task sendOp = socket.SendAsync
Task receiveOp = socket.ReceiveAsync
await receiveOp();
sendOp.Wait(); I don't think this is a common pattern when using a Independent would be something like: 'Receive loop'
while {
await socket.ReceiveAsync;
ProcessData();
};
'Send loop':
while {
await socket.SendAsync;
}; When both events happen simultaneously, the It is not likely, but it would be hard to debug that the |
A couple of questions regarding:
|
When runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs Lines 906 to 910 in 6072e4d
|
Ah, I missed that the default for the |
I'm gonna mark no merge for a bit until we're ready to go back to tactics with our updated thoughts and reasoning. |
Removing servicing-consider for now. Let's apply it again when you believe we are ready to go to tactics to discuss risk/benefit again . |
@geoffkizer are we ready to go back to Tactics? |
@danmosemsft we will bring it to Tactics on Thu if it is ok -- @geoffkizer should be there to talk about it (just sent him invite). |
Sounds good to me |
(Bar check, as no customer reports at this stage.)
Backport of #45683
Fixes #45673
/cc @geoffkizer @tmds @karelz
Customer Impact
Found by code inspection by @tmds (RedHat)
When we get epoll notifications for a particular socket, we can get either a read notification, a write notification, or both. Each notification will cause us to perform the IO and then invoke the user callback.
Prior to PR #37974, when we got both notifications, we would dispatch one notification to the thread pool so that both user callbacks can be processed concurrently.
Unfortunately, #37974 inadvertently broke this behavior, and instead resulted in the notifications being processed in sequence. This means that the second IO operation and callback won't be invoked until the first callback completes, which could in theory take arbitrarily long.
This can lead to unexpected behavior and, at worst, deadlocks. It's probably not that common in practice, but it would be extremely hard to diagnose if it was hit.
Regression?
Yes, caused by #37974 in 5.0.
Testing
It's not really possible to write a test for this because it requires epoll signalling both the read and write notification in the same notification, which is timing dependent and very difficult to make happen consistently.
Performance
Since the change alters a performance-critical path, we decided to run some of the TechEmpower benchmarks. As expected, there is no regression. For results see #46745 (comment).
Risk
Low. The fix is to basically just revert to the previous correct behavior here (same behavior as 3.1), and isolate the changes from #37974 so they do not affect the common path. It has been fixed in master for a month. We have run TechEmpower as a perf check, which also helped test stress (although, TechEmpower didn't hit the problem before it was fixed either)