Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release/5.0] Fix regression in SocketAsyncContext.Unix (sequential processing of reads & writes) #46745

Merged
merged 1 commit into from
Feb 11, 2021

Conversation

antonfirsov
Copy link
Member

@antonfirsov antonfirsov commented Jan 8, 2021

(Bar check, as no customer reports at this stage.)

Backport of #45683
Fixes #45673

/cc @geoffkizer @tmds @karelz

Customer Impact

Found by code inspection by @tmds (RedHat)

When we get epoll notifications for a particular socket, we can get either a read notification, a write notification, or both. Each notification will cause us to perform the IO and then invoke the user callback.

Prior to PR #37974, when we got both notifications, we would dispatch one notification to the thread pool so that both user callbacks can be processed concurrently.

Unfortunately, #37974 inadvertently broke this behavior, and instead resulted in the notifications being processed in sequence. This means that the second IO operation and callback won't be invoked until the first callback completes, which could in theory take arbitrarily long.

This can lead to unexpected behavior and, at worst, deadlocks. It's probably not that common in practice, but it would be extremely hard to diagnose if it was hit.

Regression?

Yes, caused by #37974 in 5.0.

Testing

It's not really possible to write a test for this because it requires epoll signalling both the read and write notification in the same notification, which is timing dependent and very difficult to make happen consistently.

Performance

Since the change alters a performance-critical path, we decided to run some of the TechEmpower benchmarks. As expected, there is no regression. For results see #46745 (comment).

Risk

Low. The fix is to basically just revert to the previous correct behavior here (same behavior as 3.1), and isolate the changes from #37974 so they do not affect the common path. It has been fixed in master for a month. We have run TechEmpower as a perf check, which also helped test stress (although, TechEmpower didn't hit the problem before it was fixed either)

…tnet#45683)

* SocketAsyncContext.Unix: fix double processing of AsyncOperations

When a socket event occurs, the pending operations gets triggered
to continue their work by calling the Process method.

The changes in dotnet#37974
cause Process to be called twice on the same AsyncOperation.

When Process is called, the operation can complete, and the
AsyncOperation instance may be reused for a different operation.

* Remove processAsyncEvents
@ghost
Copy link

ghost commented Jan 8, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Backport of #45683
Fixes #45673

/cc @geoffkizer @tmds @karelz

Customer Impact

When we get epoll notifications for a particular socket, we can get either a read notification, a write notification, or both. Each notification will cause us to perform the IO and then invoke the user callback.

Prior to PR #37974, when we got both notifications, we would dispatch one notification to the thread pool so that both user callbacks can be processed concurrently.

Unfortunately, #37974 inadvertently broke this behavior, and instead resulted in the notifications being processed in sequence. This means that the second IO operation and callback won't be invoked until the first callback completes, which could in theory take arbitrarily long.

This can lead to unexpected behavior and, at worst, deadlocks. It's probably not that common in practice, but it would be extremely hard to diagnose if it was hit.

Regression?

Yes, caused by #37974 in 5.0.

Testing

It's not really possible to write a test for this because it requires epoll signalling both the read and write notification in the same notification, which is timing dependent and very difficult to make happen consistently.

Risk

Low. The fix is to basically just revert to the previous correct behavior here, and isolate the changes from #37974 so they do not affect the common path.

Author: antonfirsov
Assignees: -
Labels:

area-System.Net.Sockets

Milestone: -

@antonfirsov
Copy link
Member Author

antonfirsov commented Jan 8, 2021

Although it is very unlikely (if not impossible) that this will result in a perf regression, we need to run TechEmpower with these changes.

@adamsitnik can you help with this? I'm fine doing it myself after a quick offline guidance, but I'm out of practice.

@antonfirsov antonfirsov added the Servicing-consider Issue for next servicing release review label Jan 8, 2021
@antonfirsov antonfirsov requested a review from a team January 8, 2021 14:05
@antonfirsov

This comment has been minimized.

@azure-pipelines

This comment has been minimized.

@adamsitnik
Copy link
Member

can you help with this? I'm fine doing it myself after a quick offline guidance, but I'm out of practice

https://github.com/aspnet/Benchmarks/blob/master/scenarios/README.md

dotnet tool install Microsoft.Crank.Controller --version "0.1.0-*" –global
crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/master/scenarios/platform.benchmarks.yml 
           --scenario plaintext --profile aspnet-citrine-lin

To profile and get a trace file: --application.collect true
To use given .dll in the published app: --application.options.outputFiles $pathToFile.dll

@tmds
Copy link
Member

tmds commented Jan 8, 2021

Although it is very unlikely (if not impossible) that this will result in a perf regression, we need to run TechEmpower with these changes.

This shouldn't impact TechEmpower, and if it does, we'd still want the change.
So this is mostly to check the impact.

@antonfirsov
Copy link
Member Author

antonfirsov commented Jan 8, 2021

Did 3-3 runs of the plaintext and json scenarios. As expected, the differences do not seem to be statistically significant.

plaintext (RPS)

  release/5.0 pr delta
  10,764,605 10,781,788  
  10,835,379 10,909,136  
  10,798,054 10,806,531  
average 10,799,346 10,832,485 0.31%

json (RPS)

  release/5.0 pr delta
  1,212,258 1,199,259  
  1,209,587 1,197,367  
  1,199,528 1,212,231  
average 1207124.333 1202952.333 -0.35%

@tmds
Copy link
Member

tmds commented Jan 12, 2021

@antonfirsov @geoffkizer thanks!

@antonfirsov

This comment has been minimized.

@antonfirsov

This comment has been minimized.

@azure-pipelines

This comment has been minimized.

1 similar comment
@azure-pipelines

This comment has been minimized.

Copy link
Member

@adamsitnik adamsitnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you for verifying perf before merging the changes @antonfirsov !

@karelz
Copy link
Member

karelz commented Jan 12, 2021

@danmosemsft this is ready for 5.0 servicing ...

@antonfirsov
Copy link
Member Author

Test failures are unrelated:

#15376, #41530, #41531 and a hang in System.Threading.Channels.Tests.BoundedChannelTests.ManyProducerConsumer_ConcurrentReadWrite_WithBufferedCapacity_Success.

@danmoseley
Copy link
Member

Tactics is sympathetic looking for more clarity around how likely this is to happen, given no customers have reported this. Maybe @geoffkizer could come to a future tactics.

@geoffkizer
Copy link
Contributor

It's pretty unlikely, because it requires a combination of a couple factors, plus lucky timing where both notifications occur at the same time. It's a bit more likely to happen under load because this will stress the timing factors.

Even when it does happen, the impact is largely benign unless you happen to do something complicated in the first callback, which is unusual.

The problem is that if you do hit this, the impact could be very very bad (deadlock), and extremely hard to diagnose.

@leecow leecow modified the milestones: 5.0.x, 5.0.3 Jan 12, 2021
@danmoseley
Copy link
Member

I wonder how we can help Tactics make a decision (about whether to wait for customer evidence of impact or not)

@geoffkizer any concerns about the risk here -- it looks like the code was refactored since 3.1, so I can't simply "eyeball and see it's the same". I don't know how to evaluate the risk of something in this code. That would reduce the numerator in risk/benefit. I wonder if we could/should do more to reduce the risk somehow.

@stephentoub any other perspective about the risk here?

@geoffkizer
Copy link
Contributor

Here's the change that introduced the bug: antonfirsov@205f997#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176

The only code that matters here is the change to ProcessSyncEventOrGetAsyncEvent. The rest of the code only kicks in when you have a particular app context switch set to enable "inline completions" (which is what this PR was intended to introduce).

Here's the fix we want to backport: antonfirsov@6e12567#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176

Again, just look at ProcessSyncEventOrGetAsyncEvent and ignore the changes in HandleInlineCompletions.

You'll see that the fix simply reverts the code in ProcessSyncEventOrGetAsyncEvent to what it did previously.

It's true that all this code has been refactored a bit since 3.1, but at least we are reverting to a known state in 5.0 here.

@geoffkizer
Copy link
Contributor

(BTW if anyone knows how to make github show that diff directly, please let me know :))

@geoffkizer
Copy link
Contributor

All that said, this code did change a fair amount during 5.0 and so I'm not sure how much stock to put in to reverting to this previous 5.0 behavior. It looks like ProcessSyncEventOrGetAsyncEvent and the related logic was heavily refactored in this change and some previous ones: antonfirsov@8157271#diff-0784c9867dba825371f6c43638e11610bbc7992527e51b6f25b31971641f5176

That change was something like 6 weeks prior to the bad PR above, so the "known 5.0 state" was only such for about 6 weeks or so in the middle of 5.0.

@geoffkizer
Copy link
Contributor

Adding @kouvel who made the change I mentioned in the last comment.

The only thing that I think is potentially scary here is that with this fix, we now go through the "two operations to process, so schedule one" path in HandleEvents. This path didn't exist in 3.1, and was only really exercised for that 6 week period during 5.0 before the bad PR went in.

Kount, any thoughts here?

@tmds
Copy link
Member

tmds commented Jan 13, 2021

The fix is related to this comment:

// This method is called from a thread pool thread. When we have only one operation to process, process it
// synchronously to avoid an extra thread pool work item. When we have two operations to process, processing both
// synchronously may delay the second operation, so schedule one onto the thread pool and process the other
// synchronously. There might be better ways of doing this.
if (sendOperation == null)
{
receiveOperation?.Process();
}
else
{
receiveOperation?.Schedule();
sendOperation.Process();
}

The bug causes Process to be called for both operations, while we want to Schedule one and Process the other. That is to avoid a deadlock between the operations. Something like:

Task sendOp = socket.SendAsync
Task receiveOp = socket.ReceiveAsync
await receiveOp();
sendOp.Wait();

I don't think this is a common pattern when using a Socket. Typically the Send and Receive operations are interleaved, or independent. Interleaved doesn't have an issue.

Independent would be something like:

'Receive loop'
while {
await socket.ReceiveAsync;
ProcessData();
};

'Send loop':
while {
await socket.SendAsync;
};

When both events happen simultaneously, the SendAsync continuation will be delayed by ProcessData.
In the regular case, SendAsync completes immediately. It becomes async when the peer doesn't acknowledge receiving the data and the local send buffer is full. Then we wait for the peer acknowledgements. ProcessData is adding a (potentially unbound) amount of delay on top of that.

It is not likely, but it would be hard to debug that the ReceiveAsync continuation causes the SendAsync to not complete.

@kouvel
Copy link
Member

kouvel commented Jan 13, 2021

A couple of questions regarding:

The bug causes Process to be called for both operations, while we want to Schedule one and Process the other. That is to avoid a deadlock between the operations.

  • From what I can tell between the change that caused the issue and the change that fixed it, the call to Process is moved from ProcessSyncEventOrGetAsyncEvent to HandleEventsInline
  • I don't see any change when going through HandleEvents, so it seems like the issue is only relevant when going through HandleEventsInline. Is there an issue in the default path?
  • When going through HandleEventsInline, even after the fix it looks like Process would be called on both operations (just in a different place). How does the fix solve the problem quoted above?

@tmds
Copy link
Member

tmds commented Jan 13, 2021

I don't see any change when going through HandleEvents, so it seems like the issue is only relevant when going through HandleEventsInline. Is there an issue in the default path?

When HandleEvents calls ProcessSyncEventOrGetAsyncEvent that method calls Process on the operation and returns null here (instead of handing it back to HandleEvents):

if (processAsyncEvents)
{
op.Process();
return null;
}

@kouvel
Copy link
Member

kouvel commented Jan 13, 2021

Ah, I missed that the default for the processAsyncEvents parameter to ProcessSyncEventOrGetAsyncEvent is true. So then the issue exists in default mode, probably worth fixing to avoid the rare and bad cases.

@danmoseley danmoseley added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jan 14, 2021
@danmoseley
Copy link
Member

I'm gonna mark no merge for a bit until we're ready to go back to tactics with our updated thoughts and reasoning.

@danmoseley danmoseley removed the Servicing-consider Issue for next servicing release review label Jan 19, 2021
@danmoseley
Copy link
Member

Removing servicing-consider for now. Let's apply it again when you believe we are ready to go to tactics to discuss risk/benefit again .

@karelz karelz modified the milestones: 5.0.3, 5.0.x Feb 4, 2021
@karelz
Copy link
Member

karelz commented Feb 4, 2021

@geoffkizer are we ready to go back to Tactics?

@karelz
Copy link
Member

karelz commented Feb 9, 2021

@danmosemsft we will bring it to Tactics on Thu if it is ok -- @geoffkizer should be there to talk about it (just sent him invite).

@danmoseley
Copy link
Member

Sounds good to me

@danmoseley danmoseley added Servicing-consider Issue for next servicing release review and removed NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) labels Feb 9, 2021
@leecow leecow added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Feb 11, 2021
@leecow leecow modified the milestones: 5.0.x, 5.0.4 Feb 11, 2021
@Anipik Anipik merged commit 2a430b4 into dotnet:release/5.0 Feb 11, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 13, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.