Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not call into MsQuic inside a lock #67037

Merged
merged 1 commit into from
Mar 29, 2022

Conversation

rzikm
Copy link
Member

@rzikm rzikm commented Mar 23, 2022

Also added Debug.Assert calls to ensure that any calls to MsQuic APIs inside a lock are caught in time.

Fixes #59345

@ghost
Copy link

ghost commented Mar 23, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Also added Debug.Assert calls to ensure that any calls to MsQuic APIs inside a lock are caught in time.

Fixes #59345

Author: rzikm
Assignees: -
Labels:

area-System.Net.Quic

Milestone: -

@@ -845,6 +863,7 @@ private void Dispose(bool disposing)

private void EnableReceive()
{
Debug.Assert(!Monitor.IsEntered(_state), "!Monitor.IsEntered(_state)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of repeating all this, it might be a little more readable to just call a method?

[Conditional("DEBUG")]
internal void AssertMonitorNotEntered(object obj)
{
     Debug.Assert(!Monitor.IsEntered(obj), "Monitor was unexpectedly held");
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're using asserts like this all over the place, including S.N.Http. So I'm fine with this as-is, it's still single line. What I do think is that the message is unnecessary here and we rarely include something like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(If we do #65965, it'll also be redundant in this case, as with or without the message you'd get the same assert.)

return new ValueTask<int>(taken);
if (reenableReceive)
{
EnableReceive();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're calling this outside of the lock, is there something else that ensures we're not racing with another thread to disable receives?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stephen's right. Just a hint here (I haven't thought this fully through), we're using the ReadState as a guard, we always change it only within the lock and than we use it to determine other actions that need to take place outside of the lock.
Possible races: ReadAsync may race with with msquic callback HandleEventRecv. And I don't remember if we have any guards to prevent parallel reads on the stream, I think we have an open issue for this somewhere...

cc @CarnaViire

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: RECV event is the only thing disabling receives. It seems to me that it should be ok due to additional guard from msquic side (no new RECV event will come until ReceiveComplete+EnableReceive is called). But I don't like how fragile it looks 😢

Note that the change is only in a branch for "IndividualReadComplete" state. It means it happens AFTER some data already arrived in RECV event.

We do have a guard for parallel reads which looks exactly at ReadState. So it will throw if ReadAsync is called while state is PendingRead (no data available and there's already a waiting read).

By moving ReceiveComplete+EnableReceive out of the lock, we allow a time where state is already changed from "IndividualReadComplete" to "None", but ongoing "first" ReceiveAsync function is not exited yet and ReceiveComplete+EnableReceive are not called yet (but all data is already copied). So a second ReceiveAsync might enter. It would see state "None" which would mean data is not available. If it grabs the lock before ReceiveComplete+EnableReceive are called (i.e. RECV event is still not possible), it would change the state to PendingRead (waiting for data), store the destination buffer reference, register cancellation and return a task to wait. All the things "None" branch touches are not touched in the remainder of the first ReceiveAsync (between exiting from the lock and returning new value task).

The bottom line is I don't think there's any problem in ReadAsync vs HandleEventRecv race. But we might want to rethink guard against parallel reads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, the only think that can go wrong with this change is parallel ReadAsync operations, which user code is not supposed to do anyway...

Should I add the guards against parallel operations in this PR, or should we do that separately? @ManickaP @stephentoub , thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say leave it as is for now. We have a separate issue for proper guards against parallel reads and writes #52627

@CarnaViire
Copy link
Member

/azp run runtime-libraries stress-http

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rzikm rzikm requested a review from ManickaP March 28, 2022 08:51
@ManickaP
Copy link
Member

There's a segmentation fault from the H/3 stress run, in artifacts is dump. Could you please investigate the cause? If it's not related to this change, we can merge, but we need to file an issue for it at least.

@ManickaP
Copy link
Member

@rzikm #67230 seems to be getting closer to stable, so if you're not successful with local run, you might opt to wait for this to get in main.

@rzikm
Copy link
Member Author

rzikm commented Mar 29, 2022

/azp run stress-http

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@rzikm
Copy link
Member Author

rzikm commented Mar 29, 2022

/azp run runtime-libraries stress-http

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rzikm
Copy link
Member Author

rzikm commented Mar 29, 2022

The second run of the http-stress did not crash, are we good to move forward, @ManickaP?

@ManickaP
Copy link
Member

We can. We should observe the H/3 stress for a while after this change. We still must have some nefarious bug in our code 😢

@@ -149,7 +149,7 @@ public MsQuicConnection(IPEndPoint localEndPoint, IPEndPoint remoteEndPoint, MsQ

try
{
Debug.Assert(!Monitor.IsEntered(_state));
Debug.Assert(!Monitor.IsEntered(_state), "!Monitor.IsEntered(_state)");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What led you to even expand more the messages in asserts after Stephen's comment?
It'll become redundant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be more useful to have a message handy when something crashes, I thought we generally tend to include the message with the assert in this repo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exactly the other way around, just grep for Debug.Assert in the repo. And if it contains the message, I mostly see some additional info and not just the copy of the condition.

}

// methods below need to be called outside of the lock
if (bytesRead > -1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever be <= -1? We didn't have check like this before if I understand it correctly.

Copy link
Member Author

@rzikm rzikm Mar 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it will be -1 if initialReadState != ReadState.IndividualReadComplete, I added it in order for EnableReceive and ReceiveComplete to be called (outside of the lock) if and only if they would've been called in the previous version (inside of the lock)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now, it's its initial value set before the lock.

Copy link
Member

@ManickaP ManickaP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just nits, otherwise good to merge, thanks!

@rzikm rzikm merged commit 4bd27a6 into dotnet:main Mar 29, 2022
@nibanks
Copy link

nibanks commented Apr 1, 2022

I know this is already merged, but you're totally allowed to call async MsQuic functions under lock. For instance, most (all?) datapath functions would be fine. Every function annotated with _IRQL_requires_max_(DISPATCH_LEVEL) should be Ok to call, because it's already designed to be called at DISPATCH level in kernel, which means it cannot be blocked by anything else inside MsQuic that would cause a deadlock.

@wfurt
Copy link
Member

wfurt commented Apr 1, 2022

The problem we run into while back (and triggered) this issue was not inside MsQuic. There are operations where we hold lock and we could try to hold it again when the event shows on different thread.

@karelz karelz added this to the 7.0.0 milestone Apr 8, 2022
@ghost ghost locked as resolved and limited conversation to collaborators May 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[QUIC] Do not call msquic in a lock
8 participants