-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Do not call into MsQuic inside a lock #67037
Do not call into MsQuic inside a lock #67037
Conversation
Tagging subscribers to this area: @dotnet/ncl Issue DetailsAlso added Fixes #59345
|
@@ -845,6 +863,7 @@ private void Dispose(bool disposing) | |||
|
|||
private void EnableReceive() | |||
{ | |||
Debug.Assert(!Monitor.IsEntered(_state), "!Monitor.IsEntered(_state)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of repeating all this, it might be a little more readable to just call a method?
[Conditional("DEBUG")]
internal void AssertMonitorNotEntered(object obj)
{
Debug.Assert(!Monitor.IsEntered(obj), "Monitor was unexpectedly held");
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using asserts like this all over the place, including S.N.Http. So I'm fine with this as-is, it's still single line. What I do think is that the message is unnecessary here and we rarely include something like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(If we do #65965, it'll also be redundant in this case, as with or without the message you'd get the same assert.)
return new ValueTask<int>(taken); | ||
if (reenableReceive) | ||
{ | ||
EnableReceive(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're calling this outside of the lock, is there something else that ensures we're not racing with another thread to disable receives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stephen's right. Just a hint here (I haven't thought this fully through), we're using the ReadState
as a guard, we always change it only within the lock and than we use it to determine other actions that need to take place outside of the lock.
Possible races: ReadAsync
may race with with msquic callback HandleEventRecv
. And I don't remember if we have any guards to prevent parallel reads on the stream, I think we have an open issue for this somewhere...
cc @CarnaViire
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR: RECV event is the only thing disabling receives. It seems to me that it should be ok due to additional guard from msquic side (no new RECV event will come until ReceiveComplete+EnableReceive is called). But I don't like how fragile it looks 😢
Note that the change is only in a branch for "IndividualReadComplete" state. It means it happens AFTER some data already arrived in RECV event.
We do have a guard for parallel reads which looks exactly at ReadState. So it will throw if ReadAsync is called while state is PendingRead (no data available and there's already a waiting read).
By moving ReceiveComplete+EnableReceive out of the lock, we allow a time where state is already changed from "IndividualReadComplete" to "None", but ongoing "first" ReceiveAsync function is not exited yet and ReceiveComplete+EnableReceive are not called yet (but all data is already copied). So a second ReceiveAsync might enter. It would see state "None" which would mean data is not available. If it grabs the lock before ReceiveComplete+EnableReceive are called (i.e. RECV event is still not possible), it would change the state to PendingRead (waiting for data), store the destination buffer reference, register cancellation and return a task to wait. All the things "None" branch touches are not touched in the remainder of the first ReceiveAsync (between exiting from the lock and returning new value task).
The bottom line is I don't think there's any problem in ReadAsync vs HandleEventRecv race. But we might want to rethink guard against parallel reads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand correctly, the only think that can go wrong with this change is parallel ReadAsync operations, which user code is not supposed to do anyway...
Should I add the guards against parallel operations in this PR, or should we do that separately? @ManickaP @stephentoub , thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say leave it as is for now. We have a separate issue for proper guards against parallel reads and writes #52627
/azp run runtime-libraries stress-http |
Azure Pipelines successfully started running 1 pipeline(s). |
There's a segmentation fault from the H/3 stress run, in artifacts is dump. Could you please investigate the cause? If it's not related to this change, we can merge, but we need to file an issue for it at least. |
/azp run stress-http |
No pipelines are associated with this pull request. |
/azp run runtime-libraries stress-http |
Azure Pipelines successfully started running 1 pipeline(s). |
The second run of the http-stress did not crash, are we good to move forward, @ManickaP? |
We can. We should observe the H/3 stress for a while after this change. We still must have some nefarious bug in our code 😢 |
@@ -149,7 +149,7 @@ public MsQuicConnection(IPEndPoint localEndPoint, IPEndPoint remoteEndPoint, MsQ | |||
|
|||
try | |||
{ | |||
Debug.Assert(!Monitor.IsEntered(_state)); | |||
Debug.Assert(!Monitor.IsEntered(_state), "!Monitor.IsEntered(_state)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What led you to even expand more the messages in asserts after Stephen's comment?
It'll become redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it would be more useful to have a message handy when something crashes, I thought we generally tend to include the message with the assert in this repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exactly the other way around, just grep for Debug.Assert
in the repo. And if it contains the message, I mostly see some additional info and not just the copy of the condition.
} | ||
|
||
// methods below need to be called outside of the lock | ||
if (bytesRead > -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever be <= -1? We didn't have check like this before if I understand it correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it will be -1 if initialReadState != ReadState.IndividualReadComplete
, I added it in order for EnableReceive
and ReceiveComplete
to be called (outside of the lock) if and only if they would've been called in the previous version (inside of the lock)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now, it's its initial value set before the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just nits, otherwise good to merge, thanks!
I know this is already merged, but you're totally allowed to call async MsQuic functions under lock. For instance, most (all?) datapath functions would be fine. Every function annotated with |
The problem we run into while back (and triggered) this issue was not inside MsQuic. There are operations where we hold lock and we could try to hold it again when the event shows on different thread. |
Also added
Debug.Assert
calls to ensure that any calls to MsQuic APIs inside a lock are caught in time.Fixes #59345