Add workaround for AsyncSemaphore disposal problems #19043

sharwell · 2017-04-27T16:34:02Z

This is a workaround for #18563. See #19042 for a follow-up task to remove
this workaround.

Ask Mode

Customer scenario

Rare but definitely observable crash in OOP scenarios (Navigate To, etc.).

Bugs this fixes:

This works around #18563 prior to the proper fix being incorporated from microsoft/vs-threading#105.

Workarounds, if any

None.

Risk

Low. This change is intended to be functionally equivalent to microsoft/vs-threading#105, both before and after that change is incorporated.

Performance impact

Aside from a small additional type existing, none.

Is this a regression from a previous update?

No.

Root cause analysis:

Failure to strictly adhere to MSDN documentation for a concurrency primitive (SemaphoreSlim). This is an exceptionally hard to reproduce race condition bug in a dependency.

How was the bug found?

Internal customer reported.

CyrusNajmabadi · 2017-04-27T17:35:45Z

src/VisualStudio/Core/Next/Remote/JsonRpcMessageHandler.cs

+
+        protected override void Dispose(bool disposing)
+        {
+            // Do not call base.Dispose. We do not want the AsyncSemaphore instances to be disposed due to a race


can you have link here to either gitub or VSO bug that documents the issue, which we can give to the platform guys to look into fixing?

📝 The link is a few lines above, in the comment for the type.

is it okay to not dispose the async semaphore? tagging @AArnott

@heejaechang Disposing AsyncSemaphore is a NOP. It just transitions the object to either an unusable or an invalid state, but doesn't actually release any unmanaged resources of any kind.

CyrusNajmabadi · 2017-04-27T17:36:40Z

What's up with the compile errors?

Overall i'm fine with this change. I'd like like the code to point to the relevant bug that is tracking getting the underlying issue fixed.

CyrusNajmabadi · 2017-04-27T19:24:00Z

My overall concern is this:

Somethign seems broken to me with how cancellation is being provided (either on our end, or in the underlying StreamRPC side). It seems fundamentally racey.

For example, it sounds like the following set of steps are possible:

we create a session (with a cancellation token.)
We start an operation on that session.
operation is now executing on remote side.
remote side calls back to host side.
host side cancels token.
Underlying stream immediately terminates connection with remote and throw OperationCanceled
We dispose the session in our using block.
Remote-call-that-is-back-on-host-side now ends up crashing because session/streams are disposed.

The part that seems wonky to me is '6'. If we have outstanding work that we're awaiting, then we shouldn't immediatley transition to the canceled state since the OOP work is still executing and may still call into us.

This approach seems to violate the cooperative cancellation model. We're cancelling to tell everything "hey you should stop executing", but if execution hasn't actually stopped, we should not be tearing down things.

Can someone verify if my set of steps above is possible? If so, then the design here seems suspect to me as it seems fundamentally 'racey'. If, however, the above is not possible, then it's unclear to me what the set of steps are that are causing the issue today. If it's another set of steps, please let me know what's happening so i can have an accurate mental model of how rpc+oop is actually working.

Thanks!

heejaechang · 2017-04-27T19:24:28Z

looked your change in StreamJson in vso. now I understand. change looks good. thank you for fixing this issue!

heejaechang · 2017-04-27T19:25:53Z

looks like you need to use external alias or something to fix the test utility compiler errors due to type defined in 2 places.

heejaechang · 2017-04-27T19:34:22Z

our cancellation is the disconnecting the stream. there is no tell the other side we are cancelling. we are prepared to deal with object disposed exception thrown due to stream being disposed.

problem here is semaphoreSlim used in StreamJson throws a null exception rather than object disposed exception when it is used after disposed.

our desire to use cancellation token at the session level rather than each invocation makes the cancellation model you described not possible.

we could harden our side that whatever exception we get from StreamJson, if our cancellation is raised, we ignore the exception. I believe right now we only translate object disposed exception to cancellation exception.

CyrusNajmabadi · 2017-04-27T19:43:41Z

our cancellation is the disconnecting the stream. there is no tell the other side we are cancelling.

Oh ick. That's hugely unfortunate. That's not how we did things for teh TypeScript out of proc work. Cancellation was intended to remote over to the OOP side to tell the OOP side to stop what it was doin.

our desire to use cancellation token at the session level rather than each invocation makes the cancellation model you described not possible.

I don't see how that's the case. We could certainly have cancellation be session level rather than invocation level. The important part is that cancelling that token isn't a "disconnect the stream" operation, it's a "let me let the remote side know it should stop working" operation.

--

Ok. So the way i see it so far is this:

It is Roslyn that is doing some funky stuff. (Prematurely disposing). We can either:

change that, and then not have to do anything else. Pro is that this should address the crash. Con is that we would need a way to tell the remote side to stop working in order for cancellation to have any meaning.
continue just diposing early. But, in that case, have the underlying system throw ObjectDisposedException instead of NullRefException. Roslyn is already hardened to this, and knows to translate this on our end to OperationCanceledException.

Given what part of the release cycle we're in, it sounds like '2' is the better choice for now. Though i would prefer a long term plan of not overly aggressively just disconnecting, and praying that doesn't lead to problems further down the line (this sort of bug being a perfect example of that :)).

heejaechang · 2017-04-27T19:51:25Z

remote side cancel things as soon as connection is closed. so, cancellation works as expected.

sure, we could make some kind of our own mechanism workarouding what StreamJson provide. but I would rather use what StreamJson uses or keep do what we are doing. both of them is simpler.

heejaechang · 2017-04-27T19:54:27Z

right now, I only translate object disposed exception to cancellation assumed that will be only exception that can be thrown when object is used after disposed.

but we can change that to any exception. it is not we swallowing all exception since we only translate it to cancellation exception if the cancellation is in raised state.

if remote call failed without us cancelled the call, it will still propagate and handled (in roslyn case, failfast)

This is a workaround for dotnet#18563. See dotnet#19042 for a follow-up task to remove this workaround.

sharwell · 2017-04-27T20:06:23Z

What's up with the compile errors?

Got called into a meeting forgot to push the fix. I included the file in one project not realizing it could see another one via IVT.

AArnott · 2017-04-27T21:57:57Z

have the underlying system throw ObjectDisposedException instead of NullRefException. Roslyn is already hardened to this, and knows to translate this on our end to OperationCanceledException.

The exception thrown is ArgumentNullException, to be clear. And I just filed microsoft/vs-streamjsonrpc#33 to track seeing how we can "fix" that to throw ObjectDisposedException instead.

sharwell · 2017-04-27T22:30:19Z

The exception thrown is ArgumentNullException, to be clear.

In that stack trace yes. A NullReferenceException is also possible.

sharwell · 2017-04-28T02:02:44Z

@MattGertz for approval. This is a workaround for a crasher @CyrusNajmabadi has hit a few times just this week. We're not sure where the final bug fix will live, but I've filed an issue to remove this change sometime in the future after we adopt the final approach.

sharwell added the Tenet-Reliability Customer telemetry indicates that the product is failing in a crash/hang/dataloss manner. label Apr 27, 2017

sharwell added this to the 15.3 milestone Apr 27, 2017

sharwell requested review from heejaechang and CyrusNajmabadi April 27, 2017 16:34

dnfclas added the cla-already-signed label Apr 27, 2017

sharwell mentioned this pull request Apr 27, 2017

Revert workaround for AsyncSemaphore disposal problems #19042

Closed

CyrusNajmabadi reviewed Apr 27, 2017

View reviewed changes

heejaechang approved these changes Apr 27, 2017

View reviewed changes

Add workaround for AsyncSemaphore disposal problems

d4a6431

This is a workaround for dotnet#18563. See dotnet#19042 for a follow-up task to remove this workaround.

sharwell force-pushed the asyncsemaphore-workaround branch from b78792c to d4a6431 Compare April 27, 2017 20:04

AArnott mentioned this pull request Apr 27, 2017

InvokeAsync throws ArgumentNullException after stream is disposed microsoft/vs-streamjsonrpc#33

Closed

MattGertz added the Approved to merge label Apr 28, 2017

sharwell merged commit 4bb26bd into dotnet:master Apr 28, 2017

sharwell deleted the asyncsemaphore-workaround branch April 28, 2017 17:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workaround for AsyncSemaphore disposal problems #19043

Add workaround for AsyncSemaphore disposal problems #19043

sharwell commented Apr 27, 2017 •

edited

Loading

CyrusNajmabadi Apr 27, 2017

sharwell Apr 27, 2017

heejaechang Apr 27, 2017

sharwell Apr 27, 2017

CyrusNajmabadi commented Apr 27, 2017

CyrusNajmabadi commented Apr 27, 2017

heejaechang commented Apr 27, 2017

heejaechang commented Apr 27, 2017

heejaechang commented Apr 27, 2017

CyrusNajmabadi commented Apr 27, 2017

heejaechang commented Apr 27, 2017

heejaechang commented Apr 27, 2017

sharwell commented Apr 27, 2017

AArnott commented Apr 27, 2017

sharwell commented Apr 27, 2017

sharwell commented Apr 28, 2017

Add workaround for AsyncSemaphore disposal problems #19043

Add workaround for AsyncSemaphore disposal problems #19043

Conversation

sharwell commented Apr 27, 2017 • edited Loading

Ask Mode

CyrusNajmabadi Apr 27, 2017

Choose a reason for hiding this comment

sharwell Apr 27, 2017

Choose a reason for hiding this comment

heejaechang Apr 27, 2017

Choose a reason for hiding this comment

sharwell Apr 27, 2017

Choose a reason for hiding this comment

CyrusNajmabadi commented Apr 27, 2017

CyrusNajmabadi commented Apr 27, 2017

heejaechang commented Apr 27, 2017

heejaechang commented Apr 27, 2017

heejaechang commented Apr 27, 2017

CyrusNajmabadi commented Apr 27, 2017

heejaechang commented Apr 27, 2017

heejaechang commented Apr 27, 2017

sharwell commented Apr 27, 2017

AArnott commented Apr 27, 2017

sharwell commented Apr 27, 2017

sharwell commented Apr 28, 2017

sharwell commented Apr 27, 2017 •

edited

Loading