Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for AsyncSemaphore disposal problems #19043

Merged
merged 1 commit into from
Apr 28, 2017

Conversation

sharwell
Copy link
Member

@sharwell sharwell commented Apr 27, 2017

This is a workaround for #18563. See #19042 for a follow-up task to remove
this workaround.

Ask Mode

Customer scenario

Rare but definitely observable crash in OOP scenarios (Navigate To, etc.).

Bugs this fixes:

This works around #18563 prior to the proper fix being incorporated from microsoft/vs-threading#105.

Workarounds, if any

None.

Risk

Low. This change is intended to be functionally equivalent to microsoft/vs-threading#105, both before and after that change is incorporated.

Performance impact

Aside from a small additional type existing, none.

Is this a regression from a previous update?

No.

Root cause analysis:

Failure to strictly adhere to MSDN documentation for a concurrency primitive (SemaphoreSlim). This is an exceptionally hard to reproduce race condition bug in a dependency.

How was the bug found?

Internal customer reported.

@sharwell sharwell added the Tenet-Reliability Customer telemetry indicates that the product is failing in a crash/hang/dataloss manner. label Apr 27, 2017
@sharwell sharwell added this to the 15.3 milestone Apr 27, 2017

protected override void Dispose(bool disposing)
{
// Do not call base.Dispose. We do not want the AsyncSemaphore instances to be disposed due to a race
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have link here to either gitub or VSO bug that documents the issue, which we can give to the platform guys to look into fixing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📝 The link is a few lines above, in the comment for the type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it okay to not dispose the async semaphore? tagging @AArnott

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heejaechang Disposing AsyncSemaphore is a NOP. It just transitions the object to either an unusable or an invalid state, but doesn't actually release any unmanaged resources of any kind.

@CyrusNajmabadi
Copy link
Member

What's up with the compile errors?

Overall i'm fine with this change. I'd like like the code to point to the relevant bug that is tracking getting the underlying issue fixed.

@CyrusNajmabadi
Copy link
Member

My overall concern is this:

Somethign seems broken to me with how cancellation is being provided (either on our end, or in the underlying StreamRPC side). It seems fundamentally racey.

For example, it sounds like the following set of steps are possible:

  1. we create a session (with a cancellation token.)
  2. We start an operation on that session.
  3. operation is now executing on remote side.
  4. remote side calls back to host side.
  5. host side cancels token.
  6. Underlying stream immediately terminates connection with remote and throw OperationCanceled
  7. We dispose the session in our using block.
  8. Remote-call-that-is-back-on-host-side now ends up crashing because session/streams are disposed.

The part that seems wonky to me is '6'. If we have outstanding work that we're awaiting, then we shouldn't immediatley transition to the canceled state since the OOP work is still executing and may still call into us.

This approach seems to violate the cooperative cancellation model. We're cancelling to tell everything "hey you should stop executing", but if execution hasn't actually stopped, we should not be tearing down things.

Can someone verify if my set of steps above is possible? If so, then the design here seems suspect to me as it seems fundamentally 'racey'. If, however, the above is not possible, then it's unclear to me what the set of steps are that are causing the issue today. If it's another set of steps, please let me know what's happening so i can have an accurate mental model of how rpc+oop is actually working.

Thanks!

@heejaechang
Copy link
Contributor

looked your change in StreamJson in vso. now I understand. change looks good. thank you for fixing this issue!

@heejaechang
Copy link
Contributor

looks like you need to use external alias or something to fix the test utility compiler errors due to type defined in 2 places.

@heejaechang
Copy link
Contributor

our cancellation is the disconnecting the stream. there is no tell the other side we are cancelling. we are prepared to deal with object disposed exception thrown due to stream being disposed.

problem here is semaphoreSlim used in StreamJson throws a null exception rather than object disposed exception when it is used after disposed.

our desire to use cancellation token at the session level rather than each invocation makes the cancellation model you described not possible.

we could harden our side that whatever exception we get from StreamJson, if our cancellation is raised, we ignore the exception. I believe right now we only translate object disposed exception to cancellation exception.

@CyrusNajmabadi
Copy link
Member

our cancellation is the disconnecting the stream. there is no tell the other side we are cancelling.

Oh ick. That's hugely unfortunate. That's not how we did things for teh TypeScript out of proc work. Cancellation was intended to remote over to the OOP side to tell the OOP side to stop what it was doin.

our desire to use cancellation token at the session level rather than each invocation makes the cancellation model you described not possible.

I don't see how that's the case. We could certainly have cancellation be session level rather than invocation level. The important part is that cancelling that token isn't a "disconnect the stream" operation, it's a "let me let the remote side know it should stop working" operation.

--

Ok. So the way i see it so far is this:

It is Roslyn that is doing some funky stuff. (Prematurely disposing). We can either:

  1. change that, and then not have to do anything else. Pro is that this should address the crash. Con is that we would need a way to tell the remote side to stop working in order for cancellation to have any meaning.
  2. continue just diposing early. But, in that case, have the underlying system throw ObjectDisposedException instead of NullRefException. Roslyn is already hardened to this, and knows to translate this on our end to OperationCanceledException.

Given what part of the release cycle we're in, it sounds like '2' is the better choice for now. Though i would prefer a long term plan of not overly aggressively just disconnecting, and praying that doesn't lead to problems further down the line (this sort of bug being a perfect example of that :)).

@heejaechang
Copy link
Contributor

remote side cancel things as soon as connection is closed. so, cancellation works as expected.

sure, we could make some kind of our own mechanism workarouding what StreamJson provide. but I would rather use what StreamJson uses or keep do what we are doing. both of them is simpler.

@heejaechang
Copy link
Contributor

right now, I only translate object disposed exception to cancellation assumed that will be only exception that can be thrown when object is used after disposed.

but we can change that to any exception. it is not we swallowing all exception since we only translate it to cancellation exception if the cancellation is in raised state.

if remote call failed without us cancelled the call, it will still propagate and handled (in roslyn case, failfast)

This is a workaround for dotnet#18563. See dotnet#19042 for a follow-up task to remove
this workaround.
@sharwell sharwell force-pushed the asyncsemaphore-workaround branch from b78792c to d4a6431 Compare April 27, 2017 20:04
@sharwell
Copy link
Member Author

What's up with the compile errors?

Got called into a meeting forgot to push the fix. I included the file in one project not realizing it could see another one via IVT.

@AArnott
Copy link
Contributor

AArnott commented Apr 27, 2017

have the underlying system throw ObjectDisposedException instead of NullRefException. Roslyn is already hardened to this, and knows to translate this on our end to OperationCanceledException.

The exception thrown is ArgumentNullException, to be clear. And I just filed microsoft/vs-streamjsonrpc#33 to track seeing how we can "fix" that to throw ObjectDisposedException instead.

@sharwell
Copy link
Member Author

The exception thrown is ArgumentNullException, to be clear.

In that stack trace yes. A NullReferenceException is also possible.

@sharwell
Copy link
Member Author

@MattGertz for approval. This is a workaround for a crasher @CyrusNajmabadi has hit a few times just this week. We're not sure where the final bug fix will live, but I've filed an issue to remove this change sometime in the future after we adopt the final approach.

@sharwell sharwell merged commit 4bb26bd into dotnet:master Apr 28, 2017
@sharwell sharwell deleted the asyncsemaphore-workaround branch April 28, 2017 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Approved to merge cla-already-signed Tenet-Reliability Customer telemetry indicates that the product is failing in a crash/hang/dataloss manner.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants