Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix where diagnostic listener SqlDiagnosticsListener.OnAfterCommand runs to avoid implict concurrent execution #1634

Closed
davidfowl opened this issue Jun 4, 2022 · 8 comments · Fixed by #1637

Comments

@davidfowl
Copy link
Member

Describe the bug

See dotnet/aspnetcore#41924 for more background and dotnet/aspnetcore#41924 (comment) for the context.

When you run an async sql SqlCommand (via ExecuteReaderAsync), the implementation fires the diagnostics listener callback after running the user's continuation. This allows the code to run concurrently with the user code, which can introduce accidental parallelism and break in various ways. The linked issues explain how this breaks in a very common app insights scenario.

Microsoft.AspNetCore.Http.DefaultHttpRequest.get_Cookies()+8f 
Microsoft.ApplicationInsights.AspNetCore.TelemetryInitializers.WebSessionTelemetryInitializer.UpdateRequestTelemetryFromPlatformContext(Microsoft.ApplicationInsights.DataContracts.RequestTelemetry, Microsoft.AspNetCore.Http.HttpContext)+3c 
Microsoft.ApplicationInsights.AspNetCore.TelemetryInitializers.WebSessionTelemetryInitializer.OnInitializeTelemetry(Microsoft.AspNetCore.Http.HttpContext, Microsoft.ApplicationInsights.DataContracts.RequestTelemetry, Microsoft.ApplicationInsights.Channel.ITelemetry)+c4 
Microsoft.ApplicationInsights.AspNetCore.TelemetryInitializers.TelemetryInitializerBase.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+fa 
Microsoft.ApplicationInsights.TelemetryClient.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+2f9 
Microsoft.ApplicationInsights.TelemetryClient.Track(Microsoft.ApplicationInsights.Channel.ITelemetry)+35 
Microsoft.ApplicationInsights.DependencyCollector.Implementation.SqlClientDiagnostics.SqlClientDiagnosticSourceListener.AfterExecuteHelper(System.Collections.Generic.KeyValuePair`2, Microsoft.ApplicationInsights.Common.PropertyFetcher, Microsoft.ApplicationInsights.Common.PropertyFetcher, Microsoft.ApplicationInsights.Common.PropertyFetcher)+18b 
Microsoft.ApplicationInsights.DependencyCollector.Implementation.SqlClientDiagnostics.SqlClientDiagnosticSourceListener.System.IObserver>.OnNext(System.Collections.Generic.KeyValuePair`2)+48f 
System.Diagnostics.DiagnosticListener.Write(System.String, System.Object)+45 
System.Data.SqlClient.SqlClientDiagnosticListenerExtensions.WriteCommandAfter(System.Diagnostics.DiagnosticListener, System.Guid, System.Data.SqlClient.SqlCommand, System.String)+1fc 
System.Data.SqlClient.SqlCommand+<>c__DisplayClass130_0.b__2(System.Threading.Tasks.Task`1)+2fc 
System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+4b 
System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)+b6 
System.Threading.ThreadPoolWorkQueue.Dispatch()+1fa 
System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()+1f2 

To reproduce

TBD (this is a race condition)

Expected behavior

The diagnostics listener should not run concurrently with code executing after the query is completed.

Further technical details

Microsoft.Data.SqlClient version: 4.8.3
.NET target: 6.0
SQL Server version: LocalDb
Operating system: Windows 11

@ErikEJ
Copy link
Contributor

ErikEJ commented Jun 4, 2022

System.Data.SqlClient is in maintenance mode. Microsoft.Data.SqlClient should be used instead (unsure if it makes any difference for the repro)

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 5, 2022

If I understand this correctly then the equivalent code for netcore in this library is:

Task<int> returnedTask = source.Task;
try
{
returnedTask = RegisterForConnectionCloseNotification(returnedTask);
Task<int>.Factory.FromAsync(BeginExecuteNonQueryAsync, EndExecuteNonQueryAsync, null).ContinueWith((t) =>
{
registration.Dispose();
if (t.IsFaulted)
{
Exception e = t.Exception.InnerException;
_diagnosticListener.WriteCommandError(operationId, this, _transaction, e);
source.SetException(e);
}
else
{
if (t.IsCanceled)
{
source.SetCanceled();
}
else
{
source.SetResult(t.Result);
}
_diagnosticListener.WriteCommandAfter(operationId, this, _transaction);
}
}, TaskScheduler.Default);
}
catch (Exception e)
{
_diagnosticListener.WriteCommandError(operationId, this, _transaction, e);
source.SetException(e);
}
return returnedTask;
}

It creates a TCS and then passes the task from it to RegisterForConnectionCloseNotification which adds a ContinueWith and unrwaps it, this is the returnedTask that we give back to the user. The Task.Factory call then sets up the async call and we add a ContinueWith on the result and add our own cleanup. We don't keep the task returned from the factory and the continue so the task returned to the user and our worker task only interact through the TCS and when we set the TCS result it's possible for the user task to be woken and finish it's task before the delegate that sets the result has finished, essentially between lines 2564 and 2566 mutithreaded stuff can happen. It'll take some pretty close timing/delays to make it happen but I think it's a hole.

If that is the problem then the obvious, and probably breaking/wrong fix would be to keep the unwrap the internal task from the async invocation and return that to the caller instead, then they'd be awaiting our cleanup task and there can't be an ordering problem.

Without a reliable replication I'd be really worried about making any changes to this. The async in this library is older than langauge support and can have really complicated interactions with any mistakes capable of breaking or regressing customers easily. @okolvik-avento have you observed this bug since you moved to Microsoft.Date.SqlClient at all?

Is there a contractual obligation for event ordering here? The issue it's causing seem to be that the telemetry client is trying to access an HttpContext object which has already been cleaned up in response to the query diagnistic being sent but is that a behaviour that's defined or is it just hopeful and works because of close timing? Does sqlclient have a defined obligation to raise the event before returning to the customer?

@okolvik-avento
Copy link

@Wraith2 i'm still seeing the get_Cookies exception after transitioning to Microsoft.Data.SqlClient

@davidfowl
Copy link
Member Author

davidfowl commented Jun 5, 2022

@Wraith2 I think there are 2 fixes I can think of:

  1. Do a bigger refactoring and await the FromAsync task. This feels bigger and a bit riskier.
  2. Call WriteCommandAfter before setting the TaskCompletionSource. This will make sure diagnostics listeners run before their async continuation executes. We don't need to await the fire and forget task. i.e moving
    _diagnosticListener.WriteCommandAfter(operationId, this, _transaction);
    line to , should be enough to resolve the problem. In the other cases where we set the tcs after invoking the diagnostic listener.

Without a reliable replication I'd be really worried about making any changes to this. The async in this library is older than langauge support and can have really complicated interactions with any mistakes capable of breaking or regressing customers easily. @okolvik-avento have you observed this bug since you moved to Microsoft.Date.SqlClient at all?

I'm confident I can make a reliable repro now that the problem is understood. If there's fear this is too breaking we can always add an app context switch. This ends up breaking ASP.NET Core + AppInsights (which is a very common combination) so it's important we figure out how to solve this.

cc @noahfalk for this take on the reordering of events here.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 5, 2022

Moving the diagnostics call before setting result/exception is easy and relatively safe in terms of changes in this library. I'm not going to worry about a replication for that one since we're not logically changing the async composition. I did eventually think of this but only after I'd turned my computer off.

My only worry is whether it would expose similar problems to calling late if we call too early and people expect late, what happens if someone relies on knowing a result has been user-handled and it now hasn't been? The other path through continuation calls before setting the exception so users have already been exposed to that situation in a limited way.

I'll try and audit the library later and move all diagnostics calls before tcs outcome setting methods are called.

@JRahnama
Copy link
Contributor

JRahnama commented Jun 6, 2022

@davidfowl thanks for opening the issue here. @Wraith2 thank you for the support. We will review the PR.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 7, 2022

How's that for a quick fix :)

@davidfowl
Copy link
Member Author

SteveHarveyHappyGIF

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants