Fix where diagnostic listener SqlDiagnosticsListener.OnAfterCommand runs to avoid implict concurrent execution #1634

davidfowl · 2022-06-04T16:47:09Z

Describe the bug

See dotnet/aspnetcore#41924 for more background and dotnet/aspnetcore#41924 (comment) for the context.

When you run an async sql SqlCommand (via ExecuteReaderAsync), the implementation fires the diagnostics listener callback after running the user's continuation. This allows the code to run concurrently with the user code, which can introduce accidental parallelism and break in various ways. The linked issues explain how this breaks in a very common app insights scenario.

Microsoft.AspNetCore.Http.DefaultHttpRequest.get_Cookies()+8f 
Microsoft.ApplicationInsights.AspNetCore.TelemetryInitializers.WebSessionTelemetryInitializer.UpdateRequestTelemetryFromPlatformContext(Microsoft.ApplicationInsights.DataContracts.RequestTelemetry, Microsoft.AspNetCore.Http.HttpContext)+3c 
Microsoft.ApplicationInsights.AspNetCore.TelemetryInitializers.WebSessionTelemetryInitializer.OnInitializeTelemetry(Microsoft.AspNetCore.Http.HttpContext, Microsoft.ApplicationInsights.DataContracts.RequestTelemetry, Microsoft.ApplicationInsights.Channel.ITelemetry)+c4 
Microsoft.ApplicationInsights.AspNetCore.TelemetryInitializers.TelemetryInitializerBase.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+fa 
Microsoft.ApplicationInsights.TelemetryClient.Initialize(Microsoft.ApplicationInsights.Channel.ITelemetry)+2f9 
Microsoft.ApplicationInsights.TelemetryClient.Track(Microsoft.ApplicationInsights.Channel.ITelemetry)+35 
Microsoft.ApplicationInsights.DependencyCollector.Implementation.SqlClientDiagnostics.SqlClientDiagnosticSourceListener.AfterExecuteHelper(System.Collections.Generic.KeyValuePair`2, Microsoft.ApplicationInsights.Common.PropertyFetcher, Microsoft.ApplicationInsights.Common.PropertyFetcher, Microsoft.ApplicationInsights.Common.PropertyFetcher)+18b 
Microsoft.ApplicationInsights.DependencyCollector.Implementation.SqlClientDiagnostics.SqlClientDiagnosticSourceListener.System.IObserver>.OnNext(System.Collections.Generic.KeyValuePair`2)+48f 
System.Diagnostics.DiagnosticListener.Write(System.String, System.Object)+45 
System.Data.SqlClient.SqlClientDiagnosticListenerExtensions.WriteCommandAfter(System.Diagnostics.DiagnosticListener, System.Guid, System.Data.SqlClient.SqlCommand, System.String)+1fc 
System.Data.SqlClient.SqlCommand+<>c__DisplayClass130_0.b__2(System.Threading.Tasks.Task`1)+2fc 
System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(System.Threading.Thread, System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)+4b 
System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)+b6 
System.Threading.ThreadPoolWorkQueue.Dispatch()+1fa 
System.Threading.PortableThreadPool+WorkerThread.WorkerThreadStart()+1f2

To reproduce

TBD (this is a race condition)

Expected behavior

The diagnostics listener should not run concurrently with code executing after the query is completed.

Further technical details

Microsoft.Data.SqlClient version: 4.8.3
.NET target: 6.0
SQL Server version: LocalDb
Operating system: Windows 11

The text was updated successfully, but these errors were encountered:

ErikEJ · 2022-06-04T17:45:55Z

System.Data.SqlClient is in maintenance mode. Microsoft.Data.SqlClient should be used instead (unsure if it makes any difference for the repro)

Wraith2 · 2022-06-05T00:26:16Z

If I understand this correctly then the equivalent code for netcore in this library is:

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs

Lines 2542 to 2577 in 9b1996a

    
                       Task<int> returnedTask = source.Task; 
        
                       try 
        
                       { 
        
                           returnedTask = RegisterForConnectionCloseNotification(returnedTask); 
        
                           Task<int>.Factory.FromAsync(BeginExecuteNonQueryAsync, EndExecuteNonQueryAsync, null).ContinueWith((t) => 
        
                           { 
        
                               registration.Dispose(); 
        
                               if (t.IsFaulted) 
        
                               { 
        
                                   Exception e = t.Exception.InnerException; 
        
                                   _diagnosticListener.WriteCommandError(operationId, this, _transaction, e); 
        
                                   source.SetException(e); 
        
                               } 
        
                               else 
        
                               { 
        
                                   if (t.IsCanceled) 
        
                                   { 
        
                                       source.SetCanceled(); 
        
                                   } 
        
                                   else 
        
                                   { 
        
                                       source.SetResult(t.Result); 
        
                                   } 
        
                                   _diagnosticListener.WriteCommandAfter(operationId, this, _transaction); 
        
                               } 
        
                           }, TaskScheduler.Default); 
        
                       } 
        
                       catch (Exception e) 
        
                       { 
        
                           _diagnosticListener.WriteCommandError(operationId, this, _transaction, e); 
        
                           source.SetException(e); 
        
                       } 
        
                       return returnedTask; 
        
                   }

It creates a TCS and then passes the task from it to RegisterForConnectionCloseNotification which adds a ContinueWith and unrwaps it, this is the returnedTask that we give back to the user. The Task.Factory call then sets up the async call and we add a ContinueWith on the result and add our own cleanup. We don't keep the task returned from the factory and the continue so the task returned to the user and our worker task only interact through the TCS and when we set the TCS result it's possible for the user task to be woken and finish it's task before the delegate that sets the result has finished, essentially between lines 2564 and 2566 mutithreaded stuff can happen. It'll take some pretty close timing/delays to make it happen but I think it's a hole.

If that is the problem then the obvious, and probably breaking/wrong fix would be to keep the unwrap the internal task from the async invocation and return that to the caller instead, then they'd be awaiting our cleanup task and there can't be an ordering problem.

Without a reliable replication I'd be really worried about making any changes to this. The async in this library is older than langauge support and can have really complicated interactions with any mistakes capable of breaking or regressing customers easily. @okolvik-avento have you observed this bug since you moved to Microsoft.Date.SqlClient at all?

Is there a contractual obligation for event ordering here? The issue it's causing seem to be that the telemetry client is trying to access an HttpContext object which has already been cleaned up in response to the query diagnistic being sent but is that a behaviour that's defined or is it just hopeful and works because of close timing? Does sqlclient have a defined obligation to raise the event before returning to the customer?

okolvik-avento · 2022-06-05T03:30:06Z

@Wraith2 i'm still seeing the get_Cookies exception after transitioning to Microsoft.Data.SqlClient

davidfowl · 2022-06-05T03:36:28Z

@Wraith2 I think there are 2 fixes I can think of:

Do a bigger refactoring and await the FromAsync task. This feels bigger and a bit riskier.
Call WriteCommandAfter before setting the TaskCompletionSource. This will make sure diagnostics listeners run before their async continuation executes. We don't need to await the fire and forget task. i.e moving

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs

Line 2566 in 9b1996a

_diagnosticListener.WriteCommandAfter(operationId, this, _transaction);

line to

SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs

Line 2557 in 9b1996a

{

, should be enough to resolve the problem. In the other cases where we set the tcs after invoking the diagnostic listener.

Without a reliable replication I'd be really worried about making any changes to this. The async in this library is older than langauge support and can have really complicated interactions with any mistakes capable of breaking or regressing customers easily. @okolvik-avento have you observed this bug since you moved to Microsoft.Date.SqlClient at all?

I'm confident I can make a reliable repro now that the problem is understood. If there's fear this is too breaking we can always add an app context switch. This ends up breaking ASP.NET Core + AppInsights (which is a very common combination) so it's important we figure out how to solve this.

cc @noahfalk for this take on the reordering of events here.

Wraith2 · 2022-06-05T10:17:08Z

Moving the diagnostics call before setting result/exception is easy and relatively safe in terms of changes in this library. I'm not going to worry about a replication for that one since we're not logically changing the async composition. I did eventually think of this but only after I'd turned my computer off.

My only worry is whether it would expose similar problems to calling late if we call too early and people expect late, what happens if someone relies on knowing a result has been user-handled and it now hasn't been? The other path through continuation calls before setting the exception so users have already been exposed to that situation in a limited way.

I'll try and audit the library later and move all diagnostics calls before tcs outcome setting methods are called.

JRahnama · 2022-06-06T16:52:47Z

@davidfowl thanks for opening the issue here. @Wraith2 thank you for the support. We will review the PR.

Wraith2 · 2022-06-07T21:05:26Z

How's that for a quick fix :)

davidfowl · 2022-06-07T21:06:43Z

davidfowl mentioned this issue Jun 5, 2022

At random intervals the application fails to set RouteValues dotnet/aspnetcore#41924

Closed

1 task

Wraith2 mentioned this issue Jun 5, 2022

Bug: fix naming, order and formatting for diagnostics #1637

Merged

DavoudEshtehari closed this as completed in #1637 Jun 7, 2022

phalpin mentioned this issue Mar 8, 2023

NullReferenceException in QueryStringApiVersionReader.cs dotnet/aspnet-api-versioning#952

Closed

stevendarby mentioned this issue Jun 5, 2023

Unawaited refresh tasks Azure/AppConfiguration-DotnetProvider#424

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix where diagnostic listener SqlDiagnosticsListener.OnAfterCommand runs to avoid implict concurrent execution #1634

Fix where diagnostic listener SqlDiagnosticsListener.OnAfterCommand runs to avoid implict concurrent execution #1634

davidfowl commented Jun 4, 2022

ErikEJ commented Jun 4, 2022

Wraith2 commented Jun 5, 2022 •

edited

Loading

okolvik-avento commented Jun 5, 2022

davidfowl commented Jun 5, 2022 •

edited

Loading

Wraith2 commented Jun 5, 2022

JRahnama commented Jun 6, 2022

Wraith2 commented Jun 7, 2022

davidfowl commented Jun 7, 2022

Fix where diagnostic listener SqlDiagnosticsListener.OnAfterCommand runs to avoid implict concurrent execution #1634

Fix where diagnostic listener SqlDiagnosticsListener.OnAfterCommand runs to avoid implict concurrent execution #1634

Comments

davidfowl commented Jun 4, 2022

Describe the bug

To reproduce

Expected behavior

Further technical details

ErikEJ commented Jun 4, 2022

Wraith2 commented Jun 5, 2022 • edited Loading

okolvik-avento commented Jun 5, 2022

davidfowl commented Jun 5, 2022 • edited Loading

Wraith2 commented Jun 5, 2022

JRahnama commented Jun 6, 2022

Wraith2 commented Jun 7, 2022

davidfowl commented Jun 7, 2022

Wraith2 commented Jun 5, 2022 •

edited

Loading

davidfowl commented Jun 5, 2022 •

edited

Loading