TelemetryClient.Flush deadlocks #1186

mlugosan · 2019-08-01T16:07:18Z

If you are reporting bug/issue, please provide detailed Repro instructions.

Repro Steps

netcoreapp2.2 console app
telemetry configuration:

ServerTelemetryChannel
DependencyTrackingTelemetryModule and UnobservedExceptionTelemetryModule

call TelemetryClient.Flush on workload completion (async Task Main)

Actual Behavior

The app deadlocks during Flush operation.
Occurrence is reliably about 1 in 8000 runs in a consistent hardware and deployment environment.

Expected Behavior

not deadlock

Version Info

SDK Version : appinsights 2.10.0
.NET Version : netcore 2.2
How Application was onboarded with SDK(VisualStudio/StatusMonitor/Azure Extension) :
OS : win10-x64
Hosting Info (IIS/Azure WebApps/ etc) : console

Two stacks seem relevant for this:
Thread1

ntdll.dll!NtWaitForMultipleObjects�()	Unknown
KERNELBASE.dll!WaitForMultipleObjectsEx()	Unknown
[Managed to Native Transition]	
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricManager.Flush(bool flushDownstreamPipeline)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.TelemetryClient.Flush()	Unknown

Thread 2

ntdll.dll!NtDelayExecution�()	Unknown
KERNELBASE.dll!SleepEx()	Unknown
[Managed to Native Transition]	
System.Private.CoreLib.dll!System.Threading.SpinWait.SpinOnce(int sleep1Threshold) Line 169	C#
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double>.GetAndResetValue(int index)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MeasurementAggregator.UpdateAggregate_Stage1(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer, int minFlushIndex, int maxFlushIndex)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.UpdateAggregate(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricValuesBufferBase<double> buffer)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.Extensibility.MetricSeriesAggregatorBase<double>.CompleteAggregation(System.DateTimeOffset periodEnd)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.GetNonpersistentAggregations(System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.CycleAggregators(ref Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.AggregatorCollection aggregators, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter, bool stopAggregators)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.MetricAggregationManager.StartOrCycleAggregators(Microsoft.ApplicationInsights.Metrics.Extensibility.MetricAggregationCycleKind aggregationCycleKind, System.DateTimeOffset tactTimestamp, Microsoft.ApplicationInsights.Metrics.Extensibility.IMetricSeriesFilter futureFilter)	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.FetchAndTrackMetrics()	Unknown
Microsoft.ApplicationInsights.dll!Microsoft.ApplicationInsights.Metrics.DefaultAggregationPeriodCycle.Run()	Unknown
System.Threading.Thread.dll!System.Threading.Thread.ThreadMain_ThreadStart() Line 93	C#
System.Private.CoreLib.dll!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext executionContext, System.Threading.ContextCallback callback, object state) Line 167	C#
[Native to Managed Transition]	
kernel32.dll!BaseThreadInitThunk�()	Unknown
ntdll.dll!RtlUserThreadStart�()	Unknown

The text was updated successfully, but these errors were encountered:

mlugosan · 2019-08-12T15:00:35Z

I'm encountering this issue on consistent basis.
Can anyone advise on a solution or workaround?

TimothyMothra · 2019-08-14T19:50:19Z

@macrogreg Can you take a look at this?

I think it's related to the lock in MetricSeriesAggregatorBase.UpdateAggregate():

ApplicationInsights-dotnet/src/Microsoft.ApplicationInsights/Metrics/Extensibility/MetricSeriesAggregatorBase.cs

Lines 392 to 394 in f181abb

    
           // This lock is only contended if a user called CreateAggregateUnsafe or CompleteAggregation. 
        
           // This is very unlikely to be the case in a tight loop. 
        
           lock (buffer)

TimothyMothra · 2019-08-15T18:51:51Z

As a workaround, if you're not using the GetMetric().Track() api you can remove the Metrics pre aggregator Processor from your configuration.

mlugosan · 2019-08-15T19:38:51Z

As per recommendation in API documentation, which discourages old metric patterns, we're using GetMetric() and TrackValue(). Hence, unfortunately, the above workaround would not help.

lmolkova · 2019-08-26T18:02:35Z

This is one of the problems with Flush API and while this particular issue is pretty bad and we want to fix it along with Flush. This cycle we won't have the bandwidth for it, so moving it to the next milestone.

rafist · 2019-10-11T21:24:51Z

I think I see the same issue with AI 2.4.
Since I am not using GetMetric().Track(), how can I remove the Metrics pre aggregator Processor from my configuration as suggested above?

cijothomas · 2019-12-06T03:44:32Z

@rafist Are you on asp.net core? then do the following to remove metric pre aggregator:
public void ConfigureServices(IServiceCollection services)
{
Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions aiOptions
= new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions();
// disable autocollected metric extractor
aiOptions.AddAutoCollectedMetricExtractor= false;
services.AddApplicationInsightsTelemetry(aiOptions);
}

If you are on Asp.Net, then comment out the following processor from AI.Config

cijothomas · 2019-12-06T03:44:50Z

moving to next milestone, as there are no cycles to investigate this.

rafist · 2019-12-31T13:48:30Z

Thanks @cijothomas, my issue was not related to the Flush after all.

anjanchidige · 2020-01-16T10:07:13Z

@rafist , Could you please share more details on actual issue. Could you please share details, if it is resolved for you.

anjanchidige · 2020-01-16T13:02:32Z

@cijothomas / @lmolkova / @TimothyMothra , We are using Log4Net, Application Insights Appender to log to App Insights. TelemetryClient.Flush method is called after every TelemetryClient.TrackTrace method. After a few hours/sometimes immediately, all other application API calls in my application are failing with Task Cancelled exceptions. When we disable the Application Insights Appender in the Log4net.config file, all my API calls are succeeding. We are using .NET Framework 4.6.2 and Application Insights Appender v2.11.0.0. it seems the Application Insights is blocking other API calls. My Application is an Azure Worker Role.

Could you please let us know if this is the known issue in the Flush method. If yes, please kindly share any workaround and possible fix timelines.

Thanks
Anjan

cijothomas · 2020-09-01T17:00:04Z

Moving further from current milestones.

mladedav · 2020-11-23T11:46:52Z

This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).

This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.

We are using only services.AddApplicationInsightsTelemetryWorkerService() with no default overrides.

(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace")

cijothomas · 2020-11-23T15:13:23Z

+1 A similar/probably same issue was reported outside of this repo.

The issue is already tagged p1, but not yet assigned a milestone. Will mark for the next release.

cijothomas · 2020-11-23T15:14:18Z

2.17 release date is not yet announced. Will update here, once a firm date is set.

ninlar · 2020-12-02T19:23:21Z

@cijothomas are you sure this will be resolved in 2.17? I've noticed the milestone for this issue keeps getting pushed out. As a potential work around, do you think we could try invoking Flush() from a background thread, and if it hasn't completed in a reasonable time, attempt to Interrupt() the thread. This occurs for us when a service is shutting down. The deadlock or livelock is actually preventing the service from shutting down. So while dispatching on a background thread and interrupting if needed is hacky, it would potentially be a solution until there is a more concrete fix. The reason why we call Flush() on shutdown, is so that we ensure any buffered Telemetry is sent to AI before the process exits.

shoguns6 · 2021-02-22T13:07:38Z

This is consistently happening to us in production also. We are calling flush from two places and in some scenarios that seems to cause the deadlock (it may be actually livelock). I am attaching stack traces from debugger, they start at the bottom after the calls from our code to the sdk. When I try to resume and pause again the debugger, the code doesn't seem to move (that however doesn't tell us much since it just may be improbable for me to hit anything but the spinwait).

This manifests when the application should shut down as both calls to flush are started only at that point. In most cases this doesn't happen and shutdown completes normally. When the deadlock occurs, metrics about cpu and memory usage are still sent.

We are using only services.AddApplicationInsightsTelemetryWorkerService() with no default overrides.

(sorry for these being screens but vscode ignores me when I ask it to "copy the stack trace")

Similar this is happening for my code as well. Have used the below settings

var aiOptions = new Microsoft.ApplicationInsights.AspNetCore.Extensions.ApplicationInsightsServiceOptions
{
InstrumentationKey = instrumentationKey,
EnableAdaptiveSampling = false,
EnableQuickPulseMetricStream = false,
EnablePerformanceCounterCollectionModule = false,
};

but still getting high number of threads and results in 100% cpu. See below for thread details.

any updates would be helpful. In the meantime , will try 'AddAutoCollectedMetricExtractor' to false and monitory.

cijothomas · 2021-02-25T18:08:21Z

@shoguns6 It looks like you are reporting a different issue than the Flush deadlocks. Metrics aggregator (DefaultAggregationPeriodCycle) is only expected to create one Thread. If you are seeing more than one (461 in your screenshot), it likely means TelemetryConfig/Client is being recreated instead of re-using. Could you check if this is possible in your application.

cijothomas · 2021-07-09T17:35:50Z

One more +1 as this was reported by another user.

cijothomas · 2021-07-14T14:10:43Z

A tactical fix can be made to prevent Flush from waiting (Spinning) indefinitely. It should avoid the currently reported deadlock situations.
As 2.18 is nearing stable release, adding this to the next milestone 2.19. The ETA for the stable release of 2.19 is not finalized, but it'd be before Nov 2021.
The 1st beta containing this fix is expected in the very 1st beta of 2.19.

viblo · 2022-03-15T08:01:20Z

Just another data point, we also have this problem. We have experienced it randomly since at least early 2021. Quite frustrating that it has not been possible to fix it even after 20 point releases of AI.

toddfoust · 2022-03-16T20:03:50Z

Customer request to add this item to 2.21 release if possible.

BertscheS · 2022-03-23T08:18:04Z

Same problem here. We had it two weeks ago for the first time using 2.18. We had is consistently multiple times during the day on different servers so we upgraded to 2.20 and yesterday this issue occured again. We have to leave Application Insights disabled for now, which is not ideal.
Please include this in 2.21

TimothyMothra · 2022-06-21T16:16:59Z

Hello All,
We released v2.21-beta2 today which contains a fix for this issue.
If anyone can test this new beta, please write back and share if the issue is resolved for you.

TimothyMothra · 2022-07-20T17:50:56Z

Hello All,
We released v2.21.0 today, which contains a fix for this issue.
We're going to leave this issue open until we get confirmation that this issue has been resolved.

ninlar · 2022-07-26T20:59:13Z

We have created a backlog item to upgrade the SDK and test the fix.

cijothomas · 2022-09-16T21:44:45Z

If anyone who experienced this issue can confirm that the latest version fixes the issue, it would be greatly appreciated.

ninlar · 2022-09-16T21:46:34Z

Apologies for the delay. We have this scheduled for the upcoming sprint, since we saw the deadlock again in integration. So we will upgrade to the new library.

…

________________________________ From: Cijo Thomas ***@***.***> Sent: Friday, September 16, 2022 2:44 PM To: microsoft/ApplicationInsights-dotnet ***@***.***> Cc: [ɴιɴlαr] ***@***.***>; Comment ***@***.***> Subject: Re: [microsoft/ApplicationInsights-dotnet] TelemetryClient.Flush deadlocks (#1186) If anyone who experienced this issue can confirm that the latest version fixes the issue, it would be greatly appreciated. — Reply to this email directly, view it on GitHub<#1186 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIHCW3I6CHO43Q6NYIZQP3V6TS5RANCNFSM4IISHSYA>. You are receiving this because you commented.Message ID: ***@***.***>

BertscheS · 2023-01-11T07:36:54Z

We have the latest version now running for 5 months, and the issue did not occur anymore. Thanks for looking into it.

TimothyMothra added metrics bug P1 labels Aug 14, 2019

TimothyMothra added this to the 2.11 milestone Aug 15, 2019

lmolkova modified the milestones: 2.11, 2.12 Aug 26, 2019

cijothomas modified the milestones: 2.12, 2.13 Dec 6, 2019

TimothyMothra modified the milestones: 2.13, 2.15 Mar 20, 2020

TimothyMothra assigned rajkumar-rangaraj Apr 21, 2020

cijothomas modified the milestones: 2.15, Future Sep 1, 2020

cijothomas modified the milestones: Future, 2.17 Nov 23, 2020

TimothyMothra modified the milestones: 2.17, 2.18 Mar 4, 2021

cijothomas modified the milestones: 2.18, 2.19 Jul 14, 2021

cijothomas assigned cijothomas and unassigned rajkumar-rangaraj Jul 22, 2021

TimothyMothra modified the milestones: 2.19, 2.20 Oct 13, 2021

TimothyMothra modified the milestones: 2.20, 2.21 Dec 14, 2021

TimothyMothra mentioned this issue May 20, 2022

fix Flush deadlock by implementing SpinWait pattern using Interlocked.CompareExchange #2595

Closed

4 tasks

TimothyMothra mentioned this issue Jun 16, 2022

fix Metric livelock by replacing potential infinate loop in MetricValuesBuffer.GetAndResetValue #2612

Merged

4 tasks

cijothomas mentioned this issue Jul 18, 2022

bump version 2.21.0 #2627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TelemetryClient.Flush deadlocks #1186

TelemetryClient.Flush deadlocks #1186

mlugosan commented Aug 1, 2019 •

edited by TimothyMothra

Loading

mlugosan commented Aug 12, 2019

TimothyMothra commented Aug 14, 2019

TimothyMothra commented Aug 15, 2019

mlugosan commented Aug 15, 2019

lmolkova commented Aug 26, 2019

rafist commented Oct 11, 2019 •

edited

Loading

cijothomas commented Dec 6, 2019

cijothomas commented Dec 6, 2019

rafist commented Dec 31, 2019

anjanchidige commented Jan 16, 2020

anjanchidige commented Jan 16, 2020 •

edited

Loading

cijothomas commented Sep 1, 2020

mladedav commented Nov 23, 2020 •

edited

Loading

cijothomas commented Nov 23, 2020

cijothomas commented Nov 23, 2020

ninlar commented Dec 2, 2020

shoguns6 commented Feb 22, 2021

cijothomas commented Feb 25, 2021

cijothomas commented Jul 9, 2021

cijothomas commented Jul 14, 2021

viblo commented Mar 15, 2022

toddfoust commented Mar 16, 2022

BertscheS commented Mar 23, 2022

TimothyMothra commented Jun 21, 2022

TimothyMothra commented Jul 20, 2022

ninlar commented Jul 26, 2022

cijothomas commented Sep 16, 2022

ninlar commented Sep 16, 2022 via email

BertscheS commented Jan 11, 2023

TelemetryClient.Flush deadlocks #1186

TelemetryClient.Flush deadlocks #1186

Comments

mlugosan commented Aug 1, 2019 • edited by TimothyMothra Loading

Repro Steps

Actual Behavior

Expected Behavior

Version Info

mlugosan commented Aug 12, 2019

TimothyMothra commented Aug 14, 2019

TimothyMothra commented Aug 15, 2019

mlugosan commented Aug 15, 2019

lmolkova commented Aug 26, 2019

rafist commented Oct 11, 2019 • edited Loading

cijothomas commented Dec 6, 2019

cijothomas commented Dec 6, 2019

rafist commented Dec 31, 2019

anjanchidige commented Jan 16, 2020

anjanchidige commented Jan 16, 2020 • edited Loading

cijothomas commented Sep 1, 2020

mladedav commented Nov 23, 2020 • edited Loading

cijothomas commented Nov 23, 2020

cijothomas commented Nov 23, 2020

ninlar commented Dec 2, 2020

shoguns6 commented Feb 22, 2021

cijothomas commented Feb 25, 2021

cijothomas commented Jul 9, 2021

cijothomas commented Jul 14, 2021

viblo commented Mar 15, 2022

toddfoust commented Mar 16, 2022

BertscheS commented Mar 23, 2022

TimothyMothra commented Jun 21, 2022

TimothyMothra commented Jul 20, 2022

ninlar commented Jul 26, 2022

cijothomas commented Sep 16, 2022

ninlar commented Sep 16, 2022 via email

BertscheS commented Jan 11, 2023

mlugosan commented Aug 1, 2019 •

edited by TimothyMothra

Loading

rafist commented Oct 11, 2019 •

edited

Loading

anjanchidige commented Jan 16, 2020 •

edited

Loading

mladedav commented Nov 23, 2020 •

edited

Loading