-
Notifications
You must be signed in to change notification settings - Fork 290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with Application Insights + asp .net core 3.1 + linux containers #1678
Comments
@SidShetye For the purposes of elimination (or confirmation), please can you repeat the test, but without the "Profiler.AspNetCore" package? |
We've spent a ton of time on this to date (see case 119121121001290 and it's master case) and can't really sponsor any additional time. Since we've have identified the product causing this and even have a repro repo, the idea is you (the vendor) debugs this this point on. The leak could be from anywhere in Application Insights but what I can add from the top of my head is that we did see an almost 4x memory footprint with profiler enabled within just 30 seconds of launch (not in containers or in linux) |
@pharring I should mention, the support case 119121121001290's secure file exchange areas has memory dumps of this situation. There is a reference dump a few minutes after starting and then 2 days when the memory consumption is 1.4 GB. The container eventually died after a total of 3 days of uptime where even a simple ps wouldn't fork.
|
@pharring I'm uploading some of the prior work from our support ticket here for easy access. It's off the VS2019 template reproduce repo above (no proprietary bits), so safe for public distribution. Memory dumps: Uploaded to Azure Storage here. It should be accessible for the next 7 days for you to download. Video of low memory situation: Not a terribly exciting video but it's just 90 seconds. https://www.youtube.com/watch?v=4JIufEk2hw8 |
@xiaomi7732 can you take a look? |
@SidShetye Any chance we can get these dumps back with a longer expiration time? Folks on the AppInsights team asked if I could take a look but at the moment I have access to neither the dumps nor a valid ikey to make progress. If possible the dumps are likely to be a much faster way to get useful info than waiting to get the repro set up and leaking again. |
Traveling, on mobile but this should work |
Thanks @SidShetye ! A dev on my team started looking at the dumps today and ruled a few things out but didn't get to a root cause yet. We'll update once we learn more. |
@noahfalk Any update since the last 3 weeks? |
Hi @SidShetye, one of my coworkers has been working on this for considerable time these last weeks and we are continuing to do so - I apologize it is taking a long time to solve. We are seeing that native memory is leaking which is much more time consuming to track down than managed memory leaks. We attempted to use several different memory leak analysis tools on Linux but thus far they've either failed to run or the data they produced has made no sense. We are now at the point where we are making custom modifications to tool called libleak and we've identified potentially problematic allocation stacks, but the callstack unwind/symbolication from the tool is unusable. We are trying a next set of tool modifications that will let us stop the app under a debugger on the apparently problematic allocations so that we can gain access to higher quality stack unwinds. |
thanks @noahfalk On a side note, it wouldn't hurt to talk to the .net diagnostics tool team so any custom tool enhancements at your end can flow back to the broader .net community. It's only a matter of time before other folks also have to debug managed + unmanaged leaks on linux. Perhaps you can save them that frustration. |
I think the team you are referring to is my team : ) Myself and immediate coworkers own tools such as dotnet-dump, dotnet-trace, dotnet-counters as well as all the low level diagnostic features implemented inside the .Net runtime. @pharring above is on the AppInsights team and he asked if we could help them out investigating this.
The modifications we are hacking together would be pretty rough for someone else to try repurposing as-is but I definitely agree on the broader sentiment. I want to stay focused on getting this bug solved for you first but after that we can take stock of what code/knowledge we gained in the exercise and how can we leverage it for broader community benefit. |
@noahfalk Ok, got it. Thanks for those tools, hopefully we can open dotnet-dumps within VS/VSCode's GUI some day! A topic for another day though. Back to this, I wanted to add that the leaks we observed were seen on 2+ core systems with server GC. On a single core server or workstation GC, we didn't observe those leaks. We had mentioned this on the internal support ticket but just noticed that the github issue doesn't mention that. |
Hi @SidShetye a quick update and thanks for your continued patience on this long bug. The custom tooling work we did is bearing dividends and we isolated two callstacks with high suspicion of leaking memory. As of a few days ago we switched modes to root causing, creating tentative fixes, and doing measurements to assess the impact of the changes. Those tasks are still in progress but it is very satisfying to have some leads now. |
Hi @SidShetye, just wanted to give another update on progress. Due to the tireless efforts of my coworker @sywhang we found a few issues that had varying contributions, but the one that appears most problematic plays out like this: The yellow is a baseline scenario that we modified slightly from the one you provided to make ApplicationInsights service profiler capture traces much more frequently. The blue line is after we applied a tentative fix in the coreclr runtime. x-axis is seconds, y-axis is commited memory in MB and we ran both test cases long enough that we thought the trend was apparent. Interestingly we thought the native CRT heap was growing because something was calling malloc() and forgetting to call free() but that wasn't the culprit. Instead the malloc/free calls were correctly balanced but there was a bad interaction between the runtime allocation pattern and the glibc heap policies that decide when unused CRT heap virtual memory should be unmapped and where new allocations should be placed. We didn't determine why specifically our pattern was causing so much trouble for glibc, but the tentative fix is using a new custom allocator that gets a pool of memory directly from the OS with mmap(). This particular issue is in the coreclr runtime (Service profiler is using a runtime feature to create and buffer the trace) so we'd need to address it via a runtime fix. Our current tentative fix still has worse throughput than the original so we are profiling it and refining it in hopes that we don't fix one problem only to create a new one. We'll also need to meet with our servicing team to figure out what options we have for distributing a fix once we have it. I'll continue to follow up here as we get more info. In the meantime we found that adjusting the environment variable MALLOC_ARENA_MAX to a low value appeared to substantially mitigate the memory usage so that is a potential option for you to experiment with if you are trying to get this working as soon as possible. Here are a few links to other software projects that were dealing with similar issues to add a little context: |
Thanks for the update and - wow, this runs deep! Perhaps a write up on Microsoft devblogs would be great, especially as an advanced followup to this piece. The folks root-causing this deserve the credit/recognition. |
Hi @SidShetye, Sorry it took quite a while for me to try out various approaches to fix the underlying issue and decide on a reasonable fix in terms of complexity and performance impact, but the fix finally got merged to the master branch of dotnet/runtime repository in dotnet/runtime#35924. Would it be possible for you to try out using the latest branch and confirm that we've fixed your issue? I'm now trying to backport the fix to our servicing release of .NET Core 3.1 and having validation from you as the first customer who reported this issue would let me go through the backport process for .NET servicing release. Thank you! |
@sywhang I really appreciate the efforts in fixing this but we have already stripped out Application Insights from our stack. Is there a write-up on this? Seems like it was a very tricky issue, so a post-analysis write up could help others too. |
Thanks @SidShetye for the update! I'll try to get the backport approved regardless so that others won't hit the same issue. Regarding writeup, I don't have anything written up on this yet because we're so close to .NET 5 release snap date and I have a huge stack of work items to go through, but I am definitely planning to share some of the tools I wrote as part of this investigation, as well as a writeup on the investigation as soon as I have some extra cycles : ) |
@forik I'd recommend opening a new issue with the information on your bug so it can be better tracked. The new issue template will ask for info specific to your setup and from what you posted so far it appears unlikely that your issue is a duplicate of Sid's issue. |
The fix to this issue has been merged to both master branch and .NET Core 3.1 servicing branch (targetting 3.1.5 servicing release) of the runtime repo. I don't have write access to this repository but I believe the issue can be closed now. |
Can’t test for confirmation but I’ll go ahead and close it. |
I can confirm that it's still happening in .net core 6. Is the MALLOC_ARENA_MAX workaround still works well ? |
If you are seeing unbounded memory growth in App Insights profiler running on .NET 6+ then it is probably going to be a bug in a different code path, and potentially a different root cause. I'd recommend creating a new issue so the App Insights team can investigate it.
I don't know whether MALLOC_ARENA_MAX will affect the behavior you are seeing. Its possible that it will help if your leak was coming from some other code path that continues to use malloc(). The allocations that were problematic in the scenario we investigated for this bug years ago were converted to use mmap(). I wouldn't expect those specific allocations to be leaking or to be impacted by MALLOC_ARENA_MAX any longer. |
FYI, this is also being tracked by Azure support as case 119121121001290
Repro Steps
Please see via my personal repo here at for both the sample repo as well as the pre-built containers. Also note that the restricted way azure app service spawns containers prevents you from running
dotnet dump
on azure (issue with diagnostics team) so this is reproduced outside azure with the standard VS2019 asp.net core 3.1 template.Actual Behavior
The user visible behavior is that Azure App Service recycles the container with the log entry
Stoping site <sitename> because it exceeded swap usage limits.
What happens under the hoods is that it takes between 1 to 4 days for the memory leak to exhaust about 1.5 GB memory. This is seen as high used and low available memory as see by the
free -w -h
linux tool (which uses the /proc/meminfo from container host).Eventually the container is killed (exit code 137, low memory).
Disabling Application Insights by not setting the
APPINSIGHTS_INSTRUMENTATIONKEY
environment variable prevents this leak.Expected Behavior
No memory leak
Version Info
SDK Version :
.NET Version :
.net core 3.1
How Application was onboarded with SDK: csproj integration
Hosting Info (IIS/Azure WebApps/ etc): Azure App Service Linux Containers as well as locally as simple linux containers
The text was updated successfully, but these errors were encountered: