-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large unmanaged memory growth (leak?) when upgrading from .NET 6 to 8 #95922
Comments
Tagging subscribers to this area: @dotnet/gc Issue DetailsDescriptionWe have a few different services hosted on kubernetes running on .NET. When we try to upgrade from .NET 6 to .NET 8, we see a steep but constant increase in memory usage, almost all in unmanaged memory. It seems to level off at around four times the memory usage in .NET 6, ignoring imposed memory limits, then continues to creep up more slowly depending on workload. So far we haven't seen an upper bound on the amount of unmanaged memory being leaked(?) here. Reproducing the problem in a minimal way has not been possible so far but we do have lots of data gathered about it. 🙂 Configuration.NET 8, from the docker image Regression?Yes, see data below. This issue does not occur on .NET 6, only on 8. We think it might be part of the GC changes from .NET 6 to 7. Give us a shout and we can try to narrow this down by running it on .NET 7. DataInitially we switched from .NET 6 to .NET 8 and we monitored memory usage using prometheus metrics. This is what the memory usage graphs look like. Both pods actually reached the 512m limit we'd imposed, and was restarted. After that we reverted to .NET 6, and things went back to normal. On .NET 6, memory usage remained consistently around ~160MB, but as soon as we deployed the upgrade to .NET 8 the memory increased without limit and were restarted once at 15:30 after hitting 512MB, once we returned to .NET 6 things went back to normal. We then tried increasing the available memory from 512MB to 1GB and re-deployed .NET 8. It increased rapidly as before, then levelled off at about 650MB and stayed that way until midnight. Service load increases drastically around that time and the memory grew again to about 950MB, where it stayed relatively level again until the service was unwittingly redeployed by a coworker. At that point we reverted back to .NET 6, where it went back to the lower memory level. I think it would have passed the 1GB memory limit after another midnight workload, but we haven't tested that again (yet). After trying and failing to reproduce the issue using local containers, we re-deployed .NET 8 and attached the JetBrains AnalysisThe only issue we could find that looked similar was this one, which also affects aspnet services running in kubernetes moving to .NET 7: #92490. As it's memory related we suspect this might be to do with the GC changes going from .NET 6 to 7. We haven't been able to get a clean repro (or any repro outside our hosted environments) yet, but please let us know if there's anything we can do to help narrow this down. 🙂
|
Does setting |
It could also be related to this issue if its a continuous memory growth: #95362. Are you able to collect some GCCollectOnly traces so we can diagnose further? |
Sorry for the slow response here. Good to hear we're not the only ones seeing this @taylorjonl! @MichalPetryka we tried the old GC setting but unfortunately no dice, the memory graphs look the same as before 😢. We'll try out more of the suggestions here after new year but I'm on holiday until then, so this issue might go quiet for a bit. Thanks everyone for the help so far! |
Maybe it's W^X or #95362 like mentioned before. Can you try |
W^X should not cause unbound native memory growth.
It is also possible that the native memory leak is caused by a tiny GC memory leak - a case when tiny managed object holds a large block of native memory alive. You would not see such a leak on the GC memory graph. That is also a possible cause related to OpenSSL where runtime uses SafeHandle derived types to reference possibly large data structures - like client certificates - allocated by the OpenSSL. I've seen cases when there was a certificate chain upto a 1GB large. To try to figure out the culprit, it would be helpful to take a dump of the running process at a point when it has already consumed a large amount of memory and then investigate it using a debugger with SOS plugin or the Also, if you'd be able to create a repro that you'd be able to share with me, I'd be happy to look into it myself. |
@mangod9 just to double check, since this is happening on linux, would |
@MichalPetryka we gave disabling W^X a go today and unfortunately no difference. Is there a nightly docker image we could use to try out the TLS fix? 🙂 (I tried looking at the docs but got a bit mixed up with how backporting works in this repo, sorry!) |
For .NET 9 daily build testing, install script can be used as follow: VERSION=9
DEST="$HOME/.dotnet$VERSION"
# recreate destination directory
rm -rf "$DEST"
mkdir "$DEST"
# download and install
curl -sSL https://dot.net/v1/dotnet-install.sh | bash /dev/stdin --quality daily --channel "$VERSION.0" --install-dir "$DEST"
# add nuget feed
cat > "$HOME/.nuget/NuGet/NuGet.Config" <<EOF
<?xml version="1.0" encoding="utf-8"?>
<configuration>
<packageSources>
<add key="nuget.org" value="https://api.nuget.org/v3/index.json" protocolVersion="3" />
<add key="dotnet$VERSION" value="https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet$VERSION/nuget/v3/index.json" />
</packageSources>
</configuration>
EOF
PATH="$DEST":$PATH
DOTNET_ROOT="$DEST"
export PATH DOTNET_ROOT
# dotnet --info
# dotnet publish ..
# etc. after changing net8.0 to net9.0 in *.csproj files, rebuild your docker image. After the testing, revert these changes and rebuild the docker image again (to bring back net8.0). |
We're also seeing the exact same issue @janvorli do you still need any memory dumps? Happy to share privately |
@jtsalva that would be great! My email address is my github username at microsoft.com |
@am11 I wonder whether that docker file might be useful to have in the docs in this repo. |
Thank you very much for the detailed response @janvorli it's really appreciated 🙂 We have a profile from JetBrains' dotMemory and a few snapshots taken with dotTrace. Would those be helpful to you in lieu of a dotnet-dump? (We can probably also get the latter if that's better to have, will need to get clearance to send to you regardless). I've had difficulty getting a clean repro for this since it only seems to happen when it's hosted on kubernetes but will keep working on it 😄 |
@SamWilliamsGS I need a dump that contains whole dump of the process memory. I am not sure if the dotMemory profile contains that or not. In case you have sensitive data in the dump, I can just explain how to look at the interesting stuff in the dump and you can do it yourself. |
The backport PR (#95439) has a milestone of 8.0.2, so I don't think you'll see it in non-daily Docker images for .NET 8 until February's servicing release. |
@martincostello I'm curious is there any place where information about next planning servicing release date and fixes is published? So far I only found this page referring to "Patch Tuesday" every month. https://dotnet.microsoft.com/en-us/platform/support/policy/dotnet-core#servicing We've updated some of services to .net8 and faced similar unmanaged memory leaks. |
I'm afraid I don't know the definitive answer (I'm not a member of the .NET team), I just know from prior experience that there's typically a release on the second Tuesday of every month to coincide with Patch Tuesday. What makes those releases though is often just detective work from looking at what's going on in GitHub PR/issues as releases with security fixes don't get worked on in public. |
Has anyone gained further insight into this? I'm also experiencing very similar issues and am working on getting some dumps and traces now to further analyze. It seems to be impacting some of our kubernetes (k3s) services deployed to edge locations with tight memory constraints. Previously these services were on .net 6 and would be fairly stable with limited memory (ranges of 128 Mib - 256 Mib).. Since uplifting them to .NET 8 we are experiencing higher base memory usage plus OOMKilling going on quite frequently as memory seems to consistently grow over time with just k8s probes/health-checks running... Enabling DATAS and GCConserve = 9 does seem to greatly improve things, but I still have tests that fail that used to pass on .net 6. The tests in question all do some batch operations that require more memory than normal load and with the higher usage in .net 8 they just cause the POD to get OOMKilled. |
can you quantify the amount of growth? Could be related to #95439 as suggested above. |
With DATAS enabled it doesn't seem to grow (or possibly just has more aggressive GC to account for the unmanaged leak?) .. but memory usage is simply just higher. I'll try to collect some metrics next week. I'll likely have to revert some PODS to .net 6 to get some baselines and compare as we weren't watching it as closely until the OOM's. Wouldn't the nightly aspnet images have the TLS leak fix in them? |
I actually had the same problem and I found out that it was related to Azure EventHub SDK... one of the guys was instantiating the EventHubProducerClient sending 1 event and DISPOSING it. But never the less a leak is there. When we started reusing the client, the problem resolved. |
@Leonardo-Ferreira Thank you for your attention and time, but as far as I know we do not use the Azure EventHub SDK |
@denislohachev1991 it is hard to gleam any useful information from the screenshot you shared, can you share the trace/report file and in which tool it can be opened? Also, knowing more about the application (e.g. how long the data collection was running, how much traffic it served) would be helpful when examining the trace. |
@Leandropintogit, regarding HTTP/3, do you know what the target server was? We are running HTTP/3 benchmarks and are aware of some performance gaps compared to HTTP/2, but it should still be very usable. Since we wanted to focus a bit on HTTP/QUIC perf for .NET 9, we might want to investigate. |
What do you mean target server? My setup Rps +/- 300 |
@rzikm How can I share the trace file with you? Я использовал Heaptrack GUI. |
@Leandropintogit I mean if you know what HTTP/3 implementation the other server is using (specifically, if it is running .NET as well), Basically enough information that I can attempt to replicate your observations and investigate them. |
Looking at the second screenshot, I am not 100% sure we're looking at a memory leak. The heaptrack tool works by tracking To identify something as a leak with greater confidence, you need to either
|
@rzikm I'm not sure if this is due to a leak or if this is normal behavior. We have several instances of an application that, over time (it takes about a month or more) consume all the server memory. I was recently looking through the code and found several places where resources were not freed. After that, I started monitoring the test application and from the start of the launch (~200 MB) to one day of work it gains up to (~450 MB). But even these numbers are very different from running the application on a Windows server. On a Windows server, the application consumes ~200 MB. That's why I assumed that the issue was related to a memory leak. |
Yep, that is good indication of a leak. it would be good to run it with heaptrack for long enough for these 100+MB to show in the report, it will be easier to isolate the leak from the rest of the live memory. I suggest using dotnet-symbol on all .so files in the application directory (assuming self-contained publish of the app) to download symbols (will show better callstacks in heaptrack). Another possible issue you are hitting is #101552 (comment), see linked comment for diagnosis step and possible workaround. |
@denislohachev1991 could you please get symbols for the .NET shared libraries, like libcoreclr.so etc? Without the symbols, we cannot see where the allocations were coming from. You can fetch the symbol files using the dotnet-symbol tool - just call it on the related .so file and it will fetch its .so.dbg file to the same directory where the library is located. The heaptrack should then be able to see them. You can use wildcards to fetch symbols for all the libxxxx.so in the dotnet runtime location. |
@janvorli Hello. I did as advised, here are my settings for self-contained the application. |
This shows the same thing as on your earlier report, those 37 MB "leaked" can very well be live memory. To be able to see anything useful, we need a report where we can see the 200 MB increase you mentioned in your previous message
Can you try running the collection for one day or more? |
@denislohachev1991 on glibc based linux distros, each thread consumes 8MB of memory for its stack by default. It looks like most of the memory in your log comes from that. You can try to lower that size e.g. to 1.5MB by setting the |
@janvorli As soon as you advised, I set DOTNET_DefaultStackSize=0x180000 and started the application. |
Hi Before After Tested using Chrome and Edge |
@denislohachev1991 could you please share the heaptrack log with me so that I can drill into it in more detail? It is strange that the env variable didn't have any effect. |
Slight OT, isn't the |
@darthShadow it doesn't matter, both ways work. We use strtoul to perform the conversion of the env var contents to number and it can take the 0x prefix optionally. See https://en.cppreference.com/w/cpp/string/byte/strtoul. |
@denislohachev1991 I've looked at the dump you've shared with me. It seems there was no permanent growth of the memory consumption over the time, there are few spikes, but the memory consumption stays about the same. Looking at the bottom-up tab in the heaptrack gui, around 25MB are coming from the openssl and about 14.5MB from the coreclr ClrMalloc, which is used by C++ new and C malloc implementations. On Windows, the HTTPS communication doesn't use openssl and IIRC, the memory consumed by that is not attributed to a specific process, so you won't see it in the working set of the process. |
@janvorli Thank you for your work and time spent. |
Thanks for this great thread. We had the same issue, which is memory growth of our Kubernetes pods after migrating to .NET8 from .NET6 We tried adding this configuration DOTNET_DefaultStackSize=0x180000 to our Debian images, but it didn't work. If you can explain to me in some details the root cause, why lowering the size of the Default Stack or using the Alpine linux which has low stack size by default (from my understanding), help fixing the issue ? |
If you are on latest patch of dotnet 8 which contains this fix #100502 Likely lowering MALLOC_ARENA_MAX or MALLOC_TRIM_THRESHOLD_ can get you to similar memory utilization as alpine |
I am facing the same issue in .NET 6 API as well. Is there any solution identified? |
hey @krishnap80, most of the discussions on this issue were around .NET 8. Since this issue has been closed I would suggest creating a new issue with details about your specific scenario. Ideally please try to move to .NET 8 too since 6 would soon be out of support. Thanks. |
Thanks all for the reply. We will update to .net 8 in a few weeks. I will
check after that and if needed will get back to you
…On Sat, Aug 31, 2024 at 7:25 PM Manish Godse ***@***.***> wrote:
hey @krishnap80 <https://github.com/krishnap80>, most of the discussions
on this issue were around .NET 8. Since this issue has been closed I would
suggest creating a new issue with details about your specific scenario.
Ideally please try to move to .NET 8 too since 6 would soon be out of
support. Thanks.
—
Reply to this email directly, view it on GitHub
<#95922 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BK5SW6FAYRXD7FFUUD4IOOLZUJNJFAVCNFSM6AAAAABARZREOOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRTGA4DKNRSGM>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Description
We have a few different services hosted on kubernetes running on .NET. When we try to upgrade from .NET 6 to .NET 8, we see a steep but constant increase in memory usage, almost all in unmanaged memory. It seems to level off at around four times the memory usage in .NET 6, ignoring imposed memory limits, then continues to creep up more slowly depending on workload. So far we haven't seen an upper bound on the amount of unmanaged memory being leaked(?) here. Reproducing the problem in a minimal way has not been possible so far but we do have lots of data gathered about it. 🙂
Configuration
.NET 8, from the docker image
mcr.microsoft.com/dotnet/aspnet:8.0
, running on x86-64 machines on AWS EC2.Regression?
Yes, see data below. This issue does not occur on .NET 6, only on 8. We think it might be part of the GC changes from .NET 6 to 7. Give us a shout and we can try to narrow this down by running it on .NET 7.
Data
Initially we switched from .NET 6 to .NET 8 and we monitored memory usage using prometheus metrics. This is what the memory usage graphs look like. Both pods actually reached the 512m limit we'd imposed, and was restarted. After that we reverted to .NET 6, and things went back to normal. On .NET 6, memory usage remained consistently around ~160MB, but as soon as we deployed the upgrade to .NET 8 the memory increased without limit and were restarted once at 15:30 after hitting 512MB, once we returned to .NET 6 things went back to normal.
We then tried increasing the available memory from 512MB to 1GB and re-deployed .NET 8. It increased rapidly as before, then levelled off at about 650MB and stayed that way until midnight. Service load increases drastically around that time and the memory grew again to about 950MB, where it stayed relatively level again until the service was unwittingly redeployed by a coworker. At that point we reverted back to .NET 6, where it went back to the lower memory level. I think it would have passed the 1GB memory limit after another midnight workload, but we haven't tested that again (yet).
After trying and failing to reproduce the issue using local containers, we re-deployed .NET 8 and attached the JetBrains
dotMemory
profiler to work out what was happening. This is the profile we collected, showing the unmanaged memory increases. Interestingly, the amount of managed memory actually goes down over time with GCs becoming more frequent, presumably .NET knows the available memory is running low as the total approaches 1GB. There also seem to be some circumstances where .NET will not allocate from unmanaged memory, since the spikes near the left hand side mirror each other for managed and unmanaged. We had to stop the profile before reaching the memory limit, since kubernetes would have restarted the pod and the profile would have been lost.And the prometheus memory usage graph, for completeness (one pod is higher than the other because it was running the
dotMemory
profiler this time, and drops because of detaching the profiler):Analysis
The only issue we could find that looked similar was this one, which also affects aspnet services running in kubernetes moving to .NET 7: #92490. As it's memory related we suspect this might be to do with the GC changes going from .NET 6 to 7. We haven't been able to get a clean repro (or any repro outside our hosted environments) yet, but please let us know if there's anything we can do to help narrow this down. 🙂
The text was updated successfully, but these errors were encountered: