-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Random malloc failures on current Debian Bookworm slim image #101722
Comments
Hi @KLazarov, please try the latest image The issue also might be coming from one of the .NET dependencies that are installed from the Debian package repos. If you can find the last image digest that is known working, then we can check the versions of packages between that and the 8.0.3 Bookworm image to see what changed. |
You might also try |
Hi @lbussell, we had that 8.0.2 worked fine, as I couldn't see the failure in the logs for that one. It was only after we made a rebuild for a new release together with updating to 8.0.3 that it started failing. The build that started failing was on the 8th of April around 19 UTC. Hi @richlander, we have already replaced it with the alpine image as it was affecting our production environment, currently I don't have the time to do more tests with other debian images, sorry. |
Glad to hear that the Alpine image resolved the issue. We haven't heard any other reports of this issue so it is hard for us to know what is causing it. FYI: |
Thanks for the info! After running the alpine image for 20 hours, we haven't encountered the issue. |
I don't think there is enough to go on yet to make a report actionable. Is there was a regression, it would be presumably be much more pervasive. Or, we're about to see a deluge of issues, which (on one hand) would help. Perhaps the eclipse caused some weird cosmic rays to affect the hardware. |
[Triage] Closing as not planned since Alpine is working for @KLazarov and there's no clear direction for investigation to go. |
I have a similar issue where after a on average 50k requests my aspnet image crashes with some malloc() issue. |
@Ekwav did you get the same error messages as me in the original post? @richlander @lbussell might be worth considering investigating this with @Ekwav's example repo? |
I built and ran the image. I see this: $ docker run --rm -it sky
started sampler with 0.001 9223372036854776
Unhandled exception. System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
... more stacktrace |
@richlander @MichaelSimons should we transfer to dotnet/aspnetcore? |
Seems like dupe of #93205 |
If I had to guess, there might be an issue here: This does a This line: byte[] encoded = sha.ComputeHash(key.ToArray().TakeWhile(c => c != '-').Select(c => (byte)c).ToArray()); Should probably just be: byte[] encoded = SHA256.HashData(key.ToArray().TakeWhile(c => c != '-').Select(c => (byte)c).ToArray()); Using the static is the "right" thing in this case. The static is safe to use from multiple threads, and will give better performance. There may be other instances of a hash algorithm being used from multiple threads. I did not examine every usage of the hash algorithm. In .NET 9 this has been "fixed" to throw a managed exception if a hash object is incorrectly used concurrently instead of crashing the process. |
Should this change be back-ported to .NET 8 or is that considered too breaking? Getting tons of hard-to-diagnose reports isn't going to be great. |
It's not a new behavior in .NET 8, I think it is easier to hit in .NET 8 though because .NET 8 docker images switched from OpenSSL 1.1 to OpenSSL 3.0 and OpenSSL 3.0 seems more prone to the issue occurring. I also don't know for a fact that is the issue at hand here - I just noted in the code base there is at least one place where it appears a hash algorithm instance is used concurrently and that is a known source of these hard to diagnose problems. My 2c: back porting the fix as-is would be too breaking because we applied it to all platforms. Windows for example, we know is more tolerant to concurrent hash algorithm use in some (but not all!) circumstances, so the bad scenario just seems to "work". Whereas the backport would stop a "not correct but not broken" scenario from "working". If we want to do something for 8.0 then the fix would need to be more targeted. @bartonjs what are your thoughts? |
I don't think that the servicing team will like "we want to make this change in case it happens to be useful". I don't know that "this doesn't fix a problem per se, it just turns an app termination into an exception" would go over astonishingly well... so it would need to be coupled with someone who is hitting this problem in the wild saying it went away when they changed from instance hashes to static hashes (or shared instances to individual instances, or locking, or whatever). |
Thanks for the explanation. Sounds like the new behavior coming in a new major version is the best choice. |
@vcsjones is right switching to |
This change was merged 3 weeks ago (at the time of writing). It is not in any released .NET 9 previews. It looks like it will be present in .NET 9 Preview 4. |
So now a backport seems more justified. Probably just the OpenSSL-based impl... and I don't know if "just 8" or "may as well offer up 6". But 7 has already had final servicing, so that's easier at least. |
Tagging subscribers to this area: @dotnet/area-system-security, @bartonjs, @vcsjones |
Okay. I will make a bespoke backport for 8 and 6 to something more reasonable. |
We merged a change for .NET 8 that was approved for 8.0.6. After some consideration I did not do a backport for 6.0. We have received no reports for this issue with .NET 6 since the 6.0 image base uses OpenSSL 1.1. Since all expected actions have been taken, I am going to close this out. Please feel free to re-open this issue if there are any unaddressed concerns or issues. Thanks all! |
Describe the Bug
We are running the
mcr.microsoft.com/dotnet/aspnet:8.0.3-bookworm-slim
image for our Kubernetes Asp.Net services. A couple of days ago we started experiencing random pod restarts, and after investigating we came to the conclusion that something has changed in the image that breaks something with malloc. The code has not changed at all between builds, only the image has been rebuild. The previous working image on the bookworm 8.0.2, all 8.0.3 images fail after a while.The errors that we get are:
The only extra libs that we add to the image are
tzdata
andlibicu72
.Steps to Reproduce
mcr.microsoft.com/dotnet/aspnet:8.0.3-bookworm-slim
image.malloc()
ormalloc_consolidate()
ortcache_thread_shutdown
error and the pod restarts.Other Information
The issue is difficult to reproduce, we had to setup a bombardment service to get it.
We have tried with different RAM limits, but we never get to the max limit.
Workaround for now is to use the Alpine image instead. We have it in another service and it works fine on the same Kubernetes cluster.
Output of
docker version
We are running Kubernetes 1.28.8.
Output of
docker info
We are running Kubernetes 1.28.8.
The text was updated successfully, but these errors were encountered: