Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random malloc failures on current Debian Bookworm slim image #101722

Closed
KLazarov opened this issue Apr 11, 2024 · 25 comments
Closed

Random malloc failures on current Debian Bookworm slim image #101722

KLazarov opened this issue Apr 11, 2024 · 25 comments
Assignees

Comments

@KLazarov
Copy link

KLazarov commented Apr 11, 2024

Describe the Bug

We are running the mcr.microsoft.com/dotnet/aspnet:8.0.3-bookworm-slim image for our Kubernetes Asp.Net services. A couple of days ago we started experiencing random pod restarts, and after investigating we came to the conclusion that something has changed in the image that breaks something with malloc. The code has not changed at all between builds, only the image has been rebuild. The previous working image on the bookworm 8.0.2, all 8.0.3 images fail after a while.

The errors that we get are:

Apr 8 22:43:16 pod1: tcache_thread_shutdown(): unaligned tcache chunk detected
Apr 9 09:32:53 pod2: malloc(): unaligned tcache chunk detected
Apr 9 10:15:15 pod3: malloc(): unaligned fastbin chunk detected
Apr 9 10:54:14 pod2: malloc_consolidate(): unaligned fastbin chunk detected
Apr 9 12:47:26 pod2: malloc(): unaligned tcache chunk detected
Apr 9 15:03:42 pod2: malloc_consolidate(): unaligned fastbin chunk detected
Apr 9 21:32:10 pod3: malloc(): unaligned tcache chunk detected
Apr 10 09:06:59 pod3: malloc(): unaligned tcache chunk detected
Apr 10 09:31:50 pod2: malloc(): unaligned tcache chunk detected
Apr 10 10:28:11 pod3: malloc(): unaligned tcache chunk detected
Apr 10 16:32:00 pod3: malloc(): unaligned tcache chunk detected
Apr 10 19:08:25 pod2: malloc(): unaligned tcache chunk detected
Apr 11 14:36:42 pod3: malloc_consolidate(): unaligned fastbin chunk detected

The only extra libs that we add to the image are tzdata and libicu72.

Steps to Reproduce

  1. Use an Asp.Net service using the mcr.microsoft.com/dotnet/aspnet:8.0.3-bookworm-slim image.
  2. Deploy to Kubernetes.
  3. Do a lot of requests (we have around 100k or so before it fails, but sometimes it fails sooner)
  4. You get malloc() or malloc_consolidate() or tcache_thread_shutdown error and the pod restarts.

Other Information

The issue is difficult to reproduce, we had to setup a bombardment service to get it.

We have tried with different RAM limits, but we never get to the max limit.

Workaround for now is to use the Alpine image instead. We have it in another service and it works fine on the same Kubernetes cluster.

Output of docker version

We are running Kubernetes 1.28.8.

Output of docker info

We are running Kubernetes 1.28.8.

@KLazarov KLazarov added bug untriaged New issue has not been triaged by the area owner labels Apr 11, 2024
@lbussell
Copy link
Contributor

Hi @KLazarov, please try the latest image mcr.microsoft.com/dotnet/aspnet:8.0.4-bookworm-slim that was released on 2024-04-09 and see if that works.

The issue also might be coming from one of the .NET dependencies that are installed from the Debian package repos. If you can find the last image digest that is known working, then we can check the versions of packages between that and the 8.0.3 Bookworm image to see what changed.

@richlander
Copy link
Member

You might also try 8.0-jammy to see if has better behavior. It should be a turnkey replacement.

@KLazarov
Copy link
Author

Hi @lbussell, we had that 8.0.2 worked fine, as I couldn't see the failure in the logs for that one. It was only after we made a rebuild for a new release together with updating to 8.0.3 that it started failing. The build that started failing was on the 8th of April around 19 UTC.
After running with 8.0-alpine, we haven't had the issue any more.

Hi @richlander, we have already replaced it with the alpine image as it was affecting our production environment, currently I don't have the time to do more tests with other debian images, sorry.

@richlander
Copy link
Member

Glad to hear that the Alpine image resolved the issue. We haven't heard any other reports of this issue so it is hard for us to know what is causing it.

FYI: 8.0-jammy is Ubuntu. Also, 8.0-jammy-chisled is a chiseled image about the size of Alpine and 8.0-jammy-chiseled-extra is that same image + tzdata and libicu.

@KLazarov
Copy link
Author

Thanks for the info!

After running the alpine image for 20 hours, we haven't encountered the issue.
It's entirely possible that only we had this issue because we install tzdata and libicu72 manually instead of using the extra image. Might be worth checking with Ubuntu/Debian for lib regression?
Sadly I haven't been able to pin down the culprit for the failures, and the logs that I showed were the only thing that we got from the pods.

@richlander
Copy link
Member

Might be worth checking with Ubuntu/Debian for lib regression?

I don't think there is enough to go on yet to make a report actionable. Is there was a regression, it would be presumably be much more pervasive. Or, we're about to see a deluge of issues, which (on one hand) would help.

Perhaps the eclipse caused some weird cosmic rays to affect the hardware.

@lbussell
Copy link
Contributor

[Triage] Closing as not planned since Alpine is working for @KLazarov and there's no clear direction for investigation to go.

@lbussell lbussell closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024
@Ekwav
Copy link

Ekwav commented Apr 24, 2024

I have a similar issue where after a on average 50k requests my aspnet image crashes with some malloc() issue.
Doesn't matter what 8.0 version I use, tried on release with 8.0 yesterday with 8.0.4 and alpine.
Sadly I have not been able to narrow it down more than it needs a lot of requests to trigger and started at .net 8
Since .net 7 reaches EOL soon I am currently downgrading to .net 6. The micro service this happens in functions as our api-gateway and is public here: https://github.com/Coflnet/SkyApi/blob/main/Dockerfile

@KLazarov
Copy link
Author

@Ekwav did you get the same error messages as me in the original post?

@richlander @lbussell might be worth considering investigating this with @Ekwav's example repo?

@richlander richlander reopened this Apr 29, 2024
@richlander
Copy link
Member

I built and ran the image. I see this:

$ docker run --rm -it sky
started sampler with 0.001 9223372036854776
Unhandled exception. System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
... more stacktrace

@lbussell
Copy link
Contributor

@richlander @MichaelSimons should we transfer to dotnet/aspnetcore?

@richlander
Copy link
Member

Seems like dupe of #93205

@richlander richlander transferred this issue from dotnet/dotnet-docker Apr 30, 2024
@richlander
Copy link
Member

@janvorli @bartonjs @vcsjones

@vcsjones
Copy link
Member

If I had to guess, there might be an issue here:

https://github.com/Coflnet/SkyBackendForFrontend/blob/2bc2185916fba1e492d3eb2aad10f52b71e81831/Services/StateUpdateService.cs#L28-L34

This does a var sha = SHA256.Create(); and captures it in a lambda used by Kafka. If that lambda is executing concurrently, then the hash object is being used from multiple threads. Instances of SHA256 are not safe to use from multiple threads.

This line:

byte[] encoded = sha.ComputeHash(key.ToArray().TakeWhile(c => c != '-').Select(c => (byte)c).ToArray());

Should probably just be:

byte[] encoded = SHA256.HashData(key.ToArray().TakeWhile(c => c != '-').Select(c => (byte)c).ToArray());

Using the static is the "right" thing in this case. The static is safe to use from multiple threads, and will give better performance. There may be other instances of a hash algorithm being used from multiple threads. I did not examine every usage of the hash algorithm.

In .NET 9 this has been "fixed" to throw a managed exception if a hash object is incorrectly used concurrently instead of crashing the process.

@richlander
Copy link
Member

Should this change be back-ported to .NET 8 or is that considered too breaking?

Getting tons of hard-to-diagnose reports isn't going to be great.

@vcsjones
Copy link
Member

Should this change be back-ported to .NET 8 or is that considered too breaking?

It's not a new behavior in .NET 8, I think it is easier to hit in .NET 8 though because .NET 8 docker images switched from OpenSSL 1.1 to OpenSSL 3.0 and OpenSSL 3.0 seems more prone to the issue occurring.

I also don't know for a fact that is the issue at hand here - I just noted in the code base there is at least one place where it appears a hash algorithm instance is used concurrently and that is a known source of these hard to diagnose problems.

My 2c: back porting the fix as-is would be too breaking because we applied it to all platforms. Windows for example, we know is more tolerant to concurrent hash algorithm use in some (but not all!) circumstances, so the bad scenario just seems to "work". Whereas the backport would stop a "not correct but not broken" scenario from "working".

If we want to do something for 8.0 then the fix would need to be more targeted.

@bartonjs what are your thoughts?

@bartonjs
Copy link
Member

I don't think that the servicing team will like "we want to make this change in case it happens to be useful". I don't know that "this doesn't fix a problem per se, it just turns an app termination into an exception" would go over astonishingly well... so it would need to be coupled with someone who is hitting this problem in the wild saying it went away when they changed from instance hashes to static hashes (or shared instances to individual instances, or locking, or whatever).

@richlander
Copy link
Member

Thanks for the explanation. Sounds like the new behavior coming in a new major version is the best choice.

@Ekwav
Copy link

Ekwav commented Apr 30, 2024

@vcsjones is right switching to SHA256.HashData seemingly resolved my issue. No crashes since I changed.
I am not sure that this is already changed in .NET 9, I also tried the asp.net preview image last week and it had the same crashes.
An exception on .NET 8 would have saved me about a week of work.

@vcsjones
Copy link
Member

vcsjones commented Apr 30, 2024

I am not sure that this is already changed in .NET 9, I also tried the asp.net preview image last week and it had the same crashes

This change was merged 3 weeks ago (at the time of writing). It is not in any released .NET 9 previews. It looks like it will be present in .NET 9 Preview 4.

@bartonjs
Copy link
Member

@vcsjones is right switching to SHA256.HashData seemingly resolved my issue. No crashes since I changed.

So now a backport seems more justified. Probably just the OpenSSL-based impl... and I don't know if "just 8" or "may as well offer up 6". But 7 has already had final servicing, so that's easier at least.

@vcsjones vcsjones removed bug area-Infrastructure untriaged New issue has not been triaged by the area owner labels Apr 30, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-security, @bartonjs, @vcsjones
See info in area-owners.md if you want to be subscribed.

@vcsjones
Copy link
Member

Okay. I will make a bespoke backport for 8 and 6 to something more reasonable.

@vcsjones
Copy link
Member

vcsjones commented May 2, 2024

We merged a change for .NET 8 that was approved for 8.0.6.

After some consideration I did not do a backport for 6.0. We have received no reports for this issue with .NET 6 since the 6.0 image base uses OpenSSL 1.1.

Since all expected actions have been taken, I am going to close this out. Please feel free to re-open this issue if there are any unaddressed concerns or issues.

Thanks all!

@vcsjones vcsjones closed this as completed May 2, 2024
@github-project-automation github-project-automation bot moved this from Tracking to Done in .NET Docker May 2, 2024
@richlander
Copy link
Member

Thanks much @vcsjones and @bartonjs!

@github-actions github-actions bot locked and limited conversation to collaborators Jun 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
Status: Done
Development

No branches or pull requests

6 participants