Random malloc failures on current Debian Bookworm slim image #101722

KLazarov · 2024-04-11T19:48:46Z

Describe the Bug

We are running the mcr.microsoft.com/dotnet/aspnet:8.0.3-bookworm-slim image for our Kubernetes Asp.Net services. A couple of days ago we started experiencing random pod restarts, and after investigating we came to the conclusion that something has changed in the image that breaks something with malloc. The code has not changed at all between builds, only the image has been rebuild. The previous working image on the bookworm 8.0.2, all 8.0.3 images fail after a while.

The errors that we get are:

Apr 8 22:43:16 pod1: tcache_thread_shutdown(): unaligned tcache chunk detected
Apr 9 09:32:53 pod2: malloc(): unaligned tcache chunk detected
Apr 9 10:15:15 pod3: malloc(): unaligned fastbin chunk detected
Apr 9 10:54:14 pod2: malloc_consolidate(): unaligned fastbin chunk detected
Apr 9 12:47:26 pod2: malloc(): unaligned tcache chunk detected
Apr 9 15:03:42 pod2: malloc_consolidate(): unaligned fastbin chunk detected
Apr 9 21:32:10 pod3: malloc(): unaligned tcache chunk detected
Apr 10 09:06:59 pod3: malloc(): unaligned tcache chunk detected
Apr 10 09:31:50 pod2: malloc(): unaligned tcache chunk detected
Apr 10 10:28:11 pod3: malloc(): unaligned tcache chunk detected
Apr 10 16:32:00 pod3: malloc(): unaligned tcache chunk detected
Apr 10 19:08:25 pod2: malloc(): unaligned tcache chunk detected
Apr 11 14:36:42 pod3: malloc_consolidate(): unaligned fastbin chunk detected

The only extra libs that we add to the image are tzdata and libicu72.

Steps to Reproduce

Use an Asp.Net service using the mcr.microsoft.com/dotnet/aspnet:8.0.3-bookworm-slim image.
Deploy to Kubernetes.
Do a lot of requests (we have around 100k or so before it fails, but sometimes it fails sooner)
You get malloc() or malloc_consolidate() or tcache_thread_shutdown error and the pod restarts.

Other Information

The issue is difficult to reproduce, we had to setup a bombardment service to get it.

We have tried with different RAM limits, but we never get to the max limit.

Workaround for now is to use the Alpine image instead. We have it in another service and it works fine on the same Kubernetes cluster.

Output of `docker version`

We are running Kubernetes 1.28.8.

Output of `docker info`

We are running Kubernetes 1.28.8.

The text was updated successfully, but these errors were encountered:

lbussell · 2024-04-11T22:08:51Z

Hi @KLazarov, please try the latest image mcr.microsoft.com/dotnet/aspnet:8.0.4-bookworm-slim that was released on 2024-04-09 and see if that works.

The issue also might be coming from one of the .NET dependencies that are installed from the Debian package repos. If you can find the last image digest that is known working, then we can check the versions of packages between that and the 8.0.3 Bookworm image to see what changed.

richlander · 2024-04-11T23:30:15Z

You might also try 8.0-jammy to see if has better behavior. It should be a turnkey replacement.

KLazarov · 2024-04-12T09:23:42Z

Hi @lbussell, we had that 8.0.2 worked fine, as I couldn't see the failure in the logs for that one. It was only after we made a rebuild for a new release together with updating to 8.0.3 that it started failing. The build that started failing was on the 8th of April around 19 UTC.
After running with 8.0-alpine, we haven't had the issue any more.

Hi @richlander, we have already replaced it with the alpine image as it was affecting our production environment, currently I don't have the time to do more tests with other debian images, sorry.

richlander · 2024-04-12T13:57:36Z

Glad to hear that the Alpine image resolved the issue. We haven't heard any other reports of this issue so it is hard for us to know what is causing it.

FYI: 8.0-jammy is Ubuntu. Also, 8.0-jammy-chisled is a chiseled image about the size of Alpine and 8.0-jammy-chiseled-extra is that same image + tzdata and libicu.

KLazarov · 2024-04-12T15:07:54Z

Thanks for the info!

After running the alpine image for 20 hours, we haven't encountered the issue.
It's entirely possible that only we had this issue because we install tzdata and libicu72 manually instead of using the extra image. Might be worth checking with Ubuntu/Debian for lib regression?
Sadly I haven't been able to pin down the culprit for the failures, and the logs that I showed were the only thing that we got from the pods.

richlander · 2024-04-12T15:13:59Z

Might be worth checking with Ubuntu/Debian for lib regression?

I don't think there is enough to go on yet to make a report actionable. Is there was a regression, it would be presumably be much more pervasive. Or, we're about to see a deluge of issues, which (on one hand) would help.

Perhaps the eclipse caused some weird cosmic rays to affect the hardware.

lbussell · 2024-04-18T18:04:38Z

[Triage] Closing as not planned since Alpine is working for @KLazarov and there's no clear direction for investigation to go.

Ekwav · 2024-04-24T12:18:02Z

I have a similar issue where after a on average 50k requests my aspnet image crashes with some malloc() issue.
Doesn't matter what 8.0 version I use, tried on release with 8.0 yesterday with 8.0.4 and alpine.
Sadly I have not been able to narrow it down more than it needs a lot of requests to trigger and started at .net 8
Since .net 7 reaches EOL soon I am currently downgrading to .net 6. The micro service this happens in functions as our api-gateway and is public here: https://github.com/Coflnet/SkyApi/blob/main/Dockerfile

KLazarov · 2024-04-29T10:47:42Z

@Ekwav did you get the same error messages as me in the original post?

@richlander @lbussell might be worth considering investigating this with @Ekwav's example repo?

richlander · 2024-04-29T16:02:06Z

I built and ran the image. I see this:

$ docker run --rm -it sky
started sampler with 0.001 9223372036854776
Unhandled exception. System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 0xFFFDFFFF): Name or service not known
... more stacktrace

lbussell · 2024-04-30T15:21:56Z

@richlander @MichaelSimons should we transfer to dotnet/aspnetcore?

richlander · 2024-04-30T15:51:54Z

Seems like dupe of #93205

richlander · 2024-04-30T15:56:06Z

@janvorli @bartonjs @vcsjones

vcsjones · 2024-04-30T16:14:56Z

If I had to guess, there might be an issue here:

https://github.com/Coflnet/SkyBackendForFrontend/blob/2bc2185916fba1e492d3eb2aad10f52b71e81831/Services/StateUpdateService.cs#L28-L34

This does a var sha = SHA256.Create(); and captures it in a lambda used by Kafka. If that lambda is executing concurrently, then the hash object is being used from multiple threads. Instances of SHA256 are not safe to use from multiple threads.

This line:

byte[] encoded = sha.ComputeHash(key.ToArray().TakeWhile(c => c != '-').Select(c => (byte)c).ToArray());

Should probably just be:

byte[] encoded = SHA256.HashData(key.ToArray().TakeWhile(c => c != '-').Select(c => (byte)c).ToArray());

Using the static is the "right" thing in this case. The static is safe to use from multiple threads, and will give better performance. There may be other instances of a hash algorithm being used from multiple threads. I did not examine every usage of the hash algorithm.

In .NET 9 this has been "fixed" to throw a managed exception if a hash object is incorrectly used concurrently instead of crashing the process.

richlander · 2024-04-30T16:18:34Z

Should this change be back-ported to .NET 8 or is that considered too breaking?

Getting tons of hard-to-diagnose reports isn't going to be great.

vcsjones · 2024-04-30T16:28:41Z

Should this change be back-ported to .NET 8 or is that considered too breaking?

It's not a new behavior in .NET 8, I think it is easier to hit in .NET 8 though because .NET 8 docker images switched from OpenSSL 1.1 to OpenSSL 3.0 and OpenSSL 3.0 seems more prone to the issue occurring.

I also don't know for a fact that is the issue at hand here - I just noted in the code base there is at least one place where it appears a hash algorithm instance is used concurrently and that is a known source of these hard to diagnose problems.

My 2c: back porting the fix as-is would be too breaking because we applied it to all platforms. Windows for example, we know is more tolerant to concurrent hash algorithm use in some (but not all!) circumstances, so the bad scenario just seems to "work". Whereas the backport would stop a "not correct but not broken" scenario from "working".

If we want to do something for 8.0 then the fix would need to be more targeted.

@bartonjs what are your thoughts?

bartonjs · 2024-04-30T16:38:18Z

I don't think that the servicing team will like "we want to make this change in case it happens to be useful". I don't know that "this doesn't fix a problem per se, it just turns an app termination into an exception" would go over astonishingly well... so it would need to be coupled with someone who is hitting this problem in the wild saying it went away when they changed from instance hashes to static hashes (or shared instances to individual instances, or locking, or whatever).

richlander · 2024-04-30T16:43:36Z

Thanks for the explanation. Sounds like the new behavior coming in a new major version is the best choice.

Ekwav · 2024-04-30T18:07:17Z

@vcsjones is right switching to SHA256.HashData seemingly resolved my issue. No crashes since I changed.
I am not sure that this is already changed in .NET 9, I also tried the asp.net preview image last week and it had the same crashes.
An exception on .NET 8 would have saved me about a week of work.

vcsjones · 2024-04-30T18:15:43Z

I am not sure that this is already changed in .NET 9, I also tried the asp.net preview image last week and it had the same crashes

This change was merged 3 weeks ago (at the time of writing). It is not in any released .NET 9 previews. It looks like it will be present in .NET 9 Preview 4.

bartonjs · 2024-04-30T19:30:38Z

@vcsjones is right switching to SHA256.HashData seemingly resolved my issue. No crashes since I changed.

So now a backport seems more justified. Probably just the OpenSSL-based impl... and I don't know if "just 8" or "may as well offer up 6". But 7 has already had final servicing, so that's easier at least.

dotnet-policy-service · 2024-04-30T19:39:41Z

Tagging subscribers to this area: @dotnet/area-system-security, @bartonjs, @vcsjones
See info in area-owners.md if you want to be subscribed.

vcsjones · 2024-04-30T19:40:13Z

Okay. I will make a bespoke backport for 8 and 6 to something more reasonable.

vcsjones · 2024-05-02T21:44:47Z

We merged a change for .NET 8 that was approved for 8.0.6.

After some consideration I did not do a backport for 6.0. We have received no reports for this issue with .NET 6 since the 6.0 image base uses OpenSSL 1.1.

Since all expected actions have been taken, I am going to close this out. Please feel free to re-open this issue if there are any unaddressed concerns or issues.

Thanks all!

richlander · 2024-05-02T21:46:55Z

Thanks much @vcsjones and @bartonjs!

KLazarov added bug untriaged New issue has not been triaged by the area owner labels Apr 11, 2024

dotnet-issue-labeler bot added the area-Infrastructure label Apr 11, 2024

lbussell closed this as not planned Won't fix, can't repro, duplicate, stale Apr 18, 2024

richlander reopened this Apr 29, 2024

richlander transferred this issue from dotnet/dotnet-docker Apr 30, 2024

vcsjones removed bug area-Infrastructure untriaged New issue has not been triaged by the area owner labels Apr 30, 2024

vcsjones added the area-System.Security label Apr 30, 2024

vcsjones self-assigned this Apr 30, 2024

vcsjones mentioned this issue Apr 30, 2024

[release/8.0-staging]: Add cryptographic operation counts to prevent process crashes #101737

Merged

5 tasks

MichaelSimons moved this from Sprint to Tracking in .NET Docker May 2, 2024

vcsjones closed this as completed May 2, 2024

github-project-automation bot moved this from Tracking to Done in .NET Docker May 2, 2024

github-actions bot locked and limited conversation to collaborators Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random malloc failures on current Debian Bookworm slim image #101722

Random malloc failures on current Debian Bookworm slim image #101722

KLazarov commented Apr 11, 2024 •

edited

Loading

lbussell commented Apr 11, 2024

richlander commented Apr 11, 2024

KLazarov commented Apr 12, 2024

richlander commented Apr 12, 2024

KLazarov commented Apr 12, 2024

richlander commented Apr 12, 2024

lbussell commented Apr 18, 2024

Ekwav commented Apr 24, 2024

KLazarov commented Apr 29, 2024

richlander commented Apr 29, 2024

lbussell commented Apr 30, 2024

richlander commented Apr 30, 2024

richlander commented Apr 30, 2024

vcsjones commented Apr 30, 2024

richlander commented Apr 30, 2024

vcsjones commented Apr 30, 2024

bartonjs commented Apr 30, 2024

richlander commented Apr 30, 2024

Ekwav commented Apr 30, 2024

vcsjones commented Apr 30, 2024 •

edited

Loading

bartonjs commented Apr 30, 2024

dotnet-policy-service bot commented Apr 30, 2024

vcsjones commented Apr 30, 2024

vcsjones commented May 2, 2024

richlander commented May 2, 2024

Random malloc failures on current Debian Bookworm slim image #101722

Random malloc failures on current Debian Bookworm slim image #101722

Comments

KLazarov commented Apr 11, 2024 • edited Loading

Describe the Bug

Steps to Reproduce

Other Information

Output of docker version

Output of docker info

lbussell commented Apr 11, 2024

richlander commented Apr 11, 2024

KLazarov commented Apr 12, 2024

richlander commented Apr 12, 2024

KLazarov commented Apr 12, 2024

richlander commented Apr 12, 2024

lbussell commented Apr 18, 2024

Ekwav commented Apr 24, 2024

KLazarov commented Apr 29, 2024

richlander commented Apr 29, 2024

lbussell commented Apr 30, 2024

richlander commented Apr 30, 2024

richlander commented Apr 30, 2024

vcsjones commented Apr 30, 2024

richlander commented Apr 30, 2024

vcsjones commented Apr 30, 2024

bartonjs commented Apr 30, 2024

richlander commented Apr 30, 2024

Ekwav commented Apr 30, 2024

vcsjones commented Apr 30, 2024 • edited Loading

bartonjs commented Apr 30, 2024

dotnet-policy-service bot commented Apr 30, 2024

vcsjones commented Apr 30, 2024

vcsjones commented May 2, 2024

richlander commented May 2, 2024

KLazarov commented Apr 11, 2024 •

edited

Loading

Output of `docker version`

Output of `docker info`

vcsjones commented Apr 30, 2024 •

edited

Loading