Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aspnet app crash with core dump when using 7.0.11-bookworm-slim docker image (debian 12) #93205

Closed
agjini opened this issue Oct 9, 2023 · 18 comments

Comments

@agjini
Copy link

agjini commented Oct 9, 2023

Description

We recently switch our base docker image from 7.0.11-bullseye-slim to 7.0.11-bookworm-slim.
On one of our container based on this image (the auth service) crash after a while with the following outputs. We encountered 3 different error messages, all causing a core dump crash.

Reproduction Steps

We build our docker image based on

Expected behavior

We expect the application to run normally without crashing (of course) as it runs with debian 11.

Actual behavior

The app crash with the following outputs (alternatively) :

Segmentation fault (core dumped)
Aborted                 (core dumped)
malloc(): unaligned tcache chunk detected
Segmentation fault      (core dumped)
double free or corruption (out)
Aborted                 (core dumped)
corrupted size vs. prev_size in fastbins

Regression?

It seems to be a bug introduced with the new debian 12 based image. When we switch back to previous docker image (debian 11), the problem disappears.

Known Workarounds

Rollback to debian 11 based docker image fix the problem.

Configuration

.Net 7.0.11
Host OS : Ubuntu 22.04.2 LTS
docker info : Server Version: 23.0.1
x64 architecture

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 9, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Oct 9, 2023
@janvorli
Copy link
Member

janvorli commented Oct 9, 2023

@agjini can you please try to set the MALLOC_CHECK_ env variable to 1 to see if it would get more details on the failure?

@agjini
Copy link
Author

agjini commented Oct 9, 2023

I've made two tests with
MALLOC_CHECK_=1 and MALLOC_CHECK_=3

I've set the environment variable in the env, before running the dotnet program (inside the docker container).

But I doesn't give me any better output : only Segmentation fault (core dumped)

Does I have to do a specific manipulation to get malloc output ?

@vcsjones vcsjones removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 10, 2023
@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Oct 16, 2023
@mangod9 mangod9 added this to the Future milestone Oct 16, 2023
@gokhansengun
Copy link

Same here, upgrading from bullseye based docker image to bookworm for runtime 6.0.28 same issue occurs and it is 100% reproducible. Using ubuntu jammy also produces the same error.

It seems to be related to crypto dependencies (possibly openssl).

Below diff (removing sha256 hashing) fixes the issue

image

@dreaminhex
Copy link

dreaminhex commented Mar 18, 2024

I had a similar issue that I investigated recently which was resulting in the exact same error message on the same image as you when performing load testing on my .NET 7.0 WebAPI endpoint.

I was able to resolve it by replacing the unmanaged memory operations in my hashing code with managed ones.

  1. My hashing service now exists as a Singleton, and I referenced the HashingAlgorithm class as a readonly var that is instantiated at construction once (using SHA256.Create()).
  2. In order to make this work efficiently I lock the HashAlgorithm object prior to computing a hash because SHA256 is not thread safe.
  3. I didn't notice a lot any significant thread waits, but if you do, you could probably remove the class instance variable of HashingAlgorithm and just instantiate a HashingAlgorithm in your hashing method in a using statement, but I'd expect to see some other potential performance hits there possibly.

Here are some other things that I did that worked:

We're now using Ubunty jammy and have not seen this come back up.

Hope it helps.

@gokhansengun
Copy link

gokhansengun commented Mar 19, 2024

Thanks @Usualdosage for the comment. I can understand that using interop type operations as app developers we should be responsible for thread safety, however using a system library in a managed language (C#), I expect that thread safety issues should either be handled by the native dll (OpenSSL in this case) or the library (System.Security.Cryptography).

In my case with SHA256 however, I think it is OpenSSL native dll related as it works fine on an older version of OpenSSL in buster image, so I think Microsoft need to address this otherwise trying to wrap system libraries for thread safety do not sound suitable for a managed language/runtime.

@janvorli do you have any comments on this issue?

@janvorli
Copy link
Member

@gokhansengun let me cc @karelz who should be able to pull in the right folks that work on this part of the openssl related code.

@karelz
Copy link
Member

karelz commented Mar 26, 2024

@wfurt @rzikm can you please take a look?

@rzikm
Copy link
Member

rzikm commented Mar 26, 2024

Without memory dumps or repro it si very hard to investigate.

@gokhansengun, @agjini would it be possible to get either of those? Small self-contained repro project would be prefferred.

@wfurt
Copy link
Member

wfurt commented Mar 27, 2024

cc @bartonjs @vcsjones for the hashing. I know some of the APIs changed in OpenSSL 3.0.

@vcsjones
Copy link
Member

vcsjones commented Mar 27, 2024

It seems to be related to crypto dependencies (possibly openssl).

Given that you have an instance of HashAlgorithm as a field in what looks like a injected service, I suspect you are using a hash algorithm from multiple threads.

Are you using an instance of HashAlgorithm like SHA256 from multiple threads? Sharing hash algorithm instance between threads is not safe and can lead to incorrect hash results or crashes. This is not new behavior as similar issues have been reported in the past. For example, #61417.

If it worked on previous versions of OpenSSL or containers, it worked by "accident" as there is no thread safety guarantee of hash algorithm instances, either from managed code or within OpenSSL itself. OpenSSL's internal structures are not thread safe in this circumstance.

You can use the static methods like SHA256.HashData which offer better performance, are thread safe, and do not allocate internally.

Based on your screenshot, you should use

var hashedText = Convert.ToBase64String(SHA256.HashData(Encoding.UTF8.GetBytes(text)));

Then the _hashAlgorithm field is unnecessary.

@MichalPetryka
Copy link
Contributor

OpenSSL's internal structures are not thread safe in this circumstance.

Didn't they introduce excessive locking in OpenSSL 3.0 to make everything thread safe? (which also supposedly caused the performance regressions there)

@vcsjones
Copy link
Member

to make everything thread safe?

OpenSSL 3 does have some additional locking in place, but as far as I understand it, most of that is around how providers and algorithms get resolved, not primitives themselves.

OpenSSL does not make any guarantees about thread safety in this circumstance.

To emphasize: most objects are not safe for simultaneous use. Exceptions to this should be documented on the specific manual pages, and some general high-level guidance is given here.
...
In all cases, however, it is generally not safe for one thread to mutate an object, such as setting elements of a private or public key, while another thread is using that object, such as verifying a signature.

In this case, EVP_DigestUpdate and EVP_MD_CTX are not documented as thread safe. I can pretty easily reproduce a crash by mutating an EVP_MD_CTX from multiple threads using EVP_DigestUpdate.

This will crash almost instantly for me using OpenSSL 3.2.

#include <assert.h>
#include <openssl/evp.h>
#include <pthread.h>

static void* updater(void *data)
{
    EVP_MD_CTX* ctx = (EVP_MD_CTX*)data;
    unsigned char foo[8] = { 0, 1, 2, 3, 4, 5, 6, 7 };
    
    for (int i = 0; i < 0x7FFFFFFF; i++)
    {
        EVP_DigestUpdate(ctx, foo, sizeof(foo));
    }
    
    return NULL;
}

int main(int argc, char *argv[])
{
    EVP_MD* sha256 = EVP_MD_fetch(NULL, "SHA256", NULL);
    assert(sha256);
    
    EVP_MD_CTX* ctx = EVP_MD_CTX_new();
    assert(ctx);
    
    if (!EVP_DigestInit_ex2(ctx, sha256, NULL))
    {
        assert(0 && "init failed");
        return 1;
    }
    
    pthread_t thread1;
    pthread_t thread2;
    pthread_t thread3;
    pthread_create(&thread1, NULL, updater, ctx);
    pthread_create(&thread2, NULL, updater, ctx);
    pthread_create(&thread3, NULL, updater, ctx);
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
    pthread_join(thread3, NULL);
}

@vcsjones
Copy link
Member

To emphasize though: even if OpenSSL 3 did not crash, using a hash object from multiple threads can result in incorrect hash results, even if the method "looks" like it is doing everything at once like ComputeHash. The only thread-safe hashing API are the static ones.

You should either use the statics, or create a new HashAlgorithm instance as needed.

@vcsjones
Copy link
Member

After chatting with @bartonjs we are going to make some changes so we can at least throw a managed exception instead of crashing the process. It won't make anything thread safe, but it will prevent a hard crash of the whole process.

@gokhansengun
Copy link

Thanks for the detailed analysis @vcsjones, preventing hard crash with a managed exception will be a great improvement.

I can not explain how but I am confident that is has something to do with the base image change to bookworm which in turn changes OpenSSL version from 1.1.1n-0 to 3.0.11-1. With the old debian release, same test with the same aspnet runtime version never reproduces the crash however with the new bookworm debian release it is 100%. I will try to slim down the app and will try to come up with a clean repro until next week.

@vcsjones
Copy link
Member

I will try to slim down the app and will try to come up with a clean repro until next week.

Reproducing this is as simple as

HashAlgorithm hash = SHA256.Create();

Thread one = new(ThreadWork);
Thread two = new(ThreadWork);
one.Start(hash);
two.Start(hash);
one.Join();
two.Join();

static void ThreadWork(object obj)
{
    HashAlgorithm hash = (HashAlgorithm)obj;
    byte[] data = "potato"u8.ToArray();

    for (int i = 0; i < 1_000_000; I++)
    {
        try
        {
            hash.ComputeHash(data);
        }
        catch // do not care about managed exceptions.
        {

        }
    }
}

is has something to do with the base image change to bookworm which in turn changes OpenSSL version from 1.1.1n-0 to 3.0.11-1

it is very likely something changed in OpenSSL 3 to make it more stateful because it has to juggle legacy engines and providers.

Never reproduces the crash however with the new bookworm debian release it is 100%

I do want to caution one more time though: just because it did not crash does not mean it worked correctly. It could have been producing incorrect hashes, even on OpenSSL 1.1.1.

@vcsjones
Copy link
Member

I think we can close this out, as the root cause appears to be using instance of HashAlgorithm concurrently. There are a few ways to address this problem.

  1. Create a new instance of a HashAlgorithm object as needed, or create them as a [ThreadLocal]
  2. Use static one-shots like SHA256.HashData
  3. Put a lock around uses of HashAlgorithm

In .NET 9, we introduced a concurrency-misuse check that will prevent the process from crashing, and instead raise an exception with appropriate error text in #100371. This was backported to .NET 8.0.6 in #101737 for Linux platforms.

@gokhansengun
Copy link

Thanks a ton, the solution looks great

@github-actions github-actions bot locked and limited conversation to collaborators Aug 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests