Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pal thread handling crashes on latest Arch linux #6613

Closed
rhuanjl opened this issue Feb 16, 2021 · 30 comments
Closed

Pal thread handling crashes on latest Arch linux #6613

rhuanjl opened this issue Feb 16, 2021 · 30 comments

Comments

@rhuanjl
Copy link
Collaborator

rhuanjl commented Feb 16, 2021

On Arch linux CC crashes when starting a helper thread with the below stack trace, this crash was confirmed with both 1.11 and master and was likely introduced by changes in Arch linux rather than changes in CC:

#0  0x00007ffff510aef5 in raise () at /usr/lib/libc.so.6
#1  0x00007ffff50f4862 in abort () at /usr/lib/libc.so.6
#2  0x00007ffff514cf78 in __libc_message () at /usr/lib/libc.so.6
#3  0x00007ffff514cfaa in __libc_fatal () at /usr/lib/libc.so.6
#4  0x00007ffff4eca70c in __lll_lock_wait () at /usr/lib/libpthread.so.0
#5  0x00007ffff4ec35f0 in pthread_mutex_lock () at /usr/lib/libpthread.so.0
#6  0x00007ffff589847b in JsUtil::BackgroundJobProcessor::Run(JsUtil::ParallelThreadData*) () at /usr/lib/libChakraCore.so
#7  0x00007ffff589826a in JsUtil::BackgroundJobProcessor::StaticThreadProc(void*) () at /usr/lib/libChakraCore.so
#8  0x00007ffff56328ce in CorUnix::CPalThread::ThreadEntry(void*) () at /usr/lib/libChakraCore.so
#9  0x00007ffff4ec1299 in start_thread () at /usr/lib/libpthread.so.0
#10 0x00007ffff51cd153 in clone () at /usr/lib/libc.so.6

(Issue reported by a friend - I don't have a linux setup to verify with myself)

@ppenzin
Copy link
Member

ppenzin commented Feb 17, 2021

What JS code does this reproduce on? Even if it is started to happen after some changes in Arch, we might still be on the hook if something wrong is passed down to the system library.

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Feb 17, 2021

What JS code does this reproduce on? Even if it is started to happen after some changes in Arch, we might still be on the hook if something wrong is passed down to the system library.

It's not specific to any particular JS - running the test suite it happened with over half the dynopogo tests.

Running a simple console.log("hello") js file it happened intermittently.

@ppenzin
Copy link
Member

ppenzin commented Feb 17, 2021

Arch seems to be available in WSL, I will give it a try.

@Eggbertx
Copy link

Eggbertx commented Mar 6, 2021

I'm the friend mentioned in the OP, the one who reported the issue to @rhuanjl. I ran into the issue in Arch on my desktop, laptop, a virtual machine, and just to be absolutely sure that I hadn't overlooked something, I just tried it in a fresh Arch WSL installation and got an almost identical stack trace.
It even crashed when running this simple code, though it didn't crash every single time.

for(let i = 0; i < 32; i++) {
	console.log(`#${i}`);
}

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Mar 9, 2021

The crash is when it tries to launch a helper thread, I'm guessing but the trigger is likely the first attempt to run the JIT OR the garbage collector (both of which normally operate off thread).

@nic11
Copy link

nic11 commented Jan 9, 2022

This is reproducible for me on Manjaro, stacktrace looks the same. When building inside of an Ubuntu 20.04 container, everything works fine. I'll post an update if I'll be able to reproduce this on some later version of Ubuntu

@Pospelove
Copy link

@nic11 Use Windows

@nic11
Copy link

nic11 commented Jan 14, 2022

Fails on Ubuntu 21.10 (with a different stacktrace though). Probably it's caused by libicu, as it appears like the only needed library which has different major version in these three environments:

# Ubuntu 20.04
$ ldd vcpkg_installed/x64-linux/bin/libChakraCore.so 
        linux-vdso.so.1 (0x00007ffcce371000)
        libicuuc.so.66 => /lib/x86_64-linux-gnu/libicuuc.so.66 (0x00007f9a94108000)
        libicui18n.so.66 => /lib/x86_64-linux-gnu/libicui18n.so.66 (0x00007f9a93e09000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f9a93de6000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f9a93de0000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f9a93c91000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f9a93c76000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f9a93a82000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f9a94fda000)
        libicudata.so.66 => /lib/x86_64-linux-gnu/libicudata.so.66 (0x00007f9a91fc1000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f9a91ddf000)

# Ubuntu 21.10
ldd vcpkg_installed/x64-linux/bin/libChakraCore.so
        linux-vdso.so.1 (0x00007ffd525dd000)
        libicuuc.so.67 => /lib/x86_64-linux-gnu/libicuuc.so.67 (0x00007fa712a66000)
        libicui18n.so.67 => /lib/x86_64-linux-gnu/libicui18n.so.67 (0x00007fa71275f000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fa71267b000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fa712661000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa712439000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa713910000)
        libicudata.so.67 => /lib/x86_64-linux-gnu/libicudata.so.67 (0x00007fa710920000)
        libstdc++.so.6 => /lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007fa710705000)

# Manjaro
$ ldd vcpkg_installed/x64-linux/bin/libChakraCore.so
        linux-vdso.so.1 (0x00007ffd518fb000)
        libicuuc.so.70 => /usr/lib/libicuuc.so.70 (0x00007f5dadefa000)
        libicui18n.so.70 => /usr/lib/libicui18n.so.70 (0x00007f5dadbd4000)
        libpthread.so.0 => /usr/lib/libpthread.so.0 (0x00007f5dadbb3000)
        libdl.so.2 => /usr/lib/libdl.so.2 (0x00007f5dadbac000)
        libm.so.6 => /usr/lib/libm.so.6 (0x00007f5dada68000)
        libgcc_s.so.1 => /usr/lib/libgcc_s.so.1 (0x00007f5dada4d000)
        libc.so.6 => /usr/lib/libc.so.6 (0x00007f5dad87f000)
        /usr/lib64/ld-linux-x86-64.so.2 (0x00007f5daee1e000)
        libicudata.so.70 => /usr/lib/libicudata.so.70 (0x00007f5dabc63000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x00007f5daba4d000)

@Pospelove
Copy link

@nic11 can we fallback to old icu and check

@Eggbertx
Copy link

Eggbertx commented Jan 15, 2022

I had a feeling this would eventually happen. To my knowledge, the Arch devs only make changes when absolutely necessary, so if the issue isn't resolved soon, it'll likely start breaking with other distros as well.

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Jan 22, 2022

I'm 99% confident that the failure on Arch was to be do with PAL.

Relevant info

  • PAL is a component taken from dotNetCoreCLR and embedded in the ChakraCore source.
  • PAL is a shim of much of the windows C runtime on top of the linux of macOS C runtimes
  • PAL within ChakraCore is modified quite a bit from the original so updating to latest PAL as a fix isn't an (easy) option
  • as a long term goal I'd like to remove PAL but I honestly don't know if I'm ever going to have enough time to do that.
  • the ARCH crash specifically appeared to be do with the thread manager that PAL provides (chakracore uses Windows style thread handling which PAL shims on top of the relevant linux apis)

ICU
An ICU related failure is almost certainly a different issue - and hopefully a much easier fix as we do just link to ICU rather than distributing it with the CC source - I'm not aware of anything in the CC source that locks us to a specific ICU version other than a default link target which we could update.

@Eggbertx
Copy link

Would it be possible at all to replace PAL with a more natural system without having to rewrite the majority of the codebase for CC as a whole?

@nic11
Copy link

nic11 commented Jan 24, 2022

I've tried building with --embed-icu, but had no luck either. Though it links with an older version. Worth trying to specify the aame one Ubuntu 20.04 ships (66). I think I'll try that sometime later

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Jan 26, 2022

Would it be possible at all to replace PAL with a more natural system without having to rewrite the majority of the codebase for CC as a whole?

I'd like to investigate this - the big piece to look into is thread handling and what calls exactly are being used to create and manage threads also how spread out across the codebase they are - it shouldn't actually be that invasive a change as there's only a few things that can spin up a thread.

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Jan 26, 2022

I've tried building with --embed-icu, but had no luck either. Though it links with an older version. Worth trying to specify the aame one Ubuntu 20.04 ships (66). I think I'll try that sometime later

Could you use a debug or test build and post a stack trace with symbols in so we can see where it's going wrong?

@nic11
Copy link

nic11 commented Jan 27, 2022

Actually on my main system it's just the same as in the issue description. Or what exactly do you mean?

Actually tbh I didn't do a standalone build, I build ChakraCore as a vcpkg dependency, but the stacktrace matches the description this issue.

And yeah, sorry, but it may take me a bit of time, I'm busy currently. I guess I'll try libicu 66 build, maybe it'll help. Btw if I create an Arch or Ubuntu 21.10 based Docker image or Dockerfile where it can be reproduced, would it be simpler for you to debug?

@Wedmer
Copy link

Wedmer commented Mar 7, 2022

@Wedmer
Copy link

Wedmer commented Mar 7, 2022

@nic11
Copy link

nic11 commented Mar 8, 2022

Also I've found some more info https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg2161469.html

Isn't it about SIGBUS? I thought it's a regular segfault here.

@Wedmer
Copy link

Wedmer commented Mar 9, 2022

Isn't it about SIGBUS? I thought it's a regular segfault here.

Just read whole thread to get some info why futex error is thrown. This info is needed to understand why mentioned above glibc commit can lead to these terminations.

@ppenzin
Copy link
Member

ppenzin commented Jun 2, 2022

@Wedmer this looks like a potential culprit, PAL might be trying to lock on unaligned address, it has its own approach to things at times.

@Wedmer
Copy link

Wedmer commented Jun 2, 2022

I've seen. Part of PAL is truly cross-platform and has implementation for different synchronization objects and APIs, but another part is heavily built around pthread.

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Apr 18, 2024

I think this may finally be fixed in latest master - I'm hoping it was the same issue as #6932 Following that we've got our full CI running on Ubuntu 22 which seems promising (see #6980 )

@Eggbertx could you confirm?

@Eggbertx
Copy link

Using the latest commit as of this message (2af598f), I'm unable to build it with build.sh)

In file included from /home/eggbertx/src/ChakraCore/pal/src/include/pal/palinternal.h:323:
/home/eggbertx/src/ChakraCore/pal/inc/pal.h:2682:31: error: size of array element of type 'PM128A' (aka '_M128U *') (8 bytes) isn't a multiple of its alignment (16 bytes)
 2682 |         PM128A FloatingContext[16];
      |                               ^
1 error generated.
make[2]: *** [pal/src/CMakeFiles/Chakra.Pal.dir/build.make:76: pal/src/CMakeFiles/Chakra.Pal.dir/cruntime/file.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:648: pal/src/CMakeFiles/Chakra.Pal.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

@ShortDevelopment
Copy link
Contributor

ShortDevelopment commented Apr 18, 2024

@Eggbertx It works using clang14

Install

sudo apt install clang-14

Build

Following the instructions from the build pipeline

cmake -GNinja -DCMAKE_BUILD_TYPE=$BUILD_TYPE $LIBTYPE -DCMAKE_CXX_COMPILER=clang++ -DCMAKE_C_COMPILER=clang ..

mkdir build
cd build

// prepare
cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_COMPILER=clang-14++ -DCMAKE_C_COMPILER=clang-14 ..

// build
ninja

@Eggbertx
Copy link

I was able to build it with clang14, and it doesn't appear to be crashing now.

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Apr 18, 2024

Using the latest commit as of this message (2af598f), I'm unable to build it with build.sh)


In file included from /home/eggbertx/src/ChakraCore/pal/src/include/pal/palinternal.h:323:

/home/eggbertx/src/ChakraCore/pal/inc/pal.h:2682:31: error: size of array element of type 'PM128A' (aka '_M128U *') (8 bytes) isn't a multiple of its alignment (16 bytes)

 2682 |         PM128A FloatingContext[16];

      |                               ^

1 error generated.

make[2]: *** [pal/src/CMakeFiles/Chakra.Pal.dir/build.make:76: pal/src/CMakeFiles/Chakra.Pal.dir/cruntime/file.cpp.o] Error 1

make[1]: *** [CMakeFiles/Makefile2:648: pal/src/CMakeFiles/Chakra.Pal.dir/all] Error 2

make: *** [Makefile:91: all] Error 2

Do you know what compiler you were using?

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Apr 18, 2024

I was able to build it with clang14, and it doesn't appear to be crashing now.

🎉

Thank you for checking; sorry we took far too long on this - we had several failed attempts at finding a fix and had pretty much parked working on CC - hoping we may be resuming again but will see...

@Eggbertx
Copy link

Eggbertx commented Apr 20, 2024

Do you know what compiler you were using?

The version of clang in my PATH is 17.0.6, assuming running build.sh with no command line arguments uses clang and not gcc (which I'm guessing CMake would use by default).

@rhuanjl
Copy link
Collaborator Author

rhuanjl commented Apr 20, 2024

Do you know what compiler you were using?

The version of clang in my PATH is 17.0.6, assuming running build.sh with no command line arguments uses clang and not gcc (which I'm guessing CMake would use by default).

Hmm, I think I'll close this for now as the issue it was raised for is resolved. Thanks for the help.

We can explore problems with Clang 17 if it affects anyone on an ongoing basis - though the snippet you gave suggests there may be a one line fix there if it really was a Clang 17 problem.

(I was wondering if the script had failed to find Clang and somehow defaulted to gcc - which has never been supported)

@rhuanjl rhuanjl closed this as completed Apr 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants