-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cargo test with musl
hangs occasionally
#2144
Comments
This is the strace for a run that doesn't hang:
Note here 1344477 is the test process waiting on 1344478 to exit. The 1344477 then receive a signal SIGCHLD and wakes up from the wait. |
Reproduced one more time, but this time seems a bit different. I used gdb to see which futex is invoked. This time, there are two fork test cases that are stuck.
Looks like |
Removed the
Looks like os.exit caused a futex wait. |
I think I have a better understanding about this now... There are a number of issues that are in play. It does have to do with difference between At current main before #2121, we have a number of issues that I highly suspect has to do with using kernel syscall clone3 directly without a proper libc mapper. As comments above mentions,
Now, with #2121 and forcing the same test to use the clone fallback, we got pass all these issues and hangs in
Now, after some more research, it seems there is a whole list of things that are unsafe in the child process after forking from a multi-thread parents. Based on my understanding, which may not be 100% or complete, posix in theory only claim exec family syscalls to be safe after fork/clone process. Calling any other syscall is considered to be undefined behavior. In reality, there are a list of syscalls that are safe (async-signal safe) syscalls. And these calls do not include The free comes from the function we pass into the clone call:
The
We use Note, in real |
Temporarily disable musl tests until #2144 is resolved
We are hitting this issue in runwasi with glibc. |
Closing for now as #2685 seems a good solution. Will re-open if issue persists |
We observed in the field that unit tests with
musl
hangs from time to time. This happens fairly often but not all the time. Several CI runs now shows this behavior and I am able to reproduce this locally with:The test that hangs is a simple test where we fork/clone3 a with a function that returns error right away. The test itself runs correctly in the normal unit test with glibc. This seems to only happen in musl.
I was able to reproduce this case locally that happens about once in a while. I see 3 process related to cargo test at the time of the hang:
Specifically, 1325757 seems to be the actual unit test process. The process correctly is calling
waitpid
:The 1325762 is the forked process and interestingly it is stuck in a futex:
I don't have a definitive proof, but I highly suspect that this is related to how we are using
clone3
kernel syscall directly to create process. As mentioned in #2030 (comment), libc does extra bookkeeping around the fork and clone syscall when creating processes. Using the kernel syscall directly bypass the these bookkeeping and my result in Undefined Behavior.I will do a bit more digging and see if I can have more proof.
In the meantime, restarting the
musl
test seems to be able to let it pass. If it happens too often, we can turn it off until it is resolved. Ifclone3
turns out to be the issue, once #2121 merges, I can force musl into the fallback path with a feature flag.The text was updated successfully, but these errors were encountered: