-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allocation after libc::fork on Android #85261
Comments
@rustbot modify labels +O-android +T-libs-impl |
|
My first guesses were right. In #81858 (comment) where I changed the stunt allocator to use
|
I found a memory allocation with a patched Miri:
With this code: #![feature(panic_always_abort)]
use std::panic;
fn run(do_panic: &dyn Fn()) {
panic::always_abort();
do_panic();
}
fn main() {
run(&|| panic!());
} |
Apparently, the thread locals aren't working as expected. I think the child after fork is being treated as a new thread, and then the thread local storage has to be re-created. I think the surviving thread in the fork should inherit all the thread-locals from the forking thread in the parent. |
But I just tested the code after |
Oh! Yes. You are right. Evidently the first use of this thread-local allocates. Presumably that would happen on a new thread as well as on the initial thread. Having panic rely on thread-locals, and thread-locals rely on the allocator, seems unfortunate. Having panic allocate is not good, especially when actually it's just trying to abort precisely to avoid this. |
Since |
It can't do that and also print the payload, because printing the payload might itself panic, and the recursive panic needs to be detected. |
Maybe instead there should be a |
I think it might be possible to use a |
Do |
The rust/library/std/src/thread/local.rs Line 703 in b439be0
Maybe the panic count could be stored directly in OsStaticKey rather than on the heap.
But is |
Should this only occur on Android? It happened on QNX as well (not yet upstreamed, work in progress so I don't trust it yet), but I can also reproduce it on diff --git a/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs b/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs
index 956be0353fa..51eec0a3a10 100644
--- a/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs
+++ b/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs
@@ -13,6 +13,7 @@ pub fn target() -> Target {
| SanitizerSet::LEAK
| SanitizerSet::MEMORY
| SanitizerSet::THREAD;
+ base.has_thread_local = false;
Target {
llvm_target: "x86_64-unknown-linux-gnu".into(), Would this be a target spec where we would expect |
Interesting. To be honest, I'm not sure. I do know that modern libc doctrine is that allocation after fork is UB. The test case is checking that we can safely panic after fork, and abort, without UB. This ought to work due to I think that every failure of this test case is a real bug. Solving those bugs is not so easy because the machinery inside panic uses thread-local state which apparently is really super hard to make work without allocating or taking locks. |
Yes, allocating memory is UB, same as working with mutexes. In case our setups are okay (i.e. having Maybe we don't need to fix the issue with the thread-local state (if at all possible):
==> In case we have two panics going on in parallel by two threads there is the chance that we don't output A change could look like that (high probability I've missed something!): diff --git a/library/std/src/panicking.rs b/library/std/src/panicking.rs
index 4b07b393a2f..c0323fb4d07 100644
--- a/library/std/src/panicking.rs
+++ b/library/std/src/panicking.rs
@@ -319,14 +319,18 @@ pub mod panic_count {
static GLOBAL_PANIC_COUNT: AtomicUsize = AtomicUsize::new(0);
pub fn increase() -> (bool, usize) {
- (
- GLOBAL_PANIC_COUNT.fetch_add(1, Ordering::Relaxed) & ALWAYS_ABORT_FLAG != 0,
+ let global_count = GLOBAL_PANIC_COUNT.fetch_add(1, Ordering::Relaxed);
+ let must_abort = global_count & ALWAYS_ABORT_FLAG != 0;
+ let panics = if must_abort {
+ global_count & !ALWAYS_ABORT_FLAG
+ } else {
LOCAL_PANIC_COUNT.with(|c| {
let next = c.get() + 1;
c.set(next);
next
- }),
- )
+ })
+ };
+ (must_abort, panics)
}
pub fn decrease() { This change passes all tests of |
@ijackson Would you please try to test the fix above to check if it solves the issue on Android as well? |
On most Unix it's UB to touch TLS after fork. Even if we have static TLS, accessing it might cause Footnotes
|
…ter_fork, r=thomcc Prevent UB in child process after calling libc::fork After calling libc::fork, the child process tried to access a TLS variable when processing a panic. This caused a memory allocation which is UB in the child. To prevent this from happening, the panic handler will not access the TLS variable in case `panic::always_abort` was called before. Fixes rust-lang#85261 (not only on Android systems, but also on Linux/QNX with TLS disabled, see issue for more details) Main drawbacks of this fix: * Panic messages can incorrectly omit `core::panic::PanicInfo` struct in case several panics (of multiple threads) occur at the same time. The handler cannot distinguish between multiple panics in different threads or recursive ones in the same thread, but the message will contain a hint about the uncertainty. * `panic_count::increase()` will be a bit slower as it has an additional `if`, but this should be irrelevant as it is only called in case of a panic.
…r_fork, r=thomcc Prevent UB in child process after calling libc::fork After calling libc::fork, the child process tried to access a TLS variable when processing a panic. This caused a memory allocation which is UB in the child. To prevent this from happening, the panic handler will not access the TLS variable in case `panic::always_abort` was called before. Fixes rust-lang#85261 (not only on Android systems, but also on Linux/QNX with TLS disabled, see issue for more details) Main drawbacks of this fix: * Panic messages can incorrectly omit `core::panic::PanicInfo` struct in case several panics (of multiple threads) occur at the same time. The handler cannot distinguish between multiple panics in different threads or recursive ones in the same thread, but the message will contain a hint about the uncertainty. * `panic_count::increase()` will be a bit slower as it has an additional `if`, but this should be irrelevant as it is only called in case of a panic.
I am trying to fix it so that Rust's stdlib prevents unwinding, or allocating, in the child, after a fork on Unix (including in
Command
). That is #81858. (Allocation after fork of a multithreaded program is UB in several libcs.)I added a new test case,
https://github.com/rust-lang/rust/blob/8220f2f2127b9aec972163ded97be7d8cff6b9a8/src/test/ui/process/process-panic-after-fork.rshttps://github.com/rust-lang/rust/blob/6369637a192bbd0a2fbf8084345ddb7c099aa460/src/test/ui/process/process-panic-after-fork.rs Unfortunately this test fails, but just on Android: #81858 (comment)I have few good theories as to why. I wrote some speculations: #81858 (comment)
I think this probably needs attention from an Android expert to try to repro and fix this issue. I suspect it's a problem with the library rather than the tests. The worst case is that it might be a general UB bug in Android Rust programs using
libc::fork
orCommand
.I'm filing this issue here to try to ask for help again, since writing in #81858 doesn't seem like a particularly good way of getting the attention of Android folks.
If we can't get a resolution, reluctantly, I guess I will disable that test on Android so that my MR can go through. The current situation is quite a hazard (see eg #79740 "panic! in Command child forked from non-main thread results in exit status 0")
Technical discussion
I will try to explain what the test does and what the symptoms seem to mean:
The test file has a custom global allocator, whose purpose is to spot allocations in the child after fork. That global allocator has an atomic variable which is supposed to contain either zero (initially, meaning it's not engaged yet) or the process's pid. Whenever an allocator method is called, we read the atomic and, if it is not zero, we check it against
process::id()
. If it doesn't match welibc::raise(libc::
SIGTRAP
SIGUSR1)
.The test enters
main
, and engages the stunt allocator, recording the program's pid. Each call toexpect_aborted
(which is called fromrun
and therefore fromone
) produces output fromdbg!(status)
. We see only one of these, so this must be the first test,one(&|| panic!())
.The test uses
libc::fork
to fork. In the child, it callspanic::always_abort()
(my new function to disable panic unwinding). It then panics (using the provided closure). This ought to result in the program dying withSIGABRT
(or maybeSIGILL
orSIGTRAP
).The parent collects the child's exit status. For the first test case, we run
expect_aborted
. This extracts the signal number from it and checks that it is as expected. On other systems this works.In the failing test, this test fails. The assertion on
signal
fails. Meaning, the child did die of a signal but the signal number wasn't the one expected. The previous debug print shows that the raw wait status (confusingly described by Rust stdlib as an "exit status") is5
10
. Usually, a bare number like that in a wait status is a signal number, and indeed that seems to be the case here sincestatus.signal()
isSome(...)
. On Linux (and most Unices),5
isSIGTRAP
and 10 isSIGUSR1
.Ie, it seems that the child tried to allocate memory, despite my efforts to make sure that panicking does not involve allocation. Weirdly, a more-portable test case which uses
Command
and does not insist on specific signal numbers passes.It's definitely my stunt allocator which is tripping here, because when I changed it to use
SIGUSR
instead ofSIGTRAP
, the failing test case signal number changed too.The text was updated successfully, but these errors were encountered: