Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation after libc::fork on Android #85261

Closed
ijackson opened this issue May 13, 2021 · 18 comments · Fixed by #102460
Closed

Allocation after libc::fork on Android #85261

ijackson opened this issue May 13, 2021 · 18 comments · Fixed by #102460
Labels
A-process Area: `std::process` and `std::env` A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-android Operating system: Android T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@ijackson
Copy link
Contributor

ijackson commented May 13, 2021

I am trying to fix it so that Rust's stdlib prevents unwinding, or allocating, in the child, after a fork on Unix (including in Command). That is #81858. (Allocation after fork of a multithreaded program is UB in several libcs.)

I added a new test case, https://github.com/rust-lang/rust/blob/8220f2f2127b9aec972163ded97be7d8cff6b9a8/src/test/ui/process/process-panic-after-fork.rs https://github.com/rust-lang/rust/blob/6369637a192bbd0a2fbf8084345ddb7c099aa460/src/test/ui/process/process-panic-after-fork.rs Unfortunately this test fails, but just on Android: #81858 (comment)

I have few good theories as to why. I wrote some speculations: #81858 (comment)

I think this probably needs attention from an Android expert to try to repro and fix this issue. I suspect it's a problem with the library rather than the tests. The worst case is that it might be a general UB bug in Android Rust programs using libc::fork or Command.

I'm filing this issue here to try to ask for help again, since writing in #81858 doesn't seem like a particularly good way of getting the attention of Android folks.

If we can't get a resolution, reluctantly, I guess I will disable that test on Android so that my MR can go through. The current situation is quite a hazard (see eg #79740 "panic! in Command child forked from non-main thread results in exit status 0")

Technical discussion

I will try to explain what the test does and what the symptoms seem to mean:

The test file has a custom global allocator, whose purpose is to spot allocations in the child after fork. That global allocator has an atomic variable which is supposed to contain either zero (initially, meaning it's not engaged yet) or the process's pid. Whenever an allocator method is called, we read the atomic and, if it is not zero, we check it against process::id(). If it doesn't match we libc::raise(libc::SIGTRAP SIGUSR1).

The test enters main, and engages the stunt allocator, recording the program's pid. Each call to expect_aborted (which is called from run and therefore from one) produces output from dbg!(status). We see only one of these, so this must be the first test, one(&|| panic!()).

The test uses libc::fork to fork. In the child, it calls panic::always_abort() (my new function to disable panic unwinding). It then panics (using the provided closure). This ought to result in the program dying with SIGABRT (or maybe SIGILL or SIGTRAP).

The parent collects the child's exit status. For the first test case, we run expect_aborted. This extracts the signal number from it and checks that it is as expected. On other systems this works.

In the failing test, this test fails. The assertion on signal fails. Meaning, the child did die of a signal but the signal number wasn't the one expected. The previous debug print shows that the raw wait status (confusingly described by Rust stdlib as an "exit status") is 510. Usually, a bare number like that in a wait status is a signal number, and indeed that seems to be the case here since status.signal() is Some(...). On Linux (and most Unices), 5 is SIGTRAP and 10 is SIGUSR1.

Ie, it seems that the child tried to allocate memory, despite my efforts to make sure that panicking does not involve allocation. Weirdly, a more-portable test case which uses Command and does not insist on specific signal numbers passes.

It's definitely my stunt allocator which is tripping here, because when I changed it to use SIGUSR instead of SIGTRAP, the failing test case signal number changed too.

@ijackson ijackson added the C-bug Category: This is a bug. label May 13, 2021
@ijackson
Copy link
Contributor Author

@rustbot modify labels +O-android +T-libs-impl

@rustbot rustbot added O-android Operating system: Android T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 13, 2021
@ijackson ijackson changed the title Alocation after libc::fork on Android Allocation after libc::fork on Android May 13, 2021
@ijackson
Copy link
Contributor Author

ijackson commented May 13, 2021

@hyd-dev provided an explanation here #81858 (comment)

I still think there is a bug here but it is not Android-specific. I have a WIP branch to try to address it, so I think it's probably best if I close this issue and submit that as an MR when it's ready.

@ijackson
Copy link
Contributor Author

My first guesses were right. In #81858 (comment) where I changed the stunt allocator to use SIGUSR1, we see this:

2021-05-14T04:09:46.9170083Z [/checkout/src/test/ui/process/process-panic-after-fork.rs:79] status = ExitStatus(
2021-05-14T04:09:46.9171632Z     ExitStatus(
2021-05-14T04:09:46.9172360Z         10,
2021-05-14T04:09:46.9172996Z     ),
2021-05-14T04:09:46.9173607Z )

@ijackson ijackson reopened this May 14, 2021
@ghost
Copy link

ghost commented May 14, 2021

I found a memory allocation with a patched Miri:

note: tracking was triggered
  --> /home/hyd-dev/Code/rust-lang/rust/library/alloc/src/alloc.rs:86:14
   |
86 |     unsafe { __rust_alloc(layout.size(), layout.align()) }
   |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ rust memory allocated
   |
   = note: inside `std::alloc::alloc` at /home/hyd-dev/Code/rust-lang/rust/library/alloc/src/alloc.rs:86:14
   = note: inside `std::alloc::Global::alloc_impl` at /home/hyd-dev/Code/rust-lang/rust/library/alloc/src/alloc.rs:166:73
   = note: inside `<std::alloc::Global as std::alloc::Allocator>::allocate` at /home/hyd-dev/Code/rust-lang/rust/library/alloc/src/alloc.rs:226:9
   = note: inside `alloc::alloc::exchange_malloc` at /home/hyd-dev/Code/rust-lang/rust/library/alloc/src/alloc.rs:316:11
   = note: inside `std::thread::__OsLocalKeyInner::<std::cell::Cell<usize>>::try_initialize` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/thread/local.rs:703:42
   = note: inside `std::thread::__OsLocalKeyInner::<std::cell::Cell<usize>>::get` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/thread/local.rs:685:22
   = note: inside `std::rt::panic_count::LOCAL_PANIC_COUNT::__getit` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/thread/local.rs:287:26
   = note: inside `std::thread::LocalKey::<std::cell::Cell<usize>>::try_with::<[closure@std::rt::panic_count::increase::{closure#0}], usize>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/thread/local.rs:375:32
   = note: inside `std::thread::LocalKey::<std::cell::Cell<usize>>::with::<[closure@std::rt::panic_count::increase::{closure#0}], usize>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/thread/local.rs:352:9
   = note: inside `std::rt::panic_count::increase` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panicking.rs:263:13
   = note: inside `std::panicking::rust_panic_with_hook` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panicking.rs:588:32
   = note: inside closure at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panicking.rs:542:9
   = note: inside `std::sys_common::backtrace::__rust_end_short_backtrace::<[closure@std::rt::begin_panic<&str>::{closure#0}], !>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/sys_common/backtrace.rs:141:18
   = note: inside `std::rt::begin_panic::<&str>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panicking.rs:541:12
note: inside closure at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panic.rs:28:9
  --> src/main.rs:11:13
   |
11 |     run(&|| panic!());
   |             ^^^^^^^^
note: inside `run` at src/main.rs:7:5
  --> src/main.rs:7:5
   |
7  |     do_panic();
   |     ^^^^^^^^^^
note: inside `main` at src/main.rs:11:5
  --> src/main.rs:11:5
   |
11 |     run(&|| panic!());
   |     ^^^^^^^^^^^^^^^^^
   = note: inside `<fn() as std::ops::FnOnce<()>>::call_once - shim(fn())` at /home/hyd-dev/Code/rust-lang/rust/library/core/src/ops/function.rs:227:5
   = note: inside `std::sys_common::backtrace::__rust_begin_short_backtrace::<fn(), ()>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/sys_common/backtrace.rs:125:18
   = note: inside closure at /home/hyd-dev/Code/rust-lang/rust/library/std/src/rt.rs:49:18
   = note: inside `std::ops::function::impls::<impl std::ops::FnOnce<()> for &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>::call_once` at /home/hyd-dev/Code/rust-lang/rust/library/core/src/ops/function.rs:259:13
   = note: inside `std::panicking::r#try::do_call::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panicking.rs:401:40
   = note: inside `std::panicking::r#try::<i32, &dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panicking.rs:365:19
   = note: inside `std::panic::catch_unwind::<&dyn std::ops::Fn() -> i32 + std::marker::Sync + std::panic::RefUnwindSafe, i32>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/panic.rs:433:14
   = note: inside `std::rt::lang_start_internal` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/rt.rs:34:21
   = note: inside `std::rt::lang_start::<()>` at /home/hyd-dev/Code/rust-lang/rust/library/std/src/rt.rs:48:5
   = note: this note originates in the macro `$crate::panic::panic_2015` (in Nightly builds, run with -Z macro-backtrace for more info)

With this code:

#![feature(panic_always_abort)]

use std::panic;

fn run(do_panic: &dyn Fn()) {
    panic::always_abort();
    do_panic();
}

fn main() {
    run(&|| panic!());
}

@ijackson
Copy link
Contributor Author

Apparently, the thread locals aren't working as expected. I think the child after fork is being treated as a new thread, and then the thread local storage has to be re-created. I think the surviving thread in the fork should inherit all the thread-locals from the forking thread in the parent.

@ghost
Copy link

ghost commented May 14, 2021

Apparently, the thread locals aren't working as expected. I think the child after fork is being treated as a new thread, and then the thread local storage has to be re-created. I think the surviving thread in the fork should inherit all the thread-locals from the forking thread in the parent.

But I just tested the code after fork(), so there's no "new thread" or "parent" here...

@ijackson
Copy link
Contributor Author

Oh! Yes. You are right. Evidently the first use of this thread-local allocates. Presumably that would happen on a new thread as well as on the initial thread.

Having panic rely on thread-locals, and thread-locals rely on the allocator, seems unfortunate. Having panic allocate is not good, especially when actually it's just trying to abort precisely to avoid this.

@ghost
Copy link

ghost commented May 14, 2021

Since ALWAYS_ABORT_FLAG is stored in GLOBAL_PANIC_COUNT which is not a thread-local, maybe the program could abort before using any thread-local.

@ijackson
Copy link
Contributor Author

It can't do that and also print the payload, because printing the payload might itself panic, and the recursive panic needs to be detected.

@ijackson
Copy link
Contributor Author

Maybe instead there should be a thread::LocalKey::maybe_try_with which can return None to mean "this thread-local variable has not been initialised".

@Aaron1011
Copy link
Member

I think it might be possible to use a const thread local to avoid the allocation.

@RalfJung
Copy link
Member

Do const thread-local always avoid the allocation? I think they fall back to regular thread-locals on platforms where avoiding allocations is not (easily) possible.

@ghost
Copy link

ghost commented May 15, 2021

The const version of thread_local! looks basically same as the non-const version when using __OsLocalKeyInner, which will likely allocate here:

let ptr: Box<Value<T>> = box Value { inner: LazyKeyInner::new(), key: self };

Maybe the panic count could be stored directly in OsStaticKey rather than on the heap.

But is OsStaticKey itself safe after fork()? If it tries to hold a lock or something it's possible to deadlock.

@the8472 the8472 added A-process Area: `std::process` and `std::env` A-thread-locals Area: Thread local storage (TLS) labels Feb 8, 2022
@flba-eb
Copy link
Contributor

flba-eb commented Sep 28, 2022

Should this only occur on Android? It happened on QNX as well (not yet upstreamed, work in progress so I don't trust it yet), but I can also reproduce it on x86_64_unknown_linux_gnu with a simple patch (disables thread local storage):

diff --git a/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs b/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs
index 956be0353fa..51eec0a3a10 100644
--- a/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs
+++ b/compiler/rustc_target/src/spec/x86_64_unknown_linux_gnu.rs
@@ -13,6 +13,7 @@ pub fn target() -> Target {
         | SanitizerSet::LEAK
         | SanitizerSet::MEMORY
         | SanitizerSet::THREAD;
+    base.has_thread_local = false;
 
     Target {
         llvm_target: "x86_64-unknown-linux-gnu".into(),

Would this be a target spec where we would expect src/test/ui/process/process-panic-after-fork.rs to succeed?

@ijackson
Copy link
Contributor Author

Interesting. To be honest, I'm not sure.

I do know that modern libc doctrine is that allocation after fork is UB. The test case is checking that we can safely panic after fork, and abort, without UB. This ought to work due to panic::always_abort().

I think that every failure of this test case is a real bug. Solving those bugs is not so easy because the machinery inside panic uses thread-local state which apparently is really super hard to make work without allocating or taking locks.

@flba-eb
Copy link
Contributor

flba-eb commented Sep 28, 2022

Yes, allocating memory is UB, same as working with mutexes. In case our setups are okay (i.e. having has_thread_local=false and using fork etc.), I also see this is a bug.

Maybe we don't need to fix the issue with the thread-local state (if at all possible):

  1. panicking::rust_panic_with_hook calls panic_count::increase() to get the number recursive panics and if we must_abort.
  2. panic_count::increase():
    • determines must_abort using GLOBAL_PANIC_COUNT: AtomicUsize --> works
    • determines number recursive panics using TLS --> allocates after fork --> UB
  3. panicking::rust_panic_with_hook checks must_abort first:
    • If it is true, it will always call libc::abort -- the process will be killed very very soon by the OS
    • In case of a recursion, a hardcoded-string is printed; otherwise additionally PanicInfo
    • Do we really need to correctly determine a recursion, or can we allow "false positives"? Instead of having UB, I think it would be fine to only check the number of overall current panics calls (as stored in GLOBAL_PANIC_COUNT ignoring ALWAYS_ABORT_FLAG). If it is higher than 2, assume we have a recursion (results in false positives in case two or more threads panic at the same time).

==> In case we have two panics going on in parallel by two threads there is the chance that we don't output PanicInfo, but it prevents UB. Also, when one thread calls abort it will often happen that the second thread will be killed soon and outputs nothing at all. In other cases (only one thread panics), there should be no difference.

A change could look like that (high probability I've missed something!):

diff --git a/library/std/src/panicking.rs b/library/std/src/panicking.rs
index 4b07b393a2f..c0323fb4d07 100644
--- a/library/std/src/panicking.rs
+++ b/library/std/src/panicking.rs
@@ -319,14 +319,18 @@ pub mod panic_count {
     static GLOBAL_PANIC_COUNT: AtomicUsize = AtomicUsize::new(0);
 
     pub fn increase() -> (bool, usize) {
-        (
-            GLOBAL_PANIC_COUNT.fetch_add(1, Ordering::Relaxed) & ALWAYS_ABORT_FLAG != 0,
+        let global_count = GLOBAL_PANIC_COUNT.fetch_add(1, Ordering::Relaxed);
+        let must_abort = global_count & ALWAYS_ABORT_FLAG != 0;
+        let panics = if must_abort {
+            global_count & !ALWAYS_ABORT_FLAG
+        } else {
             LOCAL_PANIC_COUNT.with(|c| {
                 let next = c.get() + 1;
                 c.set(next);
                 next
-            }),
-        )
+            })
+        };
+        (must_abort, panics)
     }
 
     pub fn decrease() {

This change passes all tests of src/test/ui for targets x86_64-unknown-linux-gnu and QNX on AArch64.

@flba-eb
Copy link
Contributor

flba-eb commented Sep 29, 2022

@ijackson Would you please try to test the fix above to check if it solves the issue on Android as well?

@thomcc
Copy link
Member

thomcc commented Sep 29, 2022

On most Unix it's UB to touch TLS after fork. Even if we have static TLS, accessing it might cause __tls_get_addr1 to be be called, and that function has paths where it allocates memory, takes locks, etc. That is, this is probably not a specific issue with QNX/Android.

Footnotes

  1. At least on ELF, although I believe this is broadly true for Mach-O as well.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Sep 29, 2022
…ter_fork, r=thomcc

Prevent UB in child process after calling libc::fork

After calling libc::fork, the child process tried to access a TLS variable when processing a panic. This caused a memory allocation which is UB in the child.
To prevent this from happening, the panic handler will not access the TLS variable in case `panic::always_abort` was called before.

Fixes rust-lang#85261 (not only on Android systems, but also on Linux/QNX with TLS disabled, see issue for more details)

Main drawbacks of this fix:
* Panic messages can incorrectly omit `core::panic::PanicInfo` struct in case several panics (of multiple threads) occur at the same time. The handler cannot distinguish between multiple panics in different threads or recursive ones in the same thread, but the message will contain a hint about the uncertainty.
* `panic_count::increase()` will be a bit slower as it has an additional `if`, but this should be irrelevant as it is only called in case of a panic.
@bors bors closed this as completed in 50f6d33 Oct 12, 2022
Aaron1011 pushed a commit to Aaron1011/rust that referenced this issue Jan 6, 2023
…r_fork, r=thomcc

Prevent UB in child process after calling libc::fork

After calling libc::fork, the child process tried to access a TLS variable when processing a panic. This caused a memory allocation which is UB in the child.
To prevent this from happening, the panic handler will not access the TLS variable in case `panic::always_abort` was called before.

Fixes rust-lang#85261 (not only on Android systems, but also on Linux/QNX with TLS disabled, see issue for more details)

Main drawbacks of this fix:
* Panic messages can incorrectly omit `core::panic::PanicInfo` struct in case several panics (of multiple threads) occur at the same time. The handler cannot distinguish between multiple panics in different threads or recursive ones in the same thread, but the message will contain a hint about the uncertainty.
* `panic_count::increase()` will be a bit slower as it has an additional `if`, but this should be irrelevant as it is only called in case of a panic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-process Area: `std::process` and `std::env` A-thread-locals Area: Thread local storage (TLS) C-bug Category: This is a bug. O-android Operating system: Android T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants