Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock: Command::spawn is trying to dlsym after fork, but linker mutex is locked #88585

Closed
hacker-volodya opened this issue Sep 2, 2021 · 3 comments · Fixed by #88828
Closed
Labels
C-bug Category: This is a bug.

Comments

@hacker-volodya
Copy link

There is a problem with Command::spawn on Android. When another thread interacts with system linker, Command::spawn causes a deadlock in the child process between fork and exec. It happens because under the hood Android linker uses global mutex g_dl_mutex (link to android source) and not doing pthread_atfork() to unlock it in child processes.

What exactly happens:

  1. Command::spawn is called.
  2. Command::spawn actually invokes do_fork, then do_exec is invoked in the child process.
  3. There is a comment in do_exec, which says that a deadlock may occur, so no allocations must appear in do_exec.
  4. Well, at this point we know that any code between fork and exec must not lock mutexes from the parent process, because it may be "forever locked" bacause of fork.
  5. do_exec calls sys::signal. Fine.
  6. Seems interesting that sys::signal is linking actual signal function from libc using dlsym.
  7. At this point linker is trying to lock parent mutex g_dl_mutex, which may be in "forever locked" state, so the child process hangs.

main.rs:

#![no_main]

#[link(name = "dl")]
extern {
    fn dlsym(handle: i32, symbol: i32);
}

/// Implementing here raw main function, because Rust initializers calls `signal` and caches its address, 
/// preventing deadlock. In real life Rust not always compiles to standalone executables, so this case
/// is also applicable to shared libraries. 
#[no_mangle]
pub extern "C" fn main(_argc: isize, _argv: *const *const u8) -> isize {
    let t = std::thread::spawn(|| {
        println!("Locking g_dl_mutex...");
        for _ in 0..10000000 {
            unsafe { dlsym(0, 0); }
        }
        println!("dlsym'ing done");
    });
    std::thread::sleep(std::time::Duration::from_secs(1));
    println!("Spawning ps...");
    let _child = std::process::Command::new("ps").spawn().unwrap();
    println!("ps spawned!");
    t.join().unwrap();
    println!("Done");
    0
}

How to build and run (can run on real device or emulator, to run on emulator replace aarch64 with x86_64):

cargo build --release --target aarch64-linux-android
adb push .\target\aarch64-linux-android\release\linker-deadlock /data/local/tmp/
adb shell chmod +x /data/local/tmp/linker-deadlock
adb shell /data/local/tmp/linker-deadlock

be sure to specify Android NDK toolchain binaries in ~/.cargo/config.toml:

[target.aarch64-linux-android]
ar = "C:\\Users\\user\\AppData\\Local\\Android\\Sdk\\ndk\\20.0.5594570\\toolchains\\llvm\\prebuilt\\windows-x86_64\\bin\\llvm-ar.exe"
linker = "C:\\Users\\user\\AppData\\Local\\Android\\Sdk\\ndk\\20.0.5594570\\toolchains\\llvm\\prebuilt\\windows-x86_64\\bin\\aarch64-linux-android23-clang.cmd"

[target.x86_64-linux-android]
ar = "C:\\Users\\user\\AppData\\Local\\Android\\Sdk\\ndk\\20.0.5594570\\toolchains\\llvm\\prebuilt\\windows-x86_64\\bin\\llvm-ar.exe"
linker = "C:\\Users\\user\\AppData\\Local\\Android\\Sdk\\ndk\\20.0.5594570\\toolchains\\llvm\\prebuilt\\windows-x86_64\\bin\\x86_64-linux-android23-clang.cmd"

Meta

rustc --version --verbose:

rustc 1.54.0 (a178d0322 2021-07-26)
binary: rustc
commit-hash: a178d0322ce20e33eac124758e837cbd80a6f633
commit-date: 2021-07-26
host: x86_64-pc-windows-msvc
release: 1.54.0
LLVM version: 12.0.1
rustc 1.56.0-nightly (50171c310 2021-09-01)
binary: rustc
commit-hash: 50171c310cd15e1b2d3723766ce64e2e4d6696fc
commit-date: 2021-09-01
host: x86_64-pc-windows-msvc
release: 1.56.0-nightly
LLVM version: 13.0.0

@hacker-volodya hacker-volodya added the C-bug Category: This is a bug. label Sep 2, 2021
@hacker-volodya
Copy link
Author

Also, the output of main.rs:

>adb shell /data/local/tmp/linker-deadlock 
Locking g_dl_mutex...
Spawning ps...
dlsym'ing done
[at this moment program hangs]

@hellow554
Copy link
Contributor

Since you seem to have a good understanding on what's going on: do you have an idea how to prevent this issue? Do you know how other languages e.g. Java (JVM in general) deal with this issue?

@hacker-volodya
Copy link
Author

I think we should fetch signal function from linker before the fork and pass it to do_exec, but I don't know how to implement it correctly (without pollution of platform-independent code in process_unix.rs).

Manishearth added a commit to Manishearth/rust that referenced this issue Sep 30, 2021
Use `libc::sigaction()` instead of `sys::signal()` to prevent a deadlock

Fixes rust-lang#88585. POSIX [specifies](https://man7.org/linux/man-pages/man3/fork.3p.html) that after forking,
> to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.

Rust's standard library does not currently adhere to this, as evidenced by rust-lang#88585. The child process calls [`sys::signal()`](https://github.com/rust-lang/rust/blob/7bf0736e130e2203c58654f7353dbf9575e49d5c/library/std/src/sys/unix/android.rs#L76), which on Android calls [`libc::dlsym()`](https://github.com/rust-lang/rust/blob/7bf0736e130e2203c58654f7353dbf9575e49d5c/library/std/src/sys/unix/weak.rs#L101), which is [**not**](https://man7.org/linux/man-pages/man7/signal-safety.7.html) async-signal-safe, and in fact causes a deadlock in the example in rust-lang#88585.

I think the easiest solution here would be to just call `libc::sigaction()` instead, which [is](https://man7.org/linux/man-pages/man7/signal-safety.7.html) async-signal-safe, provides the functionality we need, and is apparently available on all Android versions because it is also used e.g. [here](https://github.com/rust-lang/rust/blob/7bf0736e130e2203c58654f7353dbf9575e49d5c/library/std/src/sys/unix/stack_overflow.rs#L112-L114).
Manishearth added a commit to Manishearth/rust that referenced this issue Oct 5, 2021
Use `libc::sigaction()` instead of `sys::signal()` to prevent a deadlock

Fixes rust-lang#88585. POSIX [specifies](https://man7.org/linux/man-pages/man3/fork.3p.html) that after forking,
> to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called.

Rust's standard library does not currently adhere to this, as evidenced by rust-lang#88585. The child process calls [`sys::signal()`](https://github.com/rust-lang/rust/blob/7bf0736e130e2203c58654f7353dbf9575e49d5c/library/std/src/sys/unix/android.rs#L76), which on Android calls [`libc::dlsym()`](https://github.com/rust-lang/rust/blob/7bf0736e130e2203c58654f7353dbf9575e49d5c/library/std/src/sys/unix/weak.rs#L101), which is [**not**](https://man7.org/linux/man-pages/man7/signal-safety.7.html) async-signal-safe, and in fact causes a deadlock in the example in rust-lang#88585.

I think the easiest solution here would be to just call `libc::sigaction()` instead, which [is](https://man7.org/linux/man-pages/man7/signal-safety.7.html) async-signal-safe, provides the functionality we need, and is apparently available on all Android versions because it is also used e.g. [here](https://github.com/rust-lang/rust/blob/7bf0736e130e2203c58654f7353dbf9575e49d5c/library/std/src/sys/unix/stack_overflow.rs#L112-L114).
@bors bors closed this as completed in eb86098 Oct 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants