Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"died due to signal 11" in collectionstests on arm-android #55861

Closed
kennytm opened this issue Nov 11, 2018 · 29 comments · Fixed by #70958
Closed

"died due to signal 11" in collectionstests on arm-android #55861

kennytm opened this issue Nov 11, 2018 · 29 comments · Fixed by #70958
Labels
A-github-actions Area: GitHub Actions (GHA) A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC E-help-wanted Call for participation: Help is requested to fix this issue. O-android Operating system: Android P-medium Medium priority T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.

Comments

@kennytm
Copy link
Member

kennytm commented Nov 11, 2018

Symptom: The arm-android test failed with the following messages:

[01:44:38] died due to signal 11
[01:44:38] error: test failed, to rerun pass '--test collectionstests'
[01:44:38] 
[01:44:38] 
[01:44:38] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "test" "--target" "arm-linux-androideabi" "-j" "4" "--release" "--locked" "--color" "always" "--features" "panic-unwind backtrace" "--manifest-path" "/checkout/src/libstd/Cargo.toml" "-p" "alloc" "--"
[01:44:38] expected success, got: exit code: 3
[01:44:38] 
[01:44:38] 
[01:44:38] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test --target arm-linux-androideabi
[01:44:38] Build completed unsuccessfully in 1:32:50

Previous instances:

This might be caused by a mis-optimization like #49775.

@kennytm kennytm added O-android Operating system: Android A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC labels Nov 11, 2018
@alexcrichton
Copy link
Member

Added #55052 (comment) to the list

@kennytm
Copy link
Member Author

kennytm commented Feb 5, 2019

I suspect #56869 (comment) on i686-musl is the same as this issue. Perhaps we should run the collectiontest in miri.

Program terminated with signal SIGSEGV, Segmentation fault.

#0  0x08166877 in unbin (i=8, c=0xa048008) at src/malloc/malloc.c:195
#1  malloc (n=<optimized out>) at src/malloc/malloc.c:322
#2  0x0814ddd9 in alloc () at src/libstd/sys/unix/alloc.rs:11
#3  __rdl_alloc () at src/libstd/alloc.rs:233
#4  0x08127c2d in alloc () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/liballoc/alloc.rs:72
#5  exchange_malloc () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/liballoc/alloc.rs:182
#6  new<core::cell::UnsafeCell<core::option::Option<core::result::Result<(), alloc::boxed::Box<Any>>>>> () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/liballoc/sync.rs:288
#7  spawn_unchecked<closure,()> () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/libstd/thread/mod.rs:458
#8  spawn<closure,()> () at /rustc/b6fdcffc3d9d0b28d3a1fc34c49221ff13617b43/src/libstd/thread/mod.rs:382
#9  test::run_test::run_test_inner::hb0857806220adfa0 () at src/libtest/lib.rs:1450
#10 0x0812645e in test::run_test::h5d2da69c3af16b8e () at src/libtest/lib.rs:1471
#11 0x08120a84 in run_tests<closure> () at src/libtest/lib.rs:1161
#12 test::run_tests_console::hd61c1544d577a32b () at src/libtest/lib.rs:957
#13 0x08119522 in test::test_main::h2355f0b379f819ba () at src/libtest/lib.rs:290
#14 0x08119c07 in test::test_main_static::h90a75711843ac1f7 () at src/libtest/lib.rs:324
#15 0x08106785 in collectionstests::main::h89799a166ae8ba23 ()

The stacktrace doesn't seem meaningful.

@kennytm kennytm added the E-help-wanted Call for participation: Help is requested to fix this issue. label Feb 12, 2019
@jethrogb
Copy link
Contributor

@Mark-Simulacrum
Copy link
Member

We've asked for help on internals: https://internals.rust-lang.org/t/help-wanted-to-debug-a-segfault-in-a-standard-library-test-on-android/9428

@Aaron1011
Copy link
Member

I've managed to reproduce this locally, using the same Android emulator used by CI. I'm working on creating a self-contained script to make it easy to test this locally.

@Aaron1011
Copy link
Member

Aaron1011 commented Feb 14, 2019

I believe I've obtained a backtrace from the emulator:

#0  0x0000662c in ?? ()
#1  0xb6d2c61c in __rust_maybe_catch_panic ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#2  0xb6dc76d0 in test::run_test::run_test_inner::h59a42c89d1b0ae7a ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#3  0xb6dc7368 in test::run_test::hd576cc177e253177 ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#4  0xb6dc49fc in test::run_tests_console::h8730cd4311bcbbcb ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#5  0xb6dc1284 in test::test_main::h343b975aae7c6b90 ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#6  0xb6dc148c in test::test_main_static::hfd242c8df31b861a ()
   from /data/tmp/work/libtest-0c5e281140db6463.so
#7  0xb6e6779c in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hf8fbc2e6d4c74fb5 ()
#8  0xb6d22244 in std::panicking::try::do_call::h49216886d8653049 ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#9  0xb6d2c61c in __rust_maybe_catch_panic ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#10 0xb6d206d4 in std::panic::catch_unwind::h9c8f0a4ad65b2a9d ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#11 0xb6cf8a10 in std::rt::lang_start_internal::hc01076141fd50efb ()
   from /data/tmp/work/libstd-a3b039e1022c4e23.so
#12 0xb6e589e4 in main ()

Interestingly enough, it appears that the process actually hangs in the emulator, rather than dying. This allowed me to install gdb and connect the process after the test had failed.

@Aaron1011
Copy link
Member

I strongly suspect that this is related to the panic_safe test in src/liballoc/tests/slice.rs. However, I'm not certain, as I'm having difficulty narrowing things down to the offending test.

@Aaron1011
Copy link
Member

Aaron1011 commented Feb 14, 2019

Some additional results:

I added the following script as a test (it contains some logic from panic_safe):

use std::cell::Cell;
use std::thread;
use std::panic;

thread_local!(static SILENCE_PANIC: Cell<bool> = Cell::new(false));

#[test]
fn test_panic_hook() {
    let prev = panic::take_hook();
    panic::set_hook(Box::new(move |info| {
        if !SILENCE_PANIC.with(|s| s.get()) {
            prev(info);
        }
    }));

    for i in 0..1000 {
        let _ = thread::spawn(move || {
            SILENCE_PANIC.with(|s| s.set(true));
            panic!("Panicked from thread: {}", i);
        }).join();
    }
    println!("All done!");
}

I then manually uploaded it to the emulator, and invoked it in a loop with the following Bash script (run from /data/tmp/work):

export LD_LIBRARY_PATH='.'
while ./collectionstests-b33d04f51899f1c2 catch_panic; do :; done

After running this for about 15-20 minutes, I stopped the loop. For reasons I don't yet understand, several of the spawned processes has segfaulted, wthout stopping the loop.

I then uploaded a static GDB binary from here to the emulator (this was the only way I could manage to get GDB to work).

I managed to obtain the following backtrace from my test program:

#0  0xb6c855a4 in __futex_syscall3 () from /system/lib/libc.so
#1  0xb6c7768c in __pthread_cond_timedwait_relative ()
   from /system/lib/libc.so
#2  0xb6c776ec in __pthread_cond_timedwait () from /system/lib/libc.so
#3  0xb6c7b726 in pthread_join () from /system/lib/libc.so
#4  0xb6d03810 in std::sys::unix::thread::Thread::join::h07b837055b9f571f ()
   from libstd-a3b039e1022c4e23.so
#5  0xb6e57ccc in _$LT$std..thread..JoinHandle$LT$T$GT$$GT$::join::hd81a7f2f5464521c ()
#6  0xb6e5e944 in collectionstests::catch_panic::test_panic_hook::h9f08377776546cb3 ()
#7  0xb6dc7ca8 in _$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$::call_box::h3970079b2a2e8325 () from libtest-0c5e281140db6463.so
#8  0xb6d3361c in __rust_maybe_catch_panic () from libstd-a3b039e1022c4e23.so
#9  0xb6dce6d0 in test::run_test::run_test_inner::h59a42c89d1b0ae7a ()
   from libtest-0c5e281140db6463.so
#10 0xb6dce368 in test::run_test::hd576cc177e253177 ()
   from libtest-0c5e281140db6463.so
#11 0xb6dcb9fc in test::run_tests_console::h8730cd4311bcbbcb ()
   from libtest-0c5e281140db6463.so
#12 0xb6dc8284 in test::test_main::h343b975aae7c6b90 ()
   from libtest-0c5e281140db6463.so
#13 0xb6dc848c in test::test_main_static::hfd242c8df31b861a ()
   from libtest-0c5e281140db6463.so
#14 0xb6e57424 in std::rt::lang_start::_$u7b$$u7b$closure$u7d$$u7d$::hf8fbc2e6d4c74fb5 ()
#15 0xb6d29244 in std::panicking::try::do_call::h49216886d8653049 ()
   from libstd-a3b039e1022c4e23.so
#16 0xb6d3361c in __rust_maybe_catch_panic () from libstd-a3b039e1022c4e23.so
#17 0xb6d276d4 in std::panic::catch_unwind::h9c8f0a4ad65b2a9d ()
   from libstd-a3b039e1022c4e23.so
#18 0xb6cffa10 in std::rt::lang_start_internal::hc01076141fd50efb ()
   from libstd-a3b039e1022c4e23.so
#19 0xb6e75078 in main ()

From what I can see, the panic hook appears to be executing, and then somehow jumping back into the body of the loop. EDIT: I believe that this is actually showing the normal invocation of my test function.

Note that this was the only thread running. I'm not sure if this is related to the original issue, but it suggets that something weird is going on with the panicking threads.

My current hypothesis is that there's some sort of concurrency and/or codegen issue with the the std::panic::catch_unwind logic (maybe the try compiler intrinsic?).

@pietroalbini pietroalbini added the P-medium Medium priority label Apr 1, 2020
@Amanieu
Copy link
Member

Amanieu commented Apr 1, 2020

@Aaron1011 Do you still have instructions for reproducing this bug locally? I will look into this issue.

@Aaron1011
Copy link
Member

Aaron1011 commented Apr 4, 2020

@Amanieu: Unfortunately, I don't think that I do (I really should have documented what I previously did...)

I'll see if I can reconstruct what I did, and post instructions.

@Aaron1011
Copy link
Member

I ran the liballoc tests in the arm-android container in a loop overnight, and couldn't reproduce the issue.

@pietroalbini: Could you link to some of the failed Github Actions jobs?

@Aaron1011
Copy link
Member

It turns out that I was accidentally running the liballoc test with my local config.toml, which had debug assertions enabled. With the config.toml that gets built by the CI scripts, I was able to reproduce the issue.

@Amanieu: Run crash.sh from this branch. This script runs the liballoc tests in a loop until a crash occurs.

@Aaron1011
Copy link
Member

I was finally able to obtain an android tombstone file: https://gist.github.com/Aaron1011/94e97e3b5a7ba62f024ce4fda8211bdc

This was quite difficult to obtain - the emulator appears to get force-killed when a test fails (the outer Docker contain shuts down with the emulator still running), which caused the tombstone written to /data/tombstones/ to get lost. I needed to modify the remote-test-server tool to not exit when a binary failed, which allowed me to attach to the running Docker container: Aaron1011@ae403a6

@Aaron1011
Copy link
Member

I came across the following code:

#[cfg(any(target_os = "android", target_os = "redox", target_os = "solaris"))]
#[inline]
unsafe fn aligned_malloc(layout: &Layout) -> *mut u8 {
// On android we currently target API level 9 which unfortunately
// doesn't have the `posix_memalign` API used below. Instead we use
// `memalign`, but this unfortunately has the property on some systems
// where the memory returned cannot be deallocated by `free`!
//
// Upon closer inspection, however, this appears to work just fine with
// Android, so for this platform we should be fine to call `memalign`
// (which is present in API level 9). Some helpful references could
// possibly be chromium using memalign [1], attempts at documenting that
// memalign + free is ok [2] [3], or the current source of chromium
// which still uses memalign on android [4].
//
// [1]: https://codereview.chromium.org/10796020/
// [2]: https://code.google.com/p/android/issues/detail?id=35391
// [3]: https://bugs.chromium.org/p/chromium/issues/detail?id=138579
// [4]: https://chromium.googlesource.com/chromium/src/base/+/master/
// /memory/aligned_memory.cc
libc::memalign(layout.align(), layout.size()) as *mut u8
}

I don't know if it's related, but it's something to keep in mind.

@mati865
Copy link
Contributor

mati865 commented Apr 6, 2020

API level 9 means Android 2.3-2.3.3 released 9 years ago which seems overly conservative.
From what I could find posix_memalign was added by API level 16 which is Android 4.1.x released almost 8 years ago.
Bumping API level locally and enabling posix_memalign seems worth giving a try. Rust should probably consider defaulting to some newer API as well.

@Amanieu
Copy link
Member

Amanieu commented Apr 6, 2020

It might be worth switching it to plain malloc temporarily to see if this fixes the crash (even though it is technically wrong to ignore the alignment).

@bjorn3
Copy link
Member

bjorn3 commented Apr 6, 2020

Unlike x86 on ARM alignment is very important. Normal load and store instructions can't access unaligned memory.

@Amanieu
Copy link
Member

Amanieu commented Apr 6, 2020

Unlike x86 on ARM alignment is very important. Normal load and store instructions can't access unaligned memory.

Sure but the default malloc alignment is enough for that.

@Aaron1011
Copy link
Member

Aaron1011 commented Apr 8, 2020

I think it's going to be extremely difficult to debug this further in the android vm. I'm currently trying to reproduce this without android, using a combination of dlmalloc and collectionstest built for the armv7-unknown-linux-gnueabihf target. However, it's possible that there's something Android-specific going on here (e.g. the emulator, the QEMU arm backend, or some #[cfg(android)] code.

@Amanieu
Copy link
Member

Amanieu commented Apr 9, 2020

So, here's what I got so far. I've narrowed it down to this command line, which runs only 35 tests:

$ export RUST_TEST_THREADS=44
$ export RUSTC_BOOTSTRAP=1
$ test1/collectionstests-3293dcd521e26794 -Zunstable-options --exclude-should-panic string::test_
running 35 tests
test string::test_push ... ok
test string::test_push_str ... ok
test string::test_push_bytes ... ok
test string::test_replace_range ... ok
test string::test_pop ... ok
test string::test_into_boxed_str ... ok
test string::test_from_utf8_lossy ... ok
test string::test_from_utf8 ... ok
test string::test_from_utf16_lossy ... ok
test string::test_from_str ... ok
test string::test_from_iterator ... ok
test string::test_from_cow_str ... ok
test string::test_extend_ref ... ok
test string::test_drain ... ok
test string::test_add_assign ... ok
test string::test_replace_range_empty ... ok
test string::test_replace_range_inclusive_range ... ok
test string::test_replace_range_unbounded ... ok
test string::test_retain ... ok
test string::test_simple_types ... ok
test string::test_slicing ... ok
test string::test_split_off_ascii ... ok
test string::test_split_off_empty ... ok
test string::test_split_off_unicode ... ok
test string::test_str_add ... ok
test string::test_str_clear ... ok
test string::test_str_truncate ... ok
test string::test_str_truncate_invalid_len ... ok
test string::test_unsized_to_string ... ok
test string::test_utf16_invalid ... ok
test string::test_vectors ... ok
test string::test_from_utf16 ... ok
test string::test_reserve_exact ... ok
test string::test_try_reserve ... ok
test string::test_try_reserve_exact ... ok

test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 616 filtered out

Crashes are rare, but when it does crash it seems to always be in test_try_reserve_exact (confirmed by disassembling the address in the tombstone):

running 35 tests
test string::test_add_assign ... ok
test string::test_drain ... ok
test string::test_extend_ref ... ok
test string::test_from_cow_str ... ok
test string::test_from_iterator ... ok
test string::test_from_str ... ok
test string::test_from_utf16 ... ok
test string::test_from_utf16_lossy ... ok
test string::test_from_utf8 ... ok
test string::test_from_utf8_lossy ... ok
test string::test_into_boxed_str ... ok
test string::test_pop ... ok
test string::test_push_bytes ... ok
test string::test_push_str ... ok
test string::test_replace_range ... ok
test string::test_replace_range_empty ... ok
test string::test_replace_range_inclusive_range ... ok
test string::test_replace_range_unbounded ... ok
test string::test_reserve_exact ... ok
test string::test_retain ... ok
test string::test_simple_types ... ok
test string::test_slicing ... ok
test string::test_split_off_ascii ... ok
test string::test_split_off_empty ... ok
test string::test_split_off_unicode ... ok
test string::test_str_add ... ok
test string::test_str_clear ... ok
test string::test_str_truncate ... ok
test string::test_str_truncate_invalid_len ... ok
test string::test_unsized_to_string ... ok
test string::test_utf16_invalid ... ok
test string::test_vectors ... ok
Segmentation fault 

The segmentation fault is misleading, that's just what Android's libc uses for abort() instead of SIGABRT. The abort message is:

@@@ ABORTING: invalid address or address of corrupt block 0xb7bda648 passed to dlfree

From the looks of it, there is no problem in our code: this is purely a bug in Android's dlmalloc. My guess is that there is some integer overflow issue in dlmalloc's internal size calculations, but dependent on some internal malloc state (hence the non-determinism).

@Amanieu
Copy link
Member

Amanieu commented Apr 9, 2020

Looking at the memory dumps from the tombstone, we can see that the bug is somewhere in realloc:

Abort message: '@@@ ABORTING: invalid address or address of corrupt block 0xb8caf510 passed to dlfree'

memory near r7:
    b8caf4f8 00000000 00000000 00000000 00000000  
    b8caf508 00000000 00000000 00000000 8000000b  
    b8caf518 33323130 37363534 b8ca3938 00000001  
    b8caf528 00000018 00000006 00000000 00000000  
    b8caf538 000069d7 00000001 00000000 00000000  
    b8caf548 00000000 00000000 00000000 b2cf0f00  
    b8caf558 00000000 00000000 00000000 00000000  
    b8caf568 00000000 00000000 00000000 00000000  
    b8caf578 00000000 00000000 00000000 00000000  
    b8caf588 00000000 00000000 00000000 00000000  
    b8caf598 00000000 00000000 00000000 00000000  
    b8caf5a8 00000000 00000000 00000000 00000000  
    b8caf5b8 00000000 00000000 00000000 00000000  
    b8caf5c8 00000000 00000000 00000000 00000000  
    b8caf5d8 00000000 00000000 00000000 00000000  
    b8caf5e8 00000000 00000000 00000000 00000000  

Note the chunk prefix of 8000000b which says "this chunk has a length of 2GB + 8". This is clearly impossible since the address of the block (0xb8caf518) is in the second half of the address space.

Note that this bug is not an issue for modern Android versions since those use jemalloc instead of dlmalloc for the default libc allocator. In light of this I think we should just disable the try_reserve tests on Android.

@pietroalbini
Copy link
Member

Thanks for investigating this ❤️

@Aaron1011
Copy link
Member

Aaron1011 commented Apr 9, 2020

Looking at the memory dumps from the tombstone, we can see that the bug is somewhere in realloc:

EDIT: I misunderstood how realloc_fallback is being used. The realloc function is in fact called by libstd in certain cases, so @Amanieu's analysis seems correct.

@Amanieu: How do you know it's caused by realloc? The tombstone file I captured contained the following backtrace for one of the threads (not the crashing one):

    #00  pc 0001c5a4  /system/lib/libc.so (__futex_syscall3+8)
    #01  pc 0000de88  /system/lib/libc.so
    #02  pc 0000f56d  /system/lib/libc.so (dlmalloc+48)
    #03  pc 000108a7  /system/lib/libc.so
    #04  pc 0000d9fb  /system/lib/libc.so (memalign+10)
    #05  pc 0006dcb0  /data/tmp/work/libstd-7d8e6950da20b76f.so (std::sys_common::alloc::realloc_fallback::h05f7cc89deafdb57+36)
    #06  pc 00073a50  /data/tmp/work/libstd-7d8e6950da20b76f.so (__rdl_realloc+40)
    #07  pc 0012a390  /data/tmp/work/test1/collectionstests-8c9066a5b2aa9a17 (alloc::raw_vec::RawVec$LT$T$C$A$GT$::reserve_exact::h5da8ae41586c5215+104)
    #08  pc 0014c50c  /data/tmp/work/test1/collectionstests-8c9066a5b2aa9a17 (collectionstests::vec::overaligned_allocations::h298719e516d2db4a+112)

It looks like Rust isn't actually calling into realloc from dlmalloc - instead, it's falling back to freeing and re-allocating memory. If realloc is getting called, then it's either being done by the standard library making an explicit call to an extern C function (which I think never happens for realloc), or it's being done by bionic libc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-github-actions Area: GitHub Actions (GHA) A-spurious Area: Spurious failures in builds (spuriously == for no apparent reason) C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC E-help-wanted Call for participation: Help is requested to fix this issue. O-android Operating system: Android P-medium Medium priority T-infra Relevant to the infrastructure team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants