Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s390x regression: failing io::tests::try_oom_error #133806

Open
uweigand opened this issue Dec 3, 2024 · 11 comments
Open

s390x regression: failing io::tests::try_oom_error #133806

uweigand opened this issue Dec 3, 2024 · 11 comments
Labels
A-ABI Area: Concerning the application binary interface (ABI) C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-miscompile Issue: Correct Rust code lowers to incorrect machine code O-SystemZ Target: SystemZ processors (s390x) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@uweigand
Copy link
Contributor

uweigand commented Dec 3, 2024

As of this merge commit:

commit d53f0b1d8e261f2f3535f1cd165c714fc0b0b298
Merge: a2545fd6fc6 4a216a25d14
Author: bors <bors@rust-lang.org>
Date:   Thu Nov 28 21:44:34 2024 +0000

    Auto merge of #123244 - Mark-Simulacrum:share-inline-never-generics, r=saethlin

I'm seeing the following test case failure. Note that the test passes in both parents (a2545fd and 4a216a2) of the merge commit.

thread 'io::tests::try_oom_error' panicked at std/src/io/tests.rs:822:62:
called `Result::unwrap_err()` on an `Ok` value: ()
stack backtrace:
   0:      0x3fff7dd6702 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h0eec3d9053c23c0f
   1:      0x3fff7e37506 - core::fmt::write::h66866b531685abe5
   2:      0x3fff7dc575e - std::io::Write::write_fmt::h89ced3ac9904279e
   3:      0x3fff7dd6570 - std::sys::backtrace::BacktraceLock::print::h363d5b9cad1f5c19
   4:      0x3fff7df62e4 - std::panicking::default_hook::{{closure}}::ha4b8eaf1f6a37f57
   5:      0x3fff7df60da - std::panicking::default_hook::hda41cc1e1c3b4efa
   6:      0x2aa00430d78 - test::test_main::{{closure}}::h4d9e2859f981c511
   7:      0x3fff7df6aa0 - std::panicking::rust_panic_with_hook::heff88192ef2a89fb
   8:      0x3fff7dd6d52 - std::panicking::begin_panic_handler::{{closure}}::hff5589d5c45993a6
   9:      0x3fff7dd69b4 - std::sys::backtrace::__rust_end_short_backtrace::h165daf71d9abcca8
  10:      0x3fff7df63ca - rust_begin_unwind
  11:      0x3fff7d4aa6a - core::panicking::panic_fmt::hec8c29ccd1751d1e
  12:      0x3fff7d4b948 - core::result::unwrap_failed::h47cf11019e236d96
  13:      0x2aa001c5a9a - core::ops::function::FnOnce::call_once::h5453841f675c42ec
  14:      0x2aa00436d74 - test::__rust_begin_short_backtrace::h31f93d45aa944e21
  15:      0x2aa00436f62 - test::run_test_in_process::h617ed5302028c350
  16:      0x2aa0042a67e - std::sys::backtrace::__rust_begin_short_backtrace::hbc434a15ea7a090f
  17:      0x2aa00425e14 - core::ops::function::FnOnce::call_once{{vtable.shim}}::h2f86d2c09a8a35d2
  18:      0x3fff7df33a8 - std::sys::pal::unix::thread::Thread::new::thread_start::hce74d4c3b42eec78
  19:      0x3fff7bac3fa - start_thread
                               at /usr/src/debug/glibc-2.39-17.1.ibm.fc40.s390x/nptl/pthread_create.c:447:8
  20:      0x3fff7c2bde0 - thread_start
                               at /usr/src/debug/glibc-2.39-17.1.ibm.fc40.s390x/misc/../sysdeps/unix/sysv/linux/s390/s390-64/clone3.S:71
  21:                0x0 - <unknown>

I've tried debugging the test, but if I'm reading this correctly, the test function was already completely optimized out and replaced by a failed assertion at compile time:

Dump of assembler code for function _ZN4core3ops8function6FnOnce9call_once17h5453841f675c42ecE:
   0x000002aa001c5a60 <+0>:     stmg    %r6,%r15,48(%r15)
   0x000002aa001c5a66 <+6>:     aghi    %r15,-168
   0x000002aa001c5a6a <+10>:    lgr     %r11,%r15
   0x000002aa001c5a6e <+14>:    lgrl    %r1,0x2aa00568f28
   0x000002aa001c5a74 <+20>:    lb      %r0,0(%r1)
   0x000002aa001c5a7a <+26>:    la      %r4,167(%r11)
   0x000002aa001c5a7e <+30>:    larl    %r2,0x2aa00481e7c <anon.6846cc147164699b42462cc8b979de03.18.llvm.3644326088524771271>
   0x000002aa001c5a84 <+36>:    lghi    %r3,46
   0x000002aa001c5a88 <+40>:    larl    %r5,0x2aa00545d08 <anon.6846cc147164699b42462cc8b979de03.17.llvm.3644326088524771271>
   0x000002aa001c5a8e <+46>:    larl    %r6,0x2aa00546f78 <anon.6846cc147164699b42462cc8b979de03.473.llvm.3644326088524771271>
   0x000002aa001c5a94 <+52>:    brasl   %r14,0x2aa0005c0e0 <_ZN4core6result13unwrap_failed17h47cf11019e236d96E@plt>

Note the unconditional call to unwrap_failed.

@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 3, 2024
@bjorn3 bjorn3 added O-SystemZ Target: SystemZ processors (s390x) I-miscompile Issue: Correct Rust code lowers to incorrect machine code labels Dec 3, 2024
@saethlin saethlin added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Dec 3, 2024
@uweigand
Copy link
Contributor Author

uweigand commented Dec 3, 2024

As requested by @saethlin , I tried the previous commit a2545fd using

RUSTFLAGS_NOT_BOOTSTRAP=-Zshare-generics ./x.py test

Interestingly enough, the io::tests::try_oom_error test still succeeds. However, another test is now failing:

[uweigand@a35lp68 rust]$ LD_LIBRARY_PATH=./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib
./build/s390x-unknown-linux-gnu/stage1-std/s390x-unknown-linux-gnu/release/deps/alloctests-b4087bb360d3d1cf
sort::tests::stable::panic_retain_orig_set_cell_i32_random_d2

running 2 tests
memory allocation of 13192931584848 bytes failed
memory allocation of 13193334237648 bytes failed
Aborted (core dumped)

Backtrace shows:

#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x000003fff7aae406 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
#2  0x000003fff7a54460 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x000003fff7a3449c in __GI_abort () at abort.c:79
#4  0x000003fff7d97b94 in std::sys::pal::unix::abort_internal ()
   from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#5  0x000003fff7d72fb4 in std::process::abort () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#6  0x000003fff7d97c0c in std::alloc::rust_oom () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#7  0x000003fff7d97c30 in __rg_oom () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#8  0x000003fff7d752f4 in alloc::alloc::handle_alloc_error () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#9  0x000003fff7d752d4 in alloc::raw_vec::handle_error () from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#10 0x000002aa0009b882 in alloctests::sort::tests::panic_retain_orig_set_cell_i32_random_d2_impl ()
#11 0x000002aa0018e238 in core::ops::function::FnOnce::call_once ()
#12 0x000002aa00266e04 in test::__rust_begin_short_backtrace ()
#13 0x000002aa00267024 in test::run_test_in_process ()
#14 0x000002aa00290e8e in std::sys::backtrace::__rust_begin_short_backtrace ()
#15 0x000002aa00268ee4 in core::ops::function::FnOnce::call_once{{vtable.shim}} ()
#16 0x000003fff7de4e88 in std::sys::pal::unix::thread::Thread::new::thread_start ()
   from ./build/s390x-unknown-linux-gnu/stage1/lib/rustlib/s390x-unknown-linux-gnu/lib/libstd-890c298c4d76dcf1.so
#17 0x000003fff7aac3fa in start_thread (arg=0x3fff79008c0) at pthread_create.c:447
#18 0x000003fff7b2bde0 in thread_start () at ../sysdeps/unix/sysv/linux/s390/s390-64/clone3.S:71

Not sure if this is a related problem (at least it's also somewhere around OOM handling ...).

@saethlin
Copy link
Member

saethlin commented Dec 3, 2024

So it sounds to me like the increased use of "share-generics codegen" has exposed a pre-existing miscompile. -Zshare-generics is of course an unstable flag, but it is on by default in unoptimized builds so it is not really a niche option.

What is in your config.toml and exactly what command are you running to hit these crashes? I just want to make extra sure that this can or can't be reproduced on x86_64.

@jieyouxu jieyouxu added C-bug Category: This is a bug. and removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Dec 3, 2024
@uweigand
Copy link
Contributor Author

uweigand commented Dec 3, 2024

What is in your config.toml and exactly what command are you running to hit these crashes?

Nothing special as far as I can see ... config.toml is:

profile = "user"
[build]
extended = true
sanitizers = true
profiler = true
[rust]
lld = true

and then I'm running the following to trigger the failure:

./x.py build 
RUSTFLAGS_NOT_BOOTSTRAP=-Zshare-generics ./x.py test

All this is running natively on a s390x Linux system (Fedora 40 if it matters).

@saethlin
Copy link
Member

saethlin commented Dec 4, 2024

Confirming that I ran exactly that on my x86_86 Arch Linux dev machine and none of the library tests fail. (A ui test and an assembly test fail, but what those tests are doing is just incompatible with adding the flag).

So I think either the bad code or bad IR is gated behind a set of cfgs that are only toggled by s390x, or this is an LLVM backend issue.

In either case, minimizing a reproducer would be ideal. I don't know if the standard library build is going to be relevant here, you should be able to extract a simple test case that misbehaves on nightlies before that PR with RUSTFLAGS=-Zshare-generics cargo run -Zbuild-std --release. If you get different behavior with and without build-std, then the standard library is relevant.

The thorny part about share-generics is that changes how codegen works, depending on how your dependencies were compiled. And the standard library is always a dependency. So isolating this could be difficult.

@uweigand
Copy link
Contributor Author

uweigand commented Dec 4, 2024

There's been some interesting events: as of today, current mainline no longer shows the test case failure. I've been able to track the change down to this PR: #133701, and specifically the single changed line in library/std/src/sys/pal/unix/process/process_common.rs in that diff. Why this change should fix the problem is quite unclear. I'll try to track down differences in compiled code between the two source trees differing only in that one line.

@jieyouxu jieyouxu added the E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example label Dec 7, 2024
@uweigand
Copy link
Contributor Author

I've been able to track the change down to this PR: #133701, and specifically the single changed line in library/std/src/sys/pal/unix/process/process_common.rs in that diff. Why this change should fix the problem is quite unclear.

This was mostly a red herring. Turns out whether or not the bug is seen depends on the partitioning of code between different codegen units, which can be affected in various ways by random source code changes. Most of these random effects go away when forcing -Ccodegen-units=1. This also explains why I had been unable to create assembler or IR files showing the problem - -emit=asm or -emit=llvm-ir implicitly enforces a single codegen unit, which often changes the behavior significantly.

Using both -Zshare-generics and -Ccodegen-units=1 I was able to bisect the actual commit that introduces those bugs: #131586 . While I still don't fully understand why this introduced the problem, at least it makes sense as it is an actual codegen change for s390x. I'll investigate further.

@saethlin saethlin added the A-ABI Area: Concerning the application binary interface (ABI) label Dec 11, 2024
@uweigand
Copy link
Contributor Author

While I still don't fully understand why this introduced the problem, at least it makes sense as it is an actual codegen change for s390x.

Turns out this still isn't quite the real problem. In fact, when building the compiler and test suite, adding or removing the -vector target feature should not cause any codegen change, as they're built with the default CPU model which does not support that feature anyway.

The reason why we do see a difference in generated code is more tricky: the test case contains a catch_unwind intrinsic, which will be implemented via a __rust_try function synthesized by the LLVM codegen backend. Now, if the target specifies -vector as required feature, that feature will be added as attribute to all (normal) functions compiled by the Rust front end. However, the synthetic __rust_try function does not get this attribute added (is this deliberate or an oversight?). Therefore, the LLVM middle-end thinks this function uses a different feature set and therefore will not inline __rust_try into its caller.

When the target feature is removed via the above patch, that obstacle no longer exists and __rust_try is inlined.

That still doesn't explain the crash. I've now at least been able to create a somewhat reduces test case (still pulls in bits of the library e.g. for the unwind handling, however). The following test, compiled with /home/uweigand/rust/build/s390x-unknown-linux-gnu/stage1/bin/rustc alloctest-orig.rs -Zshare-generics -Ccodegen-units=1 -Copt-level=2, crashes with this output:

thread 'main' panicked at alloctest-orig.rs:45:13:
explicit panic
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
memory allocation of 8787504631536 bytes failed
./makealloctests-orig: line 5: 1764556 Aborted                 (core dumped) ./alloctest-orig

As best as I can determine, the problem seems to be that the exception raised by the explicit panic!() call ends up at the wrong landing pad in the caller, when all the inlining through __rust_try has happened. So instead of the exception being ignored by the catch_unwind, we end up at a landing pad intended to handle some memory allocation failure. What this has to do with -Zshare-generics is still a mystery to me.

Here's the test program (there's still quite a bit of what looks like redundant code in there, but removing any of that makes the bug disappear):

#[inline(never)]
fn get_test_data(len: usize) -> Vec<u32>
{
    (0..len).map(|x| x as u32).collect()
}

trait Sort {
    fn sort_by<T, F>(v: &mut [T], compare: F)
    where
        F: FnMut(&T, &T) -> std::cmp::Ordering;
}

struct UnstableSort {}
impl Sort for UnstableSort {
    fn sort_by<T, F>(v: &mut [T], mut compare: F)
    where
        F: FnMut(&T, &T) -> std::cmp::Ordering,
    {
        v.sort_by(|a, b| compare(a, b));
    }
}

struct StableSort {}
impl Sort for StableSort {
    fn sort_by<T, F>(v: &mut [T], mut compare: F)
    where
        F: FnMut(&T, &T) -> std::cmp::Ordering,
    {
        v.sort_unstable_by(|a, b| compare(a, b));
    }
}

fn alloc_test<S: Sort>(
    len: usize,
) {
    let mut test_data = get_test_data(len);

    <S as Sort>::sort_by(&mut test_data.clone(), |a, b| {
        a.cmp(b)
    });

    let _ = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
        <S as Sort>::sort_by(&mut test_data, |_a, _b| {
            panic!();
        });
    }));
}

fn main() {
    for test_len in [2, 3] {
        alloc_test::<UnstableSort>(test_len);
    }
    for test_len in [2, 3] {
        alloc_test::<StableSort>(test_len);
    }
}

@saethlin
Copy link
Member

saethlin commented Dec 18, 2024

What this has to do with -Zshare-generics is still a mystery to me.

Are there different symbols in the executable with and without -Zshare-generics?

Also I see you are using -Copt-level=2. What happens if you do -Copt-level=0 -Zmir-opt-level=2? From the look of your minimization, I feel like the MIR inliner is relevant here, and it's enabled by -Zmir-enable-passes=+Inline, but also automatically by -Copt-level=2, which itself implies -Zmir-opt-level=2 (all opt-level values over 0 imply mir-opt-level 2). It might be easier to look at the disassembly without LLVM optimizations enabled, which -Copt-level=0 -Zmir-opt-level=2 or -Copt-level=0 -Zmir-enable-passes=+Inline should get you.

@uweigand
Copy link
Contributor Author

What this has to do with -Zshare-generics is still a mystery to me.

Are there different symbols in the executable with and without -Zshare-generics?

This is the diff of symbols listed by nm ("old" is without -Zshare-generics, "new" is with -ZShare-generics):

> r GCC_except_table11
63d63
< r GCC_except_table13
94a95
> r GCC_except_table15
110d110
< r GCC_except_table17
142c142
< r GCC_except_table26
---
> r GCC_except_table25
323d322
< t _ZN36_$LT$T$u20$as$u20$core..any..Any$GT$7type_id17h53726e90651e22ddE
785a785
> T _ZN82_$LT$core..array..iter..IntoIter$LT$T$C$_$GT$$u20$as$u20$core..ops..drop..Drop$GT$4drop17hcb59e8dfb3cbb960E
800a801
> T _ZN99_$LT$core..array..iter..IntoIter$LT$T$C$_$GT$$u20$as$u20$core..iter..traits..iterator..Iterator$GT$4next17h299013ae472134beE

Also I see you are using -Copt-level=2. What happens if you do -Copt-level=0 -Zmir-opt-level=2?

The bug disappears.

From the look of your minimization, I feel like the MIR inliner is relevant here, and it's enabled by -Zmir-enable-passes=+Inline, but also automatically by -Copt-level=2, which itself implies -Zmir-opt-level=2 (all opt-level values over 0 imply mir-opt-level 2). It might be easier to look at the disassembly without LLVM optimizations enabled, which -Copt-level=0 -Zmir-opt-level=2 or -Copt-level=0 -Zmir-enable-passes=+Inline should get you.

From what I've seen before, it seems necessary for the bug to manifest that the __rust_try synthetic function is inlined. This function doesn't even exist in MIR; it can only be inlined by the LLVM inliner. So switching off LLVM optimizations must make the bug disappear.

@uweigand
Copy link
Contributor Author

uweigand commented Jan 9, 2025

I've finally managed to discover the actual root cause of the problem. It's a codegen bug in LLVM that is related to register allocation, which explains why just about any change in generated code may cause the symptom to appear or disappear. For a more detailed description of the LLVM bug, see llvm/llvm-project#122315.

With the patch listed in the above issue applied to Rust's in-tree copy of LLVM, the test suite passes on a commit where it would previously fail.

Note that the following is an even more reduced Rust test case:

#[inline(never)]
fn get_test_data(len: usize) -> Vec<u32>
{
    (0..len).map(|x| x as u32).collect()
}

fn main() {
    for len in [2, 3] {
        let test_data = get_test_data(len);
        let _ = test_data.clone();

        let _ = std::panic::catch_unwind(std::panic::AssertUnwindSafe(|| {
            panic!();
        }));
    }
}

This makes it clear the test case involves a loop whose code flow includes a thrown-and-caught exception, which triggers the LLVM post-RA loop invariant code motion bug listed above. If that bug hits, it may move initialization of a register (indirectly) holding the size of the data to be allocated during the inlined clone operation to before the loop. As that register is clobbered during exception dispatch, we really are trying to allocate an impossibly large chunk of memory.

I'll be working to get this fixed on the LLVM side.

@cuviper
Copy link
Member

cuviper commented Mar 11, 2025

We ran into this test failure independently, and I just submitted #138370 to avoid any actual allocation in that test. I was under the impression that LLVM was being very clever to see that the allocation wasn't really used, though I didn't know why that would only happen for s390x.

It's interesting that there's an actual codegen bug, but still, I think the test is better without attempting a real OOM.

matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 13, 2025
Simulate OOM for the `try_oom_error` test

We can create the expected error manually, rather than trying to produce
a real one, so the error conversion test can run on all targets. Before,
it was only running on 64-bit and not miri.

In Fedora, we also found that s390x was not getting the expected error,
"successfully" allocating the huge size because it was optimizing the
real `malloc` call away. It's possible to counter that by looking at the
pointer in any way, like a debug print, but it's more robust to just
deal with errors directly, since this test is only about conversion.

Related: rust-lang#133806
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Mar 13, 2025
Simulate OOM for the `try_oom_error` test

We can create the expected error manually, rather than trying to produce
a real one, so the error conversion test can run on all targets. Before,
it was only running on 64-bit and not miri.

In Fedora, we also found that s390x was not getting the expected error,
"successfully" allocating the huge size because it was optimizing the
real `malloc` call away. It's possible to counter that by looking at the
pointer in any way, like a debug print, but it's more robust to just
deal with errors directly, since this test is only about conversion.

Related: rust-lang#133806
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Mar 13, 2025
Rollup merge of rust-lang#138370 - cuviper:try_oom_error, r=jhpratt

Simulate OOM for the `try_oom_error` test

We can create the expected error manually, rather than trying to produce
a real one, so the error conversion test can run on all targets. Before,
it was only running on 64-bit and not miri.

In Fedora, we also found that s390x was not getting the expected error,
"successfully" allocating the huge size because it was optimizing the
real `malloc` call away. It's possible to counter that by looking at the
pointer in any way, like a debug print, but it's more robust to just
deal with errors directly, since this test is only about conversion.

Related: rust-lang#133806
github-actions bot pushed a commit to model-checking/verify-rust-std that referenced this issue Mar 14, 2025
Simulate OOM for the `try_oom_error` test

We can create the expected error manually, rather than trying to produce
a real one, so the error conversion test can run on all targets. Before,
it was only running on 64-bit and not miri.

In Fedora, we also found that s390x was not getting the expected error,
"successfully" allocating the huge size because it was optimizing the
real `malloc` call away. It's possible to counter that by looking at the
pointer in any way, like a debug print, but it's more robust to just
deal with errors directly, since this test is only about conversion.

Related: rust-lang#133806
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ABI Area: Concerning the application binary interface (ABI) C-bug Category: This is a bug. E-needs-mcve Call for participation: This issue has a repro, but needs a Minimal Complete and Verifiable Example I-miscompile Issue: Correct Rust code lowers to incorrect machine code O-SystemZ Target: SystemZ processors (s390x) T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants