-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
s390x regression: failing io::tests::try_oom_error #133806
Comments
As requested by @saethlin , I tried the previous commit a2545fd using
Interestingly enough, the
Backtrace shows:
Not sure if this is a related problem (at least it's also somewhere around OOM handling ...). |
So it sounds to me like the increased use of "share-generics codegen" has exposed a pre-existing miscompile. What is in your |
Nothing special as far as I can see ...
and then I'm running the following to trigger the failure:
All this is running natively on a s390x Linux system (Fedora 40 if it matters). |
Confirming that I ran exactly that on my x86_86 Arch Linux dev machine and none of the library tests fail. (A ui test and an assembly test fail, but what those tests are doing is just incompatible with adding the flag). So I think either the bad code or bad IR is gated behind a set of cfgs that are only toggled by s390x, or this is an LLVM backend issue. In either case, minimizing a reproducer would be ideal. I don't know if the standard library build is going to be relevant here, you should be able to extract a simple test case that misbehaves on nightlies before that PR with The thorny part about share-generics is that changes how codegen works, depending on how your dependencies were compiled. And the standard library is always a dependency. So isolating this could be difficult. |
There's been some interesting events: as of today, current mainline no longer shows the test case failure. I've been able to track the change down to this PR: #133701, and specifically the single changed line in library/std/src/sys/pal/unix/process/process_common.rs in that diff. Why this change should fix the problem is quite unclear. I'll try to track down differences in compiled code between the two source trees differing only in that one line. |
This was mostly a red herring. Turns out whether or not the bug is seen depends on the partitioning of code between different codegen units, which can be affected in various ways by random source code changes. Most of these random effects go away when forcing Using both |
Turns out this still isn't quite the real problem. In fact, when building the compiler and test suite, adding or removing the The reason why we do see a difference in generated code is more tricky: the test case contains a When the target feature is removed via the above patch, that obstacle no longer exists and That still doesn't explain the crash. I've now at least been able to create a somewhat reduces test case (still pulls in bits of the library e.g. for the unwind handling, however). The following test, compiled with
As best as I can determine, the problem seems to be that the exception raised by the explicit Here's the test program (there's still quite a bit of what looks like redundant code in there, but removing any of that makes the bug disappear):
|
Are there different symbols in the executable with and without Also I see you are using |
This is the diff of symbols listed by nm ("old" is without
The bug disappears.
From what I've seen before, it seems necessary for the bug to manifest that the |
I've finally managed to discover the actual root cause of the problem. It's a codegen bug in LLVM that is related to register allocation, which explains why just about any change in generated code may cause the symptom to appear or disappear. For a more detailed description of the LLVM bug, see llvm/llvm-project#122315. With the patch listed in the above issue applied to Rust's in-tree copy of LLVM, the test suite passes on a commit where it would previously fail. Note that the following is an even more reduced Rust test case:
This makes it clear the test case involves a loop whose code flow includes a thrown-and-caught exception, which triggers the LLVM post-RA loop invariant code motion bug listed above. If that bug hits, it may move initialization of a register (indirectly) holding the size of the data to be allocated during the inlined clone operation to before the loop. As that register is clobbered during exception dispatch, we really are trying to allocate an impossibly large chunk of memory. I'll be working to get this fixed on the LLVM side. |
We ran into this test failure independently, and I just submitted #138370 to avoid any actual allocation in that test. I was under the impression that LLVM was being very clever to see that the allocation wasn't really used, though I didn't know why that would only happen for s390x. It's interesting that there's an actual codegen bug, but still, I think the test is better without attempting a real OOM. |
Simulate OOM for the `try_oom_error` test We can create the expected error manually, rather than trying to produce a real one, so the error conversion test can run on all targets. Before, it was only running on 64-bit and not miri. In Fedora, we also found that s390x was not getting the expected error, "successfully" allocating the huge size because it was optimizing the real `malloc` call away. It's possible to counter that by looking at the pointer in any way, like a debug print, but it's more robust to just deal with errors directly, since this test is only about conversion. Related: rust-lang#133806
Simulate OOM for the `try_oom_error` test We can create the expected error manually, rather than trying to produce a real one, so the error conversion test can run on all targets. Before, it was only running on 64-bit and not miri. In Fedora, we also found that s390x was not getting the expected error, "successfully" allocating the huge size because it was optimizing the real `malloc` call away. It's possible to counter that by looking at the pointer in any way, like a debug print, but it's more robust to just deal with errors directly, since this test is only about conversion. Related: rust-lang#133806
Rollup merge of rust-lang#138370 - cuviper:try_oom_error, r=jhpratt Simulate OOM for the `try_oom_error` test We can create the expected error manually, rather than trying to produce a real one, so the error conversion test can run on all targets. Before, it was only running on 64-bit and not miri. In Fedora, we also found that s390x was not getting the expected error, "successfully" allocating the huge size because it was optimizing the real `malloc` call away. It's possible to counter that by looking at the pointer in any way, like a debug print, but it's more robust to just deal with errors directly, since this test is only about conversion. Related: rust-lang#133806
Simulate OOM for the `try_oom_error` test We can create the expected error manually, rather than trying to produce a real one, so the error conversion test can run on all targets. Before, it was only running on 64-bit and not miri. In Fedora, we also found that s390x was not getting the expected error, "successfully" allocating the huge size because it was optimizing the real `malloc` call away. It's possible to counter that by looking at the pointer in any way, like a debug print, but it's more robust to just deal with errors directly, since this test is only about conversion. Related: rust-lang#133806
As of this merge commit:
I'm seeing the following test case failure. Note that the test passes in both parents (a2545fd and 4a216a2) of the merge commit.
I've tried debugging the test, but if I'm reading this correctly, the test function was already completely optimized out and replaced by a failed assertion at compile time:
Note the unconditional call to
unwrap_failed
.The text was updated successfully, but these errors were encountered: