-
-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-fast
on ubuntu-24.04-arm
intermittent SIGSEGV or SIGBUS in rustc
#1790
Comments
Thanks for reporting! I didn't notice this yet and hope that this will be a rare occurrence. If not, like proposed, it could be made 'non-blocking'. |
Actually, it just failed on main: https://github.com/GitoxideLabs/gitoxide/actions/runs/12903033259/job/35977612391 Maybe it's best to just make it non-blocking right away. |
The error messages suggest to force a minimum stack size for I am about to look into whether setting that helps. Then I'll open a PR to improve the situation one way or another. (Making it non-blocking for PRs by splitting it out of Rerunning the check may make it pass, since the failure is intermittent. But I am not recommending that as a substitute for a change that would make the check fail less often (or not at all) or that would change how we treat failures of that check. |
This is to investigate the problem on the `test-fast` job with the new ARM64 runner described in GitoxideLabs#1790. This experiment does not produce useful results yet, because it has no way to distinguish happenstance from correlation. To do that, I need either to rerun each job repeatedly, or further parameterize the matrix to do that. I'll be doing the latter, but right now this dimension has size 1 (i.e., the only value of `number` is `0`) so I don't start a large number of jobs when something is broken due to a mistake in the workflows.
The previous experiment[1][2] didn't have enough of memory-related errors to clearly show which values of the variables have an effect, though it *looked* like the memory-related errors in `rustc` only happened in Ubuntu 24.04 (not 22.04) and only happened on the stable channel (not beta). That's one reason to increase the total number of jobs in the experiment. Another reason is that the memory-related errors are more varied. Not all were true memory errors involving SIGSEGV and SIGBUS anymore. Some were, same as reported in [3]. But some others were panics, looking like this (the index and slice vary but, in each, the start index is much larger than the length): thread 'rustc' panicked at /rustc/9fc6b43126469e3858e2fe86cafb4f0fd5068869/compiler/rustc_serialize/src/opaque.rs:269:45: range start index 159846347648097871 out of range for slice of length 39963722 Since the distribution of errors across jobs might also have related to the order and times in which jobs started, for example if there are inadvertent differences between different hosts (the ARM64 Linux runners are in preview, so this seems plausible, though fairly unlikely), this now expresses the repetition with two variables: a high-order one, listed first in the matrix, and a low-order one, listed last in the matrix. Besides to allow more reps with the same values of the meaningful variables, the reason to stop testing with `RUST_MIN_STACK` is that it didn't seem to make a difference other than to change the message shown, which suggests setting it to an even higher value. [1]: e71b0cf [2]: https://github.com/EliahKagan/gitoxide/actions/runs/12903958398 [3]: GitoxideLabs#1790
As suggested in: GitoxideLabs#1790 (comment) It likely won't have to be kept this way. But making it nonrequired for now makes it so that investigating what triggers the SIGSEGV (and SIGBUS) errors -- as well as other errors that were found while investigating that (d9e7fdb, e71b0cf, 5a71963) -- doesn't have to be rushed.
I've gone ahead and done this in #1792. I suspect it can be adjusted and made blocking again, but I'm not done with the research to figure out how, so I think it does make sense to make it non-blocking temporarily. |
In both e71b0cf (results) and 5a71963 (results), a pattern emerges: the memory errors on ARM64 Linux CI runners seem only ever to happen on the This remains the case even if we regard panics that suggest but do not prove memory errors, such as large range start indices that are much bigger than the range, to be memory errors. I observed these in 5a71963. A third kind of error, which is a memory error even if narrowly defined, also happened only in 5a71963, and only once: in Another kind of error that I had originally assumed was unrelated was that For jobs where the So it looks like it would be sufficient to change Why do (22.04, stable), (22.04, beta), and (24.04, beta) all work, and it is only (24.04, stable) that does not? I think #1792 may still be the best first move, until I've investigated that question more. So far I've look at issues related to the runner images, and the only issue that seems maybe related, indirectly, is actions/partner-runner-images#36. I have not yet looked at what has changed between the stable and beta channels of the toolchain the ARM64 jobs are using. |
The SIGSEGV and "range start index" errors are rust-lang/rust#135867. |
Current behavior 😯
Since #1777, a
test-fast
CI job runs onubuntu-24.04-arm
, which is one of the newly more available 64-bit ARM (AArch64/ARM64) Linux runners. This job initially worked with no problems in this job: the ARM failures mentioned in #1777 and #1778 and tracked in #1780 apply to atest-32bit
job, occur only with Docker, and do not affecttest-fast
.However, the ARM64
test-fast
now intermittently fails withSIGBUS
orSIGSEGV
inrustc
. This is probably a bug inrustsc
or another component of the Rust toolchain for ARM64, but I have not reproduced it locally or otherwise ruled out a problem on the runner image. The job, like the othertest-fast
jobs, uses a stable toolchain.Expected behavior 🤔
The compilation should complete, or give an error, but not crash.
SIGSEGV
andSIGBUS
should not occur.The underlying bug is not in gitoxide, but I'm opening this issue in gitoxide to track the problem with the affected CI job here, which may need to be removed, skipped, or made
continue-on-error
.Git behavior
Not applicable.
Steps to reproduce 🕹
Run or rerun the test-fast (ubuntu-24.04-arm) job on any commit. It seems less likely to happen if
rust-cache
is able to retrieve cache dependencies, since there is less to build, but it happens even when caching retrieves everything except what is built from this repository's workspace. It may be necessary to rerun a job multiple times to observe the problem.A few runs that show this are:
The first link is to a run on the
main
branch in my fork. The following are relevant pieces of the output of that one run (not separate runs):The text was updated successfully, but these errors were encountered: