-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiler crashes/ICEs on new aarch64 GHA runners (and/or Azure's Cobalt 100 VMs) #135867
Comments
Hit another different error: https://github.com/sgrif/pq-sys/actions/runs/12903663719/job/35979328082#step:15:166
|
Hmm, this might be something about incremental
|
Similar issue here: https://github.com/PyO3/maturin-action/actions/runs/12907444142/job/35991179531
|
I'm quite sure it isn't. We print the compiler flags at the bottom of the ICE message and there is no |
In addition, the slice index here is telling. Very often we get these ICEs because the slice that's being indexed into is being decoded from an artifact that should have been invalidated but wasn't, or because the slice was truncated. But here the index is just completely bogus. Successfully accessing a (non-ZST) slice index of 503566387083609839 would imply at least a 447 PB allocation. |
Correction: I used the wrong beta here, which was much older than intended. Oddly, that seems only to have made a small difference. See #135867 (comment) below, and GitoxideLabs/gitoxide#1790 (comment), for details. I'm not sure how useful this will be, since based on #135867 (comment) it looks like the problem might already be understood, but I figured I'd report this anyway, in case it provides useful information about the environments in which the problem does and does not occur. This happened in a There was also one occurrence of "free(): invalid next size (fast)":
I was not able to reproduce any of the errors on an Azure cloud instance, also running rustc 1.84.0 from the In addition, in GitHub Actions, as described along with some more details in GitoxideLabs/gitoxide#1790 (comment), the errors never seem to occur on the
Experiment 1 tested However, while both showed the problem only ever to occur on Experiment 1 also included Therefore, I don't know what's going on.
(The |
Compiler bugs often result in programs with absurd execution behavior, SIGILL just means it tried to execute at an offset that doesn't form a valid instruction or was a
I don't understand what's going on here, but something is very broken in a rather novel way. Thank you for your report, it was very informative. I think there are basically two possibilities, either the stable toolchain for linux-aarch64 is incredibly broken and somehow people are noticing at exactly the same time as GitHub is making free linux-aarch64 runners available... or the newly-available runners are subtly buggy. At the moment, I think it's more likely the runners are buggy. If that's the case, GitHub people are probably scrambling to do something about these reports, so for now I'd just wait. The runners are a public beta, finding bugs is expected. |
I wonder if something changed in how we build the aarch64 compiler... |
SIGSEGV
SIGSEGV
on aarch64-unknown-linux-gnu
Yes, we started optimizing it with LTO and PGO. But that's not on stable yet. |
When using `dtolnay/rust-toolchain` with the `toolchain` key to specify a channel, the action version should be given as `@master`. But I accidentally kept it at `@stable`! This caused `beta` and `nightly` to refer to the most recent beta and nightly builds *prior* to the current stable version. That made the conclucions about beta and nightly builds inaccurate. This rectifies that error and repeats the experiment. See e71b0cf (1f3f6b5), GitoxideLabs#1790, and rust-lang/rust#135867 for context. (I made this mistake in both experiment 1 and experiment 2, having wrongly thought I'd changed `@stable` to `@master` for experiment 1. This commit just repeats experiment 1, but experiment 2 should also be repeated for the same reason.)
As noted in the preceding commit, when I ran experiments 1 and 2 the first time, I accidentally used `dtolnay/rust-toolchain@stable` instead of `dtolnay/rust-toolchain@master`, even though the latter is needed to use current values of the `toolchain` key rather than the builds they referred to at the time the most recent stable build was updated. The preceding commit redid experiment 1 with that fixed. This commit redoes experiment 2 with te same fix. See 5a71963 (1b3e2cd), GitoxideLabs#1790, and rust-lang/rust#135867 for context.
When using `dtolnay/rust-toolchain` with the `toolchain` key to specify a channel, the action version should be given as `@master`. But I accidentally kept it at `@stable`! This caused `beta` and `nightly` to refer to the most recent beta and nightly builds *prior* to the current stable version. That made the conclucions about beta and nightly builds inaccurate. This rectifies that error and repeats the experiment. See e71b0cf (1f3f6b5), GitoxideLabs#1790, and rust-lang/rust#135867 for context. (I made this mistake in both experiment 1 and experiment 2, having wrongly thought I'd changed `@stable` to `@master` for experiment 1. This commit just repeats experiment 1, but experiment 2 should also be repeated for the same reason.)
As noted in the preceding commit, when I ran experiments 1 and 2 the first time, I accidentally used `dtolnay/rust-toolchain@stable` instead of `dtolnay/rust-toolchain@master`, even though the latter is needed to use current values of the `toolchain` key rather than the builds they referred to at the time the most recent stable build was updated. The preceding commit redid experiment 1 with that fixed. This commit redoes experiment 2 with te same fix. See 5a71963 (1b3e2cd), GitoxideLabs#1790, and rust-lang/rust#135867 for context.
This varies: - `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm` GHA runner. - Installing Rust via the `rust-toolchain` action vs. with curl.sh. - Installing the stable vs. beta Rust toolchain. - Installing nextest via `install-action` quickinstall/binstall. *If* this also confirms that the only fully consistent factor in whether errors happen is `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm`, then that will make it clearer that the problem is likely specific to the `ubuntu-24.04.arm` runner. See GitoxideLabs#1790 and rust-lang/rust#135867 for context.
We've been using ARM on GHA successfully for several months using their "larger runners" feature (where GitHub still manages the runners for you, unlike self-hosted, but allows you to customise the specs/architecture/... etc). Today I switched to the new And shortly after a colleague encountered this crash after they switched the ARM job of their repo from the larger ARM runner to the public ARM runner:
This would suggest that there's a difference in image or machine type between GitHub's larger runners ARM offering and the new public runner ARM offering, that's the cause of the ICE/crash. In particular, I've just found out from actions/partner-runner-images#36 (comment) that the CPU type has changed:
...so it seems that this could be an issue specific to the ARM Cobalt 100 / Neoverse N2 CPU? |
I'm not sure, because when I test now, it seems to happen less often overall, though that may very well just be due to chance. Testing 1.81, 1.82, 1.83, and 1.84, I saw it once on 1.83 and not on versions earlier than that. (The other failure on separate 1.83 run was in the runner software itself.) This was at EliahKagan/gitoxide@cca8f00 (workflow run details). A subsequent experiment at EliahKagan/gitoxide@844c6bd (workflow run details) is likewise inconclusive. |
The N1 runner's CPU info is:
The N2 runner is:
|
rust on aarch64-unknown-linux-gnu has a bug which faces SIGSEGV intermittently (rust-lang/rust#135867) with 1.83.0 or later. rust 1.82.0 will be used until the above issue is resolved. Signed-off-by: Seunguk Shin <seunguk.shin@arm.com>
SIGSEGV
on aarch64-unknown-linux-gnu
I tried opening a GitHub support ticket to raise awareness of this issue, however, was directed back to the public discussion group (where we've not yet had a reply/acknowledgement of the issue from GitHub) due to the ARM runners being in preview. Could everyone use the discussion group upvote arrow on this thread to raise it's visibility? |
Upvoted in the discussion group, as my issue was closed as a duplication of this one. Also, can confirm that for me, as a workaround, switching to |
We are also getting this issue on non-arm runners but when Docker is emulating Running Rust 1.84.1 Note:
|
@DenuxPlays Let me clarify: building financrr-app on x86-64 GitHub runners while simulating arm64 with Docker QEMU occasionally leads to SIGSEGVs, but doing the same thing on local conventional x86-64 hardware doesn't seem to trigger SIGSEGV under realistic conditions. Did I get that right? |
Yes It Happens since updating to Rust 1.84.1 (1.84.0 works) |
Or maybe since ubuntu 24 is used. I am not sure which one caused it |
I provisioned myself a D8plsv6 VM on Azure (so that's a Cobalt 100 CPU, the same that these new GHA runners are using) running Ubuntu 24.04 and I got one of these crashes by running the compiler's test suite.
|
Oh I see, I'm a fool and I didn't look at the numerous experiments that @EliahKagan has already done and basically pointed us at already. It looks to me like you provoked >20 crashes, mostly on 1.84 but one or two on 1.83, and 100% of the crashes are on Ubuntu 24, even though you did an identical number of runs on Ubuntu 22. Is that right? I do see some CI failures of yours on Ubuntu 22, but they all look like
as opposed to the smattering of segfaults, ICEs, and heap corruption that is happening on Ubuntu 24. |
Code
This happened once on a github CI runner, I nevertheless fill this as report as the output asked for it. It might be an hardware issue, as it went away with a rebuild.
CI LOG: https://github.com/sgrif/pq-sys/actions/runs/12903477183/job/35978791282?pr=73#step:14:26
Meta
rustc --version --verbose
:Error output
The text was updated successfully, but these errors were encountered: