-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compiler crashes/ICEs on new aarch64 GHA runners/Azure Cobalt 100 (Neoverse N2) CPUs #135867
Comments
Hit another different error: https://github.com/sgrif/pq-sys/actions/runs/12903663719/job/35979328082#step:15:166
|
Hmm, this might be something about incremental
|
Similar issue here: https://github.com/PyO3/maturin-action/actions/runs/12907444142/job/35991179531
|
I'm quite sure it isn't. We print the compiler flags at the bottom of the ICE message and there is no |
In addition, the slice index here is telling. Very often we get these ICEs because the slice that's being indexed into is being decoded from an artifact that should have been invalidated but wasn't, or because the slice was truncated. But here the index is just completely bogus. Successfully accessing a (non-ZST) slice index of 503566387083609839 would imply at least a 447 PB allocation. |
Correction: I used the wrong beta here, which was much older than intended. Oddly, that seems only to have made a small difference. See #135867 (comment) below, and GitoxideLabs/gitoxide#1790 (comment), for details. I'm not sure how useful this will be, since based on #135867 (comment) it looks like the problem might already be understood, but I figured I'd report this anyway, in case it provides useful information about the environments in which the problem does and does not occur. This happened in a There was also one occurrence of "free(): invalid next size (fast)":
I was not able to reproduce any of the errors on an Azure cloud instance, also running rustc 1.84.0 from the In addition, in GitHub Actions, as described along with some more details in GitoxideLabs/gitoxide#1790 (comment), the errors never seem to occur on the
Experiment 1 tested However, while both showed the problem only ever to occur on Experiment 1 also included Therefore, I don't know what's going on.
(The |
Compiler bugs often result in programs with absurd execution behavior, SIGILL just means it tried to execute at an offset that doesn't form a valid instruction or was a
I don't understand what's going on here, but something is very broken in a rather novel way. Thank you for your report, it was very informative. I think there are basically two possibilities, either the stable toolchain for linux-aarch64 is incredibly broken and somehow people are noticing at exactly the same time as GitHub is making free linux-aarch64 runners available... or the newly-available runners are subtly buggy. At the moment, I think it's more likely the runners are buggy. If that's the case, GitHub people are probably scrambling to do something about these reports, so for now I'd just wait. The runners are a public beta, finding bugs is expected. |
I wonder if something changed in how we build the aarch64 compiler... |
SIGSEGV
SIGSEGV
on aarch64-unknown-linux-gnu
Yes, we started optimizing it with LTO and PGO. But that's not on stable yet. |
When using `dtolnay/rust-toolchain` with the `toolchain` key to specify a channel, the action version should be given as `@master`. But I accidentally kept it at `@stable`! This caused `beta` and `nightly` to refer to the most recent beta and nightly builds *prior* to the current stable version. That made the conclucions about beta and nightly builds inaccurate. This rectifies that error and repeats the experiment. See e71b0cf (1f3f6b5), GitoxideLabs#1790, and rust-lang/rust#135867 for context. (I made this mistake in both experiment 1 and experiment 2, having wrongly thought I'd changed `@stable` to `@master` for experiment 1. This commit just repeats experiment 1, but experiment 2 should also be repeated for the same reason.)
As noted in the preceding commit, when I ran experiments 1 and 2 the first time, I accidentally used `dtolnay/rust-toolchain@stable` instead of `dtolnay/rust-toolchain@master`, even though the latter is needed to use current values of the `toolchain` key rather than the builds they referred to at the time the most recent stable build was updated. The preceding commit redid experiment 1 with that fixed. This commit redoes experiment 2 with te same fix. See 5a71963 (1b3e2cd), GitoxideLabs#1790, and rust-lang/rust#135867 for context.
When using `dtolnay/rust-toolchain` with the `toolchain` key to specify a channel, the action version should be given as `@master`. But I accidentally kept it at `@stable`! This caused `beta` and `nightly` to refer to the most recent beta and nightly builds *prior* to the current stable version. That made the conclucions about beta and nightly builds inaccurate. This rectifies that error and repeats the experiment. See e71b0cf (1f3f6b5), GitoxideLabs#1790, and rust-lang/rust#135867 for context. (I made this mistake in both experiment 1 and experiment 2, having wrongly thought I'd changed `@stable` to `@master` for experiment 1. This commit just repeats experiment 1, but experiment 2 should also be repeated for the same reason.)
As noted in the preceding commit, when I ran experiments 1 and 2 the first time, I accidentally used `dtolnay/rust-toolchain@stable` instead of `dtolnay/rust-toolchain@master`, even though the latter is needed to use current values of the `toolchain` key rather than the builds they referred to at the time the most recent stable build was updated. The preceding commit redid experiment 1 with that fixed. This commit redoes experiment 2 with te same fix. See 5a71963 (1b3e2cd), GitoxideLabs#1790, and rust-lang/rust#135867 for context.
This varies: - `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm` GHA runner. - Installing Rust via the `rust-toolchain` action vs. with curl.sh. - Installing the stable vs. beta Rust toolchain. - Installing nextest via `install-action` quickinstall/binstall. *If* this also confirms that the only fully consistent factor in whether errors happen is `ubuntu-22.04-arm` vs. `ubuntu-24.04.arm`, then that will make it clearer that the problem is likely specific to the `ubuntu-24.04.arm` runner. See GitoxideLabs#1790 and rust-lang/rust#135867 for context.
We've been using ARM on GHA successfully for several months using their "larger runners" feature (where GitHub still manages the runners for you, unlike self-hosted, but allows you to customise the specs/architecture/... etc). Today I switched to the new And shortly after a colleague encountered this crash after they switched the ARM job of their repo from the larger ARM runner to the public ARM runner:
This would suggest that there's a difference in image or machine type between GitHub's larger runners ARM offering and the new public runner ARM offering, that's the cause of the ICE/crash. In particular, I've just found out from actions/partner-runner-images#36 (comment) that the CPU type has changed:
...so it seems that this could be an issue specific to the ARM Cobalt 100 / Neoverse N2 CPU? |
rust on aarch64-unknown-linux-gnu has a bug which faces SIGSEGV intermittently (rust-lang/rust#135867) with 1.83.0 or later. rust 1.82.0 will be used for arm64 only until the above issue is resolved. Signed-off-by: Seunguk Shin <seunguk.shin@arm.com>
There are numerous reports of 24.04-arm host being unstable: rust-lang/rust#135867 Turns out they are running on different hardware compared to 22.04-arm: actions/partner-runner-images#36 (comment) cc davidlattimore#365
There are numerous reports of 24.04-arm host being unstable: rust-lang/rust#135867 cc davidlattimore#365
There are numerous reports of 24.04-arm host being unstable: rust-lang/rust#135867 cc #365
rust on aarch64-unknown-linux-gnu has a bug which faces SIGSEGV intermittently (rust-lang/rust#135867) with 1.83.0 or later. rust 1.82.0 will be used for arm64 only until the above issue is resolved. Signed-off-by: Seunguk Shin <seunguk.shin@arm.com>
rust on aarch64-unknown-linux-gnu has a bug which faces SIGSEGV intermittently (rust-lang/rust#135867) with 1.83.0 or later. rust 1.82.0 will be used for arm64 only until the above issue is resolved. Signed-off-by: Seunguk Shin <seunguk.shin@arm.com>
I don't know if it's related, but some CI python/shell scripts have been segfaulting on
Possibly a runner HW(?) issue, e.g. also reported in non-Rust stuff actions/partner-runner-images#46 and other issues. |
rust on aarch64-unknown-linux-gnu has a bug which faces SIGSEGV intermittently (rust-lang/rust#135867) with 1.83.0 or later. rust 1.82.0 will be used for arm64 only until the above issue is resolved. Signed-off-by: Seunguk Shin <seunguk.shin@arm.com>
rust on aarch64-unknown-linux-gnu has a bug which faces SIGSEGV intermittently (rust-lang/rust#135867) with 1.83.0 or later. rust 1.82.0 will be used for arm64 only until the above issue is resolved. Signed-off-by: Seunguk Shin <seunguk.shin@arm.com>
rust on aarch64-unknown-linux-gnu has a bug which faces SIGSEGV intermittently (rust-lang/rust#135867) with 1.83.0 or later. rust 1.82.0 will be used for arm64 only until the above issue is resolved. Signed-off-by: Seunguk Shin <seunguk.shin@arm.com>
GitHub has changed the large aarch64 runners to Arm Neoverse N1 CPUs as a mitigation for this problem, so I'm unpinning the issue but leaving it open for now. Also I'm adjusting the labels because the best guess right now is that this is a kernel bug that only causes trouble on the Neoverse N2 hardware. |
Update (2025-02-13): GitHub has changed the new large runners to Neoverse N1 CPUs. The underlying issue is not fixed, but GHA large aarch64 runners should now work fine with Ubuntu 24.
If you are reading this issue because you are seeing mysterious crashes on the new aarch64 GitHub Actions runners, try using the imageubuntu-22.04-arm
. All software seems to be unstable to some extent on theubuntu-24.04-arm
image, though crashes in rustc seem more frequent.I (@saethlin, the compiler maintainer editing this) will try to keep this message up-to-date as this problem is debugged. Original issue text is below.
Code
This happened once on a github CI runner, I nevertheless fill this as report as the output asked for it. It might be an hardware issue, as it went away with a rebuild.
CI LOG: https://github.com/sgrif/pq-sys/actions/runs/12903477183/job/35978791282?pr=73#step:14:26
Meta
rustc --version --verbose
:Error output
The text was updated successfully, but these errors were encountered: