Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch runner fails frequently #365

Closed
davidlattimore opened this issue Feb 2, 2025 · 5 comments
Closed

AArch runner fails frequently #365

davidlattimore opened this issue Feb 2, 2025 · 5 comments

Comments

@davidlattimore
Copy link
Owner

I don't necessarily have any solution for this, but wanted to raise an issue in case others had ideas.

We regularly see CI failures on the aarch64 jobs. At the moment we're running 4 aarch64 jobs (3 versions of ubuntu + openSUSE). Generally when there's a failure, it's just one or two of the jobs that fail and the others succeed.

The failure causes are varied, but mostly seem to be related to network or docker.

  • Value cannot be null. (Parameter 'network')
  • Error response from daemon: failed to create task for container
  • Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
  • Error message: Failed to connect to cdn.opensuse.org port 80 after 4001 ms: Could not connect to server
@mati865
Copy link
Collaborator

mati865 commented Feb 2, 2025

Sounds like AArch64 runners aren't stable yet (they are quite a new feature). I don't know how to recover from the Docker issue, but network issues should be solvable with a simple retry. Alternatively, for a more sophisticated solution, we could pre-build all images with up-to-date dependencies baked in and store them on GitHub Packages.

mati865 added a commit to mati865/wild that referenced this issue Feb 4, 2025
There are numerous reports of 24.04-arm host being unstable: rust-lang/rust#135867
Turns out they are running on different hardware compared to 22.04-arm: actions/partner-runner-images#36 (comment)

cc davidlattimore#365
mati865 added a commit to mati865/wild that referenced this issue Feb 4, 2025
There are numerous reports of 24.04-arm host being unstable: rust-lang/rust#135867

cc davidlattimore#365
davidlattimore pushed a commit that referenced this issue Feb 4, 2025
There are numerous reports of 24.04-arm host being unstable: rust-lang/rust#135867

cc #365
@mati865
Copy link
Collaborator

mati865 commented Feb 8, 2025

This issue stems from the combination of kernel version and hardware used by GH runners: https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/crashes.20on.20new.20aarch64.20GHA.20runners/near/498394368
So, downgrading the image version I applied is an effective workaround. Since then, there has been only one crash in zypper, but those also happen on x86_64.

@davidlattimore
Copy link
Owner Author

Thanks for looking into this! Sounds like we can probably close this.

@mati865
Copy link
Collaborator

mati865 commented Feb 16, 2025

GitHub has rolled back problematic hardware. Should we revert the workaround in #377?

@davidlattimore
Copy link
Owner Author

Sure. Happy to see how it goes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants