-
Notifications
You must be signed in to change notification settings - Fork 108
The fpm binary gets interrupted at the CI (macOS) #16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
And here it failed in master (https://github.com/fortran-lang/fpm/runs/391281163) after merging #15 that passed. So we have to figure this out. |
Is there anybody with a Mac that can reproduce this locally? |
I tried it on a macOS yesterday locally, and I can't reproduce it. So we have to Debug it at the CI itself (as I have done in #14), which is quite time consuming. |
As reported in #18 (comment), the process gets interrupted with signal 9 ( |
This might be a bug at GitHub, so I reported it at https://github.saobby.my.eu.orgmunity/t5/GitHub-Actions/GitHub-Actions-on-macOS-randomly-kill-my-test-program/m-p/43526#M5414. |
I am trying to reliably reproduce this bug in certik#1. |
One approach to fix this is to do |
This is precisely the kind of issue I am a bit worried with Rust, that the toolchain is not mature enough and will produce binaries that do not work (but only sometimes!). The closest error that I was able to find is: |
But this likely has more to do with GitHub Actions than Rust itself. Neither you nor the person from the GitHub Forum could reproduce it locally, correct? I agree it's very important to have fpm work on all 3 major platforms, but I wonder if it's a bit too early to focus Mac OS work 100% of the time? Why not let it fail sometimes? We can manually trigger the build when it does fail. |
This is a workaround for fortran-lang#16. It turns out that if the executable fails to run, just rerunning all tests typically fixes it.
I implemented a workaround in #20. The problem with GitHub Actions is that you cannot just restart a single build, you have to restart all the builds. The other issue is that it was already failing a lot with just 3 tests. Once we have hundreds of tests, it might end up failing all the time. For now the workaround implemented in #20 should make the CI robust enough for now. But we'll have to get back to this and fix it. The reason I spent so much time on this is that I thought I can use "bisection" to create a minimal example that fails, and figure it out. But I failed. Essentially it's extremely unreliable to trigger the bug by submitting 20 commits. Sometimes they all pass, and sometimes they get 4 failures. And I haven't figured out a minimal example that can reliably trigger it in 20 runs. As I start removing things, the failures become less and less frequent. If #20 doesn't give us reliable CI, then we can switch the macOS build to Azure pipelines or Travis-CI. But having a reliable CI is a must. |
I reported it in the Rust community also: https://users.rust-lang.org/t/github-actions-randomly-kill-a-test-program/37255 |
This seems to actually fix the issue fortran-lang#16 as documented at: https://users.rust-lang.org/t/github-actions-randomly-kill-a-test-program/37255/6
I think #29 fixed this issue. |
So our mac builder is getting killed with sigkill (9). This is super mysterious: * rust-lang/cargo#7821 * fortran-lang/fpm#16 * https://github.saobby.my.eu.orgmunity/t/github-actions-on-macos-randomly-kill-my-test-program/17387
So our mac builder is getting killed with sigkill (9). This is super mysterious: * rust-lang/cargo#7821 * fortran-lang/fpm#16 * https://github.saobby.my.eu.orgmunity/t/github-actions-on-macos-randomly-kill-my-test-program/17387
There is a bug at our CI that I haven't been able to figure out yet. Here is an example of it: https://github.com/fortran-lang/fpm/runs/390475601.
Here is what I know:
It only happens on macOS, never on Linux or Windows
Restarting the build typically fixes it (sometimes it fails 2x or 3x in a row, but eventually it always passes)
The
cargo test
runs in parallel by default, so I set-j1
to run in serial. That seemed to improve how often it fails (although I could be wrong on that). It still fails sometimes however, so the actual bug is still there.The error is:
Which is caused by the
assert_cmd
package, which just uses the Rust's std::process::Command, when the output command did not succeed, but it also failed to retrieve the error code, which according to the documentation means the process was interrupted by the system (with some signal like SIGKILL).The text was updated successfully, but these errors were encountered: