Skip to content

The fpm binary gets interrupted at the CI (macOS) #16

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
certik opened this issue Jan 15, 2020 · 12 comments · Fixed by #18
Closed

The fpm binary gets interrupted at the CI (macOS) #16

certik opened this issue Jan 15, 2020 · 12 comments · Fixed by #18

Comments

@certik
Copy link
Member

certik commented Jan 15, 2020

There is a bug at our CI that I haven't been able to figure out yet. Here is an example of it: https://github.com/fortran-lang/fpm/runs/390475601.

Here is what I know:

  1. It only happens on macOS, never on Linux or Windows

  2. Restarting the build typically fixes it (sometimes it fails 2x or 3x in a row, but eventually it always passes)

  3. The cargo test runs in parallel by default, so I set -j1 to run in serial. That seemed to improve how often it fails (although I could be wrong on that). It still fails sometimes however, so the actual bug is still there.

  4. The error is:

thread 'test_2' panicked at 'Unexpected failure.
code=<interrupted>
stderr=``````
command=`"/Users/runner/runners/2.163.1/work/fpm/fpm/target/x86_64-apple-darwin/debug/fpm" "build"`
code=<interrupted>
stdout=``````
stderr=``````
', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/assert_cmd-0.10.2/src/assert.rs:148:17

Which is caused by the assert_cmd package, which just uses the Rust's std::process::Command, when the output command did not succeed, but it also failed to retrieve the error code, which according to the documentation means the process was interrupted by the system (with some signal like SIGKILL).

@certik
Copy link
Member Author

certik commented Jan 15, 2020

And here it failed in master (https://github.com/fortran-lang/fpm/runs/391281163) after merging #15 that passed. So we have to figure this out.

@milancurcic
Copy link
Member

Is there anybody with a Mac that can reproduce this locally?

@certik
Copy link
Member Author

certik commented Jan 15, 2020

I tried it on a macOS yesterday locally, and I can't reproduce it. So we have to Debug it at the CI itself (as I have done in #14), which is quite time consuming.

@certik
Copy link
Member Author

certik commented Jan 16, 2020

As reported in #18 (comment), the process gets interrupted with signal 9 (SIGKILL). I suspected that above, but it's nice to have a proof. For some reason macOS kills the fpm process.

@certik certik reopened this Jan 16, 2020
@certik
Copy link
Member Author

certik commented Jan 16, 2020

@certik
Copy link
Member Author

certik commented Jan 16, 2020

I am trying to reliably reproduce this bug in certik#1.

@certik
Copy link
Member Author

certik commented Jan 17, 2020

One approach to fix this is to do cargo test || cargo test. An example where the first one fails, but the second one succeeds is https://github.com/certik/fpm/runs/396031403.

@certik
Copy link
Member Author

certik commented Jan 17, 2020

This is precisely the kind of issue I am a bit worried with Rust, that the toolchain is not mature enough and will produce binaries that do not work (but only sometimes!). The closest error that I was able to find is:

golang/go#19734

@milancurcic
Copy link
Member

But this likely has more to do with GitHub Actions than Rust itself. Neither you nor the person from the GitHub Forum could reproduce it locally, correct?

I agree it's very important to have fpm work on all 3 major platforms, but I wonder if it's a bit too early to focus Mac OS work 100% of the time? Why not let it fail sometimes? We can manually trigger the build when it does fail.

certik added a commit to certik/fpm that referenced this issue Jan 17, 2020
This is a workaround for fortran-lang#16. It turns out that if the executable fails
to run, just rerunning all tests typically fixes it.
@certik
Copy link
Member Author

certik commented Jan 18, 2020

I implemented a workaround in #20. The problem with GitHub Actions is that you cannot just restart a single build, you have to restart all the builds. The other issue is that it was already failing a lot with just 3 tests. Once we have hundreds of tests, it might end up failing all the time. For now the workaround implemented in #20 should make the CI robust enough for now. But we'll have to get back to this and fix it.

The reason I spent so much time on this is that I thought I can use "bisection" to create a minimal example that fails, and figure it out. But I failed. Essentially it's extremely unreliable to trigger the bug by submitting 20 commits. Sometimes they all pass, and sometimes they get 4 failures. And I haven't figured out a minimal example that can reliably trigger it in 20 runs. As I start removing things, the failures become less and less frequent.

If #20 doesn't give us reliable CI, then we can switch the macOS build to Azure pipelines or Travis-CI.

But having a reliable CI is a must.

@certik
Copy link
Member Author

certik commented Jan 21, 2020

I reported it in the Rust community also: https://users.rust-lang.org/t/github-actions-randomly-kill-a-test-program/37255

@certik
Copy link
Member Author

certik commented Jan 28, 2020

I think #29 fixed this issue.

@certik certik closed this as completed Jan 28, 2020
matklad added a commit to matklad/xshell that referenced this issue Mar 13, 2022
matklad added a commit to matklad/xshell that referenced this issue Mar 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants