The fpm binary gets interrupted at the CI (macOS) #16

certik · 2020-01-15T15:02:32Z

There is a bug at our CI that I haven't been able to figure out yet. Here is an example of it: https://github.com/fortran-lang/fpm/runs/390475601.

Here is what I know:

It only happens on macOS, never on Linux or Windows
Restarting the build typically fixes it (sometimes it fails 2x or 3x in a row, but eventually it always passes)
The cargo test runs in parallel by default, so I set -j1 to run in serial. That seemed to improve how often it fails (although I could be wrong on that). It still fails sometimes however, so the actual bug is still there.
The error is:

thread 'test_2' panicked at 'Unexpected failure.
code=<interrupted>
stderr=``````
command=`"/Users/runner/runners/2.163.1/work/fpm/fpm/target/x86_64-apple-darwin/debug/fpm" "build"`
code=<interrupted>
stdout=``````
stderr=``````
', /Users/runner/.cargo/registry/src/github.com-1ecc6299db9ec823/assert_cmd-0.10.2/src/assert.rs:148:17

Which is caused by the assert_cmd package, which just uses the Rust's std::process::Command, when the output command did not succeed, but it also failed to retrieve the error code, which according to the documentation means the process was interrupted by the system (with some signal like SIGKILL).

The text was updated successfully, but these errors were encountered:

certik · 2020-01-15T15:31:20Z

And here it failed in master (https://github.com/fortran-lang/fpm/runs/391281163) after merging #15 that passed. So we have to figure this out.

milancurcic · 2020-01-15T15:35:41Z

Is there anybody with a Mac that can reproduce this locally?

certik · 2020-01-15T15:39:25Z

I tried it on a macOS yesterday locally, and I can't reproduce it. So we have to Debug it at the CI itself (as I have done in #14), which is quite time consuming.

certik · 2020-01-16T06:48:29Z

As reported in #18 (comment), the process gets interrupted with signal 9 (SIGKILL). I suspected that above, but it's nice to have a proof. For some reason macOS kills the fpm process.

certik · 2020-01-16T07:03:41Z

This might be a bug at GitHub, so I reported it at https://github.saobby.my.eu.orgmunity/t5/GitHub-Actions/GitHub-Actions-on-macOS-randomly-kill-my-test-program/m-p/43526#M5414.

certik · 2020-01-16T18:28:12Z

I am trying to reliably reproduce this bug in certik#1.

certik · 2020-01-17T21:44:30Z

One approach to fix this is to do cargo test || cargo test. An example where the first one fails, but the second one succeeds is https://github.com/certik/fpm/runs/396031403.

certik · 2020-01-17T21:56:03Z

This is precisely the kind of issue I am a bit worried with Rust, that the toolchain is not mature enough and will produce binaries that do not work (but only sometimes!). The closest error that I was able to find is:

golang/go#19734

milancurcic · 2020-01-17T22:05:49Z

But this likely has more to do with GitHub Actions than Rust itself. Neither you nor the person from the GitHub Forum could reproduce it locally, correct?

I agree it's very important to have fpm work on all 3 major platforms, but I wonder if it's a bit too early to focus Mac OS work 100% of the time? Why not let it fail sometimes? We can manually trigger the build when it does fail.

This is a workaround for fortran-lang#16. It turns out that if the executable fails to run, just rerunning all tests typically fixes it.

certik · 2020-01-18T00:20:48Z

I implemented a workaround in #20. The problem with GitHub Actions is that you cannot just restart a single build, you have to restart all the builds. The other issue is that it was already failing a lot with just 3 tests. Once we have hundreds of tests, it might end up failing all the time. For now the workaround implemented in #20 should make the CI robust enough for now. But we'll have to get back to this and fix it.

The reason I spent so much time on this is that I thought I can use "bisection" to create a minimal example that fails, and figure it out. But I failed. Essentially it's extremely unreliable to trigger the bug by submitting 20 commits. Sometimes they all pass, and sometimes they get 4 failures. And I haven't figured out a minimal example that can reliably trigger it in 20 runs. As I start removing things, the failures become less and less frequent.

If #20 doesn't give us reliable CI, then we can switch the macOS build to Azure pipelines or Travis-CI.

But having a reliable CI is a must.

certik · 2020-01-21T18:31:06Z

I reported it in the Rust community also: https://users.rust-lang.org/t/github-actions-randomly-kill-a-test-program/37255

This seems to actually fix the issue fortran-lang#16 as documented at: https://users.rust-lang.org/t/github-actions-randomly-kill-a-test-program/37255/6

certik · 2020-01-28T05:35:21Z

I think #29 fixed this issue.

So our mac builder is getting killed with sigkill (9). This is super mysterious: * rust-lang/cargo#7821 * fortran-lang/fpm#16 * https://github.saobby.my.eu.orgmunity/t/github-actions-on-macos-randomly-kill-my-test-program/17387

certik mentioned this issue Jan 16, 2020

Better reporting on terminated processes #18

Merged

certik closed this as completed in #18 Jan 16, 2020

certik reopened this Jan 16, 2020

certik added a commit to certik/fpm that referenced this issue Jan 17, 2020

macOS: repeat the test up to 2x if it fails

a601a6d

This is a workaround for fortran-lang#16. It turns out that if the executable fails to run, just rerunning all tests typically fixes it.

certik mentioned this issue Jan 17, 2020

macOS: repeat the test up to 2x if it fails #20

Merged

certik added a commit to certik/fpm that referenced this issue Jan 28, 2020

Use Command::new() instead of Command::cargo_new()

d8dcf8b

This seems to actually fix the issue fortran-lang#16 as documented at: https://users.rust-lang.org/t/github-actions-randomly-kill-a-test-program/37255/6

certik added a commit to certik/fpm that referenced this issue Jan 28, 2020

Remove the workaround for fortran-lang#16

b8f2216

certik mentioned this issue Jan 28, 2020

Use Command::new() instead of Command::cargo_new() #29

Merged

certik closed this as completed Jan 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The fpm binary gets interrupted at the CI (macOS) #16

The fpm binary gets interrupted at the CI (macOS) #16

certik commented Jan 15, 2020

certik commented Jan 15, 2020

milancurcic commented Jan 15, 2020

certik commented Jan 15, 2020

certik commented Jan 16, 2020 •

edited

Loading

certik commented Jan 16, 2020

certik commented Jan 16, 2020

certik commented Jan 17, 2020

certik commented Jan 17, 2020 •

edited

Loading

milancurcic commented Jan 17, 2020

certik commented Jan 18, 2020 •

edited

Loading

certik commented Jan 21, 2020

certik commented Jan 28, 2020

The fpm binary gets interrupted at the CI (macOS) #16

The fpm binary gets interrupted at the CI (macOS) #16

Comments

certik commented Jan 15, 2020

certik commented Jan 15, 2020

milancurcic commented Jan 15, 2020

certik commented Jan 15, 2020

certik commented Jan 16, 2020 • edited Loading

certik commented Jan 16, 2020

certik commented Jan 16, 2020

certik commented Jan 17, 2020

certik commented Jan 17, 2020 • edited Loading

milancurcic commented Jan 17, 2020

certik commented Jan 18, 2020 • edited Loading

certik commented Jan 21, 2020

certik commented Jan 28, 2020

certik commented Jan 16, 2020 •

edited

Loading

certik commented Jan 17, 2020 •

edited

Loading

certik commented Jan 18, 2020 •

edited

Loading