Test rules are unable to handle signals #7119

clintharrison · 2019-01-14T21:36:25Z

Description of the problem / feature request:

Tests are currently unable to respond to signals unless --experimental_split_xml_generation is used.
To ensure all test.xml logs are written, test-setup.sh traps all signals to write this file with the contents of stdout, exit code, etc. This is similar to #6338 in that we do not properly propagate signals to the child test process.

Feature requests: what underlying problem are you trying to solve with this feature?

I have a test rule that starts some services prior to running tests. These services run in Docker containers under Docker for Mac, which is actually a separate Linux VM. As a result, the containers need to be explicitly shutdown, which is only possible by handling this signal.

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Here's a simple test rule to repro:

TEST_SCRIPT = """#!/bin/bash
echo "Running tests"

function cleanup() {
    echo "Cleaning up"
    exit 1
}

trap "cleanup" SIGTERM

sleep 10
exit 1
"""

def impl(ctx):
    ctx.actions.write(output = ctx.outputs.executable, content = TEST_SCRIPT)

my_test = rule(implementation = impl, test = True)

Running as bazel test //:test --test_output=streamed --test_timeout=1 --local_sigkill_grace_seconds=3 we do not clean up; with --experimental_split_xml_generation we do 🙂

What operating system are you running Bazel on?

macOS 10.14

What's the output of `bazel info release`?

Build target: bazel-out/darwin-opt/bin/src/main/java/com/google/devtools/build/lib/bazel/BazelServer_deploy.jar
Build time: Wed Dec 19 12:57:09 2018 (1545224229)
Build timestamp: 1545224229
Build timestamp as int: 1545224229

Any other information, logs, or outputs that you want to share?

I asked about this in #general on the Bazel Slack, it was suggested that I create this as I'm somewhat relying on an implementation detail :)

The text was updated successfully, but these errors were encountered:

jmmv · 2019-01-18T18:25:11Z

I agree that a test should be able to clean up after itself if it abruptly terminates, though there is no way to guarantee that your test will have a chance to do that from within itself.

(Imagine: if the test is stuck and Bazel sends a SIGKILL to it, the cleanup will not work. Mind you, that's why I added separate cleanup routines invoked by the runtime engine to atf.) But better grant the test some chances than none.

Looks like @ulfjack implemented the --experimental_split_xml_generation feature in 0858ae1. Ulf, what was the plan there? What would it take to take this feature out of experimental, enable it by default, and remove the test.xml generation from the setup script?

ittaiz · 2019-01-18T19:21:43Z

Julio, That commit says: “At this time, this is only implemented for the StandaloneTestStrategy.” Has this changed? Is it not relevant in non standalone strategies?

…

On Fri, 18 Jan 2019 at 20:25 Julio Merino ***@***.***> wrote: I agree that a test should be able to clean up after itself if it abruptly terminates, though there is no way to guarantee that your test will have a chance to do that from within itself. (Imagine: if the test is stuck and Bazel sends a SIGKILL to it, the cleanup will not work. Mind you, that's why I added separate cleanup routines invoked by the runtime engine to atf <https://github.com/jmmv/atf>.) But better grant the test some chances than none. Looks like @ulfjack <https://github.com/ulfjack> implemented the --experimental_split_xml_generation feature in 0858ae1 <0858ae1>. Ulf, what was the plan there? What would it take to take this feature out of experimental, enable it by default, and remove the test.xml generation from the setup script? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#7119 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABUIF6bS0A5oYMqORKWiv7LuZip5I1lKks5vEhGLgaJpZM4Z_ytC> .

jmmv · 2019-01-18T19:25:54Z

I saw that... hence why I'm wondering what the plan is. Can this be implemented for other strategies? Is anyone working on it? Does it even make sense to implement it in those cases?

ulfjack · 2019-01-21T10:04:35Z

Bazel only has a single test strategy, so the work applies to all Bazel users. The primary reason behind the flag is to solve a race condition with the generation of the test.xml files (see #4608).

Blaze currently has two test strategies, and I have been working over the past year or so on merging them into one as well as matching Bazel's implementation. The flag is technically a step in the other direction, but I don't think it's problem.

The race condition is not a problem for Blaze, because missing test.xml files aren't a problem for Google's CI system, which has dedicated Blaze support - unlike Bazel, where most CI systems don't have dedicated support / rely on the presence of the test.xml files. That said, there's also no reason for Blaze to do this differently, even if it's not strictly required.

The plan is to enable the flag by default for Bazel, and I have been fixing (newly-discovered) issues over the past two weeks. I really want to close this out, so I hope this will land in the next few days.

ulfjack · 2019-01-25T15:15:36Z

I'm afraid it looks like I have to break this behavior, because it breaks signal forwarding in other cases. No, I don't completely understand it myself, yet.

clintharrison · 2019-01-26T00:15:20Z

@ulfjack which behavior do you mean by "this behavior"? Is there an issue I can follow with the "other cases" detailed?

ulfjack · 2019-01-28T18:31:12Z

I have a pending change (https://bazel-review.googlesource.com/c/bazel/+/88094) that adds a trap to the test-setup.sh.

I am not sure which version of test-setup.sh you were testing with above (the original code has a trap, then I removed the trap in the case of experimental_split_xml_generation, now I'm adding it back in), but in my testing adding the trap is necessary for the test subprocess to receive a signal at all. If the test-setup.sh doesn't have a trap, then the shell completely ignores the sigterm, and it's also not forwarded to the subprocess.

Now, it's possible that the subprocess is also a shell, and that one's eating the signal, but that wouldn't be affected by the flag.

ulfjack · 2019-01-28T18:31:44Z

It's also possible that there are differences in shell behavior based on platform. The behavior of trap that I saw seems to be completely undocumented.

If we don't set a trap here, then bash ignores the signal, and the test process also does not receive the signal, so the test runner has no chance of writing a test.xml output. However, the behavior of trap forwarding the signal to the subprocess is not at all documented in the bash documentation, and also inconsistent with the behavior reported in #7119. There is a similar problem in the Java stub template reported in #6338. This may or may not be progress on #4608. PiperOrigin-RevId: 232035930

ulfjack · 2019-03-12T15:07:25Z

I tried this on my Mac today with bazel 0.23.2 and it seems to work:

$ bazel test //foo:test --test_output=streamed --test_timeout=1 --local_sigkill_grace_seconds=3
...
Running tests
Terminated: 15
Cleaning up
-- Test timed out at 2019-03-12 15:05:04 UTC --

TIMEOUT: //foo:test (Summary)
      bazel-out/darwin-fastbuild/testlogs/foo/test/test.log
...

Let me know if you're seeing otherwise.

alexshtin · 2019-05-14T00:11:17Z

It doesn't seem to respect --local_sigkill_grace_seconds parameter though. Sends SIGTERM and in 1 second or so sends SIGKILL.
Also what's about Ctrl+C? It seems that Bazel sends SIGKILL right away to the test.

jin added untriaged team-Local-Exec Issues and PRs for the Execution (Local) team labels Jan 14, 2019

jmmv added P2 We'll consider working on this in future. (Assignee optional) and removed untriaged labels Jan 18, 2019

jmmv assigned ulfjack Feb 11, 2019

jmmv added the type: bug label Feb 11, 2019

ulfjack closed this as completed Mar 12, 2019

dejan-lokar mentioned this issue Nov 19, 2021

Test rule is SIGKILL-ed after timeout instead of waiting for the termination grace period #14298

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test rules are unable to handle signals #7119

Test rules are unable to handle signals #7119

clintharrison commented Jan 14, 2019

jmmv commented Jan 18, 2019

ittaiz commented Jan 18, 2019 via email

jmmv commented Jan 18, 2019

ulfjack commented Jan 21, 2019

ulfjack commented Jan 25, 2019

clintharrison commented Jan 26, 2019

ulfjack commented Jan 28, 2019

ulfjack commented Jan 28, 2019

ulfjack commented Mar 12, 2019

alexshtin commented May 14, 2019 •

edited

Loading

Test rules are unable to handle signals #7119

Test rules are unable to handle signals #7119

Comments

clintharrison commented Jan 14, 2019

Description of the problem / feature request:

Feature requests: what underlying problem are you trying to solve with this feature?

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

What operating system are you running Bazel on?

What's the output of bazel info release?

Any other information, logs, or outputs that you want to share?

jmmv commented Jan 18, 2019

ittaiz commented Jan 18, 2019 via email

jmmv commented Jan 18, 2019

ulfjack commented Jan 21, 2019

ulfjack commented Jan 25, 2019

clintharrison commented Jan 26, 2019

ulfjack commented Jan 28, 2019

ulfjack commented Jan 28, 2019

ulfjack commented Mar 12, 2019

alexshtin commented May 14, 2019 • edited Loading

What's the output of `bazel info release`?

alexshtin commented May 14, 2019 •

edited

Loading