Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use a more intricate timeout strategy upon ctrl-c to address pantsd interactivity issues #7014

Closed
3 tasks
cosmicexplorer opened this issue Jan 1, 2019 · 0 comments · Fixed by #6574
Closed
3 tasks

Comments

@cosmicexplorer
Copy link
Contributor

cosmicexplorer commented Jan 1, 2019

In #6574 we introduce a GracefulSignalHandler to decouple signal handling logic from the rest of pants execution to great effect. As noted in a comment in that PR, when we receive a ctrl-c from the terminal in remote_pants_runner.py, we fire off a SIGINT to the pantsd-runner process running in daemon_pants_runner.py and then wait for it to exit. This directly leads to interactivity issues, as it is not always clear to the user on the terminal why the pantsd-runner process is taking so long to close, and sometimes it doesn't close, requiring the user to manually send a SIGKILL to get their terminal session back. I think this can be effectively resolved by doing the following:

  • have some extremely simple retry/backoff logic for sending a SIGINT to the pantsd-runner process, escalating to a SIGTERM, then a SIGKILL, and waiting for a set timeout on the pid.
  • after sending a SIGKILL and waiting a bit, don't continue to wait for the pantsd-runner process to exit, because there's clearly some issue, so invoke the currently commented-out super(PailgunClientSignalHandler, self).handle_sigint(signum, frame) call from the client.
  • if sending SIGINT and SIGTERM failed to cause the pantsd-runner process to exit, print an error message to the terminal -- if SIGKILL fails (if waiting on the pantsd-runner pid to exit continues to time out after sending the signal), print a message with the pantsd-runner's pid/pgrp so the user can send signals or do any analysis themselves (this may be useful for logging in CI runs as well).

We would almost definitely need to address the TODOs regarding pantsd testing in #7008 before we can merge a PR fixing this.

cosmicexplorer added a commit that referenced this issue Mar 22, 2019
### Problem

#6552 has some changes around signal handling. This was an attempt to move those out, and to gain greater visibility into and control over when pants errors out due to a signal (helping to address difficulties in flaky tests around exiting, such as #6708).

Resolves #7014, resolves #6708, resolves #6847, resolves #7199.

### Solution

- Create a very simple `SignalHandler` class which can be subclassed to set pants's behavior upon receiving different signals in different execution contexts.
- Add a `_signal_handler` field to `ExceptionSink`.
- Apply `SignalHandler` everywhere that tries to modify pants's exception handling.
- Add the `--pantsd-pailgun-quit-timeout` bootstrap option and implement a timeout strategy to continue to forward output from the remote `pantsd-runner` process for a brief period of time upon receiving a signal, then killing the remote process if it's still alive at the end.
- Unskip some flaky tests that we hopefully won't have to reskip.

### Result

It's much more clear when and where signal handlers are set, and the flow of control is made more clear. It is easier to extend signal handlers in a hygienic way, allowing for the resolution of multiple flaky tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant