Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Feb 4, 2019

Summary:
Currently, TensorBoard does not change the default signal dispositions
for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit
without running its atexit handlers, such as the one registered by the
DB import mode. As of this commit, we handle SIGTERM by exiting
gracefully. We leave SIGQUIT at the default disposition (temporary files
will not be cleaned up), in accordance with the GNU libc guidelines:
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT

The implementation is not perfect. Ideally, we would perform our
graceful cleanup and then kill ourselves with the same signal to
properly inform our parent of the source of the exit: for details, see
https://www.cons.org/cracauer/sigint.html. But atexit doesn’t
provide a function like “please run all registered handlers now but
don’t actually quit”; we might be able to implement this in Python 2.7
using sys.exitfunc, but that’s deprecated in Python 2.7 and removed in
Python 3. If we want to do this right, we could implement our own
version of atexit (which would not be hard: the module is tiny). For
now, I’m comfortable with submitting this mostly-correct patch.

Supersedes part of #1795.

Test Plan:
Run bazel build //tensorboard and add the built binary to your PATH.
Then, run tensorboard --logdir whatever && echo after, wait for it to
print the “serving” message, and send SIGTERM via kill(1). Note that
TensorBoard prints a message and exits cleanly, and that “after
is printed to the console.

Patch program.py to add a pre-existing signal handler and atexit
cleanup from the top of main:

diff --git a/tensorboard/program.py b/tensorboard/program.py
index da59b4d1..c07cb855 100644
--- a/tensorboard/program.py
+++ b/tensorboard/program.py
@@ -201,6 +201,14 @@ class TensorBoard(object):

     :rtype: int
     """
+    def fake_handler(signum, frame):
+      print("Handling some signals...")
+    assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL
+    def fake_cleanup():
+      print("Cleaning everything up...")
+    import atexit
+    atexit.register(fake_cleanup)
+
     self._install_signal_handler(signal.SIGTERM, "SIGTERM")
     if self.flags.inspect:
       logger.info('Not bringing up TensorBoard, but inspecting event files.')

Then, re-run the above steps, and note that the signal handler and
atexit handler are both executed prior to cleanup:

$ tensorboard --logdir whatever && echo after
TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit)
TensorBoard caught SIGTERM; exiting...
Handling some signals...
Cleaning everything up...
after

Ideally, after should not be printed; that it is is a consequence of
the fact that we don’t properly propagate the WIFSIGNALED flag as
described above.

wchargin-branch: handle-signals

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to have the caller do this by hand, since it's a simple enough check and that way it's clear at the callsite that it won't always run. That said, if you don't handle SIGQUIT then there's no need anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep; I’ll remove SIGQUIT, which removes the need for this wart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave out SIGQUIT. At least to me SIGQUIT is like "hard quit" and I would expect the program to exit as promptly as possible and not really handle the signal unless it's really necessary, and our atexits are mostly for tidiness.

See e.g. here where it suggests that SIGQUIT handling should actually avoid cleaning up temporary files (which is what our atexits do):
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Can do. The downside, of course, is that it produces a more
plausible vector by which .tensorboard-info files may become stale. But
I suppose that that’s okay; it just means that there’s more impetus to
add liveness checking where appropriate.

TIL; I guess I should hit Ctrl-Backslash less often. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh. I usually at least give the program a chance to handle Ctrl+C first :)

Summary:
Currently, TensorBoard does not change the default signal dispositions
for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit
without running its `atexit` handlers, such as the one registered by the
DB import mode. As of this commit, we handle SIGTERM by exiting
gracefully. We leave SIGQUIT at the default disposition (temporary files
will not be cleaned up), in accordance with the GNU libc guidelines:
<https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT>

The implementation is not perfect. Ideally, we would perform our
graceful cleanup and then kill ourselves with the same signal to
properly inform our parent of the source of the exit: for details, see
<https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t
provide a function like “please run all registered handlers now but
don’t actually quit”; we might be able to implement this in Python 2.7
using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in
Python 3. If we want to do this right, we could implement our own
version of `atexit` (which would not be hard: the module is tiny). For
now, I’m comfortable with submitting this mostly-correct patch.

Supersedes part of #1795.

Test Plan:
Run `bazel build //tensorboard` and add the built binary to your PATH.
Then, run `tensorboard --logdir whatever && echo after`, wait for it to
print the “serving” message, and send SIGTERM via `kill(1)`. Note that
TensorBoard prints a message and exits cleanly, and that “`after`”
is printed to the console.

Patch `program.py` to add a pre-existing signal handler and `atexit`
cleanup from the top of `main`:

```diff
diff --git a/tensorboard/program.py b/tensorboard/program.py
index da59b4d1..c07cb855 100644
--- a/tensorboard/program.py
+++ b/tensorboard/program.py
@@ -201,6 +201,14 @@ class TensorBoard(object):

     :rtype: int
     """
+    def fake_handler(signum, frame):
+      print("Handling some signals...")
+    assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL
+    def fake_cleanup():
+      print("Cleaning everything up...")
+    import atexit
+    atexit.register(fake_cleanup)
+
     self._install_signal_handler(signal.SIGTERM, "SIGTERM")
     if self.flags.inspect:
       logger.info('Not bringing up TensorBoard, but inspecting event files.')
```

Then, re-run the above steps, and note that the signal handler and
`atexit` handler are both executed prior to cleanup:

```
$ tensorboard --logdir whatever && echo after
TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit)
TensorBoard caught SIGTERM; exiting...
Handling some signals...
Cleaning everything up...
after
```

Ideally, `after` should _not_ be printed; that it is is a consequence of
the fact that we don’t properly propagate the WIFSIGNALED flag as
described above.

wchargin-branch: handle-signals
@wchargin wchargin force-pushed the wchargin-handle-signals branch from 89c9cc5 to 84ee3fd Compare February 4, 2019 20:01
@wchargin wchargin changed the title main: handle SIGTERM and SIGQUIT, running atexits main: handle SIGTERM, running atexit handlers Feb 4, 2019
@wchargin
Copy link
Contributor Author

wchargin commented Feb 4, 2019

Updated; PTAL at your convenience.

@wchargin wchargin merged commit ebfd8e5 into master Feb 4, 2019
@wchargin wchargin deleted the wchargin-handle-signals branch February 4, 2019 20:32
@victusfate
Copy link

victusfate commented Mar 10, 2020

heyo, not sure this is the best place to ask but I've got some /tmp/tmp<>.py files left after restarting an http + tensorflow prediction service, which build up over time. I tracked this back to mkdtemp and tempfile.NamedTemporaryFile, called in tensorflow from ast_to_object , and source_to_entity which may be leaving temp files around at process restart.

Do I need to add special handlers to allow tensorflow to cleanup any tempfiles it uses for logging (it sounds like that was the solution here)? I tried tf.autograph.set_verbosity(0) but the files are still created/abandoned. I'd prefer not to hard clear out /tmp/tmp*.py files on restart (as my service may not own all those tmp files in general)

Not sure how to set cleanup hooks in tensorflow source to gracefully handle receiving systemd kill signals. Currently looking for other cleanup signals which atexit does work with

Resolved with additional signal handling

def log_atexit(sig):
  print('received atexit signal',sig)
  sys.exit()
  
atexit.register(lambda: log_atexit(None))
signal.signal(signal.SIGTERM, lambda: log_atexit(signal.SIGTERM))
signal.signal(signal.SIGINT, lambda: log_atexit(signal.SIGINT))

@wchargin
Copy link
Contributor Author

wchargin commented Mar 12, 2020

Hi @victusfate! Hmm; TensorBoard doesn’t write any *.py files, nor any
temp files except under /tmp/.tensorboard-info. I’m not familiar with
what temp files Autograph writes, how it cleans them up, or when it’s
safe for you to do so. The autograph code is part of the main
TensorFlow repository; could you maybe ask over there, instead?
https://github.com/tensorflow/tensorflow/issues

@victusfate
Copy link

apologies, I got it squared away with signal handlers to allow tf's atexits to clean up.
please disregard

@wchargin
Copy link
Contributor Author

No problem; hope you figure it out. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants