main: handle SIGTERM, running `atexit` handlers #1802

wchargin · 2019-02-04T19:08:16Z

Summary:
Currently, TensorBoard does not change the default signal dispositions
for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit
without running its atexit handlers, such as the one registered by the
DB import mode. As of this commit, we handle SIGTERM by exiting
gracefully. We leave SIGQUIT at the default disposition (temporary files
will not be cleaned up), in accordance with the GNU libc guidelines:
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT

The implementation is not perfect. Ideally, we would perform our
graceful cleanup and then kill ourselves with the same signal to
properly inform our parent of the source of the exit: for details, see
https://www.cons.org/cracauer/sigint.html. But atexit doesn’t
provide a function like “please run all registered handlers now but
don’t actually quit”; we might be able to implement this in Python 2.7
using sys.exitfunc, but that’s deprecated in Python 2.7 and removed in
Python 3. If we want to do this right, we could implement our own
version of atexit (which would not be hard: the module is tiny). For
now, I’m comfortable with submitting this mostly-correct patch.

Supersedes part of #1795.

Test Plan:
Run bazel build //tensorboard and add the built binary to your PATH.
Then, run tensorboard --logdir whatever && echo after, wait for it to
print the “serving” message, and send SIGTERM via kill(1). Note that
TensorBoard prints a message and exits cleanly, and that “after”
is printed to the console.

Patch program.py to add a pre-existing signal handler and atexit
cleanup from the top of main:

diff --git a/tensorboard/program.py b/tensorboard/program.py
index da59b4d1..c07cb855 100644
--- a/tensorboard/program.py
+++ b/tensorboard/program.py
@@ -201,6 +201,14 @@ class TensorBoard(object):

     :rtype: int
     """
+    def fake_handler(signum, frame):
+      print("Handling some signals...")
+    assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL
+    def fake_cleanup():
+      print("Cleaning everything up...")
+    import atexit
+    atexit.register(fake_cleanup)
+
     self._install_signal_handler(signal.SIGTERM, "SIGTERM")
     if self.flags.inspect:
       logger.info('Not bringing up TensorBoard, but inspecting event files.')

Then, re-run the above steps, and note that the signal handler and
atexit handler are both executed prior to cleanup:

$ tensorboard --logdir whatever && echo after
TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit)
TensorBoard caught SIGTERM; exiting...
Handling some signals...
Cleaning everything up...
after

Ideally, after should not be printed; that it is is a consequence of
the fact that we don’t properly propagate the WIFSIGNALED flag as
described above.

wchargin-branch: handle-signals

nfelt · 2019-02-04T19:34:29Z

tensorboard/program.py

I'd prefer to have the caller do this by hand, since it's a simple enough check and that way it's clear at the callsite that it won't always run. That said, if you don't handle SIGQUIT then there's no need anyway.

Yep; I’ll remove SIGQUIT, which removes the need for this wart.

nfelt · 2019-02-04T19:37:51Z

tensorboard/program.py

I would leave out SIGQUIT. At least to me SIGQUIT is like "hard quit" and I would expect the program to exit as promptly as possible and not really handle the signal unless it's really necessary, and our atexits are mostly for tidiness.

See e.g. here where it suggests that SIGQUIT handling should actually avoid cleaning up temporary files (which is what our atexits do):
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

Interesting. Can do. The downside, of course, is that it produces a more
plausible vector by which .tensorboard-info files may become stale. But
I suppose that that’s okay; it just means that there’s more impetus to
add liveness checking where appropriate.

TIL; I guess I should hit Ctrl-Backslash less often. :-)

Heh. I usually at least give the program a chance to handle Ctrl+C first :)

Summary: Currently, TensorBoard does not change the default signal dispositions for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit without running its `atexit` handlers, such as the one registered by the DB import mode. As of this commit, we handle SIGTERM by exiting gracefully. We leave SIGQUIT at the default disposition (temporary files will not be cleaned up), in accordance with the GNU libc guidelines: <https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT> The implementation is not perfect. Ideally, we would perform our graceful cleanup and then kill ourselves with the same signal to properly inform our parent of the source of the exit: for details, see <https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t provide a function like “please run all registered handlers now but don’t actually quit”; we might be able to implement this in Python 2.7 using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in Python 3. If we want to do this right, we could implement our own version of `atexit` (which would not be hard: the module is tiny). For now, I’m comfortable with submitting this mostly-correct patch. Supersedes part of #1795. Test Plan: Run `bazel build //tensorboard` and add the built binary to your PATH. Then, run `tensorboard --logdir whatever && echo after`, wait for it to print the “serving” message, and send SIGTERM via `kill(1)`. Note that TensorBoard prints a message and exits cleanly, and that “`after`” is printed to the console. Patch `program.py` to add a pre-existing signal handler and `atexit` cleanup from the top of `main`: ```diff diff --git a/tensorboard/program.py b/tensorboard/program.py index da59b4d1..c07cb855 100644 --- a/tensorboard/program.py +++ b/tensorboard/program.py @@ -201,6 +201,14 @@ class TensorBoard(object): :rtype: int """ + def fake_handler(signum, frame): + print("Handling some signals...") + assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL + def fake_cleanup(): + print("Cleaning everything up...") + import atexit + atexit.register(fake_cleanup) + self._install_signal_handler(signal.SIGTERM, "SIGTERM") if self.flags.inspect: logger.info('Not bringing up TensorBoard, but inspecting event files.') ``` Then, re-run the above steps, and note that the signal handler and `atexit` handler are both executed prior to cleanup: ``` $ tensorboard --logdir whatever && echo after TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit) TensorBoard caught SIGTERM; exiting... Handling some signals... Cleaning everything up... after ``` Ideally, `after` should _not_ be printed; that it is is a consequence of the fact that we don’t properly propagate the WIFSIGNALED flag as described above. wchargin-branch: handle-signals

wchargin · 2019-02-04T20:02:02Z

Updated; PTAL at your convenience.

victusfate · 2020-03-10T16:17:51Z

heyo, not sure this is the best place to ask but I've got some /tmp/tmp<>.py files left after restarting an http + tensorflow prediction service, which build up over time. I tracked this back to mkdtemp and tempfile.NamedTemporaryFile, called in tensorflow from ast_to_object , and source_to_entity which may be leaving temp files around at process restart.

Do I need to add special handlers to allow tensorflow to cleanup any tempfiles it uses for logging (it sounds like that was the solution here)? I tried tf.autograph.set_verbosity(0) but the files are still created/abandoned. I'd prefer not to hard clear out /tmp/tmp*.py files on restart (as my service may not own all those tmp files in general)

Not sure how to set cleanup hooks in tensorflow source to gracefully handle receiving systemd kill signals. Currently looking for other cleanup signals which atexit does work with

Resolved with additional signal handling

def log_atexit(sig):
  print('received atexit signal',sig)
  sys.exit()
  
atexit.register(lambda: log_atexit(None))
signal.signal(signal.SIGTERM, lambda: log_atexit(signal.SIGTERM))
signal.signal(signal.SIGINT, lambda: log_atexit(signal.SIGINT))

wchargin · 2020-03-12T00:03:41Z

Hi @victusfate! Hmm; TensorBoard doesn’t write any *.py files, nor any
temp files except under /tmp/.tensorboard-info. I’m not familiar with
what temp files Autograph writes, how it cleans them up, or when it’s
safe for you to do so. The autograph code is part of the main
TensorFlow repository; could you maybe ask over there, instead?
https://github.com/tensorflow/tensorflow/issues

victusfate · 2020-03-13T14:07:19Z

apologies, I got it squared away with signal handlers to allow tf's atexits to clean up.
please disregard

wchargin · 2020-03-13T15:04:59Z

No problem; hope you figure it out. :-)

wchargin added the core:backend label Feb 4, 2019

wchargin requested a review from nfelt February 4, 2019 19:08

nfelt reviewed Feb 4, 2019

View reviewed changes

wchargin force-pushed the wchargin-handle-signals branch from 89c9cc5 to 84ee3fd Compare February 4, 2019 20:01

wchargin changed the title ~~main: handle SIGTERM and SIGQUIT, running atexits~~ main: handle SIGTERM, running atexit handlers Feb 4, 2019

nfelt approved these changes Feb 4, 2019

View reviewed changes

wchargin merged commit ebfd8e5 into master Feb 4, 2019

wchargin deleted the wchargin-handle-signals branch February 4, 2019 20:32

wchargin mentioned this pull request Feb 5, 2019

manager: add draft of TensorboardInfo writing #1795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

main: handle SIGTERM, running `atexit` handlers #1802

main: handle SIGTERM, running `atexit` handlers #1802

Uh oh!

wchargin commented Feb 4, 2019 •

edited

Loading

Uh oh!

nfelt Feb 4, 2019

Uh oh!

wchargin Feb 4, 2019

Uh oh!

nfelt Feb 4, 2019

Uh oh!

wchargin Feb 4, 2019

Uh oh!

nfelt Feb 4, 2019

Uh oh!

wchargin commented Feb 4, 2019

Uh oh!

victusfate commented Mar 10, 2020 •

edited

Loading

Uh oh!

wchargin commented Mar 12, 2020 •

edited

Loading

Uh oh!

victusfate commented Mar 13, 2020

Uh oh!

wchargin commented Mar 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

main: handle SIGTERM, running atexit handlers #1802

main: handle SIGTERM, running atexit handlers #1802

Uh oh!

Conversation

wchargin commented Feb 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nfelt Feb 4, 2019

Choose a reason for hiding this comment

Uh oh!

wchargin Feb 4, 2019

Choose a reason for hiding this comment

Uh oh!

nfelt Feb 4, 2019

Choose a reason for hiding this comment

Uh oh!

wchargin Feb 4, 2019

Choose a reason for hiding this comment

Uh oh!

nfelt Feb 4, 2019

Choose a reason for hiding this comment

Uh oh!

wchargin commented Feb 4, 2019

Uh oh!

victusfate commented Mar 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wchargin commented Mar 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

victusfate commented Mar 13, 2020

Uh oh!

wchargin commented Mar 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

main: handle SIGTERM, running `atexit` handlers #1802

main: handle SIGTERM, running `atexit` handlers #1802

wchargin commented Feb 4, 2019 •

edited

Loading

victusfate commented Mar 10, 2020 •

edited

Loading

wchargin commented Mar 12, 2020 •

edited

Loading