-
Notifications
You must be signed in to change notification settings - Fork 1.7k
main: handle SIGTERM, running atexit handlers
#1802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
tensorboard/program.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to have the caller do this by hand, since it's a simple enough check and that way it's clear at the callsite that it won't always run. That said, if you don't handle SIGQUIT then there's no need anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep; I’ll remove SIGQUIT, which removes the need for this wart.
tensorboard/program.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave out SIGQUIT. At least to me SIGQUIT is like "hard quit" and I would expect the program to exit as promptly as possible and not really handle the signal unless it's really necessary, and our atexits are mostly for tidiness.
See e.g. here where it suggests that SIGQUIT handling should actually avoid cleaning up temporary files (which is what our atexits do):
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Can do. The downside, of course, is that it produces a more
plausible vector by which .tensorboard-info files may become stale. But
I suppose that that’s okay; it just means that there’s more impetus to
add liveness checking where appropriate.
TIL; I guess I should hit Ctrl-Backslash less often. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh. I usually at least give the program a chance to handle Ctrl+C first :)
Summary: Currently, TensorBoard does not change the default signal dispositions for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit without running its `atexit` handlers, such as the one registered by the DB import mode. As of this commit, we handle SIGTERM by exiting gracefully. We leave SIGQUIT at the default disposition (temporary files will not be cleaned up), in accordance with the GNU libc guidelines: <https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT> The implementation is not perfect. Ideally, we would perform our graceful cleanup and then kill ourselves with the same signal to properly inform our parent of the source of the exit: for details, see <https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t provide a function like “please run all registered handlers now but don’t actually quit”; we might be able to implement this in Python 2.7 using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in Python 3. If we want to do this right, we could implement our own version of `atexit` (which would not be hard: the module is tiny). For now, I’m comfortable with submitting this mostly-correct patch. Supersedes part of #1795. Test Plan: Run `bazel build //tensorboard` and add the built binary to your PATH. Then, run `tensorboard --logdir whatever && echo after`, wait for it to print the “serving” message, and send SIGTERM via `kill(1)`. Note that TensorBoard prints a message and exits cleanly, and that “`after`” is printed to the console. Patch `program.py` to add a pre-existing signal handler and `atexit` cleanup from the top of `main`: ```diff diff --git a/tensorboard/program.py b/tensorboard/program.py index da59b4d1..c07cb855 100644 --- a/tensorboard/program.py +++ b/tensorboard/program.py @@ -201,6 +201,14 @@ class TensorBoard(object): :rtype: int """ + def fake_handler(signum, frame): + print("Handling some signals...") + assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL + def fake_cleanup(): + print("Cleaning everything up...") + import atexit + atexit.register(fake_cleanup) + self._install_signal_handler(signal.SIGTERM, "SIGTERM") if self.flags.inspect: logger.info('Not bringing up TensorBoard, but inspecting event files.') ``` Then, re-run the above steps, and note that the signal handler and `atexit` handler are both executed prior to cleanup: ``` $ tensorboard --logdir whatever && echo after TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit) TensorBoard caught SIGTERM; exiting... Handling some signals... Cleaning everything up... after ``` Ideally, `after` should _not_ be printed; that it is is a consequence of the fact that we don’t properly propagate the WIFSIGNALED flag as described above. wchargin-branch: handle-signals
89c9cc5 to
84ee3fd
Compare
atexit handlers
|
Updated; PTAL at your convenience. |
|
heyo, not sure this is the best place to ask but I've got some /tmp/tmp<>.py files left after restarting an http + tensorflow prediction service, which build up over time. I tracked this back to mkdtemp and tempfile.NamedTemporaryFile, called in tensorflow from ast_to_object , and source_to_entity which may be leaving temp files around at process restart. Do I need to add special handlers to allow tensorflow to cleanup any tempfiles it uses for logging (it sounds like that was the solution here)? I tried tf.autograph.set_verbosity(0) but the files are still created/abandoned. I'd prefer not to hard clear out /tmp/tmp*.py files on restart (as my service may not own all those tmp files in general) Not sure how to set cleanup hooks in tensorflow source to gracefully handle receiving systemd kill signals. Currently looking for other cleanup signals which atexit does work with Resolved with additional signal handling |
|
Hi @victusfate! Hmm; TensorBoard doesn’t write any |
|
apologies, I got it squared away with signal handlers to allow tf's atexits to clean up. |
|
No problem; hope you figure it out. :-) |
Summary:
Currently, TensorBoard does not change the default signal dispositions
for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit
without running its
atexithandlers, such as the one registered by theDB import mode. As of this commit, we handle SIGTERM by exiting
gracefully. We leave SIGQUIT at the default disposition (temporary files
will not be cleaned up), in accordance with the GNU libc guidelines:
https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT
The implementation is not perfect. Ideally, we would perform our
graceful cleanup and then kill ourselves with the same signal to
properly inform our parent of the source of the exit: for details, see
https://www.cons.org/cracauer/sigint.html. But
atexitdoesn’tprovide a function like “please run all registered handlers now but
don’t actually quit”; we might be able to implement this in Python 2.7
using
sys.exitfunc, but that’s deprecated in Python 2.7 and removed inPython 3. If we want to do this right, we could implement our own
version of
atexit(which would not be hard: the module is tiny). Fornow, I’m comfortable with submitting this mostly-correct patch.
Supersedes part of #1795.
Test Plan:
Run
bazel build //tensorboardand add the built binary to your PATH.Then, run
tensorboard --logdir whatever && echo after, wait for it toprint the “serving” message, and send SIGTERM via
kill(1). Note thatTensorBoard prints a message and exits cleanly, and that “
after”is printed to the console.
Patch
program.pyto add a pre-existing signal handler andatexitcleanup from the top of
main:Then, re-run the above steps, and note that the signal handler and
atexithandler are both executed prior to cleanup:Ideally,
aftershould not be printed; that it is is a consequence ofthe fact that we don’t properly propagate the WIFSIGNALED flag as
described above.
wchargin-branch: handle-signals