manager: add draft of TensorboardInfo writing #1795

wchargin · 2019-01-31T20:57:04Z

Summary:
As of this commit, TensorBoard instances will write “info files” to
/tmp/.tensorboard-info/pid-${PID}.info (where /tmp is a
platform-specific tempdir). The info files include the port number on
which the server is running and a description of the server’s
configuration, so that clients can check whether a suitable TensorBoard
instance exists (and, if so, on which port) before starting a new one.

A typical info file looks like this:

$ cat /tmp/.tensorboard-info/pid-92239.info
{
    "cache_key": "<long base64 string; omitted>",
    "db": "",
    "logdir": "/usr/local/google/home/wchargin/tensorboard_data/",
    "path_prefix": "",
    "pid": 92239,
    "port": 7012,
    "start_time": 1549064188,
    "version": "1.13.0a0"
}

Test Plan:
Launch two TensorBoard instances. Then, comment out the tb_logging
import in tensorboard/manager.py, and open a Python shell and run

from tensorboard import manager
infos = manager.get_all()

Note that len(infos) == 2 and the info objects are correct. Then, send
SIGINT to one of the TensorBoards and SIGTERM to the other, and note
that each of them has its info file removed and that get_all() returns
an empty list.

wchargin-branch: write-tensorboardinfo

tensorboard/manager.py

tensorboard/program.py

tensorboard/manager_test.py

wchargin

Thanks for the review! Will revise and resubmit.

tensorboard/manager.py

tensorboard/manager_test.py

tensorboard/program.py

wchargin · 2019-02-01T22:03:40Z

I was planning to convert the TensorboardInfo namedtuple to an actual
proto, so that the proto libraries would take care of validation and
date encoding for me. I got it working, but the Python bindings for
protobuf are so frustrating and unidiomatic that I’d rather handle it
myself. Just for posterity, the proto appears below.

Details

/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

syntax = "proto3";

package tensorboard;

import "google/protobuf/timestamp.proto";

message TensorboardInfo {
  // Value of `tensorboard.version.VERSION` from which this info was
  // created.
  string version = 1;

  string cache_key = 2;

  // Time at which the TensorBoard server started serving.
  google.protobuf.Timestamp start_time = 3;

  // PID of the process hosting the TensorBoard server.
  uint64 pid = 4;

  // TCP port on which the TensorBoard server is listening.
  uint64 port = 5;

  // Path to the main TensorBoard page within the web server (i.e.,
  // effective value of the `--path_prefix` argument to `tensorboard`).
  string path_prefix = 6;

  oneof data_location {
    // Path to the logdir from which TensorBoard is reading event files.
    string logdir = 7;

    // URI to the database backing the TensorBoard instance.
    string db = 8;
  }
}

Summary: As of this commit, TensorBoard instances will write “info files” to `/tmp/.tensorboard-info/pid-${PID}.info` (where `/tmp` is a platform-specific tempdir). The info files include the port number on which the server is running and a description of the server’s configuration, so that clients can check whether a suitable TensorBoard instance exists (and, if so, on which port) before starting a new one. A typical info file looks like this: $ cat /tmp/.tensorboard-info/pid-92239.info { "cache_key": "<long base64 string; omitted>", "db": "", "logdir": "/usr/local/google/home/wchargin/tensorboard_data/", "path_prefix": "", "pid": 92239, "port": 7012, "start_time": 1549064188, "version": "1.13.0a0" } Test Plan: Launch two TensorBoard instances. Then, comment out the `tb_logging` import in `tensorboard/manager.py`, and open a Python shell and run from tensorboard import manager infos = manager.get_all() Note that `len(infos) == 2` and the info objects are correct. Then, send SIGINT to one of the TensorBoards and SIGTERM to the other, and note that each of them has its info file removed and that `get_all()` returns an empty list. wchargin-branch: write-tensorboardinfo

wchargin · 2019-02-01T23:46:28Z

Updated; PTAL.

manivaradarajan · 2019-02-01T23:52:29Z

tensorboard/manager.py

+  """
+  path = os.path.join(tempfile.gettempdir(), ".tensorboard-info")
+  if not os.path.exists(path):
+    os.mkdir(path)


Handle the case where mkdir, open, etc. fails. You can do this in a follow on. Let's not subject the user to the indignity of a stack trace.

Sure. As discussed offline, this problem exists throughout TensorBoard;
the minimal repro is just to invoke tensorboard at the command line,
which gives a stack trace. It’s true that users shouldn’t see stack
traces, but it’s also true that these particular functions should
throw.

Created #1798 to tackle the root cause (i.e., that there is no top-level
exception handler).

nfelt · 2019-02-02T00:50:21Z

tensorboard/manager.py

+  Returns:
+    A string representation of the provided `TensorboardInfo`.
+  """
+  field_types = {


It seems better to instead _TENSORBOARDINFO_FIELDS as a list of string, type tuples, then define TensorboardInfo = collections.namedtuple("TensorboardInfo", [k for (k, v) in_TENSORBOARDINFO_FIELDS]) so that they are forced to stay in sync and you don't need the asserts and duplication of field names.

On decode you can just tweak the python 2 special casing so that if the specified type is str then we expect to get unicode instead and convert it then.

Sure, can do. The datetime/int conversion will require a bit of
special-casing, too, but I can handle that.

nfelt · 2019-02-02T00:51:27Z

tensorboard/BUILD

+    srcs_version = "PY2AND3",
+    visibility = ["//tensorboard:internal"],
+    deps = [
+        ":manager",


add :version and @org_pythonhosted_six

nfelt · 2019-02-02T00:52:51Z

tensorboard/manager_test.py

+      manager.write_info_file(info)
+    self.assertEqual(os.listdir(self.info_dir), [])
+
+  def test_tensorboardinfo_serde_roundtrip(self):


TIL serde is a term. Just roundtrip would suffice I think

nfelt · 2019-02-02T01:22:16Z

tensorboard/program.py

+        db=self.flags.db,
+        cache_key=manager.cache_key(
+            working_directory=os.getcwd(),
+            arguments=sys.argv[1:],


It would be more correct to store the argv passed into configure() as a property of the server and then use that, since the whole point of configure() is to let you control the arguments explicitly vs using sys.argv. That raises the question of what to do if configure() is passed with explicit flags instead... I guess we could attempt to serialize them out to an argv-equivalent; we'd have to pick whether to be compatible with "--foo=bar" or "--foo bar".

Hmm, good point. I’ll have to think about the best way to do this,
particularly given the argparse/abseil duality that already exists in
configure (which may or may not actually be a problem; it’s just not
obvious to me what it does in all different environments).

What would you think about using kwargs to change the input argv
before parsing it, rather than using fixing up the flags object
afterward? The implementation is a bit messier, but the API is the same,
and it guarantees that there’s no skew in the “actual” vs.
“fake-serialized” forms.

It’s worth noting also that it’d be sound to key the cache against the
argv and the kwargs separately. So configure(["tb", "--logdir=x"])
and configure(["tb"], logdir="x") would map to different cache keys
(again, sound, but incomplete).

Keying the cache against them separately seems fine, given that the goal is just to detect when the exact same execution is performed twice, not to dedupe logically equivalent ones (e.g. we don't handle argument re-orders, etc.).

Not a big fan of feeding in the kwargs to argument parsing as extra "argv" entries; it's messier and then you have to ensure all the kwarg values can be be roundtripped through a string conversion correctly, which would require some effort if we have e.g. list or dict-valued arguments in the future.

Oh, just saw this. Glad that we’re on the same page. :-)

wchargin

Thanks for your suggestions. Will take a look at the configure
business on Monday.

wchargin · 2019-02-02T01:37:37Z

tensorboard/BUILD

+    srcs_version = "PY2AND3",
+    visibility = ["//tensorboard:internal"],
+    deps = [
+        ":manager",


wchargin · 2019-02-02T01:37:41Z

tensorboard/manager.py

+  Returns:
+    A string representation of the provided `TensorboardInfo`.
+  """
+  field_types = {


Sure, can do. The datetime/int conversion will require a bit of
special-casing, too, but I can handle that.

wchargin · 2019-02-02T01:38:03Z

tensorboard/manager_test.py

+      manager.write_info_file(info)
+    self.assertEqual(os.listdir(self.info_dir), [])
+
+  def test_tensorboardinfo_serde_roundtrip(self):


wchargin · 2019-02-02T01:50:14Z

tensorboard/program.py

+        db=self.flags.db,
+        cache_key=manager.cache_key(
+            working_directory=os.getcwd(),
+            arguments=sys.argv[1:],


Hmm, good point. I’ll have to think about the best way to do this,
particularly given the argparse/abseil duality that already exists in
configure (which may or may not actually be a problem; it’s just not
obvious to me what it does in all different environments).

What would you think about using kwargs to change the input argv
before parsing it, rather than using fixing up the flags object
afterward? The implementation is a bit messier, but the API is the same,
and it guarantees that there’s no skew in the “actual” vs.
“fake-serialized” forms.

wchargin · 2019-02-04T18:55:45Z

Given that the high-level design for the purposes of Colab integration
seems to be unobjectionable, I’ll split this up into smaller, more
easily reviewed PRs.

Summary: Currently, TensorBoard does not change the default signal dispositions for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit without running its `atexit` handlers, such as the one registered by the DB import mode. As of this commit, we handle SIGTERM by exiting gracefully. We leave SIGQUIT at the default disposition (temporary files will not be cleaned up), in accordance with the GNU libc guidelines: <https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT> The implementation is not perfect. Ideally, we would perform our graceful cleanup and then kill ourselves with the same signal to properly inform our parent of the source of the exit: for details, see <https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t provide a function like “please run all registered handlers now but don’t actually quit”; we might be able to implement this in Python 2.7 using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in Python 3. If we want to do this right, we could implement our own version of `atexit` (which would not be hard: the module is tiny). For now, I’m comfortable with submitting this mostly-correct patch. Supersedes part of #1795. Test Plan: Run `bazel build //tensorboard` and add the built binary to your PATH. Then, run `tensorboard --logdir whatever && echo after`, wait for it to print the “serving” message, and send SIGTERM via `kill(1)`. Note that TensorBoard prints a message and exits cleanly, and that “`after`” is printed to the console. Patch `program.py` to add a pre-existing signal handler and `atexit` cleanup from the top of `main`: ```diff diff --git a/tensorboard/program.py b/tensorboard/program.py index da59b4d1..c07cb855 100644 --- a/tensorboard/program.py +++ b/tensorboard/program.py @@ -201,6 +201,14 @@ class TensorBoard(object): :rtype: int """ + def fake_handler(signum, frame): + print("Handling some signals...") + assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL + def fake_cleanup(): + print("Cleaning everything up...") + import atexit + atexit.register(fake_cleanup) + self._install_signal_handler(signal.SIGTERM, "SIGTERM") if self.flags.inspect: logger.info('Not bringing up TensorBoard, but inspecting event files.') ``` Then, re-run the above steps, and note that the signal handler and `atexit` handler are both executed prior to cleanup: ``` $ tensorboard --logdir whatever && echo after TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit) TensorBoard caught SIGTERM; exiting... Handling some signals... Cleaning everything up... after ``` Ideally, `after` should _not_ be printed; that it is is a consequence of the fact that we don’t properly propagate the WIFSIGNALED flag as described above. wchargin-branch: handle-signals

Summary: This type describes information about a running TensorBoard instance. In a future commit, TensorBoard’s main function will be outfitted to write an “info file” containing this structure to a well-known location, which other processes can inspect to find all running TensorBoard instances, the ports on which they’re listening, their arguments and context, etc. The wire format is JSON as opposed to pbtxt for two reasons: it’s more open-source friendly (tools like `jq` have no publicly available protobuf equivalents), and the Python protobuf APIs are so frustrating that it doesn’t feel right to force callers to use them, even if this means that we have to implement our own validation and serialization. Supersedes part of #1795. Test Plan: Unit tests included; run `bazel test //tensorboard:manager_test`. wchargin-branch: tensorboardinfo

Summary: This function computes the (opaque) value of the `cache_key` field of `TensorboardInfo` objects, which is used to determine whether it is safe to reuse a TensorBoard instance. See docs for more details. Test Plan: Unit tests included; run `bazel test //tensorboard:manager_test`. For a sanity check at the REPL: ``` >>> from tensorboard import manager >>> import base64, json, os, pprint >>> ck = manager.cache_key( ... working_directory=os.getcwd(), ... arguments=["--logdir", "foo"], ... configure_kwargs={}, ... ) >>> ck 'eyJhcmd1bWVudHMiOlsiLS1sb2dkaXIiLCJmb28iXSwiY29uZmlndXJlX2t3YXJncyI6e30sIndvcmtpbmdfZGlyZWN0b3J5IjoiL3Vzci9sb2NhbC9nb29nbGUv aG9tZS93Y2hhcmdpbi9naXQvdGVuc29yYm9hcmQifQ==' >>> pprint.pprint(json.loads(base64.b64decode(ck))) {u'arguments': [u'--logdir', u'foo'], u'configure_kwargs': {}, u'working_directory': u'/usr/local/google/home/wchargin/git/tensorboard'} ``` Supersedes part of #1795. wchargin-branch: manager-cache-key

wchargin · 2019-02-05T01:31:06Z

Closing: improved and subsumed in #1802, #1804, #1805, #1806, and #1807.

Summary: This commit implements functions `write_info_file`, `remove_info_file`, and `get_all` on the `tensorboard.manager` module. See docs for details. Supersedes part of #1795. Test Plan: Integration tests included; run `bazel test //tensorboard:manager_test`. wchargin-branch: tensorboardinfo-io

wchargin requested a review from nfelt January 31, 2019 20:57

wchargin force-pushed the wchargin-write-tensorboardinfo branch from 96c6a49 to fa9aa93 Compare January 31, 2019 23:49

manivaradarajan self-requested a review February 1, 2019 02:47

manivaradarajan self-assigned this Feb 1, 2019

manivaradarajan removed their request for review February 1, 2019 02:47

manivaradarajan reviewed Feb 1, 2019

View reviewed changes

tensorboard/manager.py Outdated Show resolved Hide resolved

tensorboard/manager.py Outdated Show resolved Hide resolved

nfelt approved these changes Feb 1, 2019

View reviewed changes

wchargin commented Feb 1, 2019

View reviewed changes

wchargin force-pushed the wchargin-write-tensorboardinfo branch from 2d549c4 to dfe0a84 Compare February 1, 2019 23:46

manivaradarajan approved these changes Feb 2, 2019

View reviewed changes

nfelt reviewed Feb 2, 2019

View reviewed changes

wchargin commented Feb 2, 2019

View reviewed changes

wchargin mentioned this pull request Feb 4, 2019

main: handle SIGTERM, running atexit handlers #1802

Merged

wchargin mentioned this pull request Feb 4, 2019

manager: add TensorboardInfo type definition #1804

Merged

wchargin mentioned this pull request Feb 4, 2019

manager: add a cache_key function #1805

Merged

wchargin unassigned manivaradarajan Feb 5, 2019

wchargin mentioned this pull request Feb 5, 2019

manager: add I/O for TensorboardInfo files #1806

Merged

wchargin closed this Feb 5, 2019

wchargin deleted the wchargin-write-tensorboardinfo branch February 5, 2019 01:31

manager: add draft of TensorboardInfo writing #1795

manager: add draft of TensorboardInfo writing #1795

Uh oh!

Conversation

wchargin commented Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wchargin commented Feb 1, 2019

Uh oh!

wchargin commented Feb 1, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin Feb 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin commented Feb 4, 2019

Uh oh!

wchargin commented Feb 5, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wchargin commented Jan 31, 2019 •

edited

Loading

wchargin Feb 2, 2019 •

edited

Loading

wchargin commented Feb 5, 2019 •

edited

Loading