Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Jan 31, 2019

Summary:
As of this commit, TensorBoard instances will write “info files” to
/tmp/.tensorboard-info/pid-${PID}.info (where /tmp is a
platform-specific tempdir). The info files include the port number on
which the server is running and a description of the server’s
configuration, so that clients can check whether a suitable TensorBoard
instance exists (and, if so, on which port) before starting a new one.

A typical info file looks like this:

$ cat /tmp/.tensorboard-info/pid-92239.info
{
    "cache_key": "<long base64 string; omitted>",
    "db": "",
    "logdir": "/usr/local/google/home/wchargin/tensorboard_data/",
    "path_prefix": "",
    "pid": 92239,
    "port": 7012,
    "start_time": 1549064188,
    "version": "1.13.0a0"
}

Test Plan:
Launch two TensorBoard instances. Then, comment out the tb_logging
import in tensorboard/manager.py, and open a Python shell and run

from tensorboard import manager
infos = manager.get_all()

Note that len(infos) == 2 and the info objects are correct. Then, send
SIGINT to one of the TensorBoards and SIGTERM to the other, and note
that each of them has its info file removed and that get_all() returns
an empty list.

wchargin-branch: write-tensorboardinfo

@wchargin wchargin requested a review from nfelt January 31, 2019 20:57
@wchargin wchargin force-pushed the wchargin-write-tensorboardinfo branch from 96c6a49 to fa9aa93 Compare January 31, 2019 23:49
@manivaradarajan manivaradarajan self-requested a review February 1, 2019 02:47
@manivaradarajan manivaradarajan self-assigned this Feb 1, 2019
@manivaradarajan manivaradarajan removed their request for review February 1, 2019 02:47
Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Will revise and resubmit.

@wchargin
Copy link
Contributor Author

wchargin commented Feb 1, 2019

I was planning to convert the TensorboardInfo namedtuple to an actual
proto, so that the proto libraries would take care of validation and
date encoding for me. I got it working, but the Python bindings for
protobuf are so frustrating and unidiomatic that I’d rather handle it
myself. Just for posterity, the proto appears below.

Details
/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/

syntax = "proto3";

package tensorboard;

import "google/protobuf/timestamp.proto";

message TensorboardInfo {
  // Value of `tensorboard.version.VERSION` from which this info was
  // created.
  string version = 1;

  string cache_key = 2;

  // Time at which the TensorBoard server started serving.
  google.protobuf.Timestamp start_time = 3;

  // PID of the process hosting the TensorBoard server.
  uint64 pid = 4;

  // TCP port on which the TensorBoard server is listening.
  uint64 port = 5;

  // Path to the main TensorBoard page within the web server (i.e.,
  // effective value of the `--path_prefix` argument to `tensorboard`).
  string path_prefix = 6;

  oneof data_location {
    // Path to the logdir from which TensorBoard is reading event files.
    string logdir = 7;

    // URI to the database backing the TensorBoard instance.
    string db = 8;
  }
}

Summary:
As of this commit, TensorBoard instances will write “info files” to
`/tmp/.tensorboard-info/pid-${PID}.info` (where `/tmp` is a
platform-specific tempdir). The info files include the port number on
which the server is running and a description of the server’s
configuration, so that clients can check whether a suitable TensorBoard
instance exists (and, if so, on which port) before starting a new one.

A typical info file looks like this:

    $ cat /tmp/.tensorboard-info/pid-92239.info
    {
        "cache_key": "<long base64 string; omitted>",
        "db": "",
        "logdir": "/usr/local/google/home/wchargin/tensorboard_data/",
        "path_prefix": "",
        "pid": 92239,
        "port": 7012,
        "start_time": 1549064188,
        "version": "1.13.0a0"
    }

Test Plan:
Launch two TensorBoard instances. Then, comment out the `tb_logging`
import in `tensorboard/manager.py`, and open a Python shell and run

    from tensorboard import manager
    infos = manager.get_all()

Note that `len(infos) == 2` and the info objects are correct. Then, send
SIGINT to one of the TensorBoards and SIGTERM to the other, and note
that each of them has its info file removed and that `get_all()` returns
an empty list.

wchargin-branch: write-tensorboardinfo
@wchargin wchargin force-pushed the wchargin-write-tensorboardinfo branch from 2d549c4 to dfe0a84 Compare February 1, 2019 23:46
@wchargin
Copy link
Contributor Author

wchargin commented Feb 1, 2019

Updated; PTAL.

"""
path = os.path.join(tempfile.gettempdir(), ".tensorboard-info")
if not os.path.exists(path):
os.mkdir(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle the case where mkdir, open, etc. fails. You can do this in a follow on. Let's not subject the user to the indignity of a stack trace.

Copy link
Contributor Author

@wchargin wchargin Feb 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. As discussed offline, this problem exists throughout TensorBoard;
the minimal repro is just to invoke tensorboard at the command line,
which gives a stack trace. It’s true that users shouldn’t see stack
traces, but it’s also true that these particular functions should
throw.

Created #1798 to tackle the root cause (i.e., that there is no top-level
exception handler).

Returns:
A string representation of the provided `TensorboardInfo`.
"""
field_types = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better to instead _TENSORBOARDINFO_FIELDS as a list of string, type tuples, then define TensorboardInfo = collections.namedtuple("TensorboardInfo", [k for (k, v) in_TENSORBOARDINFO_FIELDS]) so that they are forced to stay in sync and you don't need the asserts and duplication of field names.

On decode you can just tweak the python 2 special casing so that if the specified type is str then we expect to get unicode instead and convert it then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do. The datetime/int conversion will require a bit of
special-casing, too, but I can handle that.

srcs_version = "PY2AND3",
visibility = ["//tensorboard:internal"],
deps = [
":manager",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add :version and @org_pythonhosted_six

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

manager.write_info_file(info)
self.assertEqual(os.listdir(self.info_dir), [])

def test_tensorboardinfo_serde_roundtrip(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL serde is a term. Just roundtrip would suffice I think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

db=self.flags.db,
cache_key=manager.cache_key(
working_directory=os.getcwd(),
arguments=sys.argv[1:],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more correct to store the argv passed into configure() as a property of the server and then use that, since the whole point of configure() is to let you control the arguments explicitly vs using sys.argv. That raises the question of what to do if configure() is passed with explicit flags instead... I guess we could attempt to serialize them out to an argv-equivalent; we'd have to pick whether to be compatible with "--foo=bar" or "--foo bar".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I’ll have to think about the best way to do this,
particularly given the argparse/abseil duality that already exists in
configure (which may or may not actually be a problem; it’s just not
obvious to me what it does in all different environments).

What would you think about using kwargs to change the input argv
before parsing it, rather than using fixing up the flags object
afterward? The implementation is a bit messier, but the API is the same,
and it guarantees that there’s no skew in the “actual” vs.
“fake-serialized” forms.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s worth noting also that it’d be sound to key the cache against the
argv and the kwargs separately. So configure(["tb", "--logdir=x"])
and configure(["tb"], logdir="x") would map to different cache keys
(again, sound, but incomplete).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keying the cache against them separately seems fine, given that the goal is just to detect when the exact same execution is performed twice, not to dedupe logically equivalent ones (e.g. we don't handle argument re-orders, etc.).

Not a big fan of feeding in the kwargs to argument parsing as extra "argv" entries; it's messier and then you have to ensure all the kwarg values can be be roundtripped through a string conversion correctly, which would require some effort if we have e.g. list or dict-valued arguments in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, just saw this. Glad that we’re on the same page. :-)

Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions. Will take a look at the configure
business on Monday.

srcs_version = "PY2AND3",
visibility = ["//tensorboard:internal"],
deps = [
":manager",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Returns:
A string representation of the provided `TensorboardInfo`.
"""
field_types = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, can do. The datetime/int conversion will require a bit of
special-casing, too, but I can handle that.

manager.write_info_file(info)
self.assertEqual(os.listdir(self.info_dir), [])

def test_tensorboardinfo_serde_roundtrip(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

db=self.flags.db,
cache_key=manager.cache_key(
working_directory=os.getcwd(),
arguments=sys.argv[1:],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point. I’ll have to think about the best way to do this,
particularly given the argparse/abseil duality that already exists in
configure (which may or may not actually be a problem; it’s just not
obvious to me what it does in all different environments).

What would you think about using kwargs to change the input argv
before parsing it, rather than using fixing up the flags object
afterward? The implementation is a bit messier, but the API is the same,
and it guarantees that there’s no skew in the “actual” vs.
“fake-serialized” forms.

@wchargin
Copy link
Contributor Author

wchargin commented Feb 4, 2019

Given that the high-level design for the purposes of Colab integration
seems to be unobjectionable, I’ll split this up into smaller, more
easily reviewed PRs.

wchargin added a commit that referenced this pull request Feb 4, 2019
Summary:
Currently, TensorBoard does not change the default signal dispositions
for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit
without running its `atexit` handlers, such as the one registered by the
DB import mode. As of this commit, we handle SIGTERM by exiting
gracefully. We leave SIGQUIT at the default disposition (temporary files
will not be cleaned up), in accordance with the GNU libc guidelines:
<https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT>

The implementation is not perfect. Ideally, we would perform our
graceful cleanup and then kill ourselves with the same signal to
properly inform our parent of the source of the exit: for details, see
<https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t
provide a function like “please run all registered handlers now but
don’t actually quit”; we might be able to implement this in Python 2.7
using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in
Python 3. If we want to do this right, we could implement our own
version of `atexit` (which would not be hard: the module is tiny). For
now, I’m comfortable with submitting this mostly-correct patch.

Supersedes part of #1795.

Test Plan:
Run `bazel build //tensorboard` and add the built binary to your PATH.
Then, run `tensorboard --logdir whatever && echo after`, wait for it to
print the “serving” message, and send SIGTERM via `kill(1)`. Note that
TensorBoard prints a message and exits cleanly, and that “`after`”
is printed to the console.

Patch `program.py` to add a pre-existing signal handler and `atexit`
cleanup from the top of `main`:

```diff
diff --git a/tensorboard/program.py b/tensorboard/program.py
index da59b4d1..c07cb855 100644
--- a/tensorboard/program.py
+++ b/tensorboard/program.py
@@ -201,6 +201,14 @@ class TensorBoard(object):

     :rtype: int
     """
+    def fake_handler(signum, frame):
+      print("Handling some signals...")
+    assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL
+    def fake_cleanup():
+      print("Cleaning everything up...")
+    import atexit
+    atexit.register(fake_cleanup)
+
     self._install_signal_handler(signal.SIGTERM, "SIGTERM")
     if self.flags.inspect:
       logger.info('Not bringing up TensorBoard, but inspecting event files.')
```

Then, re-run the above steps, and note that the signal handler and
`atexit` handler are both executed prior to cleanup:

```
$ tensorboard --logdir whatever && echo after
TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit)
TensorBoard caught SIGTERM; exiting...
Handling some signals...
Cleaning everything up...
after
```

Ideally, `after` should _not_ be printed; that it is is a consequence of
the fact that we don’t properly propagate the WIFSIGNALED flag as
described above.

wchargin-branch: handle-signals
wchargin added a commit that referenced this pull request Feb 4, 2019
Summary:
Currently, TensorBoard does not change the default signal dispositions
for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit
without running its `atexit` handlers, such as the one registered by the
DB import mode. As of this commit, we handle SIGTERM by exiting
gracefully. We leave SIGQUIT at the default disposition (temporary files
will not be cleaned up), in accordance with the GNU libc guidelines:
<https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT>

The implementation is not perfect. Ideally, we would perform our
graceful cleanup and then kill ourselves with the same signal to
properly inform our parent of the source of the exit: for details, see
<https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t
provide a function like “please run all registered handlers now but
don’t actually quit”; we might be able to implement this in Python 2.7
using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in
Python 3. If we want to do this right, we could implement our own
version of `atexit` (which would not be hard: the module is tiny). For
now, I’m comfortable with submitting this mostly-correct patch.

Supersedes part of #1795.

Test Plan:
Run `bazel build //tensorboard` and add the built binary to your PATH.
Then, run `tensorboard --logdir whatever && echo after`, wait for it to
print the “serving” message, and send SIGTERM via `kill(1)`. Note that
TensorBoard prints a message and exits cleanly, and that “`after`”
is printed to the console.

Patch `program.py` to add a pre-existing signal handler and `atexit`
cleanup from the top of `main`:

```diff
diff --git a/tensorboard/program.py b/tensorboard/program.py
index da59b4d1..c07cb855 100644
--- a/tensorboard/program.py
+++ b/tensorboard/program.py
@@ -201,6 +201,14 @@ class TensorBoard(object):

     :rtype: int
     """
+    def fake_handler(signum, frame):
+      print("Handling some signals...")
+    assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL
+    def fake_cleanup():
+      print("Cleaning everything up...")
+    import atexit
+    atexit.register(fake_cleanup)
+
     self._install_signal_handler(signal.SIGTERM, "SIGTERM")
     if self.flags.inspect:
       logger.info('Not bringing up TensorBoard, but inspecting event files.')
```

Then, re-run the above steps, and note that the signal handler and
`atexit` handler are both executed prior to cleanup:

```
$ tensorboard --logdir whatever && echo after
TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit)
TensorBoard caught SIGTERM; exiting...
Handling some signals...
Cleaning everything up...
after
```

Ideally, `after` should _not_ be printed; that it is is a consequence of
the fact that we don’t properly propagate the WIFSIGNALED flag as
described above.

wchargin-branch: handle-signals
wchargin added a commit that referenced this pull request Feb 5, 2019
Summary:
This type describes information about a running TensorBoard instance. In
a future commit, TensorBoard’s main function will be outfitted to write
an “info file” containing this structure to a well-known location, which
other processes can inspect to find all running TensorBoard instances,
the ports on which they’re listening, their arguments and context, etc.

The wire format is JSON as opposed to pbtxt for two reasons: it’s more
open-source friendly (tools like `jq` have no publicly available
protobuf equivalents), and the Python protobuf APIs are so frustrating
that it doesn’t feel right to force callers to use them, even if this
means that we have to implement our own validation and serialization.

Supersedes part of #1795.

Test Plan:
Unit tests included; run `bazel test //tensorboard:manager_test`.

wchargin-branch: tensorboardinfo
wchargin added a commit that referenced this pull request Feb 5, 2019
Summary:
This type describes information about a running TensorBoard instance. In
a future commit, TensorBoard’s main function will be outfitted to write
an “info file” containing this structure to a well-known location, which
other processes can inspect to find all running TensorBoard instances,
the ports on which they’re listening, their arguments and context, etc.

The wire format is JSON as opposed to pbtxt for two reasons: it’s more
open-source friendly (tools like `jq` have no publicly available
protobuf equivalents), and the Python protobuf APIs are so frustrating
that it doesn’t feel right to force callers to use them, even if this
means that we have to implement our own validation and serialization.

Supersedes part of #1795.

Test Plan:
Unit tests included; run `bazel test //tensorboard:manager_test`.

wchargin-branch: tensorboardinfo
wchargin added a commit that referenced this pull request Feb 5, 2019
Summary:
This function computes the (opaque) value of the `cache_key` field of
`TensorboardInfo` objects, which is used to determine whether it is safe
to reuse a TensorBoard instance. See docs for more details.

Test Plan:
Unit tests included; run `bazel test //tensorboard:manager_test`.

For a sanity check at the REPL:

```
>>> from tensorboard import manager
>>> import base64, json, os, pprint
>>> ck = manager.cache_key(
...     working_directory=os.getcwd(),
...     arguments=["--logdir", "foo"],
...     configure_kwargs={},
... )
>>> ck
'eyJhcmd1bWVudHMiOlsiLS1sb2dkaXIiLCJmb28iXSwiY29uZmlndXJlX2t3YXJncyI6e30sIndvcmtpbmdfZGlyZWN0b3J5IjoiL3Vzci9sb2NhbC9nb29nbGUv
aG9tZS93Y2hhcmdpbi9naXQvdGVuc29yYm9hcmQifQ=='
>>> pprint.pprint(json.loads(base64.b64decode(ck)))
{u'arguments': [u'--logdir', u'foo'],
 u'configure_kwargs': {},
 u'working_directory': u'/usr/local/google/home/wchargin/git/tensorboard'}
```

Supersedes part of #1795.

wchargin-branch: manager-cache-key
wchargin added a commit that referenced this pull request Feb 5, 2019
Summary:
This function computes the (opaque) value of the `cache_key` field of
`TensorboardInfo` objects, which is used to determine whether it is safe
to reuse a TensorBoard instance. See docs for more details.

Test Plan:
Unit tests included; run `bazel test //tensorboard:manager_test`.

For a sanity check at the REPL:

```
>>> from tensorboard import manager
>>> import base64, json, os, pprint
>>> ck = manager.cache_key(
...     working_directory=os.getcwd(),
...     arguments=["--logdir", "foo"],
...     configure_kwargs={},
... )
>>> ck
'eyJhcmd1bWVudHMiOlsiLS1sb2dkaXIiLCJmb28iXSwiY29uZmlndXJlX2t3YXJncyI6e30sIndvcmtpbmdfZGlyZWN0b3J5IjoiL3Vzci9sb2NhbC9nb29nbGUv
aG9tZS93Y2hhcmdpbi9naXQvdGVuc29yYm9hcmQifQ=='
>>> pprint.pprint(json.loads(base64.b64decode(ck)))
{u'arguments': [u'--logdir', u'foo'],
 u'configure_kwargs': {},
 u'working_directory': u'/usr/local/google/home/wchargin/git/tensorboard'}
```

Supersedes part of #1795.

wchargin-branch: manager-cache-key
@wchargin
Copy link
Contributor Author

wchargin commented Feb 5, 2019

Closing: improved and subsumed in #1802, #1804, #1805, #1806, and #1807.

@wchargin wchargin closed this Feb 5, 2019
@wchargin wchargin deleted the wchargin-write-tensorboardinfo branch February 5, 2019 01:31
wchargin added a commit that referenced this pull request Feb 6, 2019
Summary:
This commit implements functions `write_info_file`, `remove_info_file`,
and `get_all` on the `tensorboard.manager` module. See docs for details.

Supersedes part of #1795.

Test Plan:
Integration tests included; run `bazel test //tensorboard:manager_test`.

wchargin-branch: tensorboardinfo-io
wchargin added a commit that referenced this pull request Feb 6, 2019
Summary:
This commit implements functions `write_info_file`, `remove_info_file`,
and `get_all` on the `tensorboard.manager` module. See docs for details.

Supersedes part of #1795.

Test Plan:
Integration tests included; run `bazel test //tensorboard:manager_test`.

wchargin-branch: tensorboardinfo-io
wchargin added a commit that referenced this pull request Feb 7, 2019
Summary:
This commit implements functions `write_info_file`, `remove_info_file`,
and `get_all` on the `tensorboard.manager` module. See docs for details.

Supersedes part of #1795.

Test Plan:
Integration tests included; run `bazel test //tensorboard:manager_test`.

wchargin-branch: tensorboardinfo-io
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants