-
Notifications
You must be signed in to change notification settings - Fork 1.7k
manager: add draft of TensorboardInfo writing #1795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
96c6a49 to
fa9aa93
Compare
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Will revise and resubmit.
|
I was planning to convert the Details/* Copyright 2019 The TensorFlow Authors. All Rights Reserved.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
==============================================================================*/
syntax = "proto3";
package tensorboard;
import "google/protobuf/timestamp.proto";
message TensorboardInfo {
// Value of `tensorboard.version.VERSION` from which this info was
// created.
string version = 1;
string cache_key = 2;
// Time at which the TensorBoard server started serving.
google.protobuf.Timestamp start_time = 3;
// PID of the process hosting the TensorBoard server.
uint64 pid = 4;
// TCP port on which the TensorBoard server is listening.
uint64 port = 5;
// Path to the main TensorBoard page within the web server (i.e.,
// effective value of the `--path_prefix` argument to `tensorboard`).
string path_prefix = 6;
oneof data_location {
// Path to the logdir from which TensorBoard is reading event files.
string logdir = 7;
// URI to the database backing the TensorBoard instance.
string db = 8;
}
} |
Summary:
As of this commit, TensorBoard instances will write “info files” to
`/tmp/.tensorboard-info/pid-${PID}.info` (where `/tmp` is a
platform-specific tempdir). The info files include the port number on
which the server is running and a description of the server’s
configuration, so that clients can check whether a suitable TensorBoard
instance exists (and, if so, on which port) before starting a new one.
A typical info file looks like this:
$ cat /tmp/.tensorboard-info/pid-92239.info
{
"cache_key": "<long base64 string; omitted>",
"db": "",
"logdir": "/usr/local/google/home/wchargin/tensorboard_data/",
"path_prefix": "",
"pid": 92239,
"port": 7012,
"start_time": 1549064188,
"version": "1.13.0a0"
}
Test Plan:
Launch two TensorBoard instances. Then, comment out the `tb_logging`
import in `tensorboard/manager.py`, and open a Python shell and run
from tensorboard import manager
infos = manager.get_all()
Note that `len(infos) == 2` and the info objects are correct. Then, send
SIGINT to one of the TensorBoards and SIGTERM to the other, and note
that each of them has its info file removed and that `get_all()` returns
an empty list.
wchargin-branch: write-tensorboardinfo
2d549c4 to
dfe0a84
Compare
|
Updated; PTAL. |
| """ | ||
| path = os.path.join(tempfile.gettempdir(), ".tensorboard-info") | ||
| if not os.path.exists(path): | ||
| os.mkdir(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle the case where mkdir, open, etc. fails. You can do this in a follow on. Let's not subject the user to the indignity of a stack trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. As discussed offline, this problem exists throughout TensorBoard;
the minimal repro is just to invoke tensorboard at the command line,
which gives a stack trace. It’s true that users shouldn’t see stack
traces, but it’s also true that these particular functions should
throw.
Created #1798 to tackle the root cause (i.e., that there is no top-level
exception handler).
| Returns: | ||
| A string representation of the provided `TensorboardInfo`. | ||
| """ | ||
| field_types = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems better to instead _TENSORBOARDINFO_FIELDS as a list of string, type tuples, then define TensorboardInfo = collections.namedtuple("TensorboardInfo", [k for (k, v) in_TENSORBOARDINFO_FIELDS]) so that they are forced to stay in sync and you don't need the asserts and duplication of field names.
On decode you can just tweak the python 2 special casing so that if the specified type is str then we expect to get unicode instead and convert it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do. The datetime/int conversion will require a bit of
special-casing, too, but I can handle that.
| srcs_version = "PY2AND3", | ||
| visibility = ["//tensorboard:internal"], | ||
| deps = [ | ||
| ":manager", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add :version and @org_pythonhosted_six
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| manager.write_info_file(info) | ||
| self.assertEqual(os.listdir(self.info_dir), []) | ||
|
|
||
| def test_tensorboardinfo_serde_roundtrip(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL serde is a term. Just roundtrip would suffice I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
| db=self.flags.db, | ||
| cache_key=manager.cache_key( | ||
| working_directory=os.getcwd(), | ||
| arguments=sys.argv[1:], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more correct to store the argv passed into configure() as a property of the server and then use that, since the whole point of configure() is to let you control the arguments explicitly vs using sys.argv. That raises the question of what to do if configure() is passed with explicit flags instead... I guess we could attempt to serialize them out to an argv-equivalent; we'd have to pick whether to be compatible with "--foo=bar" or "--foo bar".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. I’ll have to think about the best way to do this,
particularly given the argparse/abseil duality that already exists in
configure (which may or may not actually be a problem; it’s just not
obvious to me what it does in all different environments).
What would you think about using kwargs to change the input argv
before parsing it, rather than using fixing up the flags object
afterward? The implementation is a bit messier, but the API is the same,
and it guarantees that there’s no skew in the “actual” vs.
“fake-serialized” forms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s worth noting also that it’d be sound to key the cache against the
argv and the kwargs separately. So configure(["tb", "--logdir=x"])
and configure(["tb"], logdir="x") would map to different cache keys
(again, sound, but incomplete).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keying the cache against them separately seems fine, given that the goal is just to detect when the exact same execution is performed twice, not to dedupe logically equivalent ones (e.g. we don't handle argument re-orders, etc.).
Not a big fan of feeding in the kwargs to argument parsing as extra "argv" entries; it's messier and then you have to ensure all the kwarg values can be be roundtripped through a string conversion correctly, which would require some effort if we have e.g. list or dict-valued arguments in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, just saw this. Glad that we’re on the same page. :-)
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestions. Will take a look at the configure
business on Monday.
| srcs_version = "PY2AND3", | ||
| visibility = ["//tensorboard:internal"], | ||
| deps = [ | ||
| ":manager", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| Returns: | ||
| A string representation of the provided `TensorboardInfo`. | ||
| """ | ||
| field_types = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, can do. The datetime/int conversion will require a bit of
special-casing, too, but I can handle that.
| manager.write_info_file(info) | ||
| self.assertEqual(os.listdir(self.info_dir), []) | ||
|
|
||
| def test_tensorboardinfo_serde_roundtrip(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
| db=self.flags.db, | ||
| cache_key=manager.cache_key( | ||
| working_directory=os.getcwd(), | ||
| arguments=sys.argv[1:], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. I’ll have to think about the best way to do this,
particularly given the argparse/abseil duality that already exists in
configure (which may or may not actually be a problem; it’s just not
obvious to me what it does in all different environments).
What would you think about using kwargs to change the input argv
before parsing it, rather than using fixing up the flags object
afterward? The implementation is a bit messier, but the API is the same,
and it guarantees that there’s no skew in the “actual” vs.
“fake-serialized” forms.
|
Given that the high-level design for the purposes of Colab integration |
Summary: Currently, TensorBoard does not change the default signal dispositions for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit without running its `atexit` handlers, such as the one registered by the DB import mode. As of this commit, we handle SIGTERM by exiting gracefully. We leave SIGQUIT at the default disposition (temporary files will not be cleaned up), in accordance with the GNU libc guidelines: <https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT> The implementation is not perfect. Ideally, we would perform our graceful cleanup and then kill ourselves with the same signal to properly inform our parent of the source of the exit: for details, see <https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t provide a function like “please run all registered handlers now but don’t actually quit”; we might be able to implement this in Python 2.7 using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in Python 3. If we want to do this right, we could implement our own version of `atexit` (which would not be hard: the module is tiny). For now, I’m comfortable with submitting this mostly-correct patch. Supersedes part of #1795. Test Plan: Run `bazel build //tensorboard` and add the built binary to your PATH. Then, run `tensorboard --logdir whatever && echo after`, wait for it to print the “serving” message, and send SIGTERM via `kill(1)`. Note that TensorBoard prints a message and exits cleanly, and that “`after`” is printed to the console. Patch `program.py` to add a pre-existing signal handler and `atexit` cleanup from the top of `main`: ```diff diff --git a/tensorboard/program.py b/tensorboard/program.py index da59b4d1..c07cb855 100644 --- a/tensorboard/program.py +++ b/tensorboard/program.py @@ -201,6 +201,14 @@ class TensorBoard(object): :rtype: int """ + def fake_handler(signum, frame): + print("Handling some signals...") + assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL + def fake_cleanup(): + print("Cleaning everything up...") + import atexit + atexit.register(fake_cleanup) + self._install_signal_handler(signal.SIGTERM, "SIGTERM") if self.flags.inspect: logger.info('Not bringing up TensorBoard, but inspecting event files.') ``` Then, re-run the above steps, and note that the signal handler and `atexit` handler are both executed prior to cleanup: ``` $ tensorboard --logdir whatever && echo after TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit) TensorBoard caught SIGTERM; exiting... Handling some signals... Cleaning everything up... after ``` Ideally, `after` should _not_ be printed; that it is is a consequence of the fact that we don’t properly propagate the WIFSIGNALED flag as described above. wchargin-branch: handle-signals
Summary: Currently, TensorBoard does not change the default signal dispositions for any signal handlers, so if killed by SIGTERM or SIGQUIT it will exit without running its `atexit` handlers, such as the one registered by the DB import mode. As of this commit, we handle SIGTERM by exiting gracefully. We leave SIGQUIT at the default disposition (temporary files will not be cleaned up), in accordance with the GNU libc guidelines: <https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html#index-SIGQUIT> The implementation is not perfect. Ideally, we would perform our graceful cleanup and then kill ourselves with the same signal to properly inform our parent of the source of the exit: for details, see <https://www.cons.org/cracauer/sigint.html>. But `atexit` doesn’t provide a function like “please run all registered handlers now but don’t actually quit”; we might be able to implement this in Python 2.7 using `sys.exitfunc`, but that’s deprecated in Python 2.7 and removed in Python 3. If we want to do this right, we could implement our own version of `atexit` (which would not be hard: the module is tiny). For now, I’m comfortable with submitting this mostly-correct patch. Supersedes part of #1795. Test Plan: Run `bazel build //tensorboard` and add the built binary to your PATH. Then, run `tensorboard --logdir whatever && echo after`, wait for it to print the “serving” message, and send SIGTERM via `kill(1)`. Note that TensorBoard prints a message and exits cleanly, and that “`after`” is printed to the console. Patch `program.py` to add a pre-existing signal handler and `atexit` cleanup from the top of `main`: ```diff diff --git a/tensorboard/program.py b/tensorboard/program.py index da59b4d1..c07cb855 100644 --- a/tensorboard/program.py +++ b/tensorboard/program.py @@ -201,6 +201,14 @@ class TensorBoard(object): :rtype: int """ + def fake_handler(signum, frame): + print("Handling some signals...") + assert signal.signal(signal.SIGTERM, fake_handler) == signal.SIG_DFL + def fake_cleanup(): + print("Cleaning everything up...") + import atexit + atexit.register(fake_cleanup) + self._install_signal_handler(signal.SIGTERM, "SIGTERM") if self.flags.inspect: logger.info('Not bringing up TensorBoard, but inspecting event files.') ``` Then, re-run the above steps, and note that the signal handler and `atexit` handler are both executed prior to cleanup: ``` $ tensorboard --logdir whatever && echo after TensorBoard 1.13.0a0 at <hostname> (Press CTRL+C to quit) TensorBoard caught SIGTERM; exiting... Handling some signals... Cleaning everything up... after ``` Ideally, `after` should _not_ be printed; that it is is a consequence of the fact that we don’t properly propagate the WIFSIGNALED flag as described above. wchargin-branch: handle-signals
Summary: This type describes information about a running TensorBoard instance. In a future commit, TensorBoard’s main function will be outfitted to write an “info file” containing this structure to a well-known location, which other processes can inspect to find all running TensorBoard instances, the ports on which they’re listening, their arguments and context, etc. The wire format is JSON as opposed to pbtxt for two reasons: it’s more open-source friendly (tools like `jq` have no publicly available protobuf equivalents), and the Python protobuf APIs are so frustrating that it doesn’t feel right to force callers to use them, even if this means that we have to implement our own validation and serialization. Supersedes part of #1795. Test Plan: Unit tests included; run `bazel test //tensorboard:manager_test`. wchargin-branch: tensorboardinfo
Summary: This type describes information about a running TensorBoard instance. In a future commit, TensorBoard’s main function will be outfitted to write an “info file” containing this structure to a well-known location, which other processes can inspect to find all running TensorBoard instances, the ports on which they’re listening, their arguments and context, etc. The wire format is JSON as opposed to pbtxt for two reasons: it’s more open-source friendly (tools like `jq` have no publicly available protobuf equivalents), and the Python protobuf APIs are so frustrating that it doesn’t feel right to force callers to use them, even if this means that we have to implement our own validation and serialization. Supersedes part of #1795. Test Plan: Unit tests included; run `bazel test //tensorboard:manager_test`. wchargin-branch: tensorboardinfo
Summary:
This function computes the (opaque) value of the `cache_key` field of
`TensorboardInfo` objects, which is used to determine whether it is safe
to reuse a TensorBoard instance. See docs for more details.
Test Plan:
Unit tests included; run `bazel test //tensorboard:manager_test`.
For a sanity check at the REPL:
```
>>> from tensorboard import manager
>>> import base64, json, os, pprint
>>> ck = manager.cache_key(
... working_directory=os.getcwd(),
... arguments=["--logdir", "foo"],
... configure_kwargs={},
... )
>>> ck
'eyJhcmd1bWVudHMiOlsiLS1sb2dkaXIiLCJmb28iXSwiY29uZmlndXJlX2t3YXJncyI6e30sIndvcmtpbmdfZGlyZWN0b3J5IjoiL3Vzci9sb2NhbC9nb29nbGUv
aG9tZS93Y2hhcmdpbi9naXQvdGVuc29yYm9hcmQifQ=='
>>> pprint.pprint(json.loads(base64.b64decode(ck)))
{u'arguments': [u'--logdir', u'foo'],
u'configure_kwargs': {},
u'working_directory': u'/usr/local/google/home/wchargin/git/tensorboard'}
```
Supersedes part of #1795.
wchargin-branch: manager-cache-key
Summary:
This function computes the (opaque) value of the `cache_key` field of
`TensorboardInfo` objects, which is used to determine whether it is safe
to reuse a TensorBoard instance. See docs for more details.
Test Plan:
Unit tests included; run `bazel test //tensorboard:manager_test`.
For a sanity check at the REPL:
```
>>> from tensorboard import manager
>>> import base64, json, os, pprint
>>> ck = manager.cache_key(
... working_directory=os.getcwd(),
... arguments=["--logdir", "foo"],
... configure_kwargs={},
... )
>>> ck
'eyJhcmd1bWVudHMiOlsiLS1sb2dkaXIiLCJmb28iXSwiY29uZmlndXJlX2t3YXJncyI6e30sIndvcmtpbmdfZGlyZWN0b3J5IjoiL3Vzci9sb2NhbC9nb29nbGUv
aG9tZS93Y2hhcmdpbi9naXQvdGVuc29yYm9hcmQifQ=='
>>> pprint.pprint(json.loads(base64.b64decode(ck)))
{u'arguments': [u'--logdir', u'foo'],
u'configure_kwargs': {},
u'working_directory': u'/usr/local/google/home/wchargin/git/tensorboard'}
```
Supersedes part of #1795.
wchargin-branch: manager-cache-key
Summary: This commit implements functions `write_info_file`, `remove_info_file`, and `get_all` on the `tensorboard.manager` module. See docs for details. Supersedes part of #1795. Test Plan: Integration tests included; run `bazel test //tensorboard:manager_test`. wchargin-branch: tensorboardinfo-io
Summary: This commit implements functions `write_info_file`, `remove_info_file`, and `get_all` on the `tensorboard.manager` module. See docs for details. Supersedes part of #1795. Test Plan: Integration tests included; run `bazel test //tensorboard:manager_test`. wchargin-branch: tensorboardinfo-io
Summary: This commit implements functions `write_info_file`, `remove_info_file`, and `get_all` on the `tensorboard.manager` module. See docs for details. Supersedes part of #1795. Test Plan: Integration tests included; run `bazel test //tensorboard:manager_test`. wchargin-branch: tensorboardinfo-io
Summary:
As of this commit, TensorBoard instances will write “info files” to
/tmp/.tensorboard-info/pid-${PID}.info(where/tmpis aplatform-specific tempdir). The info files include the port number on
which the server is running and a description of the server’s
configuration, so that clients can check whether a suitable TensorBoard
instance exists (and, if so, on which port) before starting a new one.
A typical info file looks like this:
Test Plan:
Launch two TensorBoard instances. Then, comment out the
tb_loggingimport in
tensorboard/manager.py, and open a Python shell and runNote that
len(infos) == 2and the info objects are correct. Then, sendSIGINT to one of the TensorBoards and SIGTERM to the other, and note
that each of them has its info file removed and that
get_all()returnsan empty list.
wchargin-branch: write-tensorboardinfo