Skip to content

Conversation

@wchargin
Copy link
Contributor

@wchargin wchargin commented Dec 1, 2020

Summary:
TensorBoard can now start RustBoard if --load_fast is given at runtime
and --define=link_data_server=true was given at build time. The
default configuration still has no Rust code, so our Pip packages are
still portable. When we actually deploy this, we can distribute the Rust
binary in a separate Pip package that TensorBoard imports, but the UX
can stay the same: “just add --load_fast.”

Test Plan:
Test in the following configurations:

  • with just a --logdir, everything works as normal;
  • with --define=link_data_server=true -- --logdir ... --load_fast,
    the subprocess is spawned and cleaned up on a successful exit or a
    SIGKILL to TensorBoard, and INFO logs are shown iff --verbosity 0
    is specified;
  • with --logdir ... --load_fast but no --define, TensorBoard fails
    to start the server and prints a message before exiting; and
  • with --grpc_data_provider localhost:6806, TensorBoard connects to
    an existing server without needing --logdir or --define.

To test the “data server died” case, comment out the --logdir=%s flag,
which will cause the server to fail with a usage message. That message
should appear in the logs.

To test the polling, add thread::sleep(Duration::from_secs(3)) before
the server writes its port file, and run with --verbosity 0 to note
the “Polling for data server port” messages.

This also works after syncing into Google, in all relevant
configurations; see http://cl/344900410 and http://cl/344955833.

wchargin-branch: grpc-ingester

Summary:
If `tensorboard` is to launch RustBoard as a subprocess rather than
relying on the user to launch it concurrently, then we should try hard
not to leave around a zombie server. An easy solution is an `atexit`
handler in Python TensorBoard, but this doesn’t handle the case when
TensorBoard crashes or gets SIGTERMed. On Linux, we can be notified when
our parent dies by setting the parent-death signal (`man 2 prctl`), but
this isn’t portable. On portable Unix, the child can poll its PPID to
see when it changes to `1` or the PID of some subreaper, but this isn’t
portable to Windows, which doesn’t update PPID when the parent dies.

Windows is not my strong suit, but some web searching didn’t turn up an
easy and clean solution portable to both Windows and Unix (any Windows
internals like “job objects” are a non-starter, since I can’t test
them). The one thing that I would expect to work everywhere is “just
wait until stdin closes and then exit”. If even that doesn’t work on
Windows, well, we can burn that bridge when we come to it.

Test Plan:
Build `bazel build //tensorboard/data/server`, then write a simple
Python driver:

```python
import subprocess
import time

server = "bazel-bin/tensorboard/data/server/server"
p = subprocess.Popen(
    [server, "--logdir", "/tmp/nonexistent", "-v", "--die-after-stdin"],
    stdin=subprocess.PIPE,
)
print(p.pid)
time.sleep(2)
```

Run it with `python test.py`, and note that after 2 seconds the server
prints “Stdin closed; exiting”. Run it again, and suspend (`^Z`) the
process before it finishes sleeping. Run `kill -SIGCONT CHILD_PID` with
the PID printed by the Python script to resume server execution; this is
just needed because your `^Z` propagates to all processes in the group.
Note that the server continues to print logs as it starts and finishes
new load cycles. Then, run `fg` and wait for the sleep to complete, and
note that the server again exits, as desired.

wchargin-branch: rust-die-after-stdin
wchargin-source: ad925e07a2106ee37d268e32c1b997606326ae99
wchargin-branch: rust-die-after-stdin
wchargin-source: 0be687779c1709eedf2114e402a4aebe7f35a647
Summary:
TensorBoard can now start RustBoard if `--load_fast` is given at runtime
and `--define=link_data_server=true` was given at build time. The
default configuration still has no Rust code, so our Pip packages are
still portable. When we actually deploy this, we can distribute the Rust
binary in a separate Pip package that TensorBoard talks to via [entry
points], but the UX can stay the same: “just add `--load_fast`.”

[entry points]: https://packaging.python.org/specifications/entry-points/

Test Plan:
Test in the following configurations:

  - with just a `--logdir`, everything works as normal;
  - with `--define=link_data_server=true -- --logdir ... --load_fast`,
    the subprocess is spawned and cleaned up on a successful exit or a
    SIGKILL to TensorBoard, and INFO logs are shown iff `--verbosity 0`
    is specified;
  - with `--logdir ... --load_fast` but no `--define`, TensorBoard fails
    to start the server and prints a message before exiting; and
  - with `--grpc_data_provider localhost:6806`, TensorBoard connects to
    an existing server without needing `--logdir` or `--define`.

To test the “data server died” case, comment out the `--logdir=%s` flag,
which will cause the server to fail with a usage message. That message
should appear in the logs.

This also works after syncing into Google, in all relevant
configurations; see <http://cl/344900410> and <http://cl/344955833>.

wchargin-branch: grpc-ingester
wchargin-source: 57ed441a34d496055b4b75e6cd935f63b8ae2e09
@wchargin wchargin added core:backend core:rustboard //tensorboard/data/server/... labels Dec 1, 2020
@wchargin wchargin requested a review from nfelt December 1, 2020 07:19
@google-cla google-cla bot added the cla: yes label Dec 1, 2020
Comment on lines 58 to 61
def test_load_fast(self):
flags = make_flags(logdir="/some/logdir", load_fast=True)
ingester = server_ingester.ServerDataIngester(flags)
# Not much more that we can easily test here.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interested in your thoughts here? I could go to various degrees of
trouble to create fake executables that interact with stdout in various
ways, but it would be a fair amount of trouble, and it’s not obvious to
me that it’s worth it. I found it worth noting that your local data
ingester has not been touched since its creation. This case is different
because I’ll at least be forwarding things like samples_per_plugin as
we add support for them, but it’s still a useful indicator.

Feel free to send a quick review with just “more tests here, please”.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts: I probably wouldn't bother creating a fake executable - if we're going to launch a process it might as well be the real one and just make it an integration test. I think that could be useful but IMO not essentially, especially since this is still experimental.

For testing flag propagation alone, I'd consider perhaps just mocking subprocess.Popen and checking that we passed the right args. This seems like a reasonable case to use mocking since it's less of a pure implementation detail given that it's written against the CLI for the server binary, and it's a relatively more heavyweight component to replicate with an accurate fake or the real thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Problem with integration testing is that we don’t have a real server
here within Google. (I guess we could just disable the test internally.)
Agreed on “useful but not essential”.

Okay, we can use mocking here.

Base automatically changed from wchargin-rust-die-after-stdin to master December 1, 2020 19:28
wchargin-branch: grpc-ingester
wchargin-source: 203ccff9eb6d57a90f94e3383c8162a5586cf658
wchargin-branch: grpc-ingester
wchargin-source: bc101e5cd0b10ed0764d358a28d2397948842ace
)

py_library(
name = "ingester",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit reluctant to grow yet another top-level module. Could this go under data perhaps? Especially since we're prefixing the ABC name with Data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, sorry, forgot you didn’t like these. data is fine.

"""Creates a data ingester from flags.
Args:
flags: An argparse.Namespace containing TensorBoard CLI flags.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I realize this is my fault for doing this in LocalDataIngester, but per #2801 can I petition to not to formalize this API as using another untyped flags-passing mechanism? The issue provides some of the rationale but happy to discuss further if you aren't sold.

I think to encapsulate this in a fully generic way we'd probably want to adopt the same PluginLoader/Plugin model as the above issue discusses where the former is responsible for defining a set of flags, extracting them from the argparse.Namespace, and constructing the latter according to the interpretation of the flags.

That might be a bit heavyweight, especially if we don't ever expect to have more data ingesters than these two, so perhaps it'd be easier to do something like decree program to be the "owner" of all the data-ingestion-related flags, and have it do the interpretation. This is easy for ServerDataIngester which has only one flag anyway; for LocalDataIngester we could refactor the constructor into some kind of helper function that maps flags to kwargs or to some kind of typed Options struct that we then use to create LocalDataIngester.

WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that seems like a reasonable request to me. I’ll drop the
abstraction over __init__ and just let each one take the flags that it
needs, and let program do the dispatching, as you say. I think that’s
simplest, and feels clean enough as long as program owns them.

This is also nice because it affords splitting the server ingesters into
two smaller ingesters for the “existing server” and “spawn new
subprocess” case.

I’ve left LocalDataIngester as taking flags, but this is just a
special case of “each class defines a typed interface” (the typed
interface for one of them happens to be “bag of junk”) rather than
formalizing the thing that we don’t want. Now the diff to that file is
trivial.


@abc.abstractmethod
def start(self):
"""Starts ingesting data based on the ingester flag configuration.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth specifying that this starts ingesting forever in a separate thread or process, i.e. it's non-blocking? Since if it were blocking this presumably would never return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The server-starting ingester does wait for the server to
start, so I don’t want “non-blocking” to imply “returns immediately”,
but we can clarify that. (Also, it takes milliseconds, so not too bad.)

"Failed to load data:\n%s"
% "\n".join(" - %s" % e for e in errors)
)
# Stash ingester so that it can avoid GCing Windows file handles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point the reader at ServerDataIngester to learn more about this? I read this line first and had written a comment asking for elaboration before I read the longer (and very helpful) comment in that file. Probably should have thought to check there but a pointer would have helped reinforce that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly. Don’t you love reading CPython source specifically to look
for undocumented platform-specific differences in process handling, and
then finding some?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly there are no greater joys in life.


@property
@abc.abstractmethod
def deprecated_multiplexer(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, it seems strange to me to put this on the DataIngester API surface when we know it's only ever going to apply to the one legacy one (and will be removed when we can clean this up). I'd rather that program just special-case LocalDataIngester for this so we can contain the contagion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can do now that program handles dispatching.

flags = make_flags(grpc_data_provider="localhost:6806")
ingester = server_ingester.ServerDataIngester(flags)
ingester.start()
ctx = context.RequestContext()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

def setUp(self):
super().setUp()
self.enter_context(
mock.patch.object(grpc, "insecure_channel", autospec=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a bit happier only doing this where we call start() and using the context manager normally, even if it's a little more per-test-method boilerplate, vs always patching even when it's not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; done.

def test_fixed_address(self):
flags = make_flags(grpc_data_provider="localhost:6806")
ingester = server_ingester.ServerDataIngester(flags)
ingester.start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must be missing something, but how does this pass without raising RuntimeError if we're running the test without doing --define=link_data_server=true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When --grpc_data_provider is set, it connected to an already running
server rather than starting one. Mechanically, it hit the early return
on the first line of start, because the data provider was set in the
constructor.

But I have sliced the two-headed ServerDataIngester monster in half,
so this is clearer now.

Comment on lines 58 to 61
def test_load_fast(self):
flags = make_flags(logdir="/some/logdir", load_fast=True)
ingester = server_ingester.ServerDataIngester(flags)
# Not much more that we can easily test here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thoughts: I probably wouldn't bother creating a fake executable - if we're going to launch a process it might as well be the real one and just make it an integration test. I think that could be useful but IMO not essentially, especially since this is still experimental.

For testing flag propagation alone, I'd consider perhaps just mocking subprocess.Popen and checking that we passed the right args. This seems like a reasonable case to use mocking since it's less of a pure implementation detail given that it's written against the CLI for the server binary, and it's a relatively more heavyweight component to replicate with an accurate fake or the real thing.

"--load_fast",
action="store_true",
help="""\
Experimental. Use a data server to accelerate loading.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth adding info about this flag and the --define one (basically what you have in your PR description) to tensorboard/data/server/DEVELOPMENT.md?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure; good idea.

wchargin-branch: grpc-ingester
wchargin-source: fba72285f3ec0868039033a74bc326a80ad45de6
wchargin-branch: grpc-ingester
wchargin-source: fba72285f3ec0868039033a74bc326a80ad45de6
@wchargin wchargin requested a review from nfelt December 3, 2020 01:48
wchargin-branch: grpc-ingester
wchargin-source: 6dac61afb74d6456f9b7521aa67e04e0fd39a729
wchargin added a commit that referenced this pull request Dec 5, 2020
Summary:
Not all data providers support tensors and blob sequences. The methods
are optional in the `DataProvider` interface, and (e.g.) the gRPC data
provider does not yet implement them, nor does the RustBoard server.
This patch teaches the time series dashboard to gracefully omit image
and histogram data if the needed data classes aren’t available.

Test Plan:
Point TensorBoard at an MNIST dataset with all three kinds of data, and
note that scalars appear if `--load_fast` is passed, and all data
appears otherwise. Prior to this patch, the requests would 500 under
`--load_fast`, and so nothing would be displayed at all.

(Cherry-pick #4411 for `--load_fast`.)

wchargin-branch: metrics-degrade-tensors-blobs
wchargin-source: 0b486e4639b83c309b626770ae46527ec0b5460b
wchargin added a commit that referenced this pull request Dec 5, 2020
Summary:
The mesh plugin’s `is_active` method had a hard dependency on the event
multiplexer, which is not always available, as `TBContext` documents.
In fact, the entire `is_active` method can just be `return False`, since
TensorBoard core will treat the `"mesh"` plugin as active when `"mesh"`
data is available, and that’s all that it was checking, anyway.

Test Plan:
The mesh plugin appears active in a logdir with mesh data and inactive
without it, and when launching with `--load_fast` (and thus no
multiplexer) there’s no longer any log spam about “Plugin listing:
is_active() for mesh failed”.

(Cherry-pick #4411 for --load_fast.)

wchargin-branch: mesh-isactive
wchargin-source: fa80676693f66b52489b19f8041adb62f62660de
Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes and extended tests!

"Failed to load data:\n%s"
% "\n".join(" - %s" % e for e in errors)
)
# Stash ingester so that it can avoid GCing Windows file handles.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truly there are no greater joys in life.

@wchargin wchargin merged commit 9ad981d into master Dec 5, 2020
@wchargin wchargin deleted the wchargin-grpc-ingester branch December 5, 2020 01:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes core:backend core:rustboard //tensorboard/data/server/...

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants