program: add gRPC data ingester #4411

wchargin · 2020-12-01T07:19:25Z

Summary:
TensorBoard can now start RustBoard if --load_fast is given at runtime
and --define=link_data_server=true was given at build time. The
default configuration still has no Rust code, so our Pip packages are
still portable. When we actually deploy this, we can distribute the Rust
binary in a separate Pip package that TensorBoard imports, but the UX
can stay the same: “just add --load_fast.”

Test Plan:
Test in the following configurations:

with just a --logdir, everything works as normal;
with --define=link_data_server=true -- --logdir ... --load_fast,
the subprocess is spawned and cleaned up on a successful exit or a
SIGKILL to TensorBoard, and INFO logs are shown iff --verbosity 0
is specified;
with --logdir ... --load_fast but no --define, TensorBoard fails
to start the server and prints a message before exiting; and
with --grpc_data_provider localhost:6806, TensorBoard connects to
an existing server without needing --logdir or --define.

To test the “data server died” case, comment out the --logdir=%s flag,
which will cause the server to fail with a usage message. That message
should appear in the logs.

To test the polling, add thread::sleep(Duration::from_secs(3)) before
the server writes its port file, and run with --verbosity 0 to note
the “Polling for data server port” messages.

This also works after syncing into Google, in all relevant
configurations; see http://cl/344900410 and http://cl/344955833.

wchargin-branch: grpc-ingester

Summary: If `tensorboard` is to launch RustBoard as a subprocess rather than relying on the user to launch it concurrently, then we should try hard not to leave around a zombie server. An easy solution is an `atexit` handler in Python TensorBoard, but this doesn’t handle the case when TensorBoard crashes or gets SIGTERMed. On Linux, we can be notified when our parent dies by setting the parent-death signal (`man 2 prctl`), but this isn’t portable. On portable Unix, the child can poll its PPID to see when it changes to `1` or the PID of some subreaper, but this isn’t portable to Windows, which doesn’t update PPID when the parent dies. Windows is not my strong suit, but some web searching didn’t turn up an easy and clean solution portable to both Windows and Unix (any Windows internals like “job objects” are a non-starter, since I can’t test them). The one thing that I would expect to work everywhere is “just wait until stdin closes and then exit”. If even that doesn’t work on Windows, well, we can burn that bridge when we come to it. Test Plan: Build `bazel build //tensorboard/data/server`, then write a simple Python driver: ```python import subprocess import time server = "bazel-bin/tensorboard/data/server/server" p = subprocess.Popen( [server, "--logdir", "/tmp/nonexistent", "-v", "--die-after-stdin"], stdin=subprocess.PIPE, ) print(p.pid) time.sleep(2) ``` Run it with `python test.py`, and note that after 2 seconds the server prints “Stdin closed; exiting”. Run it again, and suspend (`^Z`) the process before it finishes sleeping. Run `kill -SIGCONT CHILD_PID` with the PID printed by the Python script to resume server execution; this is just needed because your `^Z` propagates to all processes in the group. Note that the server continues to print logs as it starts and finishes new load cycles. Then, run `fg` and wait for the sleep to complete, and note that the server again exits, as desired. wchargin-branch: rust-die-after-stdin wchargin-source: ad925e07a2106ee37d268e32c1b997606326ae99

wchargin-branch: rust-die-after-stdin wchargin-source: 0be687779c1709eedf2114e402a4aebe7f35a647

Summary: TensorBoard can now start RustBoard if `--load_fast` is given at runtime and `--define=link_data_server=true` was given at build time. The default configuration still has no Rust code, so our Pip packages are still portable. When we actually deploy this, we can distribute the Rust binary in a separate Pip package that TensorBoard talks to via [entry points], but the UX can stay the same: “just add `--load_fast`.” [entry points]: https://packaging.python.org/specifications/entry-points/ Test Plan: Test in the following configurations: - with just a `--logdir`, everything works as normal; - with `--define=link_data_server=true -- --logdir ... --load_fast`, the subprocess is spawned and cleaned up on a successful exit or a SIGKILL to TensorBoard, and INFO logs are shown iff `--verbosity 0` is specified; - with `--logdir ... --load_fast` but no `--define`, TensorBoard fails to start the server and prints a message before exiting; and - with `--grpc_data_provider localhost:6806`, TensorBoard connects to an existing server without needing `--logdir` or `--define`. To test the “data server died” case, comment out the `--logdir=%s` flag, which will cause the server to fail with a usage message. That message should appear in the logs. This also works after syncing into Google, in all relevant configurations; see <http://cl/344900410> and <http://cl/344955833>. wchargin-branch: grpc-ingester wchargin-source: 57ed441a34d496055b4b75e6cd935f63b8ae2e09

wchargin · 2020-12-01T07:52:49Z

tensorboard/data/server_ingester_test.py

+    def test_load_fast(self):
+        flags = make_flags(logdir="/some/logdir", load_fast=True)
+        ingester = server_ingester.ServerDataIngester(flags)
+        # Not much more that we can easily test here.


Interested in your thoughts here? I could go to various degrees of
trouble to create fake executables that interact with stdout in various
ways, but it would be a fair amount of trouble, and it’s not obvious to
me that it’s worth it. I found it worth noting that your local data
ingester has not been touched since its creation. This case is different
because I’ll at least be forwarding things like samples_per_plugin as
we add support for them, but it’s still a useful indicator.

Feel free to send a quick review with just “more tests here, please”.

My thoughts: I probably wouldn't bother creating a fake executable - if we're going to launch a process it might as well be the real one and just make it an integration test. I think that could be useful but IMO not essentially, especially since this is still experimental.

For testing flag propagation alone, I'd consider perhaps just mocking subprocess.Popen and checking that we passed the right args. This seems like a reasonable case to use mocking since it's less of a pure implementation detail given that it's written against the CLI for the server binary, and it's a relatively more heavyweight component to replicate with an accurate fake or the real thing.

Problem with integration testing is that we don’t have a real server
here within Google. (I guess we could just disable the test internally.)
Agreed on “useful but not essential”.

Okay, we can use mocking here.

wchargin-branch: grpc-ingester wchargin-source: 203ccff9eb6d57a90f94e3383c8162a5586cf658

wchargin-branch: grpc-ingester wchargin-source: bc101e5cd0b10ed0764d358a28d2397948842ace

nfelt · 2020-12-02T17:24:33Z

tensorboard/BUILD

 )

+py_library(
+    name = "ingester",


I'm a bit reluctant to grow yet another top-level module. Could this go under data perhaps? Especially since we're prefixing the ABC name with Data?

Ah, yeah, sorry, forgot you didn’t like these. data is fine.

nfelt · 2020-12-02T17:29:51Z

tensorboard/ingester.py

+        """Creates a data ingester from flags.
+
+        Args:
+          flags: An argparse.Namespace containing TensorBoard CLI flags.


So I realize this is my fault for doing this in LocalDataIngester, but per #2801 can I petition to not to formalize this API as using another untyped flags-passing mechanism? The issue provides some of the rationale but happy to discuss further if you aren't sold.

I think to encapsulate this in a fully generic way we'd probably want to adopt the same PluginLoader/Plugin model as the above issue discusses where the former is responsible for defining a set of flags, extracting them from the argparse.Namespace, and constructing the latter according to the interpretation of the flags.

That might be a bit heavyweight, especially if we don't ever expect to have more data ingesters than these two, so perhaps it'd be easier to do something like decree program to be the "owner" of all the data-ingestion-related flags, and have it do the interpretation. This is easy for ServerDataIngester which has only one flag anyway; for LocalDataIngester we could refactor the constructor into some kind of helper function that maps flags to kwargs or to some kind of typed Options struct that we then use to create LocalDataIngester.

WDYT?

Yeah, that seems like a reasonable request to me. I’ll drop the
abstraction over __init__ and just let each one take the flags that it
needs, and let program do the dispatching, as you say. I think that’s
simplest, and feels clean enough as long as program owns them.

This is also nice because it affords splitting the server ingesters into
two smaller ingesters for the “existing server” and “spawn new
subprocess” case.

I’ve left LocalDataIngester as taking flags, but this is just a
special case of “each class defines a typed interface” (the typed
interface for one of them happens to be “bag of junk”) rather than
formalizing the thing that we don’t want. Now the diff to that file is
trivial.

nfelt · 2020-12-02T17:30:46Z

tensorboard/ingester.py

+
+    @abc.abstractmethod
+    def start(self):
+        """Starts ingesting data based on the ingester flag configuration.


Maybe worth specifying that this starts ingesting forever in a separate thread or process, i.e. it's non-blocking? Since if it were blocking this presumably would never return.

Good point. The server-starting ingester does wait for the server to
start, so I don’t want “non-blocking” to imply “returns immediately”,
but we can clarify that. (Also, it takes milliseconds, so not too bad.)

nfelt · 2020-12-02T17:32:39Z

tensorboard/program.py

+                "Failed to load data:\n%s"
+                % "\n".join("  - %s" % e for e in errors)
+            )
+        # Stash ingester so that it can avoid GCing Windows file handles.


Can you point the reader at ServerDataIngester to learn more about this? I read this line first and had written a comment asking for elaboration before I read the longer (and very helpful) comment in that file. Probably should have thought to check there but a pointer would have helped reinforce that.

Certainly. Don’t you love reading CPython source specifically to look
for undocumented platform-specific differences in process handling, and
then finding some?

Truly there are no greater joys in life.

nfelt · 2020-12-02T17:50:55Z

tensorboard/ingester.py

+
+    @property
+    @abc.abstractmethod
+    def deprecated_multiplexer(self):


FWIW, it seems strange to me to put this on the DataIngester API surface when we know it's only ever going to apply to the one legacy one (and will be removed when we can clean this up). I'd rather that program just special-case LocalDataIngester for this so we can contain the contagion.

Yeah, can do now that program handles dispatching.

nfelt · 2020-12-02T18:16:34Z

tensorboard/data/server_ingester_test.py

+        flags = make_flags(grpc_data_provider="localhost:6806")
+        ingester = server_ingester.ServerDataIngester(flags)
+        ingester.start()
+        ctx = context.RequestContext()


nfelt · 2020-12-02T18:19:44Z

tensorboard/data/server_ingester_test.py

+    def setUp(self):
+        super().setUp()
+        self.enter_context(
+            mock.patch.object(grpc, "insecure_channel", autospec=True)


I'd be a bit happier only doing this where we call start() and using the context manager normally, even if it's a little more per-test-method boilerplate, vs always patching even when it's not needed.

Sure; done.

nfelt · 2020-12-02T18:30:31Z

tensorboard/data/server_ingester_test.py

+    def test_fixed_address(self):
+        flags = make_flags(grpc_data_provider="localhost:6806")
+        ingester = server_ingester.ServerDataIngester(flags)
+        ingester.start()


I must be missing something, but how does this pass without raising RuntimeError if we're running the test without doing --define=link_data_server=true?

When --grpc_data_provider is set, it connected to an already running
server rather than starting one. Mechanically, it hit the early return
on the first line of start, because the data provider was set in the
constructor.

But I have sliced the two-headed ServerDataIngester monster in half,
so this is clearer now.

nfelt · 2020-12-02T18:34:39Z

tensorboard/data/server_ingester_test.py

+    def test_load_fast(self):
+        flags = make_flags(logdir="/some/logdir", load_fast=True)
+        ingester = server_ingester.ServerDataIngester(flags)
+        # Not much more that we can easily test here.


My thoughts: I probably wouldn't bother creating a fake executable - if we're going to launch a process it might as well be the real one and just make it an integration test. I think that could be useful but IMO not essentially, especially since this is still experimental.

For testing flag propagation alone, I'd consider perhaps just mocking subprocess.Popen and checking that we passed the right args. This seems like a reasonable case to use mocking since it's less of a pure implementation detail given that it's written against the CLI for the server binary, and it's a relatively more heavyweight component to replicate with an accurate fake or the real thing.

nfelt · 2020-12-02T18:37:03Z

tensorboard/plugins/core/core_plugin.py

+            "--load_fast",
+            action="store_true",
+            help="""\
+Experimental. Use a data server to accelerate loading.


Probably worth adding info about this flag and the --define one (basically what you have in your PR description) to tensorboard/data/server/DEVELOPMENT.md?

Sure; good idea.

wchargin-branch: grpc-ingester wchargin-source: fba72285f3ec0868039033a74bc326a80ad45de6

wchargin-branch: grpc-ingester wchargin-source: 6dac61afb74d6456f9b7521aa67e04e0fd39a729

Summary: Not all data providers support tensors and blob sequences. The methods are optional in the `DataProvider` interface, and (e.g.) the gRPC data provider does not yet implement them, nor does the RustBoard server. This patch teaches the time series dashboard to gracefully omit image and histogram data if the needed data classes aren’t available. Test Plan: Point TensorBoard at an MNIST dataset with all three kinds of data, and note that scalars appear if `--load_fast` is passed, and all data appears otherwise. Prior to this patch, the requests would 500 under `--load_fast`, and so nothing would be displayed at all. (Cherry-pick #4411 for `--load_fast`.) wchargin-branch: metrics-degrade-tensors-blobs wchargin-source: 0b486e4639b83c309b626770ae46527ec0b5460b

Summary: The mesh plugin’s `is_active` method had a hard dependency on the event multiplexer, which is not always available, as `TBContext` documents. In fact, the entire `is_active` method can just be `return False`, since TensorBoard core will treat the `"mesh"` plugin as active when `"mesh"` data is available, and that’s all that it was checking, anyway. Test Plan: The mesh plugin appears active in a logdir with mesh data and inactive without it, and when launching with `--load_fast` (and thus no multiplexer) there’s no longer any log spam about “Plugin listing: is_active() for mesh failed”. (Cherry-pick #4411 for --load_fast.) wchargin-branch: mesh-isactive wchargin-source: fa80676693f66b52489b19f8041adb62f62660de

nfelt

Thanks for the changes and extended tests!

nfelt · 2020-12-05T01:30:46Z

tensorboard/program.py

+                "Failed to load data:\n%s"
+                % "\n".join("  - %s" % e for e in errors)
+            )
+        # Stash ingester so that it can avoid GCing Windows file handles.


Truly there are no greater joys in life.

wchargin added 3 commits November 30, 2020 17:27

[update diffbase: rust-die-after-stdin]

c107139

wchargin-branch: rust-die-after-stdin wchargin-source: 0be687779c1709eedf2114e402a4aebe7f35a647

wchargin added core:backend core:rustboard //tensorboard/data/server/... labels Dec 1, 2020

wchargin requested a review from nfelt December 1, 2020 07:19

google-cla bot added the cla: yes label Dec 1, 2020

wchargin mentioned this pull request Dec 1, 2020

rust: add option to kill server at stdin EOF #4408

Merged

wchargin commented Dec 1, 2020

View reviewed changes

Base automatically changed from wchargin-rust-die-after-stdin to master December 1, 2020 19:28

wchargin added 2 commits December 1, 2020 11:29

[update diffbase: grpc-ingester]

e623e63

wchargin-branch: grpc-ingester wchargin-source: 203ccff9eb6d57a90f94e3383c8162a5586cf658

[update diffbase: grpc-ingester]

f0eaf77

wchargin-branch: grpc-ingester wchargin-source: bc101e5cd0b10ed0764d358a28d2397948842ace

nfelt reviewed Dec 2, 2020

View reviewed changes

wchargin added 2 commits December 2, 2020 17:05

[update diffbase: grpc-ingester]

2651d10

wchargin-branch: grpc-ingester wchargin-source: fba72285f3ec0868039033a74bc326a80ad45de6

[update patch: grpc-ingester]

90bfc1d

wchargin-branch: grpc-ingester wchargin-source: fba72285f3ec0868039033a74bc326a80ad45de6

wchargin requested a review from nfelt December 3, 2020 01:48

[grpc-ingester: update diffbase]

7464729

wchargin-branch: grpc-ingester wchargin-source: 6dac61afb74d6456f9b7521aa67e04e0fd39a729

wchargin mentioned this pull request Dec 5, 2020

metrics: gracefully degrade without tensors/blobs #4425

Merged

wchargin mentioned this pull request Dec 5, 2020

mesh: fix is_active without multiplexer #4426

Merged

nfelt approved these changes Dec 5, 2020

View reviewed changes

wchargin merged commit 9ad981d into master Dec 5, 2020

wchargin deleted the wchargin-grpc-ingester branch December 5, 2020 01:50

wchargin mentioned this pull request Mar 9, 2021

core: dump some debug info in /data/environment #4755

Merged

program: add gRPC data ingester #4411

program: add gRPC data ingester #4411

Uh oh!

Conversation

wchargin commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wchargin commented Dec 1, 2020 •

edited

Loading