-
Notifications
You must be signed in to change notification settings - Fork 1.7k
program: add gRPC data ingester #4411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary:
If `tensorboard` is to launch RustBoard as a subprocess rather than
relying on the user to launch it concurrently, then we should try hard
not to leave around a zombie server. An easy solution is an `atexit`
handler in Python TensorBoard, but this doesn’t handle the case when
TensorBoard crashes or gets SIGTERMed. On Linux, we can be notified when
our parent dies by setting the parent-death signal (`man 2 prctl`), but
this isn’t portable. On portable Unix, the child can poll its PPID to
see when it changes to `1` or the PID of some subreaper, but this isn’t
portable to Windows, which doesn’t update PPID when the parent dies.
Windows is not my strong suit, but some web searching didn’t turn up an
easy and clean solution portable to both Windows and Unix (any Windows
internals like “job objects” are a non-starter, since I can’t test
them). The one thing that I would expect to work everywhere is “just
wait until stdin closes and then exit”. If even that doesn’t work on
Windows, well, we can burn that bridge when we come to it.
Test Plan:
Build `bazel build //tensorboard/data/server`, then write a simple
Python driver:
```python
import subprocess
import time
server = "bazel-bin/tensorboard/data/server/server"
p = subprocess.Popen(
[server, "--logdir", "/tmp/nonexistent", "-v", "--die-after-stdin"],
stdin=subprocess.PIPE,
)
print(p.pid)
time.sleep(2)
```
Run it with `python test.py`, and note that after 2 seconds the server
prints “Stdin closed; exiting”. Run it again, and suspend (`^Z`) the
process before it finishes sleeping. Run `kill -SIGCONT CHILD_PID` with
the PID printed by the Python script to resume server execution; this is
just needed because your `^Z` propagates to all processes in the group.
Note that the server continues to print logs as it starts and finishes
new load cycles. Then, run `fg` and wait for the sleep to complete, and
note that the server again exits, as desired.
wchargin-branch: rust-die-after-stdin
wchargin-source: ad925e07a2106ee37d268e32c1b997606326ae99
wchargin-branch: rust-die-after-stdin wchargin-source: 0be687779c1709eedf2114e402a4aebe7f35a647
Summary: TensorBoard can now start RustBoard if `--load_fast` is given at runtime and `--define=link_data_server=true` was given at build time. The default configuration still has no Rust code, so our Pip packages are still portable. When we actually deploy this, we can distribute the Rust binary in a separate Pip package that TensorBoard talks to via [entry points], but the UX can stay the same: “just add `--load_fast`.” [entry points]: https://packaging.python.org/specifications/entry-points/ Test Plan: Test in the following configurations: - with just a `--logdir`, everything works as normal; - with `--define=link_data_server=true -- --logdir ... --load_fast`, the subprocess is spawned and cleaned up on a successful exit or a SIGKILL to TensorBoard, and INFO logs are shown iff `--verbosity 0` is specified; - with `--logdir ... --load_fast` but no `--define`, TensorBoard fails to start the server and prints a message before exiting; and - with `--grpc_data_provider localhost:6806`, TensorBoard connects to an existing server without needing `--logdir` or `--define`. To test the “data server died” case, comment out the `--logdir=%s` flag, which will cause the server to fail with a usage message. That message should appear in the logs. This also works after syncing into Google, in all relevant configurations; see <http://cl/344900410> and <http://cl/344955833>. wchargin-branch: grpc-ingester wchargin-source: 57ed441a34d496055b4b75e6cd935f63b8ae2e09
| def test_load_fast(self): | ||
| flags = make_flags(logdir="/some/logdir", load_fast=True) | ||
| ingester = server_ingester.ServerDataIngester(flags) | ||
| # Not much more that we can easily test here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interested in your thoughts here? I could go to various degrees of
trouble to create fake executables that interact with stdout in various
ways, but it would be a fair amount of trouble, and it’s not obvious to
me that it’s worth it. I found it worth noting that your local data
ingester has not been touched since its creation. This case is different
because I’ll at least be forwarding things like samples_per_plugin as
we add support for them, but it’s still a useful indicator.
Feel free to send a quick review with just “more tests here, please”.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts: I probably wouldn't bother creating a fake executable - if we're going to launch a process it might as well be the real one and just make it an integration test. I think that could be useful but IMO not essentially, especially since this is still experimental.
For testing flag propagation alone, I'd consider perhaps just mocking subprocess.Popen and checking that we passed the right args. This seems like a reasonable case to use mocking since it's less of a pure implementation detail given that it's written against the CLI for the server binary, and it's a relatively more heavyweight component to replicate with an accurate fake or the real thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem with integration testing is that we don’t have a real server
here within Google. (I guess we could just disable the test internally.)
Agreed on “useful but not essential”.
Okay, we can use mocking here.
wchargin-branch: grpc-ingester wchargin-source: 203ccff9eb6d57a90f94e3383c8162a5586cf658
wchargin-branch: grpc-ingester wchargin-source: bc101e5cd0b10ed0764d358a28d2397948842ace
tensorboard/BUILD
Outdated
| ) | ||
|
|
||
| py_library( | ||
| name = "ingester", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit reluctant to grow yet another top-level module. Could this go under data perhaps? Especially since we're prefixing the ABC name with Data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, sorry, forgot you didn’t like these. data is fine.
tensorboard/ingester.py
Outdated
| """Creates a data ingester from flags. | ||
| Args: | ||
| flags: An argparse.Namespace containing TensorBoard CLI flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I realize this is my fault for doing this in LocalDataIngester, but per #2801 can I petition to not to formalize this API as using another untyped flags-passing mechanism? The issue provides some of the rationale but happy to discuss further if you aren't sold.
I think to encapsulate this in a fully generic way we'd probably want to adopt the same PluginLoader/Plugin model as the above issue discusses where the former is responsible for defining a set of flags, extracting them from the argparse.Namespace, and constructing the latter according to the interpretation of the flags.
That might be a bit heavyweight, especially if we don't ever expect to have more data ingesters than these two, so perhaps it'd be easier to do something like decree program to be the "owner" of all the data-ingestion-related flags, and have it do the interpretation. This is easy for ServerDataIngester which has only one flag anyway; for LocalDataIngester we could refactor the constructor into some kind of helper function that maps flags to kwargs or to some kind of typed Options struct that we then use to create LocalDataIngester.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems like a reasonable request to me. I’ll drop the
abstraction over __init__ and just let each one take the flags that it
needs, and let program do the dispatching, as you say. I think that’s
simplest, and feels clean enough as long as program owns them.
This is also nice because it affords splitting the server ingesters into
two smaller ingesters for the “existing server” and “spawn new
subprocess” case.
I’ve left LocalDataIngester as taking flags, but this is just a
special case of “each class defines a typed interface” (the typed
interface for one of them happens to be “bag of junk”) rather than
formalizing the thing that we don’t want. Now the diff to that file is
trivial.
tensorboard/ingester.py
Outdated
|
|
||
| @abc.abstractmethod | ||
| def start(self): | ||
| """Starts ingesting data based on the ingester flag configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe worth specifying that this starts ingesting forever in a separate thread or process, i.e. it's non-blocking? Since if it were blocking this presumably would never return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The server-starting ingester does wait for the server to
start, so I don’t want “non-blocking” to imply “returns immediately”,
but we can clarify that. (Also, it takes milliseconds, so not too bad.)
| "Failed to load data:\n%s" | ||
| % "\n".join(" - %s" % e for e in errors) | ||
| ) | ||
| # Stash ingester so that it can avoid GCing Windows file handles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point the reader at ServerDataIngester to learn more about this? I read this line first and had written a comment asking for elaboration before I read the longer (and very helpful) comment in that file. Probably should have thought to check there but a pointer would have helped reinforce that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly. Don’t you love reading CPython source specifically to look
for undocumented platform-specific differences in process handling, and
then finding some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truly there are no greater joys in life.
tensorboard/ingester.py
Outdated
|
|
||
| @property | ||
| @abc.abstractmethod | ||
| def deprecated_multiplexer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, it seems strange to me to put this on the DataIngester API surface when we know it's only ever going to apply to the one legacy one (and will be removed when we can clean this up). I'd rather that program just special-case LocalDataIngester for this so we can contain the contagion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, can do now that program handles dispatching.
| flags = make_flags(grpc_data_provider="localhost:6806") | ||
| ingester = server_ingester.ServerDataIngester(flags) | ||
| ingester.start() | ||
| ctx = context.RequestContext() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed.
| def setUp(self): | ||
| super().setUp() | ||
| self.enter_context( | ||
| mock.patch.object(grpc, "insecure_channel", autospec=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be a bit happier only doing this where we call start() and using the context manager normally, even if it's a little more per-test-method boilerplate, vs always patching even when it's not needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure; done.
| def test_fixed_address(self): | ||
| flags = make_flags(grpc_data_provider="localhost:6806") | ||
| ingester = server_ingester.ServerDataIngester(flags) | ||
| ingester.start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must be missing something, but how does this pass without raising RuntimeError if we're running the test without doing --define=link_data_server=true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When --grpc_data_provider is set, it connected to an already running
server rather than starting one. Mechanically, it hit the early return
on the first line of start, because the data provider was set in the
constructor.
But I have sliced the two-headed ServerDataIngester monster in half,
so this is clearer now.
| def test_load_fast(self): | ||
| flags = make_flags(logdir="/some/logdir", load_fast=True) | ||
| ingester = server_ingester.ServerDataIngester(flags) | ||
| # Not much more that we can easily test here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thoughts: I probably wouldn't bother creating a fake executable - if we're going to launch a process it might as well be the real one and just make it an integration test. I think that could be useful but IMO not essentially, especially since this is still experimental.
For testing flag propagation alone, I'd consider perhaps just mocking subprocess.Popen and checking that we passed the right args. This seems like a reasonable case to use mocking since it's less of a pure implementation detail given that it's written against the CLI for the server binary, and it's a relatively more heavyweight component to replicate with an accurate fake or the real thing.
| "--load_fast", | ||
| action="store_true", | ||
| help="""\ | ||
| Experimental. Use a data server to accelerate loading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding info about this flag and the --define one (basically what you have in your PR description) to tensorboard/data/server/DEVELOPMENT.md?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure; good idea.
wchargin-branch: grpc-ingester wchargin-source: fba72285f3ec0868039033a74bc326a80ad45de6
wchargin-branch: grpc-ingester wchargin-source: fba72285f3ec0868039033a74bc326a80ad45de6
wchargin-branch: grpc-ingester wchargin-source: 6dac61afb74d6456f9b7521aa67e04e0fd39a729
Summary: Not all data providers support tensors and blob sequences. The methods are optional in the `DataProvider` interface, and (e.g.) the gRPC data provider does not yet implement them, nor does the RustBoard server. This patch teaches the time series dashboard to gracefully omit image and histogram data if the needed data classes aren’t available. Test Plan: Point TensorBoard at an MNIST dataset with all three kinds of data, and note that scalars appear if `--load_fast` is passed, and all data appears otherwise. Prior to this patch, the requests would 500 under `--load_fast`, and so nothing would be displayed at all. (Cherry-pick #4411 for `--load_fast`.) wchargin-branch: metrics-degrade-tensors-blobs wchargin-source: 0b486e4639b83c309b626770ae46527ec0b5460b
Summary: The mesh plugin’s `is_active` method had a hard dependency on the event multiplexer, which is not always available, as `TBContext` documents. In fact, the entire `is_active` method can just be `return False`, since TensorBoard core will treat the `"mesh"` plugin as active when `"mesh"` data is available, and that’s all that it was checking, anyway. Test Plan: The mesh plugin appears active in a logdir with mesh data and inactive without it, and when launching with `--load_fast` (and thus no multiplexer) there’s no longer any log spam about “Plugin listing: is_active() for mesh failed”. (Cherry-pick #4411 for --load_fast.) wchargin-branch: mesh-isactive wchargin-source: fa80676693f66b52489b19f8041adb62f62660de
nfelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes and extended tests!
| "Failed to load data:\n%s" | ||
| % "\n".join(" - %s" % e for e in errors) | ||
| ) | ||
| # Stash ingester so that it can avoid GCing Windows file handles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Truly there are no greater joys in life.
Summary:
TensorBoard can now start RustBoard if
--load_fastis given at runtimeand
--define=link_data_server=truewas given at build time. Thedefault configuration still has no Rust code, so our Pip packages are
still portable. When we actually deploy this, we can distribute the Rust
binary in a separate Pip package that TensorBoard imports, but the UX
can stay the same: “just add
--load_fast.”Test Plan:
Test in the following configurations:
--logdir, everything works as normal;--define=link_data_server=true -- --logdir ... --load_fast,the subprocess is spawned and cleaned up on a successful exit or a
SIGKILL to TensorBoard, and INFO logs are shown iff
--verbosity 0is specified;
--logdir ... --load_fastbut no--define, TensorBoard failsto start the server and prints a message before exiting; and
--grpc_data_provider localhost:6806, TensorBoard connects toan existing server without needing
--logdiror--define.To test the “data server died” case, comment out the
--logdir=%sflag,which will cause the server to fail with a usage message. That message
should appear in the logs.
To test the polling, add
thread::sleep(Duration::from_secs(3))beforethe server writes its port file, and run with
--verbosity 0to notethe “Polling for data server port” messages.
This also works after syncing into Google, in all relevant
configurations; see http://cl/344900410 and http://cl/344955833.
wchargin-branch: grpc-ingester