Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rust: implement ListScalars and ReadScalars #4386

Merged
merged 53 commits into from
Nov 25, 2020

Conversation

wchargin
Copy link
Contributor

@wchargin wchargin commented Nov 24, 2020

Summary:
The Rust gRPC server now implements the RPCs required to serve scalar
data to the TensorBoard frontend.

There is some repetition in the implementation: ListScalars and
ReadScalars have very similar bodies. There will be more repetition
with the tensor RPCs, and to a lesser degree with those for blob
sequences. I plan to consider refactoring that later, once all the raw
materials are there, but I’m open to feedback.

Test Plan:
Unit tests included. For an end-to-end test, cherry-pick #4356, launch
the data server (bazel run //tensorboard/data/server -- LOGDIR), and
concurrently run TensorBoard with --grpc_data_provider localhost:6806
and --generic_data true flags.

wchargin-branch: rust-listscalars-readscalars

Summary:
This module performs conversions from legacy on-disk data formats. It
will correspond to both the `data_compat` and `dataclass_compat` modules
from Python TensorBoard. For now, we have just one function, to
determine the initial summary metadata of a time series. And, for now,
we only implement this for scalars, to keep things simple.

Test Plan:
Unit tests included.

wchargin-branch: rust-data-compat
wchargin-source: ef510b422b00a66a23efc1406947844fa8d1a798
Summary:
Our errors now have nice string formatting, easier `From` conversions,
and `std::error::Error` implementations.

Test Plan:
Included some tests for the `Display` implementations. They’re often not
necessary—one benefit of deriving traits is that you can be confident in
the implementation without manually testing it. But sometimes, if the
format string is non-trivial, it can be nice to actually see the full
text written out.

wchargin-branch: rust-use-thiserror
wchargin-source: 0f8009c3b96619f2eb4649353487dc996b45a712
Summary:
We’d defined a `Step` [newtype] in the `reservoir` module, but that type
will be useful more broadly. We thus move it into a new `types` module,
and define `WallTime` and `Tag` siblings, which we’ll need shortly.

[newtype]: https://doc.rust-lang.org/rust-by-example/generics/new_types.html

Test Plan:
Most behavior is just `#[derive]`d; unit tests included for the rest.

wchargin-branch: rust-types-module
wchargin-source: b0f5475b247ef8f9cb8bf2e0bf94d2b648b8009e
…rigin/wchargin-rust-use-thiserror' and 'origin/wchargin-rust-types-module' into HEAD
wchargin-branch: rust-run-loader
wchargin-source: 35359d6e233139973edac8b2c1440cc48ed4e710
wchargin-branch: rust-data-compat
wchargin-source: 467974fa4e0e55dea0872bb1394a768a38811ac5
wchargin-branch: rust-run-loader
wchargin-source: d190e3d45c304d976a3fc14a73d99572a6ad377f
Summary:
This patch refactors `run`’s `StagePayload` type into an `EventValue`
type in `data_compat`. This way, we can define enriching conversions
from `EventValue`s to committed values in a layer that’s not specific to
the run loader.

We also implement `initial_metadata` for the `GraphDef` variant, for
symmetry and to make the abstraction clearer. The run loader still
doesn’t handle graphs.

Test Plan:
Unit tests expanded for the new code, and continue to cover the old.

wchargin-branch: rust-promote-eventvalue
wchargin-source: 96af44556af064cf9ef933154ca3a522a8829798
wchargin-branch: rust-run-loader
wchargin-source: 170c77aae53df6716520df779e4838d3fc3b93b9

# Conflicts:
#	tensorboard/data/server/data_compat.rs
wchargin-branch: rust-run-loader
wchargin-source: 170c77aae53df6716520df779e4838d3fc3b93b9
wchargin-branch: rust-promote-eventvalue
wchargin-source: 106b4519aba1a966d8273b6b2509f1798131dc5a
Summary:
This patch introduces the *commit*, which is the link between loading
threads and serving threads. The commit structure is designed for
concurrency: many threads can read from it at once, and when the commit
is updated, a loader acquires exclusive access only to its own run. This
improves latency for both readers and writers.

The commit itself doesn’t have much logic; it’s mostly just the correct
composition of substructures, most notably the reservoir basins.

See also the RustBoard technical design doc (#4350) for discussion.

Test Plan:
There’s one method on `commit::TimeSeries`; it has tests.

wchargin-branch: rust-commit-type
wchargin-source: d69b4e0034b68d1d8dadc6aa0f6dc3cc48efccde
Summary:
This patch integrates `RunLoader` and `Commit`. A run loader now commits
all changes at the end of its load cycle. This isn’t sufficient—we also
need it to periodically commit while it’s still loading. But it’s a good
start for now.

Much of the weight is actually in `data_compat`, which owns the logic
for “enriching” raw events into their class-specific representations.
This way, that logic can be reused: e.g., for an alternate consumer that
exposes a stream of all events, without downsampling.

Test Plan:
The new enrichment functionality has dedicated tests, and existing tests
for the run loader have been updated to operate on a `Commit` rather
than poking into the internals.

wchargin-branch: rust-commit-run-loader
wchargin-source: 1e0ae52eb9cd9ead6f53c9716306a5fcce178e06
wchargin-branch: rust-run-loader
wchargin-source: 793ff3e287c2571fd1bc06d5c1940ed3b3bdc796
wchargin-branch: rust-promote-eventvalue
wchargin-source: 432923653efb6fe768bdd507acac833672ec3e06
wchargin-branch: rust-commit-type
wchargin-source: 1850f510bac1f515fa4b7f73c1ec9a7a47fea283
wchargin-branch: rust-commit-run-loader
wchargin-source: 14baca16a82293a24a1b7f58869305f3b4ff7a6f

# Conflicts:
#	tensorboard/data/server/run.rs
wchargin-branch: rust-commit-run-loader
wchargin-source: 14baca16a82293a24a1b7f58869305f3b4ff7a6f
wchargin-branch: rust-promote-eventvalue
wchargin-source: dd7917e2bd9084f8ff048657d94572168a221a7e

# Conflicts:
#	tensorboard/data/server/data_compat.rs
#	tensorboard/data/server/run.rs
wchargin-branch: rust-promote-eventvalue
wchargin-source: dd7917e2bd9084f8ff048657d94572168a221a7e
wchargin-branch: rust-commit-type
wchargin-source: 38179a4b9af7538f133ee04f771ac895a8db21a9
wchargin-branch: rust-commit-run-loader
wchargin-source: f84d5842282dba99385fba34ece75f6ac7f82d2e
wchargin-branch: rust-commit-type
wchargin-source: b62b0669a50b5086a1f6af18effe62fbfd1e5431
wchargin-branch: rust-commit-run-loader
wchargin-source: a8130d52ce4b02636f59cab02ebef5cb27055fc9
wchargin-branch: rust-commit-run-loader
wchargin-source: a8130d52ce4b02636f59cab02ebef5cb27055fc9
wchargin-branch: rust-commit-run-loader
wchargin-source: 869544fd17ac2d740e60f1501b9540490d06906b

# Conflicts:
#	tensorboard/data/server/commit.rs
wchargin-branch: rust-commit-run-loader
wchargin-source: 869544fd17ac2d740e60f1501b9540490d06906b
wchargin-branch: rust-commit-run-loader
wchargin-source: e0ee69aea5631ccc6f2197922249d766aaa84cb6
wchargin-branch: rust-commit-run-loader
wchargin-source: e0ee69aea5631ccc6f2197922249d766aaa84cb6
wchargin-branch: rust-commit-run-loader
wchargin-source: 3f18bbb73a892785e6dfe5255b2831289ac3e789

# Conflicts:
#	tensorboard/data/server/run.rs
wchargin-branch: rust-logdir-loader-discovery
wchargin-source: 761b320491045f4ceb6ced9eb596242ba860e7a9
wchargin-branch: rust-logdir-loader-data
wchargin-source: 5f6ccd3318e76347bc3f07d78c459070fe15e65d
Summary:
This patch connects the `//tensorboard/data/server` main module to the
`Commit` and `LogdirLoader` infrastructure, and implements `ListRuns` as
a proof of concept.

For now, we just take a logdir path as a positional command line
argument. A future patch will bring nicer command-line argument parsing.

Test Plan:
Run both the data server and a copy of TensorBoard concurrently:

```
bazel run //tensorboard/data/server -- ~/tensorboard_data/mnist &
bazel run //tensorboard -- \
    --bind_all \
    --generic_data true \
    --grpc_data_provider localhost:6806 \
    ;
```

Then navigate to `localhost:6006/data/runs` and note that the response
includes a correct list of runs. Or, for a manual `grpc_cli` test, start
just the data server, and run:

```
$ grpc_cli --channel_creds_type=insecure \
>   --protofiles tensorboard/data/proto/data_provider.proto \
>   call localhost:6806 TensorBoardDataProvider.ListRuns '' 2>/dev/null
runs {
  name: "lr_1E-03,conv=1,fc=2"
  start_time: 1563406327
}
// etc.
```

Unit tests also included. Note that testing these data provider methods
is nice and easy: you just call them. No mocking or networking required.

wchargin-branch: rust-listruns
wchargin-source: eb471eca310905eb45d730936279b57bdae70927
Summary:
The Rust gRPC server now implements the RPCs required to serve scalar
data to the TensorBoard frontend.

There is some repetition in the implementation: `ListScalars` and
`ReadScalars` have very similar bodies. There will be more repetition
with the tensor RPCs, and to a lesser degree with those for blob
sequences. I plan to consider refactoring that later, once all the raw
materials are there, but I’m open to feedback.

Test Plan:
Unit tests included. For an end-to-end test, launch the data server
(`bazel run //tensorboard/data/server -- LOGDIR`) and concurrently
launch TensorBoard with the `--grpc_data_provider localhost:6806` and
`--generic_data true` flags.

wchargin-branch: rust-listscalars-readscalars
wchargin-source: 00ca5a4502bbb5e7a2f71fc2d48e262a60d69191
@wchargin wchargin added the core:rustboard //tensorboard/data/server/... label Nov 24, 2020
@wchargin wchargin requested a review from nfelt November 24, 2020 08:56
@google-cla google-cla bot added the cla: yes label Nov 24, 2020
wchargin-branch: rust-logdir-loader-discovery
wchargin-source: 44cd7a803ba04324ce11a892d8a8414792dae95c
wchargin-branch: rust-logdir-loader-data
wchargin-source: 29289362b8ff8294883bd381401613a6269e2ebc
wchargin-branch: rust-listruns
wchargin-source: 8187059ac28210897b1c90e4a31cd8237993ec4b
wchargin-branch: rust-listscalars-readscalars
wchargin-source: d13d984fbe01390aaeb19c9322969289f899661f
wchargin-branch: rust-logdir-loader-data
wchargin-source: ab88209f4c0f5874b7c0e80a7c670a8f7cef810d

# Conflicts:
#	tensorboard/data/server/logdir.rs
wchargin-branch: rust-logdir-loader-data
wchargin-source: ab88209f4c0f5874b7c0e80a7c670a8f7cef810d
wchargin-branch: rust-listruns
wchargin-source: bf3865887dddcdf516bc73806ea40f1f51ca0738
wchargin-branch: rust-listscalars-readscalars
wchargin-source: ad93125f2159cbc427c29dcbdbd3470d3247dcfb
wchargin-branch: rust-listruns
wchargin-source: 805e652280c9c72e2f1ea5cd0be2ae74de1538bb
wchargin-branch: rust-listscalars-readscalars
wchargin-source: bead71240b2e82e0b81ad8c11598b6a0ce9e8261
wchargin-branch: rust-listruns
wchargin-source: 4314f515f85db55f912bccc1ea0eb22078e6d783
wchargin-branch: rust-listscalars-readscalars
wchargin-source: 865b3367329b12a9fc915f1848c82d3a596260dc
@@ -172,9 +251,69 @@ impl TensorBoardDataProvider for DataProviderHandler {
}
}

/// Parses a request plugin filter. Returns the desired plugin name, or an error if that's empty.
fn parse_plugin_filter(pf: Option<data::PluginFilter>) -> Result<String, Status> {
let want_plugin = pf.map(|x| x.plugin_name).unwrap_or_default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also be pf.unwrap_or_default().plugin_name right? (akin to parse_rtf below). That seems somewhat more straightforward and if I have Done The Thing correctly it looks like it optimizes to the same code: https://rust.godbolt.org/z/YWsrf3

@@ -286,6 +425,36 @@ mod tests {
);
}

/// Converts a list-of-lists back into a map, for easy assertions that don't depend on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I get why we have this: to abstract over the various run-tag-indexed response types and avoid having to do something like res.runs.iter().find(|r| r.run_name == "train").unwrap() inside the test, which would certainly be ugly. That said, I'm a bit lukewarm about the map conversion approach too; it just feels a bit clunky to have to pass in the conversion specifier closures each time in order to unwrap each response, and it's a little hard for me to look at the callsite of unroll_map() and immediately understand what the parameters are doing.

What if we instead we tucked the cumbersome indexing expression inside a tests-only Index impl for the protobuf response types?

impl Index<&str> for data::ListScalarsResponse {
    type Output = data::list_scalars_response::RunEntry;
    fn index(&self, key: &str) -> &Self::Output {
        self.runs.iter().find(|r| r.run_name == key).unwrap()
    }
}

This would allow us in the test itself to just go let train_run = &rsp.runs["train"].

It's true that we would need to repeat this boilerplate once per Run + Tag for each response type, but for test code I don't mind boilerplate so much, and we would only need to write the impls once and then no matter how many test methods we have they all get the nice concise lookup syntax.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s quite clever. I will also note that it’s also “barely” allowed:
the orphan rules say that you can only impl Trait for Type if either
Trait or Type is defined in your crate, to ensure trait coherence
(i.e., that there aren’t ever conflicting impls). And the Prost bindings
are in our crate, but I don’t view that as set in stone. It’s true
that that’s the formulation that typical prost-build/tonic-build
users will use, but a more Bazel-idiomatic workflow would probably put
each proto_library in its own crate with proper dependencies among
them. And indeed rules_rust provides rust_proto_library of
this form, and bazelbuild/rules_rust#479 implements similar
functionality for Prost/Tonic (written by a Googler).

So I’m a little hesitant to depend on this. We could newtype it, but
that starts getting less ergonomic.

Here’s an approach that avoids having to pass in the conversion
specifier closures, and removes the type parameters entirely. We take
advantage of the shared run_name/tag_name structure via a macro.
Very Pythonic. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I'll take a "clever" :) Yeah, I realize it depends on the orphan rule; originally I was thinking it would just be a new test-only trait with fn get() but then I realized Index could be used and makes it even pithier.

But I think the macro version is quite nice and addresses the main things I was lukewarm about with unroll_map so I'm perfectly happy to keep it the way you have it now.

Copy link
Contributor Author

@wchargin wchargin Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would also be totally fine with a test-only extension trait (no
orphan rule issues since the trait is in-crate). And I would probably
use a macro to implement that trait for each response type.

I still might switch to that. It has type information earlier, so it’s
easier to return a HashMap<Run, HashMap<Tag, V>> given a mapping
function FnMut(Self::TagEntry) -> V. And that would be nice to avoid
the newly introduced .as_ref() dances. You can do it with the macro,
but it’s a bit annoying because of inference details. (Basically: you
can’t write (|x| x.foo())(some_x) without explicitly specifying the
type for x, but you can work around this by capturing x’s type, Java
style: std::iter::once(some_x).map(|x| x.foo()).nth(0).unwrap() is
one awkward way to do so.)

}),
..Default::default()
});
handler.read_scalars(req).await.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to actually assert anything on the response? Seems like it might not hurt to check that there are indeed no points.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there are points, because we don’t actually downsample yet. :-)

We can assert on the keys, though, and I’ll add a draft assertion to
enable once downsampling is implemented.

});
res.runs.push(run);
for (run, data) in runs.iter() {
if !run_filter.want(run) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, iterating over the filter instead of the keyset would seem likely to save effort on net (assuming single-run lookups against a big experiment are fairly plausible, while sending a filter with many non-existent runs is unlikely; I guess most optimal would be to iterate over the shorter one). Thoughts? It probably does make the code a little messier so maybe better to wait.

Copy link
Contributor Author

@wchargin wchargin Nov 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It does complicate the code. My first attempt is:

// call site: nice and easy, modulo `&*` (for the `RwLockReadGuard`)
for (run, data) in run_filter.pluck(&*runs) {
    // ...
}

// implementation
impl<T: Hash + Eq> Filter<T> {
    fn pluck<'f, 'm: 'f, V: 'm>(
        &'f self,
        map: &'m HashMap<T, V>,
    ) -> impl Iterator<Item = (&'m T, &'m V)> + 'f
    where
        T: Eq + Hash + 'm,
    {
        match self {
            Filter::All => Either::Left(map.iter()),
            Filter::Just(which) => {
                let it = which.iter().filter_map(move |k| map.get_key_value(k));
                Either::Right(it)
            }
        }
    }
}

enum Either<A, B> {
    Left(A),
    Right(B),
}

impl<A, B, I> Iterator for Either<A, B>
where
    A: Iterator<Item = I>,
    B: Iterator<Item = I>,
{
    type Item = I;

    fn next(&mut self) -> Option<Self::Item> {
        match self {
            Either::Left(i) => i.next(),
            Either::Right(i) => i.next(),
        }
    }
}

…where the Either layer is needed to form the output type of pluck,
since the iterators in the two branches are of different types: one is
a hash_map::Iter<'m, T, V>, and the other is something like
a iter::FilterMap<hash_set::Iter<'f, T>, C> where C is the
anonymous type of the closure.

(edited: in an earlier version of this comment, I had the right
structure, but forgot to actually make the important change and iterate
over which rather than just iterating over the input map. The main
point stands.)

Another approach involves not using Iterator::filter_map and instead
just implementing our own iterator struct explicitly. This might be
simpler (I haven’t written it out). But it’s still not trivial.

Another approach uses Box<dyn Iterator<Item = (&K, &V)>> in place of
the Either layer, pushing more dispatching to runtime.

I suspect that generators would make this a lot easier, but they’re
not yet stable (and not really close, afaict).

It’s hard for me to write that and then claim with a straight face that
Rust reads like Python. Given that this only affects RPC times (not load
throughput) and that even “large” experiments probably won’t have more
than a few thousand entries in these hash maps, I’m inclined to punt, at
least until we can measure it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed thoughts! Didn't mean to make this too much of a digression, totally fine to leave as just a potential TODO for now. Maybe if I have any particularly good ideas I'll take a stab at this myself :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, TODO included. And, sure, feel free to grab it. :-)

None => continue,
Some((step, _, _)) => step,
};
let max_wall_time = ts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth a TODO to track this on TimeSeries? Would be nice not to have to rescan every single scalar wall time each time we hit ListScalars. I guess it's hard to get this entirely for "free" on TimeSeries due to preemptions, but still seems overall cheaper to compute it there (where we could take a cheap path if no preemption happened).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (as TODO). Agreed on all counts.

Base automatically changed from wchargin-rust-listruns to master November 25, 2020 17:22
wchargin-branch: rust-listscalars-readscalars
wchargin-source: 4ce71f913bc11f4dbf262405d0e3603630a33358

# Conflicts:
#	tensorboard/data/server/server.rs
wchargin-branch: rust-listscalars-readscalars
wchargin-source: 4ce71f913bc11f4dbf262405d0e3603630a33358
wchargin-branch: rust-listscalars-readscalars
wchargin-source: 9989e5c9f2f0c7197a756c4169afa1a194d94193
Copy link
Contributor Author

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL re: run_tag_map!; the rest is mostly just done.

@wchargin wchargin requested a review from nfelt November 25, 2020 20:44
@@ -286,6 +425,36 @@ mod tests {
);
}

/// Converts a list-of-lists back into a map, for easy assertions that don't depend on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I'll take a "clever" :) Yeah, I realize it depends on the orphan rule; originally I was thinking it would just be a new test-only trait with fn get() but then I realized Index could be used and makes it even pithier.

But I think the macro version is quite nice and addresses the main things I was lukewarm about with unroll_map so I'm perfectly happy to keep it the way you have it now.

});
res.runs.push(run);
for (run, data) in runs.iter() {
if !run_filter.want(run) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed thoughts! Didn't mean to make this too much of a digression, totally fine to leave as just a potential TODO for now. Maybe if I have any particularly good ideas I'll take a stab at this myself :)

@wchargin wchargin merged commit 179370a into master Nov 25, 2020
@wchargin wchargin deleted the wchargin-rust-listscalars-readscalars branch November 25, 2020 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes core:rustboard //tensorboard/data/server/...
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants