rust: make RunLoader commit periodically while loading a large run #4604

nfelt · 2021-01-25T23:42:42Z

Addresses the online commits task from #4422.

RunLoader's reload() method now ensures that we commit all time series data staged from the run at a regular interval (which I've hardcoded to 5 seconds), rather than waiting until all event files in the run have been read to exhaustion before committing any data. I did a little refactoring to facilitate making this change which I've broken out into separate commits.

Based on a minimal amount of benchmarking against the ~20GiB event file from gs://tensorboard-bench-logs/nndiv_100m/events.out.tfevents.1549560561.hostname, it looks like this adds about 1% overhead in reading the file to completion (first is unmodified code; second is with this PR):

$ hyperfine -m 5 'bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/'
Benchmark #1: bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/
  Time (mean ± σ):     168.505 s ±  2.442 s    [User: 164.278 s, System: 4.212 s]
  Range (min … max):   166.091 s … 171.609 s    5 runs

$ hyperfine -m 5 'bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/'
Benchmark #1: bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/
  Time (mean ± σ):     171.296 s ±  2.586 s    [User: 167.006 s, System: 4.275 s]
  Range (min … max):   167.432 s … 174.439 s    5 runs

Test plan: ran against the aforementioned logdir and verified that we now see new data steadily appearing at 5 second intervals.

I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing reload() (or RunLoader at construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.

…orrowing them in helpers

wchargin

Lovely! Works on my machine. Some comments inline.

I’m okay with a ~1% overhead, especially since your μ±σ confidence
intervals overlap a bunch, and since this feature is clearly important
for non-small datasets.

(first is unmodified code; second is with this PR):

One trick is to copy the old and new binaries to ./bench-{old,new}
and then ask hyperfine to do a parameter sweep:

hyperfine '{tool} --logdir logs' -L tool ./bench-old,./bench-new

The main benefit is that it’ll give you a confidence interval around the
relative performance, something like “old is 1.02(±0.06)× faster”.

I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing reload() (or RunLoader at construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.

That’s about right. The standard technique often involves doing so with
type parameters such that it’s all optimized out: StageReservoir and
its ReservoirControl do this. As to clock objects specifically: I’ve
not done this, and I don’t know of common patterns, but I imagine that a
simple trait Clock { fn now(&self) -> Instant } would suffice here,
and let the test clock own an Rc<Instant> or AtomicI32 or something
that ticks forward each time that it’s called.

I’m okay with omitting dedicated tests for this. It seems unlikely to be
subtly broken, and the failure mode is not catastrophic (read-only, and
the right data still shows up, just later). Feel free to dive in if
you’re interested, but no action requested.

I did a little refactoring to facilitate making this change which I've broken out into separate commits.

Thanks; this was helpful. Feel free to merge however you want (rebase,
squash, separate PRs).

wchargin · 2021-01-26T01:03:46Z

tensorboard/data/server/run.rs


+/// Holds data staged by a RunLoader that will be commited to the Commit.
+#[derive(Debug)]
+struct RunLoaderData {


I like the idea of extracting this type.

When first writing this module, I considered separating RunLoader into
files and data and putting methods on data, to cut down on
free-floating functions taking a bunch of &mut arguments. It seems
like that’s about what you’re doing here, so, some observations:

You’ve written this semantically as “the data that will be
committed”, which makes sense. You could also write it more
structurally as “the data that isn’t in files, since that’s the
field into which we like to take long-lived mutable references”.
I.e.: you could include self.run and self.checksum here.

Doing so avoids the clone of run_name, which obviously isn’t a big
performance hit, but makes me wonder whether the structure might be
nice. It also simplifies the callback structure a bit (see below).

The free-standing fn commit_all and fn read_event functions are
nicely set up to be &mut self methods on RunLoaderData, which I
viewed as one of the benefits of this structure in the first place.

Thoughts?

Good point about how the functions could be methods, I hadn't thought that far. Can do.

Re: the scope of the data structure, I did briefly think about which fields to include and it seemed like a more semantically coherent boundary to define it as the data that will be committed. Defining as just "everything except files" felt odd to me even if functionally it's the most convenient. In particular even though cloning run is a bit ugly, semantically it just feels quite weird to make run a field on self.data instead of the top-level RunLoader, so I guess I preferred putting the weirdness inside the method bodies versus into the struct layout. Similarly, it seems like semantically self.checksum would be more naturally located alongside self.files since it only affects the file-reading half of RunLoader, not the data-buffering half.

I guess in the purest sense we want something like a tripartite structure in which we have "effectively constant metadata" like self.run and self.checksum grouped in one substructure, and and then two different mutable substructures, self.files and self.data, so that all three can be reborrowed independently from the parent struct, but that seems... overcomplicated? Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.

Thoughts?

Yep, that makes sense and seems reasonable to me. No strong opinion;
feel free to proceed as desired.

Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.

You may be interested in @nikomatsakis’s article on interprocedural
conflicts. (Context: “NLL” = “non-lexical lifetimes”, referring to
a set of changes to the borrow checker that taught it to accept more kinds of
valid code, making it much easier to write programs that compile.)

Converted the free functions into methods.

Yeah, that article pretty much sums up the understanding I had come to as well. The "View" thing is interesting, I could see that being fairly effective for a more complex situation but it definitely isn't lightweight. Thanks for sharing!

wchargin · 2021-01-26T01:30:15Z

tensorboard/data/server/run.rs

-        self.commit_all(run_data);
+        let mut n = 0;
+        let mut last_commit_time = Instant::now();
+        self.reload_files(|run_loader_data, event| {


Thoughts on inlining this callback into reload_files? I don’t mind
callbacks in general, but they are a bit more complicated than not using
them. If self.run moves to self.data.run, then you only need to pass
in self.reload_files(run_data, &start), which feels pretty reasonable
to me: it’s natural that reload_files should take the destination, and
start is just for instrumentation.

Then reload reads like:

let start = Instant::now(); self.update_file_set(filenames); self.reload_files(run_data, &run_name, &start); self.data.commit_all(run_data); // ^ maybe move this into `reload_files`, too? maybe; could go either way debug!( "Finished load for run {:?} ({:?})", self.data.run.0, start.elapsed() );

Hmm. So I arrived at this because I liked reload_files() better with the business logic of what to do with the event and how often to commit abstracted out of it. In particular part of my reasoning was that if we wanted to parameterize the 5 second interval or do something fancier than a fixed interval (like triggering based on the event details itself), that configuration would all need to be plumbed into reload_files() either as their own parameters or as fields on self, and encapsulating it via the callback felt prettier.

As a halfway solution, I originally did have commit_all() called in reload_files() and was trying to pass in some kind of smaller "check if I need to commit yet" predicate instead, but then I remembered that abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)

I'll grant it makes reload() a little messier; but I liked being able to read the sequence of commit steps and the timing details all in one function. So if you aren't particularly opposed I think I'm happier with it this way.

This also seems reasonable to me. Sold.

abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)

@chandlerc might contest that there are no zero-cost abstractions,
and that (Rust) closures have no run-time cost but do have human cost.
:-) But yeah, I’m okay with them here.

No such thing as a free lunch after all huh 😿

wchargin · 2021-01-26T01:35:31Z

tensorboard/data/server/run.rs

-    start_time: Option<WallTime>,
-
+    /// The run name associated with this loader. Used primarily for logging; the run name is
+    /// canonically defined by the map key under which this RunLoader is stored in LogdirLoader.


nit: Since Rustdoc is Markdown, it’s [convention][1] to use backticks
around code atoms like `RunLoader` and `Commit`. You can
also write doc crosslinks like [`RunLoader`], but that’s more at
your discretion as to whether it’s useful.

(Throughout.)

wchargin · 2021-01-26T01:36:06Z

tensorboard/data/server/run.rs

    Dead,
 }

+/// Holds data staged by a RunLoader that will be commited to the Commit.


sp.: s/commited/committed/

wchargin · 2021-01-26T01:53:00Z

tensorboard/data/server/run.rs

+    if data
+        .start_time
+        .map(|start| wall_time < start)
+        .unwrap_or(true)


Can be .map_or(true, |start| wall_time < start)? (Could have been
originally, but presumably I forgot; now, it makes the code short enough
to not have to spill onto four lines, so worth changing imho.)

wchargin · 2021-01-26T01:53:07Z

tensorboard/data/server/run.rs

+        .map(|start| wall_time < start)
+        .unwrap_or(true)
+    {
+        (*data).start_time = Some(wall_time);


Can just be data.start_time; no need to explicitly deref.

(i.e., C’s operator-> and operator. are basically both just ..)

Done. (I mechanically converted it to *data.start_time and got an operator precedence error and resolved it in the most obvious way, but yes this makes sense, thanks.)

wchargin · 2021-01-26T01:53:19Z

tensorboard/data/server/run.rs

+            data: RunLoaderData {
+                start_time: None,
+                time_series: HashMap::new(),
+            },


Optional: if you let RunLoaderData derive Default, then this is its
default value, so can be just data: RunLoaderData::default().

nfelt

Thanks for the review! Will address other comments after resolving the bigger questions.

nfelt · 2021-01-26T19:15:36Z

tensorboard/data/server/run.rs


+/// Holds data staged by a RunLoader that will be commited to the Commit.
+#[derive(Debug)]
+struct RunLoaderData {


Good point about how the functions could be methods, I hadn't thought that far. Can do.

Re: the scope of the data structure, I did briefly think about which fields to include and it seemed like a more semantically coherent boundary to define it as the data that will be committed. Defining as just "everything except files" felt odd to me even if functionally it's the most convenient. In particular even though cloning run is a bit ugly, semantically it just feels quite weird to make run a field on self.data instead of the top-level RunLoader, so I guess I preferred putting the weirdness inside the method bodies versus into the struct layout. Similarly, it seems like semantically self.checksum would be more naturally located alongside self.files since it only affects the file-reading half of RunLoader, not the data-buffering half.

I guess in the purest sense we want something like a tripartite structure in which we have "effectively constant metadata" like self.run and self.checksum grouped in one substructure, and and then two different mutable substructures, self.files and self.data, so that all three can be reborrowed independently from the parent struct, but that seems... overcomplicated? Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.

Thoughts?

nfelt · 2021-01-26T19:23:09Z

tensorboard/data/server/run.rs

-        self.commit_all(run_data);
+        let mut n = 0;
+        let mut last_commit_time = Instant::now();
+        self.reload_files(|run_loader_data, event| {


Hmm. So I arrived at this because I liked reload_files() better with the business logic of what to do with the event and how often to commit abstracted out of it. In particular part of my reasoning was that if we wanted to parameterize the 5 second interval or do something fancier than a fixed interval (like triggering based on the event details itself), that configuration would all need to be plumbed into reload_files() either as their own parameters or as fields on self, and encapsulating it via the callback felt prettier.

As a halfway solution, I originally did have commit_all() called in reload_files() and was trying to pass in some kind of smaller "check if I need to commit yet" predicate instead, but then I remembered that abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)

I'll grant it makes reload() a little messier; but I liked being able to read the sequence of commit steps and the timing details all in one function. So if you aren't particularly opposed I think I'm happier with it this way.

wchargin

GitHub is still showing this in my queue; not sure why. Responded here
earlier:

nfelt

PTAL

One trick is to copy the old and new binaries to ./bench-{old,new}
and then ask hyperfine to do a parameter sweep:
hyperfine '{tool} --logdir logs' -L tool ./bench-old,./bench-new
The main benefit is that it’ll give you a confidence interval around the
relative performance, something like “old is 1.02(±0.06)× faster”.

Thanks! Yeah, I realized after I had already run the first one that I could probably just copy the binaries into different places (unlike python...) but was too lazy to re-run it 😬

I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing reload() (or RunLoader at construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.

That’s about right. The standard technique often involves doing so with
type parameters such that it’s all optimized out: StageReservoir and
its ReservoirControl do this. As to clock objects specifically: I’ve
not done this, and I don’t know of common patterns, but I imagine that a
simple trait Clock { fn now(&self) -> Instant } would suffice here,
and let the test clock own an Rc<Instant> or AtomicI32 or something
that ticks forward each time that it’s called.

I’m okay with omitting dedicated tests for this. It seems unlikely to be
subtly broken, and the failure mode is not catastrophic (read-only, and
the right data still shows up, just later). Feel free to dive in if
you’re interested, but no action requested.

Thanks, that technique makes sense! I also agree with your last two points though, so if you're fine with it I'll omit the dedicated tests.

I did a little refactoring to facilitate making this change which I've broken out into separate commits.

Thanks; this was helpful. Feel free to merge however you want (rebase,
squash, separate PRs).

I'm fine just squashing if you are (mostly broke out commits for ease of review).

nfelt · 2021-01-27T03:23:26Z

tensorboard/data/server/run.rs

    Dead,
 }

+/// Holds data staged by a RunLoader that will be commited to the Commit.


nfelt · 2021-01-27T03:26:42Z

tensorboard/data/server/run.rs

+    if data
+        .start_time
+        .map(|start| wall_time < start)
+        .unwrap_or(true)


nfelt · 2021-01-27T03:26:50Z

tensorboard/data/server/run.rs

+        .map(|start| wall_time < start)
+        .unwrap_or(true)
+    {
+        (*data).start_time = Some(wall_time);


Done. (I mechanically converted it to *data.start_time and got an operator precedence error and resolved it in the most obvious way, but yes this makes sense, thanks.)

nfelt · 2021-01-27T03:34:09Z

tensorboard/data/server/run.rs

+            data: RunLoaderData {
+                start_time: None,
+                time_series: HashMap::new(),
+            },


nfelt · 2021-01-27T06:04:30Z

tensorboard/data/server/run.rs


+/// Holds data staged by a RunLoader that will be commited to the Commit.
+#[derive(Debug)]
+struct RunLoaderData {


Converted the free functions into methods.

Yeah, that article pretty much sums up the understanding I had come to as well. The "View" thing is interesting, I could see that being fairly effective for a more complex situation but it definitely isn't lightweight. Thanks for sharing!

nfelt · 2021-01-27T06:05:04Z

tensorboard/data/server/run.rs

-        self.commit_all(run_data);
+        let mut n = 0;
+        let mut last_commit_time = Instant::now();
+        self.reload_files(|run_loader_data, event| {


No such thing as a free lunch after all huh 😿

wchargin

Feels nice to enable “Hide whitespace changes” without worrying that I’m
missing a load-bearing control flow bug!

wchargin · 2021-01-27T08:58:29Z

tensorboard/data/server/run.rs

+    pub fn new(run: Run) -> Self {
        Self {
-            start_time: None,
+            run: run,


Clippy is warning that this can be punned to just run:
https://github.com/tensorflow/tensorboard/pull/4604/checks?check_run_id=1774686429

wchargin · 2021-01-27T09:07:29Z

tensorboard/data/server/run.rs

-    start_time: Option<WallTime>,
-
+    /// The run name associated with this loader. Used primarily for logging; the run name is
+    /// canonically defined by the map key under which this `RunLoader` is stored in LogdirLoader.


nit if you’re already updating: `LogdirLoader`

I thought I got'em all but no. Thanks.

nfelt added 6 commits January 25, 2021 15:11

propagate --verbosity=debug to tensorboard-data-server

e9e5186

log how long RunLoader.reload() takes and add Run field for that purpose

a65c632

refactor RunLoader to encapsulate data-staging fields to simplify reb…

5fac477

…orrowing them in helpers

refactor commit_all() to be standalone function

a30e5e4

refactor reload_files to parameterize how event is handled

14c1cfa

extend handle_event hook to commit every five seconds during run loading

106a082

nfelt added the core:rustboard //tensorboard/data/server/... label Jan 25, 2021

nfelt requested a review from wchargin January 25, 2021 23:42

google-cla bot added the cla: yes label Jan 25, 2021

Base automatically changed from nfelt-rust-propagate-verbosity-debug to master January 26, 2021 01:19

wchargin reviewed Jan 26, 2021

View reviewed changes

wchargin mentioned this pull request Jan 26, 2021

RustBoard tasks #4422

Closed

34 tasks

nfelt commented Jan 26, 2021

View reviewed changes

nfelt requested a review from wchargin January 26, 2021 19:24

wchargin reviewed Jan 26, 2021

View reviewed changes

nfelt added 2 commits January 26, 2021 19:33

cr: misc small fixes

d149838

cr: convert commit_all() and read_event() to be methods on RunLoaderData

69625de

nfelt commented Jan 27, 2021

View reviewed changes

nfelt requested a review from wchargin January 27, 2021 06:12

wchargin approved these changes Jan 27, 2021

View reviewed changes

cr: clippppppppyyyyyy

30fae46

nfelt merged commit ae1a867 into master Jan 27, 2021

nfelt deleted the nfelt-rust-online-commits branch January 27, 2021 20:12

rust: make RunLoader commit periodically while loading a large run #4604

rust: make RunLoader commit periodically while loading a large run #4604

Uh oh!

Conversation

nfelt commented Jan 25, 2021

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wchargin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

wchargin Jan 26, 2021 •

edited

Loading

wchargin Jan 26, 2021 •

edited

Loading