Skip to content

Conversation

@nfelt
Copy link
Contributor

@nfelt nfelt commented Jan 25, 2021

Addresses the online commits task from #4422.

RunLoader's reload() method now ensures that we commit all time series data staged from the run at a regular interval (which I've hardcoded to 5 seconds), rather than waiting until all event files in the run have been read to exhaustion before committing any data. I did a little refactoring to facilitate making this change which I've broken out into separate commits.

Based on a minimal amount of benchmarking against the ~20GiB event file from gs://tensorboard-bench-logs/nndiv_100m/events.out.tfevents.1549560561.hostname, it looks like this adds about 1% overhead in reading the file to completion (first is unmodified code; second is with this PR):

$ hyperfine -m 5 'bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/'
Benchmark #1: bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/
  Time (mean ± σ):     168.505 s ±  2.442 s    [User: 164.278 s, System: 4.212 s]
  Range (min … max):   166.091 s … 171.609 s    5 runs

$ hyperfine -m 5 'bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/'
Benchmark #1: bazel-bin/tensorboard/data/server/bench --logdir ~/scratch/tensorboard-bench-logs/
  Time (mean ± σ):     171.296 s ±  2.586 s    [User: 167.006 s, System: 4.275 s]
  Range (min … max):   167.432 s … 174.439 s    5 runs

Test plan: ran against the aforementioned logdir and verified that we now see new data steadily appearing at 5 second intervals.

I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing reload() (or RunLoader at construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.

@nfelt nfelt added the core:rustboard //tensorboard/data/server/... label Jan 25, 2021
@nfelt nfelt requested a review from wchargin January 25, 2021 23:42
@google-cla google-cla bot added the cla: yes label Jan 25, 2021
Base automatically changed from nfelt-rust-propagate-verbosity-debug to master January 26, 2021 01:19
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely! Works on my machine. Some comments inline.

I’m okay with a ~1% overhead, especially since your μ±σ confidence
intervals overlap a bunch, and since this feature is clearly important
for non-small datasets.

(first is unmodified code; second is with this PR):

One trick is to copy the old and new binaries to ./bench-{old,new}
and then ask hyperfine to do a parameter sweep:

hyperfine '{tool} --logdir logs' -L tool ./bench-old,./bench-new

The main benefit is that it’ll give you a confidence interval around the
relative performance, something like “old is 1.02(±0.06)× faster”.

I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing reload() (or RunLoader at construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.

That’s about right. The standard technique often involves doing so with
type parameters such that it’s all optimized out: StageReservoir and
its ReservoirControl do this. As to clock objects specifically: I’ve
not done this, and I don’t know of common patterns, but I imagine that a
simple trait Clock { fn now(&self) -> Instant } would suffice here,
and let the test clock own an Rc<Instant> or AtomicI32 or something
that ticks forward each time that it’s called.

I’m okay with omitting dedicated tests for this. It seems unlikely to be
subtly broken, and the failure mode is not catastrophic (read-only, and
the right data still shows up, just later). Feel free to dive in if
you’re interested, but no action requested.

I did a little refactoring to facilitate making this change which I've broken out into separate commits.

Thanks; this was helpful. Feel free to merge however you want (rebase,
squash, separate PRs).


/// Holds data staged by a RunLoader that will be commited to the Commit.
#[derive(Debug)]
struct RunLoaderData {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of extracting this type.

When first writing this module, I considered separating RunLoader into
files and data and putting methods on data, to cut down on
free-floating functions taking a bunch of &mut arguments. It seems
like that’s about what you’re doing here, so, some observations:

  • You’ve written this semantically as “the data that will be
    committed”, which makes sense. You could also write it more
    structurally as “the data that isn’t in files, since that’s the
    field into which we like to take long-lived mutable references”.
    I.e.: you could include self.run and self.checksum here.

    Doing so avoids the clone of run_name, which obviously isn’t a big
    performance hit, but makes me wonder whether the structure might be
    nice. It also simplifies the callback structure a bit (see below).

  • The free-standing fn commit_all and fn read_event functions are
    nicely set up to be &mut self methods on RunLoaderData, which I
    viewed as one of the benefits of this structure in the first place.

Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about how the functions could be methods, I hadn't thought that far. Can do.

Re: the scope of the data structure, I did briefly think about which fields to include and it seemed like a more semantically coherent boundary to define it as the data that will be committed. Defining as just "everything except files" felt odd to me even if functionally it's the most convenient. In particular even though cloning run is a bit ugly, semantically it just feels quite weird to make run a field on self.data instead of the top-level RunLoader, so I guess I preferred putting the weirdness inside the method bodies versus into the struct layout. Similarly, it seems like semantically self.checksum would be more naturally located alongside self.files since it only affects the file-reading half of RunLoader, not the data-buffering half.

I guess in the purest sense we want something like a tripartite structure in which we have "effectively constant metadata" like self.run and self.checksum grouped in one substructure, and and then two different mutable substructures, self.files and self.data, so that all three can be reborrowed independently from the parent struct, but that seems... overcomplicated? Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.

Thoughts?

Copy link
Contributor

@wchargin wchargin Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense and seems reasonable to me. No strong opinion;
feel free to proceed as desired.

Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.

You may be interested in @nikomatsakis’s article on interprocedural
conflicts
. (Context: “NLL” = “non-lexical lifetimes”, referring to
a set of changes to the borrow checker that taught it to accept more kinds of
valid code, making it much easier to write programs that compile.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted the free functions into methods.

Yeah, that article pretty much sums up the understanding I had come to as well. The "View" thing is interesting, I could see that being fairly effective for a more complex situation but it definitely isn't lightweight. Thanks for sharing!

self.commit_all(run_data);
let mut n = 0;
let mut last_commit_time = Instant::now();
self.reload_files(|run_loader_data, event| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on inlining this callback into reload_files? I don’t mind
callbacks in general, but they are a bit more complicated than not using
them. If self.run moves to self.data.run, then you only need to pass
in self.reload_files(run_data, &start), which feels pretty reasonable
to me: it’s natural that reload_files should take the destination, and
start is just for instrumentation.

Then reload reads like:

let start = Instant::now();

self.update_file_set(filenames);
self.reload_files(run_data, &run_name, &start);
self.data.commit_all(run_data);
// ^ maybe move this into `reload_files`, too? maybe; could go either way

debug!(
    "Finished load for run {:?} ({:?})",
    self.data.run.0,
    start.elapsed()
);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So I arrived at this because I liked reload_files() better with the business logic of what to do with the event and how often to commit abstracted out of it. In particular part of my reasoning was that if we wanted to parameterize the 5 second interval or do something fancier than a fixed interval (like triggering based on the event details itself), that configuration would all need to be plumbed into reload_files() either as their own parameters or as fields on self, and encapsulating it via the callback felt prettier.

As a halfway solution, I originally did have commit_all() called in reload_files() and was trying to pass in some kind of smaller "check if I need to commit yet" predicate instead, but then I remembered that abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)

I'll grant it makes reload() a little messier; but I liked being able to read the sequence of commit steps and the timing details all in one function. So if you aren't particularly opposed I think I'm happier with it this way.

Copy link
Contributor

@wchargin wchargin Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also seems reasonable to me. Sold.

abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)

@chandlerc might contest that there are no zero-cost abstractions,
and that (Rust) closures have no run-time cost but do have human cost.
:-) But yeah, I’m okay with them here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No such thing as a free lunch after all huh 😿

start_time: Option<WallTime>,

/// The run name associated with this loader. Used primarily for logging; the run name is
/// canonically defined by the map key under which this RunLoader is stored in LogdirLoader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since Rustdoc is Markdown, it’s [convention][1] to use backticks
around code atoms like `RunLoader` and `Commit`. You can
also write doc crosslinks like [`RunLoader`], but that’s more at
your discretion as to whether it’s useful.

(Throughout.)

Dead,
}

/// Holds data staged by a RunLoader that will be commited to the Commit.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sp.: s/commited/committed/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if data
.start_time
.map(|start| wall_time < start)
.unwrap_or(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be .map_or(true, |start| wall_time < start)? (Could have been
originally, but presumably I forgot; now, it makes the code short enough
to not have to spill onto four lines, so worth changing imho.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.map(|start| wall_time < start)
.unwrap_or(true)
{
(*data).start_time = Some(wall_time);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just be data.start_time; no need to explicitly deref.

(i.e., C’s operator-> and operator. are basically both just ..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (I mechanically converted it to *data.start_time and got an operator precedence error and resolved it in the most obvious way, but yes this makes sense, thanks.)

Comment on lines 168 to 171
data: RunLoaderData {
start_time: None,
time_series: HashMap::new(),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: if you let RunLoaderData derive Default, then this is its
default value, so can be just data: RunLoaderData::default().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@wchargin wchargin mentioned this pull request Jan 26, 2021
34 tasks
Copy link
Contributor Author

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Will address other comments after resolving the bigger questions.


/// Holds data staged by a RunLoader that will be commited to the Commit.
#[derive(Debug)]
struct RunLoaderData {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about how the functions could be methods, I hadn't thought that far. Can do.

Re: the scope of the data structure, I did briefly think about which fields to include and it seemed like a more semantically coherent boundary to define it as the data that will be committed. Defining as just "everything except files" felt odd to me even if functionally it's the most convenient. In particular even though cloning run is a bit ugly, semantically it just feels quite weird to make run a field on self.data instead of the top-level RunLoader, so I guess I preferred putting the weirdness inside the method bodies versus into the struct layout. Similarly, it seems like semantically self.checksum would be more naturally located alongside self.files since it only affects the file-reading half of RunLoader, not the data-buffering half.

I guess in the purest sense we want something like a tripartite structure in which we have "effectively constant metadata" like self.run and self.checksum grouped in one substructure, and and then two different mutable substructures, self.files and self.data, so that all three can be reborrowed independently from the parent struct, but that seems... overcomplicated? Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.

Thoughts?

self.commit_all(run_data);
let mut n = 0;
let mut last_commit_time = Instant::now();
self.reload_files(|run_loader_data, event| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. So I arrived at this because I liked reload_files() better with the business logic of what to do with the event and how often to commit abstracted out of it. In particular part of my reasoning was that if we wanted to parameterize the 5 second interval or do something fancier than a fixed interval (like triggering based on the event details itself), that configuration would all need to be plumbed into reload_files() either as their own parameters or as fields on self, and encapsulating it via the callback felt prettier.

As a halfway solution, I originally did have commit_all() called in reload_files() and was trying to pass in some kind of smaller "check if I need to commit yet" predicate instead, but then I remembered that abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)

I'll grant it makes reload() a little messier; but I liked being able to read the sequence of commit steps and the timing details all in one function. So if you aren't particularly opposed I think I'm happier with it this way.

@nfelt nfelt requested a review from wchargin January 26, 2021 19:24
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

One trick is to copy the old and new binaries to ./bench-{old,new}
and then ask hyperfine to do a parameter sweep:

hyperfine '{tool} --logdir logs' -L tool ./bench-old,./bench-new

The main benefit is that it’ll give you a confidence interval around the
relative performance, something like “old is 1.02(±0.06)× faster”.

Thanks! Yeah, I realized after I had already run the first one that I could probably just copy the binaries into different places (unlike python...) but was too lazy to re-run it 😬

I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing reload() (or RunLoader at construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.

That’s about right. The standard technique often involves doing so with
type parameters such that it’s all optimized out: StageReservoir and
its ReservoirControl do this. As to clock objects specifically: I’ve
not done this, and I don’t know of common patterns, but I imagine that a
simple trait Clock { fn now(&self) -> Instant } would suffice here,
and let the test clock own an Rc<Instant> or AtomicI32 or something
that ticks forward each time that it’s called.

I’m okay with omitting dedicated tests for this. It seems unlikely to be
subtly broken, and the failure mode is not catastrophic (read-only, and
the right data still shows up, just later). Feel free to dive in if
you’re interested, but no action requested.

Thanks, that technique makes sense! I also agree with your last two points though, so if you're fine with it I'll omit the dedicated tests.

I did a little refactoring to facilitate making this change which I've broken out into separate commits.

Thanks; this was helpful. Feel free to merge however you want (rebase,
squash, separate PRs).

I'm fine just squashing if you are (mostly broke out commits for ease of review).

Dead,
}

/// Holds data staged by a RunLoader that will be commited to the Commit.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if data
.start_time
.map(|start| wall_time < start)
.unwrap_or(true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.map(|start| wall_time < start)
.unwrap_or(true)
{
(*data).start_time = Some(wall_time);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. (I mechanically converted it to *data.start_time and got an operator precedence error and resolved it in the most obvious way, but yes this makes sense, thanks.)

Comment on lines 168 to 171
data: RunLoaderData {
start_time: None,
time_series: HashMap::new(),
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


/// Holds data staged by a RunLoader that will be commited to the Commit.
#[derive(Debug)]
struct RunLoaderData {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converted the free functions into methods.

Yeah, that article pretty much sums up the understanding I had come to as well. The "View" thing is interesting, I could see that being fairly effective for a more complex situation but it definitely isn't lightweight. Thanks for sharing!

self.commit_all(run_data);
let mut n = 0;
let mut last_commit_time = Instant::now();
self.reload_files(|run_loader_data, event| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No such thing as a free lunch after all huh 😿

@nfelt nfelt requested a review from wchargin January 27, 2021 06:12
Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels nice to enable “Hide whitespace changes” without worrying that I’m
missing a load-bearing control flow bug!

pub fn new(run: Run) -> Self {
Self {
start_time: None,
run: run,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

start_time: Option<WallTime>,

/// The run name associated with this loader. Used primarily for logging; the run name is
/// canonically defined by the map key under which this `RunLoader` is stored in LogdirLoader.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit if you’re already updating: `LogdirLoader`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I got'em all but no. Thanks.

@nfelt nfelt merged commit ae1a867 into master Jan 27, 2021
@nfelt nfelt deleted the nfelt-rust-online-commits branch January 27, 2021 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla: yes core:rustboard //tensorboard/data/server/...

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants