-
Notifications
You must be signed in to change notification settings - Fork 1.7k
rust: make RunLoader commit periodically while loading a large run #4604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…orrowing them in helpers
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely! Works on my machine. Some comments inline.
I’m okay with a ~1% overhead, especially since your μ±σ confidence
intervals overlap a bunch, and since this feature is clearly important
for non-small datasets.
(first is unmodified code; second is with this PR):
One trick is to copy the old and new binaries to ./bench-{old,new}
and then ask hyperfine to do a parameter sweep:
hyperfine '{tool} --logdir logs' -L tool ./bench-old,./bench-new
The main benefit is that it’ll give you a confidence interval around the
relative performance, something like “old is 1.02(±0.06)× faster”.
I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing
reload()(orRunLoaderat construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.
That’s about right. The standard technique often involves doing so with
type parameters such that it’s all optimized out: StageReservoir and
its ReservoirControl do this. As to clock objects specifically: I’ve
not done this, and I don’t know of common patterns, but I imagine that a
simple trait Clock { fn now(&self) -> Instant } would suffice here,
and let the test clock own an Rc<Instant> or AtomicI32 or something
that ticks forward each time that it’s called.
I’m okay with omitting dedicated tests for this. It seems unlikely to be
subtly broken, and the failure mode is not catastrophic (read-only, and
the right data still shows up, just later). Feel free to dive in if
you’re interested, but no action requested.
I did a little refactoring to facilitate making this change which I've broken out into separate commits.
Thanks; this was helpful. Feel free to merge however you want (rebase,
squash, separate PRs).
|
|
||
| /// Holds data staged by a RunLoader that will be commited to the Commit. | ||
| #[derive(Debug)] | ||
| struct RunLoaderData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of extracting this type.
When first writing this module, I considered separating RunLoader into
files and data and putting methods on data, to cut down on
free-floating functions taking a bunch of &mut arguments. It seems
like that’s about what you’re doing here, so, some observations:
-
You’ve written this semantically as “the data that will be
committed”, which makes sense. You could also write it more
structurally as “the data that isn’t infiles, since that’s the
field into which we like to take long-lived mutable references”.
I.e.: you could includeself.runandself.checksumhere.Doing so avoids the clone of
run_name, which obviously isn’t a big
performance hit, but makes me wonder whether the structure might be
nice. It also simplifies the callback structure a bit (see below). -
The free-standing
fn commit_allandfn read_eventfunctions are
nicely set up to be&mut selfmethods onRunLoaderData, which I
viewed as one of the benefits of this structure in the first place.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about how the functions could be methods, I hadn't thought that far. Can do.
Re: the scope of the data structure, I did briefly think about which fields to include and it seemed like a more semantically coherent boundary to define it as the data that will be committed. Defining as just "everything except files" felt odd to me even if functionally it's the most convenient. In particular even though cloning run is a bit ugly, semantically it just feels quite weird to make run a field on self.data instead of the top-level RunLoader, so I guess I preferred putting the weirdness inside the method bodies versus into the struct layout. Similarly, it seems like semantically self.checksum would be more naturally located alongside self.files since it only affects the file-reading half of RunLoader, not the data-buffering half.
I guess in the purest sense we want something like a tripartite structure in which we have "effectively constant metadata" like self.run and self.checksum grouped in one substructure, and and then two different mutable substructures, self.files and self.data, so that all three can be reborrowed independently from the parent struct, but that seems... overcomplicated? Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense and seems reasonable to me. No strong opinion;
feel free to proceed as desired.
Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.
You may be interested in @nikomatsakis’s article on interprocedural
conflicts. (Context: “NLL” = “non-lexical lifetimes”, referring to
a set of changes to the borrow checker that taught it to accept more kinds of
valid code, making it much easier to write programs that compile.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted the free functions into methods.
Yeah, that article pretty much sums up the understanding I had come to as well. The "View" thing is interesting, I could see that being fairly effective for a more complex situation but it definitely isn't lightweight. Thanks for sharing!
| self.commit_all(run_data); | ||
| let mut n = 0; | ||
| let mut last_commit_time = Instant::now(); | ||
| self.reload_files(|run_loader_data, event| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on inlining this callback into reload_files? I don’t mind
callbacks in general, but they are a bit more complicated than not using
them. If self.run moves to self.data.run, then you only need to pass
in self.reload_files(run_data, &start), which feels pretty reasonable
to me: it’s natural that reload_files should take the destination, and
start is just for instrumentation.
Then reload reads like:
let start = Instant::now();
self.update_file_set(filenames);
self.reload_files(run_data, &run_name, &start);
self.data.commit_all(run_data);
// ^ maybe move this into `reload_files`, too? maybe; could go either way
debug!(
"Finished load for run {:?} ({:?})",
self.data.run.0,
start.elapsed()
);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So I arrived at this because I liked reload_files() better with the business logic of what to do with the event and how often to commit abstracted out of it. In particular part of my reasoning was that if we wanted to parameterize the 5 second interval or do something fancier than a fixed interval (like triggering based on the event details itself), that configuration would all need to be plumbed into reload_files() either as their own parameters or as fields on self, and encapsulating it via the callback felt prettier.
As a halfway solution, I originally did have commit_all() called in reload_files() and was trying to pass in some kind of smaller "check if I need to commit yet" predicate instead, but then I remembered that abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)
I'll grant it makes reload() a little messier; but I liked being able to read the sequence of commit steps and the timing details all in one function. So if you aren't particularly opposed I think I'm happier with it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also seems reasonable to me. Sold.
abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)
@chandlerc might contest that there are no zero-cost abstractions,
and that (Rust) closures have no run-time cost but do have human cost.
:-) But yeah, I’m okay with them here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No such thing as a free lunch after all huh 😿
tensorboard/data/server/run.rs
Outdated
| start_time: Option<WallTime>, | ||
|
|
||
| /// The run name associated with this loader. Used primarily for logging; the run name is | ||
| /// canonically defined by the map key under which this RunLoader is stored in LogdirLoader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since Rustdoc is Markdown, it’s [convention][1] to use backticks
around code atoms like `RunLoader` and `Commit`. You can
also write doc crosslinks like [`RunLoader`], but that’s more at
your discretion as to whether it’s useful.
(Throughout.)
tensorboard/data/server/run.rs
Outdated
| Dead, | ||
| } | ||
|
|
||
| /// Holds data staged by a RunLoader that will be commited to the Commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sp.: s/commited/committed/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorboard/data/server/run.rs
Outdated
| if data | ||
| .start_time | ||
| .map(|start| wall_time < start) | ||
| .unwrap_or(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be .map_or(true, |start| wall_time < start)? (Could have been
originally, but presumably I forgot; now, it makes the code short enough
to not have to spill onto four lines, so worth changing imho.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorboard/data/server/run.rs
Outdated
| .map(|start| wall_time < start) | ||
| .unwrap_or(true) | ||
| { | ||
| (*data).start_time = Some(wall_time); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just be data.start_time; no need to explicitly deref.
(i.e., C’s operator-> and operator. are basically both just ..)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. (I mechanically converted it to *data.start_time and got an operator precedence error and resolved it in the most obvious way, but yes this makes sense, thanks.)
tensorboard/data/server/run.rs
Outdated
| data: RunLoaderData { | ||
| start_time: None, | ||
| time_series: HashMap::new(), | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: if you let RunLoaderData derive Default, then this is its
default value, so can be just data: RunLoaderData::default().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
nfelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review! Will address other comments after resolving the bigger questions.
|
|
||
| /// Holds data staged by a RunLoader that will be commited to the Commit. | ||
| #[derive(Debug)] | ||
| struct RunLoaderData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about how the functions could be methods, I hadn't thought that far. Can do.
Re: the scope of the data structure, I did briefly think about which fields to include and it seemed like a more semantically coherent boundary to define it as the data that will be committed. Defining as just "everything except files" felt odd to me even if functionally it's the most convenient. In particular even though cloning run is a bit ugly, semantically it just feels quite weird to make run a field on self.data instead of the top-level RunLoader, so I guess I preferred putting the weirdness inside the method bodies versus into the struct layout. Similarly, it seems like semantically self.checksum would be more naturally located alongside self.files since it only affects the file-reading half of RunLoader, not the data-buffering half.
I guess in the purest sense we want something like a tripartite structure in which we have "effectively constant metadata" like self.run and self.checksum grouped in one substructure, and and then two different mutable substructures, self.files and self.data, so that all three can be reborrowed independently from the parent struct, but that seems... overcomplicated? Would be nice to have some way to help the borrow checker resolve this without as much reshuffling of the actual struct fields themselves.
Thoughts?
| self.commit_all(run_data); | ||
| let mut n = 0; | ||
| let mut last_commit_time = Instant::now(); | ||
| self.reload_files(|run_loader_data, event| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. So I arrived at this because I liked reload_files() better with the business logic of what to do with the event and how often to commit abstracted out of it. In particular part of my reasoning was that if we wanted to parameterize the 5 second interval or do something fancier than a fixed interval (like triggering based on the event details itself), that configuration would all need to be plumbed into reload_files() either as their own parameters or as fields on self, and encapsulating it via the callback felt prettier.
As a halfway solution, I originally did have commit_all() called in reload_files() and was trying to pass in some kind of smaller "check if I need to commit yet" predicate instead, but then I remembered that abstractions are supposed to be free in Rust so I figured hey maybe I can just get away with calling a callback closure on every event :)
I'll grant it makes reload() a little messier; but I liked being able to read the sequence of commit steps and the timing details all in one function. So if you aren't particularly opposed I think I'm happier with it this way.
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub is still showing this in my queue; not sure why. Responded here
earlier:
nfelt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
One trick is to copy the old and new binaries to
./bench-{old,new}
and then askhyperfineto do a parameter sweep:hyperfine '{tool} --logdir logs' -L tool ./bench-old,./bench-newThe main benefit is that it’ll give you a confidence interval around the
relative performance, something like “old is 1.02(±0.06)× faster”.
Thanks! Yeah, I realized after I had already run the first one that I could probably just copy the binaries into different places (unlike python...) but was too lazy to re-run it 😬
I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing
reload()(orRunLoaderat construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.That’s about right. The standard technique often involves doing so with
type parameters such that it’s all optimized out:StageReservoirand
itsReservoirControldo this. As to clock objects specifically: I’ve
not done this, and I don’t know of common patterns, but I imagine that a
simpletrait Clock { fn now(&self) -> Instant }would suffice here,
and let the test clock own anRc<Instant>orAtomicI32or something
that ticks forward each time that it’s called.I’m okay with omitting dedicated tests for this. It seems unlikely to be
subtly broken, and the failure mode is not catastrophic (read-only, and
the right data still shows up, just later). Feel free to dive in if
you’re interested, but no action requested.
Thanks, that technique makes sense! I also agree with your last two points though, so if you're fine with it I'll omit the dedicated tests.
I did a little refactoring to facilitate making this change which I've broken out into separate commits.
Thanks; this was helpful. Feel free to merge however you want (rebase,
squash, separate PRs).
I'm fine just squashing if you are (mostly broke out commits for ease of review).
tensorboard/data/server/run.rs
Outdated
| Dead, | ||
| } | ||
|
|
||
| /// Holds data staged by a RunLoader that will be commited to the Commit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorboard/data/server/run.rs
Outdated
| if data | ||
| .start_time | ||
| .map(|start| wall_time < start) | ||
| .unwrap_or(true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorboard/data/server/run.rs
Outdated
| .map(|start| wall_time < start) | ||
| .unwrap_or(true) | ||
| { | ||
| (*data).start_time = Some(wall_time); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. (I mechanically converted it to *data.start_time and got an operator precedence error and resolved it in the most obvious way, but yes this makes sense, thanks.)
tensorboard/data/server/run.rs
Outdated
| data: RunLoaderData { | ||
| start_time: None, | ||
| time_series: HashMap::new(), | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
|
||
| /// Holds data staged by a RunLoader that will be commited to the Commit. | ||
| #[derive(Debug)] | ||
| struct RunLoaderData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converted the free functions into methods.
Yeah, that article pretty much sums up the understanding I had come to as well. The "View" thing is interesting, I could see that being fairly effective for a more complex situation but it definitely isn't lightweight. Thanks for sharing!
| self.commit_all(run_data); | ||
| let mut n = 0; | ||
| let mut last_commit_time = Instant::now(); | ||
| self.reload_files(|run_loader_data, event| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No such thing as a free lunch after all huh 😿
wchargin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels nice to enable “Hide whitespace changes” without worrying that I’m
missing a load-bearing control flow bug!
tensorboard/data/server/run.rs
Outdated
| pub fn new(run: Run) -> Self { | ||
| Self { | ||
| start_time: None, | ||
| run: run, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clippy is warning that this can be punned to just run:
https://github.com/tensorflow/tensorboard/pull/4604/checks?check_run_id=1774686429
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorboard/data/server/run.rs
Outdated
| start_time: Option<WallTime>, | ||
|
|
||
| /// The run name associated with this loader. Used primarily for logging; the run name is | ||
| /// canonically defined by the map key under which this `RunLoader` is stored in LogdirLoader. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit if you’re already updating: `LogdirLoader`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought I got'em all but no. Thanks.
Addresses the online commits task from #4422.
RunLoader's
reload()method now ensures that we commit all time series data staged from the run at a regular interval (which I've hardcoded to 5 seconds), rather than waiting until all event files in the run have been read to exhaustion before committing any data. I did a little refactoring to facilitate making this change which I've broken out into separate commits.Based on a minimal amount of benchmarking against the ~20GiB event file from
gs://tensorboard-bench-logs/nndiv_100m/events.out.tfevents.1549560561.hostname, it looks like this adds about 1% overhead in reading the file to completion (first is unmodified code; second is with this PR):Test plan: ran against the aforementioned logdir and verified that we now see new data steadily appearing at 5 second intervals.
I haven't modified the unit tests to cover this behavior because I was a bit unsure about how to go about it. My main thought was parameterizing
reload()(orRunLoaderat construction) in order to inject some kind of fake clock object, but I wasn't sure if there was a standard technique used for this. Open to suggestions.