-
Notifications
You must be signed in to change notification settings - Fork 1.7k
rust: load runs in parallel #4634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,6 +16,7 @@ limitations under the License. | |
| //! Loader for many runs under a directory. | ||
|
|
||
| use log::{error, warn}; | ||
| use rayon::prelude::{IntoParallelIterator, ParallelIterator}; | ||
| use std::collections::{HashMap, HashSet}; | ||
| use std::path::{Path, PathBuf}; | ||
| use walkdir::WalkDir; | ||
|
|
@@ -26,6 +27,7 @@ use crate::types::Run; | |
|
|
||
| /// A loader for a log directory, connecting a filesystem to a [`Commit`] via [`RunLoader`]s. | ||
| pub struct LogdirLoader<'a> { | ||
| thread_pool: rayon::ThreadPool, | ||
| commit: &'a Commit, | ||
| logdir: PathBuf, | ||
| runs: HashMap<Run, RunState>, | ||
|
|
@@ -76,8 +78,23 @@ const EVENT_FILE_BASENAME_INFIX: &str = "tfevents"; | |
|
|
||
| impl<'a> LogdirLoader<'a> { | ||
| /// Creates a new, empty logdir loader. Does not load any data. | ||
| pub fn new(commit: &'a Commit, logdir: PathBuf) -> Self { | ||
| /// | ||
| /// This constructor is heavyweight: it builds a new thread-pool. The thread pool will be | ||
| /// reused for all calls to [`Self::reload`]. If `reload_threads` is `0`, the number of threads | ||
| /// will be determined automatically, per [`rayon`] semantics. | ||
| /// | ||
| /// # Panics | ||
| /// | ||
| /// If [`rayon::ThreadPoolBuilder::build`] returns an error; should only happen if there is a | ||
| /// failure to create threads at the OS level. | ||
| pub fn new(commit: &'a Commit, logdir: PathBuf, reload_threads: usize) -> Self { | ||
| let thread_pool = rayon::ThreadPoolBuilder::new() | ||
| .num_threads(reload_threads) | ||
| .thread_name(|i| format!("Reloader-{:03}", i)) | ||
| .build() | ||
| .expect("failed to construct Rayon thread pool"); | ||
| LogdirLoader { | ||
| thread_pool, | ||
| commit, | ||
| logdir, | ||
| runs: HashMap::new(), | ||
|
|
@@ -246,22 +263,27 @@ impl<'a> LogdirLoader<'a> { | |
| .runs | ||
| .read() | ||
| .expect("could not acquire runs.data"); | ||
|
|
||
| let mut work_items = Vec::new(); | ||
| for (run, run_state) in self.runs.iter_mut() { | ||
| let event_files = discoveries | ||
| .remove(run) | ||
| .unwrap_or_else(|| panic!("run in self.runs but not discovered: {:?}", run)); | ||
| let filenames: Vec<PathBuf> = event_files.into_iter().map(|d| d.event_file).collect(); | ||
| run_state.loader.reload( | ||
| filenames, | ||
| commit_runs.get(run).unwrap_or_else(|| { | ||
| panic!( | ||
| "run in self.runs but not in commit.runs \ | ||
| let run_data = commit_runs.get(run).unwrap_or_else(|| { | ||
| panic!( | ||
| "run in self.runs but not in commit.runs \ | ||
| (is another client mutating this commit?): {:?}", | ||
| run | ||
| ) | ||
| }), | ||
| ); | ||
| run | ||
| ) | ||
| }); | ||
| work_items.push((&mut run_state.loader, filenames, run_data)); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just to check my understanding, this is a
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good question. Assuming that you’re talking about the closure argument diff --git a/tensorboard/data/server/logdir.rs b/tensorboard/data/server/logdir.rs
index 23ff65893..cbab00442 100644
--- a/tensorboard/data/server/logdir.rs
+++ b/tensorboard/data/server/logdir.rs
@@ -277,12 +277,11 @@ impl<'a> LogdirLoader<'a> {
run
)
});
- work_items.push((&mut run_state.loader, filenames, run_data));
+ let loader = &mut run_state.loader;
+ work_items.push(move || loader.reload(filenames, run_data));
}
self.thread_pool.install(|| {
- work_items
- .into_par_iter()
- .for_each(|(loader, filenames, run_data)| loader.reload(filenames, run_data));
+ work_items.into_par_iter().for_each(|f| f());
});
}
}As written, the closures desugar kind of like this: struct MyClosure; // actually doesn't close over anything
impl for<'a> FnMut<(&'a mut RunLoader, Vec<PathBuf>, &RwLock<commit::RunData>)> for MyClosure {
fn call_mut(&mut self, arg: (&'a mut RunLoader, ...)) {
let (loader, filenames, run_data) = arg;
loader.reload(filenames, run_data);
}
}(Since they don’t close over anything, this could be a free-standing If we put them in a struct YourClosure<'a> {
loader: &'a mut RunLoader,
filenames: Vec<PathBuf>,
run_data: &RwLock<commit::RunData>,
}
impl for<'a> FnMut<()> for MyClosure { // now doesn't need any args
fn call_mut(&mut self) {
self.loader.reload(self.filenames, self.run_data);
}
}Both work, and I wouldn’t be surprised if they yielded exactly the same Does this help? Personally, I find the “collect the arguments” approach
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, got it, thanks. I think my natural inclination would be to do a vector of argument-less closures since that seems more encapsulated (the work item execution code doesn't need to know anything task-specific about how to unpack the vec), but I don't have a strong preference. |
||
| } | ||
| self.thread_pool.install(|| { | ||
| work_items | ||
| .into_par_iter() | ||
| .for_each(|(loader, filenames, run_data)| loader.reload(filenames, run_data)); | ||
| }); | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -319,7 +341,7 @@ mod tests { | |
| let test_relpath: PathBuf = ["mnist", "test"].iter().collect(); | ||
|
|
||
| let commit = Commit::new(); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf()); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf(), 1); | ||
|
|
||
| // Check that we persist the right run states in the loader. | ||
| loader.reload(); | ||
|
|
@@ -379,7 +401,7 @@ mod tests { | |
| )?; | ||
|
|
||
| let commit = Commit::new(); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf()); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf(), 1); | ||
|
|
||
| let get_run_names = || { | ||
| let runs_store = commit.runs.read().unwrap(); | ||
|
|
@@ -442,7 +464,7 @@ mod tests { | |
| } | ||
|
|
||
| let commit = Commit::new(); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf()); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf(), 1); | ||
| loader.reload(); | ||
|
|
||
| assert_eq!(loader.runs.keys().collect::<Vec<_>>(), vec![&run]); | ||
|
|
@@ -467,7 +489,7 @@ mod tests { | |
| File::create(train_dir.join(EVENT_FILE_BASENAME_INFIX))?; | ||
|
|
||
| let commit = Commit::new(); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf()); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf(), 1); | ||
| loader.reload(); | ||
|
|
||
| assert_eq!( | ||
|
|
@@ -489,7 +511,7 @@ mod tests { | |
| std::os::unix::fs::symlink(&dir2, &dir1)?; | ||
|
|
||
| let commit = Commit::new(); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf()); | ||
| let mut loader = LogdirLoader::new(&commit, logdir.path().to_path_buf(), 1); | ||
| loader.reload(); // should not hang | ||
| Ok(()) | ||
| } | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just be tying myself in knots here, but do you have any thoughts for how I can better understand the logic the borrow checker uses to handle this kind of case where we mutably borrow each
run_state.loaderand push them all into another vector for later extraction? I guess the originalself.runslifetime is the outer bound for the individual loaders, and then theVecinherits their lifetime, but then we take the loaders back out ofVec- and I guess the lifetime follows them through to wherever they go? But presumably it's also somehow attached to theVecitself now, right, so even if we were to empty theVecentirely and then try to go refill it without dropping it in between, that would still prevent mutation ofself.runseven though the actual loader references are all gone?Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Let’s spelunk through the logic:
&mut self; call that elided lifetime'1, soit’s
fn load_runs<'1>(&'1 mut self, ...).self.runs.iter_mut()for lifetime'2, which must notoutlive
'1so that it doesn’t outlive borrowed content.&mut run_state.loader, for lifetime'3, whichmust not outlive
'3so that it doesn’t outlive that loan.'1: '2and'2: '3, where“
:” is a subtyping relation that you can pronounce “outlives (oris equal to)”. You’ll often see
T: 'staticto indicate thatThas only static references (“outlives everything”).
Vec::<(&'v mut RunLoader, ...)>for some lifetime'v.Since we
pushitems with lifetime'3, we must have'3: 'v: thereferenced content must outlive the vector.
work_itemsinto the closure arg toinstall. Fromhere on out, there aren’t really any more lifetimes: we call
into_par_iterandfor_each, both of which are simply typedmethods with by-value receivers. In particular, neither of them has
a lifetime bound on its closure type, unlike
std::thread::spawn.You can similarly analyze the lifetime flow for the
&RwLock<...>lifetime. You’ll get something like this:
(Here, an arrow from
atobmeansa: b.)Writing it out, the inference looks pretty unidirectional here: we
require
'1: '2: '3: 'v, but there’s nothing concrete at the end ofthat chain, so this is satisfiable by (e.g.) setting
'1 = '2 = '3 = 'vand the borrow checker is happy.
If we had called
std::thread::spawninstead ofinstall, we wouldthen require
'v: 'static, sincespawnrequires thatF: 'static(where
Fis the type of the closure), and we have'v: Fsince thevector must live as long as the closure itself. This transitively means
that the whole diagram must outlive
'static. But'1was arbitrary:it’s the lifetime to the receiver parameter. Thus, this isn’t
satisfiable in general and should be a borrowck error.
Let’s see what Rustc has to say. In this patch, I’ve dropped the
run_dataarg so that the compiler is forced to complain about the leftbranch of the graph above rather than the right branch (it’s the same
logic, just not the one that you asked about):
Rustc says:
This is pointing out the transitive
'1: '2: 'staticchain andcomplaining about its unsatisfiability.
So how can Rayon’s
installor Crossbeam’sscopeget away with thiswhile
thread::spawnneeds the'staticrequirement? The key is thatthe scoped thread implementations ensure that the callback finishes
before
install/scopeitself returns. Thus, the borrowed content onlyneeds to live as long as the call to
install, which is on the caller’sstack frame and is thus known to be outlived by
'1. Under the hood,both of these will call
thread::spawnafter usingunsafetotransmute your closure to
'static. (If “unsafe” is surprisinghere, remember that soundness is what we care about. “Sound” means “in
no real execution will it trigger undefined behavior”; “safe” means “the
compiler has statically proven that it is sound”; “unsafe” means “the
code author has statically proven that it is sound based on properties
too advanced to be represented in the base type system”. We use sound
unsafecode to build safe abstractions.)Yeah, definitely. You can get that kind of flexibility with
RefCell,which basically pushes the borrow checking to runtime. (Follows a Rust
pattern: you can have dynamic borrow checking, dynamic dispatch, dynamic
reference counting, dynamic typing (
std::any), but you have to opt in,and sometimes accept consequences like “your type is no longer
Sync”.)But a humble
Vec<_>isn’t that smart by default.Does this help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, that helps! Thanks very much for the thorough explanation.