Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize dwarfdump per-unit #285

Merged
merged 1 commit into from
Mar 12, 2018
Merged

Parallelize dwarfdump per-unit #285

merged 1 commit into from
Mar 12, 2018

Conversation

rocallahan
Copy link
Contributor

Maybe you won't like the memory usage increase, but this is a pretty nice speedup when you're not writing the results to a file. Fearless concurrency FTW.

I have a followup patch that adds a dwarfdump option to drop the output for compilation units where the output doesn't match a given regex, which makes this more useful.

@rocallahan
Copy link
Contributor Author

Forgot to mention, that 4x speedup is basically linear speedup on my laptop (4 cores, 8 hyperthreads).

@coveralls
Copy link

coveralls commented Mar 9, 2018

Coverage Status

Coverage decreased (-0.01%) to 92.506% when pulling fef4e3d on rocallahan:parallel-dump into 71a1138 on gimli-rs:master.

@philipc
Copy link
Collaborator

philipc commented Mar 11, 2018

Forgot to mention, that 4x speedup is basically linear speedup on my laptop (4 cores, 8 hyperthreads).

There's an initial speedup due to writing to a Vec instead of using a BufWriter. So I'm seeing 80s before this PR, 60s with 1 worker, 23s with 4 workers.

Copy link
Collaborator

@philipc philipc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, nice speedup! I think the memory usage is fine. It would be nice if some other library could make parallel_output simpler, but I'm not aware of anything that does (I saw your comments in rayon about converting to a serial iterator).

@@ -65,6 +71,57 @@ impl From<io::Error> for Error {

pub type Result<T> = result::Result<T, Error>;

fn parallel_output<II: IntoIterator, F>(max_workers: usize, iter: II, f: F) -> Result<()>
Copy link
Collaborator

@philipc philipc Mar 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any benefit to using IntoIterator instead of Iterator? Also please move the bound into the where clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use Iterator here and move into_iter to the caller, but specifying IntoIterator here makes this function more reusable with I think no net increase in complexity.

fn parallel_output<II: IntoIterator, F>(max_workers: usize, iter: II, f: F) -> Result<()>
where F: Sync + Fn(II::Item, &mut Vec<u8>) -> Result<()>,
II::IntoIter: Send {
let state = Mutex::new((iter.into_iter().fuse(), 0, Ok(())));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having the fuse doesn't hurt, but is it actually needed for the current caller?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is complex enough that using a local struct type for state would improve readability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We depend on fused behavior. If we remove fuse here then I would want to make parallel_output take a Vec<T> instead of a generic Iterator/IntoIterator. Is that preferable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's fine how it is.

src/reader.rs Outdated
@@ -179,7 +179,7 @@ impl ReaderOffset for usize {
///
/// All read operations advance the section offset of the reader
/// unless specified otherwise.
pub trait Reader: Debug + Clone {
pub trait Reader: Debug + Clone + Send + Sync {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this (and same for Endianity). Nothing in gimli requires this. I think this should be up to the consumer to decide, but this isn't something I've had much experience with. We're already defining a Reader trait in dwarfdump that could specify this requirement for its use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this is tricky because there doesn't seem to be a way for dwarfdump::Reader to constrain its Endian to be Send + Sync. This is related to rust-lang/rust#38738.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured something out.

if flags.eh_frame {
dump_eh_frame(w, eh_frame)?;
dump_eh_frame(&mut BufWriter::new(out.lock()), eh_frame)?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are a bit ugly, but I don't have a better idea. This relies on the drop to flush, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@rocallahan
Copy link
Contributor Author

There's an initial speedup due to writing to a Vec instead of using a BufWriter. So I'm seeing 80s before this PR, 60s with 1 worker, 23s with 4 workers.

Well spotted. That's interesting.

Before:

time ( target/release/examples/dwarfdump -i ~/mozilla-central/obj-ff-opt/dist/bin/libxul.so >& /dev/null )
real    1m39.153s
user    1m37.714s
sys     0m1.320s

After:

time ( target/release/examples/dwarfdump -i ~/mozilla-central/obj-ff-opt/dist/bin/libxul.so >& /dev/null )
real	0m25.641s
user	2m3.328s
sys	0m1.087s

This increases memory usage. We buffer the output; the max memory usage increases by around the size of the
N largest outputs per compilation unit, where N is the min of 16 and num_cpus::get(). The larger compilation
units in Firefox libxul.so produce tens to hundreds of megabytes of output each. Then again, the speedup
processing such large files is important.
@philipc philipc merged commit 5183af0 into gimli-rs:master Mar 12, 2018
@rocallahan rocallahan deleted the parallel-dump branch March 22, 2018 21:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants