Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1164] Add parallel file merger. #1441

Merged
merged 1 commit into from
Mar 20, 2017

Conversation

fnothaft
Copy link
Member

Resolves #1164. I'd actually love to get this into 0.22.0 as well. Thoughts?

@fnothaft fnothaft added this to the 0.22.0 milestone Mar 19, 2017
@coveralls
Copy link

coveralls commented Mar 19, 2017

Coverage Status

Coverage increased (+0.03%) to 76.509% when pulling 1f5e03b on fnothaft:issues/1164-parallel-merge into cf39e6c on bigdatagenomics:master.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1884/
Test PASSed.

@fnothaft
Copy link
Member Author

Just following up with runtime numbers. With this, saving the NA12878 234GB BAM back to a single BAM from Parquet runs in 4.4 minutes on 833 cores (2.4 minutes to go ADAM->BAM, 2.0 minutes to do the merge). Without this, it takes 44 minutes (2.3 minutes to go ADAM->BAM, the remainder to merge).

//
// ideally, this would be a directory, however, fs.concat has the
// undocumented contract that the paths being merged must live in
// the same directory as the path they are being merged to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what? hope you didn't have to find that out the hard way

// UNDOCUMENTED in hadoop fs API:
// all paths passed to the concat method must be qualified with
// full scheme and name node URI
val outputPaths = (0 until numBlocksToWrite).map(idx => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and sigh

@heuermh heuermh merged commit 98b263f into bigdatagenomics:master Mar 20, 2017
@heuermh
Copy link
Member

heuermh commented Mar 20, 2017

Thank you, @fnothaft!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants