-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ADAM-1164] Add parallel file merger. #1441
[ADAM-1164] Add parallel file merger. #1441
Conversation
Test PASSed. |
Just following up with runtime numbers. With this, saving the NA12878 234GB BAM back to a single BAM from Parquet runs in 4.4 minutes on 833 cores (2.4 minutes to go ADAM->BAM, 2.0 minutes to do the merge). Without this, it takes 44 minutes (2.3 minutes to go ADAM->BAM, the remainder to merge). |
// | ||
// ideally, this would be a directory, however, fs.concat has the | ||
// undocumented contract that the paths being merged must live in | ||
// the same directory as the path they are being merged to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what? hope you didn't have to find that out the hard way
// UNDOCUMENTED in hadoop fs API: | ||
// all paths passed to the concat method must be qualified with | ||
// full scheme and name node URI | ||
val outputPaths = (0 until numBlocksToWrite).map(idx => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
...and sigh
Thank you, @fnothaft! |
Resolves #1164. I'd actually love to get this into 0.22.0 as well. Thoughts?