Merging files should be multithreaded #1164

fnothaft · 2016-09-11T18:16:00Z

Related to #1161. Right now, we merge files as a big ol' single threaded hunk of code. However, this can be parallelized.

fnothaft · 2016-09-12T23:34:58Z

I'm not going to tag this with a specific milestone; this'll be a best effort feature. Unfortunately, half of the org.apache.hadoop.fs.FileSystem interface is unofficially optional, which makes it a real PITA to do a general implementation. That being said, I'm thinking the implementation would look something like:

A block/"balancing" approach. Here, we do a mapPartitions call with a given number of partitions that resizes all things into fixed size chunks:
- For scheme = HDFS, we write the fixed size chunks and then call concat and life is good, all is well, etc.
- For scheme = S3, we do something like what conductor does and do a big ol' multipart upload.
For scheme = file, we should be able to do random writes at fixed offsets and that should be OK.
For whatever else, we drop back on the current single threaded functionality.

Resolves bigdatagenomics#1164.

Resolves #1164.

fnothaft added the enhancement label Sep 11, 2016

fnothaft self-assigned this Sep 11, 2016

fnothaft modified the milestones: 0.22.0, 0.23.0 Mar 3, 2017

fnothaft added a commit to fnothaft/adam that referenced this issue Mar 18, 2017

[ADAM-1164] Add parallel file merger.

6263317

Resolves bigdatagenomics#1164.

fnothaft added a commit to fnothaft/adam that referenced this issue Mar 19, 2017

[ADAM-1164] Add parallel file merger.

1f5e03b

Resolves bigdatagenomics#1164.

fnothaft mentioned this issue Mar 19, 2017

[ADAM-1164] Add parallel file merger. #1441

Merged

fnothaft modified the milestones: 0.22.0, 0.23.0 Mar 19, 2017

heuermh closed this as completed in #1441 Mar 20, 2017

heuermh pushed a commit that referenced this issue Mar 20, 2017

[ADAM-1164] Add parallel file merger.

98b263f

Resolves #1164.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging files should be multithreaded #1164

Merging files should be multithreaded #1164

fnothaft commented Sep 11, 2016

fnothaft commented Sep 12, 2016

Merging files should be multithreaded #1164

Merging files should be multithreaded #1164

Comments

fnothaft commented Sep 11, 2016

fnothaft commented Sep 12, 2016