Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging files should be multithreaded #1164

Closed
fnothaft opened this issue Sep 11, 2016 · 1 comment
Closed

Merging files should be multithreaded #1164

fnothaft opened this issue Sep 11, 2016 · 1 comment
Assignees
Milestone

Comments

@fnothaft
Copy link
Member

Related to #1161. Right now, we merge files as a big ol' single threaded hunk of code. However, this can be parallelized.

@fnothaft fnothaft self-assigned this Sep 11, 2016
@fnothaft
Copy link
Member Author

I'm not going to tag this with a specific milestone; this'll be a best effort feature. Unfortunately, half of the org.apache.hadoop.fs.FileSystem interface is unofficially optional, which makes it a real PITA to do a general implementation. That being said, I'm thinking the implementation would look something like:

  • A block/"balancing" approach. Here, we do a mapPartitions call with a given number of partitions that resizes all things into fixed size chunks:
    • For scheme = HDFS, we write the fixed size chunks and then call concat and life is good, all is well, etc.
    • For scheme = S3, we do something like what conductor does and do a big ol' multipart upload.
  • For scheme = file, we should be able to do random writes at fixed offsets and that should be OK.
  • For whatever else, we drop back on the current single threaded functionality.

@fnothaft fnothaft modified the milestones: 0.22.0, 0.23.0 Mar 3, 2017
fnothaft added a commit to fnothaft/adam that referenced this issue Mar 18, 2017
fnothaft added a commit to fnothaft/adam that referenced this issue Mar 19, 2017
@fnothaft fnothaft modified the milestones: 0.22.0, 0.23.0 Mar 19, 2017
heuermh pushed a commit that referenced this issue Mar 20, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant