You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not going to tag this with a specific milestone; this'll be a best effort feature. Unfortunately, half of the org.apache.hadoop.fs.FileSystem interface is unofficially optional, which makes it a real PITA to do a general implementation. That being said, I'm thinking the implementation would look something like:
A block/"balancing" approach. Here, we do a mapPartitions call with a given number of partitions that resizes all things into fixed size chunks:
For scheme = HDFS, we write the fixed size chunks and then call concat and life is good, all is well, etc.
For scheme = S3, we do something like what conductor does and do a big ol' multipart upload.
For scheme = file, we should be able to do random writes at fixed offsets and that should be OK.
For whatever else, we drop back on the current single threaded functionality.
Related to #1161. Right now, we merge files as a big ol' single threaded hunk of code. However, this can be parallelized.
The text was updated successfully, but these errors were encountered: