PARQUET-382: Add methods to append encoded data to files.#278
PARQUET-382: Add methods to append encoded data to files.#278rdblue wants to merge 1 commit intoapache:masterfrom
Conversation
|
Test failure due to flaky Hadoop MemoryMangager test. See #269. |
There was a problem hiding this comment.
Need to make sure these are set properly by methods that append blocks.
There was a problem hiding this comment.
Fixed and added tests.
This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions.
54ec8a5 to
cb98552
Compare
|
@spena could you review this when you get a chance? Thanks! |
|
@rdblue What about file schema evolution cases? Can this patch merge 2 or more files with different schema? |
There was a problem hiding this comment.
Does 'columnToCopy' need to be in order? I've seen ordered issues in Hive with HashMap on JDK 8. We use LinkedHashMap instead to avoid that. Not sure if we support JDK 8 in Parquet, but it is something to consider if it is critical the order.
There was a problem hiding this comment.
No, this is just preparing a way to look up the source columns that will be copied. The order is determined by the output file's schema and columns are added in that order to columnsInOrder. Columns need to match the order of the output file's schema, so they may be reordered, but the copy section handles that and copies them in contiguous chunks.
|
@spena, the schema for the output file is set when opening the file writer, so the desired output schema is known. If that schema doesn't match the incoming data file's schema (or block's schema) then there are a few cases: |
|
@spena you're a committer now ;) |
|
@rdblue Overall, the patch looks pretty good. I give a +1 to this. |
|
Thanks, Sergio. I built a couple tools in a Kite branch so we could use this to merge Impala files for performance testing (rdblue/kite@54ba1f2). The typical use is this: https://github.com/rdblue/kite/blob/54ba1f2a3977d0a8913bf2d8beb3b4b7aae72f9e/kite-tools-parent/kite-tools/src/main/java/org/kitesdk/cli/commands/MergeParquetCommand.java#L126 |
This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions. Author: Ryan Blue <blue@apache.org> Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits: cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.
This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions. Author: Ryan Blue <blue@apache.org> Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits: cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java Resolution: Ignored problem with adjacent changes to imports.
This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions. Author: Ryan Blue <blue@apache.org> Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits: cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.
This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions. Author: Ryan Blue <blue@apache.org> Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits: cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.
This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.
This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.