PARQUET-382: Add methods to append encoded data to files. by rdblue · Pull Request #278 · apache/parquet-java

rdblue · 2015-09-24T17:09:56Z

This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.

This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.

rdblue · 2015-09-24T18:16:37Z

Test failure due to flaky Hadoop MemoryMangager test. See #269.

rdblue · 2015-10-21T18:29:07Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

Need to make sure these are set properly by methods that append blocks.

Fixed and added tests.

This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions.

rdblue · 2015-10-22T20:38:58Z

@spena could you review this when you get a chance? Thanks!

spena · 2015-10-26T23:01:51Z

@rdblue What about file schema evolution cases? Can this patch merge 2 or more files with different schema?

spena · 2015-10-26T23:04:03Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java

Does 'columnToCopy' need to be in order? I've seen ordered issues in Hive with HashMap on JDK 8. We use LinkedHashMap instead to avoid that. Not sure if we support JDK 8 in Parquet, but it is something to consider if it is critical the order.

No, this is just preparing a way to look up the source columns that will be copied. The order is determined by the output file's schema and columns are added in that order to columnsInOrder. Columns need to match the order of the output file's schema, so they may be reordered, but the copy section handles that and copies them in contiguous chunks.

rdblue · 2015-10-28T18:39:20Z

@spena, the schema for the output file is set when opening the file writer, so the desired output schema is known. If that schema doesn't match the incoming data file's schema (or block's schema) then there are a few cases:

julienledem · 2015-12-07T23:58:35Z

@spena you're a committer now ;)

spena · 2015-12-08T19:32:28Z

@rdblue Overall, the patch looks pretty good. I give a +1 to this.
Just one quick question. How are these methods going to be used? Will you add a follow-up jira to call this new implementation?

rdblue · 2015-12-08T20:53:24Z

Thanks, Sergio.

I built a couple tools in a Kite branch so we could use this to merge Impala files for performance testing (rdblue/kite@54ba1f2). The typical use is this: https://github.com/rdblue/kite/blob/54ba1f2a3977d0a8913bf2d8beb3b4b7aae72f9e/kite-tools-parent/kite-tools/src/main/java/org/kitesdk/cli/commands/MergeParquetCommand.java#L126

This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions. Author: Ryan Blue <blue@apache.org> Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits: cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.

This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions. Author: Ryan Blue <blue@apache.org> Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits: cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files. Conflicts: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java Resolution: Ignored problem with adjacent changes to imports.

This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema and then streaming the encoded data from one file to the other. New starting offsets are tracked and the column chunk metadata in the footer is updated with the new starting positions. Author: Ryan Blue <blue@apache.org> Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits: cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.

rdblue reviewed Oct 21, 2015
View reviewed changes

rdblue force-pushed the PARQUET-382-append-encoded-blocks branch from 54ec8a5 to cb98552 Compare October 22, 2015 20:26

spena reviewed Oct 26, 2015
View reviewed changes

asfgit closed this in b45c4bd Dec 8, 2015

echeipesh mentioned this pull request Feb 7, 2022

Append NWM Forecast subset to Parquet and Zarr azavea/noaa-hydro-data#12

Closed

asfimport mentioned this pull request Jun 23, 2024

Add a way to append encoded blocks in ParquetFileWriter #1894

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PARQUET-382: Add methods to append encoded data to files.#278

PARQUET-382: Add methods to append encoded data to files.#278
rdblue wants to merge 1 commit intoapache:masterfrom
rdblue:PARQUET-382-append-encoded-blocks

rdblue commented Sep 24, 2015

Uh oh!

rdblue commented Sep 24, 2015

Uh oh!

rdblue Oct 21, 2015

Uh oh!

rdblue Oct 22, 2015

Uh oh!

rdblue commented Oct 22, 2015

Uh oh!

spena commented Oct 26, 2015

Uh oh!

spena Oct 26, 2015

Uh oh!

rdblue Oct 28, 2015

Uh oh!

rdblue commented Oct 28, 2015

Uh oh!

julienledem commented Dec 7, 2015

Uh oh!

spena commented Dec 8, 2015

Uh oh!

rdblue commented Dec 8, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rdblue commented Sep 24, 2015

Uh oh!

rdblue commented Sep 24, 2015

Uh oh!

rdblue Oct 21, 2015

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 22, 2015

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 22, 2015

Uh oh!

spena commented Oct 26, 2015

Uh oh!

spena Oct 26, 2015

Choose a reason for hiding this comment

Uh oh!

rdblue Oct 28, 2015

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 28, 2015

Uh oh!

julienledem commented Dec 7, 2015

Uh oh!

spena commented Dec 8, 2015

Uh oh!

rdblue commented Dec 8, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants