Skip to content

PARQUET-382: Add methods to append encoded data to files.#278

Closed
rdblue wants to merge 1 commit intoapache:masterfrom
rdblue:PARQUET-382-append-encoded-blocks
Closed

PARQUET-382: Add methods to append encoded data to files.#278
rdblue wants to merge 1 commit intoapache:masterfrom
rdblue:PARQUET-382-append-encoded-blocks

Conversation

@rdblue
Copy link
Contributor

@rdblue rdblue commented Sep 24, 2015

This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.

This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.

@rdblue
Copy link
Contributor Author

rdblue commented Sep 24, 2015

Test failure due to flaky Hadoop MemoryMangager test. See #269.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to make sure these are set properly by methods that append blocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and added tests.

This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.

This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.
@rdblue rdblue force-pushed the PARQUET-382-append-encoded-blocks branch from 54ec8a5 to cb98552 Compare October 22, 2015 20:26
@rdblue
Copy link
Contributor Author

rdblue commented Oct 22, 2015

@spena could you review this when you get a chance? Thanks!

@spena
Copy link

spena commented Oct 26, 2015

@rdblue What about file schema evolution cases? Can this patch merge 2 or more files with different schema?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does 'columnToCopy' need to be in order? I've seen ordered issues in Hive with HashMap on JDK 8. We use LinkedHashMap instead to avoid that. Not sure if we support JDK 8 in Parquet, but it is something to consider if it is critical the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is just preparing a way to look up the source columns that will be copied. The order is determined by the output file's schema and columns are added in that order to columnsInOrder. Columns need to match the order of the output file's schema, so they may be reordered, but the copy section handles that and copies them in contiguous chunks.

@rdblue
Copy link
Contributor Author

rdblue commented Oct 28, 2015

@spena, the schema for the output file is set when opening the file writer, so the desired output schema is known. If that schema doesn't match the incoming data file's schema (or block's schema) then there are a few cases:

@julienledem
Copy link
Member

@spena you're a committer now ;)

@spena
Copy link

spena commented Dec 8, 2015

@rdblue Overall, the patch looks pretty good. I give a +1 to this.
Just one quick question. How are these methods going to be used? Will you add a follow-up jira to call this new implementation?

@rdblue
Copy link
Contributor Author

rdblue commented Dec 8, 2015

Thanks, Sergio.

I built a couple tools in a Kite branch so we could use this to merge Impala files for performance testing (rdblue/kite@54ba1f2). The typical use is this: https://github.com/rdblue/kite/blob/54ba1f2a3977d0a8913bf2d8beb3b4b7aae72f9e/kite-tools-parent/kite-tools/src/main/java/org/kitesdk/cli/commands/MergeParquetCommand.java#L126

@asfgit asfgit closed this in b45c4bd Dec 8, 2015
piyushnarang pushed a commit to piyushnarang/parquet-mr that referenced this pull request Jun 15, 2016
This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.

This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.

Author: Ryan Blue <blue@apache.org>

Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits:

cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jul 13, 2016
This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.

This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.

Author: Ryan Blue <blue@apache.org>

Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits:

cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.

Conflicts:
	parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
Resolution:
    Ignored problem with adjacent changes to imports.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 6, 2017
This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.

This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.

Author: Ryan Blue <blue@apache.org>

Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits:

cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.
rdblue added a commit to rdblue/parquet-mr that referenced this pull request Jan 10, 2017
This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.

This works by finding the column chunk for each column in the file
schema and then streaming the encoded data from one file to the other.
New starting offsets are tracked and the column chunk metadata in the
footer is updated with the new starting positions.

Author: Ryan Blue <blue@apache.org>

Closes apache#278 from rdblue/PARQUET-382-append-encoded-blocks and squashes the following commits:

cb98552 [Ryan Blue] PARQUET-382: Add methods to append encoded data to files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants