write `ColumnMetadata` after the column chunk data, not the `ColumnChunk` #1947

liukun4515 · 2022-06-25T07:52:12Z

Which issue does this PR close?

Closes #1946

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter · 2022-06-25T08:11:34Z

Codecov Report

Merging #1947 (b104d64) into master (9f7b600) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1947   +/-   ##
=======================================
  Coverage   83.48%   83.48%           
=======================================
  Files         221      221           
  Lines       57054    57068   +14     
=======================================
+ Hits        47629    47641   +12     
- Misses       9425     9427    +2

Impacted Files	Coverage Δ
parquet/src/file/metadata.rs	`95.32% <100.00%> (+0.19%)`	⬆️
parquet/src/file/writer.rs	`92.92% <100.00%> (ø)`
arrow/src/datatypes/datatype.rs	`65.42% <0.00%> (-0.38%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.98% <0.00%> (-0.23%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9f7b600...b104d64. Read the comment docs.

liukun4515 · 2022-06-27T05:49:02Z

@tustvold PTAL

tustvold

I think this makes sense, I left some minor nits. I think with a simple unit test, this can go in.

FWIW I did some digging and tbh I'm not actually sure if anything actually ever reads this data, parquet-mr doesn't even write it, arrow C++ writes it but doesn't appear to ever read it (apache/parquet-cpp#224). Hive and Impala allegedly makes use of it, but I can't find out where/how/why...

I can't help wondering if this was an oversight in the original parquet specification, not collocating column chunk metadata in the footer, that has since been papered over. All readers I can find simply read the ColumnChunkMetadata from the footer and ignore everything else.

tustvold · 2022-06-27T10:08:56Z

parquet/src/file/metadata.rs

@@ -611,6 +611,29 @@ impl ColumnChunkMetaData {
            encrypted_column_metadata: None,
        }
    }
+
+    /// Method to convert to Thrift `ColumnMetaData`
+    pub fn to_column_metadata_thrift(&self) -> ColumnMetaData {


Could we change to_thrift above to use this method

tustvold · 2022-06-27T10:09:55Z

parquet/src/file/writer.rs

    /// Returns Ok() if there are not errors serializing and writing data into the sink.
    #[inline]
-    fn serialize_column_chunk(&mut self, chunk: parquet::ColumnChunk) -> Result<()> {
+    fn serialize_column_chunk(


Could we just remove this method and move its content into write_metadata?

I don't have any preferences

liukun4515 · 2022-06-28T03:34:02Z

I can't help wondering if this was an oversight in the original parquet specification, not collocating column chunk metadata in the footer, that has since been papered over. All readers I can find simply read the ColumnChunkMetadata from the footer and ignore everything else.

I have the same confuse like you about this metadata.
I go through the parquet-mr(Java version) which did't append this metadata in end of each column, and read this metadata from the Filemetadata in the footer.

But from the definition of the format https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L790, we can know the file_offset is required field and the https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L796 ColumnMetaData is a optional field.

liukun4515 · 2022-06-28T03:48:07Z

Many system or reader just read the footer and get the metadata, I think we should just follow the parquet-format.
Maybe it's just historical issues or historical design

github-actions bot added the parquet Changes to the parquet crate label Jun 25, 2022

tustvold reviewed Jun 27, 2022

View reviewed changes

fix bug: write column metadata to the behind of the column chunk data

82c5534

liukun4515 force-pushed the parquet_column_append_metadata branch from b104d64 to 82c5534 Compare June 28, 2022 03:43

liukun4515 requested a review from tustvold June 28, 2022 06:10

tustvold approved these changes Jun 28, 2022

View reviewed changes

tustvold merged commit 464e8d1 into apache:master Jun 28, 2022

alamb changed the title ~~write columnmetadata to the behind of the column chunk data, not the ColumnChunk~~ write ColumnMetadata after the column chunk data, not the ColumnChunk Jul 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

write `ColumnMetadata` after the column chunk data, not the `ColumnChunk` #1947

write `ColumnMetadata` after the column chunk data, not the `ColumnChunk` #1947

liukun4515 commented Jun 25, 2022

codecov-commenter commented Jun 25, 2022

liukun4515 commented Jun 27, 2022

tustvold left a comment •

edited

Loading

tustvold Jun 27, 2022

liukun4515 Jun 28, 2022

tustvold Jun 27, 2022

liukun4515 Jun 28, 2022

liukun4515 commented Jun 28, 2022 •

edited

Loading

liukun4515 commented Jun 28, 2022

write ColumnMetadata after the column chunk data, not the ColumnChunk #1947

write ColumnMetadata after the column chunk data, not the ColumnChunk #1947

Conversation

liukun4515 commented Jun 25, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jun 25, 2022

Codecov Report

liukun4515 commented Jun 27, 2022

tustvold left a comment • edited Loading

Choose a reason for hiding this comment

tustvold Jun 27, 2022

Choose a reason for hiding this comment

liukun4515 Jun 28, 2022

Choose a reason for hiding this comment

tustvold Jun 27, 2022

Choose a reason for hiding this comment

liukun4515 Jun 28, 2022

Choose a reason for hiding this comment

liukun4515 commented Jun 28, 2022 • edited Loading

liukun4515 commented Jun 28, 2022

write `ColumnMetadata` after the column chunk data, not the `ColumnChunk` #1947

write `ColumnMetadata` after the column chunk data, not the `ColumnChunk` #1947

tustvold left a comment •

edited

Loading

liukun4515 commented Jun 28, 2022 •

edited

Loading