-
Notifications
You must be signed in to change notification settings - Fork 838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
write ColumnMetadata
after the column chunk data, not the ColumnChunk
#1947
write ColumnMetadata
after the column chunk data, not the ColumnChunk
#1947
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1947 +/- ##
=======================================
Coverage 83.48% 83.48%
=======================================
Files 221 221
Lines 57054 57068 +14
=======================================
+ Hits 47629 47641 +12
- Misses 9425 9427 +2
Continue to review full report at Codecov.
|
@tustvold PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, I left some minor nits. I think with a simple unit test, this can go in.
FWIW I did some digging and tbh I'm not actually sure if anything actually ever reads this data, parquet-mr doesn't even write it, arrow C++ writes it but doesn't appear to ever read it (apache/parquet-cpp#224). Hive and Impala allegedly makes use of it, but I can't find out where/how/why...
I can't help wondering if this was an oversight in the original parquet specification, not collocating column chunk metadata in the footer, that has since been papered over. All readers I can find simply read the ColumnChunkMetadata from the footer and ignore everything else.
@@ -611,6 +611,29 @@ impl ColumnChunkMetaData { | |||
encrypted_column_metadata: None, | |||
} | |||
} | |||
|
|||
/// Method to convert to Thrift `ColumnMetaData` | |||
pub fn to_column_metadata_thrift(&self) -> ColumnMetaData { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we change to_thrift
above to use this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch
parquet/src/file/writer.rs
Outdated
/// Returns Ok() if there are not errors serializing and writing data into the sink. | ||
#[inline] | ||
fn serialize_column_chunk(&mut self, chunk: parquet::ColumnChunk) -> Result<()> { | ||
fn serialize_column_chunk( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just remove this method and move its content into write_metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any preferences
I have the same confuse like you about this metadata. But from the definition of the format https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L790, we can know the |
b104d64
to
82c5534
Compare
Many system or reader just read the footer and get the metadata, I think we should just follow the parquet-format. |
ColumnMetadata
after the column chunk data, not the ColumnChunk
Which issue does this PR close?
Closes #1946
Rationale for this change
What changes are included in this PR?
Are there any user-facing changes?