-
Notifications
You must be signed in to change notification settings - Fork 1k
Add options to control various aspects of Parquet metadata decoding #8763
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
8fce6fd
06211ae
c0efb38
c6ec5b2
9fcb348
db7e750
a7f3596
988c52e
7f7506f
acd72c3
3edfdfd
f6d81fe
7a05ac6
a0c369a
91cd1c6
beadfa6
78085e9
02a111a
0ec3f1b
d27a250
f8ca8d9
a3f0f6d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,108 @@ | ||
| // Licensed to the Apache Software Foundation (ASF) under one | ||
| // or more contributor license agreements. See the NOTICE file | ||
| // distributed with this work for additional information | ||
| // regarding copyright ownership. The ASF licenses this file | ||
| // to you under the Apache License, Version 2.0 (the | ||
| // "License"); you may not use this file except in compliance | ||
| // with the License. You may obtain a copy of the License at | ||
| // | ||
| // http://www.apache.org/licenses/LICENSE-2.0 | ||
| // | ||
| // Unless required by applicable law or agreed to in writing, | ||
| // software distributed under the License is distributed on an | ||
| // "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| // KIND, either express or implied. See the License for the | ||
| // specific language governing permissions and limitations | ||
| // under the License. | ||
|
|
||
| //! Options used to control metadata parsing | ||
| use crate::schema::types::SchemaDescPtr; | ||
|
|
||
| /// Options that can be set to control what parts of the Parquet file footer | ||
| /// metadata will be decoded and made present in the [`ParquetMetaData`] returned | ||
| /// by [`ParquetMetaDataReader`] and [`ParquetMetaDataPushDecoder`]. | ||
| /// | ||
| /// [`ParquetMetaData`]: crate::file::metadata::ParquetMetaData | ||
| /// [`ParquetMetaDataReader`]: crate::file::metadata::ParquetMetaDataReader | ||
| /// [`ParquetMetaDataPushDecoder`]: crate::file::metadata::ParquetMetaDataPushDecoder | ||
| #[derive(Default, Debug, Clone)] | ||
| pub struct ParquetMetaDataOptions { | ||
| schema_descr: Option<SchemaDescPtr>, | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this means (1) User provided schema or (2) only (min, max, etc) columns in
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's (1). Say you have a large number of files that share the same schema, there's no need to decode them all. Just grab the schema from the first file and use it for all the others.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here is a ticket that explains the use case a bit more; |
||
| } | ||
|
|
||
| impl ParquetMetaDataOptions { | ||
| /// Return a new default [`ParquetMetaDataOptions`]. | ||
| pub fn new() -> Self { | ||
| Default::default() | ||
| } | ||
|
|
||
| /// Returns an optional [`SchemaDescPtr`] to use when decoding. If this is not `None` then | ||
| /// the schema in the footer will be skipped. | ||
| pub fn schema(&self) -> Option<&SchemaDescPtr> { | ||
| self.schema_descr.as_ref() | ||
| } | ||
|
|
||
| /// Provide a schema to use when decoding the metadata. | ||
| pub fn set_schema(&mut self, val: SchemaDescPtr) { | ||
| self.schema_descr = Some(val); | ||
| } | ||
|
|
||
| /// Provide a schema to use when decoding the metadata. Returns `Self` for chaining. | ||
| pub fn with_schema(mut self, val: SchemaDescPtr) -> Self { | ||
| self.schema_descr = Some(val); | ||
| self | ||
| } | ||
| } | ||
etseidl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #[cfg(test)] | ||
| mod tests { | ||
| use bytes::Bytes; | ||
|
|
||
| use crate::{ | ||
| DecodeResult, | ||
| file::metadata::{ParquetMetaDataOptions, ParquetMetaDataPushDecoder}, | ||
| util::test_common::file_util::get_test_file, | ||
| }; | ||
| use std::{io::Read, sync::Arc}; | ||
|
|
||
| #[test] | ||
| fn test_provide_schema() { | ||
| let mut buf: Vec<u8> = Vec::new(); | ||
| get_test_file("alltypes_plain.parquet") | ||
| .read_to_end(&mut buf) | ||
| .unwrap(); | ||
|
|
||
| let data = Bytes::from(buf); | ||
| let mut decoder = ParquetMetaDataPushDecoder::try_new(data.len() as u64).unwrap(); | ||
| decoder | ||
| .push_range(0..data.len() as u64, data.clone()) | ||
| .unwrap(); | ||
|
|
||
| let expected = match decoder.try_decode().unwrap() { | ||
| DecodeResult::Data(m) => m, | ||
| _ => panic!("could not parse metadata"), | ||
| }; | ||
| let expected_schema = expected.file_metadata().schema_descr_ptr(); | ||
|
|
||
| let mut options = ParquetMetaDataOptions::new(); | ||
| options.set_schema(expected_schema); | ||
| let options = Arc::new(options); | ||
|
|
||
| let mut decoder = ParquetMetaDataPushDecoder::try_new(data.len() as u64) | ||
| .unwrap() | ||
| .with_metadata_options(Some(options)); | ||
| decoder.push_range(0..data.len() as u64, data).unwrap(); | ||
| let metadata = match decoder.try_decode().unwrap() { | ||
| DecodeResult::Data(m) => m, | ||
| _ => panic!("could not parse metadata"), | ||
| }; | ||
|
|
||
| assert_eq!(expected, metadata); | ||
etseidl marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| // the schema pointers should be the same | ||
| assert!(Arc::ptr_eq( | ||
| &expected.file_metadata().schema_descr_ptr(), | ||
| &metadata.file_metadata().schema_descr_ptr() | ||
| )); | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed the
ArrowReaderOptionsandArrowReaderMetadatastructures and their use, and I agree this is the appropriate structure to add metadata parsing to.Do you think it eventually makes sense to move the other fields from ArrowReaderOptions to
ParquetMetaDataOptions? (e.g.supplied_schema)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking perhaps the
page_index_policy, but the other things inArrowReaderOptionsare more Arrow specific rather than Parquet. That might get confusing.