-
Notifications
You must be signed in to change notification settings - Fork 807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example of reading and writing parquet metadata outside the file #6081
Conversation
/// Specifically it: | ||
/// 1. It reads the metadata of a Parquet file | ||
/// 2. Removes some column statistics from the metadata (to make them smaller) | ||
/// 3. Stores the metadata in a separate file | ||
/// 4. Reads the metadata from the separate file and uses that to read the Parquet file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm I feel like we can simplify this example a bit. My use case is essentially along the lines of https://github.com/apache/datafusion/pull/10701/files#diff-81450b08df2ee29b3a9069865fc4f0c94883023c9d75bde729756c6bb4ec630d but instead of the metadata cache being in-memory you can imagine it's on disk (so that e.g. I can cache more metadata than would fit in memory).
Maybe this can be modeled something like:
struct KeyValueStore {
storage: HashMap<String, Vec<u8>>
}
impl KeyValueStore {
pub async fn get(&self, key: String) -> &[u8];
pub async fn set(&self, key: String, value: Vec<u8>);
}
The point being that we serialize the metadata to the key value store and then deserialize it from there, passing it into the reader instead of having the reader get it from the file itself. I don't think the editing of the metadata is necessary to get this example across.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, make sense. I agree maybe that is being overly ambitious
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think reading/writing to a file is pretty similar and actually using a kv store might make it more complicated so I kept a file for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
File is fine by me 😄. Maybe a comment about storing the metadata in a fast cache like Redis or in a metadata store will be enough to spark imagination?
ea603d4
to
ddd4240
Compare
/// This function reads the format written by `write_metadata_to_file` | ||
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData { | ||
let mut file = std::fs::File::open(file).unwrap(); | ||
// This API is kind of awkward compared to the writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also filled out this part of the PR showing how to read the metadata back -- it is (very) ugly compared to the nice ParquetMetadataWriter
@adriangb any interest in creating a ParquetMetadataReader
API similar to ParquetMetadataWriter
that handles these details? If so I can create a ticket / review a PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes certainly interested!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this as a plan: #6002 (comment)
let file = std::fs::File::open(file).unwrap(); | ||
let options = ArrowReaderOptions::new() | ||
// tell the reader to read the page index | ||
.with_page_index(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is also kind of akward -- it would be great if the actual reading of the parquet metadata could do this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean loading the page index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what I was trying to get at was that since the ColumnIndex
and OffsetIndex
(aka the "Page index structures") are not store inline, decode_metadata
doesn't read them -- the logic to do so is baked into this reader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to describe this more on #6002 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we now get rid of with_page_index
? Didn't ParquetMetaDataReader
already load it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we can...it appears that options.page_index
is only used in ArrowReaderMetadata::load
, so it should have no effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is now pretty clear -- we still need to explicitly call for loading the page index, but it makes sense I think
We could potentially change the default to be "read the page index unless told not to"
/// This function reads the format written by `write_metadata_to_file` | ||
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData { | ||
let mut file = std::fs::File::open(file).unwrap(); | ||
// This API is kind of awkward compared to the writer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes certainly interested!
let file = std::fs::File::open(file).unwrap(); | ||
let options = ArrowReaderOptions::new() | ||
// tell the reader to read the page index | ||
.with_page_index(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean loading the page index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
parquet/src/file/metadata/writer.rs
Outdated
let column_indexes = self.convert_column_indexes(); | ||
let offset_indexes = self.convert_offset_index(); | ||
|
||
let mut encoder = ThriftMetadataWriter::new( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would encoder
better serializer
here?
578b8be
to
9bcf552
Compare
ParquetMetaDataReader::new() | ||
.with_column_indexes(true) | ||
.with_offset_indexes(true) | ||
.parse_and_finish(&mut file) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm lost but I thought you'd need to pass in the original file size here to adjust offsets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet the problem is that the example file doesn't actually have page offsets / page indexes 🤔 to load so the problem isn't hit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because an actual File
is being passed, we can seek wherever we need to to find the page indexes. We only need to pass the original file size if we're passing a buffer with just the tail of the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But doesn't that file only have the tail of the original file? In other words, the file it's opening has for example 100 bytes but there's byte offsets referencing byte 101 to load the page index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, needed to read more of the example 😅. I think @alamb is correct...the original file has no page indexes anyway, so on the second read, even though we ask for them, since there are no offsets specified, there's no seeking done to find them. It would be interesting to see what happens if we start with a different file (alltypes_tiny_pages.parquet for instance), since we don't ask for the page indexes in the first read, what happens to the page index offsets when we write the metadata back out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interestingly, this has also come up downstream in DataFusion with @progval on apache/datafusion#12593 (where the semi-automatic loading of page indexes causes unintended accesses and slowdowns)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData. Then again, I could see wanting to preserve the page index offsets of the original file if you only want to save the footer metadata externally...perhaps an option on the metadata writer to preserve page index offsets if desired?
This is an excellent point. This is why I think it is so important to have this example to motivate the API design
As I understand the usecase it
- store the parquet metadata as some bytes externally (e.g. in a database like redis, or some other location)
- Use that metadata both for various pruning as well as actually reading the parquet data when needed
I'll update the example to also have a file with page indexes and see what happens 🏃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep that's exactly my use case. I've done it by implementing AsyncFileReader::get_metadata
so it works for both pruning and reading the file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote up some tests here: #6463
The good news is that as long as you load the page indexes with the initial metadata load, the ParquetMetadataWriter
will correctly update the offsets so the metadata can be read again
The bad news is that #6464 happens (precisely as @etseidl predicated above):
I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also made a PR with some tests for this usecase: #6463
ParquetMetaDataReader::new() | ||
.with_column_indexes(true) | ||
.with_offset_indexes(true) | ||
.parse_and_finish(&mut file) | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet the problem is that the example file doesn't actually have page offsets / page indexes 🤔 to load so the problem isn't hit.
622cc7a
to
b6454ab
Compare
e1d54e4
to
3d6b976
Compare
Current status:
|
a112ae4
to
f73a96d
Compare
Ok, i think this PR is now basically ready to go. I have one final small API addition (for modifying ColumnChunkMetadata) here: #6523 but once that is merged then this PR will be ready for review |
f73a96d
to
0a125e9
Compare
0a125e9
to
9d926cf
Compare
//! and [`decode_metadata`] | ||
//! * Read from an `async` source to `ParquetMetaData`: [`MetadataLoader`] | ||
//! * Read from bytes or from an async source to `ParquetMetaData`: [`ParquetMetaDataReader`] | ||
//! * [`ParquetMetaDataReader`] for reading |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also updates the docs to point at the new APIs added in the previous releases
I can pull these changes out into their own PR if we prefer
Ok, 3 months later this PR is now ready for a good review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks for doing this, it really helped drive the development to have a concrete example.
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Which issue does this PR close?
Related to #6002
Closes #6504
Rationale for this change
To figure out a good API we need an example of what we are trying to do
What changes are included in this PR?
Are there any user-facing changes?
Not yet, just an example