Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example of reading and writing parquet metadata outside the file #6081

Merged
merged 2 commits into from
Oct 10, 2024

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 17, 2024

Which issue does this PR close?

Related to #6002
Closes #6504

Rationale for this change

To figure out a good API we need an example of what we are trying to do

What changes are included in this PR?

  1. Adds an example, with comments
  2. The example is based on my interpretation of @adriangb's description here API for encoding/decoding ParquetMetadata with more control #6002 (comment)

Are there any user-facing changes?

Not yet, just an example

Comment on lines 33 to 64
/// Specifically it:
/// 1. It reads the metadata of a Parquet file
/// 2. Removes some column statistics from the metadata (to make them smaller)
/// 3. Stores the metadata in a separate file
/// 4. Reads the metadata from the separate file and uses that to read the Parquet file
Copy link
Contributor

@adriangb adriangb Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm I feel like we can simplify this example a bit. My use case is essentially along the lines of https://github.com/apache/datafusion/pull/10701/files#diff-81450b08df2ee29b3a9069865fc4f0c94883023c9d75bde729756c6bb4ec630d but instead of the metadata cache being in-memory you can imagine it's on disk (so that e.g. I can cache more metadata than would fit in memory).

Maybe this can be modeled something like:

struct KeyValueStore {
    storage: HashMap<String, Vec<u8>>
}

impl KeyValueStore {
    pub async fn get(&self, key: String) -> &[u8];
    pub async fn set(&self, key: String, value: Vec<u8>);
}

The point being that we serialize the metadata to the key value store and then deserialize it from there, passing it into the reader instead of having the reader get it from the file itself. I don't think the editing of the metadata is necessary to get this example across.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, make sense. I agree maybe that is being overly ambitious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think reading/writing to a file is pretty similar and actually using a kv store might make it more complicated so I kept a file for now

Copy link
Contributor

@adriangb adriangb Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

File is fine by me 😄. Maybe a comment about storing the metadata in a fast cache like Redis or in a metadata store will be enough to spark imagination?

@alamb alamb force-pushed the alamb/parquet-stats-example branch from ea603d4 to ddd4240 Compare August 6, 2024 22:10
parquet/examples/external_metadata.rs Outdated Show resolved Hide resolved
/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
let mut file = std::fs::File::open(file).unwrap();
// This API is kind of awkward compared to the writer
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also filled out this part of the PR showing how to read the metadata back -- it is (very) ugly compared to the nice ParquetMetadataWriter

@adriangb any interest in creating a ParquetMetadataReader API similar to ParquetMetadataWriter that handles these details? If so I can create a ticket / review a PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes certainly interested!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this as a plan: #6002 (comment)

let file = std::fs::File::open(file).unwrap();
let options = ArrowReaderOptions::new()
// tell the reader to read the page index
.with_page_index(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is also kind of akward -- it would be great if the actual reading of the parquet metadata could do this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean loading the page index?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, what I was trying to get at was that since the ColumnIndex and OffsetIndex (aka the "Page index structures") are not store inline, decode_metadata doesn't read them -- the logic to do so is baked into this reader

https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to describe this more on #6002 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we now get rid of with_page_index? Didn't ParquetMetaDataReader already load it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we can...it appears that options.page_index is only used in ArrowReaderMetadata::load, so it should have no effect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is now pretty clear -- we still need to explicitly call for loading the page index, but it makes sense I think

We could potentially change the default to be "read the page index unless told not to"

parquet/examples/external_metadata.rs Outdated Show resolved Hide resolved
/// This function reads the format written by `write_metadata_to_file`
fn read_metadata_from_file(file: impl AsRef<Path>) -> ParquetMetaData {
let mut file = std::fs::File::open(file).unwrap();
// This API is kind of awkward compared to the writer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes certainly interested!

let file = std::fs::File::open(file).unwrap();
let options = ArrowReaderOptions::new()
// tell the reader to read the page index
.with_page_index(true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean loading the page index?

Copy link
Member

@mapleFU mapleFU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let column_indexes = self.convert_column_indexes();
let offset_indexes = self.convert_offset_index();

let mut encoder = ThriftMetadataWriter::new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would encoder better serializer here?

Comment on lines 125 to 166
ParquetMetaDataReader::new()
.with_column_indexes(true)
.with_offset_indexes(true)
.parse_and_finish(&mut file)
.unwrap()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm lost but I thought you'd need to pass in the original file size here to adjust offsets.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet the problem is that the example file doesn't actually have page offsets / page indexes 🤔 to load so the problem isn't hit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because an actual File is being passed, we can seek wherever we need to to find the page indexes. We only need to pass the original file size if we're passing a buffer with just the tail of the file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But doesn't that file only have the tail of the original file? In other words, the file it's opening has for example 100 bytes but there's byte offsets referencing byte 101 to load the page index.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, needed to read more of the example 😅. I think @alamb is correct...the original file has no page indexes anyway, so on the second read, even though we ask for them, since there are no offsets specified, there's no seeking done to find them. It would be interesting to see what happens if we start with a different file (alltypes_tiny_pages.parquet for instance), since we don't ask for the page indexes in the first read, what happens to the page index offsets when we write the metadata back out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly, this has also come up downstream in DataFusion with @progval on apache/datafusion#12593 (where the semi-automatic loading of page indexes causes unintended accesses and slowdowns)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData. Then again, I could see wanting to preserve the page index offsets of the original file if you only want to save the footer metadata externally...perhaps an option on the metadata writer to preserve page index offsets if desired?

This is an excellent point. This is why I think it is so important to have this example to motivate the API design

As I understand the usecase it

  1. store the parquet metadata as some bytes externally (e.g. in a database like redis, or some other location)
  2. Use that metadata both for various pruning as well as actually reading the parquet data when needed

I'll update the example to also have a file with page indexes and see what happens 🏃

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep that's exactly my use case. I've done it by implementing AsyncFileReader::get_metadata so it works for both pruning and reading the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote up some tests here: #6463

The good news is that as long as you load the page indexes with the initial metadata load, the ParquetMetadataWriter will correctly update the offsets so the metadata can be read again

The bad news is that #6464 happens (precisely as @etseidl predicated above):

I wonder if the metadata writer needs to modify the page index offsets/lengths in the ColumnMetaData if the indexes are not present in the ParquetMetaData.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also made a PR with some tests for this usecase: #6463

parquet/examples/external_metadata.rs Show resolved Hide resolved
Comment on lines 125 to 166
ParquetMetaDataReader::new()
.with_column_indexes(true)
.with_offset_indexes(true)
.parse_and_finish(&mut file)
.unwrap()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bet the problem is that the example file doesn't actually have page offsets / page indexes 🤔 to load so the problem isn't hit.

@alamb alamb force-pushed the alamb/parquet-stats-example branch 3 times, most recently from 622cc7a to b6454ab Compare September 26, 2024 14:41
@alamb alamb force-pushed the alamb/parquet-stats-example branch 3 times, most recently from e1d54e4 to 3d6b976 Compare October 3, 2024 13:10
@alamb
Copy link
Contributor Author

alamb commented Oct 3, 2024

Current status:

  • I am pretty excited about how this example is looking
  • I made a PR to improve some underlying documentaiton: Improve parquet MetadataFetch and AsyncFileReader docs #6505
  • The only thing remaining in my mind is to update the example to strip statistics from ColumnMetadata but I think that will need some additional APIs as well

@alamb alamb force-pushed the alamb/parquet-stats-example branch 2 times, most recently from a112ae4 to f73a96d Compare October 7, 2024 19:20
@alamb
Copy link
Contributor Author

alamb commented Oct 7, 2024

Ok, i think this PR is now basically ready to go.

I have one final small API addition (for modifying ColumnChunkMetadata) here: #6523 but once that is merged then this PR will be ready for review

@alamb alamb force-pushed the alamb/parquet-stats-example branch from f73a96d to 0a125e9 Compare October 7, 2024 19:30
@alamb alamb force-pushed the alamb/parquet-stats-example branch from 0a125e9 to 9d926cf Compare October 8, 2024 20:31
//! and [`decode_metadata`]
//! * Read from an `async` source to `ParquetMetaData`: [`MetadataLoader`]
//! * Read from bytes or from an async source to `ParquetMetaData`: [`ParquetMetaDataReader`]
//! * [`ParquetMetaDataReader`] for reading
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also updates the docs to point at the new APIs added in the previous releases :bowtie:

I can pull these changes out into their own PR if we prefer

@alamb alamb marked this pull request as ready for review October 8, 2024 20:32
@alamb
Copy link
Contributor Author

alamb commented Oct 8, 2024

Ok, 3 months later this PR is now ready for a good review!

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks for doing this, it really helped drive the development to have a concrete example.

parquet/src/file/metadata/mod.rs Outdated Show resolved Hide resolved
parquet/examples/external_metadata.rs Outdated Show resolved Hide resolved
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
@alamb
Copy link
Contributor Author

alamb commented Oct 10, 2024

Thanks again for the reviews and inspiration @adriangb and @etseidl -- I think these APIs are looking quite good now ❤️

@alamb alamb added the documentation Improvements or additions to documentation label Oct 10, 2024
@alamb alamb merged commit 77dcdc0 into apache:master Oct 10, 2024
17 checks passed
@alamb alamb deleted the alamb/parquet-stats-example branch October 10, 2024 12:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example of how to use parquet metadata reader APIs for a local cache
4 participants