Way to share `SchemaDescriptorPtr` across `ParquetMetadata` objects #5999

alamb · 2024-07-03T16:08:42Z

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

In low latency parquet based query applications, it is important to be able to cache / reuse the ParquetMetaData from parquet files (to supply via ArrowReaderBuilder::new_with_metadata instead of re-reading / parsing it from the parquet footer while reading the parquet data)

For many such systems (including InfluxDB 3.0) many of the files have the same schema so storing the same schema information for each parquet file is wasteful

Describe the solution you'd like
I would like a way to share SchemaDescriptorPtr -- e.g. the schema is already wrapped in an Arc so it is likely possibly to avoid storing the same schema over and over again

https://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#197 .

Describe alternatives you've considered

Perhaps we could add an API like with_schema to ParquetMetadata:

impl ParquetMetaData { 
... 
  /// Set the internal schema pointers
  fn with_schema(self, schema_descr: SchemaDescPtr) -> Self {
   ..
  }
...
}

It could be used like this:

let mut metadata: PaquetMetadata = ... // load metadata from a parquet file
// Check if we already have the same schema loaded
if let Some(existing_schema) = find_existing_schema(&catalog, &metadata) {
  // if so, use the existing schema 
  metadata = metadata.with_schema()
}

Additional context

This infrastructure is a natural follow on to #1729 to track the memory used

This API would likely be be tricky to implement given there are several references to the schema in ParquetMetadata child fields (e.g. https://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#299)

The text was updated successfully, but these errors were encountered:

mapleFU · 2024-07-03T16:37:01Z

Just curious would this a bit conflict with something like schema evolution ( https://iceberg.apache.org/docs/1.5.1/evolution/ ) in iceberg cross file? Or it's just reuse the schema when open the same file?

alamb · 2024-07-03T17:57:33Z

Just curious would this a bit conflict with something like schema evolution ( https://iceberg.apache.org/docs/1.5.1/evolution/ ) in iceberg cross file? Or it's just reuse the schema when open the same file?

In my mind this feature would work well in systems that support schema evolution like Iceberg

For example:

20 files that all share the same schema with 2 columns
30 files that all share a different schema with 3 columns (evolved from the first 20 files)

Without the feature described in this ticket, a query system today would need to retain 50 schema objects (20 of the first class and 30 of the second)

With the feature described in this ticket the query system could retain only 2 schema objects

Depending on the number of files I think this could be substantial memory savings

mapleFU · 2024-07-04T02:37:00Z

Ah, so as an implementation, there should somewhere denote that "the file's schema version is xxx", and if we find an exisiting version, it can reuse the schema ptr. And if a new version is founded, it should parse and might making the schema "public" and allow other reader to reuse that. And this would be useful for wide-column table!

wgtmac · 2024-07-04T02:39:51Z

I think iceberg has a schema_id to know whether data files share the same schema or not.

alamb · 2024-07-04T10:10:41Z

Ah, so as an implementation, there should somewhere denote that "the file's schema version is xxx", and if we find an exisiting version, it can reuse the schema ptr. And if a new version is founded, it should parse and might making the schema "public" and allow other reader to reuse that. And this would be useful for wide-column table!

Yes, that is right. I don't think the schema-id concept likely belongs in parquet-rs but having the underlying structures able to reuse the same pointers I think would be a good building block

alamb added parquet Changes to the parquet crate enhancement Any new improvement worthy of a entry in the changelog labels Jul 3, 2024

alamb mentioned this issue Jul 4, 2024

API for encoding/decoding ParquetMetadata with more control #6002

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Way to share `SchemaDescriptorPtr` across `ParquetMetadata` objects #5999

Way to share `SchemaDescriptorPtr` across `ParquetMetadata` objects #5999

alamb commented Jul 3, 2024

mapleFU commented Jul 3, 2024 •

edited

Loading

alamb commented Jul 3, 2024

mapleFU commented Jul 4, 2024 •

edited

Loading

wgtmac commented Jul 4, 2024

alamb commented Jul 4, 2024

Way to share SchemaDescriptorPtr across ParquetMetadata objects #5999

Way to share SchemaDescriptorPtr across ParquetMetadata objects #5999

Comments

alamb commented Jul 3, 2024

mapleFU commented Jul 3, 2024 • edited Loading

alamb commented Jul 3, 2024

mapleFU commented Jul 4, 2024 • edited Loading

wgtmac commented Jul 4, 2024

alamb commented Jul 4, 2024

Way to share `SchemaDescriptorPtr` across `ParquetMetadata` objects #5999

Way to share `SchemaDescriptorPtr` across `ParquetMetadata` objects #5999

mapleFU commented Jul 3, 2024 •

edited

Loading

mapleFU commented Jul 4, 2024 •

edited

Loading