-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Way to share SchemaDescriptorPtr
across ParquetMetadata
objects
#5999
Comments
Just curious would this a bit conflict with something like schema evolution ( https://iceberg.apache.org/docs/1.5.1/evolution/ ) in iceberg cross file? Or it's just reuse the schema when open the same file? |
In my mind this feature would work well in systems that support schema evolution like Iceberg For example:
Without the feature described in this ticket, a query system today would need to retain 50 schema objects (20 of the first class and 30 of the second) With the feature described in this ticket the query system could retain only 2 schema objects Depending on the number of files I think this could be substantial memory savings |
Ah, so as an implementation, there should somewhere denote that "the file's schema version is xxx", and if we find an exisiting version, it can reuse the schema ptr. And if a new version is founded, it should parse and might making the schema "public" and allow other reader to reuse that. And this would be useful for wide-column table! |
I think iceberg has a schema_id to know whether data files share the same schema or not. |
Yes, that is right. I don't think the schema-id concept likely belongs in parquet-rs but having the underlying structures able to reuse the same pointers I think would be a good building block |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
In low latency parquet based query applications, it is important to be able to cache / reuse the
ParquetMetaData
from parquet files (to supply via ArrowReaderBuilder::new_with_metadata instead of re-reading / parsing it from the parquet footer while reading the parquet data)For many such systems (including InfluxDB 3.0) many of the files have the same schema so storing the same schema information for each parquet file is wasteful
Describe the solution you'd like
I would like a way to share
SchemaDescriptorPtr
-- e.g. the schema is already wrapped in an Arc so it is likely possibly to avoid storing the same schema over and over againhttps://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#197 .
Describe alternatives you've considered
Perhaps we could add an API like
with_schema
to ParquetMetadata:It could be used like this:
Additional context
This infrastructure is a natural follow on to #1729 to track the memory used
This API would likely be be tricky to implement given there are several references to the schema in
ParquetMetadata
child fields (e.g. https://docs.rs/parquet/latest/src/parquet/file/metadata.rs.html#299)The text was updated successfully, but these errors were encountered: