-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
EPICA larger project, actively underway, with sub tasksA larger project, actively underway, with sub tasksenhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
@nuno-faria implemented the core Parquet Metadata caching logic in the following PR:
However, it is not on by default and you have to turn it on like
set datafusion.execution.parquet.cache_metadata = true;
Describe the solution you'd like
To turn the metadata cache on by default
- Cache Parquet Metadata #15582
- [Parquet Metadata Cache]: Limit memory used #17001
- [Parquet Metadata Cache] Use the cached metadata for ListingTable statistics #17002
- [Parquet Metadata Cache] remove
datafusion.execution.parquet.cache_metadata
config #17047 - [Parquet Metadata Cache] Document the parquet metadata cache #17048
- [Parquet Metadata Cache] Add an API to review the contents of the Cache #17091
Describe alternatives you've considered
No response
Additional context
- Related issues (thanks to @nuno-faria 's sluthing)
- [EPIC] Improve the performance of ListingTable #9964
- Improve parquet ListingTable speed with parquet metadata (short clickbench queries) #11719
- parquet: Add support for user-provided metadata loaders #12592
- [DISCUSSION] Make it easier and faster to query remote files (S3, iceberg, etc) #13456
- Slowdown in ClickBench Q36-Q37 between DataFusion 43.0.0 and 44.0.0 #14481
- Make it easier to run TPCH queries with datafusion-cli #14608 (comment)
- Option to *always* enable page index for
ParquetOpener
#15179 - Reduce page metadata loading to only what is necessary for query execution in ParquetOpen #16200
- Improve performance of
datafusion-cli
when reading from remote storage #16365
Metadata
Metadata
Assignees
Labels
EPICA larger project, actively underway, with sub tasksA larger project, actively underway, with sub tasksenhancementNew feature or requestNew feature or request