Skip to content

[Parquet Metadata Cache] remove datafusion.execution.parquet.cache_metadata config #17047

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Now that we have a limited parquet metadata cache for the built in ListingTableProvider thanks to @nuno-faria ❤️ in #17031

There are now two configuration options that control the caching behavior

set datafusion.execution.parquet.cache_metadata = true;

And

set datafusion.runtime.file_metadata_cache_limit = 100M

Now that we have a cache limit, I think we should consider "always" trying to cache the parquet metadata

Describe the solution you'd like

I suggest we remove options.cache_metadata and always try to save the metadata (which will be a noop if the cache is too smal)

As @nuno-faria says on #17031 (comment)

I think caching by default would be good. The only situation where it wouldn't help would be one-time scans of parquet files that do not require the page index, but for large files the scan should largely outweigh the page index retrieval anyway.

And especially if we limit memory used to 50 or 100MB that people can disable by turning off the cache, I think that would be the best "out of the box" experience for the most users

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions