Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a way to enable source level statistics for tables registered in the CLI #3774

Open
isidentical opened this issue Oct 9, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@isidentical
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
#1347 enabled collection of statistics by default on the ListingOptions constructor, though the tables created with CREATE EXTERNAL TABLE can't still use this feature since they are created manually.
https://github.com/apache/arrow-datafusion/blob/e54110fb592e03704da5f6ebd832b8fe1c51123b/datafusion/core/src/execution/context.rs#L486-L488

Describe the solution you'd like
We already have a per file extension listing option implementation for the read_ dataframe APIs (e.g. CsvReadOptions, ParquetReadOptions) and they have sane defaults (like collect_stats is false for CSV and true for Parquet). I wonder whether we can just use them here and obtain the ListingOptions directly from them.

Describe alternatives you've considered
Leaving as is, or enabling them globally (instead of refactoring that part to use ReadOptions) by just setting the flag to true.

@Dandandan
Copy link
Contributor

I wonder if it ever can be enabled by default for parquet datasets.
The downside for parquet is that when using remote object storage, collecting of statistics takes quite a bit of IO, slowing down simple queries.

I guess at some point we have to switch testing with Delta Lake or Apache Iceberg I guess :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants