-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Clarify documentation about gathering statistics for parquet files #16157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
davisp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for reviewing from mobile and not linking directly, but in ddl.md I think that should be collect_statistics, not show_statistics.
docs/source/user-guide/sql/ddl.md
Outdated
| : By default, when a table is created, DataFusion will _NOT_ read the files | ||
| to gather statistics, which can be expensive but can accelerate subsequent | ||
| queries substantially. If you want to gather statistics | ||
| when creating a table, set the `datafusion.explain.show_statistics` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datafusion.explain.collect_statistics?
comphead
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks @alamb
agree with @xudong963 that you probably mean another param to set for collecting stats?
in this PR
/// When set to true, the explain statement will print operator statistics
/// for physical plans
pub show_statistics: bool, default = false
was used but I dont see EXPLAIN stmts
Co-authored-by: Oleks V <comphead@users.noreply.github.com>
| LOCATION '/mnt/nyctaxi/tripdata.parquet'; | ||
| ``` | ||
|
|
||
| :::{note} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example of what this looks like rendered
TIL
xudong963
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!

Which issue does this PR close?
SessionContext::register_parquetandCREATE EXTERNAL TABLE#15908 from @davispRationale for this change
As noted by @davisp it was not clear that statistics are not collected by default for ListingTables which has a potentially substantial negative impact on performance. Let's at least document this
What changes are included in this PR?
Document when statistics are (not) collected and add notes about how to enable them
Are these changes tested?
Yes by CI
Are there any user-facing changes?
Docs only