Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented May 22, 2025

Which issue does this PR close?

Rationale for this change

As noted by @davisp it was not clear that statistics are not collected by default for ListingTables which has a potentially substantial negative impact on performance. Let's at least document this

What changes are included in this PR?

Document when statistics are (not) collected and add notes about how to enable them

Are these changes tested?

Yes by CI

Are there any user-facing changes?

Docs only

@alamb alamb changed the title Alamb/clarify statistics docs Clarify documentation about gathering statistics for parquet files May 22, 2025
@alamb alamb marked this pull request as ready for review May 22, 2025 19:35
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate labels May 22, 2025
Copy link
Member

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for reviewing from mobile and not linking directly, but in ddl.md I think that should be collect_statistics, not show_statistics.

: By default, when a table is created, DataFusion will _NOT_ read the files
to gather statistics, which can be expensive but can accelerate subsequent
queries substantially. If you want to gather statistics
when creating a table, set the `datafusion.explain.show_statistics`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion.explain.collect_statistics?

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks @alamb
agree with @xudong963 that you probably mean another param to set for collecting stats?

in this PR

        /// When set to true, the explain statement will print operator statistics
        /// for physical plans
        pub show_statistics: bool, default = false

was used but I dont see EXPLAIN stmts

Co-authored-by: Oleks V <comphead@users.noreply.github.com>
LOCATION '/mnt/nyctaxi/tripdata.parquet';
```

:::{note}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of what this looks like rendered

Screenshot 2025-05-22 at 3 22 13 PM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is an example of what this looks like rendered

TIL

Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@xudong963 xudong963 merged commit 33a2531 into apache:main May 28, 2025
28 checks passed
@alamb alamb deleted the alamb/clarify_statistics_docs branch May 28, 2025 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants