Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return absent stats when filters are pushed down #12471

Merged

Conversation

waruto210
Copy link
Contributor

@waruto210 waruto210 commented Sep 15, 2024

Which issue does this PR close?

Closes #12416.

Rationale for this change

Fix the bug mentioned in #12416.

What changes are included in this PR?

Unlike what was mentioned in #12416, I chose to return absent stats because it's hard to know the selectivity of the filters, and also, for filters that can be resolved using only partition cols, there's no need to pushdown them to the TableScanExec, which would otherwise produce useless unhandled pruning predicate.

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added physical-expr Physical Expressions core Core DataFusion crate labels Sep 15, 2024
@waruto210 waruto210 force-pushed the return_inexact_stats_with_filter_pushdown branch 2 times, most recently from 90cd6b5 to 20c74b9 Compare September 15, 2024 18:48
@waruto210 waruto210 changed the title return inexact stats when filters are pushed down return absent stats when filters are pushed down Sep 15, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @waruto210 -- I think this is a great start

I left some comments -- let us know what you think

// When filters are pushed down, we have no way of knowing the exact statistics.
// Note that pruning predicate is also a kind of filter pushdown.
let stats = if self.pruning_predicate.is_some()
|| self.page_pruning_predicate.is_some()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think bloom filters should also belong in this list 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bloom filters are used by pruning_predicate .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, maybe we can update the comments to help future readers. I left a suggestion

datafusion/core/tests/sql/path_partition.rs Outdated Show resolved Hide resolved
@waruto210 waruto210 force-pushed the return_inexact_stats_with_filter_pushdown branch from 20c74b9 to a7606b4 Compare September 18, 2024 08:00
@waruto210
Copy link
Contributor Author

@alamb PTAL

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @waruto210 this looks good to me now

I do think it is worth considering more explicit test coverage for partition column pushdown, but I don't think it is required.

Thanks again for the contribution

// When filters are pushed down, we have no way of knowing the exact statistics.
// Note that pruning predicate is also a kind of filter pushdown.
let stats = if self.pruning_predicate.is_some()
|| self.page_pruning_predicate.is_some()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, maybe we can update the comments to help future readers. I left a suggestion

@alamb
Copy link
Contributor

alamb commented Sep 23, 2024

Thank you again @waruto210

@alamb alamb merged commit 30d4368 into apache:main Sep 23, 2024
24 checks passed
@waruto210 waruto210 deleted the return_inexact_stats_with_filter_pushdown branch September 24, 2024 06:36
bgjackma pushed a commit to bgjackma/datafusion that referenced this pull request Sep 25, 2024
* do not pushdown filters that can be resolved only using partition cols
return absent stats when filters are pushed down

* fix and add test

* Update datafusion/core/src/datasource/physical_plan/parquet/mod.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* add test for partition pruning filters

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TableScanExec return exact stats when it contain's filters
3 participants