-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TableScanExec return exact stats when it contain's filters #12416
Comments
@alamb PTAL |
BTW I would love to learn more about this project, if you can make anything public
I think your description and proposed solution makes lots of sense to me. Thank you @waruto210 -- I am surprised we haven't hit it before 🤔 |
We're building a log query platform utilizing some components from datafusion. We'll keep the community posted on issues we find, and we'll also submit code back to datafusion if we have any work that can be put into datafusion repo as a generic component. |
Describe the bug
I'm working on a project based on datafusion's
ListingTable
andParquetExec
. I've made some modifications to enable exact filter pushdown for parquet tables.When I execute a statement like
select count(*) from table where Age > 10 limit 10
, I noticed that in the physical plan,TableScanExec
is replaced with a placeholder, causing the query to directly return the total number of rows in the table.Eventually, I found that
AggregateStatistics
was optimizing the query plan using stats, andParquetExec::statistics()
was returning stats as follows:Rows=Exact(390616), Bytes=Absent, [(Col[0]: Min=Exact(Int16(0)) Max=Exact(Int16(55)) Null=Exact(0))]
.However,
ParquetExec
contains some filters, so the stats should be inexact or absent.Currently, datafusion's
ListingTable
supports inexact filter pushdown, so there would be aFilterExec
outside theTableScanExec
, which prevents incorrect optimization byAggregateStatistics
. But, since filters can still exist withinParquetExec
, returning exact stats is semantically incorrect.To Reproduce
Use the following code and print stats in
ParquetExec::statistics()
You can find that
Rows
in stats isExact(n)
, butn
may exceed the actual number of rows returned by ParquetExec, which is semantically incorrect.Expected behavior
I suggest adding the following code to the
statistics
method of some TableScanExec implementationsAdditional context
No response
The text was updated successfully, but these errors were encountered: