-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Representing statistics in logical or physical plans (or in both of them) #4003
Comments
Thank you for this discussion @isidentical - the summary is quite nice There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan. Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)). |
Another benefit of keeping the CBOs at physical level is that the costs might be also different for different ways. E.g. hash join and sort merge join do have different costs associated with them. For a hash join swapping the order of the build and probe side has a big impact, but a sort merge join (or other types of join) might not benefit from this change (or at least in very different ways). If we would move the optimization to be on logical plans, we would need a way to expose this to the logical plan too. |
Costs might be different for different implementations like hash join vs sort join, but the stats(output rows, size, min/max,...) are the same. Even for different join ordering [A inner join B inner C] vs [C inner join B inner A], the stats are exactly the same. |
In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files. |
Yes I agree -- and in general I don't think DataFusion should be handling the decision of "should the metadata be cached in some local catalog" -- I think that decision should be moved into the |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There seem to be a few use cases that can be uncovered by being able to have statistics in the logical plan:
Describe the solution you'd like
There were 3 options proposed in the discussion:
Describe alternatives you've considered
Keep them as is, and not do anything else.
Additional context
This is a spin-off from the cost calculations/estimations in #3929 (also related to #3983 and #3984). Original discussion can be found here by @mingmwang @isidentical @Dandandan @jackwener @andygrove @alamb. It also includes a lot of material regarding what other query engine / database systems are doing (so recommend reading it, this is just a main summary to continue to discussions in a more structured/public place).
The text was updated successfully, but these errors were encountered: