Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Representing statistics in logical or physical plans (or in both of them) #4003

Open
isidentical opened this issue Oct 28, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@isidentical
Copy link
Contributor

isidentical commented Oct 28, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There seem to be a few use cases that can be uncovered by being able to have statistics in the logical plan:

  • Helping projects (e.g. Dask SQL) that only use logical plans to have a way to access statistics (so that they can actually leverage them themselves, without needing to use the physical plan from DataFusion) (by @andygrove).
  • Choosing optimal physical plans at the planning stage (instead of the physical plan optimization stage) (by @mingmwang)
  • Easier join analysis when dealing with join trees (in terms of representation) and easier nested join reordering (in terms of the ease of doing complex rewrites).
  • A clear separation between logical cost (a cost-based optimization independent from the specifics of any physical operator implementation; e.g. reducing the join sizes with reordering) vs physical cost (something that only relates to implementation of our physical operators; e.g. hash join probe switching)

Describe the solution you'd like
There were 3 options proposed in the discussion:

  • Represent statistics solely in the logical planning stage (basically a theoretical revert of Moving cost based optimizations to physical planning #962).
  • Keep them as is (in the physical plan), and think about a more generic way to convert between logical<->physical plans (probably wouldn't help directly to Dask SQL, but might have other use cases)
  • Represent them in both places, with utility toolkits to do generic join/filter selectivity analysis as well as other fundamental cost estimations.

Describe alternatives you've considered
Keep them as is, and not do anything else.

Additional context
This is a spin-off from the cost calculations/estimations in #3929 (also related to #3983 and #3984). Original discussion can be found here by @mingmwang @isidentical @Dandandan @jackwener @andygrove @alamb. It also includes a lot of material regarding what other query engine / database systems are doing (so recommend reading it, this is just a main summary to continue to discussions in a more structured/public place).

@isidentical isidentical added the enhancement New feature or request label Oct 28, 2022
@alamb
Copy link
Contributor

alamb commented Oct 28, 2022

Thank you for this discussion @isidentical - the summary is quite nice

There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.

Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available

My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).

@Dandandan
Copy link
Contributor

Another benefit of keeping the CBOs at physical level is that the costs might be also different for different ways. E.g. hash join and sort merge join do have different costs associated with them. For a hash join swapping the order of the build and probe side has a big impact, but a sort merge join (or other types of join) might not benefit from this change (or at least in very different ways).

If we would move the optimization to be on logical plans, we would need a way to expose this to the logical plan too.

@mingmwang
Copy link
Contributor

Costs might be different for different implementations like hash join vs sort join, but the stats(output rows, size, min/max,...) are the same. Even for different join ordering [A inner join B inner C] vs [C inner join B inner A], the stats are exactly the same.

@mingmwang
Copy link
Contributor

Thank you for this discussion @isidentical - the summary is quite nice

There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.

Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available

My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).

In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.

@alamb
Copy link
Contributor

alamb commented Nov 3, 2022

In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.

Yes I agree -- and in general I don't think DataFusion should be handling the decision of "should the metadata be cached in some local catalog" -- I think that decision should be moved into the TableProvider / CatalogProvider implementations so that each system that uses DataFusion can make the optimal choice for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants