Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply guarantee rewriter to sql workflow #10456

Open
jayzhan211 opened this issue May 11, 2024 · 17 comments
Open

Apply guarantee rewriter to sql workflow #10456

jayzhan211 opened this issue May 11, 2024 · 17 comments
Labels
enhancement New feature or request

Comments

@jayzhan211
Copy link
Contributor

jayzhan211 commented May 11, 2024

Is your feature request related to a problem or challenge?

While deprecatingExpr::GetIndexedField, I found there are many test cases that are not covered in sqllogictest, for example, test_inequalities_non_null_bounded. Since we hope to replace the field API with get_field. We could either move the test to datafusion/core/tests or sqllogictest. I prefer the latter, then, I found that guarantee rewrite is not applied to SQL workflow.

statement ok
create table t (c int) as values (1), (3), (5);

query TT
explain select struct(c) from t where c between 3 and 1;
----
logical_plan
01)Projection: struct(t.c)
02)--Filter: t.c >= Int32(3) AND t.c <= Int32(1)
03)----TableScan: t projection=[c]
physical_plan
01)ProjectionExec: expr=[struct(c@0) as struct(t.c)]
02)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
03)----CoalesceBatchesExec: target_batch_size=8192
04)------FilterExec: c@0 >= 3 AND c@0 <= 1
05)--------MemoryExec: partitions=1, partition_sizes=[1]

statement ok
drop table t;

I expect that FilterExec should be removed or converted to something like False, since the condition here is always false.

Describe the solution you'd like

Apply guarantee_rewriter to sql workflow.
If the simplification logic can be included in Simplifier is a plus.

Describe alternatives you've considered

No response

Additional context

PR that introduce guarantee rewrite #7467

No response

@jayzhan211 jayzhan211 added the enhancement New feature or request label May 11, 2024
@yyy1000
Copy link
Contributor

yyy1000 commented May 12, 2024

Update: #10463 is not what this issue expects, I will see what to do next.
I can help it. :) An experimental PR is #10463

@yyy1000
Copy link
Contributor

yyy1000 commented May 12, 2024

On a second glance, I feel it's difficult. 😥
When simplifying a logicalplan, it seems impossible to get the underlying data which could making guarantees.

@alamb
Copy link
Contributor

alamb commented May 12, 2024

I think this may be another example of what @samuelcolvin was suggesting on #10400

I think we could use ExecutionPlan::statistics to get the guarantee information

@demetribu
Copy link
Contributor

@jayzhan211
Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?

@jayzhan211
Copy link
Contributor Author

@jayzhan211 Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?

I think so.

@alamb
Copy link
Contributor

alamb commented May 14, 2024

old version of the guarantee?

What does "old version of the guarantee?" refer to?

@demetribu
Copy link
Contributor

What does "old version of the guarantee?" refer to?

impl<'a> TreeNodeRewriter for GuaranteeRewriter<'a> {

@alamb
Copy link
Contributor

alamb commented May 14, 2024

As I understand it, the usecase for GuaranteeRewriter when @wjones127 (maybe?) added it was for providing external information (outside of information that came from SQL). I don't think we should just remove the ability to do so

I would personally recommend add code that translates Statistics into Guarantees to pass to GuaranteeRewriter

We could discuss reworking how GuaranteeRewriter works as a follow on PR

demetribu added a commit to demetribu/datafusion that referenced this issue May 14, 2024
demetribu added a commit to demetribu/datafusion that referenced this issue May 14, 2024
demetribu added a commit to demetribu/datafusion that referenced this issue May 15, 2024
demetribu added a commit to demetribu/datafusion that referenced this issue May 15, 2024
@jayzhan211
Copy link
Contributor Author

@dmitrybugakov are you working on #10510?

@jayzhan211
Copy link
Contributor Author

@alamb Is it reasonable to evaluate column in ConstEvaluator and collect statistics for guarantee rewriter or should we avoid evaluation in logical optimization step and compute it in physical planner?

I'm thinking of passing schema and batch to ConstEvaluator to evaluate columns and updating statistics each passes for guarantee rewriter.

@alamb
Copy link
Contributor

alamb commented Jun 10, 2024

@alamb Is it reasonable to evaluate column in ConstEvaluator and collect statistics for guarantee rewriter or should we avoid evaluation in logical optimization step and compute it in physical planner?

I don't quite follow what you are proposing here.

As I I understand the idea on this ticket, the idea is to add a pass that knows how to use Statistics to simplify expressions by creating a Simplifier, and pass in the min and max values via with_guarantee

The challenges I see are:

  1. statistics not available in the LogicalPlan but only in in ExecutionPlan via ExecutionPlan::with_statistics
  2. The ExprSimplifier::simplify API is in terms of Exprs (not PhysicalExprs)

One potential thing you could do is use PruningPredicate for FilterExecs and try to prove inputs can never be true. However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)

@jayzhan211
Copy link
Contributor Author

jayzhan211 commented Jun 11, 2024

One potential thing you could do is use PruningPredicate for FilterExecs and try to prove inputs can never be true.

It seems quite similar to the comments in #10400.

However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)

Maybe I should works on other issue 🤔

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

Maybe I should works on other issue 🤔

Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days 🏃

@jayzhan211
Copy link
Contributor Author

Maybe I should works on other issue 🤔

Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days 🏃

I think #8708 is about 80% complete. I'm exploring the next interesting topic.

@alamb
Copy link
Contributor

alamb commented Jun 11, 2024

I think #8708 is about 80% complete. I'm exploring the next interesting topic.

Let me know if you would like help breaking down the work and filing some more follow on tickets (to organize getting some additional community help).

Depending on the kind of project you are interested in, here are some ideas (unsolicited) that I would love to help review:

  1. API design / extraction: [Epic] Extract catalog functionality from the core to make it more modular #10782 with Catalog APIs
  2. Grouping performance -- either Improve performance for grouping by variable length columns (strings) #9403 or Improve Memory usage + performance with large numbers of groups / High Cardinality Aggregates #6937 (both are pretty tricky low level optimizations)

Rock on!

@jayzhan211
Copy link
Contributor Author

Improving grouping performance seems interesting!

@alamb
Copy link
Contributor

alamb commented Jun 13, 2024

Improving grouping performance seems interesting!

I think it would be awesome -- thank you. How would you like to proceed? I personally think either #9403 or #6937 are super valuable

For either, I think the key will be to do some sort of POC to make sure we can make performance improve before polishing too much.

Looking forward to working with you more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants