-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/design: add proposal docs for constraint propagation enhancement #7648
Conversation
@bb7133 Great! Thanks! |
Related with #7098 |
|
||
1. Find `column = constant` expression and substitue the constant for column, as well as try to fold the substituted constant expression if possible, for example: | ||
|
||
Given `a = b and a = 2 and b = 3`, it becomes `2 = 3` after substitution and lead to a final `false` constant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/lead/leads
* `ast.GE('>=')` | ||
* `ast.NE('!=')` | ||
|
||
The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rule. For example, in predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source(TiKV), and thus reduce the amount of data in the whole data path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/reduce/reduces
|
||
NOTE: in this case, `t1.a in (12, 13)` works on the result of the outer join and we have pushed it down to the outer table. | ||
|
||
But we can further push this filter down to the inner table, since only the the records satisfy `t2.a in (12, 13)` could make join predicate `t1.a = t2.a` be positive in the join operator. So we can optimize this query to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/satisfy/satisfied ?
|
||
## Rationale | ||
|
||
Constraint propagation is commonly used as logcial plan optimization in traditional databases, for example, [this doc](https://dev.mysql.com/doc/internals/en/optimizer-constant-propagation.html) explained some details of constant propagtions in MySQL. It is is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al, those engines usually query on huge amount of data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/It is is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al, those engines usually query on huge amount of data./It is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, etc. Those engines usually query on huge amount of data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constraint propagation is commonly used as logical plan optimization in traditional databases. For example, this doc explains some details of constant propagations in MySQL. It is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al. Those engines usually query on a huge amount of data.
|
||
### Advantages: | ||
|
||
Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints we can filter data as early as possible, and thus reduce disc/network IO and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level(TiKV) as a coprocessor task, and lead to the following benefits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/lead/leads or s/lead/can lead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints, we can filter data as early as possible, and thus reduce disc/network I/O and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level (TiKV) as a Coprocessor task, and lead to the following benefits:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all tips above for the grammar!
|
||
Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints we can filter data as early as possible, and thus reduce disc/network IO and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level(TiKV) as a coprocessor task, and lead to the following benefits: | ||
|
||
* Apply the filters at each TiKV instance, which make the calculation distributed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply the filters at each TiKV instance, which make the calculation distributed.
This may not be correct. We actually push cop to regions rather than TiKV
. Although, the benefit is indeed valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got your point!
|
||
* Apply the filters at each TiKV instance, which make the calculation distributed. | ||
|
||
* When loading data, skip some table partitions if its data range doesn't pass the filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some table or partitions? Those are two different things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When loading data, skip some table partition if its data range doesn't pass the filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was ambiguous, I was trying to say 'skip some partitions of table if the partitioning expression doesn't pass the filter', it is for the tables with user-defined partitions. Thanks!
|
||
For a query `select * from t0, t1 on t0.a = t1.a where t1.a < 5`, we get a propagation `t0.a < 5`, but if all `t0.a` is greater than 5, applying the filter brings unnecessary overheads. | ||
|
||
Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it useful./Considering the trade-off, we still gain a lot of benefits from constraint propagation in most of cases; hence it still can be treated it useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure~
Great Work. Thanks for your contribution. |
|
||
For example, | ||
|
||
`t1.a = t2.a and t1.a < 5` => `t2.a < 5` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simply describe the examples, maybe it's better to s/t1.a/a/
and s/t2.a/b/
? By the way, how about using a table to describe the examples, for example:
origin filters | propagated filters |
---|---|
t1.a = t2.a and t1.a < 5 | t2.a < 5 |
t1.a = t2.a and t1.a in (12, 13) and t2.a in (14, 15) | t1.a in (14, 15) and t2.a in (12, 13) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks!
|
||
`t1.a = t2.a and t1.a < sleep()` -- the expression has side effect | ||
|
||
2. Infer NotNULL filters from comparison operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a not null
filter for the involved column once a scalar expression is null rejected? In fact, there are many other expressions are null rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
|
||
`a < 3 and 3 > a` -> `a < 3` | ||
|
||
`a < 5 and a > 5` -> `False` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean constructing an expression which is always flase
, for example, a = null
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think replacing them with a constant False
is fine
@@ -0,0 +1,189 @@ | |||
# Proposal: Enhance constraint propagation in TiDB logical plan | |||
|
|||
- Author(s): [@bb7133](https://github.com/bb7133), [@zz-jason](https://github.com/zz-jason) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal was mainly completed by you, please remove me from the author list 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the proposal 4 is all from your issue :)
|
||
## Background | ||
|
||
For now, most of the constraint propagation work in TiDB is done by `propagateConstantSolver`, it does: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Currently, most of the constraint propagation work in TiDB is done by propagateConstantSolver
, which does:
|
||
For now, most of the constraint propagation work in TiDB is done by `propagateConstantSolver`, it does: | ||
|
||
1. Find `column = constant` expression and substitue the constant for column, as well as try to fold the substituted constant expression if possible, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Find the column = constant
expression and substitute the constant for the column, as well as try to fold the substituted constant expression if possible, for example:
|
||
Given `a = b and a = 2 and b = 3`, it becomes `2 = 3` after substitution and lead to a final `false` constant. | ||
|
||
2. Find `column A = column B` expression(which happens in `join` statements mostly) and propagate expressions like `column op constant`(as well as `constant op column`) based on the equliaty relation, the supported operators are: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> Find the column A = column B
expression (which happens in join
statements mostly) and propagate expressions like column op **constant
(as** well as constant op column
) based on the equality relation. The supported operators are:
Note:
- Do not combine two sentences with a comma
- Add a space before "("
* `ast.GE('>=')` | ||
* `ast.NE('!=')` | ||
|
||
The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rule. For example, in predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source(TiKV), and thus reduce the amount of data in the whole data path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The propagateConstantSolver
makes more detailed/explicit filters/constraints, which can be used within other optimization rules. For example, in the predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source (TiKV), and thus reduces the amount of data in the whole data path.
|
||
The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rule. For example, in predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source(TiKV), and thus reduce the amount of data in the whole data path. | ||
|
||
We can further do the optimization by introduce more rules and infer/propagate more constraints from the existings ones, which helps us building better logical plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can further do the optimization by introducing more rules and inferring/propagating more constraints from the existing ones, which helps us build a better logical plan.
|
||
For a query `select * from t0, t1 on t0.a = t1.a where t1.a < 5`, we get a propagation `t0.a < 5`, but if all `t0.a` is greater than 5, applying the filter brings unnecessary overheads. | ||
|
||
Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considering the trade-off, most of the time we gain benefits from constraint propagation and still treat it as useful.
|
||
## Compatibility | ||
|
||
All rules mentioned in this proposal are logical plan optimization, they should not change the semantic of a query, and thus dont't lead to any compatibility issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All rules mentioned in this proposal are logical plan optimization, which does(?) not change the semantics of a query, and thus this proposal will not lead to any compatibility issue.
|
||
Here are rough ideas about possible implementations: | ||
|
||
* For proposal #1, we can extend current `propagateConstantSolver` to support wider types of operators from column equiality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For proposal #1, we can extend the current propagateConstantSolver
to support wider types of operators from column equality.
|
||
* For proposal #1, we can extend current `propagateConstantSolver` to support wider types of operators from column equiality. | ||
|
||
* For proposal #2, `propagateConstantSolver` is also a applicable way to add `NotNULL` filter(`not(isnull())`), but should examine if the column doesn't have `NotNULL` constraint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"For proposal #2, propagateConstantSolver
is also an applicable way to add NotNULL
filter(not(isnull())
), but should examine whether the column has the NotNULL
constraint"?
or
"For proposal #2, propagateConstantSolver
is also an applicable way to add NotNULL
filter(not(isnull())
), but should be examined if the column doesn't have the NotNULL
constraint"?
|
||
* For proposal #2, `propagateConstantSolver` is also a applicable way to add `NotNULL` filter(`not(isnull())`), but should examine if the column doesn't have `NotNULL` constraint. | ||
|
||
* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collecting and folding comparison constraints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @CaitinChen , thank you very much for all the comments, I will the update doc one-by-one. Again, your help is really appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bb7133 My pleasure ^_^
35594f8
to
2f56960
Compare
hi @zhexuany @zz-jason @CaitinChen , comments addressed, and thanks for your help! |
|
||
Here are proposed rules we can consider: | ||
|
||
1. Infer more filters/constraints from column equality relation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be EXTREMELY careful here. Consider the following examples:
t1.a = t2.a and t1.a is null
we should NOT infer t2.a is null
, it is not logically equivalent because null = null
is false;
another example is:
t1.a = t2.a and cast(t1.a as char(10)) = '+0.0'
we should NOT infer cast(t2.a as char(10)) = '+0.0'
, it is not logically equivalent either because this would filter out tuples with t2.a = -0.0
These 2 cases should fail in current master branch logically, we need to fix them.
Maybe there are other failed cases as well, once more, we need to think over this carefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you're right! I tried to get some failed corner cases but didn't get the ones you commented. Thank you~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ported this proposal back to original google doc: https://docs.google.com/document/d/1G3wVBaiza9GI5q9nwHLCB2I7r1HUbg6RpTNFpQWTFhQ/edit?usp=sharing, and made updates based on your cases. As you said, we should think very carefully about this part, please let me know if you have any new idea, thanks
|
||
* For proposal #2, `propagateConstantSolver` is also an applicable way to add `NotNULL` filter(`not(isnull())`), but should examine whether the column has the `NotNULL` constraint already. | ||
|
||
* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently, ranger only comes in when we are computing ranges for index scan, or for table scan with filter on RowID. If we want to impose general expression simplification, we need to extract infrastructure functionalities in util/ranger/points.go
into a general module.
|
||
* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints | ||
|
||
* For proposal #4, current rule [PredicatePushDown](https://github.com/pingcap/tidb/blob/b3d4ed79b978efadf2974f78db8eeb711509e545/plan/rule_predicate_push_down.go#L1) may be enhanced to archive it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PredicatePushDown is another story. What we need to do here is to derive more logically equivalent conditions by constant propagation, and then PredicatePushDown can consume them. PredicatePushDown itself is fine I think. I am trying to figure out if there is a general approach to correctly propagate these condition over outer join, still work in progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this part of the doc are just some rough ideas. It would be great if we have a better solution, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/archive/achieve
982e7a3
to
ca2a99e
Compare
hi @eurekaka , I've updated the docs to address your comments. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
|
||
* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints | ||
|
||
* For proposal #4, current rule [PredicatePushDown](https://github.com/pingcap/tidb/blob/b3d4ed79b978efadf2974f78db8eeb711509e545/plan/rule_predicate_push_down.go#L1) may be enhanced to archive it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/archive/achieve
ca2a99e
to
bf22488
Compare
Addressed, thanks a lot! |
LGTM |
What problem does this PR solve?
Add description for the proposal of constraint propagation enhancement
What is changed and how it works?
This is a doc change
Check List
Tests
Code changes
Side effects
Related changes