-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs/design: add proposal docs for constraint propagation enhancement #7648
Changes from 4 commits
4d82254
877bd42
12b4726
bf22488
a1a3066
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,193 @@ | ||
# Proposal: Enhance constraint propagation in TiDB logical plan | ||
|
||
- Author(s): [@bb7133](https://github.com/bb7133), [@zz-jason](https://github.com/zz-jason) | ||
- Last updated: 2018-07-22 | ||
- Discussion at: https://docs.google.com/document/d/1G3wVBaiza9GI5q9nwHLCB2I7r1HUbg6RpTNFpQWTFhQ/edit# | ||
|
||
## Abstract | ||
|
||
This proposal tries to illustrate some rules that can be added to current constraint/filter propgation optimizations in TiDB logcial plan. | ||
|
||
## Background | ||
|
||
Currently, most of the constraint propagation work in TiDB is done by `propagateConstantSolver`, which does: | ||
|
||
1. Find the `column = constant` expression and substitute the constant for the column, as well as try to fold the substituted constant expression if possible, for example: | ||
|
||
Given `a = b and a = 2 and b = 3`, it becomes `2 = 3` after substitution and leads to a final `false` constant. | ||
|
||
2. Find the `column A = column B` expression (which happens in `join` statements mostly) and propagate expressions like `column op constant` (as well as `constant op column`) based on the equality relation. The supported operators are: | ||
|
||
* `ast.LT('<')` | ||
* `ast.GT('>')` | ||
* `ast.LE('<=')` | ||
* `ast.GE('>=')` | ||
* `ast.NE('!=')` | ||
|
||
The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rules. For example, in the predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source (TiKV), and thus reduces the amount of data in the whole data path. | ||
|
||
We can further do the optimization by introducing more rules and inferring/propagating more constraints from the existing ones, which helps us build a better logical plan. | ||
|
||
## Proposal | ||
|
||
Here are proposed rules we can consider: | ||
|
||
1. Infer more filters/constraints from column equality relation | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We need to be EXTREMELY careful here. Consider the following examples:
we should NOT infer another example is:
we should NOT infer These 2 cases should fail in current master branch logically, we need to fix them. Maybe there are other failed cases as well, once more, we need to think over this carefully. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see, you're right! I tried to get some failed corner cases but didn't get the ones you commented. Thank you~ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I ported this proposal back to original google doc: https://docs.google.com/document/d/1G3wVBaiza9GI5q9nwHLCB2I7r1HUbg6RpTNFpQWTFhQ/edit?usp=sharing, and made updates based on your cases. As you said, we should think very carefully about this part, please let me know if you have any new idea, thanks |
||
|
||
We should be able to infer a set of data constraints based on column equality, that is, any constraint on `a` now applies to `b` as long as: | ||
|
||
* The constraint is deterministic | ||
* The constraint doesn’t have any side effect | ||
* The constraint doesn't include `isnull()` expression: `isnull()` constraint cannot be propagated through equality | ||
* The constraint doesn't include `cast()`/`convert()` expression: type cast may break the equality relation | ||
|
||
For example, | ||
|
||
| original expressions | propagated filters | | ||
| -------------------- | ------------------ | | ||
| a = b and a < 5 | b < 5 | | ||
| a = b and a in (12, 13) and b in (14, 15) | a in (14, 15) and b in (12, 13) | | ||
| a = b and cast(a, varchar(20)) rlike 'abc' | cast(b, varchar(20)) rlike 'abc' | | ||
|
||
Equality propagation should also be included: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think the first one is right~ |
||
|
||
| original expressions | propagated filters | | ||
| -------------------- | ------------------ | | ||
| a = b and abs(a) = 5 | abs(b) = 5 | | ||
|
||
But following predicates cannot be propagated: | ||
|
||
| unpropagateable expressions | reason | | ||
| --------------------------- | ------ | | ||
| a = b and a < random() | the expression is non-deterministic | | ||
| a = b and a < sleep() | the expression has side effect | | ||
| a = b and isnull(a) | isnull() cannot be propagated | | ||
| a = b and cast(a as char(10)) = '+0.0' | type cast expression cannot be propagated | | ||
|
||
2. Infer `NotNULL` filters from null-rejected scalar expression | ||
|
||
We can infer `NotNULL` constraints from a scalar expression that doesn’t accept NULL, then we can know that involved columns cannot be NULL and add `NotNull` filter to them: | ||
|
||
| original expressions | inferred filters | | ||
| -------------------- | ---------------- | | ||
| a = b | not(isnull(a)) and not(isnull(b)) | | ||
| a != b | not(isnull(a)) and not(isnull(b)) | | ||
| a < 5 | not(isnull(a)) | | ||
| abs(a) < 3 | not(isnull(a)) | | ||
|
||
NOTE: Those columns should not have `NotNULL` constraint attribute, or the inferred filters are unnecessary. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. NOTE: Those columns should not have the NOTE: Those columns should not have the |
||
|
||
3. Fold the constraints based on their semantics to avoid redundancy | ||
|
||
After the propagations we may produce some duplicate predicates, and they can be combined. We should analyze all of the conditions and try to make the predicates clean: | ||
|
||
| original expressions | combined expression | | ||
| -------------------- | ------------------ | | ||
| a = b and b = a | a = b | | ||
| a < 3 and 3 > a | a < 3 | | ||
| a < 5 | not(isnull(a)) | | ||
| a < 5 and a > 5 | false | | ||
| a < 10 and a <= 5 | a <= 5 | | ||
| isnull(a) and not(isnull(a)) | false | | ||
| a < 3 or a >= 3 | true | | ||
| a in (1, 2) and a in (3, 5) | false | | ||
| a in (1, 2) or a in (3, 5) | a in (1, 2, 3, 5) | | ||
|
||
4. Filter propagation for outer join | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Filter propagation for the outer join |
||
|
||
When doing an equality outer join, we can propagate predicates on outer table in `where` condition to the inner table in `on` condition. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When doing an equality outer join, we can propagate predicates on the outer table in the |
||
For example: | ||
|
||
`select * from t1 left join t2 on t1.a=t2.a where t1.a in (12, 13);` | ||
|
||
``` | ||
TiDB(localhost:4000) > desc select * from t1 left join t2 on t1.a=t2.a where t1.a in (12, 13); | ||
+-------------------------+----------+------+-------------------------------------------------------------------------+ | ||
| id | count | task | operator info | | ||
+-------------------------+----------+------+-------------------------------------------------------------------------+ | ||
| HashLeftJoin_7 | 25.00 | root | left outer join, inner:TableReader_12, equal:[eq(test.t1.a, test.t2.a)] | | ||
| ├─TableReader_10 | 20.00 | root | data:Selection_9 | | ||
| │ └─Selection_9 | 20.00 | cop | in(test.t1.a, 12, 13) | | ||
| │ └─TableScan_8 | 10000.00 | cop | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo | | ||
| └─TableReader_12 | 10000.00 | root | data:TableScan_11 | | ||
| └─TableScan_11 | 10000.00 | cop | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo | | ||
+-------------------------+----------+------+-------------------------------------------------------------------------+ | ||
6 rows in set (0.00 sec) | ||
``` | ||
|
||
NOTE: in this case, `t1.a in (12, 13)` works on the result of the outer join and we have pushed it down to the outer table. | ||
|
||
But we can further push this filter down to the inner table, since only the records satisfying `t2.a in (12, 13)` can make join predicate `t1.a = t2.a` positive in the join operator. So we can optimize this query to: | ||
|
||
`select * from t1 left join t2 on t1.a=t2.a and t2.a in (12, 13) where t1.a in (12, 13);` | ||
|
||
And the join predicate `t2.a in (12, 13)` can be pushed down: | ||
|
||
``` | ||
TiDB(localhost:4000) > desc select * from t1 left join t2 on t1.a=t2.a and t2.a in (12, 13) where t1.a in (12, 13); | ||
+-------------------------+-------+------+-------------------------------------------------------------------------+ | ||
| id | count | task | operator info | | ||
+-------------------------+-------+------+-------------------------------------------------------------------------+ | ||
| HashLeftJoin_7 | 0.00 | root | left outer join, inner:TableReader_13, equal:[eq(test.t1.a, test.t2.a)] | | ||
| ├─TableReader_10 | 0.00 | root | data:Selection_9 | | ||
| │ └─Selection_9 | 0.00 | cop | in(test.t1.a, 12, 13) | | ||
| │ └─TableScan_8 | 2.00 | cop | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo | | ||
| └─TableReader_13 | 0.00 | root | data:Selection_12 | | ||
| └─Selection_12 | 0.00 | cop | in(test.t2.a, 12, 13) | | ||
| └─TableScan_11 | 2.00 | cop | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo | | ||
+-------------------------+-------+------+-------------------------------------------------------------------------+ | ||
7 rows in set (0.00 sec) | ||
``` | ||
|
||
## Rationale | ||
|
||
Constraint propagation is commonly used as logical plan optimization in traditional databases. For example, [this doc](https://dev.mysql.com/doc/internals/en/optimizer-constant-propagation.html) explains some details of constant propagations in MySQL. It is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al. Those engines usually query on a huge amount of data. | ||
|
||
### Advantages: | ||
|
||
Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints, we can filter data as early as possible, and thus reduce disk/network I/O and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level (TiKV), as a Coprocessor task, and leads to the following benefits: | ||
|
||
* Apply the filters at each storage segment (Region), which make the calculation distributed. | ||
|
||
* When loading data, skip some partitions of a table if the partitioning expression doesn't pass the filter. | ||
|
||
* For the columnar storage format to be supported in the future, we may apply some filters directly when accessing the raw storage. | ||
|
||
* Reduce the data transfered from TiKV to TiDB. | ||
|
||
### Disadvantages: | ||
|
||
Constraint propagation may bring unnecessary filters and lead to unnecessary overheads during a query. This is mostly due to the fact that logical optimization doesn't take data statistics into account, for example: | ||
|
||
For a query `select * from t0, t1 on t0.a = t1.a where t1.a < 5`, we get a propagation `t0.a < 5`, but if all `t0.a` is greater than 5, applying the filter brings unnecessary overheads. | ||
|
||
Considering the trade-off, we still gain a lot of benefits from constraint propagation in most of cases; hence it still can be treated as useful. | ||
|
||
## Compatibility | ||
|
||
All rules mentioned in this proposal are logical plan optimization, which do not change the semantics of a query, and thus this proposal will not lead to any compatibility issue | ||
|
||
## Implementation | ||
|
||
Here are rough ideas about possible implementations: | ||
|
||
* For proposal #1, we can extend the current `propagateConstantSolver` to support wider types of operators from column equality. | ||
|
||
* For proposal #2, `propagateConstantSolver` is also an applicable way to add `NotNULL` filter(`not(isnull())`), but should examine whether the column has the `NotNULL` constraint already. | ||
|
||
* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. currently, ranger only comes in when we are computing ranges for index scan, or for table scan with filter on RowID. If we want to impose general expression simplification, we need to extract infrastructure functionalities in |
||
|
||
* For proposal #4, current rule [PredicatePushDown](https://github.com/pingcap/tidb/blob/b3d4ed79b978efadf2974f78db8eeb711509e545/plan/rule_predicate_push_down.go#L1) may be enhanced to achieve it | ||
|
||
## Open issues (if applicable) | ||
|
||
Related issues: | ||
|
||
https://github.com/pingcap/tidb/issues/7098 - the very issue that inspires this proposal | ||
|
||
Related PRs: | ||
|
||
https://github.com/pingcap/tidb/pull/7276 - Related to proposal #1 | ||
|
||
https://github.com/pingcap/tidb/pull/7643 - Related to proposal #1 and #3 | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal was mainly completed by you, please remove me from the author list 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the proposal 4 is all from your issue :)