Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs/design: add proposal docs for constraint propagation enhancement #7648

Merged
merged 5 commits into from
Sep 25, 2018
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
193 changes: 193 additions & 0 deletions docs/design/2018-07-22-enhance-propagations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
# Proposal: Enhance constraint propagation in TiDB logical plan

- Author(s): [@bb7133](https://github.com/bb7133), [@zz-jason](https://github.com/zz-jason)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal was mainly completed by you, please remove me from the author list 😂

Copy link
Member Author

@bb7133 bb7133 Sep 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the proposal 4 is all from your issue :)

- Last updated: 2018-07-22
- Discussion at: https://docs.google.com/document/d/1G3wVBaiza9GI5q9nwHLCB2I7r1HUbg6RpTNFpQWTFhQ/edit#

## Abstract

This proposal tries to illustrate some rules that can be added to current constraint/filter propgation optimizations in TiDB logcial plan.

## Background

Currently, most of the constraint propagation work in TiDB is done by `propagateConstantSolver`, which does:

1. Find the `column = constant` expression and substitute the constant for the column, as well as try to fold the substituted constant expression if possible, for example:

Given `a = b and a = 2 and b = 3`, it becomes `2 = 3` after substitution and leads to a final `false` constant.

2. Find the `column A = column B` expression (which happens in `join` statements mostly) and propagate expressions like `column op constant` (as well as `constant op column`) based on the equality relation. The supported operators are:

* `ast.LT('<')`
* `ast.GT('>')`
* `ast.LE('<=')`
* `ast.GE('>=')`
* `ast.NE('!=')`

The `propagateConstantSolver` makes more detailed/explicit filters/constraints, which can be used within other optimization rules. For example, in the predicate-pushdown optimization, it generates more predicates that can be pushed closer to the data source (TiKV), and thus reduces the amount of data in the whole data path.

We can further do the optimization by introducing more rules and inferring/propagating more constraints from the existing ones, which helps us build a better logical plan.

## Proposal

Here are proposed rules we can consider:

1. Infer more filters/constraints from column equality relation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be EXTREMELY careful here. Consider the following examples:

t1.a = t2.a and t1.a is null

we should NOT infer t2.a is null, it is not logically equivalent because null = null is false;

another example is:

t1.a = t2.a and cast(t1.a as char(10)) = '+0.0'

we should NOT infer cast(t2.a as char(10)) = '+0.0', it is not logically equivalent either because this would filter out tuples with t2.a = -0.0

These 2 cases should fail in current master branch logically, we need to fix them.

Maybe there are other failed cases as well, once more, we need to think over this carefully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you're right! I tried to get some failed corner cases but didn't get the ones you commented. Thank you~

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported this proposal back to original google doc: https://docs.google.com/document/d/1G3wVBaiza9GI5q9nwHLCB2I7r1HUbg6RpTNFpQWTFhQ/edit?usp=sharing, and made updates based on your cases. As you said, we should think very carefully about this part, please let me know if you have any new idea, thanks


We should be able to infer a set of data constraints based on column equality, that is, any constraint on `a` now applies to `b` as long as:

* The constraint is deterministic
* The constraint doesn’t have any side effect
* The constraint doesn't include `isnull()` expression: `isnull()` constraint cannot be propagated through equality
* The constraint doesn't include `cast()`/`convert()` expression: type cast may break the equality relation

For example,

| original expressions | propagated filters |
| -------------------- | ------------------ |
| a = b and a < 5 | b < 5 |
| a = b and a in (12, 13) and b in (14, 15) | a in (14, 15) and b in (12, 13) |
| a = b and cast(a, varchar(20)) rlike 'abc' | cast(b, varchar(20)) rlike 'abc' |

Equality propagation should also be included:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • If Equality propagation includes the following, then use
    "Equality propagation should also include:"
  • If the following is some Equality propagation, then use
    "The following equality propagation should also be included:"

I think the first one is right~


| original expressions | propagated filters |
| -------------------- | ------------------ |
| a = b and abs(a) = 5 | abs(b) = 5 |

But following predicates cannot be propagated:

| unpropagateable expressions | reason |
| --------------------------- | ------ |
| a = b and a < random() | the expression is non-deterministic |
| a = b and a < sleep() | the expression has side effect |
| a = b and isnull(a) | isnull() cannot be propagated |
| a = b and cast(a as char(10)) = '+0.0' | type cast expression cannot be propagated |

2. Infer `NotNULL` filters from null-rejected scalar expression

We can infer `NotNULL` constraints from a scalar expression that doesn’t accept NULL, then we can know that involved columns cannot be NULL and add `NotNull` filter to them:

| original expressions | inferred filters |
| -------------------- | ---------------- |
| a = b | not(isnull(a)) and not(isnull(b)) |
| a != b | not(isnull(a)) and not(isnull(b)) |
| a < 5 | not(isnull(a)) |
| abs(a) < 3 | not(isnull(a)) |

NOTE: Those columns should not have `NotNULL` constraint attribute, or the inferred filters are unnecessary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: Those columns should not have the NotNULL constraint attribute. Otherwise, the inferred filters will be unnecessary?
In Chinese, XXXX 否则 XXX?

NOTE: Those columns should not have the NotNULL constraint attribute unless the inferred filters are unnecessary?
In Chinese, XXXX 除非 XXX?


3. Fold the constraints based on their semantics to avoid redundancy

After the propagations we may produce some duplicate predicates, and they can be combined. We should analyze all of the conditions and try to make the predicates clean:

| original expressions | combined expression |
| -------------------- | ------------------ |
| a = b and b = a | a = b |
| a < 3 and 3 > a | a < 3 |
| a < 5 | not(isnull(a)) |
| a < 5 and a > 5 | false |
| a < 10 and a <= 5 | a <= 5 |
| isnull(a) and not(isnull(a)) | false |
| a < 3 or a >= 3 | true |
| a in (1, 2) and a in (3, 5) | false |
| a in (1, 2) or a in (3, 5) | a in (1, 2, 3, 5) |

4. Filter propagation for outer join
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filter propagation for the outer join


When doing an equality outer join, we can propagate predicates on outer table in `where` condition to the inner table in `on` condition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When doing an equality outer join, we can propagate predicates on the outer table in the where condition to the inner table in the on condition.

For example:

`select * from t1 left join t2 on t1.a=t2.a where t1.a in (12, 13);`

```
TiDB(localhost:4000) > desc select * from t1 left join t2 on t1.a=t2.a where t1.a in (12, 13);
+-------------------------+----------+------+-------------------------------------------------------------------------+
| id | count | task | operator info |
+-------------------------+----------+------+-------------------------------------------------------------------------+
| HashLeftJoin_7 | 25.00 | root | left outer join, inner:TableReader_12, equal:[eq(test.t1.a, test.t2.a)] |
| ├─TableReader_10 | 20.00 | root | data:Selection_9 |
| │ └─Selection_9 | 20.00 | cop | in(test.t1.a, 12, 13) |
| │ └─TableScan_8 | 10000.00 | cop | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo |
| └─TableReader_12 | 10000.00 | root | data:TableScan_11 |
| └─TableScan_11 | 10000.00 | cop | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo |
+-------------------------+----------+------+-------------------------------------------------------------------------+
6 rows in set (0.00 sec)
```

NOTE: in this case, `t1.a in (12, 13)` works on the result of the outer join and we have pushed it down to the outer table.

But we can further push this filter down to the inner table, since only the records satisfying `t2.a in (12, 13)` can make join predicate `t1.a = t2.a` positive in the join operator. So we can optimize this query to:

`select * from t1 left join t2 on t1.a=t2.a and t2.a in (12, 13) where t1.a in (12, 13);`

And the join predicate `t2.a in (12, 13)` can be pushed down:

```
TiDB(localhost:4000) > desc select * from t1 left join t2 on t1.a=t2.a and t2.a in (12, 13) where t1.a in (12, 13);
+-------------------------+-------+------+-------------------------------------------------------------------------+
| id | count | task | operator info |
+-------------------------+-------+------+-------------------------------------------------------------------------+
| HashLeftJoin_7 | 0.00 | root | left outer join, inner:TableReader_13, equal:[eq(test.t1.a, test.t2.a)] |
| ├─TableReader_10 | 0.00 | root | data:Selection_9 |
| │ └─Selection_9 | 0.00 | cop | in(test.t1.a, 12, 13) |
| │ └─TableScan_8 | 2.00 | cop | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo |
| └─TableReader_13 | 0.00 | root | data:Selection_12 |
| └─Selection_12 | 0.00 | cop | in(test.t2.a, 12, 13) |
| └─TableScan_11 | 2.00 | cop | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo |
+-------------------------+-------+------+-------------------------------------------------------------------------+
7 rows in set (0.00 sec)
```

## Rationale

Constraint propagation is commonly used as logical plan optimization in traditional databases. For example, [this doc](https://dev.mysql.com/doc/internals/en/optimizer-constant-propagation.html) explains some details of constant propagations in MySQL. It is also widely adopted in distributed analytical engines, like Apache Hive, Apache SparkSQL, Apache Impala, et al. Those engines usually query on a huge amount of data.

### Advantages:

Constraint propagation brings more detailed, explicit constraints to each data source involved in a query. With those constraints, we can filter data as early as possible, and thus reduce disk/network I/O and computational overheads during the execution of a query. In TiDB, most propagated filters can be pushed down to the storage level (TiKV), as a Coprocessor task, and leads to the following benefits:

* Apply the filters at each storage segment (Region), which make the calculation distributed.

* When loading data, skip some partitions of a table if the partitioning expression doesn't pass the filter.

* For the columnar storage format to be supported in the future, we may apply some filters directly when accessing the raw storage.

* Reduce the data transfered from TiKV to TiDB.

### Disadvantages:

Constraint propagation may bring unnecessary filters and lead to unnecessary overheads during a query. This is mostly due to the fact that logical optimization doesn't take data statistics into account, for example:

For a query `select * from t0, t1 on t0.a = t1.a where t1.a < 5`, we get a propagation `t0.a < 5`, but if all `t0.a` is greater than 5, applying the filter brings unnecessary overheads.

Considering the trade-off, we still gain a lot of benefits from constraint propagation in most of cases; hence it still can be treated as useful.

## Compatibility

All rules mentioned in this proposal are logical plan optimization, which do not change the semantics of a query, and thus this proposal will not lead to any compatibility issue

## Implementation

Here are rough ideas about possible implementations:

* For proposal #1, we can extend the current `propagateConstantSolver` to support wider types of operators from column equality.

* For proposal #2, `propagateConstantSolver` is also an applicable way to add `NotNULL` filter(`not(isnull())`), but should examine whether the column has the `NotNULL` constraint already.

* For proposal #3, the [ranger](https://github.com/pingcap/tidb/blob/6fb1a637fbfd41b2004048d8cb995de6442085e2/util/ranger/ranger.go#L14) may be useful to help us collect and fold comparison constraints
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, ranger only comes in when we are computing ranges for index scan, or for table scan with filter on RowID. If we want to impose general expression simplification, we need to extract infrastructure functionalities in util/ranger/points.go into a general module.


* For proposal #4, current rule [PredicatePushDown](https://github.com/pingcap/tidb/blob/b3d4ed79b978efadf2974f78db8eeb711509e545/plan/rule_predicate_push_down.go#L1) may be enhanced to achieve it

## Open issues (if applicable)

Related issues:

https://github.com/pingcap/tidb/issues/7098 - the very issue that inspires this proposal

Related PRs:

https://github.com/pingcap/tidb/pull/7276 - Related to proposal #1

https://github.com/pingcap/tidb/pull/7643 - Related to proposal #1 and #3