Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiDB logical optimization: extend outer join elimination #7559

Closed
tianjiqx opened this issue Aug 30, 2018 · 13 comments
Closed

TiDB logical optimization: extend outer join elimination #7559

tianjiqx opened this issue Aug 30, 2018 · 13 comments
Assignees
Labels
sig/planner SIG: Planner

Comments

@tianjiqx
Copy link
Contributor

tianjiqx commented Aug 30, 2018

TiDB(Release Version: v2.1.0-beta-171-g7223353) support outer join elimination, but it can be further expanded.

create table t1(c1 int primary key,c2 int,c3 int);
create table t2(c1 int primary key,c2 int,c3 int);
create table t3(c1 int primary key,c2 int,c3 int);
create table t4(c1 int primary key,c2 int,c3 int);
create table t5(c1 int primary key,c2 int,c3 int);


insert into t1 values(1,1,2);
insert into t1 values(2,2,4);
insert into t1 values(3,3,2);
insert into t1 values(4,4,4);
insert into t1 values(5,5,2);


insert into t2 values(1,1,2);
insert into t2 values(2,2,4);
insert into t2 values(3,3,2);
insert into t2 values(4,4,4);


insert into t3 values(1,1,2);
insert into t3 values(2,2,4);
insert into t3 values(3,3,2);


insert into t4 values(1,1,2);
insert into t4 values(2,2,4);
insert into t4 values(3,3,2);
insert into t4 values(4,4,4);

insert into t5 values(1,1,2);
insert into t5 values(2,2,4);
insert into t5 values(3,null,2);
insert into t5 values(4,null,4);
insert into t5 values(5,null,4);

-- outerJoin2innerJoin
-- left join as an example  
-- Example 1:
select t1.c1,t2.c1,t3.c1 from  t1 left join t2 on t1.c1=t2.c2 left join t3 on t2.c1=t3.c2 
where t3.c3< 5; 
+----+----+----+
| c1 | c1 | c1 |
+----+----+----+
|  1 |  1 |  1 |
|  2 |  2 |  2 |
|  3 |  3 |  3 |
+----+----+----+
3 rows in set (0.00 sec)


-- tidb only remove one left join (t2 and t3) =>
mysql> explain select t1.c1,t2.c1,t3.c1 from  t1 left join t2 on t1.c1=t2.c2 left join t3 on t2.c1=t3.c2 
    -> where t3.c3< 5; 
+------------------------------+-------+------+---------------------------------------------------------------------------+
| id                           | count | task | operator info                                                             |
+------------------------------+-------+------+---------------------------------------------------------------------------+
| Projection_8                 | 1.56  | root | test.t1.c1, test.t2.c1, test.t3.c1                                        |
| └─HashLeftJoin_9             | 1.56  | root | inner join, inner:TableReader_18, equal:[eq(test.t2.c1, test.t3.c2)]      |
|   ├─HashLeftJoin_11          | 5.00  | root | left outer join, inner:TableReader_15, equal:[eq(test.t1.c1, test.t2.c2)] |
|   │ ├─TableReader_13         | 5.00  | root | data:TableScan_12                                                         |
|   │ │ └─TableScan_12         | 5.00  | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo               |
|   │ └─TableReader_15         | 4.00  | root | data:TableScan_14                                                         |
|   │   └─TableScan_14         | 4.00  | cop  | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo               |
|   └─TableReader_18           | 1.00  | root | data:Selection_17                                                         |
|     └─Selection_17           | 1.00  | cop  | lt(test.t3.c3, 5)                                                         |
|       └─TableScan_16         | 3.00  | cop  | table:t3, range:[-inf,+inf], keep order:false, stats:pseudo               |
+------------------------------+-------+------+---------------------------------------------------------------------------+
10 rows in set (0.01 sec)

select t1.c1,t2.c1,t3.c1 from  t1 left join t2 on t1.c1=t2.c2 inner join t3 on t2.c1=t3.c2 
where t3.c3< 5; 
+----+----+----+
| c1 | c1 | c1 |
+----+----+----+
|  1 |  1 |  1 |
|  2 |  2 |  2 |
|  3 |  3 |  3 |
+----+----+----+
3 rows in set (0.00 sec)


-- in fact, the follow is also is right =>

select t1.c1,t2.c1,t3.c1 from  t1 inner join t2 on t1.c1=t2.c2 inner join t3 on t2.c1=t3.c2 
where t3.c3< 5; 
+----+----+----+
| c1 | c1 | c1 |
+----+----+----+
|  1 |  1 |  1 |
|  2 |  2 |  2 |
|  3 |  3 |  3 |
+----+----+----+
3 rows in set (0.00 sec)

-- be careful, There is different example =>

-- Example 2:
select t1.c1,t3.c1,t4.c1 from  t1 left join t3 on t1.c1=t3.c2 left join t4 on t1.c1=t4.c2 
where t4.c3< 5; 
+----+----+----+
| c1 | c1 | c1 |
+----+----+----+
|  1 |  1 |  1 |
|  2 |  2 |  2 |
|  3 |  3 |  3 |
|  4 | NULL |  4 |
+----+----+----+
4 rows in set (0.00 sec)


-- tidb, remove one left join is right =>

select t1.c1,t3.c1,t4.c1 from  t1 left join t3 on t1.c1=t3.c2 inner join t4 on t1.c1=t4.c2 
where t4.c3< 5;  
+----+----+----+
| c1 | c1 | c1 |
+----+----+----+
|  1 |  1 |  1 |
|  2 |  2 |  2 |
|  3 |  3 |  3 |
|  4 | NULL |  4 |
+----+----+----+
4 rows in set (0.00 sec)


-- remove two left join is wrong =>

select t1.c1,t3.c1,t4.c1 from  t1 inner join t3 on t1.c1=t3.c2 inner join t4 on t1.c1=t4.c2 
where t4.c3< 5; 
+----+----+----+
| c1 | c1 | c1 |
+----+----+----+
|  1 |  1 |  1 |
|  2 |  2 |  2 |
|  3 |  3 |  3 |
+----+----+----+
3 rows in set (0.01 sec)



-- why ? because example1 and example2 have join topology.
-- Example 1:
t1--left--t2--left--t3
--- Example 2:
t1--left--t3
|
--left--t4

-- so, a "chain rule" can be derived from Example 1 and Example 2:

taking the left outer join as an example, the outer table of left join(L2) is the inner table of the other left join(L1), 
and when the inner table of the left join(L2) satisfies the "NUll rejection", the other left join(L1) can also be eliminated.

-- such as (pay attention to, join condtion have A-B, B-C, C-D, D-E, table 'E' is satisfies the "NUll rejection"),
A--left join--B--left join--C--left join--D--left join--E
<=>
A--inner join--B--inner join--C--inner join--D--inner join--E

-- Extend
-- Example 3:

select t1.c1,t3.c1,t3.* from t1 left join t3 on  t1.c1=t3.c2 inner join t5 on t3.c1=t5.c2; 
+----+----+----+------+------+
| c1 | c1 | c1 | c2   | c3   |
+----+----+----+------+------+
|  1 |  1 |  1 |    1 |    2 |
|  2 |  2 |  2 |    2 |    4 |
+----+----+----+------+------+
2 rows in set (0.00 sec)


-- tidb will hold left join 
mysql> explain select t1.c1,t3.c1,t3.* from t1 left join t3 on  t1.c1=t3.c2 inner join t5 on t3.c1=t5.c2; 
+------------------------------+-------+------+---------------------------------------------------------------------------+
| id                           | count | task | operator info                                                             |
+------------------------------+-------+------+---------------------------------------------------------------------------+
| Projection_7                 | 6.25  | root | test.t1.c1, test.t3.c1, test.t3.c1, test.t3.c2, test.t3.c3                |
| └─HashLeftJoin_8             | 6.25  | root | inner join, inner:TableReader_16, equal:[eq(test.t3.c1, test.t5.c2)]      |
|   ├─HashLeftJoin_10          | 5.00  | root | left outer join, inner:TableReader_14, equal:[eq(test.t1.c1, test.t3.c2)] |
|   │ ├─TableReader_12         | 5.00  | root | data:TableScan_11                                                         |
|   │ │ └─TableScan_11         | 5.00  | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo               |
|   │ └─TableReader_14         | 3.00  | root | data:TableScan_13                                                         |
|   │   └─TableScan_13         | 3.00  | cop  | table:t3, range:[-inf,+inf], keep order:false, stats:pseudo               |
|   └─TableReader_16           | 5.00  | root | data:TableScan_15                                                         |
|     └─TableScan_15           | 5.00  | cop  | table:t5, range:[-inf,+inf], keep order:false, stats:pseudo               |
+------------------------------+-------+------+---------------------------------------------------------------------------+
9 rows in set (0.00 sec)

-- in fact, it's not neccessary maintain left join
select t1.c1,t3.c1,t3.* from t1 inner join t3 on  t1.c1=t3.c2 inner join t5 on t3.c1=t5.c2; 
+----+----+----+------+------+
| c1 | c1 | c1 | c2   | c3   |
+----+----+----+------+------+
|  1 |  1 |  1 |    1 |    2 |
|  2 |  2 |  2 |    2 |    4 |
+----+----+----+------+------+
2 rows in set (0.00 sec)

-- remove left join is is right, due to inner join isn't produce  result for null value
-- for example:
select t5.c2 from t5 a inner join t5 b on a.c2=b.c2; 
+------+
| c2   |
+------+
|    1 |
|    2 |
+------+
2 rows in set (0.00 sec)



-- conclusion, two class "NULL rejection" 
-- 1. the inner table of outer join has "not null" filter condition   
-- 2. inner join
@tianjiqx
Copy link
Contributor Author

If there are no other idle people, I can try to implement it.

@zhexuany
Copy link
Contributor

zhexuany commented Aug 30, 2018

this proposal is still being reviewed. We need all agree on design first and then implement it. We are really happy if you could jump in and contribute to our project.

@zz-jason
Copy link
Member

zz-jason commented Aug 30, 2018

@tianjiqx The first example:

TiDB(localhost:4000) > desc select t1.c1,t2.c1,t3.c1 from  t1 left join t2 on t1.c1=t2.c2 left join t3 on t2.c1=t3.c2  where t3.c3< 5;
+------------------------------+----------+------+---------------------------------------------------------------------------+
| id                           | count    | task | operator info                                                             |
+------------------------------+----------+------+---------------------------------------------------------------------------+
| Projection_8                 | 5192.71  | root | test.t1.c1, test.t2.c1, test.t3.c1                                        |
| └─HashLeftJoin_9             | 5192.71  | root | inner join, inner:TableReader_18, equal:[eq(test.t2.c1, test.t3.c2)]      |
|   ├─HashLeftJoin_11          | 12500.00 | root | left outer join, inner:TableReader_15, equal:[eq(test.t1.c1, test.t2.c2)] |
|   │ ├─TableReader_13         | 10000.00 | root | data:TableScan_12                                                         |
|   │ │ └─TableScan_12         | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo               |
|   │ └─TableReader_15         | 10000.00 | root | data:TableScan_14                                                         |
|   │   └─TableScan_14         | 10000.00 | cop  | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo               |
|   └─TableReader_18           | 3323.33  | root | data:Selection_17                                                         |
|     └─Selection_17           | 3323.33  | cop  | lt(test.t3.c3, 5)                                                         |
|       └─TableScan_16         | 10000.00 | cop  | table:t3, range:[-inf,+inf], keep order:false, stats:pseudo               |
+------------------------------+----------+------+---------------------------------------------------------------------------+
10 rows in set (0.00 sec)

Because the filter t3.c2 < 5 is NULL rejective and t3 is the inner table, the left outer join on t2 and t3 is simplified to inner join. After this simplification, we can safely add another two filters on t2 and t3 based on the on condition of the simplified inner join:

(t1 left outer join t2) inner join t3 on t2.c1=t3.c2

The two filters are:

  • t2.c1 is not null
  • t3.c2 is not null,

With the derived filter t2.c1 is not null, we can further simplify the second left outer join to inner join, because the filter t2.c1 is not null is also NULL rejective and t2 is the inner table of the left outer join on t1 and t2.

For now, the filter derivation is not adopted, so the second outer join simplification failed.

This can be improved in the current planner. The filter derivation based on the join condition has already proposed by @bb7133 in this PR: #7276

@zz-jason
Copy link
Member

@zhexuany This optimization can be done in the current planner, no need to wait for the new planner being implemented.

@tianjiqx
Copy link
Contributor Author

Good

@zz-jason
Copy link
Member

@tianjiqx It would be great if you can investigate the code logic about the predicate push down on the join operator and help us to make some improvement on the filter derivation and outer join simplification 😄

@tianjiqx
Copy link
Contributor Author

tianjiqx commented Sep 2, 2018

Addtional:

1. a predicate containing two tables on where is also a null value rejection.

-- Example4:
select * from t1 left join t2 on t1.c1=t2.c2 where t1.c3=t2.c3;
+----+------+------+----+------+------+
| c1 | c2   | c3   | c1 | c2   | c3   |
+----+------+------+----+------+------+
|  1 |    1 |    2 |  1 |    1 |    2 |
|  2 |    2 |    4 |  2 |    2 |    4 |
|  3 |    3 |    2 |  3 |    3 |    2 |
|  4 |    4 |    4 |  4 |    4 |    4 |
+----+------+------+----+------+------+
4 rows in set (0.00 sec)

tidb is not processed at present.

mysql> explain select * from t1 left join t2 on t1.c1=t2.c2 where t1.c3=t2.c3;
+-------------------------+-------+------+---------------------------------------------------------------------------+
| id                      | count | task | operator info                                                             |
+-------------------------+-------+------+---------------------------------------------------------------------------+
| Selection_7             | 4.00  | root | eq(test.t1.c3, test.t2.c3)                                                |
| └─HashLeftJoin_8        | 5.00  | root | left outer join, inner:TableReader_12, equal:[eq(test.t1.c1, test.t2.c2)] |
|   ├─TableReader_10      | 5.00  | root | data:TableScan_9                                                          |
|   │ └─TableScan_9       | 5.00  | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo               |
|   └─TableReader_12      | 4.00  | root | data:TableScan_11                                                         |
|     └─TableScan_11      | 4.00  | cop  | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo               |
+-------------------------+-------+------+---------------------------------------------------------------------------+
6 rows in set (0.00 sec)

It can be rewritten into the following form(reduce left join and from where push down to join on condition). =>

select * from t1 inner join t2 on t1.c1=t2.c2 and t1.c3=t2.c3;
+----+------+------+----+------+------+
| c1 | c2   | c3   | c1 | c2   | c3   |
+----+------+------+----+------+------+
|  1 |    1 |    2 |  1 |    1 |    2 |
|  2 |    2 |    4 |  2 |    2 |    4 |
|  3 |    3 |    2 |  3 |    3 |    2 |
|  4 |    4 |    4 |  4 |    4 |    4 |
+----+------+------+----+------+------+
4 rows in set (0.00 sec)

this SQL's explain plan is in PostgreSQL(hash join is represent inner join, otherwise hash right join/hash left join).

pgtest=# explain select * from t1 left join t2 on t1.c1=t2.c2 where t1.c3=t2.c3;
                            QUERY PLAN
-------------------------------------------------------------------
 Hash Join  (cost=61.00..122.51 rows=10 width=24)
   Hash Cond: ((t2.c2 = t1.c1) AND (t2.c3 = t1.c3))
   ->  Seq Scan on t2  (cost=0.00..30.40 rows=2040 width=12)
   ->  Hash  (cost=30.40..30.40 rows=2040 width=12)
         ->  Seq Scan on t1  (cost=0.00..30.40 rows=2040 width=12)

sometimes it can be pushed down to the base table by equivalent analogy.

pgtest=# explain select * from t1 left join t2 on t1.c1=t2.c2 where t1.c2=t2.c2;
                           QUERY PLAN
-----------------------------------------------------------------
 Hash Join  (cost=35.63..91.77 rows=10 width=24)
   Hash Cond: (t2.c2 = t1.c1)
   ->  Seq Scan on t2  (cost=0.00..30.40 rows=2040 width=12)
   ->  Hash  (cost=35.50..35.50 rows=10 width=12)
         ->  Seq Scan on t1  (cost=0.00..35.50 rows=10 width=12)
               Filter: (c1 = c2)

2. Maybe another thing to note is that the use of () changes the join order.

--Example 5
select * from (t1 left join t2 on t1.c1=t2.c2)  left join ( t3 left join t4 on t3.c2 =t4.c1) on t2.c2=t3.c2 where t4.c2 <10;

PostgreSQL can reduce all left join:

pgtest=# explain select * from (t1 left join t2 on t1.c1=t2.c2)  left join ( t3 left join t4 on t3.c2 =t4.c1) on t2.c2=t3.c2 where t4.c2 <10;
                                        QUERY PLAN
------------------------------------------------------------------------------------------
 Hash Join  (cost=162.00..206.85 rows=680 width=48)
   Hash Cond: (t3.c2 = t1.c1)
   ->  Seq Scan on t3  (cost=0.00..30.40 rows=2040 width=12)
   ->  Hash  (cost=153.50..153.50 rows=680 width=36)
         ->  Hash Join  (cost=108.65..153.50 rows=680 width=36)
               Hash Cond: (t2.c2 = t1.c1)
               ->  Seq Scan on t2  (cost=0.00..30.40 rows=2040 width=12)
               ->  Hash  (cost=100.15..100.15 rows=680 width=24)
                     ->  Hash Join  (cost=44.00..100.15 rows=680 width=24)
                           Hash Cond: (t1.c1 = t4.c1)
                           ->  Seq Scan on t1  (cost=0.00..30.40 rows=2040 width=12)
                           ->  Hash  (cost=35.50..35.50 rows=680 width=12)
                                 ->  Seq Scan on t4  (cost=0.00..35.50 rows=680 width=12)
                                       Filter: (c2 < 10)

reduce left join process like this:

  • first, due to t4.c2 < 10, can elimilate t2--left--t3 => t2--inner--t3, then produce t2.c2 is not null,t3.c2 is not null
  • second, due to t2.c2 is not null, push down to 't1 left join t2', can eleimite t1--left--t2 => t1--inner--t2, then produce t1.c2 is not null
  • third, due to t4.c2 <10, can elimilate t3--left--t4 => t3--inner--t4, then produce t4.c1 is not null

tidb retains one left join:

mysql> explain select * from (t1 left join t2 on t1.c1=t2.c2)  left join ( t3 left join t4 on t3.c2 =t4.c1) on t2.c2=t3.c2 where t4.c2 <10;
+----------------------------+-------+------+------------------------------------------------------------------------------+
| id                         | count | task | operator info                                                                |
+----------------------------+-------+------+------------------------------------------------------------------------------+
| HashLeftJoin_11            | 2.60  | root | inner join, inner:IndexJoin_21, equal:[eq(test.t2.c2, test.t3.c2)]           |
| ├─HashLeftJoin_13          | 5.00  | root | left outer join, inner:TableReader_17, equal:[eq(test.t1.c1, test.t2.c2)]    |
| │ ├─TableReader_15         | 5.00  | root | data:TableScan_14                                                            |
| │ │ └─TableScan_14         | 5.00  | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                  |
| │ └─TableReader_17         | 4.00  | root | data:TableScan_16                                                            |
| │   └─TableScan_16         | 4.00  | cop  | table:t2, range:[-inf,+inf], keep order:false, stats:pseudo                  |
| └─IndexJoin_21             | 1.66  | root | inner join, inner:TableReader_20, outer key:test.t3.c2, inner key:test.t4.c1 |
|   ├─TableReader_25         | 3.00  | root | data:TableScan_24                                                            |
|   │ └─TableScan_24         | 3.00  | cop  | table:t3, range:[-inf,+inf], keep order:false, stats:pseudo                  |
|   └─TableReader_20         | 1.33  | root | data:Selection_19                                                            |
|     └─Selection_19         | 1.33  | cop  | lt(test.t4.c2, 10)                                                           |
|       └─TableScan_18       | 0.00  | cop  | table:t4, range: decided by [test.t3.c2], keep order:false, stats:pseudo     |
+----------------------------+-------+------+------------------------------------------------------------------------------+
12 rows in set (0.00 sec)

@tianjiqx
Copy link
Contributor Author

tianjiqx commented Sep 2, 2018

Sorry, I am a little busy recently. Only now can I reply to you. @zz-jason

I review the code logic about the predicate push down on the join operator in PostgreSQL(include reduce_outer_joins() and distribute_qual_to_rels()).However, I don't fully understand it.It seems that it first completes the outer join elimination and then makes the predicate push down.

I want to illustrate the elimination of outer join by examples. All of the following of push down case are correct in tidb.
I hope the following is useful.

about predicate(filter) push down

a predicate can appear in four places

  • select(can't push down)
  • join on
  • where
  • having

a predicate(filter) can mounted on the three places

  1. filter is mounted on base table level eg. select * from t1 where t1.c1=10;
  2. filter is mounted on join level, eg. select * from t1 left join t2 on t1.c1=t2.c2 and t1.c2 <10 and t2.c1 <10;
  3. filter is mounted on aggregate level, eg. select c1,count(c2) from t1 group by c1 hving c1 < 10 and count(c2)>1;

(I) for inner join:

  • all position single table predicate can push down to base table level(position 1),
  • other predicate can push down to join level or stay in aggregate level.

It means:

where:
select * from t1 inner join t2 on t1.c1=t2.c2 where t1.c1=10 ;
<=>
select * from (select * from t1 where t1.c1=10) a inner join (select * from t2 where t1.c2=10) b on a.c1=b.c2;

select * from t1 inner join t2 on t1.c1=t2.c2 where t1.c2=t2.c1;
<=>
select * from t1 inner join t2 on t1.c1=t2.c2 and t1.c2=t2.c1;

select * from t1 inner join t2 on t1.c1=t2.c2 where t1.c2=10 or t2.c1=10;
<=>
select * from t1 inner join t2 on t1.c1=t2.c2 and (t1.c2=10 or t2.c1=10);
join on:
select * from t1 inner join t2 on t1.c1=t2.c2 and t1.c2=10 and t2.c1=10;
<=>
select * from (select * from t1 where t1.c2=10) a inner join (select * from t2 where t1.c1=10) b on a.c1=b.c2;
having:
select t1.c2,t2.c1,sum(t1.c3) from t1 inner join t2 on  t1.c1=t2.c2 group by t1.c2,t2.c1 having t1.c2=10 and t2.c1=10 and sum(t1.c3)>0;
<=>
select t1.c2,t2.c1,sum(t1.c3) from  (select * from t1 where t1.c2=10) a inner join (select * from t2 where t1.c1=10) b on a.c1=b.c2 group by t1.c2,t2.c1 having sum(t1.c3)>0;

(II) for left join:

  • nullable side: the inner table of left jon

  • nonullable side: the outer table of left join

  • strict predicate: input NULL output NULL/false, eg. c2 is not null, c2<10

  • not strict predicate: input NULL output other/true, eg. c2 is null,COALESCE(c2, 0)

where:

-- can push down strict predicate on where to nullable side table level
-- and push down strict and not strict predicate on where to nonullable side table level
select * from t1 left join t5 on t1.c1=t5.c2 where t5.c3 is not null;
<=>
select * from t1 inner join (select * from t5 where t5.c3 is not null)  b on t1.c1=b.c2 ;

-- can push strict predicate to join on
select * from t1 left join t2 on t1.c1=t2.c2 where t1.c2=t2.c1;
<=>
select * from t1 inner join t2 on t1.c1=t2.c2 and t1.c2=t2.c1;

select * from t1 left join t5 on t1.c1=t5.c2 where t1.c3 is not null;
<=>
select * from (select * from t1 where t1.c3 is not null) a  left join t5 on a.c1=t5.c2 ;

select * from t1 left join t5 on t1.c1=t5.c2 where t1.c3 is null;
<=>
select * from (select * from t1 where t1.c3 is null) a  left join t5 on a.c1=t5.c2 ;

-- can't push down no strict predicate to nullable side
select * from t1 left join t5 on t1.c1=t5.c2 where t5.c3 is null;
select t5.c2 from t1 left join t5 on t1.c1=t5.c2 where  COALESCE(t5.c2, 2) > 1;

join on:
-- can  push down strict and not strict predicate on join level to nullable side base table level 
select * from t1 left join t2 on t1.c1=t2.c2 and t2.c3 <3;
<=>
select * from t1 left join (select * from t2 where t2.c3 <3) b on t1.c1=b.c2;

select * from t1 left join t2 on t1.c1=t2.c2 and t2.c3 is null;
<=>
select * from t1 left join (select * from t2 where t2.c3 <3) b on t1.c1=b.c2;

--can't push down predicate to nonullable side 
select * from t1 left join t2 on t1.c1=t2.c2 and t1.c3 is null;
select * from t1 left join t2 on t1.c1=t2.c2 and t1.c3 <10;
select * from t1 left join t2 on t1.c1=t2.c2 left join t3 on t1.c1=t3.c2 and t2.c3 <10;

-- having:(same as where)
-- can push strict and not strict predicate on having to nonullable side table level
-- can push strict predicate on having to nullable side table level
select t1.c2,t1.c3,count(t2.c1) c from t1 left join t2 on t1.c1=t2.c2 group by t1.c2,t1.c3 having t1.c2 <10 and t1.c3 <10 and count(t2.c1) >=1;
<=>
select a.c2,a.c3,count(t2.c1) c from (select * from t1 where t1.c2 <10 and t1.c3 <10) a left join t2 on a.c1=t2.c2 group by a.c2,a.c3 having count(t2.c1) >=1;

select t1.c2,t1.c3,count(t2.c1) c from t1 left join t2 on t1.c1=t2.c2 group by t1.c2,t1.c3 having t1.c2 is null and count(t2.c1) >=1;
<=>
select a.c2,a.c3,count(t2.c1) c from (select * from t1 where t1.c2 is null) a left join t2 on a.c1=t2.c2 group by a.c2,a.c3 having count(t2.c1) >=1;

select t2.c2,t2.c3,count(t2.c1) c from t1 left join t2 on t1.c1=t2.c2 group by t2.c2,t2.c3 having t2.c2 <10 and t2.c3 <10 and count(t2.c1) >=1;
<=>
select b.c2,b.c3,count(b.c1) c from t1 inner join (select * from t2 where t2.c2 <10 and t2.c3 <10) b on t1.c1=b.c2 group by b.c2,b.c3 having count(t2.c1) >=1;

-- can't push down no strict predicate to nullable side
select t2.c2,t2.c3,count(t2.c1) c from t1 left join t2 on t1.c1=t2.c2 group by t2.c2,t2.c3 having t2.c3 is null and count(t2.c1) >=1;

@zhexuany zhexuany added the sig/planner SIG: Planner label Sep 3, 2018
@tianjiqx
Copy link
Contributor Author

tianjiqx commented Sep 6, 2018

I analyzed the process of predicates push down in TiDB and then tried to improve the outer join elimination.It seems to work, hopefully bug free.

I didn't change too much code and only did two things.

  • One is to adjust the order in which the OuterJoin of the LogicalJoin and its child nodes is eliminated.
  • The other is a null-rejected expression on the join condition column generated for the inner join.
func simplifyOuterJoin(p *LogicalJoin, predicates []expression.Expression) {
	if p.JoinType != LeftOuterJoin && p.JoinType != RightOuterJoin && p.JoinType != InnerJoin {
		return
	}

	innerTable := p.children[0]
	outerTable := p.children[1]
	if p.JoinType == LeftOuterJoin {
		innerTable, outerTable = outerTable, innerTable
	}

	var fullConditions []expression.Expression

	// first simplify embedded outer join.
	// When trying to simplify an embedded outer join operation in a query,
	// we must take into account the join condition for the embedding outer join together with the WHERE condition.
	if innerPlan, ok := innerTable.(*LogicalJoin); ok {
		fullConditions = concatOnAndWhereConds(p, predicates)
		simplifyOuterJoin(innerPlan, fullConditions)
	}
	//del by qx [simplify outerJoin] 20180906 :b
	/*
		if outerPlan, ok := outerTable.(*LogicalJoin); ok {
			if fullConditions != nil {
				fullConditions = concatOnAndWhereConds(p, predicates)
			}
			simplifyOuterJoin(outerPlan, fullConditions)
		}
	*/
	//del :e

	if p.JoinType == InnerJoin {
                // also can generate join column is not null condition,omitted // add by qx 20180906
		return
	}
	// then simplify embedding outer join.
	canBeSimplified := false
	for _, expr := range predicates {
		isOk := isNullRejected(p.ctx, innerTable.Schema(), expr)
		if isOk {
			canBeSimplified = true
			break
		}
	}
	//add by qx [simplify outerJoin] 20180906:b
	var nullExprs = make([]expression.Expression, 0, 0)
	//add :e
	if canBeSimplified {
		p.JoinType = InnerJoin
		//add by qx [simplify outerJoin] 20180906:b
		//generate join column is not null predicate
		for _, expr := range p.EqualConditions {
			for _, arg := range expr.GetArgs() {
				col, ok := arg.(*expression.Column)
				if ok {
					args := make([]expression.Expression, 2)
					args[0] = col
					args[1] = expression.Null
					nullExpr := expression.NewFunctionInternal(p.ctx, "ne", types.NewFieldType(mysql.TypeTiny), args...)
					nullExprs = append(nullExprs, nullExpr)
					//predicates = append(predicates,nullExpr) // can't work
				}
			}
		}
		//add :e
	}
	//add by qx [simplify outerJoin] 20180906:b
	if outerPlan, ok := outerTable.(*LogicalJoin); ok {
		if fullConditions != nil {
			fullConditions = concatOnAndWhereConds(p, predicates)
		}

		if len(nullExprs) != 0 {
			if fullConditions != nil {
				fullConditions = append(fullConditions, nullExprs...)
			} else {
				fullConditions = nullExprs
			}
		}
		simplifyOuterJoin(outerPlan, fullConditions)
	}
	//add :e
}

Results:

ubuntu_64_pc1-2018-09-06-08-28-07

What I have changed is not perfect.

  • Still does not support join condition on where. eg.select * from t1 left join t2 on t1.c1=t2.c2 where t1.c3=t2.c3;
  • The null-rejected predicates generated is not pushed down to the lower level, only for outer join elimination. (Expection: select * from t1 left join t2 on t1.c1=t2.c2 where t2.c3=3; => select * from (select * from t1 where t1.c1 is not null ) a left join (select * fromr t2 where t2.c2 is not null and t2.c3=3) b on b.c1=b.c2;)
  • Not familiar with golang and tidb, the code is slightly ugly.

@eurekaka
Copy link
Contributor

eurekaka commented Sep 6, 2018

@tianjiqx Could you please open a pull request for your code change? GitHub issue is not that friendly for following code changes. Thanks.

@tianjiqx
Copy link
Contributor Author

tianjiqx commented Sep 6, 2018

@eurekaka
I saw that the task has been assigned to @ zhexuany, so I just provide suggestions without opening a pull request.

@zz-jason
Copy link
Member

zz-jason commented Sep 6, 2018

@tianjiqx The task is not assigned to @zhexuany, please feel free to contribute 😂

@LittleFall
Copy link
Contributor

It seems that these examples have been resolved in the current version.

MySQL [test]> select tidb_version();
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------+
| tidb_version()
                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------+
| Release Version: v4.0.0-beta.2-135-g2a31878c5
Git Commit Hash: 2a31878c530c050775a67ca8beb1c819e80c1764
Git Branch: master
UTC Build Time: 2020-03-31 07:58:52
GoVersion: go1.14.1
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------+
1 row in set (0.00 sec)

MySQL [test]>
MySQL [test]> -- example 1
MySQL [test]> explain select t1.c1, t2.c1, t3.c1 from t1 left join t2 on t1.c1=t2.c2 left join t3 on t2.c1=t3.c2 where t3.c3<5;
+--------------------------------+---------+-----------+---------------+------------------------------------------------+
| id                             | estRows | task      | access object | operator info                                  |
+--------------------------------+---------+-----------+---------------+------------------------------------------------+
| HashJoin_9                     | 1.56    | root      |               | inner join, equal:[eq(test.t2.c1, test.t3.c2)] |
| ├─TableReader_29(Build)        | 1.00    | root      |               | data:Selection_28                              |
| │ └─Selection_28               | 1.00    | cop[tikv] |               | lt(test.t3.c3, 5), not(isnull(test.t3.c2))     |
| │   └─TableFullScan_27         | 3.00    | cop[tikv] | table:t3      | keep order:false, stats:pseudo                 |
| └─HashJoin_20(Probe)           | 5.00    | root      |               | inner join, equal:[eq(test.t1.c1, test.t2.c2)] |
|   ├─TableReader_24(Build)      | 4.00    | root      |               | data:Selection_23                              |
|   │ └─Selection_23             | 4.00    | cop[tikv] |               | not(isnull(test.t2.c2))                        |
|   │   └─TableFullScan_22       | 4.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                 |
|   └─TableReader_26(Probe)      | 5.00    | root      |               | data:TableFullScan_25                          |
|     └─TableFullScan_25         | 5.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                 |
+--------------------------------+---------+-----------+---------------+------------------------------------------------+
10 rows in set (0.00 sec)

MySQL [test]>
MySQL [test]> -- example 2
MySQL [test]> explain select t1.c1,t3.c1,t4.c1 from t1 left join t3 on t1.c1=t3.c2 left join t4 on t1.c1=t4.c2 where t4.c3<5;
+--------------------------------+---------+-----------+---------------+-----------------------------------------------------+
| id                             | estRows | task      | access object | operator info                                       |
+--------------------------------+---------+-----------+---------------+-----------------------------------------------------+
| HashJoin_9                     | 1.66    | root      |               | inner join, equal:[eq(test.t1.c1, test.t4.c2)]      |
| ├─TableReader_20(Build)        | 1.33    | root      |               | data:Selection_19                                   |
| │ └─Selection_19               | 1.33    | cop[tikv] |               | lt(test.t4.c3, 5), not(isnull(test.t4.c2))          |
| │   └─TableFullScan_18         | 4.00    | cop[tikv] | table:t4      | keep order:false, stats:pseudo                      |
| └─HashJoin_11(Probe)           | 5.00    | root      |               | left outer join, equal:[eq(test.t1.c1, test.t3.c2)] |
|   ├─TableReader_17(Build)      | 3.00    | root      |               | data:Selection_16                                   |
|   │ └─Selection_16             | 3.00    | cop[tikv] |               | not(isnull(test.t3.c2))                             |
|   │   └─TableFullScan_15       | 3.00    | cop[tikv] | table:t3      | keep order:false, stats:pseudo                      |
|   └─TableReader_14(Probe)      | 5.00    | root      |               | data:TableFullScan_13                               |
|     └─TableFullScan_13         | 5.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                      |
+--------------------------------+---------+-----------+---------------+-----------------------------------------------------+
10 rows in set (0.00 sec)

MySQL [test]>
MySQL [test]> -- example 3
MySQL [test]> explain select t1.c1, t3.* from t1 left join t3 on t1.c1=t3.c2 inner join t5 on t3.c1=t5.c2;
+------------------------------------+---------+-----------+---------------+------------------------------------------------+
| id                                 | estRows | task      | access object | operator info                                  |
+------------------------------------+---------+-----------+---------------+------------------------------------------------+
| Projection_10                      | 4.68    | root      |               | test.t1.c1, test.t3.c1, test.t3.c2, test.t3.c3 |
| └─HashJoin_21                      | 4.68    | root      |               | inner join, equal:[eq(test.t3.c2, test.t1.c1)] |
|   ├─HashJoin_34(Build)             | 3.75    | root      |               | inner join, equal:[eq(test.t3.c1, test.t5.c2)] |
|   │ ├─TableReader_40(Build)        | 3.00    | root      |               | data:Selection_39                              |
|   │ │ └─Selection_39               | 3.00    | cop[tikv] |               | not(isnull(test.t3.c2))                        |
|   │ │   └─TableFullScan_38         | 3.00    | cop[tikv] | table:t3      | keep order:false, stats:pseudo                 |
|   │ └─TableReader_37(Probe)        | 5.00    | root      |               | data:Selection_36                              |
|   │   └─Selection_36               | 5.00    | cop[tikv] |               | not(isnull(test.t5.c2))                        |
|   │     └─TableFullScan_35         | 5.00    | cop[tikv] | table:t5      | keep order:false, stats:pseudo                 |
|   └─TableReader_42(Probe)          | 5.00    | root      |               | data:TableFullScan_41                          |
|     └─TableFullScan_41             | 5.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                 |
+------------------------------------+---------+-----------+---------------+------------------------------------------------+
11 rows in set (0.00 sec)

MySQL [test]>
MySQL [test]> -- example 4
MySQL [test]> explain select * from t1 left join t2 on t1.c1=t2.c2 where t1.c3=t2.c3;
+------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------+
| id                           | estRows | task      | access object | operator info                                                             |
+------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------+
| HashJoin_18                  | 4.99    | root      |               | inner join, equal:[eq(test.t1.c1, test.t2.c2) eq(test.t1.c3, test.t2.c3)] |
| ├─TableReader_22(Build)      | 3.99    | root      |               | data:Selection_21                                                         |
| │ └─Selection_21             | 3.99    | cop[tikv] |               | not(isnull(test.t2.c2)), not(isnull(test.t2.c3))                          |
| │   └─TableFullScan_20       | 4.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                            |
| └─TableReader_25(Probe)      | 5.00    | root      |               | data:Selection_24                                                         |
|   └─Selection_24             | 5.00    | cop[tikv] |               | not(isnull(test.t1.c3))                                                   |
|     └─TableFullScan_23       | 5.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                            |
+------------------------------+---------+-----------+---------------+---------------------------------------------------------------------------+
7 rows in set (0.00 sec)

MySQL [test]>
MySQL [test]> -- example 5
MySQL [test]> explain select * from (t1 left join t2 on t1.c1=t2.c2) left join (t3 left join t4 on t3.c2=t4.c1) on t2.c2=t3.c2 where t4.c2<10;
+----------------------------------+---------+-----------+---------------+------------------------------------------------------------------------------+
| id                               | estRows | task      | access object | operator info                                                                |
+----------------------------------+---------+-----------+---------------+------------------------------------------------------------------------------+
| HashJoin_11                      | 2.60    | root      |               | inner join, equal:[eq(test.t2.c2, test.t3.c2)]                               |
| ├─IndexMergeJoin_37(Build)       | 1.66    | root      |               | inner join, inner:TableReader_35, outer key:test.t3.c2, inner key:test.t4.c1 |
| │ ├─TableReader_44(Build)        | 3.00    | root      |               | data:Selection_43                                                            |
| │ │ └─Selection_43               | 3.00    | cop[tikv] |               | not(isnull(test.t3.c2))                                                      |
| │ │   └─TableFullScan_42         | 3.00    | cop[tikv] | table:t3      | keep order:false, stats:pseudo                                               |
| │ └─TableReader_35(Probe)        | 0.33    | root      |               | data:Selection_34                                                            |
| │   └─Selection_34               | 0.33    | cop[tikv] |               | lt(test.t4.c2, 10)                                                           |
| │     └─TableRangeScan_33        | 1.00    | cop[tikv] | table:t4      | range: decided by [test.t3.c2], keep order:true, stats:pseudo                |
| └─HashJoin_22(Probe)             | 5.00    | root      |               | inner join, equal:[eq(test.t1.c1, test.t2.c2)]                               |
|   ├─TableReader_26(Build)        | 4.00    | root      |               | data:Selection_25                                                            |
|   │ └─Selection_25               | 4.00    | cop[tikv] |               | not(isnull(test.t2.c2))                                                      |
|   │   └─TableFullScan_24         | 4.00    | cop[tikv] | table:t2      | keep order:false, stats:pseudo                                               |
|   └─TableReader_28(Probe)        | 5.00    | root      |               | data:TableFullScan_27                                                        |
|     └─TableFullScan_27           | 5.00    | cop[tikv] | table:t1      | keep order:false, stats:pseudo                                               |
+----------------------------------+---------+-----------+---------------+------------------------------------------------------------------------------+
14 rows in set (0.00 sec)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/planner SIG: Planner
Projects
None yet
Development

No branches or pull requests

5 participants