Pushdown single column predicates from ON join clauses #3578

HuSen8891 · 2022-09-21T18:08:41Z

Which issue does this PR close?

Rationale for this change

optimize join's performance when join has OR clauses in filter or on clause.

examine any OR clauses of join to see if any useful clauses can be extracted and push down to join's rel to filter more rows before join.

for TPCH q19

the logical plan before optimized

    "Explain [plan_type:Utf8, plan:Utf8]",
    "  Projection: #lineitem.l_partkey [l_partkey:Int64]",
    "    Projection: #part.p_size >= Int32(1) AS #part.p_size >= Int32(1)Int32(1)#part.p_size, #lineitem.l_partkey, #lineitem.l_quantity, #part.p_brand, #part.p_size [#part.p_size >= Int32(1)Int32(1)#part.p_size:Boolean;N, l_partkey:Int64, l_quantity:Decimal128(15, 2), p_brand:Utf8, p_size:Int32]",
    "      Filter: #part.p_brand = Utf8(\"Brand#12\") AND #lineitem.l_quantity >= Decimal128(Some(100),15,2) AND #lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND #part.p_size <= Int32(5) OR #part.p_brand = Utf8(\"Brand#23\") AND #lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND #part.p_size <= Int32(10) OR #part.p_brand = Utf8(\"Brand#34\") AND #lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND #part.p_size <= Int32(15) [l_partkey:Int64, l_quantity:Decimal128(15, 2), p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
    "        Inner Join: #lineitem.l_partkey = #part.p_partkey [l_partkey:Int64, l_quantity:Decimal128(15, 2), p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
    "          TableScan: lineitem projection=[l_partkey, l_quantity] [l_partkey:Int64, l_quantity:Decimal128(15, 2)]",
    "          Filter: #part.p_size >= Int32(1) [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
    "            TableScan: part projection=[p_partkey, p_brand, p_size], partial_filters=[#part.p_size >= Int32(1)] [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",

the logical plan after optimized

    "Explain [plan_type:Utf8, plan:Utf8]",
    "  Projection: #lineitem.l_partkey [l_partkey:Int64]",
    "    Projection: #part.p_size >= Int32(1) AS #part.p_size >= Int32(1)Int32(1)#part.p_size, #lineitem.l_partkey, #lineitem.l_quantity, #part.p_brand, #part.p_size [#part.p_size >= Int32(1)Int32(1)#part.p_size:Boolean;N, l_partkey:Int64, l_quantity:Decimal128(15, 2), p_brand:Utf8, p_size:Int32]",
    "      Filter: #part.p_brand = Utf8(\"Brand#12\") AND #lineitem.l_quantity >= Decimal128(Some(100),15,2) AND #lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND #part.p_size <= Int32(5) OR #part.p_brand = Utf8(\"Brand#23\") AND #lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND #part.p_size <= Int32(10) OR #part.p_brand = Utf8(\"Brand#34\") AND #lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND #part.p_size <= Int32(15) [l_partkey:Int64, l_quantity:Decimal128(15, 2), p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
    "        Inner Join: #lineitem.l_partkey = #part.p_partkey [l_partkey:Int64, l_quantity:Decimal128(15, 2), p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
    "          Filter: #lineitem.l_quantity >= Decimal128(Some(100),15,2) AND #lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR #lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR #lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(3000),15,2) [l_partkey:Int64, l_quantity:Decimal128(15, 2)]",
    "            TableScan: lineitem projection=[l_partkey, l_quantity], partial_filters=[#lineitem.l_quantity >= Decimal128(Some(100),15,2) AND #lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR #lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR #lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(3000),15,2)] [l_partkey:Int64, l_quantity:Decimal128(15, 2)]",
    "          Filter: #part.p_size >= Int32(1) AND #part.p_brand = Utf8(\"Brand#12\") AND #part.p_size <= Int32(5) OR #part.p_brand = Utf8(\"Brand#23\") AND #part.p_size <= Int32(10) OR #part.p_brand = Utf8(\"Brand#34\") AND #part.p_size <= Int32(15) [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
    "            TableScan: part projection=[p_partkey, p_brand, p_size], partial_filters=[#part.p_size >= Int32(1), #part.p_brand = Utf8(\"Brand#12\") AND #part.p_size <= Int32(5) OR #part.p_brand = Utf8(\"Brand#23\") AND #part.p_size <= Int32(10) OR #part.p_brand = Utf8(\"Brand#34\") AND #part.p_size <= Int32(15)] [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",

we extract new predicate and push down to join's rel, this predicate filters more rows before join , makes join more effective.

What changes are included in this PR?

add extract OR clasue in datafusion/optimizer/src/filter_push_down.rs.

codecov-commenter · 2022-09-21T20:01:19Z

Codecov Report

Merging #3578 (f65dd47) into master (0a2b0a7) will increase coverage by 0.00%.
The diff coverage is 98.16%.

@@           Coverage Diff           @@
##           master    #3578   +/-   ##
=======================================
  Coverage   85.92%   85.92%           
=======================================
  Files         301      300    -1     
  Lines       56249    56309   +60     
=======================================
+ Hits        48330    48383   +53     
- Misses       7919     7926    +7

Impacted Files	Coverage Δ
datafusion/core/src/execution/context.rs	`79.31% <ø> (-0.03%)`	⬇️
datafusion/core/tests/sql/joins.rs	`99.33% <ø> (ø)`
datafusion/core/tests/sql/predicates.rs	`100.00% <ø> (ø)`
datafusion/core/src/physical_plan/planner.rs	`77.47% <90.90%> (+0.11%)`	⬆️
datafusion/optimizer/src/filter_push_down.rs	`97.76% <98.75%> (+0.09%)`	⬆️
datafusion/core/src/physical_plan/sorts/sort.rs	`93.86% <100.00%> (-0.60%)`	⬇️
...e/src/physical_plan/sorts/sort_preserving_merge.rs	`93.49% <0.00%> (-0.36%)`	⬇️
datafusion/expr/src/logical_plan/plan.rs	`77.92% <0.00%> (-0.33%)`	⬇️
datafusion/expr/src/binary_rule.rs	`84.50% <0.00%> (-0.09%)`	⬇️
... and 3 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

alamb

Thank you @AssHero -- I need to think about this transformation carefully and will review the code in the next day or two

alamb

I don't think this transformation is correct, though I may be mistaken

alamb · 2022-09-24T11:22:32Z

datafusion/optimizer/src/filter_push_down.rs

@@ -248,6 +249,128 @@ fn get_pushable_join_predicates<'a>(
        .unzip()
 }

+// examine OR clause to see if any useful clauses can be extracted and push down.


I don't think this transformation is correct. In particular, I don't think the results will always be the same

Schematically, we have this type of predicate (that is being evaluated during the join)

(A AND B) OR (C AND D)

This transformation proposes adding another (A OR B) clause (evaluated before the join), so effectively

((A AND B) OR (C AND D)) AND (A OR B)

In order to do this transformation, the boolean statements must be equivalent for all inputs.

However, a counter example is

A: false, B: false, C: true, D: true

In this case, the original predicate would be true, but the rewrite would be false

Here is the program I wrote to generate the entire truth table: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=334938478775ba3cd55e7c400ea89b06

This transformation should extract at least one quals from each sub-clauses of OR, else do nothing.

(A AND B) OR (C AND D)

will be transformed to

((A AND B) OR (C AND D)) AND (A OR C)

OR

((A AND B) OR (C AND D)) AND ((A AND B) OR C)

OR

do nothing.

I see -- thanks -- I checked those rewrites and https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3b41b0409c8ecf4df0027f323668e0db they do look good to me

alamb

Thanks @AssHero -- this makes more sense now. I'll try and find some time to review it more carefully over the next day or two

alamb · 2022-09-28T20:39:13Z

datafusion/optimizer/src/filter_push_down.rs

+//          TableScan: projection=[c, d]
+fn extract_or_clauses_for_join(


Suggested change

// TableScan: projection=[c, d]

fn extract_or_clauses_for_join(

// TableScan: projection=[c, d]

//

// In general, predicates of this form:

//

// (A AND B) OR (C AND D)

//

// will be transformed to

//

// ((A AND B) OR (C AND D)) AND (A OR C)

//

// OR

//

// ((A AND B) OR (C AND D)) AND ((A AND B) OR C)

//

// OR

//

// do nothing.

//

fn extract_or_clauses_for_join(

I'll add this comments later. Thanks!

alamb · 2022-09-28T20:41:34Z

datafusion/optimizer/src/filter_push_down.rs

@@ -248,6 +249,128 @@ fn get_pushable_join_predicates<'a>(
        .unzip()
 }

+// examine OR clause to see if any useful clauses can be extracted and push down.


I see -- thanks -- I checked those rewrites and https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=3b41b0409c8ecf4df0027f323668e0db they do look good to me

alamb · 2022-10-07T10:25:18Z

I am sorry for the late review here -- it is on my list. Basically I am struggling to find time enough to sit down and convince myself that this is a correct transformation in all cases (esp with outer joins and nullability), and then also that we want to do this kind of pushdown. I just haven't had the time yet.

HuSen8891 · 2022-10-07T14:32:00Z

I am sorry for the late review here -- it is on my list. Basically I am struggling to find time enough to sit down and convince myself that this is a correct transformation in all cases (esp with outer joins and nullability), and then also that we want to do this kind of pushdown. I just haven't had the time yet.

Thanks!

alamb

Thank you for this PR again @AssHero and I apologize for the late review

I spent a while this afternoon thinking about the transformation it proposes and I have basically convinced myself that it is correct (and quite clever, I want to point out)

I think this optimization might actually slow some plans down (if the filters that are pushed down don't actually filter many rows, they will just consume CPU). I don't have a great suggestion about this

Given the potential for very subtle wrong results with this kind of optimization I am trying to be cautious. I would like to add some additional comments and tests (will point out where inline below) prior to merging this. Maybe we can even add a config option to disable it as a way to further derisk the process.

cc @Dandandan / @andygrove / @avantgardnerio / @xudong963 / @thinkharderdev in case you have some other thoughts.

alamb · 2022-10-12T17:51:35Z

datafusion/core/tests/sql/predicates.rs

-        "          TableScan: lineitem projection=[l_partkey, l_quantity] [l_partkey:Int64, l_quantity:Decimal128(15, 2)]",
-        "          Filter: #part.p_size >= Int32(1) [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
-        "            TableScan: part projection=[p_partkey, p_brand, p_size], partial_filters=[#part.p_size >= Int32(1)] [p_partkey:Int64, p_brand:Utf8, p_size:Int32]",
+        "          Filter: #lineitem.l_quantity >= Decimal128(Some(100),15,2) AND #lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR #lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR #lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND #lineitem.l_quantity <= Decimal128(Some(3000),15,2) [l_partkey:Int64, l_quantity:Decimal128(15, 2)]",


I went through this plan and I agree it seems correct (as in the pushed down filters don't filter out anything that would have passed the original filter)

alamb · 2022-10-12T17:58:18Z

datafusion/core/tests/sql/joins.rs

@@ -1484,7 +1484,8 @@ async fn reduce_left_join_2() -> Result<()> {
        "    Filter: CAST(#t2.t2_int AS Int64) < Int64(10) OR CAST(#t1.t1_int AS Int64) > Int64(2) AND #t2.t2_name != Utf8(\"w\") [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
        "      Inner Join: #t1.t1_id = #t2.t2_id [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
        "        TableScan: t1 projection=[t1_id, t1_name, t1_int] [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
-        "        TableScan: t2 projection=[t2_id, t2_name, t2_int] [t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",
+        "        Filter: CAST(#t2.t2_int AS Int64) < Int64(10) OR #t2.t2_name != Utf8(\"w\") [t2_id:UInt32;N, t2_name:Utf8;N, t2_int:UInt32;N]",


alamb · 2022-10-12T18:02:26Z

datafusion/optimizer/src/filter_push_down.rs

+// to new OR clause as predicate.
+//


I think we need to explain the conditions under which a qual can be extracted as it may not be obvious to someone when they initially look at this.

Suggested change

// to new OR clause as predicate.

//

// to new OR clause as predicate.

//

// A qual is extracted if it it contains (only) common set of column references with the other quals.

I am not sure that is correct

I would also like to see the return type documented here (as in what does the (Vec<Expr>, Vec<HashSet<Column>>) represent? I think it is the extracted quals and their column references but I am not sure

alamb · 2022-10-12T18:02:43Z

datafusion/optimizer/src/filter_push_down.rs

+    (exprs, expr_columns)
+}
+
+// extract qual from OR sub-clause.


Can you please add some additional comments under what conditions the OR clause is extracted? I tried to explain above

sure, I can add more comments for this.

alamb · 2022-10-12T18:05:15Z

I would be willing to help write some more tests / comments for this PR if others think it is a good idea.

Dandandan · 2022-10-12T19:58:40Z

@AssHero do you have some numbers, e.g. how much does it help q19?

@alamb one thing we might consider is to apply the optimization whenever there is an expensive parent node (joins/aggregate) that benefits from more filtering? Anyway, in most cases, filtering should be cheap compared to other operations as long as the expression is not very expensive (for example, no expensive UDF).

Transformation is OK after I thought about it more

HuSen8891 · 2022-10-13T08:54:09Z

@AssHero do you have some numbers, e.g. how much does it help q19?

@alamb one thing we might consider is to apply the optimization whenever there is an expensive parent node (joins/aggregate) that benefits from more filtering? Anyway, in most cases, filtering should be cheap compared to other operations as long as the expression is not very expensive (for example, no expensive UDF).

Before this commit 'Convert more cross joins to inner joins'(https://github.com/apache/arrow-datafusion/pull/3482)， this optimization helps much in q19, and I intruduce this optimization to solve q19's performance problem.
Now q19 is transformed to inner join, and this helps little in q19, but I think this will help other queries with OR clauses.

HuSen8891 · 2022-10-13T09:03:19Z

I think this optimization might actually slow some plans down (if the filters that are pushed down don't actually filter many rows, they will just consume CPU). I don't have a great suggestion about this

I have made some tests on q19 with this optimization on current version, it does not slow down the queries even through it helps little(may consume more CPU).
I think more better way is combined with statistic data(CBO). With the statistic data, we can evaluate the rows filtered by this predicate, then make the decision.

HuSen8891 · 2022-10-13T15:37:21Z

add some comments about extract_or_clause.

alamb

Thanks @AssHero

I think this is good enough and we can adjust / fix this this transformation if we find in the future the predicates are hurting more than helping

Another nice thing about pushing single column predicates down is that they can be applied during the scan as well (e.g. #3463 )

alamb · 2022-10-15T11:16:04Z

I took the liberty of merging this branch from master and resolving the merge conflict in fe582a7

alamb · 2022-10-15T11:16:18Z

Thanks again for sticking with this one @AssHero

alamb · 2022-10-15T11:30:58Z

The newly added TPCH plan benchmarks needed to be updated as well

alamb

cc @avantgardnerio and @andygrove who I know run the tpch benchmarks regularly

I think we should probably run https://github.com/apache/arrow-datafusion/tree/master/benchmarks#benchmark-derived-from-tpc-h before merging this in

(Kudos to @andygrove for adding the tpch plan benchmarks -- they are super helpful)

alamb · 2022-10-15T11:31:40Z

benchmarks/expected-plans/q7.txt

@@ -14,7 +14,9 @@ Sort: shipping.supp_nation ASC NULLS LAST, shipping.cust_nation ASC NULLS LAST,
                        TableScan: lineitem projection=[l_orderkey, l_suppkey, l_extendedprice, l_discount, l_shipdate]
                    TableScan: orders projection=[o_orderkey, o_custkey]
                  TableScan: customer projection=[c_custkey, c_nationkey]
-                SubqueryAlias: n1
+                Filter: n1.n_name = Utf8("FRANCE") OR n1.n_name = Utf8("GERMANY")


👍 these should be fine, though I expect the performance to be negligible as the number of rows in nation are very small

Think so too - although if it stays on the probe side (e.g. the HashBuildProbeOrder rule doesn't apply) - the joins might still quite a bit slower, and the input and output of the joins will be bigger as well making them more expensive.
FYI @andygrove this might change the perf in Ballista too.

alamb · 2022-10-15T11:33:16Z

benchmarks/expected-plans/q19.txt

@@ -3,7 +3,7 @@ Projection: SUM(lineitem.l_extendedprice * Int64(1) - lineitem.l_discount) AS re
    Projection: lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") AS lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")Utf8("DELIVER IN PERSON")lineitem.l_shipinstruct, lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AS lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")])Utf8("AIR REG")Utf8("AIR")lineitem.l_shipmode, part.p_size >= Int32(1) AS part.p_size >= Int32(1)Int32(1)part.p_size, lineitem.l_quantity, lineitem.l_extendedprice, lineitem.l_discount, part.p_brand, part.p_size, part.p_container
      Filter: part.p_brand = Utf8("Brand#12") AND part.p_container IN ([Utf8("SM CASE"), Utf8("SM BOX"), Utf8("SM PACK"), Utf8("SM PKG")]) AND lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) AND part.p_size <= Int32(5) OR part.p_brand = Utf8("Brand#23") AND part.p_container IN ([Utf8("MED BAG"), Utf8("MED BOX"), Utf8("MED PKG"), Utf8("MED PACK")]) AND lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) AND part.p_size <= Int32(10) OR part.p_brand = Utf8("Brand#34") AND part.p_container IN ([Utf8("LG CASE"), Utf8("LG BOX"), Utf8("LG PACK"), Utf8("LG PKG")]) AND lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2) AND part.p_size <= Int32(15)
        Inner Join: lineitem.l_partkey = part.p_partkey
-          Filter: lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON")
+          Filter: lineitem.l_shipmode IN ([Utf8("AIR"), Utf8("AIR REG")]) AND lineitem.l_shipinstruct = Utf8("DELIVER IN PERSON") AND lineitem.l_quantity >= Decimal128(Some(100),15,2) AND lineitem.l_quantity <= Decimal128(Some(1100),15,2) OR lineitem.l_quantity >= Decimal128(Some(1000),15,2) AND lineitem.l_quantity <= Decimal128(Some(2000),15,2) OR lineitem.l_quantity >= Decimal128(Some(2000),15,2) AND lineitem.l_quantity <= Decimal128(Some(3000),15,2)


These are probably good filters to have added as long as they are selective, though they may be expensive to evaluate because they will be evaluated against all rows in lineitem rather than only those rows which were not filtered out by the join with part.

To make this super fast, maybe could use "sideways information passing" to push down the part filtering into the lineitem scan as well

I am sometimes surprised when profiling how little CPU the filters consume - it is really cheap compared to parquet reads / joins / etc. so almost any filtering we can do probably helps as long as it filters out some data and a subsequent operators can benefit from it.

Dandandan · 2022-10-15T11:56:51Z

@alamb running the benchmarks now for q7 and q19 posting them here when done

Dandandan · 2022-10-15T12:09:40Z

~~Looks like performance regressed a bit :/ we probably have a better look before merging it @alamb~~

Looks like roughly a 1.35x speedup on q7 and a smaller 1.16x speedup on q19 🚀

Running with (on partitioned parquet data - 16 partitions, 16 vCPU):

cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path [path] --format parquet --query [query] --batch-size 8192 -n 16

Master:

Query 7 iteration 0 took 6488.3 ms and returned 4 rows
Query 7 iteration 1 took 6235.8 ms and returned 4 rows
Query 7 iteration 2 took 6293.0 ms and returned 4 rows
Query 7 avg time: 6339.04 ms

Query 19 iteration 0 took 1180.6 ms and returned 1 rows
Query 19 iteration 1 took 1012.3 ms and returned 1 rows
Query 19 iteration 2 took 1017.0 ms and returned 1 rows
Query 19 avg time: 1069.99 ms

This branch:

Query 7 iteration 0 took 4856.0 ms and returned 4 rows
Query 7 iteration 1 took 4645.7 ms and returned 4 rows
Query 7 iteration 2 took 4661.6 ms and returned 4 rows
Query 7 avg time: 4721.10 ms

Query 19 iteration 0 took 1021.7 ms and returned 1 rows
Query 19 iteration 1 took 866.5 ms and returned 1 rows
Query 19 iteration 2 took 872.9 ms and returned 1 rows
Query 19 avg time: 920.37 ms

Dandandan · 2022-10-15T12:27:00Z

Hm getting different results now - may have been something running in the background on my machine - will update the results accordingly

Dandandan · 2022-10-15T12:40:13Z

Ok - updated the benchmark results - looks good to go now!

Dandandan · 2022-10-15T13:47:16Z

Thanks @AssHero @alamb

ursabot · 2022-10-15T13:52:10Z

Benchmark runs are scheduled for baseline = fc5081d and contender = e02376d. e02376d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-10-16T10:02:25Z

Nice! Thanks everyone

extract OR clause for join

f65dd47

github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Sep 21, 2022

alamb reviewed Sep 23, 2022

View reviewed changes

alamb previously requested changes Sep 24, 2022

View reviewed changes

alamb reviewed Sep 28, 2022

View reviewed changes

add more comments

162d20b

HuSen8891 requested a review from alamb October 5, 2022 03:17

alamb reviewed Oct 12, 2022

View reviewed changes

add some comments

159980f

alamb approved these changes Oct 15, 2022

View reviewed changes

alamb changed the title ~~extract OR clause for join~~ Pushdown single column predicates from OR clause below joins Oct 15, 2022

alamb changed the title ~~Pushdown single column predicates from OR clause below joins~~ Pushdown single column predicates from ON join clauses Oct 15, 2022

Update text

fe582a7

Update TPCH plans

7a944ef

alamb reviewed Oct 15, 2022

View reviewed changes

Update test plan

4f540ca

Dandandan approved these changes Oct 15, 2022

View reviewed changes

Dandandan merged commit e02376d into apache:master Oct 15, 2022

HuSen8891 deleted the extract_or_clause branch October 17, 2022 03:29

		// TableScan: projection=[c, d]
		fn extract_or_clauses_for_join(

-//          TableScan: projection=[c, d]
-fn extract_or_clauses_for_join(
+//          TableScan: projection=[c, d]
+//
+// In general, predicates of this form:
+//
+// (A AND B) OR (C AND D)
+//
+// will be transformed to
+//
+// ((A AND B) OR (C AND D)) AND (A OR C)
+//
+// OR
+//
+// ((A AND B) OR (C AND D)) AND ((A AND B) OR C)
+//
+// OR
+//
+// do nothing.
+//
+fn extract_or_clauses_for_join(

Pushdown single column predicates from ON join clauses #3578

Pushdown single column predicates from ON join clauses #3578

Conversation

HuSen8891 commented Sep 21, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

codecov-commenter commented Sep 21, 2022

Codecov Report

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 7, 2022

HuSen8891 commented Oct 7, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Oct 12, 2022

Dandandan commented Oct 12, 2022

HuSen8891 commented Oct 13, 2022

HuSen8891 commented Oct 13, 2022

HuSen8891 commented Oct 13, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented Oct 15, 2022

alamb commented Oct 15, 2022

alamb commented Oct 15, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan Oct 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Oct 15, 2022

Dandandan commented Oct 15, 2022 • edited Loading

Dandandan commented Oct 15, 2022

Dandandan commented Oct 15, 2022

Dandandan commented Oct 15, 2022

ursabot commented Oct 15, 2022

alamb commented Oct 16, 2022

HuSen8891 commented Sep 21, 2022 •

edited

Loading

alamb left a comment •

edited

Loading

Dandandan Oct 15, 2022 •

edited

Loading

Dandandan commented Oct 15, 2022 •

edited

Loading