feat: support input reordering for `NestedLoopJoinExec` #9676

korowa · 2024-03-18T14:45:50Z

Which issue does this PR close?

Closes #8393.

Rationale for this change

Making plans, containing NestedLoopJoinExec optimizeable by

adding them to join_selection rule of physical optimizer
making NLJoin execution "type-agnostic" -- currently NLJoin build-side is chosen based on logical join type, so reordering join inputs won't help much without it

What changes are included in this PR?

NestedLoopJoinExec covered by join_selection rule of physical optimizer
NestedLoopJoinExec always picks left input as build side (which makes it consistent with e.g HashJoinExec operator), and reuses utility functions for other join implementations.

Are these changes tested?

Added tests for physical optimizer + NLJoinExec added to join_fuzz tests

Are there any user-facing changes?

In case both inputs have proper statistics, physical optimizer now picks build side properly. In addition, now there is an option to disable join_selection rule, and manually specify required join order.

korowa · 2024-03-18T16:01:16Z

Regarding benchmarks -- this PR affects q11 and q22 in tpch, but the results differ much for tpch and tpch_mem (tpch_mem statistics estimations differ from ones in tpch over parquet files):

tpch_mem

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   master ┃ nl_join_reorder ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 11    │  40.20ms │         60.53ms │  1.51x slower │
│ QQuery 22    │  39.90ms │         48.82ms │  1.22x slower │
└──────────────┴──────────┴─────────────────┴───────────────┘

tpch

┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   master ┃ nl_join_reorder ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 11    │  65.86ms │         37.58ms │ +1.75x faster │
│ QQuery 22    │  69.01ms │         72.40ms │     no change │
└──────────────┴──────────┴─────────────────┴───────────────┘

Dandandan · 2024-03-18T20:23:50Z

let's run /benchmark

github-actions · 2024-03-18T20:34:42Z

Benchmark results

Benchmarks comparing 35ff7a6 (main) and 2da33f4 (PR)

Comparing 35ff7a6 and 2da33f4
--------------------
Benchmark tpch.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  35ff7a6 ┃  2da33f4 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 435.69ms │ 452.72ms │     no change │
│ QQuery 2     │  57.86ms │  60.13ms │     no change │
│ QQuery 3     │ 145.82ms │ 145.12ms │     no change │
│ QQuery 4     │  85.83ms │  88.60ms │     no change │
│ QQuery 5     │ 202.61ms │ 203.92ms │     no change │
│ QQuery 6     │ 108.07ms │ 105.53ms │     no change │
│ QQuery 7     │ 286.26ms │ 281.52ms │     no change │
│ QQuery 8     │ 199.76ms │ 197.78ms │     no change │
│ QQuery 9     │ 306.42ms │ 308.94ms │     no change │
│ QQuery 10    │ 242.59ms │ 239.58ms │     no change │
│ QQuery 11    │  63.12ms │  42.20ms │ +1.50x faster │
│ QQuery 12    │ 123.95ms │ 127.16ms │     no change │
│ QQuery 13    │ 178.68ms │ 178.25ms │     no change │
│ QQuery 14    │ 130.45ms │ 131.45ms │     no change │
│ QQuery 15    │ 194.89ms │ 189.04ms │     no change │
│ QQuery 16    │  51.72ms │  50.79ms │     no change │
│ QQuery 17    │ 328.88ms │ 311.22ms │ +1.06x faster │
│ QQuery 18    │ 454.21ms │ 440.76ms │     no change │
│ QQuery 19    │ 234.51ms │ 232.58ms │     no change │
│ QQuery 20    │ 198.28ms │ 194.71ms │     no change │
│ QQuery 21    │ 329.77ms │ 324.89ms │     no change │
│ QQuery 22    │  53.36ms │  72.41ms │  1.36x slower │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (35ff7a6)   │ 4412.72ms │
│ Total Time (2da33f4)   │ 4379.31ms │
│ Average Time (35ff7a6) │  200.58ms │
│ Average Time (2da33f4) │  199.06ms │
│ Queries Faster         │         2 │
│ Queries Slower         │         1 │
│ Queries with No Change │        19 │
└────────────────────────┴───────────┘

Dandandan · 2024-03-18T20:35:13Z

datafusion/sqllogictest/test_files/tpch/q11.slt.part

+--SortExec: TopK(fetch=10), expr=[value@1 DESC]
+----ProjectionExec: expr=[ps_partkey@0 as ps_partkey, SUM(partsupp.ps_supplycost * partsupp.ps_availqty)@1 as value]
+------NestedLoopJoinExec: join_type=Inner, filter=CAST(SUM(partsupp.ps_supplycost * partsupp.ps_availqty)@0 AS Decimal128(38, 15)) > SUM(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001)@1
+--------CoalescePartitionsExec


This does seem to use run less in parallel?

Yes, inner (always left) input must be collected into single partition, and outer (always right) input is executed in parallel. In case reordering is not happening due to lacking / misestimated statistics -- it'll also cause lack of parallelism.

In case reordering is not happening due to lacking / misestimated statistics -- it'll also cause lack of parallelism.

should_swap_join_order is based on total_byte_size. It would become Precision::Absent after some operator like AggregateExec and ProjectionExec. I don't know whether it is a good idea to add a check by num_rows.

It already falls back to num_rows if any of join inputs doesn't provide bytes statistics.

Dandandan · 2024-03-18T20:40:11Z

datafusion/sqllogictest/test_files/tpch/q22.slt.part

--------------------HashJoinExec: mode=Partitioned, join_type=LeftAnti, on=[(c_custkey@0, o_custkey@0)], projection=[c_phone@1, c_acctbal@2]
+----------------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
+------------------NestedLoopJoinExec: join_type=Inner, filter=CAST(c_acctbal@0 AS Decimal128(19, 6)) > AVG(customer.c_acctbal)@1
+--------------------CoalescePartitionsExec


Here as well. before we had no CoalescePartitionsExec here so from the plan it looks like this join was using more parallelism before?

Same as above -- the reason is that currently result of final aggregation (estimated as single row), cannot be reordered with result of partitioned aggregation (abscent statistics)

Dandandan · 2024-03-18T20:42:30Z

QQuery 22

Here, QQuery 22 seems to run slower(1.36x slower) for tpch (non-memory) as well.

korowa · 2024-03-19T19:40:13Z

Here, QQuery 22 seems to run slower(1.36x slower) for tpch (non-memory) as well.

🤔 I'll recheck -- I've obtained my results from older version, and maybe statistics output has been affected by some commit since then

korowa · 2024-03-20T20:44:49Z

I'll recheck

Well, seems that I've screwed up while benchmarking locally as there is no significant runtime diff, and results are consistent with ones produced by GH action.

Dandandan · 2024-03-27T22:24:31Z

/benchmark

korowa · 2024-03-31T16:00:27Z

Update: running tpch (parquet) on merged master with Semi/Anti join stats doesn't produce performance regressions for tpch anymore.

metesynnada · 2024-04-05T11:36:44Z

/benchmark

github-actions · 2024-04-05T12:03:31Z

Benchmark results

Benchmarks comparing 2dad904 (main) and bdd5905 (PR)

Comparing 2dad904 and bdd5905
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  2dad904 ┃  bdd5905 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 289.48ms │ 291.06ms │     no change │
│ QQuery 2     │  41.13ms │  42.37ms │     no change │
│ QQuery 3     │  59.08ms │  61.88ms │     no change │
│ QQuery 4     │  77.32ms │  82.71ms │  1.07x slower │
│ QQuery 5     │ 128.18ms │ 109.04ms │ +1.18x faster │
│ QQuery 6     │  21.50ms │  16.55ms │ +1.30x faster │
│ QQuery 7     │ 232.67ms │ 250.17ms │  1.08x slower │
│ QQuery 8     │  44.14ms │  46.56ms │  1.05x slower │
│ QQuery 9     │ 123.62ms │ 126.51ms │     no change │
│ QQuery 10    │ 116.73ms │ 114.69ms │     no change │
│ QQuery 11    │  46.64ms │  76.38ms │  1.64x slower │
│ QQuery 12    │  60.57ms │  61.19ms │     no change │
│ QQuery 13    │ 104.27ms │ 117.77ms │  1.13x slower │
│ QQuery 14    │  19.37ms │  19.86ms │     no change │
│ QQuery 15    │  31.96ms │  32.22ms │     no change │
│ QQuery 16    │  49.26ms │  47.73ms │     no change │
│ QQuery 17    │ 143.90ms │ 148.92ms │     no change │
│ QQuery 18    │ 575.82ms │ 599.09ms │     no change │
│ QQuery 19    │  65.20ms │  66.44ms │     no change │
│ QQuery 20    │ 118.11ms │ 131.08ms │  1.11x slower │
│ QQuery 21    │ 351.78ms │ 339.24ms │     no change │
│ QQuery 22    │  39.74ms │  30.63ms │ +1.30x faster │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (2dad904)   │ 2740.46ms │
│ Total Time (bdd5905)   │ 2812.10ms │
│ Average Time (2dad904) │  124.57ms │
│ Average Time (bdd5905) │  127.82ms │
│ Queries Faster         │         3 │
│ Queries Slower         │         6 │
│ Queries with No Change │        13 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  2dad904 ┃  bdd5905 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 435.28ms │ 433.46ms │     no change │
│ QQuery 2     │  57.60ms │  60.11ms │     no change │
│ QQuery 3     │ 145.37ms │ 145.74ms │     no change │
│ QQuery 4     │  91.09ms │  90.39ms │     no change │
│ QQuery 5     │ 204.11ms │ 205.20ms │     no change │
│ QQuery 6     │ 108.91ms │ 108.76ms │     no change │
│ QQuery 7     │ 279.90ms │ 300.12ms │  1.07x slower │
│ QQuery 8     │ 195.70ms │ 202.59ms │     no change │
│ QQuery 9     │ 302.57ms │ 295.42ms │     no change │
│ QQuery 10    │ 234.85ms │ 242.60ms │     no change │
│ QQuery 11    │  63.57ms │  41.70ms │ +1.52x faster │
│ QQuery 12    │ 128.23ms │ 126.32ms │     no change │
│ QQuery 13    │ 182.22ms │ 187.13ms │     no change │
│ QQuery 14    │ 128.07ms │ 129.07ms │     no change │
│ QQuery 15    │ 194.21ms │ 197.92ms │     no change │
│ QQuery 16    │  50.48ms │  53.30ms │  1.06x slower │
│ QQuery 17    │ 308.33ms │ 319.43ms │     no change │
│ QQuery 18    │ 459.37ms │ 473.85ms │     no change │
│ QQuery 19    │ 235.09ms │ 232.97ms │     no change │
│ QQuery 20    │ 199.36ms │ 201.44ms │     no change │
│ QQuery 21    │ 335.03ms │ 335.01ms │     no change │
│ QQuery 22    │  55.36ms │  44.52ms │ +1.24x faster │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (2dad904)   │ 4394.71ms │
│ Total Time (bdd5905)   │ 4427.03ms │
│ Average Time (2dad904) │  199.76ms │
│ Average Time (bdd5905) │  201.23ms │
│ Queries Faster         │         2 │
│ Queries Slower         │         2 │
│ Queries with No Change │        18 │
└────────────────────────┴───────────┘
--------------------
Benchmark tpch_sf10.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   2dad904 ┃   bdd5905 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 4261.78ms │ 4284.23ms │     no change │
│ QQuery 2     │  494.79ms │  537.10ms │  1.09x slower │
│ QQuery 3     │ 1750.07ms │ 1735.21ms │     no change │
│ QQuery 4     │  838.70ms │  832.06ms │     no change │
│ QQuery 5     │ 2240.24ms │ 2283.63ms │     no change │
│ QQuery 6     │ 1049.19ms │ 1046.87ms │     no change │
│ QQuery 7     │ 3777.45ms │ 3895.30ms │     no change │
│ QQuery 8     │ 2523.06ms │ 2522.64ms │     no change │
│ QQuery 9     │ 4261.89ms │ 4301.60ms │     no change │
│ QQuery 10    │ 2591.54ms │ 2610.27ms │     no change │
│ QQuery 11    │  572.57ms │  349.80ms │ +1.64x faster │
│ QQuery 12    │ 1221.10ms │ 1213.01ms │     no change │
│ QQuery 13    │ 2383.38ms │ 2394.47ms │     no change │
│ QQuery 14    │ 1286.08ms │ 1294.95ms │     no change │
│ QQuery 15    │ 1986.21ms │ 1981.72ms │     no change │
│ QQuery 16    │  520.46ms │  529.13ms │     no change │
│ QQuery 17    │ 5259.84ms │ 5349.47ms │     no change │
│ QQuery 18    │ 6952.12ms │ 7205.29ms │     no change │
│ QQuery 19    │ 2252.07ms │ 2239.19ms │     no change │
│ QQuery 20    │ 2642.42ms │ 2671.71ms │     no change │
│ QQuery 21    │ 4695.84ms │ 4525.04ms │     no change │
│ QQuery 22    │  573.27ms │  471.46ms │ +1.22x faster │
└──────────────┴───────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (2dad904)   │ 54134.08ms │
│ Total Time (bdd5905)   │ 54274.15ms │
│ Average Time (2dad904) │  2460.64ms │
│ Average Time (bdd5905) │  2467.01ms │
│ Queries Faster         │          2 │
│ Queries Slower         │          1 │
│ Queries with No Change │         19 │
└────────────────────────┴────────────┘

alamb · 2024-04-13T13:38:15Z

What is the current status of this PR? Is it ready to go?

korowa · 2024-04-14T15:30:16Z

What is the current status of this PR? Is it ready to go?

Join behavior is now consistent with HJ, and it doesn't introduce any performance regressions for tpch. The only issue is incorrect left/right placing in tpch_mem q11 caused by difference in parquet & memory table statistics content -- I don't think it is relevant to this PR, so I suppose this PR to be "ready for review".

(Some clarification regarding "I don't think it is relevant to this PR" -- even with inputs misplacing due to incorrect/absent statistics, this PR gives an option to disable optimizer rule and specify join inputs as required -- this option is not available in current NLJ implementation, as build-side is picked based on logical join type)

alamb · 2024-04-17T17:19:15Z

Ok, thanks @korowa -- I will try and find time to review it over the next day or so

alamb

Thank you very much @korowa -- this looks like a very nice improvement to the NestedLoopsJoinExec. I am sorry it took so long to find time to review it (I should have known reviewing this would be straightforward given your past history of writing well documented and reviewed code 🙏 )

I think the unit test (big_col > small_col vs big_col > big_col) is worth double checking but otherwise I think this PR is good to go.

Thanks again

datafusion/core/src/physical_optimizer/join_selection.rs

alamb · 2024-04-21T16:13:15Z

datafusion/core/tests/fuzz_cases/join_fuzz.rs

@@ -73,7 +79,7 @@ async fn test_full_join_1k() {
 }

 #[tokio::test]
-async fn test_semi_join_1k() {
+async fn test_semi_join_10k() {


this is a nice drive by cleanup

datafusion/physical-plan/src/joins/nested_loop_join.rs

alamb · 2024-04-21T16:24:54Z

datafusion/physical-plan/src/joins/nested_loop_join.rs

+/// This step is also executed in parallel (once per probe input partition), and to avoid
+/// duplicate output of unmatched data (due to shared nature build-side data), each thread
+/// "reports" about probe phase completion (which means that "visited" bitmap won't be
+/// updated anymore), and only the last thread, reporting about completion, will return output.


Thank you for the documentation updates 👍

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb · 2024-04-22T18:38:13Z

Thanks again @korowa

Dandandan · 2024-04-22T18:48:11Z

Thank you @korowa 🙏

* support input reordering for NestedLoopJoinExec * renamed variables and struct fields * fixed nl join filter expression in tests * Update datafusion/physical-plan/src/joins/nested_loop_join.rs Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * typo fixed --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Mar 18, 2024

korowa force-pushed the nl_join_reorder branch from 37a9099 to 03c778c Compare March 18, 2024 14:54

support input reordering for NestedLoopJoinExec

2da33f4

korowa force-pushed the nl_join_reorder branch from 03c778c to 2da33f4 Compare March 18, 2024 14:59

Dandandan reviewed Mar 18, 2024

View reviewed changes

This was referenced Mar 23, 2024

fix: duplicate output for HashJoinExec in CollectLeft mode #9757

Merged

Implement semi/anti join output statistics estimation #9800

Merged

Merge remote-tracking branch 'upstream/main' into nl_join_reorder

b04e2a7

Merge remote-tracking branch 'upstream/main' into nl_join_reorder

bdd5905

korowa added 2 commits April 17, 2024 21:49

renamed variables and struct fields

3914e1c

Merge remote-tracking branch 'upstream/main' into nl_join_reorder

06c9f03

alamb approved these changes Apr 21, 2024

View reviewed changes

korowa and others added 3 commits April 22, 2024 20:25

fixed nl join filter expression in tests

45f170d

Update datafusion/physical-plan/src/joins/nested_loop_join.rs

90a899d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

typo fixed

cb31fa8

alamb merged commit 8f8e105 into apache:main Apr 22, 2024
23 checks passed

Dandandan mentioned this pull request Apr 22, 2024

Range/inequality joins are slow #8393

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support input reordering for `NestedLoopJoinExec` #9676

feat: support input reordering for `NestedLoopJoinExec` #9676

korowa commented Mar 18, 2024 •

edited

Loading

korowa commented Mar 18, 2024

Dandandan commented Mar 18, 2024

github-actions bot commented Mar 18, 2024

Dandandan Mar 18, 2024

korowa Mar 19, 2024 •

edited

Loading

my-vegetable-has-exploded Mar 21, 2024 •

edited

Loading

korowa Mar 23, 2024

Dandandan Mar 18, 2024

korowa Mar 19, 2024

Dandandan commented Mar 18, 2024

korowa commented Mar 19, 2024 •

edited

Loading

korowa commented Mar 20, 2024

Dandandan commented Mar 27, 2024

korowa commented Mar 31, 2024 •

edited

Loading

metesynnada commented Apr 5, 2024

github-actions bot commented Apr 5, 2024

alamb commented Apr 13, 2024

korowa commented Apr 14, 2024 •

edited

Loading

alamb commented Apr 17, 2024

alamb left a comment

alamb Apr 21, 2024

alamb Apr 21, 2024

alamb commented Apr 22, 2024 •

edited

Loading

Dandandan commented Apr 22, 2024

feat: support input reordering for NestedLoopJoinExec #9676

feat: support input reordering for NestedLoopJoinExec #9676

Conversation

korowa commented Mar 18, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

korowa commented Mar 18, 2024

Dandandan commented Mar 18, 2024

github-actions bot commented Mar 18, 2024

Benchmark results

Dandandan Mar 18, 2024

Choose a reason for hiding this comment

korowa Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

my-vegetable-has-exploded Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

korowa Mar 23, 2024

Choose a reason for hiding this comment

Dandandan Mar 18, 2024

Choose a reason for hiding this comment

korowa Mar 19, 2024

Choose a reason for hiding this comment

Dandandan commented Mar 18, 2024

korowa commented Mar 19, 2024 • edited Loading

korowa commented Mar 20, 2024

Dandandan commented Mar 27, 2024

korowa commented Mar 31, 2024 • edited Loading

metesynnada commented Apr 5, 2024

github-actions bot commented Apr 5, 2024

Benchmark results

alamb commented Apr 13, 2024

korowa commented Apr 14, 2024 • edited Loading

alamb commented Apr 17, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 21, 2024

Choose a reason for hiding this comment

alamb Apr 21, 2024

Choose a reason for hiding this comment

alamb commented Apr 22, 2024 • edited Loading

Dandandan commented Apr 22, 2024

feat: support input reordering for `NestedLoopJoinExec` #9676

feat: support input reordering for `NestedLoopJoinExec` #9676

korowa commented Mar 18, 2024 •

edited

Loading

korowa Mar 19, 2024 •

edited

Loading

my-vegetable-has-exploded Mar 21, 2024 •

edited

Loading

korowa commented Mar 19, 2024 •

edited

Loading

korowa commented Mar 31, 2024 •

edited

Loading

korowa commented Apr 14, 2024 •

edited

Loading

alamb commented Apr 22, 2024 •

edited

Loading