Sort Merge Join: Reduce batch concatenation, use `BatchCoalescer`, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster) #18875

mbutrovich · 2025-11-21T22:27:18Z

Which issue does this PR close?

Closes Sort Merge Join is extremely slow on LeftAnti joins #18487.
Will eventually close Performance regression after adding support for SMJ with join filter datafusion-comet#901.

Rationale for this change

DataFusion Comet often uses Sort Merge Joins because DataFusion does not have a larger-than-memory Hash Join operator. Performance on TPC-H Q21 is quite bad when run through native, and instead Comet falls back to Spark by default. If you force Comet to use DataFusion's SMJ operator, performance is:

Profiling showed most of the time spent in concat_batches of single-digit rows:

What changes are included in this PR?

Use a BatchCoalescer both internally and to buffer final output. There was also some redundant concatenation of batches for filtered joins. One made the biggest difference, but I switched to two to be consistent. Here are Comet results with the changes based on 50.3 (which is where Comet is):

TPC-H SF1 benchmark results are below (PREFER_HASH_JOIN=false ./bench.sh run tpch). I tried to run SF10 TPC-H but it seemed like it was going to take hours on my machine. It ran successfully on this PR.

./bench.sh compare_detail main smj        
Comparing main and smj
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Query        ┃                                           main ┃                               smj ┃           Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ QQuery 1     │                 44.37 / 48.67 ±4.54 / 55.68 ms │   41.63 / 55.88 ±19.83 / 95.24 ms │     1.15x slower │
│ QQuery 2     │                 45.18 / 47.44 ±2.39 / 51.74 ms │    45.26 / 47.29 ±3.56 / 54.39 ms │        no change │
│ QQuery 3     │                 52.59 / 56.15 ±2.65 / 59.79 ms │    50.93 / 52.39 ±1.35 / 54.46 ms │    +1.07x faster │
│ QQuery 4     │                 33.06 / 34.46 ±0.97 / 35.88 ms │    30.06 / 31.04 ±0.74 / 32.14 ms │    +1.11x faster │
│ QQuery 5     │                 84.50 / 87.63 ±2.06 / 90.58 ms │    78.33 / 80.62 ±2.96 / 86.32 ms │    +1.09x faster │
│ QQuery 6     │                 17.87 / 18.64 ±0.48 / 19.22 ms │    16.14 / 17.54 ±1.12 / 19.55 ms │    +1.06x faster │
│ QQuery 7     │              111.11 / 113.59 ±1.79 / 116.70 ms │ 112.43 / 115.85 ±2.55 / 118.96 ms │        no change │
│ QQuery 8     │                89.84 / 94.59 ±3.34 / 100.15 ms │    92.26 / 94.64 ±2.28 / 97.50 ms │        no change │
│ QQuery 9     │              128.36 / 133.12 ±3.46 / 138.00 ms │ 124.58 / 130.47 ±6.30 / 138.85 ms │        no change │
│ QQuery 10    │                 49.89 / 51.91 ±1.41 / 54.19 ms │    48.55 / 50.43 ±1.82 / 52.92 ms │        no change │
│ QQuery 11    │                 34.19 / 35.30 ±0.59 / 35.84 ms │    32.42 / 34.59 ±1.52 / 36.47 ms │        no change │
│ QQuery 12    │                 36.26 / 38.67 ±2.44 / 42.77 ms │    32.92 / 34.28 ±1.18 / 36.38 ms │    +1.13x faster │
│ QQuery 13    │                 31.32 / 34.13 ±2.29 / 38.22 ms │    28.66 / 29.84 ±1.11 / 31.94 ms │    +1.14x faster │
│ QQuery 14    │                 23.54 / 24.79 ±0.92 / 26.00 ms │    22.48 / 23.45 ±1.03 / 25.44 ms │    +1.06x faster │
│ QQuery 15    │                 26.66 / 27.47 ±0.86 / 29.05 ms │    26.23 / 28.64 ±1.72 / 31.48 ms │        no change │
│ QQuery 16    │                 17.63 / 18.94 ±0.97 / 20.20 ms │    16.82 / 18.11 ±1.33 / 20.60 ms │        no change │
│ QQuery 17    │                 94.36 / 96.41 ±1.62 / 98.44 ms │    91.47 / 93.47 ±1.70 / 96.54 ms │        no change │
│ QQuery 18    │               99.91 / 108.58 ±5.85 / 117.27 ms │ 104.25 / 106.40 ±2.42 / 110.47 ms │        no change │
│ QQuery 19    │                 35.23 / 36.68 ±1.46 / 39.23 ms │    32.98 / 36.03 ±1.88 / 38.57 ms │        no change │
│ QQuery 20    │                 40.66 / 41.84 ±1.20 / 44.05 ms │    38.12 / 39.20 ±0.92 / 40.45 ms │    +1.07x faster │
│ QQuery 21    │ 151142.04 / 246274.24 ±89682.07 / 358766.84 ms │ 216.09 / 218.73 ±2.03 / 221.31 ms │ +1125.94x faster │
│ QQuery 22    │                16.69 / 28.53 ±22.72 / 73.97 ms │    16.72 / 17.39 ±0.78 / 18.86 ms │    +1.64x faster │
└──────────────┴────────────────────────────────────────────────┴───────────────────────────────────┴──────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main)      │ 247451.79ms │
│ Total Time (smj)       │   1356.29ms │
│ Average Time (main)    │  11247.81ms │
│ Average Time (smj)     │     61.65ms │
│ Queries Faster         │          10 │
│ Queries Slower         │           1 │
│ Queries with No Change │          11 │
│ Queries with Failure   │           0 │
└────────────────────────┴─────────────┘

Are these changes tested?

Existing Sort Merge Join unit tests, added a new benchmark.

Are there any user-facing changes?

There should not be.

…hes on vector of RecordBatches. Add benchmarks, update tests.

comphead · 2025-11-21T22:31:17Z

+1168.11x faster

mbutrovich · 2025-11-21T22:45:00Z

I have a bug somewhere the extended tests demonstrate. I'll try to track it down next week.

# Conflicts: # Cargo.lock

mbutrovich · 2025-12-02T15:53:09Z

I think I sorted out the corner case failures by refactoring a bit. I basically removed direct member access to JoinedRecordBatches fields and encapsulated their logic in functions sprinkled with debug_assert to make more sense of the control flow. There were some redundant concat_batches in the existing logic to begin with that already improved performance, but the BatchCoalescer makes it even better.

comphead · 2025-12-02T23:15:10Z

Probably we can also test it with #18985 once it is merged

…oalescer

alamb · 2025-12-03T16:52:28Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing smj (36a73e5) to 9af6858 diff using: tpch
Results will be posted here when complete

alamb · 2025-12-03T16:52:49Z

I started the following on this branch

PREFER_HASH_JOIN=false BENCHMARKS="tpch" ./gh_compare_branch.sh https://github.com/apache/datafusion/pull/18875

I think that will effectively test the merge join performance of main with this branch

Omega359 · 2025-12-03T17:55:57Z

This is what I get on my amd ryzen 9 machine:

$ ./bench.sh compare_detail upstream_main smj
Comparing upstream_main and smj
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Query        ┃                                 upstream_main ┃                               smj ┃          Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ QQuery 1     │              91.48 / 109.29 ±9.59 / 119.07 ms │   85.45 / 99.39 ±7.45 / 105.42 ms │   +1.10x faster │
│ QQuery 2     │                78.59 / 80.25 ±2.76 / 85.74 ms │    75.24 / 80.58 ±3.81 / 85.16 ms │       no change │
│ QQuery 3     │               89.52 / 93.74 ±3.71 / 100.51 ms │   91.79 / 94.31 ±3.17 / 100.53 ms │       no change │
│ QQuery 4     │                51.04 / 52.51 ±1.40 / 54.56 ms │    49.81 / 50.59 ±0.67 / 51.38 ms │       no change │
│ QQuery 5     │             151.19 / 154.89 ±4.39 / 163.06 ms │ 151.96 / 159.35 ±4.93 / 165.01 ms │       no change │
│ QQuery 6     │                23.67 / 29.73 ±3.32 / 32.81 ms │    25.87 / 30.65 ±2.59 / 32.65 ms │       no change │
│ QQuery 7     │             209.97 / 214.53 ±3.55 / 220.02 ms │ 213.94 / 219.05 ±7.28 / 233.48 ms │       no change │
│ QQuery 8     │             191.34 / 198.65 ±5.17 / 203.92 ms │ 189.82 / 197.02 ±4.05 / 201.31 ms │       no change │
│ QQuery 9     │             270.53 / 275.98 ±6.12 / 283.91 ms │ 272.41 / 280.72 ±4.92 / 286.95 ms │       no change │
│ QQuery 10    │               92.68 / 96.99 ±4.49 / 103.76 ms │   96.97 / 99.41 ±1.74 / 101.62 ms │       no change │
│ QQuery 11    │                60.48 / 63.12 ±2.32 / 66.97 ms │    60.53 / 63.75 ±1.85 / 65.76 ms │       no change │
│ QQuery 12    │                54.61 / 56.53 ±1.91 / 59.79 ms │    56.49 / 58.06 ±1.46 / 60.02 ms │       no change │
│ QQuery 13    │                48.51 / 51.06 ±1.72 / 53.27 ms │    48.90 / 51.44 ±1.63 / 53.71 ms │       no change │
│ QQuery 14    │                38.22 / 43.60 ±3.26 / 47.31 ms │    42.90 / 44.19 ±0.81 / 45.45 ms │       no change │
│ QQuery 15    │                47.58 / 53.90 ±4.17 / 59.49 ms │    53.77 / 55.22 ±1.27 / 57.09 ms │       no change │
│ QQuery 16    │                31.89 / 33.13 ±0.68 / 33.73 ms │    32.06 / 34.95 ±2.17 / 38.40 ms │    1.05x slower │
│ QQuery 17    │             213.95 / 215.98 ±2.13 / 219.65 ms │ 216.23 / 218.27 ±2.02 / 221.31 ms │       no change │
│ QQuery 18    │             203.46 / 208.98 ±3.20 / 212.19 ms │ 226.90 / 236.01 ±6.11 / 243.46 ms │    1.13x slower │
│ QQuery 19    │                67.12 / 69.08 ±1.12 / 70.18 ms │    68.34 / 71.18 ±1.80 / 73.83 ms │       no change │
│ QQuery 20    │                74.62 / 77.51 ±1.66 / 79.76 ms │    71.51 / 81.10 ±5.82 / 88.99 ms │       no change │
│ QQuery 21    │ 194460.40 / 199334.85 ±6297.51 / 211607.77 ms │ 345.81 / 351.86 ±4.31 / 359.12 ms │ +566.52x faster │
│ QQuery 22    │                28.62 / 33.55 ±6.70 / 46.71 ms │    27.89 / 29.55 ±1.20 / 31.03 ms │   +1.14x faster │
└──────────────┴───────────────────────────────────────────────┴───────────────────────────────────┴─────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (upstream_main)   │ 201547.85ms │
│ Total Time (smj)             │   2606.64ms │
│ Average Time (upstream_main) │   9161.27ms │
│ Average Time (smj)           │    118.48ms │
│ Queries Faster               │           3 │
│ Queries Slower               │           2 │
│ Queries with No Change       │          17 │
│ Queries with Failure         │           0 │
└──────────────────────────────┴─────────────┘

alamb · 2025-12-03T19:49:34Z

FWIW the benchmarks are still running because Q21 took over an hour to run 🤯

Query 21 iteration 0 took 4665980.8 ms and returned 100 rows

rluvaton · 2025-12-03T20:09:46Z

some of the debug_assert are very very cheap that I think we should do regular assert.
for example:

debug_assert_eq!(
        indices.len(),
        indices_len,
        "indices.len() should match indices_len parameter"
    );

mbutrovich · 2025-12-03T21:46:56Z

some of the debug_assert are very very cheap that I think we should do regular assert. for example:
debug_assert_eq!(
        indices.len(),
        indices_len,
        "indices.len() should match indices_len parameter"
    );

I might remove some. They were mostly to help me understand control flow as I was learning the SMJ state machine: I'd try to codify my understanding with debug_asserts as I went, and if I broke something or otherwise changed behavior that I was convinced was an invariant, I'd have good safeguards.

alamb · 2025-12-03T23:49:01Z

🤖: Benchmark completed

Details

Comparing HEAD and smj
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Query        ┃          HEAD ┃        smj ┃           Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ QQuery 1     │     230.23 ms │  227.23 ms │        no change │
│ QQuery 2     │     188.11 ms │  188.44 ms │        no change │
│ QQuery 3     │     244.92 ms │  249.80 ms │        no change │
│ QQuery 4     │     174.31 ms │  175.78 ms │        no change │
│ QQuery 5     │     408.15 ms │  402.66 ms │        no change │
│ QQuery 6     │      68.06 ms │   67.13 ms │        no change │
│ QQuery 7     │     488.35 ms │  502.39 ms │        no change │
│ QQuery 8     │     470.92 ms │  477.66 ms │        no change │
│ QQuery 9     │     682.45 ms │  684.11 ms │        no change │
│ QQuery 10    │     241.30 ms │  238.98 ms │        no change │
│ QQuery 11    │     171.28 ms │  168.36 ms │        no change │
│ QQuery 12    │     159.38 ms │  160.41 ms │        no change │
│ QQuery 13    │     264.82 ms │  265.14 ms │        no change │
│ QQuery 14    │      95.66 ms │   91.17 ms │        no change │
│ QQuery 15    │      99.71 ms │   98.49 ms │        no change │
│ QQuery 16    │      70.28 ms │   73.21 ms │        no change │
│ QQuery 17    │     504.18 ms │  501.81 ms │        no change │
│ QQuery 18    │     586.59 ms │  745.85 ms │     1.27x slower │
│ QQuery 19    │     138.05 ms │  152.12 ms │     1.10x slower │
│ QQuery 20    │     180.26 ms │  187.47 ms │        no change │
│ QQuery 21    │ 4422642.01 ms │ 1063.69 ms │ +4157.85x faster │
│ QQuery 22    │     104.78 ms │   99.21 ms │    +1.06x faster │
└──────────────┴───────────────┴────────────┴──────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Total Time (HEAD)      │ 4428213.77ms │
│ Total Time (smj)       │    6821.10ms │
│ Average Time (HEAD)    │  201282.44ms │
│ Average Time (smj)     │     310.05ms │
│ Queries Faster         │            2 │
│ Queries Slower         │            2 │
│ Queries with No Change │           18 │
│ Queries with Failure   │            0 │
└────────────────────────┴──────────────┘

mbutrovich · 2025-12-04T00:13:02Z

│ QQuery 21 │ 4422642.01 ms │ 1063.69 ms │ +4157.85x faster │```

My goodness.

comphead · 2025-12-04T02:48:47Z

Small batches are evil, sorry for delay, I wanted to check the PR with TPCDS but because of recent regression #19075 cannot merge it right now

rluvaton · 2025-12-04T06:58:46Z

could you please align with main, I just merged a PR that fixed bug in SMJ and updated fuzz tests

Fix: Align sort_merge_join filter output with join schema to fix right-anti panic #18800

datafusion/physical-plan/src/joins/sort_merge_join/stream.rs

Use BatchCoaleser in sort merge join instead of calling coalesce_batc…

cd6433c

…hes on vector of RecordBatches. Add benchmarks, update tests.

github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 21, 2025

Merge branch 'main' into smj

d29fd29

mbutrovich and others added 21 commits November 24, 2025 19:36

Merge branch 'main' into smj

c1b58b9

stash

a655212

Stash with assertions.

4ed5cd4

Stash with assertions.

4364656

encapsulate

7a41fe6

encapsulate

b986fd7

encapsulate

387c882

pre-refactor

efa2996

get rid of confusing output_size

a5c926f

refactor

f725308

refactor

4cc21e8

fix double concat for filtered joins

f6430db

more elided concats

32021cb

remove dead code

2e0f211

passes

37bb875

Merge branch 'main' into smj5

2ac80f6

# Conflicts: # Cargo.lock

comments

8c69056

clippy, comments

67877e6

Remove unused import

e7b94e5

optimize concat_batches call

7c55ad9

Merge branch 'main' into smj

ad583d2

mbutrovich marked this pull request as ready for review December 2, 2025 15:46

mbutrovich requested a review from comphead December 2, 2025 16:38

mbutrovich changed the title ~~Use BatchCoaleser in Sort Merge Join, new benchmarks~~ Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ 1000x faster) Dec 2, 2025

mbutrovich changed the title ~~Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ 1000x faster)~~ Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ ~1000x faster) Dec 2, 2025

fix metrics collection filtered joins

43a945f

mbutrovich changed the title ~~Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ ~1000x faster)~~ Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ ~1000x faster) Dec 2, 2025

mbutrovich and others added 2 commits December 2, 2025 19:32

pass through batches that are batch_size / 2 similar to LimitedBatchC…

6a4e664

…oalescer

Merge branch 'main' into smj

36a73e5

mbutrovich changed the title ~~Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ ~1000x faster)~~ Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~1000x faster) Dec 3, 2025

Merge branch 'main' into smj

1000afa

Dandandan reviewed Dec 4, 2025

View reviewed changes

datafusion/physical-plan/src/joins/sort_merge_join/stream.rs Outdated Show resolved Hide resolved

mbutrovich and others added 3 commits December 4, 2025 10:20

Address PR feedback.

66ea027

Merge branch 'main' into smj

eb5637e

Remove stray import.

86cbc5c

mbutrovich changed the title ~~Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~1000x faster)~~ Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster) Dec 4, 2025

Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster) #18875

Are you sure you want to change the base?

Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster) #18875

Uh oh!

Conversation

mbutrovich commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

comphead commented Nov 21, 2025

Uh oh!

mbutrovich commented Nov 21, 2025

Uh oh!

mbutrovich commented Dec 2, 2025

Uh oh!

comphead commented Dec 2, 2025

Uh oh!

alamb commented Dec 3, 2025

Uh oh!

alamb commented Dec 3, 2025

Uh oh!

Omega359 commented Dec 3, 2025

Uh oh!

alamb commented Dec 3, 2025

Uh oh!

rluvaton commented Dec 3, 2025

Uh oh!

mbutrovich commented Dec 3, 2025

Uh oh!

alamb commented Dec 3, 2025

Uh oh!

mbutrovich commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comphead commented Dec 4, 2025

Uh oh!

rluvaton commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Sort Merge Join: Reduce batch concatenation, use `BatchCoalescer`, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster) #18875

Sort Merge Join: Reduce batch concatenation, use `BatchCoalescer`, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster) #18875

mbutrovich commented Nov 21, 2025 •

edited

Loading

mbutrovich commented Dec 4, 2025 •

edited

Loading

rluvaton commented Dec 4, 2025 •

edited

Loading