Skip to content

Conversation

@mbutrovich
Copy link
Contributor

@mbutrovich mbutrovich commented Nov 21, 2025

Which issue does this PR close?

Rationale for this change

DataFusion Comet often uses Sort Merge Joins because DataFusion does not have a larger-than-memory Hash Join operator. Performance on TPC-H Q21 is quite bad when run through native, and instead Comet falls back to Spark by default. If you force Comet to use DataFusion's SMJ operator, performance is:

Screenshot 2025-11-21 at 11 31 18 AM

Profiling showed most of the time spent in concat_batches of single-digit rows:

Screenshot 2025-11-20 at 6 49 20 PM

What changes are included in this PR?

Use a BatchCoalescer both internally and to buffer final output. There was also some redundant concatenation of batches for filtered joins. One made the biggest difference, but I switched to two to be consistent. Here are Comet results with the changes based on 50.3 (which is where Comet is):

Screenshot 2025-11-21 at 11 43 57 AM

TPC-H SF1 benchmark results are below (PREFER_HASH_JOIN=false ./bench.sh run tpch). I tried to run SF10 TPC-H but it seemed like it was going to take hours on my machine. It ran successfully on this PR.

./bench.sh compare_detail main smj        
Comparing main and smj
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Query        ┃                                           main ┃                               smj ┃           Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ QQuery 1     │                 44.37 / 48.67 ±4.54 / 55.68 ms │   41.63 / 55.88 ±19.83 / 95.24 ms │     1.15x slower │
│ QQuery 2     │                 45.18 / 47.44 ±2.39 / 51.74 ms │    45.26 / 47.29 ±3.56 / 54.39 ms │        no change │
│ QQuery 3     │                 52.59 / 56.15 ±2.65 / 59.79 ms │    50.93 / 52.39 ±1.35 / 54.46 ms │    +1.07x faster │
│ QQuery 4     │                 33.06 / 34.46 ±0.97 / 35.88 ms │    30.06 / 31.04 ±0.74 / 32.14 ms │    +1.11x faster │
│ QQuery 5     │                 84.50 / 87.63 ±2.06 / 90.58 ms │    78.33 / 80.62 ±2.96 / 86.32 ms │    +1.09x faster │
│ QQuery 6     │                 17.87 / 18.64 ±0.48 / 19.22 ms │    16.14 / 17.54 ±1.12 / 19.55 ms │    +1.06x faster │
│ QQuery 7     │              111.11 / 113.59 ±1.79 / 116.70 ms │ 112.43 / 115.85 ±2.55 / 118.96 ms │        no change │
│ QQuery 8     │                89.84 / 94.59 ±3.34 / 100.15 ms │    92.26 / 94.64 ±2.28 / 97.50 ms │        no change │
│ QQuery 9     │              128.36 / 133.12 ±3.46 / 138.00 ms │ 124.58 / 130.47 ±6.30 / 138.85 ms │        no change │
│ QQuery 10    │                 49.89 / 51.91 ±1.41 / 54.19 ms │    48.55 / 50.43 ±1.82 / 52.92 ms │        no change │
│ QQuery 11    │                 34.19 / 35.30 ±0.59 / 35.84 ms │    32.42 / 34.59 ±1.52 / 36.47 ms │        no change │
│ QQuery 12    │                 36.26 / 38.67 ±2.44 / 42.77 ms │    32.92 / 34.28 ±1.18 / 36.38 ms │    +1.13x faster │
│ QQuery 13    │                 31.32 / 34.13 ±2.29 / 38.22 ms │    28.66 / 29.84 ±1.11 / 31.94 ms │    +1.14x faster │
│ QQuery 14    │                 23.54 / 24.79 ±0.92 / 26.00 ms │    22.48 / 23.45 ±1.03 / 25.44 ms │    +1.06x faster │
│ QQuery 15    │                 26.66 / 27.47 ±0.86 / 29.05 ms │    26.23 / 28.64 ±1.72 / 31.48 ms │        no change │
│ QQuery 16    │                 17.63 / 18.94 ±0.97 / 20.20 ms │    16.82 / 18.11 ±1.33 / 20.60 ms │        no change │
│ QQuery 17    │                 94.36 / 96.41 ±1.62 / 98.44 ms │    91.47 / 93.47 ±1.70 / 96.54 ms │        no change │
│ QQuery 18    │               99.91 / 108.58 ±5.85 / 117.27 ms │ 104.25 / 106.40 ±2.42 / 110.47 ms │        no change │
│ QQuery 19    │                 35.23 / 36.68 ±1.46 / 39.23 ms │    32.98 / 36.03 ±1.88 / 38.57 ms │        no change │
│ QQuery 20    │                 40.66 / 41.84 ±1.20 / 44.05 ms │    38.12 / 39.20 ±0.92 / 40.45 ms │    +1.07x faster │
│ QQuery 21    │ 151142.04 / 246274.24 ±89682.07 / 358766.84 ms │ 216.09 / 218.73 ±2.03 / 221.31 ms │ +1125.94x faster │
│ QQuery 22    │                16.69 / 28.53 ±22.72 / 73.97 ms │    16.72 / 17.39 ±0.78 / 18.86 ms │    +1.64x faster │
└──────────────┴────────────────────────────────────────────────┴───────────────────────────────────┴──────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main)      │ 247451.79ms │
│ Total Time (smj)       │   1356.29ms │
│ Average Time (main)    │  11247.81ms │
│ Average Time (smj)     │     61.65ms │
│ Queries Faster         │          10 │
│ Queries Slower         │           1 │
│ Queries with No Change │          11 │
│ Queries with Failure   │           0 │
└────────────────────────┴─────────────┘

Are these changes tested?

Existing Sort Merge Join unit tests, added a new benchmark.

Are there any user-facing changes?

There should not be.

…hes on vector of RecordBatches. Add benchmarks, update tests.
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Nov 21, 2025
@comphead
Copy link
Contributor

+1168.11x faster

@mbutrovich
Copy link
Contributor Author

I have a bug somewhere the extended tests demonstrate. I'll try to track it down next week.

@mbutrovich mbutrovich marked this pull request as ready for review December 2, 2025 15:46
@mbutrovich
Copy link
Contributor Author

I think I sorted out the corner case failures by refactoring a bit. I basically removed direct member access to JoinedRecordBatches fields and encapsulated their logic in functions sprinkled with debug_assert to make more sense of the control flow. There were some redundant concat_batches in the existing logic to begin with that already improved performance, but the BatchCoalescer makes it even better.

@mbutrovich mbutrovich requested a review from comphead December 2, 2025 16:38
@mbutrovich mbutrovich changed the title Use BatchCoaleser in Sort Merge Join, new benchmarks Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ 1000x faster) Dec 2, 2025
@mbutrovich mbutrovich changed the title Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ 1000x faster) Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ ~1000x faster) Dec 2, 2025
@mbutrovich mbutrovich changed the title Use BatchCoaleser in Sort Merge Join, new benchmarks (TPC-H Q21 SMJ ~1000x faster) Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ ~1000x faster) Dec 2, 2025
@comphead
Copy link
Contributor

comphead commented Dec 2, 2025

Probably we can also test it with #18985 once it is merged

@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing smj (36a73e5) to 9af6858 diff using: tpch
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

I started the following on this branch

PREFER_HASH_JOIN=false BENCHMARKS="tpch" ./gh_compare_branch.sh https://github.com/apache/datafusion/pull/18875

I think that will effectively test the merge join performance of main with this branch

@Omega359
Copy link
Contributor

Omega359 commented Dec 3, 2025

This is what I get on my amd ryzen 9 machine:

$ ./bench.sh compare_detail upstream_main smj
Comparing upstream_main and smj
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┓
┃ Query        ┃                                 upstream_main ┃                               smj ┃          Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━┩
│ QQuery 1     │              91.48 / 109.29 ±9.59 / 119.07 ms │   85.45 / 99.39 ±7.45 / 105.42 ms │   +1.10x faster │
│ QQuery 2     │                78.59 / 80.25 ±2.76 / 85.74 ms │    75.24 / 80.58 ±3.81 / 85.16 ms │       no change │
│ QQuery 3     │               89.52 / 93.74 ±3.71 / 100.51 ms │   91.79 / 94.31 ±3.17 / 100.53 ms │       no change │
│ QQuery 4     │                51.04 / 52.51 ±1.40 / 54.56 ms │    49.81 / 50.59 ±0.67 / 51.38 ms │       no change │
│ QQuery 5     │             151.19 / 154.89 ±4.39 / 163.06 ms │ 151.96 / 159.35 ±4.93 / 165.01 ms │       no change │
│ QQuery 6     │                23.67 / 29.73 ±3.32 / 32.81 ms │    25.87 / 30.65 ±2.59 / 32.65 ms │       no change │
│ QQuery 7     │             209.97 / 214.53 ±3.55 / 220.02 ms │ 213.94 / 219.05 ±7.28 / 233.48 ms │       no change │
│ QQuery 8     │             191.34 / 198.65 ±5.17 / 203.92 ms │ 189.82 / 197.02 ±4.05 / 201.31 ms │       no change │
│ QQuery 9     │             270.53 / 275.98 ±6.12 / 283.91 ms │ 272.41 / 280.72 ±4.92 / 286.95 ms │       no change │
│ QQuery 10    │               92.68 / 96.99 ±4.49 / 103.76 ms │   96.97 / 99.41 ±1.74 / 101.62 ms │       no change │
│ QQuery 11    │                60.48 / 63.12 ±2.32 / 66.97 ms │    60.53 / 63.75 ±1.85 / 65.76 ms │       no change │
│ QQuery 12    │                54.61 / 56.53 ±1.91 / 59.79 ms │    56.49 / 58.06 ±1.46 / 60.02 ms │       no change │
│ QQuery 13    │                48.51 / 51.06 ±1.72 / 53.27 ms │    48.90 / 51.44 ±1.63 / 53.71 ms │       no change │
│ QQuery 14    │                38.22 / 43.60 ±3.26 / 47.31 ms │    42.90 / 44.19 ±0.81 / 45.45 ms │       no change │
│ QQuery 15    │                47.58 / 53.90 ±4.17 / 59.49 ms │    53.77 / 55.22 ±1.27 / 57.09 ms │       no change │
│ QQuery 16    │                31.89 / 33.13 ±0.68 / 33.73 ms │    32.06 / 34.95 ±2.17 / 38.40 ms │    1.05x slower │
│ QQuery 17    │             213.95 / 215.98 ±2.13 / 219.65 ms │ 216.23 / 218.27 ±2.02 / 221.31 ms │       no change │
│ QQuery 18    │             203.46 / 208.98 ±3.20 / 212.19 ms │ 226.90 / 236.01 ±6.11 / 243.46 ms │    1.13x slower │
│ QQuery 19    │                67.12 / 69.08 ±1.12 / 70.18 ms │    68.34 / 71.18 ±1.80 / 73.83 ms │       no change │
│ QQuery 20    │                74.62 / 77.51 ±1.66 / 79.76 ms │    71.51 / 81.10 ±5.82 / 88.99 ms │       no change │
│ QQuery 21    │ 194460.40 / 199334.85 ±6297.51 / 211607.77 ms │ 345.81 / 351.86 ±4.31 / 359.12 ms │ +566.52x faster │
│ QQuery 22    │                28.62 / 33.55 ±6.70 / 46.71 ms │    27.89 / 29.55 ±1.20 / 31.03 ms │   +1.14x faster │
└──────────────┴───────────────────────────────────────────────┴───────────────────────────────────┴─────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary            ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (upstream_main)   │ 201547.85ms │
│ Total Time (smj)             │   2606.64ms │
│ Average Time (upstream_main) │   9161.27ms │
│ Average Time (smj)           │    118.48ms │
│ Queries Faster               │           3 │
│ Queries Slower               │           2 │
│ Queries with No Change       │          17 │
│ Queries with Failure         │           0 │
└──────────────────────────────┴─────────────┘

@mbutrovich mbutrovich changed the title Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ ~1000x faster) Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~1000x faster) Dec 3, 2025
@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

FWIW the benchmarks are still running because Q21 took over an hour to run 🤯

Query 21 iteration 0 took 4665980.8 ms and returned 100 rows

@rluvaton
Copy link
Member

rluvaton commented Dec 3, 2025

some of the debug_assert are very very cheap that I think we should do regular assert.
for example:

debug_assert_eq!(
        indices.len(),
        indices_len,
        "indices.len() should match indices_len parameter"
    );

@mbutrovich
Copy link
Contributor Author

some of the debug_assert are very very cheap that I think we should do regular assert. for example:

debug_assert_eq!(
        indices.len(),
        indices_len,
        "indices.len() should match indices_len parameter"
    );

I might remove some. They were mostly to help me understand control flow as I was learning the SMJ state machine: I'd try to codify my understanding with debug_asserts as I went, and if I broke something or otherwise changed behavior that I was convinced was an invariant, I'd have good safeguards.

@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

🤖: Benchmark completed

Details

Comparing HEAD and smj
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┓
┃ Query        ┃          HEAD ┃        smj ┃           Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━┩
│ QQuery 1     │     230.23 ms │  227.23 ms │        no change │
│ QQuery 2     │     188.11 ms │  188.44 ms │        no change │
│ QQuery 3     │     244.92 ms │  249.80 ms │        no change │
│ QQuery 4     │     174.31 ms │  175.78 ms │        no change │
│ QQuery 5     │     408.15 ms │  402.66 ms │        no change │
│ QQuery 6     │      68.06 ms │   67.13 ms │        no change │
│ QQuery 7     │     488.35 ms │  502.39 ms │        no change │
│ QQuery 8     │     470.92 ms │  477.66 ms │        no change │
│ QQuery 9     │     682.45 ms │  684.11 ms │        no change │
│ QQuery 10    │     241.30 ms │  238.98 ms │        no change │
│ QQuery 11    │     171.28 ms │  168.36 ms │        no change │
│ QQuery 12    │     159.38 ms │  160.41 ms │        no change │
│ QQuery 13    │     264.82 ms │  265.14 ms │        no change │
│ QQuery 14    │      95.66 ms │   91.17 ms │        no change │
│ QQuery 15    │      99.71 ms │   98.49 ms │        no change │
│ QQuery 16    │      70.28 ms │   73.21 ms │        no change │
│ QQuery 17    │     504.18 ms │  501.81 ms │        no change │
│ QQuery 18    │     586.59 ms │  745.85 ms │     1.27x slower │
│ QQuery 19    │     138.05 ms │  152.12 ms │     1.10x slower │
│ QQuery 20    │     180.26 ms │  187.47 ms │        no change │
│ QQuery 21    │ 4422642.01 ms │ 1063.69 ms │ +4157.85x faster │
│ QQuery 22    │     104.78 ms │   99.21 ms │    +1.06x faster │
└──────────────┴───────────────┴────────────┴──────────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃              ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Total Time (HEAD)      │ 4428213.77ms │
│ Total Time (smj)       │    6821.10ms │
│ Average Time (HEAD)    │  201282.44ms │
│ Average Time (smj)     │     310.05ms │
│ Queries Faster         │            2 │
│ Queries Slower         │            2 │
│ Queries with No Change │           18 │
│ Queries with Failure   │            0 │
└────────────────────────┴──────────────┘

@mbutrovich
Copy link
Contributor Author

mbutrovich commented Dec 4, 2025

│ QQuery 21 │ 4422642.01 ms │ 1063.69 ms │ +4157.85x faster │```

My goodness.

@comphead
Copy link
Contributor

comphead commented Dec 4, 2025

Small batches are evil, sorry for delay, I wanted to check the PR with TPCDS but because of recent regression #19075 cannot merge it right now

@rluvaton
Copy link
Member

rluvaton commented Dec 4, 2025

could you please align with main, I just merged a PR that fixed bug in SMJ and updated fuzz tests

@mbutrovich mbutrovich changed the title Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~1000x faster) Sort Merge Join: Reduce batch concatenation, use BatchCoalescer, new benchmarks (TPC-H Q21 SMJ up to ~4000x faster) Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sort Merge Join is extremely slow on LeftAnti joins Performance regression after adding support for SMJ with join filter

6 participants