Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated #17444

adriangb · 2025-09-05T15:27:32Z

This should unblock attempts to do spilling and should be no better or worse than the current approach in terms of performance / complexity

cc @jonathanc-n

…ds instead of computing them after data is accumulated This should unblock attempts to do spilling and should be no better or worse than the current approach in terms of performance / complexity

adriangb · 2025-09-05T15:31:51Z

cc @nuno-faria

nuno-faria

I tested this PR and see no regression in performance against main. When #17267 is eventually tackled it might be better to add a test showing the dynamic filter still being built when the join is spilled. Here is a potential example:

copy (select i as k, i as v from generate_series(1, 100000000) as t(i)) to 't1.parquet';
copy (select i as k, i as v from generate_series(1, 100000000) as t(i)) to 't2.parquet';

create external table t1 stored as parquet location 't1.parquet';
create external table t2 stored as parquet location 't2.parquet';

> SET datafusion.runtime.memory_limit = '40M';
0 row(s) fetched.
Elapsed 0.000 seconds.

> explain analyze select * from t1 join t2 on t1.k = t2.k where t2.v < 1000000;
Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
  HashJoinInput[4]#586(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[3]#585(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[2]#584(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[5]#601(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[10]#603(can spill: false) consumed 3.4 MB, peak 3.4 MB.
Error: Failed to allocate additional 2.1 MB for HashJoinInput[8] with 1306.0 KB already allocated for this reservation - 1392.5 KB remain available for the total pool

jonathanc-n

lgtm thanks!

adriangb · 2025-09-08T12:47:19Z

@alamb could you kick off benchmarks on this PR and maybe take a stab at a brief review? FWIW I think it's more of a refactor than a behavior change.

alamb · 2025-09-09T14:23:00Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing accumulate-hash-bounds (b54f71f) to f5bdc2d diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb

Thanks @adriangb -- I think this looks good. I would really like to avoid the new dependency if possible, but we can also do it as a follow on PR

Thanks to @nuno-faria and @jonathanc-n for the review

FYI @2010YOUY01 given you have been looking at this code recently as well

alamb · 2025-09-09T14:23:26Z

datafusion/physical-plan/Cargo.toml

 datafusion-execution = { workspace = true }
 datafusion-expr = { workspace = true }
-datafusion-functions-aggregate-common = { workspace = true }
+datafusion-functions-aggregate = { workspace = true }


I worry about bringing in all of the aggregate functions for all physical plans

I would personally prefer to have Min/Max accumulators moved into datafusion-functions-aggregate-common and avoid this new dependency

Makes sense. I opened #17492 which also removes the dep for the parquet crate. However I might merge this PR first and do it as a followup since the followup will be just changing an import.

alamb · 2025-09-09T14:27:34Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

+    /// The physical expression to evaluate for each batch
+    expr: Arc<dyn PhysicalExpr>,
+    /// Accumulator for tracking the minimum value across all batches
+    min: MinAccumulator,


I think using the min/max accumulators is a great idea

It may also help the symptoms @LiaCastaneda is reporting in

High CPU during dynamic filter bound computation: min_batch/max_batch #17486

As I believe I remember there is a better optimized implementation for Lists in those accumulators than what was here previously

alamb · 2025-09-09T14:28:17Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

            let batch_size = get_record_batch_memory_size(&batch);
            // Reserve memory for incoming batch
-            acc.3.try_grow(batch_size)?;
+            state.reservation.try_grow(batch_size)?;


I really like this state.reservation rather than acc.2.

alamb · 2025-09-09T15:17:29Z

🤖: Benchmark completed

Details

Comparing HEAD and accumulate-hash-bounds
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ accumulate-hash-bounds ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2713.58 ms │             2774.05 ms │    no change │
│ QQuery 1     │  1216.87 ms │             1446.94 ms │ 1.19x slower │
│ QQuery 2     │  2372.79 ms │             2673.84 ms │ 1.13x slower │
│ QQuery 3     │  1185.56 ms │             1181.55 ms │    no change │
│ QQuery 4     │  2278.40 ms │             2240.44 ms │    no change │
│ QQuery 5     │ 27531.94 ms │            27913.72 ms │    no change │
│ QQuery 6     │  4317.62 ms │             4225.34 ms │    no change │
│ QQuery 7     │  3608.69 ms │             3661.81 ms │    no change │
└──────────────┴─────────────┴────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 45225.45ms │
│ Total Time (accumulate-hash-bounds)   │ 46117.68ms │
│ Average Time (HEAD)                   │  5653.18ms │
│ Average Time (accumulate-hash-bounds) │  5764.71ms │
│ Queries Faster                        │          0 │
│ Queries Slower                        │          2 │
│ Queries with No Change                │          6 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ accumulate-hash-bounds ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.31 ms │                2.26 ms │     no change │
│ QQuery 1     │    51.42 ms │               51.29 ms │     no change │
│ QQuery 2     │   139.23 ms │              136.03 ms │     no change │
│ QQuery 3     │   164.69 ms │              165.23 ms │     no change │
│ QQuery 4     │  1069.41 ms │             1116.11 ms │     no change │
│ QQuery 5     │  1544.21 ms │             1605.46 ms │     no change │
│ QQuery 6     │     2.17 ms │                2.18 ms │     no change │
│ QQuery 7     │    53.44 ms │               57.14 ms │  1.07x slower │
│ QQuery 8     │  1499.94 ms │             1504.72 ms │     no change │
│ QQuery 9     │  1952.85 ms │             1960.49 ms │     no change │
│ QQuery 10    │   383.37 ms │              402.73 ms │  1.05x slower │
│ QQuery 11    │   448.06 ms │              454.30 ms │     no change │
│ QQuery 12    │  1449.91 ms │             1499.32 ms │     no change │
│ QQuery 13    │  2170.20 ms │             2262.80 ms │     no change │
│ QQuery 14    │  1305.68 ms │             1332.44 ms │     no change │
│ QQuery 15    │  1239.58 ms │             1250.71 ms │     no change │
│ QQuery 16    │  2680.85 ms │             2764.62 ms │     no change │
│ QQuery 17    │  2654.51 ms │             2737.53 ms │     no change │
│ QQuery 18    │  5186.52 ms │             5027.45 ms │     no change │
│ QQuery 19    │   127.96 ms │              128.82 ms │     no change │
│ QQuery 20    │  2129.59 ms │             2014.96 ms │ +1.06x faster │
│ QQuery 21    │  2384.15 ms │             2376.14 ms │     no change │
│ QQuery 22    │  4195.18 ms │             4071.80 ms │     no change │
│ QQuery 23    │ 12984.05 ms │            13117.15 ms │     no change │
│ QQuery 24    │   223.66 ms │              210.57 ms │ +1.06x faster │
│ QQuery 25    │   518.38 ms │              532.27 ms │     no change │
│ QQuery 26    │   224.24 ms │              235.28 ms │     no change │
│ QQuery 27    │  2963.41 ms │             2926.84 ms │     no change │
│ QQuery 28    │ 23334.20 ms │            23409.83 ms │     no change │
│ QQuery 29    │   991.68 ms │              993.11 ms │     no change │
│ QQuery 30    │  1406.75 ms │             1371.44 ms │     no change │
│ QQuery 31    │  1368.31 ms │             1353.94 ms │     no change │
│ QQuery 32    │  4662.92 ms │             4888.18 ms │     no change │
│ QQuery 33    │  5892.25 ms │             5906.30 ms │     no change │
│ QQuery 34    │  6108.34 ms │             5883.00 ms │     no change │
│ QQuery 35    │  2094.88 ms │             2059.82 ms │     no change │
│ QQuery 36    │   123.71 ms │              120.35 ms │     no change │
│ QQuery 37    │    53.98 ms │               52.32 ms │     no change │
│ QQuery 38    │   125.41 ms │              121.70 ms │     no change │
│ QQuery 39    │   203.34 ms │              201.01 ms │     no change │
│ QQuery 40    │    44.74 ms │               44.34 ms │     no change │
│ QQuery 41    │    41.57 ms │               41.90 ms │     no change │
│ QQuery 42    │    32.60 ms │               34.87 ms │  1.07x slower │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 96233.64ms │
│ Total Time (accumulate-hash-bounds)   │ 96428.73ms │
│ Average Time (HEAD)                   │  2237.99ms │
│ Average Time (accumulate-hash-bounds) │  2242.53ms │
│ Queries Faster                        │          2 │
│ Queries Slower                        │          3 │
│ Queries with No Change                │         38 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ accumulate-hash-bounds ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 168.14 ms │              167.72 ms │     no change │
│ QQuery 2     │  25.20 ms │               25.76 ms │     no change │
│ QQuery 3     │  44.93 ms │               44.80 ms │     no change │
│ QQuery 4     │  26.48 ms │               26.68 ms │     no change │
│ QQuery 5     │  74.08 ms │               74.98 ms │     no change │
│ QQuery 6     │  19.25 ms │               19.69 ms │     no change │
│ QQuery 7     │ 145.99 ms │              149.34 ms │     no change │
│ QQuery 8     │  37.27 ms │               29.13 ms │ +1.28x faster │
│ QQuery 9     │ 103.17 ms │               84.67 ms │ +1.22x faster │
│ QQuery 10    │  59.08 ms │               58.92 ms │     no change │
│ QQuery 11    │  41.43 ms │               42.41 ms │     no change │
│ QQuery 12    │  49.94 ms │               50.06 ms │     no change │
│ QQuery 13    │  45.26 ms │               45.58 ms │     no change │
│ QQuery 14    │  13.86 ms │               13.80 ms │     no change │
│ QQuery 15    │  24.09 ms │               24.64 ms │     no change │
│ QQuery 16    │  24.26 ms │               23.95 ms │     no change │
│ QQuery 17    │ 150.11 ms │              146.25 ms │     no change │
│ QQuery 18    │ 326.95 ms │              326.51 ms │     no change │
│ QQuery 19    │  37.55 ms │               37.40 ms │     no change │
│ QQuery 20    │  49.57 ms │               49.19 ms │     no change │
│ QQuery 21    │ 216.75 ms │              223.29 ms │     no change │
│ QQuery 22    │  19.32 ms │               19.79 ms │     no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1702.68ms │
│ Total Time (accumulate-hash-bounds)   │ 1684.57ms │
│ Average Time (HEAD)                   │   77.39ms │
│ Average Time (accumulate-hash-bounds) │   76.57ms │
│ Queries Faster                        │         2 │
│ Queries Slower                        │         0 │
│ Queries with No Change                │        20 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

adriangb · 2025-09-09T15:29:29Z

Benchmark clickbench_extended.json
│ QQuery 1 │ 1216.87 ms │ 1446.94 ms │ 1.19x slower │

I think this is noise, Q1 is:

datafusion/benchmarks/queries/clickbench/extended/q1.sql

Line 4 in fc5888b

    
           SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage")  FROM hits;

Which has no join.

Benchmark tpch_mem_sf1.json
QQuery 8 │ 37.27 ms │ 29.13 ms │ +1.28x faster

This looks real, lots of joins!

datafusion/benchmarks/queries/q8.sql

Lines 13 to 21 in fc5888b

    
           from 
        
               part, 
        
               supplier, 
        
               lineitem, 
        
               orders, 
        
               customer, 
        
               nation n1, 
        
               nation n2, 
        
               region

LiaCastaneda · 2025-09-10T15:00:15Z

datafusion/physical-plan/src/joins/hash_join/exec.rs

-            // Use Arrow kernels for efficient min/max computation
-            let min_val = min_batch(array)?;
-            let max_val = max_batch(array)?;


I think the accumulator update_batch also uses min_batch which uses min_max_batch_generic which was the expensive function, but I will try to bring this and see if it solves #17486

…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)

…ds instead of computing them after data is accumulated (apache#17444) (#46) (cherry picked from commit 5b833b9) Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>

Refactor HashJoinExec to progressively accumulate dynamic filter boun…

2740a7a

…ds instead of computing them after data is accumulated This should unblock attempts to do spilling and should be no better or worse than the current approach in terms of performance / complexity

github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 5, 2025

remove unused crate

25be14a

adriangb requested review from 2010YOUY01 and xudong963 September 5, 2025 15:31

adriangb added 7 commits September 5, 2025 10:56

fix

9cc484f

fix

77787f2

docstrings

d16b011

fix typo

671949a

simplify

5564824

fmt

e6c2b75

lint

b54f71f

adriangb mentioned this pull request Sep 5, 2025

PROPOSAL Hash Join Spilling Proposal #17267

Open

nuno-faria approved these changes Sep 6, 2025

View reviewed changes

jonathanc-n approved these changes Sep 6, 2025

View reviewed changes

alamb approved these changes Sep 9, 2025

View reviewed changes

adriangb mentioned this pull request Sep 9, 2025

move MinAggregator and MaxAggregator to functions-aggregate-common #17492

Merged

adriangb merged commit 5b833b9 into apache:main Sep 9, 2025
30 checks passed

adriangb deleted the accumulate-hash-bounds branch September 9, 2025 17:58

This was referenced Sep 9, 2025

fix: synchronize partition bounds reporting in HashJoin #17452

Merged

High CPU during dynamic filter bound computation: min_batch/max_batch #17486

Closed

LiaCastaneda reviewed Sep 10, 2025

View reviewed changes

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 10, 2025

Refactor HashJoinExec to progressively accumulate dynamic filter boun…

6a78a66

…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)

LiaCastaneda mentioned this pull request Sep 10, 2025

Cherry pick min max progressive accumulation in hash joins DataDog/datafusion#44

Closed

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 10, 2025

Refactor HashJoinExec to progressively accumulate dynamic filter boun…

a016ed6

…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)

adriangb mentioned this pull request Sep 10, 2025

update physical-plan to use datafusion-functions-aggregate-common for Min/MaxAccumulator #17502

Merged

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 19, 2025

Refactor HashJoinExec to progressively accumulate dynamic filter boun…

d6c38b7

…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)

LiaCastaneda mentioned this pull request Sep 19, 2025

Refactor HashJoinExec to progressively accumulate dynamic filter bounds DataDog/datafusion#46

Merged

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 19, 2025

Refactor HashJoinExec to progressively accumulate dynamic filter boun…

ef3386c

…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated #17444

Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated #17444

Uh oh!

adriangb commented Sep 5, 2025

Uh oh!

adriangb commented Sep 5, 2025

Uh oh!

nuno-faria left a comment

Uh oh!

jonathanc-n left a comment

Uh oh!

adriangb commented Sep 8, 2025

Uh oh!

alamb commented Sep 9, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Sep 9, 2025

Uh oh!

adriangb Sep 9, 2025

Uh oh!

alamb Sep 9, 2025

Uh oh!

alamb Sep 9, 2025

Uh oh!

alamb commented Sep 9, 2025

Uh oh!

adriangb commented Sep 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

LiaCastaneda Sep 10, 2025

Uh oh!

Uh oh!

Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated #17444

Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated #17444

Uh oh!

Conversation

adriangb commented Sep 5, 2025

Uh oh!

adriangb commented Sep 5, 2025

Uh oh!

nuno-faria left a comment

Choose a reason for hiding this comment

Uh oh!

jonathanc-n left a comment

Choose a reason for hiding this comment

Uh oh!

adriangb commented Sep 8, 2025

Uh oh!

alamb commented Sep 9, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 9, 2025

Uh oh!

adriangb commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LiaCastaneda Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adriangb commented Sep 9, 2025 •

edited

Loading