Skip to content

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Sep 5, 2025

This should unblock attempts to do spilling and should be no better or worse than the current approach in terms of performance / complexity

cc @jonathanc-n

…ds instead of computing them after data is accumulated

This should unblock attempts to do spilling and should be no better or worse than the current approach in terms of performance / complexity
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Sep 5, 2025
@adriangb
Copy link
Contributor Author

adriangb commented Sep 5, 2025

cc @nuno-faria

Copy link
Contributor

@nuno-faria nuno-faria left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR and see no regression in performance against main. When #17267 is eventually tackled it might be better to add a test showing the dynamic filter still being built when the join is spilled. Here is a potential example:

copy (select i as k, i as v from generate_series(1, 100000000) as t(i)) to 't1.parquet';
copy (select i as k, i as v from generate_series(1, 100000000) as t(i)) to 't2.parquet';

create external table t1 stored as parquet location 't1.parquet';
create external table t2 stored as parquet location 't2.parquet';

> SET datafusion.runtime.memory_limit = '40M';
0 row(s) fetched.
Elapsed 0.000 seconds.

> explain analyze select * from t1 join t2 on t1.k = t2.k where t2.v < 1000000;
Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
  HashJoinInput[4]#586(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[3]#585(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[2]#584(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[5]#601(can spill: false) consumed 3.4 MB, peak 3.4 MB,
  HashJoinInput[10]#603(can spill: false) consumed 3.4 MB, peak 3.4 MB.
Error: Failed to allocate additional 2.1 MB for HashJoinInput[8] with 1306.0 KB already allocated for this reservation - 1392.5 KB remain available for the total pool

Copy link
Contributor

@jonathanc-n jonathanc-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm thanks!

@adriangb
Copy link
Contributor Author

adriangb commented Sep 8, 2025

@alamb could you kick off benchmarks on this PR and maybe take a stab at a brief review? FWIW I think it's more of a refactor than a behavior change.

@alamb
Copy link
Contributor

alamb commented Sep 9, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing accumulate-hash-bounds (b54f71f) to f5bdc2d diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- I think this looks good. I would really like to avoid the new dependency if possible, but we can also do it as a follow on PR

Thanks to @nuno-faria and @jonathanc-n for the review

FYI @2010YOUY01 given you have been looking at this code recently as well

datafusion-execution = { workspace = true }
datafusion-expr = { workspace = true }
datafusion-functions-aggregate-common = { workspace = true }
datafusion-functions-aggregate = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry about bringing in all of the aggregate functions for all physical plans

I would personally prefer to have Min/Max accumulators moved into datafusion-functions-aggregate-common and avoid this new dependency

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I opened #17492 which also removes the dep for the parquet crate. However I might merge this PR first and do it as a followup since the followup will be just changing an import.

/// The physical expression to evaluate for each batch
expr: Arc<dyn PhysicalExpr>,
/// Accumulator for tracking the minimum value across all batches
min: MinAccumulator,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the min/max accumulators is a great idea

It may also help the symptoms @LiaCastaneda is reporting in

As I believe I remember there is a better optimized implementation for Lists in those accumulators than what was here previously

let batch_size = get_record_batch_memory_size(&batch);
// Reserve memory for incoming batch
acc.3.try_grow(batch_size)?;
state.reservation.try_grow(batch_size)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this state.reservation rather than acc.2.

@alamb
Copy link
Contributor

alamb commented Sep 9, 2025

🤖: Benchmark completed

Details

Comparing HEAD and accumulate-hash-bounds
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ accumulate-hash-bounds ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2713.58 ms │             2774.05 ms │    no change │
│ QQuery 1     │  1216.87 ms │             1446.94 ms │ 1.19x slower │
│ QQuery 2     │  2372.79 ms │             2673.84 ms │ 1.13x slower │
│ QQuery 3     │  1185.56 ms │             1181.55 ms │    no change │
│ QQuery 4     │  2278.40 ms │             2240.44 ms │    no change │
│ QQuery 5     │ 27531.94 ms │            27913.72 ms │    no change │
│ QQuery 6     │  4317.62 ms │             4225.34 ms │    no change │
│ QQuery 7     │  3608.69 ms │             3661.81 ms │    no change │
└──────────────┴─────────────┴────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 45225.45ms │
│ Total Time (accumulate-hash-bounds)   │ 46117.68ms │
│ Average Time (HEAD)                   │  5653.18ms │
│ Average Time (accumulate-hash-bounds) │  5764.71ms │
│ Queries Faster                        │          0 │
│ Queries Slower                        │          2 │
│ Queries with No Change                │          6 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ accumulate-hash-bounds ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.31 ms │                2.26 ms │     no change │
│ QQuery 1     │    51.42 ms │               51.29 ms │     no change │
│ QQuery 2     │   139.23 ms │              136.03 ms │     no change │
│ QQuery 3     │   164.69 ms │              165.23 ms │     no change │
│ QQuery 4     │  1069.41 ms │             1116.11 ms │     no change │
│ QQuery 5     │  1544.21 ms │             1605.46 ms │     no change │
│ QQuery 6     │     2.17 ms │                2.18 ms │     no change │
│ QQuery 7     │    53.44 ms │               57.14 ms │  1.07x slower │
│ QQuery 8     │  1499.94 ms │             1504.72 ms │     no change │
│ QQuery 9     │  1952.85 ms │             1960.49 ms │     no change │
│ QQuery 10    │   383.37 ms │              402.73 ms │  1.05x slower │
│ QQuery 11    │   448.06 ms │              454.30 ms │     no change │
│ QQuery 12    │  1449.91 ms │             1499.32 ms │     no change │
│ QQuery 13    │  2170.20 ms │             2262.80 ms │     no change │
│ QQuery 14    │  1305.68 ms │             1332.44 ms │     no change │
│ QQuery 15    │  1239.58 ms │             1250.71 ms │     no change │
│ QQuery 16    │  2680.85 ms │             2764.62 ms │     no change │
│ QQuery 17    │  2654.51 ms │             2737.53 ms │     no change │
│ QQuery 18    │  5186.52 ms │             5027.45 ms │     no change │
│ QQuery 19    │   127.96 ms │              128.82 ms │     no change │
│ QQuery 20    │  2129.59 ms │             2014.96 ms │ +1.06x faster │
│ QQuery 21    │  2384.15 ms │             2376.14 ms │     no change │
│ QQuery 22    │  4195.18 ms │             4071.80 ms │     no change │
│ QQuery 23    │ 12984.05 ms │            13117.15 ms │     no change │
│ QQuery 24    │   223.66 ms │              210.57 ms │ +1.06x faster │
│ QQuery 25    │   518.38 ms │              532.27 ms │     no change │
│ QQuery 26    │   224.24 ms │              235.28 ms │     no change │
│ QQuery 27    │  2963.41 ms │             2926.84 ms │     no change │
│ QQuery 28    │ 23334.20 ms │            23409.83 ms │     no change │
│ QQuery 29    │   991.68 ms │              993.11 ms │     no change │
│ QQuery 30    │  1406.75 ms │             1371.44 ms │     no change │
│ QQuery 31    │  1368.31 ms │             1353.94 ms │     no change │
│ QQuery 32    │  4662.92 ms │             4888.18 ms │     no change │
│ QQuery 33    │  5892.25 ms │             5906.30 ms │     no change │
│ QQuery 34    │  6108.34 ms │             5883.00 ms │     no change │
│ QQuery 35    │  2094.88 ms │             2059.82 ms │     no change │
│ QQuery 36    │   123.71 ms │              120.35 ms │     no change │
│ QQuery 37    │    53.98 ms │               52.32 ms │     no change │
│ QQuery 38    │   125.41 ms │              121.70 ms │     no change │
│ QQuery 39    │   203.34 ms │              201.01 ms │     no change │
│ QQuery 40    │    44.74 ms │               44.34 ms │     no change │
│ QQuery 41    │    41.57 ms │               41.90 ms │     no change │
│ QQuery 42    │    32.60 ms │               34.87 ms │  1.07x slower │
└──────────────┴─────────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 96233.64ms │
│ Total Time (accumulate-hash-bounds)   │ 96428.73ms │
│ Average Time (HEAD)                   │  2237.99ms │
│ Average Time (accumulate-hash-bounds) │  2242.53ms │
│ Queries Faster                        │          2 │
│ Queries Slower                        │          3 │
│ Queries with No Change                │         38 │
│ Queries with Failure                  │          0 │
└───────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ accumulate-hash-bounds ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 168.14 ms │              167.72 ms │     no change │
│ QQuery 2     │  25.20 ms │               25.76 ms │     no change │
│ QQuery 3     │  44.93 ms │               44.80 ms │     no change │
│ QQuery 4     │  26.48 ms │               26.68 ms │     no change │
│ QQuery 5     │  74.08 ms │               74.98 ms │     no change │
│ QQuery 6     │  19.25 ms │               19.69 ms │     no change │
│ QQuery 7     │ 145.99 ms │              149.34 ms │     no change │
│ QQuery 8     │  37.27 ms │               29.13 ms │ +1.28x faster │
│ QQuery 9     │ 103.17 ms │               84.67 ms │ +1.22x faster │
│ QQuery 10    │  59.08 ms │               58.92 ms │     no change │
│ QQuery 11    │  41.43 ms │               42.41 ms │     no change │
│ QQuery 12    │  49.94 ms │               50.06 ms │     no change │
│ QQuery 13    │  45.26 ms │               45.58 ms │     no change │
│ QQuery 14    │  13.86 ms │               13.80 ms │     no change │
│ QQuery 15    │  24.09 ms │               24.64 ms │     no change │
│ QQuery 16    │  24.26 ms │               23.95 ms │     no change │
│ QQuery 17    │ 150.11 ms │              146.25 ms │     no change │
│ QQuery 18    │ 326.95 ms │              326.51 ms │     no change │
│ QQuery 19    │  37.55 ms │               37.40 ms │     no change │
│ QQuery 20    │  49.57 ms │               49.19 ms │     no change │
│ QQuery 21    │ 216.75 ms │              223.29 ms │     no change │
│ QQuery 22    │  19.32 ms │               19.79 ms │     no change │
└──────────────┴───────────┴────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                     ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                     │ 1702.68ms │
│ Total Time (accumulate-hash-bounds)   │ 1684.57ms │
│ Average Time (HEAD)                   │   77.39ms │
│ Average Time (accumulate-hash-bounds) │   76.57ms │
│ Queries Faster                        │         2 │
│ Queries Slower                        │         0 │
│ Queries with No Change                │        20 │
│ Queries with Failure                  │         0 │
└───────────────────────────────────────┴───────────┘

@adriangb
Copy link
Contributor Author

adriangb commented Sep 9, 2025

Benchmark clickbench_extended.json
│ QQuery 1 │ 1216.87 ms │ 1446.94 ms │ 1.19x slower │

I think this is noise, Q1 is:

SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM hits;

Which has no join.

Benchmark tpch_mem_sf1.json
QQuery 8 │ 37.27 ms │ 29.13 ms │ +1.28x faster

This looks real, lots of joins!

from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region

@adriangb adriangb merged commit 5b833b9 into apache:main Sep 9, 2025
30 checks passed
@adriangb adriangb deleted the accumulate-hash-bounds branch September 9, 2025 17:58
Comment on lines -1204 to -1206
// Use Arrow kernels for efficient min/max computation
let min_val = min_batch(array)?;
let max_val = max_batch(array)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the accumulator update_batch also uses min_batch which uses min_max_batch_generic which was the expensive function, but I will try to bring this and see if it solves #17486

LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 10, 2025
…ds instead of computing them after data is accumulated (apache#17444)

(cherry picked from commit 5b833b9)
LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 10, 2025
…ds instead of computing them after data is accumulated (apache#17444)

(cherry picked from commit 5b833b9)
LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 19, 2025
…ds instead of computing them after data is accumulated (apache#17444)

(cherry picked from commit 5b833b9)
LiaCastaneda pushed a commit to DataDog/datafusion that referenced this pull request Sep 19, 2025
…ds instead of computing them after data is accumulated (apache#17444)

(cherry picked from commit 5b833b9)
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Sep 29, 2025
…ds instead of computing them after data is accumulated (apache#17444) (#46)

(cherry picked from commit 5b833b9)

Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
physical-plan Changes to the physical-plan crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants