-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Refactor HashJoinExec to progressively accumulate dynamic filter bounds instead of computing them after data is accumulated #17444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ds instead of computing them after data is accumulated This should unblock attempts to do spilling and should be no better or worse than the current approach in terms of performance / complexity
cc @nuno-faria |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR and see no regression in performance against main. When #17267 is eventually tackled it might be better to add a test showing the dynamic filter still being built when the join is spilled. Here is a potential example:
copy (select i as k, i as v from generate_series(1, 100000000) as t(i)) to 't1.parquet';
copy (select i as k, i as v from generate_series(1, 100000000) as t(i)) to 't2.parquet';
create external table t1 stored as parquet location 't1.parquet';
create external table t2 stored as parquet location 't2.parquet';
> SET datafusion.runtime.memory_limit = '40M';
0 row(s) fetched.
Elapsed 0.000 seconds.
> explain analyze select * from t1 join t2 on t1.k = t2.k where t2.v < 1000000;
Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
HashJoinInput[4]#586(can spill: false) consumed 3.4 MB, peak 3.4 MB,
HashJoinInput[3]#585(can spill: false) consumed 3.4 MB, peak 3.4 MB,
HashJoinInput[2]#584(can spill: false) consumed 3.4 MB, peak 3.4 MB,
HashJoinInput[5]#601(can spill: false) consumed 3.4 MB, peak 3.4 MB,
HashJoinInput[10]#603(can spill: false) consumed 3.4 MB, peak 3.4 MB.
Error: Failed to allocate additional 2.1 MB for HashJoinInput[8] with 1306.0 KB already allocated for this reservation - 1392.5 KB remain available for the total pool
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm thanks!
@alamb could you kick off benchmarks on this PR and maybe take a stab at a brief review? FWIW I think it's more of a refactor than a behavior change. |
🤖 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb -- I think this looks good. I would really like to avoid the new dependency if possible, but we can also do it as a follow on PR
Thanks to @nuno-faria and @jonathanc-n for the review
FYI @2010YOUY01 given you have been looking at this code recently as well
datafusion-execution = { workspace = true } | ||
datafusion-expr = { workspace = true } | ||
datafusion-functions-aggregate-common = { workspace = true } | ||
datafusion-functions-aggregate = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about bringing in all of the aggregate functions for all physical plans
I would personally prefer to have Min/Max accumulators moved into datafusion-functions-aggregate-common
and avoid this new dependency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I opened #17492 which also removes the dep for the parquet crate. However I might merge this PR first and do it as a followup since the followup will be just changing an import.
/// The physical expression to evaluate for each batch | ||
expr: Arc<dyn PhysicalExpr>, | ||
/// Accumulator for tracking the minimum value across all batches | ||
min: MinAccumulator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the min/max accumulators is a great idea
It may also help the symptoms @LiaCastaneda is reporting in
As I believe I remember there is a better optimized implementation for Lists in those accumulators than what was here previously
let batch_size = get_record_batch_memory_size(&batch); | ||
// Reserve memory for incoming batch | ||
acc.3.try_grow(batch_size)?; | ||
state.reservation.try_grow(batch_size)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this state.reservation
rather than acc.2.
🤖: Benchmark completed Details
|
I think this is noise, Q1 is:
Which has no join.
This looks real, lots of joins! datafusion/benchmarks/queries/q8.sql Lines 13 to 21 in fc5888b
|
// Use Arrow kernels for efficient min/max computation | ||
let min_val = min_batch(array)?; | ||
let max_val = max_batch(array)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the accumulator update_batch
also uses min_batch
which uses min_max_batch_generic
which was the expensive function, but I will try to bring this and see if it solves #17486
…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)
…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)
…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)
…ds instead of computing them after data is accumulated (apache#17444) (cherry picked from commit 5b833b9)
…ds instead of computing them after data is accumulated (apache#17444) (#46) (cherry picked from commit 5b833b9) Co-authored-by: Adrian Garcia Badaracco <1755071+adriangb@users.noreply.github.com>
This should unblock attempts to do spilling and should be no better or worse than the current approach in terms of performance / complexity
cc @jonathanc-n