-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Push down hashes to probe side in HashJoinExec #17529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
0649ad3
to
18d2acc
Compare
131c4c5
to
8c5d61b
Compare
8c5d61b
to
e23fbea
Compare
//! ``` | ||
//! The join portion of the query should look something like this: | ||
//! | ||
//! ```text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriangb I think this is probably ready for an initial look when you get a chance! I plan on adding unit + fuzz tests as well. But let me know if you have any other thoughts re: testing. |
@alamb Would you be able to kick off benchmarks 🙏🏾 ? Specifically TPC-H against parquet files Should want the following configuration options set:
|
/// Each element represents the column bounds computed by one partition. | ||
bounds: Vec<PartitionBounds>, | ||
/// Hashes from the left (build) side, if enabled | ||
left_hashes: NoHashSet<u64>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found using a HashSet
here to yield better performance than using a Vec<Arc<dyn JoinHashMapType>>
. Though, it does of course result in extra allocations.
My guess is that its primarily due to that Vec<Arc<dyn JoinHashMapType>>
results in more indirection + scattered memory accesses which likely means worse cache locality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that the extra memory cost is prohibitively expensive: there are going to be queries that ran just fine previously but now OOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is opt-in, hopefully it's not as much of an issue? I've mentioned this in the config documentation:
/// When set to true, hash joins will allow passing hashes from the build
/// side to the right side of the join. This can be useful to prune rows early on,
/// but may consume more memory.
In general though I agree that we shouldn't need extra allocations here. It gets a bit tricky though because even if we combine all the hash tables from each build partition into a single, shareable table - each stream probe partition needs to be able to validate that the lookup is localized to its partition. Otherwise we'll see duplicate / incorrect results I believe.
Though, it may just take some tweaking to the existing data structure. Haven't thought about it enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way you'd do it is something like (col in hash_table_1) OR (col in hash_table_2) OR (...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah that's essentially the same as what I was referring to earlier with using a Vec<Arc<dyn JoinHashMapType>>
(sorry probably should've clarified more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👋 I'd be happy to test this in our service, and see memory implications in a prod environment. I guess for any query that uses this kind of filter, the Hash table shouldn't be big? otherwise this kind of filters wont be that worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @LiaCastaneda - that would be great! And apologies for the delayed response.
I am curious to see how gains diminish as the build side grows larger. I'm not sure if it is that drastic or if there are still gains to be seen for very large build sides. To me the biggest downside is memory usage. e.g. a build side with 1 billion hashes would be ~8GB of additional memory.
Currently there are no checks to determine whether it is worth it to build the hash set and push it down. Though, that could be done via separate configuration depending on the upper bound of memory consumption a consumer is comfortable with for this feature 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, our current plan for using this API is to use some available statistics we have ( like row count), and set the enable_dynamic...
option based on that, so its even before logical and physical DataFusion planning. Would it be possible to do something similar in DataFusion itself while running the optimizer rule?
Although the option you mention is more straightforward, are you suggesting that we set a configuration option for maximum memory consumption and, when building the hashes, stop producing them and free memory if we exceed that limit?
} | ||
|
||
impl SharedBuildAccumulator { | ||
/// Creates a new [SharedBuildAccumulator] configured for the given partition mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Creates a new [SharedBuildAccumulator] configured for the given partition mode | |
/// Creates a new [`SharedBuildAccumulator`] configured for the given partition mode |
Very cool! Incidentally we were just discussing today with @gabotechs and @robtandy how to make HashJoin dynamic filter pushdown more compatible with distributed datafusion and how to eliminate the latency associated with waiting until we have the full build side to create filters. One idea that came up was to push something like: But for this PR the big question in my mind is going to be: is the cost of the extra evaluation of the hash worth it? |
🤖 |
I am not quite sure how to set these options in the benchmarks... |
Ah okay then we probably won't see any changes since these need to be enabled for the changes here to take effect. Posting a comparison I did locally:
|
🤖: Benchmark completed Details
|
@rkrishn7 sorry if I haven't looped back here. I went on a bit of a tangent exploring #17632 and then had some vacation and a team offsite. This is overall very exciting work that I think will help a lot of people. My main concern with this change is the overhead and making a CPU / memory tradeoff decision for users. I think we might be able to ship it as an experimental thing with the feature flag defaulting to false as you've done in this PR but long term I worry that an extra 8GB or RAM consumed might be too much. Do you have any numbers on how fast and how much RAM these 3 different scenarios use for some queries? I don't mean to ask you to run them all but I do remember you mentioning you have already.
My hypothesis is that the table will look something like this:
If that were the case, which is just a guess at this point, making a query 10x faster with no extra memory use is easy to justify, everyone wants that! Choosing to make some queries 17% faster for 100% more memory use is harder to justify. If the performance difference is larger and we think it is justified in some cases maybe we can at least try to reserve the extra memory and fall back to re-using the existing hash tables? I also think it's worth thinking about integrating your suggestion from our conversation to use an |
Which issue does this PR close?
What changes are included in this PR?
hash_join_sideways_hash_passing
) to enable passing hashes from build side to probe side scans.HashComparePhysicalExpr
that is pushed down to supported right-side scans in hash join.Are these changes tested?
Not yet. Plan to add unit + fuzz tests
Are there any user-facing changes?
Yes, new configuration option for hash join execution.