You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
let random_state = RandomState::with_seeds(0,0,0,0);
This means that both operators compute the same hash functions. This is problematic when a HashJoinExec has a RepartitionExec as a child: Because the RepartitionExec partitions based on the lowest k bits of the hash, each HashJoinStream finds that all the hashes it computes have the same k lowest bits! In theory this could make the HashTable work less efficiently / have more collisions than expected.
Why in theory? Because despite my best attempts, I could not construct a benchmark that showed that changing the hash seed for HJ made a performance difference. I suspect this is due to a combination of the fact that the underlying hashtable uses open addressing, hashbrown storing a bitmask based on the highest bits in the buckets to do some early filtering and other bottlenecks in the HJ.
Describe the bug
HashJoinExec and RepartitionExec both instantiate a
ahash::RandomState
with the same seed:datafusion/datafusion/physical-plan/src/repartition/mod.rs
Lines 210 to 216 in 64889c0
datafusion/datafusion/physical-plan/src/joins/hash_join.rs
Lines 387 to 389 in 64889c0
This means that both operators compute the same hash functions. This is problematic when a HashJoinExec has a RepartitionExec as a child: Because the RepartitionExec partitions based on the lowest k bits of the hash, each HashJoinStream finds that all the hashes it computes have the same k lowest bits! In theory this could make the HashTable work less efficiently / have more collisions than expected.
Why in theory? Because despite my best attempts, I could not construct a benchmark that showed that changing the hash seed for HJ made a performance difference. I suspect this is due to a combination of the fact that the underlying hashtable uses open addressing, hashbrown storing a bitmask based on the highest bits in the buckets to do some early filtering and other bottlenecks in the HJ.
Is there anything I missed?
To Reproduce
Use this branch: https://github.com/ctsk/datafusion/tree/experiment/collision-reproducer
I patched the HashJoinExec so that it emits a warning when all row hashes of a batch share common least significant bits.
Run
RUST_LOG=warn ./bench.sh run tpch
and see lots of warning emitted.Expected behavior
No response
Additional context
Fix is simple: Change the hash seed in HashJoinExec - or don't provide any seed (use
Default::default()
like we do in AggregationExec.The text was updated successfully, but these errors were encountered: