Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Base.PersistentDict/HAMT: rehashing is ineffective #54740

Open
xitology opened this issue Jun 8, 2024 · 4 comments
Open

Base.PersistentDict/HAMT: rehashing is ineffective #54740

xitology opened this issue Jun 8, 2024 · 4 comments
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@xitology
Copy link
Contributor

xitology commented Jun 8, 2024

I'm looking at the implementation of PersistentDict and HAMT in https://github.com/JuliaLang/julia/blob/master/base/hamt.jl. This implementation attempts to avoid hash collisions by using rehashing, but I believe it does not provide the intended effect.

A comment in the code claims Perfect hash collisions should not occur in practice since we perform rehashing after using 55 bits (MAX_SHIFT) of hash. Here's how rehashing is explained in Bagwell's paper: The algorithm requires that the hash can be extended to an arbitrary number of bits. This was accomplished by rehashing the key combined with an integer representing the trie level, zero being the root. Hence if two keys do give the same initial hash then the rehash has a probability of 1 in 232 of a further collision. However the Julia implementation rehashes not the original key, but the previous hash value. If two hashes collide, so will the rehashed hashes.

Here's a test case:

julia> versioninfo()
Julia Version 1.12.0-DEV.689
Commit 94a0ee8637b (2024-06-08 14:18 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 16 × AMD Ryzen 7 PRO 6850U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 16 virtual cores)

julia> const KEY1 = 0;

julia> const KEY2 = 0xeb8449dc1393171b;

julia> @assert objectid(KEY1) == objectid(KEY2)

julia> Base.PersistentDict{Any,Any}(KEY1 => true, KEY2 => false)
ERROR: Perfect hash collision
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:44
 [2] insert!
   @ ./hamt.jl:207 [inlined]
 [3] _keyvalueset
   @ ./dict.jl:879 [inlined]
 [4] set(dict::Base.PersistentDict{Any, Any}, key::UInt64, val::Bool)
   @ Base ./dict.jl:874
 [5] PersistentDict
   @ ./dict.jl:951 [inlined]
 [6] Base.PersistentDict{Any, Any}(KV::Pair{Int64, Bool}, rest::Pair{UInt64, Bool})
   @ Base ./dict.jl:957
 [7] top-level scope
   @ REPL[8]:1

Note that IdDict can handle such keys:

julia> IdDict{Any,Any}(KEY1 => true, KEY2 => false)
IdDict{Any, Any} with 2 entries:
  0                  => true
  0xeb8449dc1393171b => false
@oscardssmith oscardssmith added the bug Indicates an unexpected problem or unintended behavior label Jun 8, 2024
@vchuravy
Copy link
Member

I will need to think about what the right answer is here. Before #52193 we indeed used the key itself here.

But yes an objectid collision does cause a perfect hash collision. I am kinda curious how IdDict survives that.

@xitology
Copy link
Contributor Author

I will need to think about what the right answer is here. Before #52193 we indeed used the key itself here.

Even with pre-#52193 implementation rehashing was ineffective for some key types, e.g. Symbols. That's because hash(s::Symbol) is defined as objectid(s) and hash(s, h) depends only on objectid(s):
https://github.com/JuliaLang/julia/blob/master/base/hashing.jl#L38-L40

@vchuravy
Copy link
Member

So looking at IDdict it also "just" uses objectid and then uses the typical probe + egal check and grow the table on conflict.

For HAMT that statregy wouldn't work. We would probably need to introduce a "PerfectConflict" node with a linear probe,
but as you noted with Symbols this is a general property of using objectid as source of the hash.

@oscardssmith
Copy link
Member

I'm pretty sure it is a bug if objectid has collisions. It seems like something that could be provably made to cause miscompiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants