Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swiss tables design for Dict #44513

Merged
merged 27 commits into from
Mar 18, 2022
Merged

Conversation

petvana
Copy link
Member

@petvana petvana commented Mar 8, 2022

This PR introduces simplified Swiss tables design for Dict described at
This extends my previous PR #44332 by using Swiss tables design described at
https://abseil.io/about/design/swisstables#swiss-tables-design-notes

The performance gain starts to be really interesting, especially for abstract types. I created a separate PR because the Swiss Tables can be implemented independently on #44332 but probably with less performance gian. Changes related only to the Swiss tables are in the commit Swiss tables based hashing.

The main idea is to store the 7 highest bits of the hash in the slots. They can be utilized to test if the key may be equal. This limits isequal calls almost to a single one per operation (using a high-quality hashing function).

CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
All times are in seconds.

Master (total time 52.320 s)

TypeSETGETGET!emptyGET!fullITERATE
1Dict{Int64, Int64}0.7020.4450.7180.5070.048
2Dict{Any, Int64}5.1551.2424.3031.520.048
3Dict{Int64, Any}1.4121.3452.5461.7651.342
4Dict{Any, Any}5.8013.4726.1173.7550.934
5Dict{String, Int64}3.3211.3732.9151.4850.049

PR (total time 35.070 s)

TypeSETGETGET!emptyGET!fullITERATE
1Dict{Int64, Int64}0.6370.3810.6380.4860.049
2Dict{Any, Int64}2.330.5322.850.6840.057
3Dict{Int64, Any}1.5711.1562.3881.2580.885
4Dict{Any, Any}4.2391.2135.6711.2450.888
5Dict{String, Int64}1.9910.8051.9531.1040.056

PR together with #44332 (total time 33.552 s)

TypeSETGETGET!emptyGET!fullITERATE
1Dict{Int64, Int64}0.5970.3560.590.4440.051
2Dict{Any, Int64}2.7880.5062.2780.5920.053
3Dict{Int64, Any}0.7731.1132.3761.0911.66
4Dict{Any, Any}3.4611.1274.8761.1192.119
5Dict{String, Int64}1.8150.7111.9081.0840.063

SwissDict (total time 43.458 s) from DataStructures.jl

TypeSETGETGET!emptyGET!fullITERATE
1SwissDict{Int64, Int64}0.7760.3590.8130.4590.018
2SwissDict{Any, Int64}3.9551.9463.8671.3550.018
3SwissDict{Int64, Any}1.2571.2822.1051.8991.328
4SwissDict{Any, Any}5.7171.6366.7311.660.913
5SwissDict{String, Int64}1.7151.0611.5131.0550.018
Testing code
module TestDict

using Printf
using Random
using DataStructures
using DataFrames

const n = 10_000_000

function test_set(dict, x)
    xn = length(x)
    #sizehint!(dict, xn)
    for i in 1:xn
        dict[x[i]] = i
    end
end

test_get(dict, x) = sum(dict[x[i]] for i = 1:length(x))
test_get!(dict, x) = sum(get!(dict, x[i], i) for i = 1:length(x))
test_iterate(dict) = sum(v for v = values(dict))

for D in [SwissDict, Dict]
    df = DataFrame()
    for (A,B) in [(Int, Int), (Any, Int), (Int, Any), (Any, Any), (String, Int)]
        Random.seed!(42)
        if A == String
            keys = [randstring() for i = 1:n]
        else
            keys = rand(A == Any ? Int : A, n)
        end
        keys = unique(keys)
        correct_sum = sum([1:length(keys)...])

        dict = D{A, B}()
        test_set(dict, keys)
        test_get(dict, keys)
        dict = D{A, B}()
        time_set = @elapsed test_set(dict, keys)
        time_get = @elapsed getsum = test_get(dict, keys)
        @assert getsum == correct_sum

        dict = D{A, B}()
        test_get!(dict, keys)
        test_get(dict, keys)

        dict = D{A, B}()
        time_get!_empty = @elapsed getsum = test_get!(dict, keys)
        @assert getsum == correct_sum
        time_get!_full = @elapsed getsum = test_get!(dict, keys)
        @assert getsum == correct_sum

        test_iterate(dict)
        time_iterate = @elapsed getsum = test_iterate(dict)
        @assert getsum == correct_sum

        new_data = ( 
            Type = typeof(dict), 
            SET = time_set, 
            GET = time_get, 
            GET!empty = time_get!_empty,
            GET!full = time_get!_full,
            ITERATE = time_iterate,
        )
        println(new_data)
        push!(df, new_data)
    end
    total = sum(sum(x) for x in eachcol(df[:,2:end]))
    df[:,2:end] = round.(df[:,2:end]; digits = 3)
    show(stdout, MIME("text/plain"), df)
    println("\n")
    show(stdout, MIME("text/html"), df; eltypes = false, summary = false)
    println("\n")
    @printf "Total time %.3f s\n\n\n" total
end

end

@KristofferC
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@petvana
Copy link
Member Author

petvana commented Mar 8, 2022

@KristofferC I'm sorry I forget to update set.jl. So, this nanosoldier run can be terminated.

@vtjnash
Copy link
Member

vtjnash commented Mar 8, 2022

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

@oscardssmith
Copy link
Member

Does this PR generate similar code to the LLVM-calls from the JuliaCollections dict? If so, that's really cool!

@petvana
Copy link
Member Author

petvana commented Mar 8, 2022

Would be great to hear how this compares in implementation and performance to https://juliacollections.github.io/DataStructures.jl/latest/swiss_dict/ https://nextjournal.com/eulerkochy/gsoc-20-in-datastructures.jl @eulerkochy

Thank you for the comment. I've added SwissDict into the comparison. I'm quite surprised that the PR seems to be faster (at least for abstract types). However, these two implementations are very different because the PR utilizes only one idea to store part of the hash in Metadata (here slots).

@petvana
Copy link
Member Author

petvana commented Mar 8, 2022

Does this PR generate similar code to the LLVM-calls from the JuliaCollections dict? If so, that's really cool!

@oscardssmith Unfortunately the PR is not that cool. :-) The idea was to keep it as simple as possible and in pure Julia. It just takes advantage of iterating over Vector{UInt8} is fast. Therefore, if you store part of the hash (7 bits) in slots you are able with 127/128 probability check if the keys are equal. As a result, the linear probing is fast and the number of isequal calls is limited almost to a single call per operation (~1.05 calls for optimal hashing function).

@JeffBezanson JeffBezanson added collections Data structures holding multiple items, e.g. sets performance Must go faster labels Mar 9, 2022
@petvana petvana marked this pull request as ready for review March 10, 2022 22:11
base/dict.jl Outdated
h.count += 1
h.age += 1
if index < h.idxfloor
h.idxfloor = index
end

sz = length(h.keys)
sz = length(h.pairs)
# Rehash now if necessary
if h.ndel >= ((3*sz)>>2) || h.count*3 > sz*2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this heuristic be updated? As I understand, most of the purpose of a swiss table is to allow higher capacity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the question. Generally yes, but here the primary motivation is different. The PR limits the number of isequal calls, and thus limits the number of allocations for abstract types and pressure on GC. These coeficients should be updated to limit memory consumption. This fine-tuning needs much more benchmarking on various sizes. There will always be some tradeoff between speed and used memory. I propose to move such a discussion into a separate PR. Meanwhile, I'll try to prepare some microbenchmarks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I've tested so far, we would need to use SIMD (as in DataStructures.jl) to increase the capacity without a significant performance drop (given by unpredictable branching). The good news is, that metadata in slots is prepared for that. Nevertheless, I'm not sure, if such a low-level llvm code should go to base because it will be harder to read and check ... and it will be platform specific.

base/dict.jl Outdated Show resolved Hide resolved
@oscardssmith
Copy link
Member

I think this is basically ready to merge. Can you add a benchmark for iterating over a Dict to the suite? That looks like the main type of test missing.

@oscardssmith
Copy link
Member

oscardssmith commented Mar 11, 2022

One other optimization that we should either include here or in a followup PR is that for get we should do a linear probe over slotsfor dictionaries with 32 or fewer elements. The linear lookup should be easy for LLVM to vectorize, and should be simpler and faster. (for reference, with a perfect hash function the linear scan will have a false positive 23% of the time for a 32 element dict).

@petvana
Copy link
Member Author

petvana commented Mar 11, 2022

I think this is basically ready to merge. Can you add a benchmark for iterating over a Dict to the suite? That looks like the main type of test missing.

I've added a benchmark for iterating over values. There is only one extra & operation when iterating. It comes from:

@propagate_inbounds isslotfilled(h::Dict, i::Int) = (h.slots[i] & 0x80) == 0

Btw, if we increase the density in future, the iteration will become faster (and closer to SwissDict).

base/dict.jl Outdated Show resolved Hide resolved
base/dict.jl Outdated Show resolved Hide resolved
@fredrikekre
Copy link
Member

I guess the breakage of this PR is the same as #44332, but lets check:

@nanosoldier runtests(ALL, vs = ":master")

@JeffBezanson
Copy link
Member

I like the idea of storing a vector of pairs, but it occurs to me this can have a large cost in alignment padding, for example in a Dict{Int64, Int8}.

@petvana
Copy link
Member Author

petvana commented Mar 11, 2022

I like the idea of storing a vector of pairs, but it occurs to me this can have a large cost in alignment padding, for example in a Dict{Int64, Int8}.

This is a design choice and I'm NOT the right one in this conversation to decide. I'll benchmark such combinations. Now, I see the following options:

  1. Merge as it is (both vector of pairs and Swiss table design) - Slightly braking (for example single change in CSV.jl)
  2. Split the PR and merge only the Swiss table design. - Almost no breakage
  3. Split the PR, merge only the Swiss table design now, and merge a vector of pairs to Julia 2.0. - The most breaking change would be postponed to a major release.

@oscardssmith
Copy link
Member

Assuming benchmarks for option 2 look good, that's probably what I would want. We definitely wouldn't wait to do Vector{Pair} for Julia 2.0. It's not breaking, so if it's better we'll merge it nowish, and if it's worse, we won't merge it for 2.0. I also discovered that Google has an Apache licensed implementation of swish hash here.

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@KristofferC
Copy link
Member

Personally, I like keeping fairly orthogonal things in different PRs so to me, only focusing on the Swiss here and having it be merged relatively quickly is advantageous over bundling it together with the Pair discussion that has to also take into account the padding and "breakage". So, in my opinion, make this PR only have the Swiss and rebase the other one on top of this.

@petvana
Copy link
Member Author

petvana commented Mar 14, 2022

Personally, I like keeping fairly orthogonal things in different PRs so to me, only focusing on the Swiss here and having it be merged relatively quickly is advantageous over bundling it together with the Pair discussion that has to also take into account the padding and "breakage". So, in my opinion, make this PR only have the Swiss and rebase the other one on top of this.

I agree, so I've focused the PR only on the Swiss design and updated the comparison. Further, I've changed 0x00 to be empty slot. Finally, I've tried to brake as little code as possible by preserving ht_keyindex2!. Now, we can depricate it easily if you want. Tested on AbstractAlgebra and CSV packages.

julia/base/dict.jl

Lines 370 to 371 in 8fd9617

# Only for better backward compatibility. It can be removed in the future.
ht_keyindex2!(h::Dict, key) = ht_keyindex2_shorthash!(h, key)[1]

@KristofferC
Copy link
Member

Great and thanks for putting in the extra effort of keeping things backward compatible!

IMO this is mergeable but maybe @JeffBezanson wants to look it over one last time.

@petvana
Copy link
Member Author

petvana commented Mar 14, 2022

I've gone throw the code for the last time and reverted a single line (fill to zeros). Thus, ready to be merged from my point of view.

@KristofferC KristofferC merged commit 85eaf4e into JuliaLang:master Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants