Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make Dict ordered? #34265

Open
oxinabox opened this issue Jan 5, 2020 · 22 comments
Open

make Dict ordered? #34265

oxinabox opened this issue Jan 5, 2020 · 22 comments
Labels
collections Data structures holding multiple items, e.g. sets needs decision A decision on this change is needed

Comments

@oxinabox
Copy link
Contributor

oxinabox commented Jan 5, 2020

@bkamins pointed out on slack:

julia> using Serialization;
julia> d = Dict{Symbol, Vector{Int}}(Symbol.('a':'z') .=> Ref([1]));
julia> serialize("test.bin", d);
julia> d2 = deserialize("test.bin");
julia> hcat(collect.(keys.([d, d2]))...)
26×2 Array{Symbol,2}:
 :o  :j
 :b  :x
 :p  :d
 :n  :k
 :j  :g
 :e  :u
 :c  :r
 :h  :a
 :l  :m
 :w  :y
 :x  :i
 :d  :o
 :k  :b
 :s  :p
 :v  :n
 :g  :e
 :u  :c
 :q  :h
 :r  :l
 :z  :w
 :a  :s
 :f  :v
 :m  :z
 :y  :q
 :i  :f
 :t  :t

But it doesn't have to be this way.

If we redefine things so it remembers how many slots it should have,
then it comes out the same as it came in.


hintsize(dict::AbstractDict) = length(dict)
hintsize(dict::Dict) = length(dict.keys)

function Serialization.deserialize_dict(s::AbstractSerializer, T::Type{<:AbstractDict})
    n = read(s.io, Int32)
    sz = read(s.io, Int32)
    t = T();
    sizehint!(t, sz)
    Serialization.deserialize_cycle(s, t)
    for i = 1:n
        k = deserialize(s)
        v = deserialize(s)
        t[k] = v
    end
    return t
end

function Serialization.serialize_dict_data(s::AbstractSerializer, d::AbstractDict)
    write(s.io, Int32(length(d)))
    write(s.io, Int32(length(d.slots)))
    for (k,v) in d
        serialize(s, k)
        serialize(s, v)
    end
end

But this is annoying because it changes the serialization format.
I would rather change sizehint!(::Dict) or how we call it.
The problem is that sizehint!(Dict(), 26) gives it 32 slots,
but the d had 64 slots.


In python this was one of thing things that really caught me out.
Because python salts its hashes wiith a random salt selected each time it starts.
But julia doesn't.

@oxinabox
Copy link
Contributor Author

oxinabox commented Jan 5, 2020

A vaguely justifyable (from what we normally do when rehashing)
is

function Serialization.deserialize_dict(s::AbstractSerializer, T::Type{<:AbstractDict})
    n = read(s.io, Int32)
    t = T();
    sizehint!(t, n > 64000 ? n : n*2)
    Serialization.deserialize_cycle(s, t)
    for i = 1:n
        k = deserialize(s)
        v = deserialize(s)
        t[k] = v
    end
    return t
end

But I don't know that this will always be right -- it depends on the dictionaries history of adds and deletes I guess as how many times it has been resized.

Actually, I guess add and delete paterns might also screw up things even in the solution I posted in the OP.
But maybe we can solve those patterns by rehash!ing before serialization?

@bkamins
Copy link
Member

bkamins commented Jan 5, 2020

My thinking was that it should be rather an update of serialization of Dict, if we decided to change this (other AbstractDicts might have different fields), is just to serialize and deserialize its fields verbatim. This is slightly less efficient but guarantees reproducibility.

@KristofferC
Copy link
Member

What's the reason why one would care about this? Some caching optimization (tokens) that gets invalidated?

@oxinabox
Copy link
Contributor Author

oxinabox commented Jan 5, 2020

For my purposes is mostly just counter intuitive.
Its fine for the keys to be in arbitary order.
But i expect them to stay the same across operations like serialization.

Silly example (but not so different from something that actually messed me up in python).

results::Dict{String, Float64}

function convert_to_percent!(res::Dict{String, Float64},)
    for model in keys(res)
        res[model]*=100
    end
    return res
end

function znormalize!(res::Dict{String, Float64},)
    mean_ = mean(values(res))
    std_ = std(values(res))
    for model in keys(res)
        res[model] = (res[model]-mean_)/std_
    end
    return res
end

percent, zscore = pmap([convert_to_percent!, znormalize]) do f
    f(results)
end

df = DataFrame()
df.models = keys(results)
df.scores = values(results)
df.percent = values(percent)
df.zscore = values(zscore)

@bkamins
Copy link
Member

bkamins commented Jan 5, 2020

The reason to have it is reproducibility. A I have discussed with @pszufe a user can reasonably expect that in:

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> @everywhere using Distributed

julia> @everywhere using Random

julia> @everywhere Random.seed!(1234)

julia> d = Dict(Symbol.('a':'z') .=> 1:26);

julia> pmap(x -> myid() => rand(x), [d])
1-element Array{Pair{Int64,Pair{Symbol,Int64}},1}:
 2 => (:c => 3)

julia> pmap(x -> myid() => rand(x), [d], distributed=false)
1-element Array{Pair{Int64,Pair{Symbol,Int64}},1}:
 1 => (:u => 21)

you should be able to get the same results in both pmap calls. Otherwise the result depends on where you execute the task to be performed (which preferably should not be the case).

It would even more common if someone uses https://github.com/ChrisRackauckas/ParallelDataTransfer.jl.

@andyferris
Copy link
Member

Oh wow. Why don't we serialize and deserialize Dict as a simple struct containing arrays? It would seem to be the simplest solution. (But then again, I just realized that I don't know much about serialization in Julia... are there any problems with this?)

If we redefine things so it remembers how many slots it should have, then it comes out the same as it came in.

@oxinabox unfortunately with insertions and deletions it means that even with a given number of slots there is some freedom left in the ordering, so that won't quite work. :(

@JeffBezanson
Copy link
Member

We could make this backwards-compatible for reading using a trick like this: save the negative of the length, then the sizehint size. If the first number is negative we know we have the new format.

@JeffBezanson
Copy link
Member

Or we could just make Dicts ordered.

@andyferris
Copy link
Member

Or we could just make Dicts ordered.

We'd better follow JavaScript semantics then. You know, for consistency. :trollface:

The iteration order for objects follows a certain set of rules since ES2015, but it does not (always) follow the insertion order. Simply put, the iteration order is a combination of the insertion order for strings keys, and ascending order for number-like keys:

// key order: 1, foo, bar
const obj = { "foo": "foo", "1": "1", "bar": "bar" }

@andyferris
Copy link
Member

But it's an interesting point, the default Base dictionary could be ordered, it could be useful for supporting sorting of dictionary keys or values out-of-the box, etc.

@JeffBezanson
Copy link
Member

Unfortunately we can't just save the fields of the Dict struct verbatim, since some hashes might depend on address --- rehashing has to happen on the other machine anyway. So it seems to me that an ordered dict is strictly required to get this property.

@andyferris
Copy link
Member

since some hashes might depend on address

That's a good point - thanks.

In that case - does anyone know if there some good benchmarks for OrderedCollections.jl in comparison to Dict/Set? It would be interesting to know the performance characteristics.

@JeffBezanson
Copy link
Member

We did a bunch of work on it in #10116. Iteration is way faster of course (and the change might be worth it just for that). Some other workloads are a bit faster, and IIRC the main cost is that some deletion-heavy workloads are slower.

@kmsquire
Copy link
Member

kmsquire commented Jan 6, 2020

Just to add, the implementation in OrderedCollections was derived from #10116 (although the Dict implementation which #10116 was based off of has diverged some since then.)

@JeffBezanson
Copy link
Member

I'll rename this to reflect the fact that it's equivalent to making Dict ordered, which there is a PR for but no issue AFAICT.

@JeffBezanson JeffBezanson changed the title Deserializing a Dict gives a different order make Dict ordered? Jan 7, 2020
@JeffBezanson JeffBezanson added collections Data structures holding multiple items, e.g. sets needs decision A decision on this change is needed labels Jan 7, 2020
@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 25, 2020

@JeffBezanson Can we just make Dict to be OrderedDict, for Julia 1.6, to be done with it? Or at least add it as an extra, however we do it, either copy it (your, by now modified code) as stdlib, or the full OrderedCollections.jl, which has also an alternative ordered:

It also implements LittleDict which is a ordered dictionary, that is much faster than any other AbstractDict (ordered or not) for small collections.

My longer argument here: #37761 (comment)

I realize there are not many days left of the month, and Julia 1.6 is due this month. Can we at least change the implementation and merge to get packageeval, and see, it may actually be faster, even for Julia itself. We could always revert before the end of the month.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 27, 2020

I did make a change to Julia (locally, PR was forthcoming, unless people are not interested) for Dict->OrderedDict.

I didn't mean to sound pushy, I just thought, and still think we might want to do a packageeval. I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

@andyferris
Copy link
Member

andyferris commented Sep 28, 2020

I did make a change to Julia (locally, PR was forthcoming, unless people are not interested) for Dict->OrderedDict.

At least I would be interested in seeing a version of Julia using OrderedDict.

I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

I never had any luck making OrderedDict faster for insert/delete heavy workloads.

If ordered hash maps are inherently slower at insertion/deletion then I'd say Julia might want actually both ordered and unordered versions built in. The ordered one being the default dictionary in Base. The unordered one, being better for cache-style workloads, would be used by the compiler (since we care about latency) and if it's in Core.Compiler and of good quality (which it already is) it may as well be exported from Base as well (since users may also have cache-style workloads to optimize). One possible version of this + #37761 that follows this line of reasoning would rename Dict to UnorderedDict globally (including compiler), then add the OrderedDict code to Base but calling it Dict. If packageeval passes on that, well, that would seem very interesting, to me.

@KristofferC
Copy link
Member

I didn't mean to sound pushy, I just thought, and still think we might want to do a packageeval. I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

That sounds more like an error in the methodology. I highly doubt that loading e.g. OhMyREPL would be an incredibly 2x slower just because of this change to dictionaries so it is likely something else going on.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 28, 2020

Yes, I suspected something about precompiling missing, but I did try using the other package first, and to see if it was only the first using, and I did actually get 2x. I guess I still shouldn't rule out that theory, with only your package hitting some codepath for dicts.

Let's say the slowdown was limited to 6% for using and dicts, would that be ok with for the ordered guarantee?

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 28, 2020

Should I push my code as an early ("[WIP]") with the slowdown I get? Instead I was exploring other possibly faster options, and those are currently failing.* I probably should recover the working version, while I remember the changes I made before recompiling again.

I'd say Julia might want actually both ordered and unordered versions built in.

Yes, then it's less important if there is slowdown. I was just trying to to the simplest thing at first, one version, replacing the other, also to see the effect. AND only comparing to the status quo, the current unordered Dict, that's probably outdated:

SwissDict is available, based on Google's:
https://abseil.io/blog/20180927-swisstables

The “flat” Swiss tables should be your default choice.

also Ordered and regular RobinDict, but I wasn't sure about adding more recent code (less tested?).

* https://discourse.julialang.org/t/ordered-dict-in-base-is-littledict-not-thread-safe-and-possibly-the-reason-i-cant-add-it-to-base/47372

@adkabo
Copy link
Contributor

adkabo commented May 24, 2022

Regardless of what data structure gets the name Dict, it is important to keep a distinction between unordered dict and ordered dict.

In ordered dicts, since equality testing depends on order, od1 == od2 implies first(od1) == first(od2), which is good.

In unordered dicts, since iteration order is undefined, ud1 == ud2 makes no promises about which element is first(ud1), so first(ud1) != first(ud2) is fine.

What goes wrong is the "compromise" policy where iteration order is guaranteed but equality does not consider it. Equality should imply that every guaranteed part of the public interface is equal between the two dicts, but if iteration order is not considered then it's possible to have cd1 == cd2 but first(cd1) != first(cd2) even though all of those operations are well defined. This seems absurd to me.

Python 3.7+ dict uses the compromise policy but I wish it didn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets needs decision A decision on this change is needed
Projects
None yet
8 participants