Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make unique(f, itr) and unique!(f, itr) faster #30286

Merged
merged 2 commits into from
Dec 9, 2018
Merged

Conversation

andyferris
Copy link
Member

Avoid creation of a Set{Any} to make the function-providing forms faster, and address regression of unique(itr) perfermance from #30141.

cc @raghav9-97 @oxinabox @fredrikekre

Setup:

julia> x = rand(1:100_000, 1_000_000);

Before #30141:

julia> @btime unique(x);
  13.898 ms (49 allocations: 5.00 MiB)

After #30141:

julia> @btime unique(identity, x);
  40.192 ms (644863 allocations: 14.84 MiB)

After this PR:

julia> @btime unique(identity, x);
  14.092 ms (50 allocations: 5.00 MiB)

base/set.jl Outdated Show resolved Hide resolved
base/set.jl Outdated Show resolved Hide resolved
base/set.jl Outdated
else
seen2 = convert(Set{promote_typejoin(eltype(seen), typeof(x))}, seen)
push!(seen2, t)
return _unique!(out, f, C, seen2, i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is really clever.

base/set.jl Outdated Show resolved Hide resolved
Avoid creation of a `Set{Any}`.
@oxinabox
Copy link
Contributor

oxinabox commented Dec 6, 2018

How much does the manual specialisation on AbstractArray help performance?

@andyferris
Copy link
Member Author

How much does the manual specialisation on AbstractArray help performance?

Where's this?

push!(out, x)
if y isa eltype(seen)
push!(seen, y)
else
Copy link
Contributor

@chethega chethega Dec 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do

else
    TT = promote_typejoin(eltype(seen), typeof(y))
    if eltype(seen) === TT
        push!(seen, y)
        else
             seen2 = ...
        end
end

Idea is that we don't grow the stack if f is e.g. type-unstable between Float32 and Float64.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh, maybe disregard. I misunderstoof promote_typejoin, which apparently doesn't do that. Point remains, do we want collect-style widening here (integer and float merge to Float64) or do we want exact widening as you did?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, is this calling _unique! recursively each time a new value is encountered?! @chethega is right that this is bad, a call should only happen when encountering a new type.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - it simply pushes if the value is of type eltype of seen. If not, and only if not, then it constructs seen2 and makes a new call.

@oxinabox
Copy link
Contributor

oxinabox commented Dec 6, 2018

How much does the manual specialisation on AbstractArray help performance?

Where's this?

Nevermind, I misread the code.
I had thought unique(f, iter) and unique(f, arr::AbstractArray) were both defined.

@JeffBezanson
Copy link
Member

Adding to 1.1 since this fixes a perf regression introduced on master.

@JeffBezanson JeffBezanson added this to the 1.1 milestone Dec 7, 2018
@andyferris
Copy link
Member Author

I’m happy for this to merge now, unless someone else wants to review?

@oxinabox
Copy link
Contributor

oxinabox commented Dec 7, 2018

Can an test be added:

the make sure the behavour of the following has not changed.

v = Vector{Any}(); push!(v, 10_000); push!(v, 10_000.0);  push!(v, 10_000f0)
@test unique(v) == Any[10_000]

It is not the behavour I expect, but it is the behavour of 1.0

@KristofferC KristofferC mentioned this pull request Dec 7, 2018
52 tasks
@andyferris
Copy link
Member Author

What didn’t you expect? The eltype, or the length 1 result?

@oxinabox
Copy link
Contributor

oxinabox commented Dec 8, 2018

The length 1 result.
I was not hugely surprised, but I do think it is the kinda behavour that might be easily changed by mistake.
I take it that behavour is maintained in this PR?

(On further thought, it would actually be hard to not maintain that behavour given all such numbers both hash the same and are isequal)

@andyferris
Copy link
Member Author

The checks are always done via isequal (which is obviously different from == and ===, the first being bad for NaN and the latter bad for things like big numbers), and that part should be exactly the same.

@andyferris
Copy link
Member Author

The hidden fact here is that in for Set uses isequal.

@andyferris
Copy link
Member Author

Ok let’s merge - no point holding up 1.1.

@andyferris andyferris merged commit c2fb1dc into master Dec 9, 2018
@andyferris andyferris deleted the ajf/faster_unique_f branch December 9, 2018 07:02
KristofferC pushed a commit that referenced this pull request Dec 10, 2018
* Make `unique(f, itr)` and `unique!(f, itr)` faster

Avoid creation of a `Set{Any}`.

* Fix unique! for resizable OffsetVector

(cherry picked from commit c2fb1dc)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants