Optimized unique and union! methods #15009

gajomi · 2016-02-10T01:05:46Z

This PR implements performance improvements for unique and union! and adds an additional method signature for unique. The basic outline of the different performance issues can be found in #14649. There are four separates commits in the pull request which implement

optimized methods for associative collections
optimized methods for collections with "small" eltypes, as well as an additional method unique(c,k), which returns at most k unique elements from a collection c. This is used in the implementation of the aforementioned optimizations, but I think is a useful method in its own right and worth exposing to the user.
optimized unique method for range types
optimized unique and union methods for sparse-like arrays such as Diagonal,Bidiagonal,SparseVector etc.

Each of these optimizations have different regimes in which performance improvements become apparent. A benchmark that illustrates performance improvements for unique in action can be seen by looking at the diff in this gist: https://gist.github.com/gajomi/0c184160042ddfc329ab/revisions (results similar for union!). Generally speaking, one sees improvements in walltime performance, although there are some minor performance regressions in memory allocation for some methods.

gajomi · 2016-02-10T01:08:45Z

NOTE: The build generates warnings about ambiguous methods along the lines of

WARNING: New definition 
    union!(Base.Set, Base.LinAlg.Tridiagonal{#T<:Any}) at linalg/tridiag.jl:495
is ambiguous with: 
    union!(Base.Set{Int8}, AbstractArray{Int8, N<:Any}) at set.jl:64.
To fix, define 
    union!(Base.Set{Int8}, Base.LinAlg.Tridiagonal{Int8})
before the new definition.

On my system the right method seems to be called anyway

julia> A = Diagonal(Int8[1,2,3,4])
4x4 Diagonal{Int8}:
 1  ⋅  ⋅  ⋅
 ⋅  2  ⋅  ⋅
 ⋅  ⋅  3  ⋅
 ⋅  ⋅  ⋅  4

julia> @which union!(Set(),A)
union!{T}(s::Set{T<:Any}, D::Diagonal{T}) at linalg/diagonal.jl:240

but it would be preferable to get rid of these warnings. As far as I can tell the recommended approach would require duplicating the functionality for collections with small eltypes to all the structured array types. I would like to avoid this and was hoping to find suggestions to work around it.

timholy · 2016-02-10T01:40:39Z

From a brief skim, this looks amazingly thorough. I'm not sure I understand the reason for all the constructs like [v[1]; zero(T); v[2:end]], can you explain?

As far as the ambiguity warning goes: dispatching on both the element type and the container type is often a formula for trouble. I'm not sure which should take precedence, but let's suppose it's the container type. Then rather than "sharing an implicit common node" (which is what the ambiguity warning is suggesting you define), I think a better approach is to cascade to differently-named methods:

## Split on container type
function union!(s::Set, x::Diagonal)
    # code to handle Diagonal
end
function union!(s::Set, Tridiagonal)
    # code to handle Tridiagonal
end
union!(s::Set, x::AbstractArray) = _union!(s, x)  # fallback method

## Split on element type
function _union!{T<:SpecialElementTypes}(s::Set{T}, x::AbstactArray{T})
    ...
end
function _union!(s::Set, x::AbstractArray)
    # the generic method
    ...
end

(Stated otherwise, from a given node in the method "tree," split on either container type OR element type, not both simultaneously.)

gajomi · 2016-02-10T01:50:48Z

EDIT: I gave a very confusing answer and so thought it best to write a new one

The most straightforward way to unique methods for structured matrices like Diagonal, Bidiagonal, etc, would be to loop over over the elements of the matrix accumulating unique entries in column major order until one reached an entry that belonged to the structure zero indexes. At this point one would push the zero entry into the unique if it had not yet been seen. After this point one would continue iterating through the values of the matrix, but skipping over all indices corresponding to structured zeros, since this element is already known to be in the unique result.

The main problem with the above "straightforward" method is that is essentially requires reimplementing all the logic of the generic unique method across a bunch of specialized ones. Furthermore, if one wants to benefit from optimizations for small eltypes as described above this has to be reimplemented all over again.

So the idea behind the implementation in the pull request is to form a new matrix that only contains the data of the nonzero regions in the correct (column major) order plus one or more zero elements in appropriate places so that the result of calling generic unique on this new matrix is the same as calling generic unique on the full matrix. Gettting this matrix requires the funny looking constructs mentioned above.

This perhaps best explained with an example:

julia> A = SymTridiagonal([1,2,3,4,5],[6,7,8,1])
5x5 SymTridiagonal{Int64}:
 1  6  ⋅  ⋅  ⋅
 6  2  7  ⋅  ⋅
 ⋅  7  3  8  ⋅
 ⋅  ⋅  8  4  1
 ⋅  ⋅  ⋅  1  5

julia> B = hcat([A.dv[1],A.ev[1]],
                [zero(Int64),A.dv[2]],
                [A.ev[2:end] A.dv[3:end]]')
2x5 Array{Int64,2}:
 1  0  7  8  1
 6  2  3  4  5

julia> unique(B)'
1x9 Array{Int64,2}:
 1  6  0  2  7  3  8  4  5

This approach allows for zeros to be found before the structured zero portion of the matrix, but without the need to branch upon their detection. Furthermore, in the case that the matrix has small eltype, the intermediate matrix B will have this eltype so that the final call uses the unique method for small eltypes. In this way, one may possibly use both optimizations from the same top-level call.

gajomi · 2016-02-10T07:05:21Z

@timholy not sure I fully grok the cascading method approach, but I'll give it a try.

tkelman · 2016-02-10T09:12:33Z

base/sparse/sparsevector.jl

+    s
+end
+function unique{Tv,Ti<:Integer}(x::SparseVector{Tv,Ti})
+    inds = sort(x.nzind)


generally these should always be sorted as part of the data structure convention

Got it, this fact was documented for SparseMatrixCSC but I didn't see it for SparseVectors. I may elect to take the liberty of adding this to comments.

yes, please do

tkelman · 2016-02-10T09:14:50Z

could use some inline comments explaining the approach and constructions used. do these work for the case of 1x1 matrices of these types?

nalimilan · 2016-02-10T09:19:40Z

base/range.jl

@@ -786,3 +786,5 @@ end

 in{T<:Integer}(x, r::Range{T}) = isinteger(x) && !isempty(r) && x>=minimum(r) && x<=maximum(r) && (mod(convert(T,x),step(r))-mod(first(r),step(r)) == 0)
 in(x::Char, r::Range{Char}) = !isempty(r) && x >= minimum(r) && x <= maximum(r) && (mod(Int(x) - Int(first(r)), step(r)) == 0)
+
+unique(x::Range) = x


Maybe this should return an Array instead (i.e. call collect)? The documentation for unique isn't very clear about what kind of "array" it returns. Not sure.

This was the main idea of #14649.

IMO unique needn't always be an array, if the datastructure is already unique e.g. Set/Range/etc. then a no-op should be preferred.

I'm fine with that. But note that sets are applied collect, as they are not AbstractArrays.

I don't think unique on Sets need collect, and return an AbstractArray, that was also in 14649. Or maybe I'm misunderstanding you.

All I'm saying is that this is what happens with this PR, and that this sounds justified, as Sets are not AbstractArrays and e.g. cannot be indexed.

Isn't this a concern?

_unique(x::Range) = x length(unique(linspace(0,0,10))) 1 length(_unique(linspace(0,0,10))) 10

I was under the impression that Ranges are not in general unique.

This was brought up in #14649. I had thought that the above behavior behavior was supposed to be a bug but reading back on the comments it looks like its still ambiguous. This PR has lingered on a bit long, so I modified the method to accommodate either case. WRT to the larger issue of confusion about range with zero step sizes see #15218.

timholy · 2016-02-10T09:56:15Z

@gajomi, I edited my post above to hopefully make the intention clearer. Here's a little diagram that might help:

"Product-specialization" is the solution that's recommended in the ambiguity warning, but over time I've found that it's rarely a good idea to follow that advice. The "cascade" model is the one I usually find more productive.

gajomi · 2016-02-12T02:30:16Z

Thanks everyone for the extensive comments. I have mostly fixed things up and as a natural result of the process of refactoring have slightly expanded the situations in which the optimization apply, so I'll have to think a bit about upgrading benchmarks.

gajomi · 2016-02-19T06:49:59Z

OK, this took quite a bit longer than I expected (there were many forks in the road I'll to revisit later), but the latest commits should address all the issues raised above.

New performance benchmarks can be found here: https://gist.github.com/gajomi/617466a1502dbf920252/revisions?diff=split

Differences from first commits:

The various union! specializations to container types versus eltypes follows the strategy suggested by @timholy.
The new union! implementation distinguishes between diagonal dispatch on specialized eltypes in both set/array arguments and single dispatch on the eltype of the array, which extends the optimization to new cases.
Where appropriate Set annotation are generalized to AbstractSet.
The test coverage was expanded to cover union! methods.

gajomi · 2016-02-19T14:27:16Z

It seems that a test associated with IntSet failed on most architectures:

@test union!(IntSet(UInt8[]),[collect(0x00:0xff); 0x01]) == IntSet(collect(0x00:0xff))

This gave a deprecation warning about pushing zeros into IntSets on my system, but otherwise passed. I removed the test and the associated method in latest commits.

tkelman · 2016-02-20T00:37:21Z

base/linalg/bidiag.jl

+        return unique(hcat([dv[1],ev[1]],
+                            zeros(T,2),
+                            [dv[2:end-1] ev[2:end]]',
+                            [zero(T),dv[end]]))


can you add a few more comments on this?

Yeah I added some comments (and ended up refactoring lower branch)/

gajomi · 2016-02-25T01:32:42Z

This seems to be good to go (modulo some potentially needed fixups stemming from nearly orthogonal issue discussed at JuliaLang/LinearAlgebra.jl#310).

tkelman · 2016-02-25T04:18:59Z

base/intset.jl

-end
+union!(s::AbstractSet, xs) = _union!(s, xs)
+
+_union!(s::AbstractSet, xs) = rawunion!(s, xs)


what is the purpose of having the extra step here?

union! for Set types needs a size hint before calling rawunion!, whereas in IntSet one gets a size hint for every push! (as the underlying container size depends on largest value).

Keno · 2020-02-28T22:07:52Z

I'm gonna call this one stale. As a general comment, it seemed like this PR combined quite a number of different changes. Generally it's easier to get things in if they are done if very small pieces. New API surface in particular usually is quite slow because it's hard to change later. That said, if anybody ever wants to pick this up again, you're welcome to of course.

gajomi mentioned this pull request Feb 10, 2016

We need more unique and related methods #14649

Closed

4 tasks

tkelman reviewed Feb 10, 2016
View reviewed changes

nalimilan reviewed Feb 10, 2016
View reviewed changes

gajomi mentioned this pull request Feb 14, 2016

Proposal for function uniqueslices #14142

Closed

gajomi force-pushed the moreuniquemethods branch 2 times, most recently from fd2d12f to 300a496 Compare February 19, 2016 06:30

gajomi force-pushed the moreuniquemethods branch from 300a496 to a8a63a8 Compare February 19, 2016 14:27

tkelman reviewed Feb 20, 2016
View reviewed changes

gajomi force-pushed the moreuniquemethods branch from a8a63a8 to 7d5a4f6 Compare February 24, 2016 15:31

gajomi added 2 commits February 24, 2016 09:26

optimized unique for range types

6e09be2

optimized unique methods for associative collections

21c2708

gajomi added 2 commits February 24, 2016 09:26

optimized unique and Set methods for small eltypes

487e38d

optimized unique and union! methods for sparse-like arrays

44e275c

gajomi force-pushed the moreuniquemethods branch from 7d5a4f6 to 44e275c Compare February 24, 2016 18:16

tkelman reviewed Feb 25, 2016
View reviewed changes

DNF2 mentioned this pull request Apr 28, 2016

Add Base.allunique #15914

Merged

Keno closed this Feb 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized unique and union! methods #15009

Optimized unique and union! methods #15009

gajomi commented Feb 10, 2016

gajomi commented Feb 10, 2016

timholy commented Feb 10, 2016

gajomi commented Feb 10, 2016

gajomi commented Feb 10, 2016

tkelman Feb 10, 2016

gajomi Feb 12, 2016

tkelman Feb 12, 2016

tkelman commented Feb 10, 2016

nalimilan Feb 10, 2016

hayd Feb 10, 2016

nalimilan Feb 10, 2016

hayd Feb 10, 2016

nalimilan Feb 10, 2016

DNF2 Feb 20, 2016

gajomi Feb 24, 2016

timholy commented Feb 10, 2016

gajomi commented Feb 12, 2016

gajomi commented Feb 19, 2016

gajomi commented Feb 19, 2016

tkelman Feb 20, 2016

gajomi Feb 24, 2016

gajomi commented Feb 25, 2016

tkelman Feb 25, 2016

gajomi Feb 26, 2016

Keno commented Feb 28, 2020

Optimized unique and union! methods #15009

Optimized unique and union! methods #15009

Conversation

gajomi commented Feb 10, 2016

gajomi commented Feb 10, 2016

timholy commented Feb 10, 2016

gajomi commented Feb 10, 2016

gajomi commented Feb 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Feb 10, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timholy commented Feb 10, 2016

gajomi commented Feb 12, 2016

gajomi commented Feb 19, 2016

gajomi commented Feb 19, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gajomi commented Feb 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Keno commented Feb 28, 2020