RFC: Implement unique consistently #29038

laborg · 2018-09-04T18:02:25Z

This pull request updates the unique[!]([f], itr) implementation in sets.jl and achieves a couple of things:

Fixes Add unique!(f, itr) #28415 by adding the missing unqiue!(f,itr)
Fixes unique(itr) doesn't infer types for a generator #20105 but this time for unique(f,itr) instead of unique(itr), meaning that unique(f, itr) will behave as unique(itr) regarding returned types.
Improves performance
Algorithms are implemented consistently for unique[!]([f], itr)

Remarks:

The documentation currently speaks of performance improvements if the data is sorted - this is not quite correct, as the performant sorted unqiue! algorithm is only available for a couple of types (<:AbstractString, <:Real, <:Symbol). Using hasmethod to assert sortability would be slow...
I've toyed around with merging the sorting detection into the uniquifying loop, but the code quickly gets untidy and is not worth the hassle.

Benchmarks

Datasets

Dict(
    "Sparse random Any" => repeat(Any[1.,2.2,2,3,4,0x00,"String","Strong"],1000),
    "Dense random Any" => convert(Vector{Any},rand(rand([Float64,Int,Int16]),10000)),
    "Sparse sorted Int" => sort!(rand(1:100, 10000)),
    "Dense sorted Int" => sort!(rand(1:10000, 10000)),
    "Sparse random Int" => rand(1:10, 10000),
    "Dense random Int" => rand(1:10000, 10000),
    "Small random Int" => rand(1:5, 30))

`unique(itr)` and `unique(f,itr)`

 Dense random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-57.02% => improvement)
  "unique(generator)" => TrialJudgement(-36.85% => improvement)
  "unique(x*x,vector)" => TrialJudgement(-52.84% => improvement)

 Sparse sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-65.58% => improvement)
  "unique(generator)" => TrialJudgement(-5.16% => improvement)
  "unique(x*x,vector)" => TrialJudgement(-67.83% => improvement)

 Small random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-45.96% => improvement)
  "unique(generator)" => TrialJudgement(+2.59% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-46.33% => improvement)

 Sparse random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(+5.89% => regression)
  "unique(generator)" => TrialJudgement(-20.46% => improvement)
  "unique(x*x,vector)" => TrialJudgement(+8.33% => regression)

 Dense random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-60.12% => improvement)
  "unique(generator)" => TrialJudgement(-0.13% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-57.74% => improvement)

 Dense sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-60.17% => improvement)
  "unique(generator)" => TrialJudgement(+2.06% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-57.98% => improvement)

 Sparse random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-41.09% => improvement)
  "unique(generator)" => TrialJudgement(-4.47% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-42.47% => improvement)

`unique!(itr)`

  "Dense random Any" => TrialJudgement(-59.70% => improvement)
  "Sparse sorted Int" => TrialJudgement(-67.74% => improvement)
  "Small random Int" => TrialJudgement(-8.58% => improvement)
  "Sparse random Any" => TrialJudgement(+8.40% => regression)
  "Dense random Int" => TrialJudgement(-7.46% => improvement)
  "Dense sorted Int" => TrialJudgement(-46.25% => improvement)
  "Sparse random Int" => TrialJudgement(-18.38% => improvement)

Discussion

Sparse Random Any regression for unqiue! is understandable, as the default element type for the seen Set is currently Any

laborg · 2018-09-07T07:54:09Z

I managed to use a single function body for unqiue and unique! without any performance loss and rebased all changes.

laborg · 2018-09-10T14:28:16Z

If I make all uniquifying algorithms explicit and accept the code duplication (https://gist.github.com/laborg/f9af0f5c2def4edfa410c5522afbff1d) the benchmarks look sligthley better:

unique

 Dense random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-47.21% => improvement)
  "unique(generator)" => TrialJudgement(-52.73% => improvement)
  "unique(x*x,vector)" => TrialJudgement(-57.46% => improvement)

 Sparse sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-70.58% => improvement)
  "unique(generator)" => TrialJudgement(-4.99% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-70.85% => improvement)

 Small random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-45.17% => improvement)
  "unique(generator)" => TrialJudgement(-4.20% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-44.53% => improvement)

 Sparse random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(+8.28% => regression)
  "unique(generator)" => TrialJudgement(-25.97% => improvement)
  "unique(x*x,vector)" => TrialJudgement(+0.87% => invariant)

 Dense random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-62.24% => improvement)
  "unique(generator)" => TrialJudgement(-1.68% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-60.58% => improvement)

 Dense sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-63.96% => improvement)
  "unique(generator)" => TrialJudgement(-4.15% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-62.77% => improvement)

 Sparse random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-52.92% => improvement)
  "unique(generator)" => TrialJudgement(-4.99% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-52.92% => improvement)

unique!

"Dense random Any" => TrialJudgement(-73.12% => improvement)
"Sparse sorted Int" => TrialJudgement(-63.64% => improvement)
"Small random Int" => TrialJudgement(-11.05% => improvement)
"Sparse random Any" => TrialJudgement(-3.39% => invariant)
"Dense random Int" => TrialJudgement(-11.63% => improvement)
"Dense sorted Int" => TrialJudgement(-42.74% => improvement)
"Sparse random Int" => TrialJudgement(-19.79% => improvement)

I'm not sure which version is better, but I feel the consistency between implementations for unique()/unique!()/unique(f,)/unqiue!(f,) is a good thing compared to the current state.

stevengj · 2018-11-27T23:03:25Z

What is the advantage of the new unique(f, itr) vs. unique!(map(f, itr))or unique!(f, A) vs. unique!(A .= f.(A))?

StefanKarpinski · 2018-11-28T00:53:50Z

@stevengj: you mean as an implementation or a way of calling this?

stevengj · 2018-11-28T01:15:42Z

Both. My suggestion involves no extra copies of the data, and I’m skeptical that it is much slower. But also, if that works as an implementation it’s not clear why we need a new API.

StefanKarpinski · 2018-11-28T01:20:27Z

Well, given that we have unique already it seems like it should work, no?

stevengj · 2018-11-28T04:00:34Z

Not all functions on arrays need to accept a function argument. We don't have sort(f, itr), and unique seems closely related to sort.

StefanKarpinski · 2018-11-28T05:03:20Z

We don't have sort(f, itr) because it's unclear if f is the lt or by function.

laborg · 2018-11-28T07:34:05Z

unique!(map(f, itr))

is not equivalent to unique[!](f,itr):

julia> unique(x->x^2, [-3,3,1])
2-element Array{Int64,1}:
 -3
  1

julia> unique(map(x->x^2, [-3,3,1]))
2-element Array{Int64,1}:
 9
 1

Nevertheless I don't think that unqiue[!](f,)! function has been used often. My "telemetry" data is the missing issue report of the existing unique(f,) not infering generator types (#20105).

stevengj · 2018-11-28T13:54:04Z

Oh, I see. This sounds more analogous to sort(itr, by=f), so maybe we should call it unique(itr, by=f)?

laborg · 2018-11-28T15:44:45Z

Yes by would look better to me too, but it would be breaking...

StefanKarpinski · 2018-11-28T17:02:30Z

What else would the function mean do? In the case of sort there are two possible function arguments: by or lt. In this case, there's only one; requiring a keyword to pass it seems unnecessary.

stevengj · 2018-11-30T15:40:57Z

What else would the function mean do?

What I initially thought when I read the description was that unique(f, itr) meant unique(map(f, itr)).

fredrikekre · 2018-12-05T23:17:11Z

Is there anything useful here now with #30141 merged?

laborg · 2018-12-06T08:05:21Z

Well, although I've haven't redone the benchmarks, I believe this PR should have a faster implementations for all unique functions (unique(),unique!(),unique(f,),unique!(f,)), because it implements proper widening (similar to collect). Btw.: Benchmarks of unique should include all kinds of data, not just sorted Int64, as was used as demonstration in #30141...

@andyferris was talking about a faster unique implementation - I would be interested to benchmark it against this one here.

laborg · 2020-06-27T04:51:44Z

unique and Julia have changed since this attempt so closing seems the best solution.

StefanKarpinski · 2020-06-27T12:59:34Z

Sorry it didn't get used. I feel bad every time I come across a PR like this that didn't get merged 😞

laborg · 2020-06-27T15:07:02Z

Sorry it didn't get used. I feel bad every time I come across a PR like this that didn't get merged disappointed

In this case I can clearly say you shouldn't. This was more an exercise to get to know Julia better and in this part it was a success! The recent implementation by Tim Holy is nicer too.

laborg changed the title ~~Implement unique consistently~~ RFC: Implement unique consistently Sep 4, 2018

Implement unique consistently

5659d8e

laborg force-pushed the consistentunique branch from f8d7653 to 5659d8e Compare September 7, 2018 07:38

laborg closed this Jun 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Implement unique consistently #29038

RFC: Implement unique consistently #29038

laborg commented Sep 4, 2018 •

edited

Loading

laborg commented Sep 7, 2018

laborg commented Sep 10, 2018

stevengj commented Nov 27, 2018

StefanKarpinski commented Nov 28, 2018

stevengj commented Nov 28, 2018

StefanKarpinski commented Nov 28, 2018

stevengj commented Nov 28, 2018

StefanKarpinski commented Nov 28, 2018

laborg commented Nov 28, 2018

stevengj commented Nov 28, 2018 •

edited

Loading

laborg commented Nov 28, 2018

StefanKarpinski commented Nov 28, 2018

stevengj commented Nov 30, 2018

fredrikekre commented Dec 5, 2018

laborg commented Dec 6, 2018

laborg commented Jun 27, 2020

StefanKarpinski commented Jun 27, 2020

laborg commented Jun 27, 2020

RFC: Implement unique consistently #29038

RFC: Implement unique consistently #29038

Conversation

laborg commented Sep 4, 2018 • edited Loading

Benchmarks

Datasets

unique(itr) and unique(f,itr)

unique!(itr)

Discussion

laborg commented Sep 7, 2018

laborg commented Sep 10, 2018

unique

unique!

stevengj commented Nov 27, 2018

StefanKarpinski commented Nov 28, 2018

stevengj commented Nov 28, 2018

StefanKarpinski commented Nov 28, 2018

stevengj commented Nov 28, 2018

StefanKarpinski commented Nov 28, 2018

laborg commented Nov 28, 2018

stevengj commented Nov 28, 2018 • edited Loading

laborg commented Nov 28, 2018

StefanKarpinski commented Nov 28, 2018

stevengj commented Nov 30, 2018

fredrikekre commented Dec 5, 2018

laborg commented Dec 6, 2018

laborg commented Jun 27, 2020

StefanKarpinski commented Jun 27, 2020

laborg commented Jun 27, 2020

laborg commented Sep 4, 2018 •

edited

Loading

`unique(itr)` and `unique(f,itr)`

`unique!(itr)`

stevengj commented Nov 28, 2018 •

edited

Loading