Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Implement unique consistently #29038

Closed
wants to merge 1 commit into from

Conversation

laborg
Copy link
Contributor

@laborg laborg commented Sep 4, 2018

This pull request updates the unique[!]([f], itr) implementation in sets.jl and achieves a couple of things:

Remarks:

  • The documentation currently speaks of performance improvements if the data is sorted - this is not quite correct, as the performant sorted unqiue! algorithm is only available for a couple of types (<:AbstractString, <:Real, <:Symbol). Using hasmethod to assert sortability would be slow...
  • I've toyed around with merging the sorting detection into the uniquifying loop, but the code quickly gets untidy and is not worth the hassle.

Benchmarks

Datasets

Dict(
    "Sparse random Any" => repeat(Any[1.,2.2,2,3,4,0x00,"String","Strong"],1000),
    "Dense random Any" => convert(Vector{Any},rand(rand([Float64,Int,Int16]),10000)),
    "Sparse sorted Int" => sort!(rand(1:100, 10000)),
    "Dense sorted Int" => sort!(rand(1:10000, 10000)),
    "Sparse random Int" => rand(1:10, 10000),
    "Dense random Int" => rand(1:10000, 10000),
    "Small random Int" => rand(1:5, 30))

unique(itr) and unique(f,itr)

 Dense random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-57.02% => improvement)
  "unique(generator)" => TrialJudgement(-36.85% => improvement)
  "unique(x*x,vector)" => TrialJudgement(-52.84% => improvement)

 Sparse sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-65.58% => improvement)
  "unique(generator)" => TrialJudgement(-5.16% => improvement)
  "unique(x*x,vector)" => TrialJudgement(-67.83% => improvement)

 Small random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-45.96% => improvement)
  "unique(generator)" => TrialJudgement(+2.59% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-46.33% => improvement)

 Sparse random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(+5.89% => regression)
  "unique(generator)" => TrialJudgement(-20.46% => improvement)
  "unique(x*x,vector)" => TrialJudgement(+8.33% => regression)

 Dense random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-60.12% => improvement)
  "unique(generator)" => TrialJudgement(-0.13% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-57.74% => improvement)

 Dense sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-60.17% => improvement)
  "unique(generator)" => TrialJudgement(+2.06% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-57.98% => improvement)

 Sparse random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-41.09% => improvement)
  "unique(generator)" => TrialJudgement(-4.47% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-42.47% => improvement)

unique!(itr)

  "Dense random Any" => TrialJudgement(-59.70% => improvement)
  "Sparse sorted Int" => TrialJudgement(-67.74% => improvement)
  "Small random Int" => TrialJudgement(-8.58% => improvement)
  "Sparse random Any" => TrialJudgement(+8.40% => regression)
  "Dense random Int" => TrialJudgement(-7.46% => improvement)
  "Dense sorted Int" => TrialJudgement(-46.25% => improvement)
  "Sparse random Int" => TrialJudgement(-18.38% => improvement)

Discussion

  • Sparse Random Any regression for unqiue! is understandable, as the default element type for the seen Set is currently Any

@laborg laborg changed the title Implement unique consistently RFC: Implement unique consistently Sep 4, 2018
@laborg
Copy link
Contributor Author

laborg commented Sep 7, 2018

I managed to use a single function body for unqiue and unique! without any performance loss and rebased all changes.

@laborg
Copy link
Contributor Author

laborg commented Sep 10, 2018

If I make all uniquifying algorithms explicit and accept the code duplication (https://gist.github.com/laborg/f9af0f5c2def4edfa410c5522afbff1d) the benchmarks look sligthley better:

unique

 Dense random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-47.21% => improvement)
  "unique(generator)" => TrialJudgement(-52.73% => improvement)
  "unique(x*x,vector)" => TrialJudgement(-57.46% => improvement)

 Sparse sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-70.58% => improvement)
  "unique(generator)" => TrialJudgement(-4.99% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-70.85% => improvement)

 Small random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-45.17% => improvement)
  "unique(generator)" => TrialJudgement(-4.20% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-44.53% => improvement)

 Sparse random Any
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(+8.28% => regression)
  "unique(generator)" => TrialJudgement(-25.97% => improvement)
  "unique(x*x,vector)" => TrialJudgement(+0.87% => invariant)

 Dense random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-62.24% => improvement)
  "unique(generator)" => TrialJudgement(-1.68% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-60.58% => improvement)

 Dense sorted Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-63.96% => improvement)
  "unique(generator)" => TrialJudgement(-4.15% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-62.77% => improvement)

 Sparse random Int
3-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "unique(x*x,generator)" => TrialJudgement(-52.92% => improvement)
  "unique(generator)" => TrialJudgement(-4.99% => invariant)
  "unique(x*x,vector)" => TrialJudgement(-52.92% => improvement)


unique!

"Dense random Any" => TrialJudgement(-73.12% => improvement)
"Sparse sorted Int" => TrialJudgement(-63.64% => improvement)
"Small random Int" => TrialJudgement(-11.05% => improvement)
"Sparse random Any" => TrialJudgement(-3.39% => invariant)
"Dense random Int" => TrialJudgement(-11.63% => improvement)
"Dense sorted Int" => TrialJudgement(-42.74% => improvement)
"Sparse random Int" => TrialJudgement(-19.79% => improvement)

I'm not sure which version is better, but I feel the consistency between implementations for unique()/unique!()/unique(f,)/unqiue!(f,) is a good thing compared to the current state.

@stevengj
Copy link
Member

What is the advantage of the new unique(f, itr) vs. unique!(map(f, itr))or unique!(f, A) vs. unique!(A .= f.(A))?

@StefanKarpinski
Copy link
Member

@stevengj: you mean as an implementation or a way of calling this?

@stevengj
Copy link
Member

Both. My suggestion involves no extra copies of the data, and I’m skeptical that it is much slower. But also, if that works as an implementation it’s not clear why we need a new API.

@StefanKarpinski
Copy link
Member

Well, given that we have unique already it seems like it should work, no?

@stevengj
Copy link
Member

Not all functions on arrays need to accept a function argument. We don't have sort(f, itr), and unique seems closely related to sort.

@StefanKarpinski
Copy link
Member

We don't have sort(f, itr) because it's unclear if f is the lt or by function.

@laborg
Copy link
Contributor Author

laborg commented Nov 28, 2018

unique!(map(f, itr))

is not equivalent to unique[!](f,itr):

julia> unique(x->x^2, [-3,3,1])
2-element Array{Int64,1}:
 -3
  1

julia> unique(map(x->x^2, [-3,3,1]))
2-element Array{Int64,1}:
 9
 1

Nevertheless I don't think that unqiue[!](f,)! function has been used often. My "telemetry" data is the missing issue report of the existing unique(f,) not infering generator types (#20105).

@stevengj
Copy link
Member

stevengj commented Nov 28, 2018

Oh, I see. This sounds more analogous to sort(itr, by=f), so maybe we should call it unique(itr, by=f)?

@laborg
Copy link
Contributor Author

laborg commented Nov 28, 2018

Yes by would look better to me too, but it would be breaking...

@StefanKarpinski
Copy link
Member

What else would the function mean do? In the case of sort there are two possible function arguments: by or lt. In this case, there's only one; requiring a keyword to pass it seems unnecessary.

@stevengj
Copy link
Member

What else would the function mean do?

What I initially thought when I read the description was that unique(f, itr) meant unique(map(f, itr)).

@fredrikekre
Copy link
Member

Is there anything useful here now with #30141 merged?

@laborg
Copy link
Contributor Author

laborg commented Dec 6, 2018

Well, although I've haven't redone the benchmarks, I believe this PR should have a faster implementations for all unique functions (unique(),unique!(),unique(f,),unique!(f,)), because it implements proper widening (similar to collect). Btw.: Benchmarks of unique should include all kinds of data, not just sorted Int64, as was used as demonstration in #30141...

@andyferris was talking about a faster unique implementation - I would be interested to benchmark it against this one here.

@laborg
Copy link
Contributor Author

laborg commented Jun 27, 2020

unique and Julia have changed since this attempt so closing seems the best solution.

@laborg laborg closed this Jun 27, 2020
@StefanKarpinski
Copy link
Member

Sorry it didn't get used. I feel bad every time I come across a PR like this that didn't get merged 😞

@laborg
Copy link
Contributor Author

laborg commented Jun 27, 2020

Sorry it didn't get used. I feel bad every time I come across a PR like this that didn't get merged disappointed

In this case I can clearly say you shouldn't. This was more an exercise to get to know Julia better and in this part it was a success! The recent implementation by Tim Holy is nicer too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add unique!(f, itr) unique(itr) doesn't infer types for a generator
4 participants