Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for function uniqueslices #14142

Closed
wants to merge 7 commits into from
Closed
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 190 additions & 0 deletions base/multidimensional.jl
Original file line number Diff line number Diff line change
Expand Up @@ -863,3 +863,193 @@ If `dim` is specified, returns unique regions of the array `itr` along `dim`.
@nref $N A d->d == dim ? sort!(uniquerows) : (1:size(A, d))
end
end

"""
C, ia, ib, ic = uniqueind(itr, dim)

A function that operates similiarly to `unique(itr,dim)` but returns multiple
output arguments having the following properties:

C - the unique elements of the array `itr` along the selected dimension `dim`
ia - a Vector{Int} of indices such that:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a general description of the meaning of this vector? Then the examples below will make it more concrete. Without this "and so forth for higher dimensional arrays." is a bit imprecise.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I wonder whether the names of the arguments couldn't be improved.Does ia stand for indices in A and ic for indices in C? In that case, as in MATLAB, you'd better rename itr to a, and explicit this meaning in parentheses.

It also means ib doesn't stand for anything. Actually, I think you'd better not return it at all, as it can trivially be computed from ic IIUC. At the very least, it should have a more explicit name and be returned as the last result, as it is quite different from ia and ib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice of names ia and ic were based on the MATLAB(R) names but can certainly be changed as desired by the community. Choice of itr was just based on making minimal changes from the existing unique function, but A would be a closer choice to the existing MATLAB name. EDIT: itr has been changed to A in the working copy of a doc string.

Along with C, the vector of vectors ib is essentially part of what is returned from the group function in q, also known as the left (=) operator in K. Returning this output is functionality that was specifically requested of me to implement, but is also why I mentioned above that it might be desired to have multiple functions returning different outputs instead of just one returning the four outputs included here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choice of names ia and ic were based on the MATLAB(R) names but can certainly be changed as desired by the community. Choice of itr was just based on making minimal changes from the existing unique function, but A would be a closer choice to the existing MATLAB name.

In Julia too, itr is usually reserved for general iterables, and a or A used for arrays.

Along with C, the vector of vectors ib is essentially part of what is returned from the group function in q, also known as the left (=) operator in K. Returning this output is functionality that was specifically requested of me to implement, but is also why I mentioned above that it might be desired to have multiple functions returning different outputs instead of just one returning the four outputs included here.

Maybe we can find a short way of computing it from ic, which could be given in the docs for those who need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can find a short way of computing it from ic, which could be given in the docs for those who need it?

For people coming to Julia from K/q, having to execute two functions to get back the results of what is a single character operator (=A) applied to an array is incredibly verbose. For that part of the audience, a single function call will likely be preferred.

`itr[ia] == C` returns `true` if `itr` is a one-dimensional array and `dim == 1`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would write this more simply as "itr[ia] == C", without " returns true"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

`itr[ia,:] == C` returns `true` if `itr` is a two-dimensional array and `dim == 1`
`itr[:,ia] == C` returns `true` if `itr` is a two-dimensional array and `dim == 2`
`itr[:,:,ia] == C` returns `true if `itr` is a three-dimensional array and `dim == 3`
and so forth for higher dimensional arrays.
ib - a Vector{Vector{Int}} where each Vector{Int} contains the indices associated with the
individual entries of `C` along the dimension `dim`
ic - A Vector{Int} of indices such that:
`C[ic] == itr` returns `true` if `itr` is a one-dimensional array and `dim == 1`
`C[ic,:] == itr` returns `true` if `itr` is a two-dimensional array and `dim == 1`
`C[:,ic] == itr` returns `true` if `itr` is a two-dimensional array and `dim == 2`
`C[:,:,ic] == itr` returns `true if `itr` is a three-dimensional array and `dim == 3`
and so forth for higher dimensional arrays.

Examples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these should probably be formatted as doctests via ```jldoctest


julia> itr = [1;2;3;2;3;5;6;5;7;1];
julia> C, ia, ib, ic = uniqueind(itr,1)
([1,2,3,5,6,7],[1,2,3,6,7,9],[[1,10],[2,4],[3,5],[6,8],[7],[9]],[1,2,3,2,3,4,5,4,6,1])
julia> C[ic] == itr
true
julia> itr[ia] == C
true
julia> ib
6-element Array{Array{Int64,1},1}:
[1,10]
[2,4]
[3,5]
[6,8]
[7]
[9]

julia> D = rand(Int,2,3,1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better always use the same values for doctests I think. That will allow running them as tests and arrays with no duplicated values would be less interesting to understand how this works [EDIT: nevermind, I see you duplicate the values when concatenating].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a call to srand prior to converting to a doctest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd simply create arrays from fixed inputs. This has the advantage of allowing to use small positive integers, making the output easier to read.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated as suggested.

julia> E = rand(Int,2,3,1);
julia> F = rand(Int,2,3,1);
julia> A = cat(3, D, F, E, E, D)
2x3x5 Array{Int8,3}:
[:, :, 1] =
2 -95 -25
-94 -60 -74

[:, :, 2] =
-125 71 -58
31 -71 -1

[:, :, 3] =
-79 -33 -46
80 76 -85

[:, :, 4] =
-79 -33 -46
80 76 -85

[:, :, 5] =
2 -95 -25
-94 -60 -74

julia> C, ia, ib, ic = uniqueind(A,3)
(
2x3x3 Array{Int8,3}:
[:, :, 1] =
2 -95 -25
-94 -60 -74

[:, :, 2] =
-125 71 -58
31 -71 -1

[:, :, 3] =
-79 -33 -46
80 76 -85,

[1,2,3],[[1,5],[2],[3,4]],[1,2,3,3,1])

julia> A[:,:,ia] == C
true
julia> C[:,:,ic] == A
true
julia> ib
3-element Array{Array{Int64,1},1}:
[1,5]
[2]
[3,4]

"""
@generated function uniqueind{T,N}(A::AbstractArray{T,N}, dim::Int)
quote
1 <= dim <= $N || return copy(A)
hashes = zeros(UInt, size(A, dim))

# Compute hash for each row
k = 0
@nloops $N i A d->(if d == dim; k = i_d; end) begin
@inbounds hashes[k] = hash(hashes[k], hash((@nref $N A i)))
end

# Collect index of first row for each hash
uniquerow = Array(Int, size(A, dim))
ic = Array(Int, size(A, dim))
ia = Int[]
firstrow = Dict{Prehashed,Int}()
icdict = Dict{Int,Int}()
iadict = Dict{UInt,Int}()
h = 0
for k = 1:size(A, dim)
tmp = get!(firstrow, Prehashed(hashes[k]), k)
uniquerow[k] = tmp
if !haskey(icdict,tmp)
h += 1
icdict[tmp] = h
ic[k] = h
else
ic[k] = icdict[tmp]
end
if !haskey(iadict,hashes[k])
iadict[hashes[k]] = k
push!(ia,k)
end
end
uniquerows = collect(values(firstrow))

# Check for collisions
collided = falses(size(A, dim))
@inbounds begin
@nloops $N i A d->(if d == dim
k = i_d
j_d = uniquerow[k]
else
j_d = i_d
end) begin
if (@nref $N A j) != (@nref $N A i)
collided[k] = true
end
end
end

if any(collided)
nowcollided = BitArray(size(A, dim))
while any(collided)
# Collect index of first row for each collided hash
empty!(firstrow)
for j = 1:size(A, dim)
collided[j] || continue
uniquerow[j] = get!(firstrow, Prehashed(hashes[j]), j)
end
for v in values(firstrow)
push!(uniquerows, v)
end

# Check for collisions
fill!(nowcollided, false)
@nloops $N i A d->begin
if d == dim
k = i_d
j_d = uniquerow[k]
(!collided[k] || j_d == k) && continue
else
j_d = i_d
end
end begin
if (@nref $N A j) != (@nref $N A i)
nowcollided[k] = true
end
end
(collided, nowcollided) = (nowcollided, collided)
end
end

C = @nref $N A d->d == dim ? sort!(uniquerows) : (1:size(A, d))

ib = Array(Vector{Int},length(ia))
for k = 1:length(ia)
ib[k] = Int[]
end
for h = 1:length(ic)
push!(ib[ic[h]], h)
end

return C, ia, ib, ic
end
end