-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for function uniqueslices #14142
Conversation
This commit introduces a function `uniqueind` that is a slight modification of the existing `unique` function accepting a `dim` argument but returns an additional three index vectors. The outputs, as described in the docstring, are as follows: `C` - an array of the unique elements/rows/columns/hyperplanes of the input AbstractArray `itr` along the dimension `dim` `ia` - a vector of index values such that slice indexing into `itr`along dimension `dim` re-produces `C`. For example, if `itr` is a vector, `itr[ia] == C` returns `true` `ib` - a vector of vectors of integers, where each vector of integers contains the index positions of the unique elements/rows/columns/planes within `C` `ic` - a vector of index values such that slice indexing into `C` along dimension `dim` re-rproduces `itr` For example, if `itr` is a vector, `C[ic] == itr` returns `true` The presence of `ia`, `ib`, and `ic` are meant to mimic the behaviors (but with potentially different output order) of various unique and group functions found in a few other technical computing languages. This function provides all three of those outputs from a single function, while some other languages might only produce one or two arrays of index vectors from a given function. It might be decided upon review that instead of a single function returning all four outputs, there should be multiple functions that return separate output arguments and more closely mimic similar functions from other languages. This PR is in relation to issue #1845.
`C[:,:,ic] == itr` returns `true if `itr` is a three-dimensional array and `dim == 3` | ||
and so forth for higher dimensional arrays. | ||
|
||
Examples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these should probably be formatted as doctests via ```jldoctest
Removing a trailing whitespace that caused a build failure.
Missed a few last time.
One more trailing whitespace.
|
||
C - the unique elements of the array `itr` along the selected dimension `dim` | ||
ia - a Vector{Int} of indices such that: | ||
`itr[ia] == C` returns `true` if `itr` is a one-dimensional array and `dim == 1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would write this more simply as "itr[ia] == C
", without " returns true
"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated as suggested.
Thanks for this. Here are a few general comments. One is about the naming. I would be fine with A more complex question concerns composability, as noted by @StefanKarpinski here: #1845 (comment) Do you have any idea about whether it would be doable to base Another point is that it would be great to have a generic implementation for iterables, which wouldn't support the Finally, you should definitely add tests for this. You can take inspiration from those exiting for |
In response to the four points from @nalimilan above:
|
I hadn't noticed this difference with |
CC: @simonster, who wrote |
The output of |
OK, I misunderstood your reply, I thought you said |
When the |
That seems to have more in common with |
Suggestions for separate names (and separate functions with different combinations of outputs) are welcome and encouraged. This issue provides a set of possible output arguments that have been found useful in different technical computing languages. Defining the appropriate function names and api(s) for each possible set of output arguments that might need to be returned is what will hopefully be the result that closes this issue and #1845. |
This commit makes the simple doc updates suggested in #14142 to change `itr` to `A` and use a fixed set of integers, instead of random integers, when constructing the multidimensional array examples.
Performing a search for synonyms of "unique", here are a few possible terms: synonyms: distinctive, distinct, individual, special, idiosyncratic; single, sole, lone, unrepeated, unrepeatable, solitary, exclusive, rare, uncommon, unusual, sui generis; What might people think of naming the function(s) |
|
Along with Since no separate conditionals or functions to apply are being passed here, I would guess that this is a bit less generic than most The Since all of the English words are taken, why not try using a similar word from another language (according to Google Translate): English (or French or Latin) "unique" to: Spanish - |
I like Seriously, ( |
@StefanKarpinski Thoughts on Needs tests and also travis to pass. |
Updating function name to `uniqueslices` primarily just to kick off a new round of Travis and AppVeyor tests, because the Travis tests failed last time for reasons seemingly unrelated to the contents of this file.
Updated the name of the function to I'll add tests once the name is approved (and more so once I wrap up some other work today). |
The current Travis failure is due to the test for |
more likely that error is the fault of recent libuv changes, not anything about needing a rebase
|
I can see the utility of this function but the signature just seems so fiddly and specific, I'm not sure... |
I'm wondering if we can return one thing that can be used to easily compute the other things. For example, if what's returned is the vector of vectors of indices where each vector corresponds to a unique slice and gives the indices where that slice occurs. Is this |
Yes, that is There are use cases where the output information one wants to use could be:
The inspiration for me doing this work is a code I am working at present that needs either What @StefanKarpinski states here:
is actually how to construct ia = map(first,ib) # Equivalent to what I am currently doing in this PR
ia = map(last,ib) # Definitely something that could be useful as well And C == A[:,:,ia,:] # For whatever dimension `dim` to which `ia` should correspond (in this example dim == 3) In looking at the code again, there is a way to construct A method for doing so could be the following (which I am sure could be optimized from the below): n = 0
for i = 1:length(ib)
n += length(ib[i])
end
ic = Array(Int,n)
for i = 1:length(ib)
for k = 1:length(ib[i])
ic[ib[i][k]] = i
end
end All in all, in terms of returned index vectors, one only needs to return Defining the APIs for the wrapper functions is a useful exercise, because there is no need for people to copy code around to calculate So I guess I am now back to my main question from earlier in this thread. What would be appropriate function names and argument signatures for calculating and returning different sets of these outputs? |
I created two separate gists with possibilities for the core function and various wrappers for function signatures having various outputs: The included function names are currently The difference between the two gists is mainly related to which outputs the core function Comments and function name suggestions greatly appreciated. |
Bump. What do we need to do to get this merged now? |
Same tasks as in December...
|
Some comments on the task list:
The proposal for the core function in the first gist makes sense to me, in that everything can be computed in that everything can be computed from I should like to note that if the current
While the exact tests might have to wait for the implementation, we can talk about test arrays beforehand. One of the main use case I see for unique slice like methods (I recall people writing similar Python function) is in a network theory context, where |
Thinking along these lines, I also wanted to ask a question (to @AndyGreenwell and maybe also to @simonster) about the algorithm design and potential performance consequences. Rather then eagerly computing a hash for an entire row and then checking for collisions, would it make sense to continuously accumulate partial hashes only while they are uncollided? For example, in the case of the array of julia> A
4x5 Array{Int8,2}:
107 28 22 121 106
8 51 5 60 116
36 41 57 106 49
86 62 97 13 8 One can see that each row/column slice is unique just by inspecting the first element in each column/row. In general, whenever an array is large and has "sufficiency diverse" entries it would be preferable to verify the absence of collisions earlier rather than later. Now in the case of a matrix of Is this too ambitious or does it seem sensible? |
I agree with @gajomi that merging the core function under the name Implementation-wise, do you really need generated functions? Cc: @timholy |
My suspicion is that the most fundamental operation here is assigning group numbers to each slice – i.e. computing ic = groupslices(A, dim)
ib = groupinds(ic)
ia1 = firstinds(ic)
ia2 = map(first,ib)
ia3 = map(last,ib) My hunch (unverified) is that computing |
This commit adds updated argument signatures for the functionality discussed in JuliaLang#14142 wherein each of these functions only returns one single output argument containing index vectors of various purposes related to the unique slices currently returned from the unique function taking two input arguments. Still needs tests and doc integration.
The function signatures suggested by @StefanKarpinski are implemented in the following commit in my current fork of Julia. The only modifications to the API he provides above are that I have two methods for The line The actual implementations of these functions cut out many of the operations used for calculating the output array ( Please review and comment, and then I will create an updated PR based on any requested changes. |
@AndyGreenwell, did you happen to benchmark to see if my hunches about combined vs separated speed were correct? |
I'm not sure that the code in that commit is right, since it doesn't seem to handle hash collisions. |
@StefanKarpinski I have not performed comparison timings as of yet, but will do so. @simonster The code in the commit to my current fork does not change the method I used for determining the entries in output vector Consequently, if modifications need to be made to the determination of the entries populating |
This commit alters the current groupslices function to return the vector uniquerow that was originally calculated within the existing unique function. The values contained within uniquerow for cases where there are no hash collisions are actually equal to what I was calculating in array ic. As @simonster pointed out in comment JuliaLang#14142 (comment) the previous commit was not taking into account hash collisions for the values in ic. As uniquerow within unique was already calculating the values in ic, taking into account hash collisons, and updating its values accordingly, we can just return uniquerow from groupslices. For continuity with the conversation in JuliaLang#14142, I currently have assigned ic as an alias for uniquerow, but that can certainly be removed.
So it seems that I have updated the version of
Please review that commit and let me know if there are any comments. If approved, then I will bring my fork up to date and submit a PR. |
Looks like it will need to be a new PR, did you delete your fork then recreate it or something? It now says "from |
This commit adds updated argument signatures for the functionality discussed in JuliaLang#14142 wherein each of these functions only returns one single output argument containing index vectors of various purposes related to the unique slices currently returned from the unique function taking two input arguments. Still needs tests and doc integration.
This commit alters the current groupslices function to return the vector uniquerow that was originally calculated within the existing unique function. The values contained within uniquerow for cases where there are no hash collisions are actually equal to what I was calculating in array ic. As @simonster pointed out in comment JuliaLang#14142 (comment) the previous commit was not taking into account hash collisions for the values in ic. As uniquerow within unique was already calculating the values in ic, taking into account hash collisons, and updating its values accordingly, we can just return uniquerow from groupslices. For continuity with the conversation in JuliaLang#14142, I currently have assigned ic as an alias for uniquerow, but that can certainly be removed.
This commit adds updated argument signatures for the functionality discussed in JuliaLang#14142 wherein each of these functions only returns one single output argument containing index vectors of various purposes related to the unique slices currently returned from the unique function taking two input arguments. Still needs tests and doc integration.
This commit alters the current groupslices function to return the vector uniquerow that was originally calculated within the existing unique function. The values contained within uniquerow for cases where there are no hash collisions are actually equal to what I was calculating in array ic. As @simonster pointed out in comment JuliaLang#14142 (comment) the previous commit was not taking into account hash collisions for the values in ic. As uniquerow within unique was already calculating the values in ic, taking into account hash collisons, and updating its values accordingly, we can just return uniquerow from groupslices. For continuity with the conversation in JuliaLang#14142, I currently have assigned ic as an alias for uniquerow, but that can certainly be removed. Change error to ArgumentError uniquerow is ic, and seems to handle hash collisions already Add initial tests for multidimensional.jl for groupslices and related functions This commit adds some initial tests for the functions groupslices, groupinds, firstinds, and lastinds. The initial tests are very basic and do not currently include a test for a multidimensional input array that would have hash collisions between slices.
New PR #15503 has been created to replace functionality in this PR. Closing this PR. |
This commit introduces a function
uniqueind
that is a slight modification of the existingunique
function accepting adim
argument but returns an additional three index vectors.The outputs, as described in the docstring, are as follows:
C
- an array of the unique elements/rows/columns/hyperplanes of the input AbstractArrayitr
along the dimensiondim
ia
- a vector of index values such that slice indexing intoitr
along dimensiondim
re-producesC
.For example, if
itr
is a vector,itr[ia] == C
returnstrue
ib
- a vector of vectors of integers, where each vector of integers contains the index positions of the unique elements/rows/columns/hyperplanes withinC
ic
- a vector of index values such that slice indexing intoC
along dimensiondim
re-producesitr
For example, if
itr
is a vector,C[ic] == itr
returnstrue
The presence of
ia
,ib
, andic
are meant to mimic the behaviors (but with potentially different output order) of various unique and group functions found in a few other technical computing languages. This function provides all three of those outputs from a single function, while some other languages might only produce one or two arrays of index vectors from a given function.It might be decided upon review that instead of a single function returning all four outputs, there should be multiple functions that return separate output arguments and more closely mimic similar functions from other languages.
This PR is in relation to issue #1845.