-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Julep: Generalize indexing with and by Associative
s
#24019
Comments
I should have also mentioned that "logical indexing" |
This looks really cool! I haven't had the chance to look through it thoroughly yet, but it's very appealing to me that arrays can be thought of as associative containers with a shape. |
Would it be a goal of AssociativeArray to support say , if |
@macd Yes, but does |
Very cool, does work as I expect. I didn't get the interoperability issue, but it makes perfect sense. |
As someone who has worked on this kind of stuff, I found it interesting that I was initially surprised by some of your choices, until you pointed out the guiding principle. I completely agree that |
Thanks @timholy. I should point out that this is 100% motivated by a few of your comments along the lines of "arrays are really just some special kind of associative container" (in regards to motivating/justifying offset arrays) and I've been trying to make sense of that since :) |
OK, so far I'm seeing positive feedback here, at least on the indexing behavior. I'd like to know people's opinions on my list of suggested breaking changes that IMO would make all this come together consistently. Enumerated:
Anyway, some design feedback here would help in creating an implementation. |
Very interesting indeed. Still pondering this. One more major breaking point: Allowing non-scalar indexing on Associatives means you'll no longer be able to have arrays or associatives as keys. That is,
|
@mbauman Yes this is an interesting case. Note that for cases where the indices are strongly typed (such as Also - it would help if I knew of a use case for such index types. I know we've worried about hashing of vectors and unit ranges and so-on, but I haven't come across a use for this yet... |
In any case we need a nicer way to express that an array should be treated as scalar in an expression where arrays usually are splashed, broadcasted or unboxed in other ways, compare the crutch |
Yes, this is also currently a problem for non-scalar indexed assignments for arrays of arrays. That is, I have a very strong distaste for changing semantics based upon eltype changes. |
That does seem like a tricky case... I'm wondering what the solution is. We could always favour scalar indexing unless we can prove it doesn't work, or something... (It seems to me that the scalar case is the more fundamental kind of indexing (at least of |
Then this might no longer be an Associative data-structure. It’s dangerous to generalize the behavior of a concretion, Wikipedia also notes some examples of common Associative data-structures which are not uniquely indexable, including HTML queries and header. They are ordered and can be iterated as pairs, but attempting to index them may return a non-unique result (one name for these is: ordered-multimaps). |
It's worth noting that if I were developing non-scalar indexing from scratch right now, I'd totally detangle the scalar and non-scalar behaviors with dot-broadcasting. Actually, this makes me realize that we could deprecate the non-scalar indexed assignment of many values to many indices to |
@vtjnash. It is interesting to consider all the possible data structures here. I guess in some senses I've mostly thought
I'm pretty sure the uniqueness of the keys is pretty important here - we'll need to make some assumptions to make a good indexing interface suitable for generic programming. For my uses, I don't see any problem with using the I consider the more "exotic" multi-maps and XML structures as being something a bit different to a Julia One interesting thing from the Wikipedia article that I have been thinking about is that there is a difference between updating the value corresponding to an existing index, and adding a new index to the collection. It's extremely convenient that I can add new indices to a |
This seems like a reasonable idea to me. (It reminds me, I've of done something like |
That should work just fine without the
|
Oh, that's nice to know :) You've now got me wondering if it should be |
The major issue with |
Only because |
I think there's two operations here being described as We can extend this function to multiple associative collections by making the definition varargs: |
I like this a lot. The disconnect between associatives and arrays has always bothered me. I like the notion of array as a specialization of associative collections. The most visible cost of changing this would be having to write
If we change the model for associative collections from being a collection of pairs to being an indexed collection of values and accordingly iterate values instead of pairs, that would clarify the behavior of |
Another question here: should we continue to treat But we may want to deprecate the implicit tuplization behavior here in order to make the two more similar. |
That syntax has always seemed to me like a convenient poor man's sparse matrix. |
That's the smoking gun. The usual structure of a |
Memoization is a big one.
Mappings and functions are the same thing, so why not replace your proposed indexing with function application: It would be error-prone to write generic code where |
I've been thinking about sets and keys. Here's a relatively sane property that I'm considering which may be desirable across arrays and associatives: a[keys(a)] == a That is, both the keys and values will be preserved by this indexing operation. (Perhaps this will be To achieve this:
I think if we have 1-3 for v1.0, things may be pretty consistent and tied together nicely and ready for 4 in the future. The final thing I'd suggest for v1.0 is to allow parsing (not lowering) of |
This property is true of arrays, but is not true of all associative containers. One counter-example is a multi-dicts, but another more commonly encountered counter-example is the min-heap. |
You keep talking about multi-dicts, but they are not, in my view, a thing. If |
You also keep trying to claim this, despite that Julia provides an implementation of one (https://github.com/JuliaLang/julia/blob/master/base/dict.jl#L747), and which also fails to conform to your suggestion that it's just a "Dict{K, Set{V}}" |
An ImmutableDict is "persistent" (I believe that term is used in Clojure) in that previous mappings aren't destroyed. But I think the fact that this is exposed by iteration is more of a bug. |
What does indexing mean here? Wouldn't the user-facing interface be that of a priority queue (i.e. deque- or bag-like) rather than as an indexable? The OP is specifically referring to collections which provide mappings from a unique set of keys to a single value, which is what I thought |
It's entirely reasonable to additionally define and create another persistent dict type without needing to first accuse a different data structure of being a bug. It's a feature that we can also implement a ordered multi-dict with the same interface as other associative mappings.
By restricting the classification, it'll be a tradeoff. As with any classification system, making the definition more specific ("a unique mapping from a key to a single value") sometimes makes it more useful (you can assume indexing is possible). But it also then excludes other possibilities (the counter-examples I gave), which now will form some new categories. In these cases, these perhaps might be a
There are alternative ways of writing the OP which don’t make this choice of definition. Here’s one generic implementation option that instead assumes the existence of an appropriate definition for function getindex.(iterable, keys::Associative)
newmap = similar(first(eltype(keys)) => last(eltype(pairs(iterable)), iterable)
for (newkey, oldkey) in keys
for (key, value) in pairs(iterable)
if key == oldkey
push!(newmap, newkey => value)
end
end
end
return newmap
end
function getindex.(iterable, keys::AbstractArray)
newarray = similar(last(eltype(pairs(iterable))), iterable)
for oldkey in keys
for (key, value) in pairs(iterable)
if key == oldkey
push!(newarray, value)
end
end
end
return newarray
end |
FYI I've started some prototyping at https://github.com/andyferris/Indexing.jl |
I think the Indexing.jl prototype is working pretty well now. It's not too much code (it probably needs a bit more code to make it really fast) so it doesn't seem too scary to port to It's probably a good time to think about future syntax, in case depredations are necessary. I would suggest ending up with something like: a[i] --> getindex(a, i) # scalar only
a.[inds] --> getindices(a, inds) # or view(a, inds)?
a[i] = v --> setindex!(a, v, i) # scalar only
a.[inds] = v --> setindices!(a, v, inds)
a[i] .= v --> broadcast!(identity, getindex(a, i), v)
a.[inds] .= values --> broadcast!(identity, view(a, inds), values) Note the lack of IIUC we'd need to deprecate |
FYI, the second bullet point has been implemented in n = NamedArrays.NamedArray(rand(3), ([:a, :b, :c],))
@show n[[:c, :a]];
n[[:c, :a]] = 2-element Named Array{Float64,1}
A │
───┼─────────
:c │ 0.410735
:a │ 0.253642 (ended up here searching for a way to extend REPL completion for Associative-like structures) |
Technically, the rules described further in the OP might imply that indexing with a
which isn't exactly ergonomic for your purposes! (The way I think about it is the rules described in the OP is basically equivalent to |
I wasn't really suggesting that NamedArray is a satisfying solution to your OP. In fact I struggled with how to implement I suppose that for type stability reasons (I probably am the most type instable programmer ever) NamedArrays (taking the role of the Associative) should implement your OP, and always produce the type of the indexing array with elements of the type of the indexed array. |
I thought it might be relevant to reference Dictionaries.jl which was recently released, and implements a system for dictionary non-scalar indexing in combination with Indexing.jl. |
I've always loved how in Julia (and MATLAB) one can create a new array from an old one, using what is now the APL indexing rules. Basically if you index a collection of values with a collection of indices, you get a new collection of the indexed values. Beautiful, simple. Indexing has also been extended by allowing arrays that don't use 1-based indexing by e.g. the OffsetArrays.jl package.
I'm not sure if this issue exists elsewhere as its own entity (cleaning up distinctions between arrays and associatives was surely mentioned in #20402 and this Julep seems to be a logical extension of #22907), but here I propose specifically that we extend indexing of and by
Associative
and make related changes so that the semantics are consistent across these two types of container. I prototyped ideas at https://github.com/andyferris/AssociativeArray.jl and basically came up with the ability to (with simple code):Associative{K,V}
with anAssociative{I,K}
to get anAssociative{I,V}
. E.g.Dict(:a=>1, :b=>2, c:=>3)[Dict("a"=>:a, "c"=>:c)] == Dict("a"=>1, "c"=>3)
.Associative{K,V}
with anAbstractArray{K,N}
to get anAbstractArray{V,N}
. E.g.Dict(:a=>1, :b=>2, c:=>3)[[:c, :a]] == [3,1]
.AbstractArray{T,N}
with anAssociative{K,I}
to get anAssociative{K,T}
(whereI
might beInt
for linear indexing, or aCartesianIndex{N}
for Cartesian indexing). E.g.[11,12,13][Dict(:a=>1, :c=>3)] == Dict(:a=>11, :c=>13)
.The semantics are consistent across arrays and dictionaries, and provide that for
out = a[b]
:out
shares the indices ofb
(note: these areCartesianRange
for arrays)out[i]
correspond toa[b[i]]
.This is fully consistent with both the
Base
arrays and the OffsetArrays.jl package (We can do something similar forsetindex!
).To make everything consistent, it helps to make the following associated changes:
Associative
s be containers of values, not ofindex=>value
pairs, so that arrays and dictionaries are consistent on this fundamental point. Use the existingpairs
function when necessary (and ideally make it preserve indexability).similar
always return a container with the same indices, even for dictionaries. Ideally, unifysimilar
acrossAssociative
s andArray
s (for example a dictionary which issimilar
to a distributed array might also be distributed) via use of the indices.empty
function that makes emptyDict
s andVector
s to which elements should be added. (Done, Addempty
and changesimilar(::Associative)
#24390).getindex
andsetindex!
with be calledindices
, rather thankeys
(and rename the currentindices(::AbstractArray)
to something else)view
work for the various combinations wheregetindex
works.The demonstration package also prototypes making
AbstractArray{T, N} <: Associative{CartesianIndex{N}, T}
- I don't think this is strictly necessary but it helped (me) to highlight which parts of the existing interface were inconsistent. The package does demonstrate that we can put something simple together without excessive amounts of code (some performance tuning is surely required).Finally, a word on what motivates this: lately I've been playing with what fundamental data operations (such as mapping, grouping, joining or filtering) would be useful for both generic data structures and tables/dataframes (that iterate rows), and I found whenever I created say a grouping (using a dictionary of groups), I immediately felt the loss of ability to do complex indexing and other operations with the result (as well have to worry whether the output iterates values or key-value pairs, etc).
The text was updated successfully, but these errors were encountered: