Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to compute "fast array" with same "comparison" and "permutation" behavior as original #31606

Open
piever opened this issue Apr 3, 2019 · 14 comments
Labels
arrays [a, r, r, a, y, s]

Comments

@piever
Copy link
Contributor

piever commented Apr 3, 2019

This is a "companion issue" to #31601 in determining what abstractions would need to be in Julia Base (or let's say a stdlib) to allow to write packages for tabular data with no dependency. My intuition here is fuzzier than in #31601 so I hope the proposal makes sense, feel free to correct me if it does not. See this comment and this comment for a little bit of background.

There are at least three distinct array types (WeakRefStrings.StringArray, PooledArrays.PooledArray and CategoricalArrays.CategoricalArray) that each have a corresponding "array of references" (WeakRefStrings in the first case and say UInt8 in the latter two cases). The array of references (let's call it fast_array(v), though the name should be chosen with care) has the following two properties in these cases:

  • fast_array(v)[i] == fast_array(v)[j] if and only if v[i] == v[j] (but the first comparison is in general much faster)
  • permute!(v, p) and permute!(fast_array(v), p) have the same effect or similarly copyto!(v, v[p]) and copyto!(fast_array(v), fast_array(v)[p]) have the same effect (again, the fast_array version is much faster)

What I wanted to ask is whether being able to create a fast_array with the same guarantees as above is sufficiently general as to deserve to have a "fallback" in Base (simply fast_array(v::AbstractArray) = v, as it has the two properties defined above). Then the various packages could extend it as needed and external tabular data packages could have efficient row comparison (and efficient permute!(v) and copyto!(v, v[p]) or more generally sorting functionality) without having to depend on all these array packages.

If Base is not a good location for this interface, I'm also open to suggestion as to what could be a reasonable place for this thing to live (it could also be together with the defaultarray function from #31601).

@StefanKarpinski
Copy link
Member

cc @mbauman

@piever
Copy link
Contributor Author

piever commented Apr 4, 2019

To add even more ramblings, I wanted to point out that it would actually be helpful to have two things:

  • This fast_array(v) as above
  • A way to recover the original values from the fast ones (which are mostly references).

Basically the second point would be some function recover_original(v, ref) so that recover_original(v, fast_array(v)[i]) = v[i]. In this way for example when converting to a PooledArray (which pools elements together) a v::StringArray, one could just pool the fast_array(v) and then convert the (few) unique values to the actual string values. Not having access to this part of the interface led to this conditional definition in a Requires block in Tables and to this strange optimization in a Requires block in StructArrays.

@nalimilan
Copy link
Member

nalimilan commented Apr 4, 2019

+1, but I don't know whether this should be in Base or in a package. Also it would be nice to find a more specific name.

Do you think we also need levels as an O(1) way to get the list of possible values, in the order corresponding to references in fast_array(v)? Currently it lives in Missings.jl (because it skips missing values automatically). Without it, we would have to do something like v[indexin(unique(fast_array(v)), fast_array(v))] to get the list of unique values, which is O(N). There's a difference between CategoricalArrays/PooledArrays and WeakRefStringArrays in that regard, which is that the former guarantee that references are contiguous and point to a pool of values (returned by levels), while the latter only returns a custom struct which isn't even an integer. Knowing the range of possible integer values is essential e.g. to preallocate a vector mapping integer references in fast_array(v) to references ordered according to sortperm(unique(v)) when grouping (to avoid a second pass just for sorting).

(BTW, WeakRefStringArray is supposed to be deprecated in favor of StringArray (JuliaData/WeakRefStrings.jl#36), but that doesn't really affect this issue AFAICT.)

EDIT: Ooops, for some reason your comment only appeared when I posted mine. We're talking about the same thing: levels(v)[i] allows retrieving the value corresponding to value i in fast_array(v), but it only works when that array contains contiguous integer indices (i.e. not for WeakRefStringArrays).

@mbauman mbauman added the arrays [a, r, r, a, y, s] label Apr 4, 2019
@mbauman
Copy link
Member

mbauman commented Apr 4, 2019

Interesting. At first blush based upon your naming and use of indexing in the examples, I was thinking you were talking about a function that would allow an array to decide if it's faster to copy into a dense Array or to use directly. So I think we need a better name here — and ideally a more general construct.

In some senses, it's kinda like hash.(A), only with stricter guarantees, right? In fact, it's essentially perfecthash.(A), yeah? Of course, it can't really be a broadcast function because it's the array itself that knows the hashing scheme. Is there more language around perfect hashing that would help us describe levels and the mapping? My cursory perusal of https://en.wikipedia.org/wiki/Perfect_hash_function was unsatisfactory.

@piever
Copy link
Contributor Author

piever commented Apr 4, 2019

@nalimilan : glad to see we are on the same page (despite GitHub glitches)! It would be ideal to have a unique interface for StringArray and Categorical/PooledArray: I like the function because it encompasses both scenarios (being levels(v)[i] for the categorical case and just String for the WeakRefStrings case). The sortperm issue you mentioned was done in PooledArrays in a bit of a "hacky" way by feeding to Perm a modified version of the vector that has equivalent sorting but is a AbstractVector{<:Integer} which sorts more quickly. I'm still trying to wrap up my mind as to whether fast_array and recover_original could be enough, or if we need a levels dictionary?

A possible API in my mind would be:

refarray(v::AbstractArray) = v
valuefromref(v::AbstractArray, ref) = ref

with specializations for the various array types. Or we could have a custom RefArray type that incorporates both the refarray(v) and the closure ref -> valuefromref(v, ref)? Not sure.

As to where this should live, I personally have no preferences as long as PooledArrays, WeakRefStrings and CategoricalArrays accept it (so that I can remove "performance hacks" from StructArrays). Maybe it can be tried in a package and then moved to Base in the future?

@piever
Copy link
Contributor Author

piever commented Apr 4, 2019

In some senses, it's kinda like hash.(A), only with stricter guarantees, right?

It's a bit tricky, as in my mind the pooled arrays (arrays that represents data with a vector of hashes and a dictionary) would return hash.(v), ref -> hashdict[ref], whereas the StringArray would return a strange array of custom objects and the method String to turn them into actual string, and I would like there to be a fallback of v, identity: I'm afraid that the hash terminology applies to the first scenario but not to the others.

@nalimilan
Copy link
Member

I'm still trying to wrap up my mind as to whether fast_array and recover_original could be enough, or if we need a levels dictionary?

I guess it would work for some use cases, but (as I said) some optimizations only work when you know the references are contiguous integers (so not for WeakRefStringArray). In that case, you also need a way to know the number (or range) of the references, and the list of unique values. And of course you need a way to know whether you're in this case or not (some kind of trait?).

@piever
Copy link
Contributor Author

piever commented Apr 5, 2019

So in summary one could have (I prefer allowing refvaluedict to return nothing rather than adding a trait, but either way is fine for me):

refarray(v::AbstractArray)::AbstractArray = v
refvaluedict(v)::Union{AbstractDict, Nothing} = nothing # nothing if info is not easily available
function refvaluemap(v::AbstractArray)::Function # or callable type
    dict = refvaluedict(v) 
    dict === nothing ? identity : t -> dict[t] # hopefully all of this gets optimized away
end

We just need to come up with sensible names and decide where this should live. @mbauman does this look like something that would be useful in Base? Otherwise it may just be better to have a tiny package for this.

@nalimilan
Copy link
Member

Looks good. Not sure about the names, it's always the hardest part. :-)

We should probably start with a package, and maybe move it to Base later if we're happy with the API and it feels essential enough. Indeed Base doesn't contain functions which aren't used by any type defined in Base AFAIK.

Two remarks:

  • For refvaluedict, I think we should just require it to return an object which implements length and keys/values and can be indexed using values in refarray (or nothing). In practice that means PooledArrays and CategoricalArrays can return a vector, and WeakRefStringArrays a dict.
  • refvaluemap doesn't seem to be required, since you gave a very simple and generic definition that packages can copy.

@piever
Copy link
Contributor Author

piever commented Apr 5, 2019

In practice that means PooledArrays and CategoricalArrays can return a vector, and WeakRefStringArrays a dict.

That's a good point, vector makes a lot of sense for PooledArray and CategoricalArray. I have a basic confusion though: for me StringArray should return nothing here as there is no efficient small "refvaluedict" in the general case. However StringArray should somehow give the information that one should call String to get the concrete object. Should we just "force it" by annotating the return type, meaning the general function to get the actual values would be (here is an example function to recover values that we would not include in the code but maybe just in the README):

function ref2value(v::AbstractArray{T}, t)::T where T
    dict = refvaluedict(v) 
    dict === nothing ? t : dict[t]
end

So the rule would be: if there is a refvaluedict(v) then refarray(v)[i] is the key for refvaluedict(v) to get the correct value. If instead refvaluedict(v) is nothing, refarray(v)[i] should be something that converts v[i].

If indeed we only need this two functions, I find the PooledArrays terminology (refs and pool) to be the clearest. I made a small package here with these two definitions so that we can start trying out PRs at PooledArrays, CategoricalArrays and WeakRefStrings.

@nalimilan
Copy link
Member

If instead refvaluedict(v) is nothing, refarray(v)[i] should be something that converts v[i].

Can you develop? refarray(v)[i] would return an UInt64 offset for StringArray. How would you convert that to a String?

@piever
Copy link
Contributor Author

piever commented Apr 10, 2019

What I had in mind was something like:

refarray(v::StringArray) = convert(StringArray{WeakRefString{UInt8}}, v)

So basically the refarray would keep the buffer and then refarray(v)[i] would be a WeakRefString that could be converted to string easily. The advantage compared to the original array is that to compare two entries we don't need to allocate strings (indexing just returns the WeakRefString object without materializing a new string). For example:

julia> s = StringArray(["asda", "sada"]);

julia> w = convert(StringArray{WeakRefString{UInt8}}, s);

julia> @btime $s[1] == $s[2]
  34.966 ns (2 allocations: 64 bytes)
false

julia> @btime $w[1] == $w[2]
  4.773 ns (0 allocations: 0 bytes)
false

Drawback here is that things like a[1:2] = a[3:4] can't really be done efficiently because a[3:4] would copy the buffer, whereas I would like to simply work on the a.offsets and a.lengths arrays. Maybe this could be fixed by requiring that slicing doesn't copy the buffer for StringArray{<:WeakRefString} (which are a type that should be "hidden" to the user, so maybe this is OK).

I imagine you are thinking something along the lines of refarray(v) = (v.offsets, v.lengths) (you also need the length as FWIU in principle there is no guarantee that a string ends where the following begins) which solves the a[1:2] = a[3:4] problem but doesn't allow comparison, as strings may start at different offsets but be equal.

@nalimilan
Copy link
Member

Makes sense. I think we should wait until @quinnj is available to comment, though.

Drawback here is that things like a[1:2] = a[3:4] can't really be done efficiently because a[3:4] would copy the buffer, whereas I would like to simply work on the a.offsets and a.lengths arrays. Maybe this could be fixed by requiring that slicing doesn't copy the buffer for StringArray{<:WeakRefString} (which are a type that should be "hidden" to the user, so maybe this is OK).

Maybe that's not an issue? These arrays aren't supposed to be transformed anyway, and one would better use views to take subsets.

I imagine you are thinking something along the lines of refarray(v) = (v.offsets, v.lengths) (you also need the length as FWIU in principle there is no guarantee that a string ends where the following begins) which solves the a[1:2] = a[3:4] problem but doesn't allow comparison, as strings may start at different offsets but be equal.

Good point.

@piever
Copy link
Contributor Author

piever commented Apr 10, 2019

Maybe that's not an issue? These arrays aren't supposed to be transformed anyway, and one would better use views to take subsets.

Yes, actually by looking closer this is probably not needed. I'm getting convinced that refarray(v::StringArray) = convert(StringArray{WeakRefString{UInt8}}, v) (with some care for missings) should be the correct way forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s]
Projects
None yet
Development

No branches or pull requests

4 participants