-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function to compute "fast array" with same "comparison" and "permutation" behavior as original #31606
Comments
cc @mbauman |
To add even more ramblings, I wanted to point out that it would actually be helpful to have two things:
Basically the second point would be some function |
+1, but I don't know whether this should be in Base or in a package. Also it would be nice to find a more specific name. Do you think we also need (BTW, EDIT: Ooops, for some reason your comment only appeared when I posted mine. We're talking about the same thing: |
Interesting. At first blush based upon your naming and use of indexing in the examples, I was thinking you were talking about a function that would allow an array to decide if it's faster to copy into a dense In some senses, it's kinda like |
@nalimilan : glad to see we are on the same page (despite GitHub glitches)! It would be ideal to have a unique interface for A possible API in my mind would be: refarray(v::AbstractArray) = v
valuefromref(v::AbstractArray, ref) = ref with specializations for the various array types. Or we could have a custom As to where this should live, I personally have no preferences as long as PooledArrays, WeakRefStrings and CategoricalArrays accept it (so that I can remove "performance hacks" from StructArrays). Maybe it can be tried in a package and then moved to Base in the future? |
It's a bit tricky, as in my mind the pooled arrays (arrays that represents data with a vector of hashes and a dictionary) would return |
I guess it would work for some use cases, but (as I said) some optimizations only work when you know the references are contiguous integers (so not for |
So in summary one could have (I prefer allowing refarray(v::AbstractArray)::AbstractArray = v
refvaluedict(v)::Union{AbstractDict, Nothing} = nothing # nothing if info is not easily available
function refvaluemap(v::AbstractArray)::Function # or callable type
dict = refvaluedict(v)
dict === nothing ? identity : t -> dict[t] # hopefully all of this gets optimized away
end We just need to come up with sensible names and decide where this should live. @mbauman does this look like something that would be useful in Base? Otherwise it may just be better to have a tiny package for this. |
Looks good. Not sure about the names, it's always the hardest part. :-) We should probably start with a package, and maybe move it to Base later if we're happy with the API and it feels essential enough. Indeed Base doesn't contain functions which aren't used by any type defined in Base AFAIK. Two remarks:
|
That's a good point, vector makes a lot of sense for function ref2value(v::AbstractArray{T}, t)::T where T
dict = refvaluedict(v)
dict === nothing ? t : dict[t]
end So the rule would be: if there is a If indeed we only need this two functions, I find the PooledArrays terminology ( |
Can you develop? |
What I had in mind was something like: refarray(v::StringArray) = convert(StringArray{WeakRefString{UInt8}}, v) So basically the julia> s = StringArray(["asda", "sada"]);
julia> w = convert(StringArray{WeakRefString{UInt8}}, s);
julia> @btime $s[1] == $s[2]
34.966 ns (2 allocations: 64 bytes)
false
julia> @btime $w[1] == $w[2]
4.773 ns (0 allocations: 0 bytes)
false Drawback here is that things like I imagine you are thinking something along the lines of |
Makes sense. I think we should wait until @quinnj is available to comment, though.
Maybe that's not an issue? These arrays aren't supposed to be transformed anyway, and one would better use views to take subsets.
Good point. |
Yes, actually by looking closer this is probably not needed. I'm getting convinced that |
This is a "companion issue" to #31601 in determining what abstractions would need to be in Julia Base (or let's say a stdlib) to allow to write packages for tabular data with no dependency. My intuition here is fuzzier than in #31601 so I hope the proposal makes sense, feel free to correct me if it does not. See this comment and this comment for a little bit of background.
There are at least three distinct array types (
WeakRefStrings.StringArray
,PooledArrays.PooledArray
andCategoricalArrays.CategoricalArray
) that each have a corresponding "array of references" (WeakRefStrings
in the first case and sayUInt8
in the latter two cases). The array of references (let's call itfast_array(v)
, though the name should be chosen with care) has the following two properties in these cases:fast_array(v)[i] == fast_array(v)[j]
if and only ifv[i] == v[j]
(but the first comparison is in general much faster)permute!(v, p)
andpermute!(fast_array(v), p)
have the same effect or similarlycopyto!(v, v[p])
andcopyto!(fast_array(v), fast_array(v)[p])
have the same effect (again, thefast_array
version is much faster)What I wanted to ask is whether being able to create a
fast_array
with the same guarantees as above is sufficiently general as to deserve to have a "fallback" in Base (simplyfast_array(v::AbstractArray) = v
, as it has the two properties defined above). Then the various packages could extend it as needed and external tabular data packages could have efficient row comparison (and efficientpermute!(v)
andcopyto!(v, v[p])
or more generally sorting functionality) without having to depend on all these array packages.If Base is not a good location for this interface, I'm also open to suggestion as to what could be a reasonable place for this thing to live (it could also be together with the
defaultarray
function from #31601).The text was updated successfully, but these errors were encountered: