-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CategoricalArray type not closed under unique
method
#129
Comments
It's not terribly useful to return a What's your use case? |
Hi Milan,
Thanks for your message.
I have code that relabels vectors of arbitrary type into integer vectors, based on "training" the code on one particular vector, which is presumed to take on all values likely to be encountered in new vectors. My vectors are usually columns of a DataFrame. I want to switch between the two representations of the vectors. I use `unique` to determine what the values are but my code won't work as expected if `unique` changes the element type.
I do have a work around. It appears that `v -> collect(Set(v))` is closed under type (at least for columns of DataFrames) and otherwise does the same thing.
Anthony
|
More specifically, could you explain briefly why the code doesn't work if the types differ? I'm trying to evaluate whether this pattern can be common. |
Thanks for your message.
I am developing a Julia machine learning environment and am wrapping a learning algorithm that expects categorical features to have Int type. In my environment, data is initially, by default, in DataFrame form; I must therefore transform the categorical columns into integer vectors. The initial eltype of the column is unknown; we just know it represents a categorical. Note that I must record the actual labelling used, so that I can transform new instances of data (test data) later on.
I realise that your CategoricalArrays is already doing something like this under the hood, but as a user I don't want to bother looking inside :-) . Also, I don't know ahead of time if my column is indeed a CategoricalArray or something else.
Here is a simplified version of my code for relabelling a vector (the columns of some DataFrame) with integers. Some obvious checks are missing.
———————-
# the data structure for storing the relabelling dictionaries:
struct ToIntScheme{T}
int_given_T::Dict{T, Int}
T_given_int::Dict{Int, T}
end
function fit(v::AbstractVector{T}) where T
int_given_T = Dict{T, Int}()
T_given_int = Dict{Int, T}()
vals = collect(Set(v)) # <—— natural to use `unique(v)` here but then typeof(vals) != T
i = 1
for c in vals
int_given_T[c] = i # <—— `c` must be type `T` here or I get an error
T_given_int[i] = c
i = i + 1
end
return ToIntScheme{T}(int_given_T, T_given_int)
end
# transform a scalar according to given scheme:
transform(scheme::ToIntScheme{T}, x::T) where T = scheme.int_given_T[x]
# demonstration:
using CategoricalArrays
v = [Char(rand(UInt8)) for i in 1:10^4];
v[1:10]
vcat = CategoricalVector(v);
typeof(vcat)
typeof(unique(vcat))
scheme = fit(vcat[1:end-1]); # fit to all but last element of vcat
y = transform(scheme, vcat[end]) # transform last element according to scheme
|
Update: My code has moved on and the use-case above no longer exists. On reflection, I'm not sure there is a compelling reason to favour different behaviour. Feel free to close. |
I have another case of the code that is agnostic of the array representation and breaks if Suppose there is a function nodes(edges::AbstractDataFrame) = DataFrame(id = sort!(unique(vcat(edges.source, edges.target)))) that works with the dataframe representation of a graph (dataframe But there are more annoying subtle bugs. E.g. |
When one applies the
unique
function to a categorical array, I would expect a categorical array of the same type to be returned but this is not the case. I'm using Julia 0.6:The text was updated successfully, but these errors were encountered: