Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble with categorical(v) when v is a SentinelArrays.ChainedVector #361

Closed
ablaom opened this issue Jul 27, 2021 · 19 comments · Fixed by #369
Closed

Trouble with categorical(v) when v is a SentinelArrays.ChainedVector #361

ablaom opened this issue Jul 27, 2021 · 19 comments · Fixed by #369

Comments

@ablaom
Copy link

ablaom commented Jul 27, 2021

using SentinelArrays, CategoricalArrays
v = ChainedVector([1:3, 4:5])
@assert v isa AbstractVector{Int}

julia> categorical(v)
ERROR: MethodError: copyto!(::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}}, ::ChainedVector{Int64, UnitRange{Int64}}) is ambiguous. Candidates:
  copyto!(dest::Union{SubArray{var"#s5", N, var"#s4", I, L} where {var"#s5", var"#s4"<:(CategoricalArrays.CategoricalArray{T, var"#s3", R, V, C, U} where {var"#s3", V, C, U}), I, L}, CategoricalArrays.CategoricalArray{T, N, R, V, C, U} where {V, C, U}} where {T, N, R<:Integer}, src::AbstractArray) in CategoricalArrays at /Users/anthony/.julia/packages/CategoricalArrays/rDwMt/src/array.jl:615
  copyto!(dest::AbstractVector{T} where T, src::ChainedVector) in SentinelArrays at /Users/anthony/.julia/packages/SentinelArrays/1hMOA/src/chainedvector.jl:326
Possible fix, define
  copyto!(::Union{SubArray{T, 1, var"#s4", I, L} where {T, T1, R<:Integer, var"#s4"<:(CategoricalArrays.CategoricalArray{T1, var"#s3", R, V, C, U} where {var"#s3", V, C, U}), I, L}, CategoricalArrays.CategoricalVector{T, R, V, C, U} where {C, U, T, R<:Integer, V}}, ::ChainedVector)                                                                              
Stacktrace:
 [1] _convert(::Type{CategoricalArrays.CategoricalVector{Int64, UInt32, V, C, U} where {V, C, U}}, A::ChainedVector{Int64, UnitRange{Int64}}; levels::Nothing)
   @ CategoricalArrays ~/.julia/packages/CategoricalArrays/rDwMt/src/array.jl:343
 [2] (CategoricalArrays.CategoricalVector{Int64, UInt32, V, C, U} where {V, C, U})(A::ChainedVector{Int64, UnitRange{Int64}}; levels::Nothing, ordered::Bool)                     
   @ CategoricalArrays ~/.julia/packages/CategoricalArrays/rDwMt/src/array.jl:253
 [3] #categorical#72
   @ ~/.julia/packages/CategoricalArrays/rDwMt/src/array.jl:927 [inlined]
 [4] categorical(A::ChainedVector{Int64, UnitRange{Int64}})
   @ CategoricalArrays ~/.julia/packages/CategoricalArrays/rDwMt/src/array.jl:926
 [5] top-level scope
   @ REPL[83]:1
  [324d7699] CategoricalArrays v0.10.0
  [91c51154] SentinelArrays v1.3.5
julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin19.6.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_LTS_PATH = /Applications/Julia-1.0.app/Contents/Resources/julia/bin/julia
  JULIA_PATH = /Applications/Julia-1.6.app/Contents/Resources/julia/bin/julia
  JULIA_EGLOT_PATH = /Applications/Julia-1.5.app/Contents/Resources/julia/bin/julia
  JULIA_NUM_THREADS = 5
  JULIA_NIGHTLY_PATH = /Applications/Julia-1.7.app/Contents/Resources/julia/bin/julia
@bkamins
Copy link
Member

bkamins commented Jul 27, 2021

@quinnj and @nalimilan - this one is a tough issue. We need to decide which package should depend on which.

@quinnj
Copy link
Member

quinnj commented Aug 3, 2021

Hmmm, bummer. Yeah, I'd rather not have to have either depend on each other; I wonder if we can play with the method signatures to avoid the ambiguities here.

@bkamins
Copy link
Member

bkamins commented Aug 3, 2021

The problem is that on a conceptual level:

  • CategoricalArray wants to specify how copyto! should work when it is a TARGET
  • SentinelArray wants to specify how copyto! should work when it is a SOURCE

So there is no way to avoid ambiguity I feel. If we have to I would rather make CategoricalArrays.jl take SentinelArrays.jl as a dependency. Let us wait to hear what @nalimilan thinks.

@storopoli
Copy link

storopoli commented Aug 13, 2021

Yeah I am having issues with:

transform!(df, [:col1, :col2, :col3, :col4] .=> categorical; renamecols=false)

If I import a DataFrame as CSV.read(..., DataFrame). Had to import as CSV.File(...) |> DataFrame in order to work.

@nalimilan
Copy link
Member

I'd rather not yet another dependency to CategoricalArrays that isn't strictly needed to use it (we already have JSON, JSON3, StructTypes and RecipesBase). Also, the same issue will affect PooledArrays once we implement the same optimized copyto! implementation for sources that implement DataAPI.refpool (as we probably should). And any other array type that implements copyto! similar to ChainedVector will create the same bug.

It looks like this is a problem that should be fixed in Julia. In the past, some functions have been implemented in a way that avoids ambiguity issues. There might be a way to define internal methods so that priority is given to the destination over the source type (or the reverse). Ideally IIRC it's been discussed to allow each package to specify a number giving the priority level of the method, or to say that calling any method defined for AbstractArray is fine given that the interface is well defined.

@bkamins
Copy link
Member

bkamins commented Aug 13, 2021

It looks like this is a problem that should be fixed in Julia.

Yes - it should be hopefully fixed in Julia in the long term. However, we need to decide on something before that happens (and in practice it will not happen fast I am afraid). I will ask on Slack in #internal about the opinion what to do. Maybe there will be some reasonable suggestion.

@bkamins
Copy link
Member

bkamins commented Aug 14, 2021

Given the discussion on Slack I do not think this is going to be resolved soon.

@nalimilan - would you agree to take SentinelArrays.jl as a dependency then? (I do not like this solution, and I hope it is temporary, but I do not see any other work-around for the time being)

@nalimilan
Copy link
Member

I guess one solution would be to use Requires.jl for this, and also make JSON, JSON3, StructTypes and RecipesBase optional dependencies too.

@bkamins
Copy link
Member

bkamins commented Aug 17, 2021

It seems as a best solution for now.

One question regarding Requires.jl - does package loading sequence matter here or not?

@ablaom
Copy link
Author

ablaom commented Aug 17, 2021

I guess one solution would be to use Requires.jl for this, and also make JSON, JSON3, StructTypes and RecipesBase optional dependencies too.

Our experience in MLJModels using using Requires.jl for optional model-code loading was painful and ultimately abandoned. We had issues with latency issues (packages pre-compiling twice) and we suspected it responsible for issues we were never able to resolve before abandoning its use (eg, JuliaAI/MLJ.jl#321). I think Requires is a very clever piece of software, but is essentially a hack; and I have heard both its author and Jeff describe it so.

While it's likely the authors of CAs are better familiar with the correct use of Requires.jl than I, my suggestion is that they think two or three times before choosing that route. Personally, I'd rather live with the hard-wired dependencies until Julia comes up with a better solution to this kind of thing.

@timholy
Copy link
Contributor

timholy commented Aug 23, 2021

I can't speak for pre-compiling twice (that sounds bad...), but Requires got a bit less hacky when all "registration" moved into the module __init__ function. WRT the specific error in from JuliaAI/MLJModels.jl#22,

using MLJModels.XGBoost_

is XGBoost_ loaded conditionally? It may be an order-of-events issue. There might be something fixable about this, but it would probably take some investigation.

Requires hasn't changed its design much since Julia 1.0 came out, but it has gotten a lot more lightweight than it was back in 2019. I think it contributed ~0.5s to loading time back then, and we've gotten it down to about 1/10th of that. The latest was JuliaPackaging/Requires.jl#101, but JuliaLang/julia#37574 had already gotten it down considerably by that point.

@bkamins
Copy link
Member

bkamins commented Aug 23, 2021

@timholy - thank you for commenting on this. Also probably you are the best to answer my earlier question:

One question regarding Requires.jl - does package loading sequence matter here or not?

Thank you!

@nalimilan - shall we go for Requires.jl then? (and I assume it should go into CategoricalArrays.jl)

@quinnj
Copy link
Member

quinnj commented Aug 23, 2021

Yes, perhaps it's worth trying out Requires.jl again; which we could probably use for all those "extra" CategoricalArray dependencies (StructTypes.jl, JSON.jl, etc.).

@timholy
Copy link
Contributor

timholy commented Aug 23, 2021

It seems like it shouldn't, but... The reason for my thinking it this:

If there are other callbacks, though, perhaps that could mess some stuff up?

@timholy
Copy link
Contributor

timholy commented Aug 23, 2021

@timholy
Copy link
Contributor

timholy commented Aug 24, 2021

Upon further reflection, it seems possible that the following may not work (not sure, but be on the lookout for such issues):

module Inner
# stuff
function __init__()
    @require ...
end
end

module Outer
using Inner
# some top-level expression that expects the `@require` block of `Inner` to have already run
end

followed by

julia> using Outer

All would be fine if you typed using Inner, Outer, but the problem is that the callbacks may (?) only fire after loading the module specified in the using statement. If someone gets a MWE demonstrating this, it would presumably be a simple fix. And again, maybe it is already working, I'm just trying to guess what happened in JuliaAI/MLJModels.jl#22.

@bkamins
Copy link
Member

bkamins commented Aug 24, 2021

Thank you for thinking about it. Fortunately with CategoricalArrays.jl and SentinelArrays.jl we shall not have this problem.

@nalimilan
Copy link
Member

OK, so let's try Requires.jl then. If it creates problems it's easy to move to standard dependencies.

@bkamins
Copy link
Member

bkamins commented Sep 16, 2021

In JuliaData/DataFrames.jl#2883 the problem was reported again.
@nalimilan - would you be willing to make a fix in CategoricalArrays.jl and make a release?
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants