Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss of parametric type information for custom types #134

Closed
omus opened this issue Feb 23, 2021 · 4 comments
Closed

Loss of parametric type information for custom types #134

omus opened this issue Feb 23, 2021 · 4 comments

Comments

@omus
Copy link
Contributor

omus commented Feb 23, 2021

When using a table where a column contains a variety of types with different parameters this information can be lost:

julia> using Arrow, Intervals

julia> table = (col = [
           Interval{Closed,Closed}(1,2),
           Interval{Open,Closed}(1,2),
           Interval{Closed,Unbounded}(1,nothing),
       ],);

julia> table.col
3-element Array{Interval{Int64,L,R} where R<:Bound where L<:Bound,1}:
 Interval{Int64,Closed,Closed}(1, 2)
 Interval{Int64,Open,Closed}(1, 2)
 Interval{Int64,Closed,Unbounded}(1, nothing)

julia> Arrow.write("ex.arrow", table);

julia> Arrow.Table("ex.arrow").col
3-element Arrow.Struct{Interval{Int64,L,R} where R<:Bound where L<:Bound,Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Primitive{Int64,Array{Int64,1}}}}:
 Interval{Int64,Closed,Closed}(1, 2)
 Interval{Int64,Closed,Closed}(1, 2)
 Interval{Int64,Closed,Closed}(1, 1)

For the particular Interval type the problem is worse as the undefined type parameters are inferred from the arguments:

julia> (Interval{Int64,L,R} where R<:Bound where L<:Bound)(1,2)
Interval{Int64,Closed,Closed}(1, 2)
@quinnj
Copy link
Member

quinnj commented Mar 25, 2021

Ok, this will now correctly error on #156 PR.

@quinnj quinnj closed this as completed Mar 25, 2021
quinnj added a commit that referenced this issue Mar 29, 2021
* Start work on overhauling type serialization architecture

* More work; serialization is pretty much done but not tested

* fix timetype ArrowTypes definitions

* more work to get tests passing

* get tests passing?

* fix

* Fix #75 by supporting Set serialization/deserialization

* Fix #85 by supporting tuple serialization/deserialization

* Lots of cleanup

* few more fixes

* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>

* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>

* fix NullKind reading

* Fix #134 by requiring concrete or union of concrete element types for
all columns when serializing

* Add new ArrowTypes.arrowmetadata method for providing additional extension type metadata htat can be used in JuliaType

* Update manual

* tests

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
tanmaykm pushed a commit to tanmaykm/arrow-julia that referenced this issue Apr 7, 2021
* Start work on overhauling type serialization architecture

* More work; serialization is pretty much done but not tested

* fix timetype ArrowTypes definitions

* more work to get tests passing

* get tests passing?

* fix

* Fix apache#75 by supporting Set serialization/deserialization

* Fix apache#85 by supporting tuple serialization/deserialization

* Lots of cleanup

* few more fixes

* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>

* Update src/arrowtypes.jl

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>

* fix NullKind reading

* Fix apache#134 by requiring concrete or union of concrete element types for
all columns when serializing

* Add new ArrowTypes.arrowmetadata method for providing additional extension type metadata htat can be used in JuliaType

* Update manual

* tests

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
@omus omus reopened this Jul 7, 2021
@omus
Copy link
Contributor Author

omus commented Jul 7, 2021

Using ArrowTypes.arrowmetadata as shown in the Intervals example in the documentation you can only serialize a column where all of the type parameters are the same. Having mixture of type parameters does not work:

table = (col = [
    Interval{Closed,Unbounded}(1,nothing),
    Interval{Unbounded,Closed}(nothing,2),
],)

@omus
Copy link
Contributor Author

omus commented Jul 7, 2021

I attempted to work around this on Arrow 1.6 (not quite yet released) by storing the parametric information as part of the value as using ArrowTypes.arrowmetadata can't handle element variation. An implementation of this looks like:

using Arrow, ArrowTypes, Intervals

table = (;
    col=[
        Interval{Closed,Unbounded}(1,nothing),
        Interval{Unbounded,Closed}(nothing,2),
    ]
)

for T in (Closed, Open, Unbounded)
    name = QuoteNode(Symbol("JuliaLang.Intervals.$(string(T))"))

    @eval begin
        ArrowTypes.arrowname(::Type{$T}) = $name
        ArrowTypes.JuliaType(::Val{$name}) = $T
    end
end

let name = Symbol("JuliaLang.Intervals.Interval")
    ArrowTypes.arrowname(::Type{<:Interval{T}}) where T = name
    ArrowTypes.ArrowType(::Type{<:Interval{T}}) where T = NamedTuple{(:left, :right), Tuple{Tuple{String, T}, Tuple{String, T}}}
    function ArrowTypes.toarrow(x::Interval{T,L,R}) where {T,L,R}
        return (; left=(string(arrowname(L)), x.first), right=(string(arrowname(R)), x.last))
    end
    ArrowTypes.JuliaType(::Val{name}) = Interval
    function ArrowTypes.fromarrow(::Type{Interval}, left, right)
        T = typeof(left[2])
        L = ArrowTypes.JuliaType(Val(Symbol(left[1])))
        R = ArrowTypes.JuliaType(Val(Symbol(right[1])))
        return Interval{T,L,R}(
            L === Unbounded ? nothing : left[2],
            R === Unbounded ? nothing : right[2],
        )
    end
end

# ArrowTypes.fromarrow(Interval, ArrowTypes.toarrow(table.col[1]))

table.col
t = Arrow.Table(Arrow.tobuffer(table))
t.col
julia> table.col
2-element Vector{Interval{Int64, L, R} where {L<:Bound, R<:Bound}}:
 Interval{Int64, Closed, Unbounded}(1, nothing)
 Interval{Int64, Unbounded, Closed}(nothing, 2)

julia> t = Arrow.Table(Arrow.tobuffer(table))
Arrow.Table with 2 rows, 1 columns, and schema:
 :col  Interval

julia> t.col
2-element Arrow.Struct{Interval, Tuple{Arrow.Struct{Tuple{String, Int64}, Tuple{Arrow.List{String, Int32, Vector{UInt8}}, Arrow.Primitive{Int64, Vector{Int64}}}}, Arrow.Struct{Tuple{String, Int64}, Tuple{Arrow.List{String, Int32, Vector{UInt8}}, Arrow.Primitive{Int64, Vector{Int64}}}}}}:
 Interval{Int64, Closed, Unbounded}(1, nothing)
 Interval{Int64, Unbounded, Closed}(nothing, 2)

The main issue I had with implementing this is that the serialized instance as defined by toarrow is not what is passed into fromarrow. In my particular case I just needed to ensure that the NamedTuple I created in toarrow had the same fieldcount as Interval.

@quinnj
Copy link
Member

quinnj commented Jun 14, 2023

Closing as I think we have all the tools in place to support this kind of use-case, even if it's not the most convenient. i.e. the arrow format is really built for pretty homogenous data within the bounds of individual columns, but beyond that, it doesn't fare very well with mixed-type kinds of columns.

@quinnj quinnj closed this as completed Jun 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants