Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reconsidering the current type registration/serialization mechanism (and its internal usage) #88

Closed
ericphanson opened this issue Dec 23, 2020 · 12 comments

Comments

@ericphanson
Copy link
Member

E.g.

julia> using Arrow, UUIDs                           
                                                                             
julia> table = (;x = [uuid4() for _ = 1:5])                                                                               
(x = UUID[UUID("03fec02f-3d33-4cf8-88e0-1e152faccff7"), UUID("a789baa7-9059-4bc4-b0d9-2af1dca4bf88"), UUID("2a4ecfd7-049e-
468d-8fed-8f967b732ee0"), UUID("91b9cad6-b84c-4ee8-aeaf-7448778fb1d5"), UUID("39edead7-1670-430a-8e4d-fefe3615c2d0")],)   

julia> Arrow.write("uuids.arrow", table)                                                                                  
"uuids.arrow"                                                                                                             
                                                                                                                          
julia> Arrow.Table("uuids.arrow")
Arrow.Table: (x = UUID[UUID("03fec02f-3d33-4cf8-88e0-1e152faccff7"), UUID("a789baa7-9059-4bc4-b0d9-2af1dca4bf88"), UUID("2
a4ecfd7-049e-468d-8fed-8f967b732ee0"), UUID("91b9cad6-b84c-4ee8-aeaf-7448778fb1d5"), UUID("39edead7-1670-430a-8e4d-fefe361
5c2d0")],)

In a new session:

julia> using Arrow, UUIDs

julia> Arrow.Table("uuids.arrow")
┌ Warning: unsupported ARROW:extension:name type: "JuliaLang.UUID"
└ @ Arrow.ArrowTypes ~/.julia/packages/Arrow/CyJ4L/src/arrowtypes.jl:141
Arrow.Table: (x = NamedTuple{(:value,),Tuple{UInt128}}[(value = 0x03fec02f3d334cf888e01e152faccff7,), (value = 0xa789baa790594bc4b0d92af1dca4bf88,), (value = 0x2a4ecfd7049e468d8fed8f967b732ee0,), (value = 0x91b9cad6b84c4ee8aeaf7448778fb1d5,), (value = 0x39edead71670430a8e4dfefe3615c2d0,)],)

I believe this is due to the write-time registration of types at

https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86

@ericphanson
Copy link
Member Author

(The workaround is to just call Arrow.ArrowTypes.registertype!(UUID, UUID) before deserializing. But I think the hidden statefulness is still confusing / problematic).

@quinnj
Copy link
Member

quinnj commented Jan 31, 2021

The UUID case is now fixed (defined by default in Arrow) and we've updated the docs to mention the need to call registertype!. I'm considering some larger changes to type serializing and such, so this might be something we make easier with that.

@jrevels
Copy link
Contributor

jrevels commented Mar 14, 2021

@quinnj what's the motivation behind https://github.com/JuliaData/Arrow.jl/blob/3ab2b18829c1656198a85759360389b6bbb22ab3/src/arraytypes/struct.jl#L86? Is it just to give the "convenience behavior" listed in the OP or is there a deeper reason? If it's just the former, I wonder if it's better just to remove it...I ran into another related issue just now.

I'm essentially implementing the following (which is also why I needed #150):

struct Foo ... end

struct _FooArrow ... end

Foo(::_FooArrow) = ...

Arrow.ArrowTypes.registertype!(Foo, _FooArrow)

Arrow.ArrowTypes.arrowconvert(::Type{_FooArrow}, f::Foo) = ...

the above in theory would allow me to have full control over Arrow <-> Julia conversion for my Foo type.

The problem is that Arrow.jl is automatically calling ArrowTypes.registertype!(_FooArrow, _FooArrow) on write even though I don't want it to :( as a caller I can't really think of a scenario where I would want auto-registration, but I could be missing something.

Even if we can't get rid of it in general, would it be possible to gate this behavior behind a flag passed to Arrow.write (autoregister=true)?

@ericphanson
Copy link
Member Author

Just to add another reason in favor of removing it, mutating the global registration dict at write-time seems like it could be an issue for concurrent writing from different threads (ref #90 (comment) for other thread safety issues). Whereas the user could be sure to always manually register outside the threaded region of code.

@quinnj
Copy link
Member

quinnj commented Mar 14, 2021

Yeah, these are good points for removing the auto registering. The main reason for having it was convenience.

@quinnj
Copy link
Member

quinnj commented Mar 14, 2021

@jrevels , can you explain your use-case/example a bit more? What I dont' quite follow is how _FooArrow will be supported? The 2nd argument to registertype! should be a native arrow type that your custom type converts to.

@quinnj
Copy link
Member

quinnj commented Mar 14, 2021

Hold up, don't mind me. I'm digging back through all the code and in the structs.jl file we know how to serialize a _FooArrow, so yeah, I think I understand the example better now.

@quinnj
Copy link
Member

quinnj commented Mar 14, 2021

Wait, backsies again. So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple. Here's where my thinking is going, though I recognize the code itself doesn't currently reflect this vision:

  • Stop autoregistering StructTypes, we'd require users to call ArrowTypes.registertype! for custom types
  • By default, we'd assume that Arrow.Types.registertype!(::Type{T}) where {T} = registertype!(T, T), which means the custom struct would be serialized as-is, and when deserializing, we'd just call (essentially) T(serialized_fields...)
  • For more customizability when serializing custom structs, you would do something like: ArrowTypes.registertype!(T, @NamedTuple{field1::Int, field2::String}), where, e.g., you only want to serialize field1 and field2 of your custom type. This would then require a corresponding definition like: ArrowTypes.arrowconvert(::Type{@NamedTuple{field1::Int, field2::String}}, x::T) = (field1=x.field1, field2=x.field2), though I think we could provide some kind of auto-convert fallback, like: ArrowTypes.arrowconvert(::Type{T}, x) where {T <: NamedTuple} = (; nm=>getfield(x, nm) for nm in names(T))
  • In addition, you'd be able to define a custom ArrowTypes.arrowconvert(::Type{T}, x::@NamedTuple{field1::Int, field2::Strong}) = T(x.field1, x.field2); this would allow "hooking" into deserialization, to fix cases like Avoid assuming field values can be used in constructors #135

@jrevels
Copy link
Contributor

jrevels commented Mar 14, 2021

So the problem with not autoregistering, is that without ArrowTypes.registertype!(_FooArrow, _FooArrow), we don't know how to deserialize the struct, it would just deserialize as a NamedTuple.

Ah, but for me this is the desired behavior :) I want it to deserialize as NamedTuple unless I, the caller, tell it explicitly not to. Right now it feels like Arrow.jl is making the decision for me, and it's making the wrong one (AFAICT).

Reopening this issue as it seems like the discussion may lead to some action items :)

@jrevels jrevels reopened this Mar 14, 2021
@jrevels
Copy link
Contributor

jrevels commented Mar 14, 2021

ref beacon-biosignals/Onda.jl#68 for a motivating example.

my thoughts are very rough/not super well-considered yet, but off the top of my head, here are my big "wants" (some of these might already be possible w/ existing behavior):

  • For the current Dict-based registertype! mechanism to somehow be replaced/augmented by method dispatch (motivation here: https://github.com/beacon-biosignals/Onda.jl/pull/68/files#diff-9d1b70fd041b1dbbe08ff4096cf1c68daa131b7d249d2ba3101e9079e129f44cR505 ; if there's another way to do this that I'm not seeing with the current system, that'd be dope). The barriers here AFAICT are a) dynamic dispatch might be slower than the current Dict look up during deserialization, so we'd have to amortize it by doing it up front based on the present extension metadata (I think this should work?) and b) this mechanism would be less dynamic than registertype! currently is (I would be happy to make it less dynamic, but maybe there's a use case where the extra dynamism is useful?). If handling this starts to look like it requires @eval, we could consider a combined approach, e.g. keep the current *_MAPPING Dict but have it contain extension string => Julia function pairs, to hide Julia's "smart dispatch" behind a "dumb dispatch" tier. That way the mapping would be clear/resolvable without eval but callers could take advantage of Julia's dispatch for e.g. allowing full use of type parameters.
  • For there to be very clear/well-named/well-documented/separate "pre-serialization hook" and "post-deserialization hook" transformations. I'm not overly concerned with names as long as the duality is clear, e.g. lower/raise, toarrow/fromarrow, etc. I wouldn't want to use a single function for both as, IME, doing so forces the caller to "do more than they intend to do" by overloading that function. Like, I want to be able to define a pre-serialization hook to go from A in Julia -> B in Arrow, without implying anything about the transformation behavior of A in Arrow -> B in Julia.
  • For me to be able to specify/toggle behavior at the callsite. i.e. turn off the pre/post-hooks for a specific type during a specific read/write operation. Overriding these sorts of global behaviors is really useful in practice for e.g. migrating data between different representation versions w/o needing to adjust your running environment.

@jrevels
Copy link
Contributor

jrevels commented Mar 17, 2021

@jrevels jrevels changed the title Types deserialize differently during session in which they were written reconsidering the current type registration/serialization mechanism (and its internal usage) Mar 18, 2021
@quinnj
Copy link
Member

quinnj commented Mar 29, 2021

Closing now that #156 is merged/tagged

@quinnj quinnj closed this as completed Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants