Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary #103

Merged
merged 3 commits into from
Jan 11, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/src/manual.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,10 +53,12 @@ Apart from letting other packages have all the fun, an `Arrow.Table` itself can

In the arrow data format, specific logical types are supported, a list of which can be found [here](https://arrow.apache.org/docs/status.html#data-types). These include booleans, integers of various bit widths, floats, decimals, time types, and binary/string. While most of these map naturally to types builtin to Julia itself, there are a few cases where the definitions are slightly different, and in these cases, by default, they are converted to more "friendly" Julia types (this auto conversion can be avoided by passing `convert=false` to `Arrow.Table`, like `Arrow.Table(file; convert=false)`). Examples of arrow to julia type mappings include:

* `Date`, `Time`, `Timestamp`, and `Duration` all have natural Julia defintions in `Dates.Date`, `Dates.Time`, `TimeZones.ZonedDateTime`, and `Dates.Period` subtypes, respectively.
* `Date`, `Time`, `Timestamp`, and `Duration` all have natural Julia defintions in `Dates.Date`, `Dates.Time`, `TimeZones.ZonedDateTime`, and `Dates.Period` subtypes, respectively.
* `Char` and `Symbol` Julia types are mapped to arrow string types, with additional metadata of the original Julia type; this allows deserializing directly to `Char` and `Symbol` in Julia, while other language implementations will see these columns as just strings
* Similarly to the above, the `UUID` Julia type is mapped to a 128-bit `FixedSizeBinary` arrow type.
* `Decimal128` and `Decimal256` have no corresponding builtin Julia types, so they're deserialized using a compatible type definition in Arrow.jl itself: `Arrow.Decimal`


Note that when `convert=false` is passed, data will be returned in Arrow.jl-defined types that exactly match the arrow definitions of those types; the authoritative source for how each type represents its data can be found in the arrow [`Schema.fbs`](https://github.com/apache/arrow/blob/master/format/Schema.fbs) file.

#### Custom types
Expand Down Expand Up @@ -118,7 +120,7 @@ With `Arrow.write`, you provide either an `io::IO` argument or `file::String` to
What are some examples of Tables.jl-compatible sources? A few examples include:
* `Arrow.write(io, df::DataFrame)`: A `DataFrame` is a collection of indexable columns
* `Arrow.write(io, CSV.File(file))`: read data from a csv file and write out to arrow format
* `Arrow.write(io, DBInterface.execute(db, sql_query))`: Execute an SQL query against a database via the [`DBInterface.jl`](https://github.com/JuliaDatabases/DBInterface.jl) interface, and write the query resultset out directly in the arrow format. Packages that implement DBInterface include [SQLite.jl](https://juliadatabases.github.io/SQLite.jl/stable/), [MySQL.jl](https://juliadatabases.github.io/MySQL.jl/dev/), and [ODBC.jl](http://juliadatabases.github.io/ODBC.jl/latest/).
* `Arrow.write(io, DBInterface.execute(db, sql_query))`: Execute an SQL query against a database via the [`DBInterface.jl`](https://github.com/JuliaDatabases/DBInterface.jl) interface, and write the query resultset out directly in the arrow format. Packages that implement DBInterface include [SQLite.jl](https://juliadatabases.github.io/SQLite.jl/stable/), [MySQL.jl](https://juliadatabases.github.io/MySQL.jl/dev/), and [ODBC.jl](http://juliadatabases.github.io/ODBC.jl/latest/).
* `df |> @map(...) |> Arrow.write(io)`: Write the results of a [Query.jl](https://www.queryverse.org/Query.jl/stable/) chain of operations directly out as arrow data
* `jsontable(json) |> Arrow.write(io)`: Treat a json array of objects or object of arrays as a "table" and write it out as arrow data using the [JSONTables.jl](https://github.com/JuliaData/JSONTables.jl) package
* `Arrow.write(io, (col1=data1, col2=data2, ...))`: a `NamedTuple` of `AbstractVector`s or an `AbstractVector` of `NamedTuple`s are both considered tables by default, so they can be quickly constructed for easy writing of arrow data if you already have columns of data
Expand Down
36 changes: 26 additions & 10 deletions src/arrowtypes.jl
Original file line number Diff line number Diff line change
Expand Up @@ -44,14 +44,6 @@ struct PrimitiveType <: ArrowType end
ArrowType(::Type{<:Integer}) = PrimitiveType()
ArrowType(::Type{<:AbstractFloat}) = PrimitiveType()

arrowconvert(::Type{UInt128}, u::UUID) = UInt128(u)
arrowconvert(::Type{UUID}, u::UInt128) = UUID(u)

# This method is included as a deprecation path to allow reading Arrow files that may have
# been written before Arrow.jl defined its own UUID <-> UInt128 mapping (in which case
# a struct-based fallback `JuliaLang.UUID` extension type may have been utilized)
arrowconvert(::Type{UUID}, u::NamedTuple{(:value,),Tuple{UInt128}}) = UUID(u.value)

struct BoolType <: ArrowType end
ArrowType(::Type{Bool}) = BoolType()

Expand All @@ -77,6 +69,30 @@ ArrowType(::Type{NTuple{N, T}}) where {N, T} = FixedSizeListType()
gettype(::Type{NTuple{N, T}}) where {N, T} = T
getsize(::Type{NTuple{N, T}}) where {N, T} = N

ArrowType(::Type{UUID}) = FixedSizeListType()
gettype(::Type{UUID}) = UInt8
getsize(::Type{UUID}) = 16

function _unsafe_cast(::Type{B}, a::A)::B where {B,A}
a = Ref(a)
b = Ref{B}()
GC.@preserve a b begin
ptra = Base.unsafe_convert(Ptr{A}, a)
ptrb = Base.unsafe_convert(Ptr{B}, b)
unsafe_copyto!(Ptr{A}(ptrb), ptra, 1)
end
return b[]
end

arrowconvert(::Type{NTuple{16,UInt8}}, u::UUID) = _unsafe_cast(NTuple{16,UInt8}, u.value)
arrowconvert(::Type{UUID}, u::NTuple{16,UInt8}) = UUID(_unsafe_cast(UInt128, u))

# These methods are included as deprecation paths to allow reading Arrow files that may have
# been written before Arrow.jl's current UUID <-> NTuple{16,UInt8} mapping existed (in which case
# a struct-based fallback `JuliaLang.UUID` extension type may have been utilized)
arrowconvert(::Type{UUID}, u::NamedTuple{(:value,),Tuple{UInt128}}) = UUID(u.value)
arrowconvert(::Type{UUID}, u::UInt128) = UUID(u)

struct StructType <: ArrowType end

ArrowType(::Type{<:NamedTuple}) = StructType()
Expand Down Expand Up @@ -125,7 +141,7 @@ default(::Type{NamedTuple{names, types}}) where {names, types} = NamedTuple{name
const JULIA_TO_ARROW_TYPE_MAPPING = Dict{Type, Tuple{String, Type}}(
Char => ("JuliaLang.Char", UInt32),
Symbol => ("JuliaLang.Symbol", String),
UUID => ("JuliaLang.UUID", UInt128),
UUID => ("JuliaLang.UUID", NTuple{16,UInt8}),
)

istyperegistered(::Type{T}) where {T} = haskey(JULIA_TO_ARROW_TYPE_MAPPING, T)
Expand All @@ -140,7 +156,7 @@ end
const ARROW_TO_JULIA_TYPE_MAPPING = Dict{String, Tuple{Type, Type}}(
"JuliaLang.Char" => (Char, UInt32),
"JuliaLang.Symbol" => (Symbol, String),
"JuliaLang.UUID" => (UUID, UInt128),
"JuliaLang.UUID" => (UUID, NTuple{16,UInt8}),
)

function extensiontype(f, meta)
Expand Down
5 changes: 4 additions & 1 deletion test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -194,9 +194,12 @@ tt = Arrow.Table(io)
@test length(tt) == length(t)
@test all(isequal.(values(t), values(tt)))

# 89 - test deprecation path for old UUID autoconversion
# 89 etc. - test deprecation paths for old UUID autoconversion + UUID FixedSizeListType overloads
u = 0x6036fcbd20664bd8a65cdfa25434513f
@test Arrow.ArrowTypes.arrowconvert(UUID, (value=u,)) === UUID(u)
@test Arrow.ArrowTypes.arrowconvert(UUID, u) === UUID(u)
@test Arrow.ArrowTypes.gettype(UUID) == UInt8
@test Arrow.ArrowTypes.getsize(UUID) == 16

# 98
t = (a = [Nanosecond(0), Nanosecond(1)], b = [uuid4(), uuid4()], c = [missing, Nanosecond(1)])
Expand Down