-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vector{UInt8}
mis-represented when writing to disk
#411
Comments
to show that In [8]: import pyarrow.feather, numpy as np, pandas as pd
In [9]: df = pd.DataFrame({"x": [[np.uint8(0)], [np.uint8(1), np.uint8(2)]]})
In [11]: pyarrow.feather.write_feather(df, "/tmp/pyarrow.feather", compression="uncompressed")
In [12]: pyarrow.feather.read_table("/tmp/pyarrow.feather")["x"]
Out[12]:
<pyarrow.lib.ChunkedArray object at 0x7f80e3f93ec0>
[
[
[
0
],
[
1,
2
]
]
] read it back from Julia julia> Arrow.Table("/tmp/pyarrow.feather").x
2-element Arrow.List{Union{Missing, Vector{Union{Missing, UInt8}}}, Int32, Arrow.Primitive{Union{Missing, UInt8}, Vector{UInt8}}}:
Union{Missing, UInt8}[0x00]
Union{Missing, UInt8}[0x01, 0x02] |
Vector{UInt8}
mis-represented when writing to diskVector{UInt8}
mis-represented in metadata when writing to disk
I did some digging diff --git a/src/arraytypes/arraytypes.jl b/src/arraytypes/arraytypes.jl
index f3cee5d..a338004 100644
--- a/src/arraytypes/arraytypes.jl
+++ b/src/arraytypes/arraytypes.jl
@@ -34,7 +34,9 @@ Base.deleteat!(x::T, inds) where {T <: ArrowVector} = throw(ArgumentError("`$T`
function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
@debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(values(kw))"
@debugv 3 x
+ @show typeof(x)
A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...)
+ @show typeof(A)
if compression isa LZ4FrameCompressor
A = compress(Meta.CompressionTypes.LZ4_FRAME, compression, A)
elseif compression isa Vector{LZ4FrameCompressor} julia> data = (; x = [[0x01, 0x02], UInt8[], [0x03]], y = [[0, 1], Int[], [2,3]])
(x = Vector{UInt8}[[0x01, 0x02], [], [0x03]], y = [[0, 1], Int64[], [2, 3]])
julia> Arrow.write("/tmp/bug411.feather", data)
typeof(x) = Vector{Vector{UInt8}}
typeof(A) = Arrow.List{Vector{UInt8}, Int32, Arrow.ToList{UInt8, false, Vector{UInt8}, Int32}}
typeof(x) = Vector{Vector{Int64}}
typeof(A) = Arrow.List{Vector{Int64}, Int32, Arrow.Primitive{Int64, Arrow.ToList{Int64, false, Vector{Int64}, Int32}}}
"/tmp/bug411.feather" the question is why |
arrow-julia/src/arraytypes/list.jl Lines 192 to 197 in c469151
this seems to be the reason, and one step back, |
we also hit this part: Lines 405 to 407 in c469151
all in all it seems like a deliberate choice which I think is wrong, given pyarrow behavior and application of |
Vector{UInt8}
mis-represented in metadata when writing to diskVector{UInt8}
mis-represented when writing to disk
I think it's a reasonable request to not treat |
Fixes #411. Alternative to #419. This PR should be compatible with or without the ArrowTypes changes. I think it's fine to do compat things in Arrow like this as long as they don't get out of hand and we can eventually remove them as we bump required ArrowTypes versions and such. The PR consists of not treating `Vector{UInt8}` as the Arrow Binary type, which is meant for "binary string"s. Julia has a pretty good match for that in `Base.CodeUnits`, so instead, we use that to write Binary and `Vector{UInt8}` is treated as a regular List of Primitive UInt8 type.
Fixes #411. Alternative to #419. This PR should be compatible with or without the ArrowTypes changes. I think it's fine to do compat things in Arrow like this as long as they don't get out of hand and we can eventually remove them as we bump required ArrowTypes versions and such. The PR consists of not treating `Vector{UInt8}` as the Arrow Binary type, which is meant for "binary string"s. Julia has a pretty good match for that in `Base.CodeUnits`, so instead, we use that to write Binary and `Vector{UInt8}` is treated as a regular List of Primitive UInt8 type.
instead of
Vector{UInt8}
, it ended up being seen asbyte-string
The text was updated successfully, but these errors were encountered: