Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vector{UInt8} mis-represented when writing to disk #411

Closed
Moelf opened this issue Apr 2, 2023 · 5 comments · Fixed by #439
Closed

Vector{UInt8} mis-represented when writing to disk #411

Moelf opened this issue Apr 2, 2023 · 5 comments · Fixed by #439

Comments

@Moelf
Copy link
Contributor

Moelf commented Apr 2, 2023

julia> using Arrow, DataFrames

julia> df = DataFrame(; x = [[0x01, 0x02], UInt8[], [0x03]])
3×1 DataFrame
 Row │ x
     │ Array
─────┼───────────────────
   1 │ UInt8[0x01, 0x02]
   2 │ UInt8[]
   3 │ UInt8[0x03]

julia> Arrow.write("/tmp/julia.feather", df)
"/tmp/julia.feather"

instead of Vector{UInt8}, it ended up being seen as byte-string

In [1]: import pyarrow.feather

In [3]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
Out[3]:
<pyarrow.lib.ChunkedArray object at 0x7fb2994c86d0>
[
  [
    0102,
    ,
    03
  ]
]
@Moelf
Copy link
Contributor Author

Moelf commented Apr 4, 2023

to show that pyarrow does something different and consistent:

In [8]: import pyarrow.feather, numpy as np, pandas as pd

In [9]: df = pd.DataFrame({"x": [[np.uint8(0)], [np.uint8(1), np.uint8(2)]]})

In [11]: pyarrow.feather.write_feather(df, "/tmp/pyarrow.feather", compression="uncompressed")

In [12]: pyarrow.feather.read_table("/tmp/pyarrow.feather")["x"]
Out[12]:
<pyarrow.lib.ChunkedArray object at 0x7f80e3f93ec0>
[
  [
    [
      0
    ],
    [
      1,
      2
    ]
  ]
]

read it back from Julia

julia> Arrow.Table("/tmp/pyarrow.feather").x
2-element Arrow.List{Union{Missing, Vector{Union{Missing, UInt8}}}, Int32, Arrow.Primitive{Union{Missing, UInt8}, Vector{UInt8}}}:
 Union{Missing, UInt8}[0x00]
 Union{Missing, UInt8}[0x01, 0x02]

@Moelf Moelf changed the title Vector{UInt8} mis-represented when writing to disk Vector{UInt8} mis-represented in metadata when writing to disk Apr 4, 2023
@Moelf
Copy link
Contributor Author

Moelf commented Apr 7, 2023

I did some digging

diff --git a/src/arraytypes/arraytypes.jl b/src/arraytypes/arraytypes.jl
index f3cee5d..a338004 100644
--- a/src/arraytypes/arraytypes.jl
+++ b/src/arraytypes/arraytypes.jl
@@ -34,7 +34,9 @@ Base.deleteat!(x::T, inds) where {T <: ArrowVector} = throw(ArgumentError("`$T`
 function toarrowvector(x, i=1, de=Dict{Int64, Any}(), ded=DictEncoding[], meta=getmetadata(x); compression::Union{Nothing, Vector{LZ4FrameCompressor}, LZ4FrameCompressor, Vector{ZstdCompressor}, ZstdCompressor}=nothing, kw...)
     @debugv 2 "converting top-level column to arrow format: col = $(typeof(x)), compression = $compression, kw = $(values(kw))"
     @debugv 3 x
+    @show typeof(x)
     A = arrowvector(x, i, 0, 0, de, ded, meta; compression=compression, kw...)
+    @show typeof(A)
     if compression isa LZ4FrameCompressor
         A = compress(Meta.CompressionTypes.LZ4_FRAME, compression, A)
     elseif compression isa Vector{LZ4FrameCompressor}
julia> data = (; x = [[0x01, 0x02], UInt8[], [0x03]], y = [[0, 1], Int[], [2,3]])
(x = Vector{UInt8}[[0x01, 0x02], [], [0x03]], y = [[0, 1], Int64[], [2, 3]])

julia> Arrow.write("/tmp/bug411.feather", data)
typeof(x) = Vector{Vector{UInt8}}
typeof(A) = Arrow.List{Vector{UInt8}, Int32, Arrow.ToList{UInt8, false, Vector{UInt8}, Int32}}
typeof(x) = Vector{Vector{Int64}}
typeof(A) = Arrow.List{Vector{Int64}, Int32, Arrow.Primitive{Int64, Arrow.ToList{Int64, false, Vector{Int64}, Int32}}}
"/tmp/bug411.feather"

the question is why UInt8 is built ToList while Int64 is Primitive while both of them seem to be possible primitive https://arrow.apache.org/docs/python/generated/pyarrow.uint8.html#pyarrow.uint8

@Moelf
Copy link
Contributor Author

Moelf commented Apr 7, 2023

flat = ToList(x; largelists=largelists)
offsets = Offsets(UInt8[], flat.inds)
if eltype(flat) == UInt8 # binary or utf8string
data = flat
T = origtype(flat)
else

this seems to be the reason, and one step back, ToList() converts both into flat Vector{UInt8} so it's not distinguishable if you only look at variable flat

@Moelf
Copy link
Contributor Author

Moelf commented Apr 7, 2023

we also hit this part:

arrow-julia/src/eltypes.jl

Lines 405 to 407 in c469151

else # if Vector{UInt8}
if O == Int32
Meta.binaryStart(b)

all in all it seems like a deliberate choice which I think is wrong, given pyarrow behavior and application of Vector{UInt8} that's not byte-string

@Moelf Moelf changed the title Vector{UInt8} mis-represented in metadata when writing to disk Vector{UInt8} mis-represented when writing to disk Apr 13, 2023
@quinnj
Copy link
Member

quinnj commented May 18, 2023

I think it's a reasonable request to not treat Vector{UInt8} as the Binary arrow type and only have CodeUnits be treated that way.

quinnj added a commit that referenced this issue May 18, 2023
Fixes #411. Alternative to #419.

This PR should be compatible with or without the ArrowTypes changes.
I think it's fine to do compat things in Arrow like this as long as
they don't get out of hand and we can eventually remove them as we
bump required ArrowTypes versions and such.

The PR consists of not treating `Vector{UInt8}` as the Arrow Binary
type, which is meant for "binary string"s. Julia has a pretty good
match for that in `Base.CodeUnits`, so instead, we use that to write
Binary and `Vector{UInt8}` is treated as a regular List of Primitive UInt8
type.
quinnj added a commit that referenced this issue May 19, 2023
Fixes #411. Alternative to #419.

This PR should be compatible with or without the ArrowTypes changes. I
think it's fine to do compat things in Arrow like this as long as they
don't get out of hand and we can eventually remove them as we bump
required ArrowTypes versions and such.

The PR consists of not treating `Vector{UInt8}` as the Arrow Binary
type, which is meant for "binary string"s. Julia has a pretty good match
for that in `Base.CodeUnits`, so instead, we use that to write Binary
and `Vector{UInt8}` is treated as a regular List of Primitive UInt8
type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants