Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Vector{UInt8} writing #419

Closed
wants to merge 7 commits into from
Closed

fix Vector{UInt8} writing #419

wants to merge 7 commits into from

Conversation

Moelf
Copy link
Contributor

@Moelf Moelf commented Apr 7, 2023

fix #411

Setup:

julia> using Arrow

julia> df = (; x = [[0x01, 0x02], UInt8[], [0x03]])

julia> Arrow.write("/tmp/julia.feather", df)
"/tmp/julia.feather"

Before:

In [3]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
Out[3]:
<pyarrow.lib.ChunkedArray object at 0x7fb2994c86d0>
[
  [
    0102,
    ,
    03
  ]
]

Note: data = (; x = [b"12", b"", b"3"]) from Julia will give write out the exact same thing, which is IMO why this was wrong.

After:

In [39]: pyarrow.feather.read_table("/tmp/julia.feather")["x"]
Out[39]:
<pyarrow.lib.ChunkedArray object at 0x7f4568b376a0>
[
  [
    [
      1,
      2
    ],
    [],
    [
      3
    ]
  ]
]

@Moelf Moelf changed the title fix Vector{UInt8} writing fix Vector{UInt8} writing Apr 7, 2023
@Moelf
Copy link
Contributor Author

Moelf commented Apr 8, 2023

test wouldn't fully pass, could use some help and also this is technically breaking but I believe this is the correct thing to do.

If user wants byte-string, they should use b"" (which still works now) and not Vector{UInt8} @quinnj

@Moelf
Copy link
Contributor Author

Moelf commented Apr 11, 2023

CI waiting for #424

Meta.largUtf8Start(b)
return Meta.LargeUtf8, Meta.largUtf8End(b), nothing
end
elseif eltype(A) == UInt8 && ArrowTypes.isstringtype(T)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only additional type cathed here is Base.CodeUnits.

Ideally, we want the first (above) used for long string and this one for short-string, but that's not a well defined concept, so for now we just say, "use CodeUnits if you want output byte-string in Arrow"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "cathed" mean?

Copy link
Contributor Author

@Moelf Moelf May 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, "caught" (I probably tried to type catched, isn't a word), as in caught by the condition

@Moelf
Copy link
Contributor Author

Moelf commented Apr 13, 2023

The reason why this PR doesn't need to touch test is that our implementation was wrongly self-consistent -- it ignored an entire possible type composition in arrow:

Before this PR

@Moelf
Copy link
Contributor Author

Moelf commented May 10, 2023

@quinnj 👀 gentle bump?

@simsurace
Copy link
Contributor

@ericphanson or @baumgold would you be willing/able to review this?

@baumgold
Copy link
Member

@ericphanson or @baumgold would you be willing/able to review this?

Looks like some of the tests are failing. Would you mind investigating and fixing? Thanks.

@Moelf
Copy link
Contributor Author

Moelf commented May 13, 2023

@baumgold

some tests are failing because this changes both code base, you should look at this monorepo test I added in #424:

https://github.com/apache/arrow-julia/actions/runs/4685294460/jobs/8302259625?pr=419

@baumgold
Copy link
Member

I see. Probably we should split this up into 2 smaller PRs: one for ArrowTypes and one for Arrow. Assuming we make a new release of ArrowTypes then in the follow-on Arrow PR we can bump the compat of ArrowTypes to the newly released version.

@Moelf
Copy link
Contributor Author

Moelf commented May 14, 2023

making a stand alone PR adding a convenient function that doesn't get used and has no direct test makes 0 sense to me.

all the changes needed are in Arrow.jl, the convenient function to ArrowType is because I think it should be the same function. I could have added a function with some other name into Arrow.jl, which would be ugly

the "normal" tests fail just because how tests are set up (which brings us back to question: isn't the whole point of monorepo so that when we have 2 packages we don't have to make rapid release to fix tests for the other package?)

@ericphanson
Copy link
Member

isn't the whole point of monorepo so that when we have 2 packages we don't have to make rapid release to fix tests for the other package?

I’m not sure all the reasons for using this setup in this case but that definitely isn’t always a goal. E.g. https://discourse.julialang.org/t/how-beacon-packages-julia-code-in-a-monorepo/90822.

@Moelf
Copy link
Contributor Author

Moelf commented May 14, 2023

So what people think, I should do a separate PR to add convince function??

@simsurace
Copy link
Contributor

If this is what it takes to keep the ball moving, I would suggest doing so, but I am not a maintainer, just an observer interested to see this package continue getting improved.

@Moelf
Copy link
Contributor Author

Moelf commented May 15, 2023

just to be clear, in that case we will be doing:

  1. make a PR to add a convenient function ("will be used in next PR trust me")
  2. release a new ArrowTypes.jl for that, and potentially making a bunch of users re-precompile their stuff for no reason
  3. see tests in this PR turn all green?

Moelf added a commit to Moelf/arrow-julia that referenced this pull request May 18, 2023
@quinnj
Copy link
Member

quinnj commented May 18, 2023

@Moelf, do you mind if I make an alternative PR to fix the original issue here?

@Moelf
Copy link
Contributor Author

Moelf commented May 18, 2023

go ahead

quinnj added a commit that referenced this pull request May 18, 2023
Fixes #411. Alternative to #419.

This PR should be compatible with or without the ArrowTypes changes.
I think it's fine to do compat things in Arrow like this as long as
they don't get out of hand and we can eventually remove them as we
bump required ArrowTypes versions and such.

The PR consists of not treating `Vector{UInt8}` as the Arrow Binary
type, which is meant for "binary string"s. Julia has a pretty good
match for that in `Base.CodeUnits`, so instead, we use that to write
Binary and `Vector{UInt8}` is treated as a regular List of Primitive UInt8
type.
@quinnj
Copy link
Member

quinnj commented May 18, 2023

Ok, PR up: #439. Sorry to be so MIA lately; I've been tied up in some heavy other projects and it's been too hard to context-switch back here. A lot of that work (webstack-related) has wrapped up (mostly), I'm hoping to have more time to help review/fix stuff here.

This PR was definitely in the right direction @Moelf; thanks for the contribution. It was a great starting point. Couple of specific points:

  • I'd rather not pun the ArrowTypes.isstringtype function for non-ListKind types; it keeps that a little cleaner IMO
  • We can make Base.CodeUnits the main translation between the Arrow Binary type
  • We can add some compat shims in Arrow to avoid the awkward ArrowTypes/Arrow mis-compat

@quinnj quinnj closed this May 18, 2023
@Moelf Moelf deleted the uint8_jagged branch May 19, 2023 02:51
quinnj added a commit that referenced this pull request May 19, 2023
Fixes #411. Alternative to #419.

This PR should be compatible with or without the ArrowTypes changes. I
think it's fine to do compat things in Arrow like this as long as they
don't get out of hand and we can eventually remove them as we bump
required ArrowTypes versions and such.

The PR consists of not treating `Vector{UInt8}` as the Arrow Binary
type, which is meant for "binary string"s. Julia has a pretty good match
for that in `Base.CodeUnits`, so instead, we use that to write Binary
and `Vector{UInt8}` is treated as a regular List of Primitive UInt8
type.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Vector{UInt8} mis-represented when writing to disk
5 participants