Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary #103

Merged
merged 3 commits into from
Jan 11, 2021

Conversation

jrevels
Copy link
Contributor

@jrevels jrevels commented Jan 7, 2021

...to make tables written out by Arrow.jl more portable.

I'm dumb and didn't realize e.g. pyarrow doesn't support 128-bit ints, so currently UUID-containing tables will fail to be read by e.g. pa.feather.read_feather(...).

There's even pyarrow documentation that suggests to Python users that UUIDs can be written as FixedSizeBinary columns, so this seems like a more friendly/correct way to do this.

@quinnj
Copy link
Member

quinnj commented Jan 7, 2021

Failing tests?

@codecov
Copy link

codecov bot commented Jan 7, 2021

Codecov Report

Merging #103 (f278814) into main (cbf4456) will decrease coverage by 0.04%.
The diff coverage is 92.85%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #103      +/-   ##
==========================================
- Coverage   84.07%   84.02%   -0.05%     
==========================================
  Files          23       23              
  Lines        2669     2680      +11     
==========================================
+ Hits         2244     2252       +8     
- Misses        425      428       +3     
Impacted Files Coverage Δ
src/arrowtypes.jl 89.74% <92.85%> (+0.19%) ⬆️
src/eltypes.jl 84.32% <0.00%> (-0.85%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cbf4456...f278814. Read the comment docs.

@jrevels
Copy link
Contributor Author

jrevels commented Jan 7, 2021

Ah, was missing some methods for FixedSizeListType types it seems. Fixed IIUC.

It's a shame that this entails a significant perf hit for copy:

julia> @time copy(uuids_uint128);
  0.002593 seconds (2 allocations: 3.248 MiB)

julia> length(uuids_fixed)
212886

julia> typeof(uuids_fixed)
Arrow.FixedSizeList{Base.UUID,Array{UInt8,1}}

julia> length(uuids_uint128)
212886

julia> typeof(uuids_uint128)
Arrow.Primitive{Base.UUID,Array{UInt128,1}}

julia> @time copy(uuids_fixed);
  0.167757 seconds (1.28 M allocations: 94.203 MiB, 2.81% gc time)

julia> @time copy(uuids_uint128);
  0.001690 seconds (2 allocations: 3.248 MiB)

I imagine we could optimize further to lessen/remove the overhead, though I think it'd preferable to do that in a follow-up

Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
@jrevels
Copy link
Contributor Author

jrevels commented Jan 9, 2021

Thanks to @SimonDanisch the UInt128 <-> NTuple{16,UInt8} cast employed here is now non-allocating 🔥

Reduces allocations for copy(uuids_fixed) a bit (see below) but looks like there's some additional performance still left on the table.

julia> @time copy(uuids_fixed);
  0.142450 seconds (1.06 M allocations: 74.713 MiB, 2.43% gc time)

I'll dig into it in a separate PR when I get a chance but AFAICT this is now good to go pending green CI.

jrevels added a commit to jrevels/Arrow.jl that referenced this pull request Jan 9, 2021
jrevels added a commit to jrevels/Arrow.jl that referenced this pull request Jan 9, 2021
@jrevels
Copy link
Contributor Author

jrevels commented Jan 9, 2021

@quinnj This is now ready for review, recommend looking at #104 at the same time (ref #104 (comment), which I can take care of once either is merged) 🙂 thanks!

@quinnj quinnj merged commit 31476c9 into apache:main Jan 11, 2021
@jrevels jrevels deleted the jr/uuidfixedsizebinary branch January 11, 2021 23:13
jrevels added a commit to jrevels/Arrow.jl that referenced this pull request Jan 12, 2021
quinnj pushed a commit that referenced this pull request Jan 12, 2021
* add isbitstype optimized path for FixedSizeList getindex

* reuse _unsafe_cast strategy from #103

* rebase + DRY _unsafe_cast
tanmaykm pushed a commit to tanmaykm/arrow-julia that referenced this pull request Apr 7, 2021
…izeBinary (apache#103)

* change UUID <-> Arrow mapping to (de)serialize to/from 16-byte FixedSizeBinary

* fix tests

* optimize UInt128 <-> NTuple{16,UInt8} casting

Co-authored-by: SimonDanisch <sdanisch@protonmail.com>

Co-authored-by: SimonDanisch <sdanisch@protonmail.com>
tanmaykm pushed a commit to tanmaykm/arrow-julia that referenced this pull request Apr 7, 2021
* add isbitstype optimized path for FixedSizeList getindex

* reuse _unsafe_cast strategy from apache#103

* rebase + DRY _unsafe_cast
@ericphanson
Copy link
Member

BTW the PR title here says FixedSizeBinary, but what was implemented was FixedSizeList; why is that?

Asking since arrow is considering a canonical extension for uuids that uses FixedSizeBinary.

@jrevels
Copy link
Contributor Author

jrevels commented Apr 26, 2024

BTW the PR title here says FixedSizeBinary, but what was implemented was FixedSizeList; why is that?

IIRC back when this PR was authored, Arrow.jl would (de)serialize Julia values with type FixedSizeList w/ UInt8 eltype as Arrow FixedSizeBinary values, and for w/e reason, that seemed like the right/easiest path to use in this PR (can't really remember the rationale for that, or whether there even was one 😅)

idk if that is still true nowadays though

here's the relevant arrowtype definition that existed when this PR was authored:

function arrowtype(b, x::FixedSizeList{T, A}) where {T, A}
    N = ArrowTypes.getsize(Base.nonmissingtype(T))
    if eltype(A) == UInt8
        Meta.fixedSizeBinaryStart(b)
        Meta.fixedSizeBinaryAddByteWidth(b, Int32(N))
        return Meta.FixedSizeBinary, Meta.fixedSizeBinaryEnd(b), nothing
    else
        children = [fieldoffset(b, "", x.data)]
        Meta.fixedSizeListStart(b)
        Meta.fixedSizeListAddListSize(b, Int32(N))
        return Meta.FixedSizeList, Meta.fixedSizeListEnd(b), children
    end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants