Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slower materialization Feather vs Arrow #131

Closed
Felix-Gauthier opened this issue Feb 16, 2021 · 1 comment · Fixed by #176
Closed

Slower materialization Feather vs Arrow #131

Felix-Gauthier opened this issue Feb 16, 2021 · 1 comment · Fixed by #176

Comments

@Felix-Gauthier
Copy link

I'm currently transitioning from Feather.jl to Arrow.jl, and I've noticed a near double time increase when reading and modifying 525600x400 DataFrame with mostly columns of Union{Missing, Float}/Union{Missing,Bool}.

Consider :

using Feather, Arrow, DataFrames
tmp_A, tmp_F = tempname(), tempname()
df = DataFrame(:A => vcat([missing], rand(525_599)))
Arrow.write(tmp_A,df)
Feather.write(tmp_F,df)

for (fn,name,k) in [(Arrow.Table,tmp_A,(:A,)),(Feather.read,tmp_F,(!,:A))]
    vec = fn(name)[k...]
    @info typeof(vec)
    for i in 1:3
        out = Vector{Union{Float64, Missing}}(undef, length(vec))
        @time out .= vec .+ 0.1
    end
end

We get the following results :

[ Info: Arrow.Primitive{Union{Missing, Float64},Array{Float64,1}}
  0.107514 seconds (303.19 k allocations: 15.577 MiB, 8.51% gc time)
  0.006216 seconds (2 allocations: 160 bytes)
  0.006779 seconds (2 allocations: 160 bytes)
[ Info: Feather.Arrow.NullablePrimitive{Float64}
  0.136757 seconds (442.93 k allocations: 23.064 MiB)
  0.002040 seconds (2 allocations: 160 bytes)
  0.001903 seconds (2 allocations: 160 bytes)

using Juliav1.5.3, Feather v0.5.7, Arrow v1.2.4, DataFrames v0.22.5 with similar results in juliav1.6.0-rc1

Is there a better way to do this with Arrow, that wouldn't it at a slower speed than it's predecessor ?

@quinnj
Copy link
Member

quinnj commented Apr 14, 2021

Sorry for the slow response; looks like it has to do w/ our nullability check, since the non-nullable case is actually faster in Arrow.jl:

julia> for (fn,name,k) in [(Arrow.Table,tmp_A,(:A,)),(Feather.read,tmp_F,(!,:A))]
           vec = fn(name)[k...]
           @info typeof(vec)
           for i in 1:3
               out = Vector{Union{Float64, Missing}}(undef, length(vec))
               @time out .= vec .+ 0.1
           end
       end
[ Info: Arrow.Primitive{Float64, Vector{Float64}}
  0.001771 seconds (2 allocations: 160 bytes)
  0.000543 seconds (2 allocations: 160 bytes)
  0.000323 seconds (2 allocations: 160 bytes)
[ Info: Feather.Arrow.Primitive{Float64}
  0.002012 seconds (2 allocations: 96 bytes)
  0.000795 seconds (2 allocations: 96 bytes)
  0.000607 seconds (2 allocations: 96 bytes)

quinnj added a commit that referenced this issue Apr 14, 2021
Fixes #131. Dip me in mustard and call me a hotdog, cuz I can't tell
how/why `divrem(i - 1, 8) .+ (1, 1)` ends up being ~30% faster than
`fldmod1(i, 8)`. It'd probably be worth looking into it more, but it
works for now. The divrem code is @ExpandingMan 's code from his
arrow/feather code.
quinnj added a commit that referenced this issue Apr 15, 2021
Fixes #131. Dip me in mustard and call me a hotdog, cuz I can't tell
how/why `divrem(i - 1, 8) .+ (1, 1)` ends up being ~30% faster than
`fldmod1(i, 8)`. It'd probably be worth looking into it more, but it
works for now. The divrem code is @ExpandingMan 's code from his
arrow/feather code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants