Slower materialization Feather vs Arrow #131

Felix-Gauthier · 2021-02-16T20:27:05Z

I'm currently transitioning from Feather.jl to Arrow.jl, and I've noticed a near double time increase when reading and modifying 525600x400 DataFrame with mostly columns of Union{Missing, Float}/Union{Missing,Bool}.

Consider :

using Feather, Arrow, DataFrames
tmp_A, tmp_F = tempname(), tempname()
df = DataFrame(:A => vcat([missing], rand(525_599)))
Arrow.write(tmp_A,df)
Feather.write(tmp_F,df)

for (fn,name,k) in [(Arrow.Table,tmp_A,(:A,)),(Feather.read,tmp_F,(!,:A))]
    vec = fn(name)[k...]
    @info typeof(vec)
    for i in 1:3
        out = Vector{Union{Float64, Missing}}(undef, length(vec))
        @time out .= vec .+ 0.1
    end
end

We get the following results :

[ Info: Arrow.Primitive{Union{Missing, Float64},Array{Float64,1}}
  0.107514 seconds (303.19 k allocations: 15.577 MiB, 8.51% gc time)
  0.006216 seconds (2 allocations: 160 bytes)
  0.006779 seconds (2 allocations: 160 bytes)
[ Info: Feather.Arrow.NullablePrimitive{Float64}
  0.136757 seconds (442.93 k allocations: 23.064 MiB)
  0.002040 seconds (2 allocations: 160 bytes)
  0.001903 seconds (2 allocations: 160 bytes)

using Juliav1.5.3, Feather v0.5.7, Arrow v1.2.4, DataFrames v0.22.5 with similar results in juliav1.6.0-rc1

Is there a better way to do this with Arrow, that wouldn't it at a slower speed than it's predecessor ?

The text was updated successfully, but these errors were encountered:

quinnj · 2021-04-14T21:37:35Z

Sorry for the slow response; looks like it has to do w/ our nullability check, since the non-nullable case is actually faster in Arrow.jl:

julia> for (fn,name,k) in [(Arrow.Table,tmp_A,(:A,)),(Feather.read,tmp_F,(!,:A))]
           vec = fn(name)[k...]
           @info typeof(vec)
           for i in 1:3
               out = Vector{Union{Float64, Missing}}(undef, length(vec))
               @time out .= vec .+ 0.1
           end
       end
[ Info: Arrow.Primitive{Float64, Vector{Float64}}
  0.001771 seconds (2 allocations: 160 bytes)
  0.000543 seconds (2 allocations: 160 bytes)
  0.000323 seconds (2 allocations: 160 bytes)
[ Info: Feather.Arrow.Primitive{Float64}
  0.002012 seconds (2 allocations: 96 bytes)
  0.000795 seconds (2 allocations: 96 bytes)
  0.000607 seconds (2 allocations: 96 bytes)

@ExpandingMan

Fixes #131. Dip me in mustard and call me a hotdog, cuz I can't tell how/why `divrem(i - 1, 8) .+ (1, 1)` ends up being ~30% faster than `fldmod1(i, 8)`. It'd probably be worth looking into it more, but it works for now. The divrem code is @ExpandingMan 's code from his arrow/feather code.

@ExpandingMan

Fixes #131. Dip me in mustard and call me a hotdog, cuz I can't tell how/why `divrem(i - 1, 8) .+ (1, 1)` ends up being ~30% faster than `fldmod1(i, 8)`. It'd probably be worth looking into it more, but it works for now. The divrem code is @ExpandingMan 's code from his arrow/feather code.

quinnj mentioned this issue Apr 14, 2021

Fix slight perf hit when checking validity bitmap #176

Merged

quinnj closed this as completed in #176 Apr 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slower materialization Feather vs Arrow #131

Slower materialization Feather vs Arrow #131

Felix-Gauthier commented Feb 16, 2021

quinnj commented Apr 14, 2021

Slower materialization Feather vs Arrow #131

Slower materialization Feather vs Arrow #131

Comments

Felix-Gauthier commented Feb 16, 2021

quinnj commented Apr 14, 2021