Arrow dataset imported as DataFrames are not pure DataFrames? #127

eloualiche · 2021-02-09T19:54:51Z

I have encountered a small issue reading an Arrow file as a DataFrame.
This might not be a bug, but I think it's worth clarifying if it is expected behavior.

tl;dr;
Reading an arrow file into a DataFrame seems to produce results that are not entirely consistent with other DataFrames.

I start reading an arrow table into the DataFrame package

> df = Arrow.Table("table.arrow") |> DataFrame
> select(df, :V1, :V2)
4×2 DataFrame
 Row │ V1      V2
     │ String  String?
─────┼───────────────────────────────────────────────
   1 │ 1R4C    missing
   2 │ 6V1B    missing
   3 │ FAA8    2020-07-10T18:26
   4 │ KH9Z    2020-07-10T18:26

I am interested in replacing the missing values of V2 as something different (another date).

> df[ ismissing.(df[!, :V2]), :V2] .= "2020-06-09T17:08"
ERROR: setindex! not defined for Arrow.List{Union{Missing, String},Int32,Array{UInt8,1}}

I looked carefully at the dataframe and I realized that it was still a list from Arrow

> df[!, :V2]
4-element Arrow.List{Union{Missing, String},Int32,Array{UInt8,1}}
...

Applying another DataFrame "conversion I got to the "correct" type:

> DataFrame(df)[!, :V2]
4-element Array{Union{Missing, String},1}:
...

Such that the initial transformation worked out. So one solution would be to apply the DataFrame conversion twice:

> df = Arrow.Table("table.arrow") |> DataFrame |> DataFrame;
>  df[ ismissing.(df[!, :V2]), :V2] .= "2020-06-09T17:08"
4-element view(::Array{Union{Missing, String},1}, [1, 2, 3, 4]) with eltype Union{Missing, String}:
...

which accomplishes the desired change in place.

Is this expected behavior?

The text was updated successfully, but these errors were encountered:

quinnj · 2021-02-10T04:27:05Z

Yes, this is expected. We tried in the documentation to convey that Arrow.Tables are immutable, as is designed in the official arrow spec. Arrow data in general is meant as an "analytical" format, to enable analysis workloads to process and share data at highest possible speeds between implementations.

That said, when used as a storage format, it's often desirable to further process the data and mutate. You can make copies of the arrow data as mutable structures by doing something like df = DataFrame(Arrow.columntable(Arrow.Table(file))) to get normal Vector arrays.

eloualiche · 2021-02-10T09:18:38Z

Thanks! This is very useful.

Thanks a lot for all the work that went into this amazing project.

quinnj closed this as completed Feb 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow dataset imported as DataFrames are not pure DataFrames? #127

Arrow dataset imported as DataFrames are not pure DataFrames? #127

eloualiche commented Feb 9, 2021

quinnj commented Feb 10, 2021

eloualiche commented Feb 10, 2021

Arrow dataset imported as DataFrames are not pure DataFrames? #127

Arrow dataset imported as DataFrames are not pure DataFrames? #127

Comments

eloualiche commented Feb 9, 2021

quinnj commented Feb 10, 2021

eloualiche commented Feb 10, 2021