Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow dataset imported as DataFrames are not pure DataFrames? #127

Closed
eloualiche opened this issue Feb 9, 2021 · 2 comments
Closed

Arrow dataset imported as DataFrames are not pure DataFrames? #127

eloualiche opened this issue Feb 9, 2021 · 2 comments

Comments

@eloualiche
Copy link

I have encountered a small issue reading an Arrow file as a DataFrame.
This might not be a bug, but I think it's worth clarifying if it is expected behavior.

tl;dr;
Reading an arrow file into a DataFrame seems to produce results that are not entirely consistent with other DataFrames.

I start reading an arrow table into the DataFrame package

> df = Arrow.Table("table.arrow") |> DataFrame
> select(df, :V1, :V2)
4×2 DataFrame
 Row │ V1      V2
     │ String  String?
─────┼───────────────────────────────────────────────
   11R4C    missing
   26V1B    missing
   3 │ FAA8    2020-07-10T18:26
   4 │ KH9Z    2020-07-10T18:26

I am interested in replacing the missing values of V2 as something different (another date).

> df[ ismissing.(df[!, :V2]), :V2] .= "2020-06-09T17:08"
ERROR: setindex! not defined for Arrow.List{Union{Missing, String},Int32,Array{UInt8,1}}

I looked carefully at the dataframe and I realized that it was still a list from Arrow

> df[!, :V2]
4-element Arrow.List{Union{Missing, String},Int32,Array{UInt8,1}}
...

Applying another DataFrame "conversion I got to the "correct" type:

> DataFrame(df)[!, :V2]
4-element Array{Union{Missing, String},1}:
...

Such that the initial transformation worked out. So one solution would be to apply the DataFrame conversion twice:

> df = Arrow.Table("table.arrow") |> DataFrame |> DataFrame;
>  df[ ismissing.(df[!, :V2]), :V2] .= "2020-06-09T17:08"
4-element view(::Array{Union{Missing, String},1}, [1, 2, 3, 4]) with eltype Union{Missing, String}:
...

which accomplishes the desired change in place.

Is this expected behavior?

@quinnj
Copy link
Member

quinnj commented Feb 10, 2021

Yes, this is expected. We tried in the documentation to convey that Arrow.Tables are immutable, as is designed in the official arrow spec. Arrow data in general is meant as an "analytical" format, to enable analysis workloads to process and share data at highest possible speeds between implementations.

That said, when used as a storage format, it's often desirable to further process the data and mutate. You can make copies of the arrow data as mutable structures by doing something like df = DataFrame(Arrow.columntable(Arrow.Table(file))) to get normal Vector arrays.

@quinnj quinnj closed this as completed Feb 10, 2021
@eloualiche
Copy link
Author

Thanks! This is very useful.

Thanks a lot for all the work that went into this amazing project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants