Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Iterators.partition for DataFrameRows #3299

Merged
merged 3 commits into from
Apr 8, 2023
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Mar 12, 2023

Fixes #3298

@svilupp - could you please review this PR against your requirements?

@nalimilan - do you think it should be patch or should go to 1.6 release?

@bkamins bkamins requested a review from nalimilan March 12, 2023 20:16
@bkamins bkamins added this to the patch milestone Mar 12, 2023
@bkamins
Copy link
Member Author

bkamins commented Mar 12, 2023

@svilupp - a related decision is #3054. Would you need Tables.rows(df) and Tables.colums(df) to be a valid Tables.jl table, or it is enough for you that they are iterators?

@svilupp
Copy link
Contributor

svilupp commented Mar 13, 2023

Wow! So fast! Thank you, @bkamins!

This fixes apache/arrow-julia#403

Missigness is now retained through Arrow serialization and the test case succeeds:

df = Tables.rowtable((; x1 =["a","b",missing,"c"], x2 = 1:4)) |> DataFrame
Arrow.write(fn, Iterators.partition(Tables.rows(df), 2); compress = nothing);
t=Arrow.Table(fn)
t[:x1]|>eltype  
# Union{Missing, String}

Out of curiosity, it's not entirely clear what makes the default fallback lose the type information given that it's views. Is it this line that fixed it?


It would be awesome if this could be a patch, because there are other partitioning use cases where people rely on the general fallback for Tables-sources Iterators.partition(Tables.rows(df),chunksize) and DataFrames are arguably the most popular source.
I think there are other packages affected as well (eg, here)

The broader context is that most basic users aren't aware that they need to partition their datasets manually to leverage threading in Arrow (which makes Arrow pretty slow in comparison with PyArrow / Polars, see here).

I hope to change that by chunking data by default (PR here), which is something that PyArrow does automatically (reference).
For the automatic chunking to work, we'd need to have a consistent chunking method for all Tables-sources (hence, the need for the above call to work). I appreciate that we could add some specializations by type but that would required heavier dependencies.

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2023

CI error on nightly is unrelated (it is a problem with compilation time and I cannot reproduce it locally)

@svilupp
Copy link
Contributor

svilupp commented Mar 13, 2023

@svilupp - a related decision is #3054. Would you need Tables.rows(df) and Tables.colums(df) to be a valid Tables.jl table, or it is enough for you that they are iterators?

For the Arrow.write use case, I don't think we need more than what we have (take it with a pinch of salt -- it's my first week of getting to know Arrow.jl code base).
The only Tables methods I see used are: partitions, columns (and Columns), schema (and Schema), rowcount, columnnames, getcolumn and eachcolumn.

I think rowcount is the equivalent of nrow (mentioned in the issue 3054), but it already has a pretty decent fallback

Perhaps @quinnj is better suited to provide his opinion here?

@bkamins
Copy link
Member Author

bkamins commented Mar 13, 2023

Out of curiosity, it's not entirely clear what makes the default fallback lose the type information given that it's views.

No, the reason was that previously we relied on Julia inferring the type of elements of the collection by inspecting them (and a narrowest eltype was therefore picked). Now we pass object that carries over schema information.


Aside. The pattern:

Iterators.partition(Tables.rows(df),chunksize)

does not seem to be fully correct. The point is that Tables.rows contract does not guarantee to return a valid Tables.jl table. It only guarantees to return an AbstractRow-compatible iterator , which is a weaker requirement. However, @quinnj probably should verify this claim. However, this is the reason of your problem.

This is the situation before this PR:

julia> isnothing(Tables.schema(first(Iterators.partition(eachrow(df), 2))))
true

what the PR changes is that the above statement returns a schema of the underlying data frame (as this is doable, however, I believe that Tables.jl contract does not guarantee this).

src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
src/abstractdataframe/iteration.jl Outdated Show resolved Hide resolved
bkamins and others added 2 commits April 8, 2023 13:29
@bkamins bkamins merged commit 23a28b1 into main Apr 8, 2023
@bkamins bkamins deleted the bk/partition_DataFrameRows branch April 8, 2023 15:39
@bkamins
Copy link
Member Author

bkamins commented Apr 8, 2023

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

When partitioned, partition might lose the missingness eltype (in Tables.schema)
3 participants