Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nano-arrow #11179

Merged
merged 13 commits into from
Sep 19, 2023
Merged

nano-arrow #11179

merged 13 commits into from
Sep 19, 2023

Conversation

ritchie46
Copy link
Member

Polars has done great on arrow2, but now that Jorge has stepped back, the benefits of utilizing arrow(2) (with some choices not being ideal for our usecase) are much less prevalent. This will fork/continue on the great work of arrow2 and repel almost everything, keeping only the memory specification, IPC and interop with arrow-rs intact.

All compute and IO eventually will be implemented/integrated within polars. Arrow is such a big dependency of polars and we are so tightly integrated with this, that we want this in the same repo.

We can share dependencies within the same workspace and keep versions and CI tightly coupled.

@ritchie46 ritchie46 requested a review from orlp as a code owner September 18, 2023 12:54
@ritchie46 ritchie46 changed the title WIP: nano-arrow nano-arrow Sep 19, 2023
@ritchie46 ritchie46 merged commit 122b2ed into main Sep 19, 2023
18 checks passed
@ritchie46 ritchie46 deleted the nano-arrow branch September 19, 2023 08:41
@eitsupi
Copy link
Contributor

eitsupi commented Oct 9, 2023

Hi, given that the name "nanoarrow" is already in use by Apache Arrow project https://github.com/apache/arrow-nanoarrow, you might want to rename it before the first release to crates.io.

@ritchie46
Copy link
Member Author

That release already happened 21 days ago: https://crates.io/crates/nano-arrow.

The project you point to is C++. I don't believe there is rust counterpart of that. That's what I want for nano-arrow. A very minimal implementation that only implements the memory spec. Could rename it, but I don't think there is any conflict at the moment.

@ritchie46
Copy link
Member Author

The name isn't important to us. Let's move the code into polars-arrow. The name nano-arrow can be used by apache. If they want to have the crates.io handle, please ping me. :)

@eitsupi
Copy link
Contributor

eitsupi commented Oct 9, 2023

Disclaimer: I just wanted to let you know that "nanoarrow" already exists because I happened to find this PR, and I don't want to use this name myself.
I just thought the similar names were confusing. (For example, the R polars package has the nanoarrow R package as an optional dependency and is used for data conversion with the arrow package. (pola-rs/r-polars#5))

I think the name changes here are helpful to make things easier to understand for users. Thank you for your prompt reply.

@aldanor
Copy link
Contributor

aldanor commented Nov 11, 2023

@ritchie46 Just learned about this whole ordeal by stumbling into this after bumping to 0.34

error[E0277]: the trait bound `polars_core::frame::DataFrame: std::convert::From<(arrow2::chunk::Chunk<std::boxed::Box<dyn arrow2::array::Array>>, &[arrow2::datatypes::Field])>` is not satisfied
    |
    |         Ok(DataFrame::try_from((chunk, fields.as_slice()))?)
    |            ------------------- ^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `std::convert::From<(arrow2::chunk::Chunk<std::boxed::Box<dyn arrow2::array::Array>>, &[arrow2::datatypes::Field])>` is not implemented for `polars_core::frame::DataFrame`
    |            |

There's a subset of users (myself and my team included) who have code written for arrow2 that interops with polars-core on the Rust side (e.g. if you want to parallelize chunk loading in a particular way, where chunks are organized in some non-automatic problem-specific fashion etc), writing your own low-level arrow routines may become somewhat critical if dealing with very large amounts of arrow data. We also have Rust projects (e.g. ones writing arrow data) that don't depend on polars at all and simply use the low-level api of arrow2 for doing so due to the amount of data being processed in a streaming fashion (hence my recent prs in arrow2 in fixing mutable dicts that were not behaving correctly).

We are very concerned about it, given there's no explicit statements about it, hence a few questions:

  • Does polars-arrow become a purely 'internal' crate for polars/polars-core, which implies the polars team could then throw out unneeded parts, reshuffle the code as they see fit, so it aligns better with polars goals?
  • Or, does polars-arrow become an arrow2 replacement even for projects not using polars on the Rust side? If it will be better maintained than arrow2, this is a legit and a pretty important question. If true, than arrow2 repo should definitely be marked as deprecated with a link to polars-arrow.
  • If the latter is the case, is it worth moving polars-arrow out into a separate crate, so as not to mix commits and issues with the massive amount of polars/py-polars commits/issues?
  • Are there any guarantees that the arrow2/polars-arrow code will stay (somewhat) intact? (i.e. can we s/arrow2/polars_arrow/g and hope that this code will survive for non-zero amount of time?)

@ritchie46
Copy link
Member Author

ritchie46 commented Nov 15, 2023

Does polars-arrow become a purely 'internal' crate for polars/polars-core, which implies the polars team could then throw out unneeded parts, reshuffle the code as they see fit, so it aligns better with polars goals?

This one. There will be a public API, but it will be limited in goal. For polars we want to adhere to arrow memory, but have compute in polars. Consumers/producers of a different arrow implementation should still be able to move the data into polars zero copy. Either via polars-arrow or via the arrow C-ffi.

Are there any guarantees that the arrow2/polars-arrow code will stay (somewhat) intact? (i.e. can we s/arrow2/polars_arrow/g and hope that this code will survive for non-zero amount of time?)

no guarantees, but we want to keep all the builders for the data-types we support in polars. The compute and IO maybe removed. What do you use mostly?

@aldanor
Copy link
Contributor

aldanor commented Jan 3, 2024

@ritchie46 Totally forgot to reply to this one:

we want to keep all the builders for the data-types we support in polars. The compute and IO maybe removed. What do you use mostly?

Here's a sample use case - you have a custom parallelized chunk reader written in arrow2, you end up with a bunch of chunks and you want to create a polars dataframe out of them. So you can (a) rely on all the low-level tools available for io in arrow2 but (b) interface with the outer world via polars. This used to be possible but I believe now it's not (see my code snippet posted above).

I guess, to formalize this question: right now, for most of the low-level arrow2 code, if you simply replace s/arrow2/polars_arrow/g, the code will still work (i.e. array::*, datatypes::* and io::* are mostly untouched). Can one expect that at least array/datatypes/io functionality will remain more-or-less intact in polars-arrow?

@ritchie46
Copy link
Member Author

Yes, you can expect that. Though parquet is moved to polars-parquet crate. It's usage is the same. I think it is actually better, as we already learn that we maintain and develop our version of arrow much more actively since the fork.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants