Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

arrow<->arrow2 interopability conversion method? #629

Open
dbr opened this issue Nov 24, 2021 · 11 comments
Open

arrow<->arrow2 interopability conversion method? #629

dbr opened this issue Nov 24, 2021 · 11 comments
Labels
question Further information is requested

Comments

@dbr
Copy link
Contributor

dbr commented Nov 24, 2021

I'm using arrow2 in a project along with polars. I now also want to send the same data to datafusion, which uses arrow (ideally without having to send the data through via the IPC serialization format)

Before I fumble around with the FFI API, I figured I would check first: is there a method somewhere which handles the conversion of a RecordBatch from arrow to arrow2 and vice versa? Seems like something that might exist already in some kind of integration test or similar, but "arrow<->arrow2" is quite a tricky thing to search for

@jorgecarleitao jorgecarleitao added the question Further information is requested label Nov 24, 2021
@jorgecarleitao
Copy link
Owner

Nop, we do not have such a methods. That is because doing so requires arrow depending on arrow2 or vice-versa.

The arrow format was designed exactly to support this case, though, and the FFI is really the conversion. What happens is that a RecordBatch is not part of arrow in-memory specification, it is just a struct that exists in some implementations.

@ritchie46
Copy link
Collaborator

There could be a crate one level higher that implements the conversion.

    arrow-conv
         |
        /\
      /    \
arrow      arrow2 

It could maybe support conversion of the core types, ArrayRef and RecordBatch.

@houqp
Copy link
Collaborator

houqp commented Nov 27, 2021

off topic, but in case you missed it, there is also a fairly uptodate arrow2 branch for datafusion that's being worked on.

@renato2099
Copy link

Hi @houqp , may I ask what branch is that? and is that going to be under apache or also as part of a separate repo?

@alamb
Copy link
Collaborator

alamb commented Jan 2, 2022

I think it means apache/datafusion#68

@multimeric
Copy link

doing so requires arrow depending on arrow2 or vice-versa.

This could easily be feature-gated though, so there is no hard dependency. The FFI works but is very awkward to do manually since it involves working with raw pointers, versus the lovely abstraction of just calling .into().

@dbr
Copy link
Contributor Author

dbr commented Mar 7, 2022

This could easily be feature-gated though, so there is no hard dependency

True, although I think there is some benefits to having it be a separate crate, mainly the releases wouldn't be tied to any either project's release cycle (i.e a new release of the conv crate could be made whenever arrow or arrow2 is released)

It also could allow the crate to work on multiple versions of either version via feature flags (similar to how multiple winit versions are handled here)

The crate could also support the pyarrow interop via PyO3, which would have the same benefits (e.g currently arrow-rs 9 only supports pyo3 0.15, and we are blocked on updating pyo3 until next release of arrow)

Biggest drawback of it being a separate crate would be you can't as neatly implement some conversion traits (since they would be forgien types)

@alamb
Copy link
Collaborator

alamb commented Mar 7, 2022

I think a conversion crate would be valuable indeed, though I don't have time to work on such a thing now.

We could potentially host it in the https://github.com/datafusion-contrib organization and there might be others in the community who are interested in helping -- see the arrow2 milestone https://github.com/apache/arrow-datafusion/milestone/3

@jorgecarleitao
Copy link
Owner

designing a bit, I think that we would need 6 functions:

pub fn arrow_to_arrow2_error(error: arrow::error::ArrowError) -> arrow2::error::ArrowError;
pub fn arrow_to_arrow2_field(field: arrow::datatypes::Field) -> Result<arrow2::datatypes::Field, arrow2::error::ArrowError>;
pub fn arrow_to_arrow2_array(array: Arc<dyn arrow::array::Array>) -> Result<Arc<dyn arrow2::array::Array>, arrow2::error::ArrowError>;

pub fn arrow2_to_arrow_error(error: arrow2::error::ArrowError) -> arrow::error::ArrowError;
pub fn arrow2_to_arrow_field(field: arrow2::datatypes::Field) -> Result<arrow::datatypes::Field, arrow::error::ArrowError>;
pub fn arrow2_to_arrow_array(array: Arc<dyn arrow2::array::Array>) -> Result<Arc<dyn arrow::array::Array>, arrow::error::ArrowError>;

@ritchie46
Copy link
Collaborator

You miss out on a valid naming option: arrow2arrow 😜

@alamb
Copy link
Collaborator

alamb commented Mar 7, 2022

You miss out on a valid naming option: arrow2arrow 😜

🤯 lol

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

7 participants