Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api!): make ArrowArrayStreamReader Send #4232

Merged
merged 5 commits into from
May 27, 2023

Conversation

wjones127
Copy link
Member

@wjones127 wjones127 commented May 17, 2023

Which issue does this PR close?

Closes #4222

Rationale for this change

The underlying implementation is arguably Send, and making it not Send makes it extremely hard to work with. This drops support for Clone, but if someone wants to sacrifice Send for Clone, they can wrap it in an Arc themselves.

What changes are included in this PR?

  • Alters implementation of ArrowArrayStreamReader to use Box instead of Arc internally.
  • Refactors PyArrowConvert convert into a few different traits, so that we can differentiate between conversions that borrow vs those that take ownership.
  • Cleans up some of the ffi_stream code to remove some unnecessary heap allocations.

Are there any user-facing changes?

This is a breaking change to PyArrowConvert.

@github-actions github-actions bot added the arrow Changes to the arrow crate label May 17, 2023
@wjones127 wjones127 force-pushed the GH-4222-c-stream-send branch from 06753ea to c51d589 Compare May 21, 2023 03:27
@wjones127 wjones127 force-pushed the GH-4222-c-stream-send branch from c51d589 to 0f4aa77 Compare May 21, 2023 03:36
@wjones127 wjones127 marked this pull request as ready for review May 21, 2023 03:54
@tustvold
Copy link
Contributor

I'm not sure about the changes to the python interface, we've run into challenges in the past because python can't transfer ownership. I'll have a play later today and see if we can't make it Send without requiring a breaking change

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, I like where this is headed. I think there is one memory safety issue, but otherwise mostly just nits

//!
//! // consumed and used by something else...
//!
//! // import it
//! let stream_reader = unsafe { ArrowArrayStreamReader::from_raw(stream_ptr).unwrap() };
//! let stream_reader = unsafe { ArrowArrayStreamReader::from_raw(&mut stream).unwrap() };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! let stream_reader = unsafe { ArrowArrayStreamReader::from_raw(&mut stream).unwrap() };
//! let stream_reader = ArrowArrayStreamReader::try_new(stream).unwrap();

I think we could deprecate from_raw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still relevant. For example, there are cases where we need to implement a C API where we will be passed a pointer to an ArrowArrayStreamReader. For example, this is done in the "bind stream" function in ADBC:

https://github.com/apache/arrow-adbc/blob/75ba2cef9080b19f4d242d0a03c17db304609d85/rust/src/implement/internal.rs#L647-L664

@@ -231,8 +227,7 @@ impl ExportedArrayStream {
let struct_array = StructArray::from(batch);
let array = FFI_ArrowArray::new(&struct_array.to_data());

unsafe { std::ptr::copy(addr_of!(array), out, 1) };
std::mem::forget(array);
unsafe { std::ptr::write_unaligned(out, array) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will cause the array to be released when array goes out of scope, i.e. immediately. I think eventually this will result in a double-free

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. However, the docs for write_unaligned say that src is guaranteed not to be dropped:

https://doc.rust-lang.org/std/ptr/fn.write_unaligned.html

Internally, it's calling intrinsics::forget(). So I think this is sound.
https://doc.rust-lang.org/src/core/ptr/mod.rs.html#1447

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, I had no idea that was the semantic - makes sense

let stream = Box::new(FFI_ArrowArrayStream::empty());
let stream_ptr = Box::into_raw(stream) as *mut FFI_ArrowArrayStream;

unsafe { export_reader_into_raw(Box::new(self.clone()), stream_ptr) };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why IntoPyArrow is needed, this needs to "consume" the stream

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. It's a little messy, but I did want to convey we are moving ownership, while all the other conversion are either creating entirely new objects (data types) or sharing ownership (arrays).


impl<'source, T: PyArrowConvert> FromPyObject<'source> for PyArrowType<T> {
impl<'source, T: FromPyArrow + IntoPyArrow> FromPyObject<'source> for PyArrowType<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl<'source, T: FromPyArrow + IntoPyArrow> FromPyObject<'source> for PyArrowType<T> {
impl<'source, T: FromPyArrow> FromPyObject<'source> for PyArrowType<T> {

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason both traits are there is because PyArrowType<T> requires. If I remove one we get

the trait bound `T: FromPyArrow` is not satisfied

fn extract(value: &'source PyAny) -> PyResult<Self> {
Ok(Self(T::from_pyarrow(value)?))
}
}

impl<'a, T: PyArrowConvert> IntoPy<PyObject> for PyArrowType<T> {
impl<T: FromPyArrow + IntoPyArrow> IntoPy<PyObject> for PyArrowType<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl<T: FromPyArrow + IntoPyArrow> IntoPy<PyObject> for PyArrowType<T> {
impl<T: IntoPyArrow> IntoPy<PyObject> for PyArrowType<T> {

}
}

impl<T: PyArrowConvert> From<T> for PyArrowType<T> {
impl<T: FromPyArrow + IntoPyArrow> From<T> for PyArrowType<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl<T: FromPyArrow + IntoPyArrow> From<T> for PyArrowType<T> {
impl<T> From<T> for PyArrowType<T> {

@tustvold tustvold added the api-change Changes to the arrow API label May 22, 2023
@tustvold
Copy link
Contributor

tustvold commented May 24, 2023

The integration test failure looks legitimate and would suggest something is off with this PR somewhere...

/home/runner/work/_temp/57085026-34c0-45a7-adab-58efe95c1cf7.sh: line 4: 9787 Aborted (core dumped) pytest -v .
tests/test_sql.py::test_record_batch_reader

@wjones127
Copy link
Member Author

Yes, thanks for the reminder. I'll look into this soon.

@tustvold tustvold merged commit 3e5b07a into apache:master May 27, 2023
alamb pushed a commit to alamb/arrow-rs that referenced this pull request May 30, 2023
* feat(api make ArrowArrayStreamReader Send

* simplify ptr handling

* rename pyarrow traits to conform to guidelines

* pr feedback

* remove dangling Box::from_raw
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make ArrowArrayStreamReader Send
2 participants