Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decouple buffer deallocation from ffi and allow creating buffers from rust vec #1494

Merged
merged 7 commits into from
Apr 7, 2022

Conversation

jhorstmann
Copy link
Contributor

@jhorstmann jhorstmann commented Mar 28, 2022

Which issue does this PR close?

Closes #1516

Rationale for this change

I'd like to process data that is coming from outside arrow using arrow compute kernels without copying the data.

The FFI api does not seem to support this use case at the moment. The only way I found to create an FFI_ArrowArray was with an already existing ArrayData struct. Creating ArrayData requires at least one Buffer, and creating a Buffer from foreign memory again needs an FFI_ArrowArray.

What changes are included in this PR?

The owner of an allocation is tracked via a trait object instead of an FFI_ArrowArray. This allows zero-copy interoperability between arrow buffers and most collection types from rust stdlib or other crates. The most common example would be Vec or String, and it should also be possible to access arrow2 buffers this way.

Are there any user-facing changes?

The function from_unowned was marked as deprecated and delegates to a new function from_custom_allocation. The "unowned" part is abit misleading since the data is owned by the FFI_ArrayArray.

The other changes should be api compatible. The bytes module is pub(crate), so moving the Deallocation enum should be fine.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 28, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really cool -- nice work @jhorstmann

In terms of naming, I would like to suggest we keep it backwards compatible for a while (so as not to have to bump major versions for a bit)

So that would mean

  1. Keep the old interface named from_unowned, mark it deprecated, change impl to call from_foreign
  2. Add new from_unowned() that does what you suggest

cc @viirya and @sunchao

arrow/src/alloc/mod.rs Outdated Show resolved Hide resolved
Buffer::from_foreign(
NonNull::new_unchecked(strings.as_mut_ptr()),
strings.len(),
Arc::new(strings),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems pretty neat to me that the ownership of the Vec is transferred to Arc which is then stored (as dyn Allocation)

@alamb
Copy link
Contributor

alamb commented Mar 29, 2022

BTW I ran this under MIRI and it looks good:

cargo +nightly miri test -p arrow -- test_string_data_from_foreigntest_string_data_from_foreign
...
running 1 test
test array::data::tests::test_string_data_from_foreign ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 1072 filtered out

   Doc-tests arrow

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 145 filtered out; finished in 0.03s

🎉


/// The owner of an allocation, that is not natively allocated.
/// The trait implementation is responsible for dropping the allocations once no more references exist.
pub trait Allocation: RefUnwindSafe {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trait bound on RefUnwindSafe was needed to make one test compile. I'll need to check that in more detail and document why it is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an explicit test that Buffer is UnwindSafe to make this requirement clearer.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good idea to me. Thanks.

arrow/src/alloc/mod.rs Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

Codecov Report

Merging #1494 (d23e744) into master (6bf3b3a) will increase coverage by 0.01%.
The diff coverage is 75.11%.

❗ Current head d23e744 differs from pull request most recent head 55ddb82. Consider uploading reports for the commit 55ddb82 to get more accurate results

@@            Coverage Diff             @@
##           master    #1494      +/-   ##
==========================================
+ Coverage   82.72%   82.74%   +0.01%     
==========================================
  Files         188      188              
  Lines       54286    54389     +103     
==========================================
+ Hits        44908    45002      +94     
- Misses       9378     9387       +9     
Impacted Files Coverage Δ
arrow/src/alloc/mod.rs 78.04% <0.00%> (-13.38%) ⬇️
arrow/src/array/array_binary.rs 92.62% <ø> (ø)
arrow/src/array/array_string.rs 97.65% <ø> (ø)
...ion-testing/src/bin/arrow-json-integration-test.rs 0.00% <0.00%> (ø)
arrow/src/array/equal/list.rs 79.38% <59.57%> (-9.85%) ⬇️
arrow/src/bytes.rs 73.07% <80.00%> (+16.82%) ⬆️
parquet/src/arrow/array_reader.rs 86.70% <80.00%> (+2.34%) ⬆️
arrow/src/buffer/immutable.rs 98.46% <92.85%> (-1.00%) ⬇️
arrow/src/array/equal/utils.rs 77.68% <95.45%> (+3.68%) ⬆️
arrow/src/array/data.rs 83.12% <100.00%> (+0.32%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6bf3b3a...55ddb82. Read the comment docs.

@jhorstmann jhorstmann marked this pull request as ready for review April 1, 2022 13:27
@jhorstmann jhorstmann changed the title RFC: Decouple buffer deallocation from ffi and allow creating buffers from rust vec Decouple buffer deallocation from ffi and allow creating buffers from rust vec Apr 1, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great @jhorstmann -- thank you. @viirya / @tustvold / @sunchao any last thoughts before I merge it in?

I believe it is an API change, so we will bump the major version of arrow to use it (but that is probably ok given stuff like #1510 which will require a new version as well

///
/// # Safety
///
/// This function is unsafe as there is no guarantee that the given pointer is valid for `len`
/// bytes and that the foreign deallocator frees the region.
#[deprecated(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Apr 6, 2022

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @jhorstmann

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@alamb
Copy link
Contributor

alamb commented Apr 7, 2022

I merged this PR with master and fixed a rustdoc issue with e6c6c07 https://github.com/apache/arrow-rs/runs/5756218032?check_suite_focus=true -- thanks @jhorstmann

@alamb
Copy link
Contributor

alamb commented Apr 7, 2022

I believe the clippy errors are due to rust 1.60 being released. I will create a new PR

@alamb alamb merged commit 688dd4c into apache:master Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support creating arrays from externally owned memory like Vec or String
5 participants