-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10109: [Rust] Add support to the C data interface for primitive types and utf8 #8401
Conversation
Thanks @jorgecarleitao this is really interesting. I've had a first pass through to get familiar with this and I will try building locally sometime this weekend hopefully. |
Just FTR, in C++ we have a global function that returns the number of currently Arrow-allocated bytes. This helps us write crude resource allocation tests (here through the Python wrapper Another possibility would be to use the callback facility on your |
Some changes since last time:
The tool counts all memory allocations and deallocations, like in C++. It is used as a test at the end of all tests, as a final validation that the test suite does not leak. I've placed it behind a feature gate as it current only works in single-threaded programs. Suggestions are welcomed to improve it further. I think that this is now ready to review. |
This is a major refactor of the `equal.rs` module. The rational for this change is many fold: * currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders. * the code in array comparison is difficult to follow given the amount of calls that they perform around offsets. * The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate. * Some code is being repeated. This PR: 1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change) 2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons. 3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison 4. Makes array equality be statically dispatched, via `match datatype`. 5. DRY the code around array equality 6. Fixes an error in equality of dictionary with equal values 7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries) 8. splits `equal.rs` in smaller, more manageable files. 9. Removes `ArrayListOps`, since it it no longer needed 10. Moves Json equality to its own module, for clarity. 11. removes the need to have two functions per type to compare arrays. 12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from #8401 13. adds a benchmark for array equality Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on #8200. IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long. This also improves performance by 10-40%. <details> <summary>Benchmark results</summary> ``` Previous HEAD position was 3dd3c69 Added bench for equality. Switched to branch 'equal' Your branch is up to date with 'origin/equal'. Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 51.28s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12 Gnuplot not found, using plotters backend equal_512 time: [36.861 ns 36.894 ns 36.934 ns] change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe equal_nulls_512 time: [2.3271 us 2.3299 us 2.3331 us] change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe equal_string_512 time: [49.219 ns 49.347 ns 49.517 ns] change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe equal_string_nulls_512 time: [3.7873 us 3.7939 us 3.8013 us] change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe ``` </details> All tests are there, plus new tests for some of the edge cases and untested arrays. This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust. Closes #8541 from jorgecarleitao/equal Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
@nevi-me , @andygrove @pitrou @alamb , I have rebased this PR. I need your guidance here:
Some ideas:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took so long to review this PR -- there was a lot here.
Something I had missed earlier was the rationale for this work which is well spelled out on https://issues.apache.org/jira/browse/ARROW-10109 but I hadn't read that before. I recommend reviewing that description to gain more context for this PR for anyone else who is interested
I am not an expert in any of these technologies, but I did read the PR carefully and it makes sense to me
is this something that we still want to pursue, or should we close this?
I think it is definitely worth pursuing. Thank you for doing so!
Currently the memory track is done with a feature gate. This is faster, but requires a new compilation to run the tests with that feature gate.
I think always tracking memory , as @pitrou says the overhead of updating some counter (especially as the semantics of doing so can be very relaxed) is likely to be noise compared to the actual work to do the allocation itself
Currently it tests memory leaks via a test at the end of all tests (and under the feature gate). This covers all tests implicitely, but tests that panic are intrinsically leaky and thus there is a non-trivial interaction between tests and the memory check test.
I don't understand the assertion that tests that panic!
are leaky (the stack is still unwound in an orderly fashion and allocations are Drop
'd as I understand) so I think testing at the end is fine unless we get evidence to the contrary.
The integration test with Python/C++ requires another compilation, with other compiler flags, which is an extra CI burden.
Maybe we could look into extending some of the existing python tests, or only running the integration test if the ffi or python binding code changes to reduce the burden.
// specific language governing permissions and limitations | ||
// under the License. | ||
|
||
//! Contains functionality to load an ArrayData from the C Data Interface |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//! Contains functionality to load an ArrayData from the C Data Interface | |
//! Contains functionality to load data to/from `ArrayData` from the C Data Interface |
I made the following changes:
The CI will likely fail, as I will need to fiddle with it. There will be some spam. Sorry guys! |
Codecov Report
@@ Coverage Diff @@
## master #8401 +/- ##
==========================================
- Coverage 84.54% 84.47% -0.08%
==========================================
Files 185 190 +5
Lines 46176 46836 +660
==========================================
+ Hits 39041 39565 +524
- Misses 7135 7271 +136
Continue to review full report at Codecov.
|
@kszucs , I am trying to fix the From what I understood, this script is used to replace versions throughout the project. I added a new section for the new crate, but I can't find understand why is the test failing. |
@alamb @pitrou , @paddyhoran @andygrove , the integration with CI is in place, the tests pass, and all comments from @alamb and @pitrou were addressed. If anyone wants to take a final pass, please let me know, otherwise, for me this is ready to merge. |
/// Mode of deallocating memory regions | ||
pub enum Deallocation { | ||
/// Native deallocation, using Rust deallocator with Arrow-specific memory aligment | ||
Native(usize), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it's worth optimizing the size of the Bytes
struct. If we use NonZeroUsize
here, the enum itself might become only be the size of one usize
, but I haven't tried whether that really works.
Deallocation::Native(capacity) => capacity, | ||
// we cannot determine this in general, | ||
// and thus we state that this is externally-owned memory | ||
Deallocation::Foreign(_) => 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning len
here might be better as it would keep the invariant that capacity() >= len()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
Have you seem a good usecase for having capacity
public: IMO it is an implementation detail of the buffer, related to how much it expands on every reserve / resize
. Since we perform the check internally for safety reasons, using it outside of buffer
seems unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a good point. Seems it is only used for reporting memory usage of arrays and you could argue that does not need to include foreign memory. For that usecase the method could have a better name like memory_usage
, with the current name I think returning len
sounds more logical.
*(self.array.buffers as *mut *const u8).add(1) as *const i32 | ||
}; | ||
// get last offset | ||
(unsafe { *offset_buffer.add(len / size_of::<i32>() - 1) }) as usize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I spent several minutes thinking about whether the - 1
is correct, seems it is, because len
is the length of the offset buffer, which is one larger than the length of the corresponding array. Not sure how to phrase that in a small comment though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this, an alternative is to safely cast all of this to i32 and perform the op on &[i32]
. IMO this is pretty unsafe if someone sends us data with wrong alignments. :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhorstmann , I agree with you. I am working on this on as a separate PR, but it is not forgotten.
I merged this and marked all associated issues as resolved. We can now FFI to and from c++ and pyarrow 🎉 |
Wahoo! This is a great milestone! You could announce it on the ML. |
This is a major refactor of the `equal.rs` module. The rational for this change is many fold: * currently array comparison requires downcasting the array ref to its concrete types. This is painful and not very ergonomics, as the user must "guess" what to downcast for comparison. We can see this in the hacks around `sort`, `take` and `concatenate` kernel's tests, and some of the tests of the builders. * the code in array comparison is difficult to follow given the amount of calls that they perform around offsets. * The implementation currently indirectly uses many of the `unsafe` APIs that we have (via pointer aritmetics), which makes it risky to operate and mutate. * Some code is being repeated. This PR: 1. adds `impl PartialEq for dyn Array`, to allow `Array` comparison based on `Array::data` (main change) 2. Makes array equality to only depend on `ArrayData`, i.e. it no longer depends on concrete array types (such as `PrimitiveArray` and related API) to perform comparisons. 3. Significantly reduces the risk of panics and UB when composite arrays are of different types, by checking the types on `range` comparison 4. Makes array equality be statically dispatched, via `match datatype`. 5. DRY the code around array equality 6. Fixes an error in equality of dictionary with equal values 7. Added tests to equalities that were not tested (fixed binary, some edge cases of dictionaries) 8. splits `equal.rs` in smaller, more manageable files. 9. Removes `ArrayListOps`, since it it no longer needed 10. Moves Json equality to its own module, for clarity. 11. removes the need to have two functions per type to compare arrays. 12. Adds the number of buffers and their respective width to datatypes from the specification. This was backported from apache#8401 13. adds a benchmark for array equality Note that this does not implement `PartialEq` for `ArrayData`, only `dyn Array`, as different data does not imply a different array (due to nullability). That implementation is being worked on apache#8200. IMO this PR significantly simplifies the code around array comparison, to the point where many implementations are 5 lines long. This also improves performance by 10-40%. <details> <summary>Benchmark results</summary> ``` Previous HEAD position was 3dd3c69 Added bench for equality. Switched to branch 'equal' Your branch is up to date with 'origin/equal'. Compiling arrow v3.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/arrow) Finished bench [optimized] target(s) in 51.28s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/equal-176c3cb11360bd12 Gnuplot not found, using plotters backend equal_512 time: [36.861 ns 36.894 ns 36.934 ns] change: [-43.752% -43.400% -43.005%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe equal_nulls_512 time: [2.3271 us 2.3299 us 2.3331 us] change: [-10.846% -9.0877% -7.7336%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 4 (4.00%) high mild 7 (7.00%) high severe equal_string_512 time: [49.219 ns 49.347 ns 49.517 ns] change: [-30.789% -30.538% -30.235%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 3 (3.00%) high mild 6 (6.00%) high severe equal_string_nulls_512 time: [3.7873 us 3.7939 us 3.8013 us] change: [-8.2944% -7.0636% -5.4266%] (p = 0.00 < 0.05) Performance has improved. Found 12 outliers among 100 measurements (12.00%) 4 (4.00%) high mild 8 (8.00%) high severe ``` </details> All tests are there, plus new tests for some of the edge cases and untested arrays. This change is backward incompatible `array1.equals(&array2)` no longer works: use `array1 == array2` instead, which is the idiomatic way of comparing structs and trait objects in rust. Closes apache#8541 from jorgecarleitao/equal Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>
… types and utf8 This PR is a proposal to add support to the [C data interface](https://arrow.apache.org/docs/format/CDataInterface.html) by implementing the necessary functionality to both consume and produce structs with its ABI and lifetime rules. This is for now limited to primitive types and strings (utf8), but it is easily generalized for all types whose data is encapsulated in `ArrayData` (things with buffers and child data). Some design choices: * import and export does not care about the type of the data that is in memory (previously `BufferData`, now `Bytes`) - it only cares about how they should be converted from and to `ArrayData` to the C data interface. * import wraps incoming pointers on a struct behind an `Arc`, so that we thread-safely refcount them and can share them between buffers, arrays, etc. * `export` places `Buffer`s in `private_data` for bookkeeping and release them when the consumer releases it via `release`. I do not expect this PR to be easy to review, as it is touching sensitive (aka `unsafe`) code. However, based on the tests I did so far, I am sufficiently happy to PR it. This PR has three main parts: 1. Addition of an `ffi` module that contains the import and export functionality 2. Add some helpers to import and export an Array from C Data Interface 3. A crate to test this against Python/C++'s API It also does a small refactor of `BufferData`, renaming it to `Bytes` (motivated by the popular `bytes` crate), and moving it to a separate file. What is tested: * round-trip `Python -> Rust -> Python` (new separate crate, `arrow-c-integration`) * round-trip `Rust -> Python -> Rust` (new separate crate, `arrow-c-integration`) * round-trip `Rust -> Rust -> Rust` * memory allocation counts Finally, this PR has a large contribution of @pitrou , that took _a lot_ of his time to explain to me how the C++ was doing it and the main things that I had to worry about here. Closes apache#8401 from jorgecarleitao/arrow-c-inte Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This PR is a proposal to add support to the C data interface by implementing the necessary functionality to both consume and produce structs with its ABI and lifetime rules.
This is for now limited to primitive types and strings (utf8), but it is easily generalized for all types whose data is encapsulated in
ArrayData
(things with buffers and child data).Some design choices:
BufferData
, nowBytes
) - it only cares about how they should be converted from and toArrayData
to the C data interface.Arc
, so that we thread-safely refcount them and can share them between buffers, arrays, etc.export
placesBuffer
s inprivate_data
for bookkeeping and release them when the consumer releases it viarelease
.I do not expect this PR to be easy to review, as it is touching sensitive (aka
unsafe
) code. However, based on the tests I did so far, I am sufficiently happy to PR it.This PR has three main parts:
ffi
module that contains the import and export functionalityIt also does a small refactor of
BufferData
, renaming it toBytes
(motivated by the popularbytes
crate), and moving it to a separate file.What is tested:
Python -> Rust -> Python
(new separate crate,arrow-c-integration
)Rust -> Python -> Rust
(new separate crate,arrow-c-integration
)Rust -> Rust -> Rust
Finally, this PR has a large contribution of @pitrou , that took a lot of his time to explain to me how the C++ was doing it and the main things that I had to worry about here.