-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IPC format support for StringViewArray and BinaryViewArray #5525
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao -- this is looking great
I left a bunch of minor comments -- but the only thing that is missing from making this mergable in my opinion are some round trip tests (to ensure that we can write these arrays to an IPC file/stream and then read them back and get the same result)
I think we can use the same roundtrip
pattern here:
arrow-rs/arrow-ipc/src/reader.rs
Lines 1571 to 1604 in 72854c4
#[test] | |
fn test_roundtrip_stream_run_array_sliced() { | |
let run_array_1: Int32RunArray = vec!["a", "a", "a", "b", "b", "c", "c", "c"] | |
.into_iter() | |
.collect(); | |
let run_array_1_sliced = run_array_1.slice(2, 5); | |
let run_array_2_inupt = vec![Some(1_i32), None, None, Some(2), Some(2)]; | |
let mut run_array_2_builder = PrimitiveRunBuilder::<Int16Type, Int32Type>::new(); | |
run_array_2_builder.extend(run_array_2_inupt); | |
let run_array_2 = run_array_2_builder.finish(); | |
let schema = Arc::new(Schema::new(vec![ | |
Field::new( | |
"run_array_1_sliced", | |
run_array_1_sliced.data_type().clone(), | |
false, | |
), | |
Field::new("run_array_2", run_array_2.data_type().clone(), false), | |
])); | |
let input_batch = RecordBatch::try_new( | |
schema, | |
vec![Arc::new(run_array_1_sliced.clone()), Arc::new(run_array_2)], | |
) | |
.unwrap(); | |
let output_batch = roundtrip_ipc_stream(&input_batch); | |
// As partial comparison not yet supported for run arrays, the sliced run array | |
// has to be unsliced before comparing with the output. the second run array | |
// can be compared as such. | |
assert_eq!(input_batch.column(1), output_batch.column(1)); | |
let run_array_1_unsliced = unslice_run_array(run_array_1_sliced.into_data()).unwrap(); |
The cases to cover are:
- Basic BinaryView / Utf8View
- Sliced BinaryView / Utf8View
- Nested BinaryView/Utf8View in Dictionary/Struct/List (to cover the code in
set_variadic_buffer_counts
)
Currently implementation ignores the offset of Binary/Utf8View Array's offsets, meaning that the entire buffers will be serialize to the IPC buffer. This might write more data than necessary. Slicing and writing a view array to IPC buffer is non-trivial and was left as future work.
I think the IPC serializer should just serialize the raw arrays as given and not try to optimize anything. If users wants to "compact" the arrays prior to sending them over IPC I think it should be an explicit choice and they can do it via the gc
API suggested in #5513
In #5506, @alamb mentioned (3) the integration tests. I'm not entirely sure how to do this. Should we generate some arrow data and commit to the testing repository and then add more tests to the arrow-rs/arrow-integration-testing/tests /ipc_reader.rs?
Maybe @bkietz knows if we have added StringViewArrays to the integration test suite already. I did not see any commits in https://github.com/apache/arrow-testing/commits/master that have such files.
If we don't have such files, I think we should add them / work with the other language teams to add them for compatibility as a follow on task. I can file tickets to track this
Thank you @ariesdevil and @viirya for the revies
arrow-ipc/src/writer.rs
Outdated
// The spec is not clear on whether the view/null buffer should be included in the variadic buffer count. | ||
// But from C++ impl https://github.com/apache/arrow/blob/b448b33808f2dd42866195fa4bb44198e2fc26b9/cpp/src/arrow/ipc/writer.cc#L477 | ||
// we know they are not included. | ||
counts.push(array.to_data().buffers().len() as i64 - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
counts.push(array.to_data().buffers().len() as i64 - 1); | |
counts.push(array.data_buffers().len() as i64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data_buffers()
is only available when the array is casted down to GenericByteViewArray
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this pattern, it must be a GenericByteViewArray
, so using data_buffers()
here is right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but it's quite verbose to first cast each type in to BinaryView
or Utf8View
and then call data_buffers()
Had a brief look and I like where this is headed. I left some comments, but other than those already suggested by others, I wonder if we could integrate the variadicBuffer collection into the existing logic to traverse the nested types. This would be quicker, simpler and probably easier to maintain |
Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Raphael Taylor-Davies <1781103+tustvold@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is left before we can merge this PR?
It seems like the two remaining items are:
- Roundtrip tests (just in this repo, as described in IPC format support for StringViewArray and BinaryViewArray #5525 (review))
- Maybe a follow up ticket to track adding archery testing (e.g. IPC format support for StringViewArray and BinaryViewArray #5525 (comment))
Is there anything else?
I think those are the two major todos. Sorry I got quite busy these days, will try to address them in a few days. |
No worries! I think we could merge this PR with just the first (round trip tests) and then do the integration test in a follow on PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @XiangpengHao -- I think this is really close.
I think the tests need a few tweaks and fix the CI but then this will be good to go.
Thank you again so much 🙏
Updated the tests! I believe the CI failure is not related to this pr here.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it looks good to me -- thank you @XiangpengHao 🙏
We can keep iterating in subsequent PRs I think
@@ -1247,6 +1291,22 @@ fn write_array_data( | |||
compression_codec, | |||
)?; | |||
} | |||
} else if matches!(data_type, DataType::BinaryView | DataType::Utf8View) { | |||
// Slicing the views buffer is safe and easy, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Update: the CI fails on main. as well, filed #5565 -- I'll try and look at it in a day or two if no one beats me to it |
I took the liberty of merging up from master to this branch to hopefully get a clean CI run |
Thanks @alamb the CI passed! |
Thanks again @XiangpengHao -- this is a very nice step forward |
Which issue does this PR close?
Part of #5506.
Rationale for this change
Added necessary changes to handle the BinaryView and Utf8View in IPC reader/writer.
What changes are included in this PR?
The changes are slightly larger than expected because the BinaryView and Utf8View has variadicBufferCounts which no other types had before.
Currently implementation ignores the offset of Binary/Utf8View Array's offsets, meaning that the entire buffers will be serialize to the IPC buffer. This might write more data than necessary. Slicing and writing a view array to IPC buffer is non-trivial and was left as future work.
In #5506, @alamb mentioned (3) the integration tests. I'm not entirely sure how to do this. Should we generate some arrow data and commit to the testing repository and then add more tests to the
arrow-rs/arrow-integration-testing/tests /ipc_reader.rs
?Are there any user-facing changes?
I just learned the arrow memory layout today, so I expect quite a lot of corner cases I didn't handle, please feel free to comment as you see anything non-intuitive :-)