Alternative implementation of nullif kernel by slicing nested buffers #1499

jhorstmann · 2022-03-29T10:00:24Z

Which issue does this PR close?

Closes #510.

Rationale for this change

This is an alternative implementation to #521 of the nullif kernel that works by slicing the buffers of the array depending on the data type. For most data types this can be done without copying data. The exceptions are boolean arrays (when the offset is not a multiple of 8) and list arrays because there are some assumptions that offsets start at zero.

There are several TODO still left in the code to verify the correct slicing logic for some data types. It also seems a bit strange that this slicing logic is only needed for a single kernel and I also don't like that we still need to copy for some data types.

A better design would require some larger refactorings of the current data model:

Remove the offset from ArrayData so that all slicing has to be pushed down into buffers
Add a offset field to Bitmap so that bitmaps can be sliced without copying
Also use the Bitmap as the data holder for boolean arrays

What changes are included in this PR?

I added the kernel as a separate function to make the diff easier to read, if we decide to merge this it should replace the existing nullif kernel.

The api changes since the nullif kernel now takes an ArrayRef instead of a PrimitiveArray.

jhorstmann · 2022-03-29T10:03:23Z

@bjchambers @alamb This is the alternative implementation I mentioned, thanks for reminding me about this topic.

codecov-commenter · 2022-03-29T10:19:59Z

Codecov Report

Merging #1499 (cbbfa77) into master (c5442cf) will increase coverage by 0.03%.
The diff coverage is 83.08%.

@@            Coverage Diff             @@
##           master    #1499      +/-   ##
==========================================
+ Coverage   82.68%   82.72%   +0.03%     
==========================================
  Files         188      189       +1     
  Lines       54361    54701     +340     
==========================================
+ Hits        44951    45253     +302     
- Misses       9410     9448      +38

Impacted Files	Coverage Δ
arrow/src/array/array_map.rs	`81.81% <ø> (+0.31%)`	⬆️
arrow/src/compute/kernels/boolean.rs	`93.56% <80.45%> (-3.24%)`	⬇️
arrow/src/array/array_list.rs	`95.58% <100.00%> (+0.05%)`	⬆️
arrow/src/compute/util.rs	`98.90% <100.00%> (ø)`
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (-0.46%)`	⬇️
arrow/src/datatypes/datatype.rs	`66.40% <0.00%> (-0.40%)`	⬇️
arrow/src/compute/kernels/filter.rs	`88.00% <0.00%> (ø)`
arrow/src/compute/kernels/length.rs	`100.00% <0.00%> (ø)`
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5442cf...cbbfa77. Read the comment docs.

alamb · 2022-03-30T18:59:54Z

arrow/src/compute/kernels/boolean.rs

+/// The only buffers that need an actual copy are booleans (if they are not byte-aligned)
+/// and list/binary/string offsets because the arrow implementation requires them to start at 0.
+/// This is useful when a kernel calculates a new validity bitmap but wants to reuse other buffers.
+fn slice_buffers(


I am sorry for my ignorance here -- when implementing nullif why don't we simply manipulate the bitmask and create a new ArrayData that is otherwise the same?? As in why do we want to create a new buffer(s) that start at offset zero?

let new_validity_mask = compute::and(self.validitity, condition); let new_data = self .data() .clone() .replace_validitiy(new_validity)

With apologies for making up replace_validitiy that doesn't really exist

Ideally we would be able to do exactly that. The problem in the current code is that the offset in ArrayData applies to both the data and validity buffers. If the offset in the previous ArrayData is larger than zero then we would either have to pad the start of the validity bitmap by the same number of bits (which was the approach tried in #510) or also slice the data buffers so we can access them with an offset of zero, same as the newly created validity.

I don't fully like either of these approaches. A better alternative would require quite some refactoring and api changes by removing offset from ArrayData and instead pushing it into Buffer (for primitive types) and Bitmap (for boolean and validity). I want to look into how much effort such a refactoring would be, but I'm not sure when I'll find time for it.

The problem in the current code is that the offset in ArrayData applies to both the data and validity buffers.

Got it. Thank you for the explanation

A better alternative would require quite some refactoring and api changes by removing offset from ArrayData and instead pushing it into Buffer

I agree this would be a better approach (maybe possibly also with something like #1474)

…er can be sliced without copying

jhorstmann · 2022-04-01T09:45:32Z

arrow/src/compute/kernels/boolean.rs

+            }
+        }
+        DataType::Map(_, _) => {
+            // TODO: verify this is actually correct


This is probably not correct since Map has offsets like a list, only the offsets buffer should need to be sliced.

alamb · 2022-04-05T20:36:33Z

@bjchambers do you have the time / inclination to review this PR? There is now ~ 1.5 weeks until I cut the next arrow-rs release candidate -- it would be cool to include this too

bjchambers · 2022-04-05T21:32:24Z

@bjchambers do you have the time / inclination to review this PR? There is now ~ 1.5 weeks until I cut the next arrow-rs release candidate -- it would be cool to include this too

I can take a quick look, but my review would likely be looking at the tests / etc. I'm not familiar enough with the more complex types (List, Map, etc.) to review the code in detail.

bjchambers · 2022-03-29T16:25:08Z

arrow/src/compute/kernels/boolean.rs

+    offset_buffer.into()
+}
+
+pub fn nullif_alternative(


Could also have this one be generic_nullif (or something like that) if we wanted to keep them around. Does it behave differently than the primitive version on primitives? If so, that could be a reason to consider keeping both.

bjchambers · 2022-03-29T16:25:45Z

arrow/src/compute/kernels/boolean.rs

+) -> Result<ArrayRef> {
+    if array.len() != condition.len() {
+        return Err(ArrowError::ComputeError(
+            "Inputs to conditional kernels must have the same length".to_string(),


Maybe change "conditional kernels" to "nullif_alternative" to make it clearer what went wrong?

bjchambers · 2022-03-29T16:27:06Z

arrow/src/compute/kernels/boolean.rs

+        }
+        DataType::FixedSizeList(_, _size) => {
+            // TODO: should this actually slice the child arrays?
+            vec![]


Here and below -- does this mean it doesn't support fixed size lists, structs, etc.?

If yes -- should this instead return an error (or panic) rather than returning the empty list?
If no -- should there be added tests for these cases?

Sorry, I know this logic is a bit confusing. For most types we only need to slice the directly contained buffers, so returning these new buffers is basically the happy path here. Struct or FixedSizeList layouts don't directly contain buffers, but instead the data is stored in child arrays which need to be sliced.

bjchambers · 2022-03-29T16:31:43Z

arrow/src/compute/kernels/boolean.rs

+        condition.len(),
+    );
+
+    let (result_data_buffers, result_child_data) = slice_buffers(


I wonder if this slice_buffers method should instead be a method on the array (given everything comes from that). And rather than "slice" (which has an existing meaning) I wonder if it could be something like "trim" to indicate that it "trims" the buffer to start at 0 (rather than the offset). As your comment identifies, I could see this being useful with other kernels that expect the inputs to start at 0 (because it's created a buffer).

Perhaps slice() belongs on ArrayData (which has both the buffers and datatypes)

bjchambers · 2022-04-05T21:33:24Z

arrow/src/compute/util.rs

+    ))
+}
+
+pub(super) fn combine_option_buffers(


May be worth some comments on what this combination does?

/// The result is null if either of the buffers are null. /// The resulting buffer is at offset `0`.

bjchambers · 2022-04-05T21:37:17Z

arrow/src/compute/kernels/boolean.rs

+    }
+
+    #[test]
+    fn test_nullif_struct_sliced() {


For many of these -- it looks like we test for the case of the data array being sliced, but not the boolean array being sliced. Should we add some additional tests for that?

Also note -- I've seen problems with specific values of slicing. Not sure if it's worth considering something like proptests for this?

alamb

I took a high level look at this PR and it looks reasonable (and well tested!) to me. Thank you @jhorstmann!

I also tested this branch (locally) using the forced validation check from #1546

cargo test --features=force_validate -p arrow

And all checks passed 👍

So I wonder what is the plan for this PR (I am preparing for the arrow 12 release at the end of this week)? Shall we remove the old nullif implementation and replace it with this one?

alamb · 2022-04-12T14:57:00Z

arrow/src/compute/kernels/boolean.rs

+        condition.len(),
+    );
+
+    let (result_data_buffers, result_child_data) = slice_buffers(


Perhaps slice() belongs on ArrayData (which has both the buffers and datatypes)

jhorstmann · 2022-04-12T15:29:36Z

@alamb I'm also not sure what is the best way forward. It is a lot of code just to support one kernel and it changes quite a bit how we work with Buffers. And it still does not avoid copying completely for boolean types. If these tradeoffs are ok there are also some TODOs still left:

Map types are not yet supported
The previous PR Change nullif to support arbitrary arrays #521 has a few more test cases that could be integrated
The function needs to be renamed, the new name was only choosen to make the diff easier to read

Longer term I'd like to experiment with removing the offset from ArrayData and see if slicing could be implemented in general with a method like this.For the validity, an offset could be introduced in Bitmap so it can be sliced without copying. The open problem is then the BooleanArray, which would ideally need to contain a Bitmap instead of a Buffer for it's data which could be sliced zero-copy. But there is no way to store this data bitmap in ArrayData.

alamb · 2022-04-12T16:31:37Z

Longer term I'd like to experiment with removing the offset from ArrayData and see if slicing could be implemented in general with a method like this.For the validity, an offset could be introduced in Bitmap so it can be sliced without copying. The open problem is then the BooleanArray, which would ideally need to contain a Bitmap instead of a Buffer for it's data which could be sliced zero-copy. But there is no way to store this data bitmap in ArrayData.

I think removing the explicit offset in the ArrayData would make a significant improvement.

As you mention, however, BooleanArray / Bitmaps would end up being copied without more thought

tustvold · 2022-04-20T12:06:40Z

One potential option to handle BooleanArray might be for Buffer to store the offset in terms of elements, as opposed to bytes. This would make it behave similarly to the current offset in ArrayData. We might even be able to combine this with a move to some sort of typed buffer implementation. The downside would be even more code churn...

tustvold · 2022-05-06T06:59:15Z

I'm converting this to a draft as it appears to have stalled, feel free to unmark if I am mistaken

tustvold · 2022-05-21T12:58:06Z

Another random thought would be to store a bit_offset on Buffer instead of a byte offset 🤔 I think this would allow removing the offset from ArrayData and a max offset of 2^56 bytes is likely sufficient for any sane workload 😆

alamb · 2022-05-23T11:09:24Z

Given the number of bugs associated with offsets in general I really like the idea of switching to Bytes internally (that already handles offsets) and then for bitmasks handling but_offsets specially

tustvold · 2022-11-10T03:24:27Z

Closing as #2940 has been merged

Alternative implementation of nullif kernel by slicing nested buffers

8e3055b

github-actions bot added the arrow Changes to the arrow crate label Mar 29, 2022

alamb reviewed Mar 30, 2022

View reviewed changes

jhorstmann added 5 commits March 31, 2022 14:56

Remove requirement that offsets start at zero so that the offset buff…

239a9a2

…er can be sliced without copying

Also skip copying string offsets

cbbfa77

Implement nullif and slicing for FixedSizeList

bf41041

Implement nullif and slicing StructArray

d67842e

Remove unused var in test

c22862d

jhorstmann commented Apr 1, 2022

View reviewed changes

bjchambers reviewed Apr 5, 2022

View reviewed changes

alamb reviewed Apr 12, 2022

View reviewed changes

jhorstmann mentioned this pull request Apr 23, 2022

Fix Null Mask Handling in ArrayData, UnionArray, and MapArray #1589

Merged

tustvold marked this pull request as draft May 6, 2022 06:58

tustvold mentioned this pull request May 21, 2022

Buffer::bit_slice has incorrect length for aligned offset #807

Closed

tustvold mentioned this pull request Jun 6, 2022

Strongly Typed ArrayData #1799

Closed

tustvold closed this Nov 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alternative implementation of nullif kernel by slicing nested buffers #1499

Alternative implementation of nullif kernel by slicing nested buffers #1499

jhorstmann commented Mar 29, 2022

jhorstmann commented Mar 29, 2022

codecov-commenter commented Mar 29, 2022 •

edited

Loading

alamb Mar 30, 2022

jhorstmann Mar 30, 2022

alamb Mar 31, 2022

jhorstmann Apr 1, 2022

alamb commented Apr 5, 2022

bjchambers commented Apr 5, 2022

bjchambers Mar 29, 2022

bjchambers Mar 29, 2022

bjchambers Mar 29, 2022

jhorstmann Apr 12, 2022

bjchambers Mar 29, 2022

alamb Apr 12, 2022

bjchambers Apr 5, 2022

bjchambers Apr 5, 2022

alamb left a comment

alamb Apr 12, 2022

jhorstmann commented Apr 12, 2022

alamb commented Apr 12, 2022

tustvold commented Apr 20, 2022

tustvold commented May 6, 2022

tustvold commented May 21, 2022

alamb commented May 23, 2022

tustvold commented Nov 10, 2022

Alternative implementation of nullif kernel by slicing nested buffers #1499

Alternative implementation of nullif kernel by slicing nested buffers #1499

Conversation

jhorstmann commented Mar 29, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

jhorstmann commented Mar 29, 2022

codecov-commenter commented Mar 29, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Apr 5, 2022

bjchambers commented Apr 5, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhorstmann commented Apr 12, 2022

alamb commented Apr 12, 2022

tustvold commented Apr 20, 2022

tustvold commented May 6, 2022

tustvold commented May 21, 2022

alamb commented May 23, 2022

tustvold commented Nov 10, 2022

codecov-commenter commented Mar 29, 2022 •

edited

Loading