Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative implementation of nullif kernel by slicing nested buffers #1499

Closed
wants to merge 6 commits into from

Conversation

jhorstmann
Copy link
Contributor

Which issue does this PR close?

Closes #510.

Rationale for this change

This is an alternative implementation to #521 of the nullif kernel that works by slicing the buffers of the array depending on the data type. For most data types this can be done without copying data. The exceptions are boolean arrays (when the offset is not a multiple of 8) and list arrays because there are some assumptions that offsets start at zero.

There are several TODO still left in the code to verify the correct slicing logic for some data types. It also seems a bit strange that this slicing logic is only needed for a single kernel and I also don't like that we still need to copy for some data types.

A better design would require some larger refactorings of the current data model:

  • Remove the offset from ArrayData so that all slicing has to be pushed down into buffers
  • Add a offset field to Bitmap so that bitmaps can be sliced without copying
  • Also use the Bitmap as the data holder for boolean arrays

What changes are included in this PR?

I added the kernel as a separate function to make the diff easier to read, if we decide to merge this it should replace the existing nullif kernel.

The api changes since the nullif kernel now takes an ArrayRef instead of a PrimitiveArray.

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 29, 2022
@jhorstmann
Copy link
Contributor Author

@bjchambers @alamb This is the alternative implementation I mentioned, thanks for reminding me about this topic.

@codecov-commenter
Copy link

codecov-commenter commented Mar 29, 2022

Codecov Report

Merging #1499 (cbbfa77) into master (c5442cf) will increase coverage by 0.03%.
The diff coverage is 83.08%.

@@            Coverage Diff             @@
##           master    #1499      +/-   ##
==========================================
+ Coverage   82.68%   82.72%   +0.03%     
==========================================
  Files         188      189       +1     
  Lines       54361    54701     +340     
==========================================
+ Hits        44951    45253     +302     
- Misses       9410     9448      +38     
Impacted Files Coverage Δ
arrow/src/array/array_map.rs 81.81% <ø> (+0.31%) ⬆️
arrow/src/compute/kernels/boolean.rs 93.56% <80.45%> (-3.24%) ⬇️
arrow/src/array/array_list.rs 95.58% <100.00%> (+0.05%) ⬆️
arrow/src/compute/util.rs 98.90% <100.00%> (ø)
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (-0.46%) ⬇️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
arrow/src/compute/kernels/filter.rs 88.00% <0.00%> (ø)
arrow/src/compute/kernels/length.rs 100.00% <0.00%> (ø)
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5442cf...cbbfa77. Read the comment docs.

/// The only buffers that need an actual copy are booleans (if they are not byte-aligned)
/// and list/binary/string offsets because the arrow implementation requires them to start at 0.
/// This is useful when a kernel calculates a new validity bitmap but wants to reuse other buffers.
fn slice_buffers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry for my ignorance here -- when implementing nullif why don't we simply manipulate the bitmask and create a new ArrayData that is otherwise the same?? As in why do we want to create a new buffer(s) that start at offset zero?

let new_validity_mask = compute::and(self.validitity, condition);

let new_data = self
  .data()
  .clone()
  .replace_validitiy(new_validity)

With apologies for making up replace_validitiy that doesn't really exist

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would be able to do exactly that. The problem in the current code is that the offset in ArrayData applies to both the data and validity buffers. If the offset in the previous ArrayData is larger than zero then we would either have to pad the start of the validity bitmap by the same number of bits (which was the approach tried in #510) or also slice the data buffers so we can access them with an offset of zero, same as the newly created validity.

I don't fully like either of these approaches. A better alternative would require quite some refactoring and api changes by removing offset from ArrayData and instead pushing it into Buffer (for primitive types) and Bitmap (for boolean and validity). I want to look into how much effort such a refactoring would be, but I'm not sure when I'll find time for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem in the current code is that the offset in ArrayData applies to both the data and validity buffers.

Got it. Thank you for the explanation

A better alternative would require quite some refactoring and api changes by removing offset from ArrayData and instead pushing it into Buffer

I agree this would be a better approach (maybe possibly also with something like #1474)

}
}
DataType::Map(_, _) => {
// TODO: verify this is actually correct
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not correct since Map has offsets like a list, only the offsets buffer should need to be sliced.

@alamb
Copy link
Contributor

alamb commented Apr 5, 2022

@bjchambers do you have the time / inclination to review this PR? There is now ~ 1.5 weeks until I cut the next arrow-rs release candidate -- it would be cool to include this too

@bjchambers
Copy link
Contributor

@bjchambers do you have the time / inclination to review this PR? There is now ~ 1.5 weeks until I cut the next arrow-rs release candidate -- it would be cool to include this too

I can take a quick look, but my review would likely be looking at the tests / etc. I'm not familiar enough with the more complex types (List, Map, etc.) to review the code in detail.

offset_buffer.into()
}

pub fn nullif_alternative(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also have this one be generic_nullif (or something like that) if we wanted to keep them around. Does it behave differently than the primitive version on primitives? If so, that could be a reason to consider keeping both.

) -> Result<ArrayRef> {
if array.len() != condition.len() {
return Err(ArrowError::ComputeError(
"Inputs to conditional kernels must have the same length".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "conditional kernels" to "nullif_alternative" to make it clearer what went wrong?

}
DataType::FixedSizeList(_, _size) => {
// TODO: should this actually slice the child arrays?
vec![]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and below -- does this mean it doesn't support fixed size lists, structs, etc.?

If yes -- should this instead return an error (or panic) rather than returning the empty list?
If no -- should there be added tests for these cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I know this logic is a bit confusing. For most types we only need to slice the directly contained buffers, so returning these new buffers is basically the happy path here. Struct or FixedSizeList layouts don't directly contain buffers, but instead the data is stored in child arrays which need to be sliced.

condition.len(),
);

let (result_data_buffers, result_child_data) = slice_buffers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this slice_buffers method should instead be a method on the array (given everything comes from that). And rather than "slice" (which has an existing meaning) I wonder if it could be something like "trim" to indicate that it "trims" the buffer to start at 0 (rather than the offset). As your comment identifies, I could see this being useful with other kernels that expect the inputs to start at 0 (because it's created a buffer).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps slice() belongs on ArrayData (which has both the buffers and datatypes)

))
}

pub(super) fn combine_option_buffers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be worth some comments on what this combination does?

/// The result is null if either of the buffers are null.
/// The resulting buffer is at offset `0`.

}

#[test]
fn test_nullif_struct_sliced() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For many of these -- it looks like we test for the case of the data array being sliced, but not the boolean array being sliced. Should we add some additional tests for that?

Also note -- I've seen problems with specific values of slicing. Not sure if it's worth considering something like proptests for this?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a high level look at this PR and it looks reasonable (and well tested!) to me. Thank you @jhorstmann!

I also tested this branch (locally) using the forced validation check from #1546

cargo test --features=force_validate -p arrow

And all checks passed 👍

So I wonder what is the plan for this PR (I am preparing for the arrow 12 release at the end of this week)? Shall we remove the old nullif implementation and replace it with this one?

condition.len(),
);

let (result_data_buffers, result_child_data) = slice_buffers(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps slice() belongs on ArrayData (which has both the buffers and datatypes)

@jhorstmann
Copy link
Contributor Author

@alamb I'm also not sure what is the best way forward. It is a lot of code just to support one kernel and it changes quite a bit how we work with Buffers. And it still does not avoid copying completely for boolean types. If these tradeoffs are ok there are also some TODOs still left:

  • Map types are not yet supported
  • The previous PR Change nullif to support arbitrary arrays #521 has a few more test cases that could be integrated
  • The function needs to be renamed, the new name was only choosen to make the diff easier to read

Longer term I'd like to experiment with removing the offset from ArrayData and see if slicing could be implemented in general with a method like this.For the validity, an offset could be introduced in Bitmap so it can be sliced without copying. The open problem is then the BooleanArray, which would ideally need to contain a Bitmap instead of a Buffer for it's data which could be sliced zero-copy. But there is no way to store this data bitmap in ArrayData.

@alamb
Copy link
Contributor

alamb commented Apr 12, 2022

Longer term I'd like to experiment with removing the offset from ArrayData and see if slicing could be implemented in general with a method like this.For the validity, an offset could be introduced in Bitmap so it can be sliced without copying. The open problem is then the BooleanArray, which would ideally need to contain a Bitmap instead of a Buffer for it's data which could be sliced zero-copy. But there is no way to store this data bitmap in ArrayData.

I think removing the explicit offset in the ArrayData would make a significant improvement.

As you mention, however, BooleanArray / Bitmaps would end up being copied without more thought

@tustvold
Copy link
Contributor

One potential option to handle BooleanArray might be for Buffer to store the offset in terms of elements, as opposed to bytes. This would make it behave similarly to the current offset in ArrayData. We might even be able to combine this with a move to some sort of typed buffer implementation. The downside would be even more code churn...

@tustvold
Copy link
Contributor

tustvold commented May 6, 2022

I'm converting this to a draft as it appears to have stalled, feel free to unmark if I am mistaken

@tustvold
Copy link
Contributor

Another random thought would be to store a bit_offset on Buffer instead of a byte offset 🤔 I think this would allow removing the offset from ArrayData and a max offset of 2^56 bytes is likely sufficient for any sane workload 😆

@alamb
Copy link
Contributor

alamb commented May 23, 2022

Given the number of bugs associated with offsets in general I really like the idea of switching to Bytes internally (that already handles offsets) and then for bitmasks handling but_offsets specially

@tustvold
Copy link
Contributor

Closing as #2940 has been merged

@tustvold tustvold closed this Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nullif operating on ArrayRef
5 participants