-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative implementation of nullif kernel by slicing nested buffers #1499
Conversation
@bjchambers @alamb This is the alternative implementation I mentioned, thanks for reminding me about this topic. |
Codecov Report
@@ Coverage Diff @@
## master #1499 +/- ##
==========================================
+ Coverage 82.68% 82.72% +0.03%
==========================================
Files 188 189 +1
Lines 54361 54701 +340
==========================================
+ Hits 44951 45253 +302
- Misses 9410 9448 +38
Continue to review full report at Codecov.
|
/// The only buffers that need an actual copy are booleans (if they are not byte-aligned) | ||
/// and list/binary/string offsets because the arrow implementation requires them to start at 0. | ||
/// This is useful when a kernel calculates a new validity bitmap but wants to reuse other buffers. | ||
fn slice_buffers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am sorry for my ignorance here -- when implementing nullif
why don't we simply manipulate the bitmask and create a new ArrayData
that is otherwise the same?? As in why do we want to create a new buffer(s) that start at offset zero?
let new_validity_mask = compute::and(self.validitity, condition);
let new_data = self
.data()
.clone()
.replace_validitiy(new_validity)
With apologies for making up replace_validitiy
that doesn't really exist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would be able to do exactly that. The problem in the current code is that the offset
in ArrayData applies to both the data and validity buffers. If the offset in the previous ArrayData is larger than zero then we would either have to pad the start of the validity bitmap by the same number of bits (which was the approach tried in #510) or also slice the data buffers so we can access them with an offset of zero, same as the newly created validity.
I don't fully like either of these approaches. A better alternative would require quite some refactoring and api changes by removing offset
from ArrayData
and instead pushing it into Buffer
(for primitive types) and Bitmap
(for boolean and validity). I want to look into how much effort such a refactoring would be, but I'm not sure when I'll find time for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem in the current code is that the offset in ArrayData applies to both the data and validity buffers.
Got it. Thank you for the explanation
A better alternative would require quite some refactoring and api changes by removing offset from ArrayData and instead pushing it into Buffer
I agree this would be a better approach (maybe possibly also with something like #1474)
} | ||
} | ||
DataType::Map(_, _) => { | ||
// TODO: verify this is actually correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not correct since Map has offsets like a list, only the offsets buffer should need to be sliced.
@bjchambers do you have the time / inclination to review this PR? There is now ~ 1.5 weeks until I cut the next arrow-rs release candidate -- it would be cool to include this too |
I can take a quick look, but my review would likely be looking at the tests / etc. I'm not familiar enough with the more complex types (List, Map, etc.) to review the code in detail. |
offset_buffer.into() | ||
} | ||
|
||
pub fn nullif_alternative( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also have this one be generic_nullif
(or something like that) if we wanted to keep them around. Does it behave differently than the primitive version on primitives? If so, that could be a reason to consider keeping both.
) -> Result<ArrayRef> { | ||
if array.len() != condition.len() { | ||
return Err(ArrowError::ComputeError( | ||
"Inputs to conditional kernels must have the same length".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe change "conditional kernels" to "nullif_alternative" to make it clearer what went wrong?
} | ||
DataType::FixedSizeList(_, _size) => { | ||
// TODO: should this actually slice the child arrays? | ||
vec![] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here and below -- does this mean it doesn't support fixed size lists, structs, etc.?
If yes -- should this instead return an error (or panic) rather than returning the empty list?
If no -- should there be added tests for these cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I know this logic is a bit confusing. For most types we only need to slice the directly contained buffers, so returning these new buffers is basically the happy path here. Struct or FixedSizeList layouts don't directly contain buffers, but instead the data is stored in child arrays which need to be sliced.
condition.len(), | ||
); | ||
|
||
let (result_data_buffers, result_child_data) = slice_buffers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this slice_buffers
method should instead be a method on the array
(given everything comes from that). And rather than "slice" (which has an existing meaning) I wonder if it could be something like "trim" to indicate that it "trims" the buffer to start at 0 (rather than the offset). As your comment identifies, I could see this being useful with other kernels that expect the inputs to start at 0 (because it's created a buffer).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps slice()
belongs on ArrayData
(which has both the buffers and datatypes)
)) | ||
} | ||
|
||
pub(super) fn combine_option_buffers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be worth some comments on what this combination does?
/// The result is null if either of the buffers are null.
/// The resulting buffer is at offset `0`.
} | ||
|
||
#[test] | ||
fn test_nullif_struct_sliced() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For many of these -- it looks like we test for the case of the data array being sliced, but not the boolean array being sliced. Should we add some additional tests for that?
Also note -- I've seen problems with specific values of slicing. Not sure if it's worth considering something like proptests for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a high level look at this PR and it looks reasonable (and well tested!) to me. Thank you @jhorstmann!
I also tested this branch (locally) using the forced validation check from #1546
cargo test --features=force_validate -p arrow
And all checks passed 👍
So I wonder what is the plan for this PR (I am preparing for the arrow 12 release at the end of this week)? Shall we remove the old nullif implementation and replace it with this one?
condition.len(), | ||
); | ||
|
||
let (result_data_buffers, result_child_data) = slice_buffers( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps slice()
belongs on ArrayData
(which has both the buffers and datatypes)
@alamb I'm also not sure what is the best way forward. It is a lot of code just to support one kernel and it changes quite a bit how we work with Buffers. And it still does not avoid copying completely for boolean types. If these tradeoffs are ok there are also some TODOs still left:
Longer term I'd like to experiment with removing the |
I think removing the explicit As you mention, however, BooleanArray / Bitmaps would end up being copied without more thought |
One potential option to handle |
I'm converting this to a draft as it appears to have stalled, feel free to unmark if I am mistaken |
Another random thought would be to store a bit_offset on |
Given the number of bugs associated with offsets in general I really like the idea of switching to |
Closing as #2940 has been merged |
Which issue does this PR close?
Closes #510.
Rationale for this change
This is an alternative implementation to #521 of the
nullif
kernel that works by slicing the buffers of the array depending on the data type. For most data types this can be done without copying data. The exceptions are boolean arrays (when the offset is not a multiple of 8) and list arrays because there are some assumptions that offsets start at zero.There are several TODO still left in the code to verify the correct slicing logic for some data types. It also seems a bit strange that this slicing logic is only needed for a single kernel and I also don't like that we still need to copy for some data types.
A better design would require some larger refactorings of the current data model:
offset
fromArrayData
so that all slicing has to be pushed down into buffersoffset
field toBitmap
so that bitmaps can be sliced without copyingBitmap
as the data holder for boolean arraysWhat changes are included in this PR?
I added the kernel as a separate function to make the diff easier to read, if we decide to merge this it should replace the existing
nullif
kernel.The api changes since the
nullif
kernel now takes anArrayRef
instead of aPrimitiveArray
.