Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for primitive and boolean take kernel for nullable indices with an offset #509

Merged
merged 3 commits into from
Jun 30, 2021

Conversation

jhorstmann
Copy link
Contributor

Which issue does this PR close?

Closes #502.

While implementing the fix I noticed a similar issue in the boolean take kernel which is now also fixed.

Rationale for this change

When reusing the validity buffer of an array for a newly created array the offsets of the original array have to be taken into account. The original array might have had an offset > 0 but the new array will usually start at 0 and so a slice of the validity buffer has to be taken.

What changes are included in this PR?

Are there any user-facing changes?

No

@github-actions github-actions bot added the arrow Changes to the arrow crate label Jun 29, 2021
@codecov-commenter
Copy link

codecov-commenter commented Jun 29, 2021

Codecov Report

Merging #509 (2423fad) into master (99b1c90) will increase coverage by 0.12%.
The diff coverage is 100.00%.

❗ Current head 2423fad differs from pull request most recent head 541b5ad. Consider uploading reports for the commit 541b5ad to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #509      +/-   ##
==========================================
+ Coverage   82.64%   82.76%   +0.12%     
==========================================
  Files         165      165              
  Lines       45703    45724      +21     
==========================================
+ Hits        37769    37845      +76     
+ Misses       7934     7879      -55     
Impacted Files Coverage Δ
arrow/src/array/array_binary.rs 92.23% <ø> (+2.10%) ⬆️
arrow/src/array/array_boolean.rs 94.01% <ø> (+3.10%) ⬆️
arrow/src/array/array_dictionary.rs 88.38% <ø> (+3.81%) ⬆️
arrow/src/array/array_list.rs 94.88% <ø> (+2.06%) ⬆️
arrow/src/array/array_primitive.rs 94.60% <ø> (-0.10%) ⬇️
arrow/src/array/array_string.rs 97.76% <ø> (+1.71%) ⬆️
arrow/src/array/array_struct.rs 89.24% <ø> (+1.39%) ⬆️
arrow/src/array/array_union.rs 89.26% <ø> (+2.33%) ⬆️
arrow/src/array/null.rs 83.78% <ø> (-2.89%) ⬇️
arrow/src/array/array.rs 80.90% <100.00%> (+4.04%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99b1c90...541b5ad. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the logic and the tests carefully and this looks great to me. Thank you @jhorstmann

FYI @ritchie46

@@ -516,7 +522,7 @@ where
nulls = match indices.data_ref().null_buffer() {
Some(buffer) => Some(buffer_bin_and(
buffer,
0,
indices.offset(),
&null_buf.into(),
0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct that this 0 is due to the fact that null_buf was constructed via

        let mut null_buf = MutableBuffer::new(num_byte).with_bitset(num_byte, true);

in the same else clause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, null_buf is newly constructed and initialized starting from 0, while the first buffer and offset pair are coming from the indices array which might have a non-0 offset.

There was a proposal before to push the offsets down into all buffers instead of storing it in the array. That way we wouldn't need to care about which array a buffer originally belonged too. But if we do that we'd still need a better abstraction for the validity bitmap and only access it via (chunked) iterators. I'm also not sure whether such a change would have an affect on FFI usage.

);

test_take_primitive_arrays_non_null::<Int64Type>(
vec![0, 1, 2, 3, 4, 5, 6],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use different values here than the indices -- perhaps something like

Suggested change
vec![0, 1, 2, 3, 4, 5, 6],
vec![0, 10, 20, 30, 40, 50, 60],

So it is clearer from just this context that just 20 and 30 should be returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done

@ritchie46
Copy link
Contributor

Nice. Great that this is fixed before the 5.0 release!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good -- thanks again @jhorstmann

@alamb alamb merged commit b63c407 into apache:master Jun 30, 2021
alamb pushed a commit that referenced this pull request Jun 30, 2021
…n offset (#509)

* Fix for take kernel with nullable indices and nonnull values

* Fix for boolean take kernel when indices have an offset

* Use different values for data so they cannot be confused with the indices
alamb added a commit that referenced this pull request Jul 1, 2021
…n offset (#509) (#516)

* Fix for take kernel with nullable indices and nonnull values

* Fix for boolean take kernel when indices have an offset

* Use different values for data so they cannot be confused with the indices

Co-authored-by: Jörn Horstmann <git@jhorstmann.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sliced null buffers lead to incorrect result in take kernel (and probably on other places)
4 participants