Add NullBuffer::from_unsliced_buffer helper and refactor call sites#9411
Add NullBuffer::from_unsliced_buffer helper and refactor call sites#9411alamb merged 12 commits intoapache:mainfrom
NullBuffer::from_unsliced_buffer helper and refactor call sites#9411Conversation
alamb
left a comment
There was a problem hiding this comment.
Thank you @Eyad3skr -- this looks nice to me
cc @liamzwbao in case you have time to help reivew
arrow-buffer/src/buffer/null.rs
Outdated
| pub fn buffer(&self) -> &Buffer { | ||
| self.buffer.inner() | ||
| } | ||
| /// Create a [`NullBuffer`] from an *unsliced* validity bitmap (`offset = 0`) of length `len`. |
There was a problem hiding this comment.
Could you also update this comment to reflect the offset is in bits (not bytes)? Mixing the units is a common mistake so making sure the documentation is as clear as possible would help
arrow-buffer/src/buffer/null.rs
Outdated
| /// Create a [`NullBuffer`] from an *unsliced* validity bitmap (`offset = 0`) of length `len`. | ||
| /// | ||
| /// Returns `None` if there are no nulls (all values valid). | ||
| pub fn try_from_unsliced(buffer: impl Into<Buffer>, len: usize) -> Option<Self> { |
There was a problem hiding this comment.
Maybe we should call it from_buffer? try_* probably should be reserved for functions that return a result, and its probably better to be direct that we're constructing directly from a buffer?
There was a problem hiding this comment.
a valid point tbh! on it.
|
@alamb I guess there is nothing else to take care after anymore? maybe if someone can just trigger/approve the CI pipeline workflows left that would be awesome. |
Done! |
|
Seems like the CI is broken |
|
true, my schedule this week is a bit busy because of university. I will be working on fixing the CI between today and Friday. On it. |
liamzwbao
left a comment
There was a problem hiding this comment.
Overall LGTM! thanks for the change.
It would be better to refactor the following call sites as well
arrow-rs/arrow-array/src/array/boolean_array.rs
Lines 537 to 542 in c129c7c
arrow-rs/arrow-string/src/regexp.rs
Lines 203 to 207 in c129c7c
arrow-rs/arrow-string/src/substring.rs
Lines 216 to 220 in c129c7c
arrow-rs/arrow-string/src/substring.rs
Lines 318 to 322 in c129c7c
arrow-rs/arrow-string/src/substring.rs
Lines 356 to 360 in c129c7c
| #[test] | ||
| fn test_from_unsliced_buffer_with_nulls() { | ||
| // Buffer with some nulls: 0b10110010 = valid, null, valid, valid, null, null, valid, null | ||
| let buf = Buffer::from([0b10110010u8]); |
There was a problem hiding this comment.
Arrow uses LSB numbering and that's why this test failed. Could refer to the doc to fix the test
There was a problem hiding this comment.
Oh!! Thanks for the info. going to check it now!
…op unused BooleanBuffer import in parquet view_buffer.rs
…op unused BooleanBuffer import in parquet view_buffer.rs
|
would appreciate if you can trigger/approve on CI workflows @alamb |
arrow-string/src/regexp.rs
Outdated
| let null_bit_buffer = array | ||
| .nulls() | ||
| .map(|n| n.inner().sliced()) | ||
| .and_then(|b| NullBuffer::from_unsliced_buffer(b, array.len())) | ||
| .map(|nb| nb.into_inner().into_inner()); |
There was a problem hiding this comment.
We don't need a new null_bit_buffer, let's refactor line 208 to 212 at the end of this func instead
arrow-string/src/substring.rs
Outdated
| // FixedSizeBinaryArray with size 0 requires a validity bitmap | ||
| if new_len == 0 && nulls.is_none() { | ||
| // FixedSizeBinaryArray::new takes length from the values buffer, except when size == 0. | ||
| // In that case it uses the null buffer length, so preserve the original length here. | ||
| // Example: ["", "", ""] -> substring(..., 1, Some(2)) should keep len=3; | ||
| // otherwise it collapses to an empty array (len=0). |
There was a problem hiding this comment.
We should keep the existing comment here and avoid adding the new comment on line 360, as it is unrelated to this PR.
refinement for comments and null_bit_buffer
| .and_then(|b| NullBuffer::from_unsliced_buffer(b, num_of_elements)); | ||
|
|
||
| // FixedSizeBinaryArray::new takes length from the values buffer, except when size == 0. | ||
| // In that case it uses the null buffer length, so preserve the original length here. | ||
| // Example: ["", "", ""] -> substring(..., 1, Some(2)) should keep len=3; | ||
| // otherwise it collapses to an empty array (len=0). | ||
| if new_len == 0 && nulls.is_none() { | ||
| // FixedSizeBinaryArray::new takes length from the values buffer, except when size == 0. | ||
| // In that case it uses the null buffer length, so preserve the original length here. | ||
| // Example: ["", "", ""] -> substring(..., 1, Some(2)) should keep len=3; | ||
| // otherwise it collapses to an empty array (len=0). |
There was a problem hiding this comment.
nit: can we move the comment to its original place?
liamzwbao
left a comment
There was a problem hiding this comment.
LGTM! But I guess CI will fail due to formatting/import issues.
You could run cargo fmt and cargo check to check it out
true, I forgot about 2 warning from unused imports.. thanks! |
NullBuffer::try_from_unsliced helper and refactor call sitesNullBuffer::from_unsliced_buffer helper and refactor call sites
|
would appreciate if you can trigger the CI workflows for me @alamb |
alamb
left a comment
There was a problem hiding this comment.
This looks great -- thanks a lot @Eyad3skr and @liamzwbao and @Jefffrey . A good team effort
Implements a helper to replace the pattern of creating a
BooleanBufferfrom an unsliced validity bitmap and filtering by null count. Previously this was done withBooleanBuffer::new(...)plusSome(NullBuffer::new(...)).filter(|n| n.null_count() > 0);now it is a single call toNullBuffer::try_from_unsliced(buffer, len), which returnsSome(NullBuffer)when there are nulls andNonewhen all values are valid.try_from_unslicedinarrow-buffer/src/buffer/null.rswith tests for nulls, all valid, all null, emptyFixedSizeBinaryArray::try_from_iter_with_sizeandtry_from_sparse_iter_with_sizeto use ittake_nullsinarrow-selectto use itCloses #9385