Skip to content

Add NullBuffer::from_unsliced_buffer helper and refactor call sites#9411

Merged
alamb merged 12 commits intoapache:mainfrom
Eyad3skr:feat/null-buffer-try-from-unsliced
Feb 24, 2026
Merged

Add NullBuffer::from_unsliced_buffer helper and refactor call sites#9411
alamb merged 12 commits intoapache:mainfrom
Eyad3skr:feat/null-buffer-try-from-unsliced

Conversation

@Eyad3skr
Copy link
Contributor

Implements a helper to replace the pattern of creating a BooleanBuffer from an unsliced validity bitmap and filtering by null count. Previously this was done with BooleanBuffer::new(...) plus Some(NullBuffer::new(...)).filter(|n| n.null_count() > 0); now it is a single call to NullBuffer::try_from_unsliced(buffer, len), which returns Some(NullBuffer) when there are nulls and None when all values are valid.

  • Added try_from_unsliced in arrow-buffer/src/buffer/null.rs with tests for nulls, all valid, all null, empty
  • Refactor FixedSizeBinaryArray::try_from_iter_with_size and try_from_sparse_iter_with_size to use it
  • Refactor take_nulls in arrow-select to use it

Closes #9385

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Feb 13, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Eyad3skr -- this looks nice to me

cc @liamzwbao in case you have time to help reivew

pub fn buffer(&self) -> &Buffer {
self.buffer.inner()
}
/// Create a [`NullBuffer`] from an *unsliced* validity bitmap (`offset = 0`) of length `len`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also update this comment to reflect the offset is in bits (not bytes)? Mixing the units is a common mistake so making sure the documentation is as clear as possible would help

/// Create a [`NullBuffer`] from an *unsliced* validity bitmap (`offset = 0`) of length `len`.
///
/// Returns `None` if there are no nulls (all values valid).
pub fn try_from_unsliced(buffer: impl Into<Buffer>, len: usize) -> Option<Self> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should call it from_buffer? try_* probably should be reserved for functions that return a result, and its probably better to be direct that we're constructing directly from a buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a valid point tbh! on it.

@Eyad3skr
Copy link
Contributor Author

@alamb I guess there is nothing else to take care after anymore? maybe if someone can just trigger/approve the CI pipeline workflows left that would be awesome.

@alamb
Copy link
Contributor

alamb commented Feb 16, 2026

@alamb I guess there is nothing else to take care after anymore? maybe if someone can just trigger/approve the CI pipeline workflows left that would be awesome.

Done!

@alamb
Copy link
Contributor

alamb commented Feb 17, 2026

Seems like the CI is broken

@Eyad3skr
Copy link
Contributor Author

true, my schedule this week is a bit busy because of university. I will be working on fixing the CI between today and Friday. On it.

Copy link
Contributor

@liamzwbao liamzwbao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! thanks for the change.

It would be better to refactor the following call sites as well

#[test]
fn test_from_unsliced_buffer_with_nulls() {
// Buffer with some nulls: 0b10110010 = valid, null, valid, valid, null, null, valid, null
let buf = Buffer::from([0b10110010u8]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow uses LSB numbering and that's why this test failed. Could refer to the doc to fix the test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh!! Thanks for the info. going to check it now!

@Eyad3skr
Copy link
Contributor Author

would appreciate if you can trigger/approve on CI workflows @alamb

Comment on lines 183 to 187
let null_bit_buffer = array
.nulls()
.map(|n| n.inner().sliced())
.and_then(|b| NullBuffer::from_unsliced_buffer(b, array.len()))
.map(|nb| nb.into_inner().into_inner());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need a new null_bit_buffer, let's refactor line 208 to 212 at the end of this func instead

Comment on lines 360 to 365
// FixedSizeBinaryArray with size 0 requires a validity bitmap
if new_len == 0 && nulls.is_none() {
// FixedSizeBinaryArray::new takes length from the values buffer, except when size == 0.
// In that case it uses the null buffer length, so preserve the original length here.
// Example: ["", "", ""] -> substring(..., 1, Some(2)) should keep len=3;
// otherwise it collapses to an empty array (len=0).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep the existing comment here and avoid adding the new comment on line 360, as it is unrelated to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On it!

refinement for comments and null_bit_buffer
Comment on lines 358 to 365
.and_then(|b| NullBuffer::from_unsliced_buffer(b, num_of_elements));

// FixedSizeBinaryArray::new takes length from the values buffer, except when size == 0.
// In that case it uses the null buffer length, so preserve the original length here.
// Example: ["", "", ""] -> substring(..., 1, Some(2)) should keep len=3;
// otherwise it collapses to an empty array (len=0).
if new_len == 0 && nulls.is_none() {
// FixedSizeBinaryArray::new takes length from the values buffer, except when size == 0.
// In that case it uses the null buffer length, so preserve the original length here.
// Example: ["", "", ""] -> substring(..., 1, Some(2)) should keep len=3;
// otherwise it collapses to an empty array (len=0).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move the comment to its original place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough

Copy link
Contributor

@liamzwbao liamzwbao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! But I guess CI will fail due to formatting/import issues.

You could run cargo fmt and cargo check to check it out

@Eyad3skr
Copy link
Contributor Author

LGTM! But I guess CI will fail due to formatting/import issues.

You could run cargo fmt and cargo check to check it out

true, I forgot about 2 warning from unused imports.. thanks!

@Jefffrey Jefffrey changed the title Add NullBuffer::try_from_unsliced helper and refactor call sites Add NullBuffer::from_unsliced_buffer helper and refactor call sites Feb 21, 2026
@Eyad3skr
Copy link
Contributor Author

would appreciate if you can trigger the CI workflows for me @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great -- thanks a lot @Eyad3skr and @liamzwbao and @Jefffrey . A good team effort

@alamb alamb merged commit a2cffdb into apache:main Feb 24, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce NullBuffer::try_from_unsliced to simplify array construction

4 participants