Skip to content

[Arrow] Add bitwise operations BooleanArray that potentially reuse the underlying allocaton #8809

@alamb

Description

@alamb

This is related to

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

It is common to apply successive filtering operations that progressively narrow a bitmap with rows that pass a predicate -- for example a AND b AND c AND d AND ...

This would be evaluated typically like

t1 = a AND b
t2 = t1 AND c
t3 = t2 AND d

Where t1, t2, and t3 are newly allocated arrays; t3 holds the final output

For a typical array size of 8192 elements, t1, t2 and t3 are eack 1KB allocations

The current boolean kernels in arrow pretty much require allocating a new buffer as they take a reference

Now that @rluvaton has added optimized code to apply mutable bitwise operations in the following PR

I think we are in the position to also provide the same operations on BooleanArray

Describe the solution you'd like

I would like a way to update the contents of a BooleanArray in place (assuming there is no other reference to the backing array)

Describe alternatives you've considered
I propose following the API of PrimitiveArray, such as PrimitiveArray::unary and PrimitiveArray::unary_mut

So that would mean adding new functions to BooleanArray, like

impl BooleanArray {
...
  // apply a bitwise operation to this BooleanArray, using u64 operations
  // returning a new BooleanArray
  fn bitwise_unary(&self, mut op:F) -> Self
where
    F: FnMut(u64) -> u64 {.. }

  // Try to apply a bitwise operation to this BooleanArray in place, using u64 operations
  // If the underlying buffer is shared, returns Err<self>
  fn bitwise_unary_mut(self, mut op:F) -> Result<Self, Self>
where
    F: FnMut(u64) -> u64 {.. }

  // Try to apply a bitwise operation to this BooleanArray in place, using u64 operations
  // If the underlying buffer is shared, returns a new 
  // BooleanArray
  fn bitwise_unary_mut_or_clone(self, mut op:F) -> Self
where
    F: FnMut(u64) -> u64 {.. }

And we would also add similar functions for binary_bitwise functions (that takes a second array argument and right_offset_in_bits)

These APIs are likely fairly straightforward if we implement the plan in

As then the new methods on BooleanArray would just call the relevant underlying buffer methods

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions