Skip to content

Improvements to BooleanBufferBuilder / BooleanBuilder #8561

@jhorstmann

Description

@jhorstmann

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

BooleanBufferBuilder / BooleanBuilder seem to currently serve two quite distinct usecases, which makes it more difficult to get the best possible performance for both of them.

  • Building a new buffer by starting from an empty state and incrementally appending new bits (append_value, append_slice, append_packed_range and similar methods).
  • Starting from a buffer that is initialized to ones/zeroes or a copy of an existing buffer, modifying certain bits in it (set_bit, get_bit).

(This is based my analysis of the arrow-rs code only, the assumption should also be verified against some bigger users like datafusion.)

The first usecase can be optimized by collecting bits to append in a u64 and only appending the corresponding bytes to memory every 64 appended values. Any capacity checks are thus amortized over those 64 values. On the other hand, methods to get and set arbitrary bit positions would be a bit less efficient to implement in this scheme.

The second usecase should not need any logic to resize the buffer and so could be much simpler.

Describe the solution you'd like

I think it would make sense to also separate these usecases in code, by introducing separate BooleanBufferBuilder and MutableBooleanBuffer implementations.

Since this would involve breaking the existing api, it would probably have to be done in multiple steps. First introducing a separate MutableBooleanBuffer and deprecating the set_bit / get_bit functionality, and only afterwards refactoring the BooleanBuilder.

Describe alternatives you've considered

Additional context

Noticed while looking into #8543

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions