-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
BooleanBufferBuilder / BooleanBuilder seem to currently serve two quite distinct usecases, which makes it more difficult to get the best possible performance for both of them.
- Building a new buffer by starting from an empty state and incrementally appending new bits (
append_value,append_slice,append_packed_rangeand similar methods). - Starting from a buffer that is initialized to ones/zeroes or a copy of an existing buffer, modifying certain bits in it (
set_bit,get_bit).
(This is based my analysis of the arrow-rs code only, the assumption should also be verified against some bigger users like datafusion.)
The first usecase can be optimized by collecting bits to append in a u64 and only appending the corresponding bytes to memory every 64 appended values. Any capacity checks are thus amortized over those 64 values. On the other hand, methods to get and set arbitrary bit positions would be a bit less efficient to implement in this scheme.
The second usecase should not need any logic to resize the buffer and so could be much simpler.
Describe the solution you'd like
I think it would make sense to also separate these usecases in code, by introducing separate BooleanBufferBuilder and MutableBooleanBuffer implementations.
Since this would involve breaking the existing api, it would probably have to be done in multiple steps. First introducing a separate MutableBooleanBuffer and deprecating the set_bit / get_bit functionality, and only afterwards refactoring the BooleanBuilder.
Describe alternatives you've considered
Additional context
Noticed while looking into #8543