WIP: special case bitwise ops when buffers are u64 aligned #8807

alamb · 2025-11-07T14:36:13Z

Which issue does this PR close?

part of Consolidate bitwise operation implementations #8806
Follow on to Alamb/test boolean kernels #8793
Follow on to feat: add apply_unary_op and apply_binary_op bitwise operations #8619
related to Improvements to BooleanBufferBuilder / BooleanBuilder #8561

Rationale for this change

While messing around with other bitwise operations, I am pretty sure we can optimize these operations more

Let's try using aligned u64 operations when possible

What changes are included in this PR?

Special case bitwise operations when the data is already aligned to u64 (a reasonably common special case)

Are these changes tested?

Yes by CI

Are there any user-facing changes?

No just faster performance

alamb · 2025-11-07T14:38:07Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

alamb · 2025-11-07T14:40:15Z

arrow-buffer/src/buffer/ops.rs

+                && left_suffix.is_empty()
+                && right_suffix.is_empty()
+            {
+                let result_u64s = left_u64s


I am pretty excited to see how much this actually helps with performance. This code should vectorize pretty spectacularly

TLDR: 30-50% faster 😎

alamb · 2025-11-07T14:42:11Z

🤖: Benchmark completed

Details

group         alamb_test_offset_zero                 main
-----         ----------------------                 ----
and           1.00    208.2±0.55ns        ? ?/sec    1.34    279.5±1.21ns        ? ?/sec
and_sliced    1.00   1227.4±1.83ns        ? ?/sec    1.00   1227.3±4.50ns        ? ?/sec
not           1.00    143.0±0.21ns        ? ?/sec    1.50    215.0±0.41ns        ? ?/sec
not_sliced    1.05    732.9±0.85ns        ? ?/sec    1.00    698.3±1.12ns        ? ?/sec
or            1.00    199.0±0.43ns        ? ?/sec    1.26    250.6±0.54ns        ? ?/sec
or_sliced     1.00   1095.3±1.49ns        ? ?/sec    1.00  1099.6±15.01ns        ? ?/sec

alamb · 2025-11-07T14:42:17Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

alamb · 2025-11-07T14:44:25Z

🤖: Benchmark completed

30% - 50% faster 😎

alamb · 2025-11-07T14:46:07Z

🤖: Benchmark completed

Details

group                                alamb_test_offset_zero                 main
-----                                ----------------------                 ----
buffer_binary_ops/and                1.00    211.4±0.31ns    67.7 GB/sec    1.22    257.4±0.32ns    55.6 GB/sec
buffer_binary_ops/and_with_offset    1.13   1488.6±3.34ns     9.6 GB/sec    1.00   1319.1±1.74ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    208.0±0.71ns    68.8 GB/sec    1.23    255.8±0.38ns    55.9 GB/sec
buffer_binary_ops/or_with_offset     1.00   1351.8±2.57ns    10.6 GB/sec    1.10   1482.3±3.39ns     9.7 GB/sec
buffer_unary_ops/not                 1.00    204.4±1.50ns    46.7 GB/sec    1.08    221.6±0.52ns    43.0 GB/sec
buffer_unary_ops/not_with_offset     1.00    908.4±1.14ns    10.5 GB/sec    1.27  1157.3±11.06ns     8.2 GB/sec

alamb · 2025-11-07T14:46:41Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

alamb · 2025-11-07T14:50:35Z

🤖: Benchmark completed

Details

group         alamb_test_offset_zero                 main
-----         ----------------------                 ----
and           1.00    208.0±0.56ns        ? ?/sec    1.33    275.8±0.79ns        ? ?/sec
and_sliced    1.00   1228.8±3.74ns        ? ?/sec    1.00   1226.5±1.16ns        ? ?/sec
not           1.00    143.0±0.36ns        ? ?/sec    1.52    216.9±1.43ns        ? ?/sec
not_sliced    1.05    736.1±1.29ns        ? ?/sec    1.00    698.2±1.57ns        ? ?/sec
or            1.00    198.6±0.31ns        ? ?/sec    1.26    251.0±0.41ns        ? ?/sec
or_sliced     1.00   1095.7±2.09ns        ? ?/sec    1.00   1100.0±1.93ns        ? ?/sec

alamb · 2025-11-07T14:50:38Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

alamb · 2025-11-07T14:54:26Z

🤖: Benchmark completed

Details

group                                alamb_test_offset_zero                 main
-----                                ----------------------                 ----
buffer_binary_ops/and                1.00    211.4±0.35ns    67.7 GB/sec    1.22    257.6±0.87ns    55.5 GB/sec
buffer_binary_ops/and_with_offset    1.13   1488.8±3.99ns     9.6 GB/sec    1.00   1320.2±3.26ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    208.1±0.45ns    68.7 GB/sec    1.23    255.7±0.53ns    55.9 GB/sec
buffer_binary_ops/or_with_offset     1.00   1351.5±3.51ns    10.6 GB/sec    1.10   1482.3±2.84ns     9.7 GB/sec
buffer_unary_ops/not                 1.00    204.2±1.34ns    46.7 GB/sec    1.09    222.1±0.56ns    42.9 GB/sec
buffer_unary_ops/not_with_offset     1.00    909.1±1.19ns    10.5 GB/sec    1.27   1155.6±1.45ns     8.3 GB/sec

alamb · 2025-11-07T15:08:30Z

It is strange that buffer_binary_ops/and_with_offset is reported to be slower but and_sliced is not (and it calls the same implementatation). Perhaps the overhead of checking alignment dominates the actual call 🤔

alamb · 2025-11-07T15:10:30Z

Since I think no one really calls the bitwise kernels directly, I would be inclined to consolidate the benchmarks as part of #8806

rluvaton · 2025-11-09T13:16:23Z

arrow-buffer/src/util/bit_chunk_iterator.rs

        #[allow(clippy::cast_ptr_alignment)]
        let raw_data = self.buffer.as_ptr() as *const u64;

        // bit-packed buffers are stored starting with the least-significant byte first


wouldn't your change of using align_to would not work on all bit alignments? (to_le)

As currently written I think this PR will work on both big and little endian -- as it only applies to buffers which are perfectly aligned and a multiple of a 64-bit boundary (aka there are no byte-wise operations occuring)

I think the endianess will come into play if we try to expand this technique to work data that is not an exact multiple of 64bits (aka the loop ends would likely be different)

but the user expect it to be at specific order when the callback is called and this violate that, no?

I don't think there is any problem. But I clearly don't really understand the concern

consider the following (inefficient) code:

let mut index = 0; bitwise_bin_op_helper( ... some args that are u64 aligned |left, right| { for each bit in left and right { some_other_array[index] = left_bit & right_bit; index += 1; } return left & right } )

before and after your change will give different order of bits on certain endians

Ah, I see

yes I agree that some operation that moves the bits around in the u64 word could give different results on different endianesses

I think that is the case for the current bitwise operations too (they are supposed to be bitwise, not bit shuffling) 🤔

I will try and make a PR to update the docs to make this clearer

From a glance it does seem that all current uses of bitwise_bin_op_helper/bitwise_unary_op_helper would be safe even if the endianess was swapped.

It is a difference from the bit chunks implementation so I agree that updating the docs would be good.

An example kernel that would fail would be finding the position of the first unset bit in an array, although I'm not sure bitwise_unary_op_helper would be a great solution for that anyways since you'd want "break from loop once found" behavior.

westonpace

Is this safe on a big-endian machine? I see the current bit_chunks has a to_le call, do you need something equivalent?

westonpace · 2025-11-14T13:30:13Z

arrow-buffer/src/buffer/ops.rs

+            if left_prefix.is_empty()
+                && right_prefix.is_empty()
+                && left_suffix.is_empty()
+                && right_suffix.is_empty()


Even if there is a prefix/suffix could you do u64 operations on the aligned portion and fallback to bitwise operations on the unaligned portion?

That being said, this seems like a fine optimization on its own.

That is a good point (as long as the prefix and suffixes are the same length)

westonpace · 2025-11-14T13:39:04Z

Is this safe on a big-endian machine? I see the current bit_chunks has a to_le call, do you need something equivalent?

Ah, I see the other comment, my bad

alamb added 2 commits November 7, 2025 09:26

Special case bitwise operations when buffers are aligned

5516cf4

special case bitwise ops when buffers are u64 aligned

aff95f2

github-actions bot added the arrow Changes to the arrow crate label Nov 7, 2025

alamb commented Nov 7, 2025

View reviewed changes

alamb mentioned this pull request Nov 7, 2025

Alamb/test boolean kernels #8793

Closed

alamb mentioned this pull request Nov 9, 2025

Improvements to BooleanBufferBuilder / BooleanBuilder #8561

Open

rluvaton reviewed Nov 9, 2025

View reviewed changes

alamb added enhancement Any new improvement worthy of a entry in the changelog performance labels Nov 9, 2025

alamb mentioned this pull request Nov 13, 2025

WIP: Special case apply_bitwise_*_op operation when buffers are u64 aligned #8835

Closed

westonpace reviewed Nov 14, 2025

View reviewed changes

WIP: special case bitwise ops when buffers are u64 aligned #8807

Are you sure you want to change the base?

WIP: special case bitwise ops when buffers are u64 aligned #8807

Conversation

alamb commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

alamb commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Nov 7, 2025

Uh oh!

rluvaton Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

westonpace commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Nov 7, 2025 •

edited

Loading

alamb commented Nov 7, 2025 •

edited

Loading

rluvaton Nov 9, 2025 •

edited

Loading

rluvaton Nov 10, 2025 •

edited

Loading

westonpace Nov 14, 2025 •

edited

Loading