Faster bitmask iteration #1228

tustvold · 2022-01-23T14:39:26Z

Which issue does this PR close?

Closes #1227.

Rationale for this change

This improves the filter benchmarks by a factor of 2x, and likely will have similar benefits elsewhere

filter u8               time:   [140.44 us 140.61 us 140.76 us]                      
                        change: [-51.558% -51.392% -51.226%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.4576 us 2.4587 us 2.4601 us]
                        change: [-53.091% -52.977% -52.863%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.6956 us 1.6981 us 1.7010 us]
                        change: [-60.543% -60.284% -59.829%] (p = 0.00 < 0.05)
                        Performance has improved.
                        
filter f32              time:   [418.84 us 419.45 us 420.15 us]                       
                        change: [-26.414% -26.286% -26.141%] (p = 0.00 < 0.05)
                        Performance has improved.

filter single record batch                                                                            
                        time:   [183.96 us 188.44 us 193.16 us]
                        change: [-33.234% -31.549% -29.597%] (p = 0.00 < 0.05)
                        Performance has improved.

Note: the filter context benchmarks don't see any change as the filter is computed outside the benchmark body.

What changes are included in this PR?

This adds an UnalignedBitChunkIterator and updates SlicesIterator to use it

Are there any user-facing changes?

No

arrow/src/compute/kernels/filter.rs

tustvold · 2022-01-23T14:43:30Z

parquet/src/arrow/record_reader/definition_levels.rs

-        }
-
-        if byte_idx == 0 {
+    let bit_length = bytes.len() * 8;


This wasn't ever really a bottleneck in the parquet parsing, but still sees a slight aggregate improvement of 2-3%. Updated for curiosity more than necessity.

codecov-commenter · 2022-01-23T14:50:30Z

Codecov Report

Merging #1228 (f809d33) into master (aa71aea) will increase coverage by 0.03%.
The diff coverage is 93.91%.

@@            Coverage Diff             @@
##           master    #1228      +/-   ##
==========================================
+ Coverage   82.96%   83.00%   +0.03%     
==========================================
  Files         178      178              
  Lines       51522    51690     +168     
==========================================
+ Hits        42744    42904     +160     
- Misses       8778     8786       +8

Impacted Files	Coverage Δ
arrow/src/util/bit_chunk_iterator.rs	`93.47% <92.43%> (-4.33%)`	⬇️
arrow/src/compute/kernels/filter.rs	`92.13% <96.77%> (+2.57%)`	⬆️
arrow/src/buffer/immutable.rs	`99.45% <100.00%> (+0.52%)`	⬆️
parquet/src/arrow/bit_util.rs	`100.00% <100.00%> (ø)`
arrow/src/datatypes/datatype.rs	`66.38% <0.00%> (-0.43%)`	⬇️
arrow/src/array/transform/mod.rs	`84.64% <0.00%> (-0.13%)`	⬇️
parquet_derive/src/parquet_field.rs	`66.21% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa71aea...f809d33. Read the comment docs.

tustvold · 2022-01-24T10:21:51Z

In order to test how much of the performance uplift was the changes to SlicesIterator and how much UnalignedChunk I created a branch with the changes to SlicesIterator but using BitChunks instead of UnalignedBitChunks.

#1225

filter u8              
                        time:   [289.51 us 289.72 us 289.93 us]                      

filter u8 high selectivity                                                                             
                        time:   [5.2759 us 5.2786 us 5.2819 us]

filter u8 low selectivity                                                                             
                        time:   [3.8342 us 3.8385 us 3.8453 us]

Improved SlicesIterator only

filter u8              
                        time:   [175.13 us 175.16 us 175.20 us]                      

filter u8 high selectivity                                                                             
                        time:   [2.5869 us 2.5880 us 2.5892 us]

filter u8 low selectivity                                                                             
                        time:   [1.9370 us 1.9388 us 1.9407 us]

This PR

filter u8               
                        time:   [140.52 us 140.55 us 140.58 us]                      

filter u8 high selectivity                                                                             
                        time:   [2.5015 us 2.5023 us 2.5033 us]

filter u8 low selectivity                                                                             
                        time:   [1.7586 us 1.7591 us 1.7597 us]

So the UnalignedBitChunks does yield a non-negligible performance benefit

tustvold · 2022-01-28T19:21:00Z

I've rebased this to not be on top of #1225 as it is pending some more thought, without that fix this makes a smaller delta but still not insignificant

filter u8               time:   [359.76 us 359.84 us 359.93 us]                      
                        change: [-27.199% -27.161% -27.119%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [8.8616 us 8.9026 us 8.9439 us]
                        change: [-29.670% -29.463% -29.275%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [2.4243 us 2.4249 us 2.4254 us]
                        change: [-45.812% -45.627% -45.478%] (p = 0.00 < 0.05)
                        Performance has improved.

Perhaps more importantly I hope to use this to implement optimized filter kernels

jhorstmann · 2022-01-29T10:49:21Z

arrow/src/util/bit_chunk_iterator.rs

+
+        let unaligned = UnalignedBitChunk::new(buffer.as_slice(), 5, 27);
+
+        assert_eq!(unaligned.prefix(), Some(((1 << 32) - 1) - ((1 << 5) - 1)));


These tests might be easier to understand by asserting against a binary literal.

I added a bit of debug output locally and I'm not sure whether this should be the expected output:

eprintln!("{:064b}", unaligned.prefix().unwrap()); // output: 0000000000000000000000000000000011111111111111111111111111100000

Would have expected the prefix chunk to not have trailing zeroes, and instead have a length less than 64 bits. That does not make a difference when counting bits, but I find it a bit confusing.

I don't yet understand how the current behavior interacts in the advance_to_set_bit function. The way I understand it, offset would be 5 for this example, then trailing_zeros would also be 5, and then in SlicesIterator start_chunk + start_bit would be 10. But I'm probably missing something there.

I've updated these tests to hopefully be clearer, let me know if that helps.

Edit: I'm looking into SlicesIterator now

Tests are much nicer to read now, thanks. My one concern is that the actual UnalignedBitChunkIterator iterator is a bit difficult to use, because it requires a bit of special handling for the first element, which has to be shifted by lead_padding. I played around with a similar design, and one idea is to return an iterator of (start_offset, mask, len) tuples.

Example: Bitmap consisting of 130 one bits, starting at offset 60, the iterator would return

(0, 0b1111, 4) (4, u64::MAX, 64) (68, 0b11, 2)

Users would then need no special logic for prefix/suffix. When iterating over all set bits you could use trailing_zeros on each chunk and add the start_offset.

Yeah I agree that UnalignedBitChunkIterator isn't the easiest construction to use, although this is sort of by design. It is not intended as a replacement for BitChunkIterator, but rather something for where you are willing to pay the cost of more complex start and termination logic, for simpler logic within the main loop itself...

Would it allay your concerns if I made it pub(crate) so that we can continue to iterate on it without introducing breaking changes?

Marking it internal for now sounds good, I'm ok with merging it then. The performance improvements are really nice, and we can probably find more use cases for it.

Would moving the count_set_bits and iter_set_bits_rev functions to the arrow crate be an option, and then hide the UnalignedBitChunk as their implementation detail? I think they were added after the last release, so that would not even be a breaking change. On the other hand, iter_set_bits_rev seems very specific to the parquet usecase.

They're currently in a crate private module in parquet somewhat intentionally. I'll have a play with the iterator approach you propose and see if it has an impact on perf

So I ran into challenges making a reversible UnalignedBitChunkIterator which is necessary for iter_set_bits_rev, which actually led me to a simpler solution - just implement iter_set_bits_rev in terms of UnalignedBitChunk and make the iterator crate private. Let me know what you think of this

Sounds good. For the SlicesIterator the current definition of prefix/lead_padding seems to work better than my idea above. If we want to change the behavior at some later point we could rename those methods in a major release. Behavior change would have been bad with a public iter method.

alamb · 2022-01-29T11:24:07Z

arrow/src/util/bit_chunk_iterator.rs

+/// Yields an iterator of aligned u64, along with the leading and trailing
+/// u64 necessary to align the buffer to a 8-byte boundary
+///
+/// This is unlike [`BitChunkIterator`] which only exposes a trailing u64,


👍 good rationale

tustvold · 2022-01-29T11:42:19Z

I think SlicesIterator has a bug 😞 - fixing... It's applying the offset in the wrong direction 😅

alamb

The original implementation looks like it came in with apache/arrow#8960 cc @jorgecarleitao and @nevi-me

I made it through about half of this PR -- and I now need to go to do other things; I will keep reviewing it tomorrow

arrow/src/compute/kernels/filter.rs

parquet/src/arrow/bit_util.rs

alamb

I went through this code again and wrote some additional tests to convince myself it was correct. Nice work @tustvold

I will wait until @jhorstmann is satisfied with this PR before merging.

arrow/src/compute/kernels/filter.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb

I think @tustvold has made all the changes requested. @jhorstmann any last thoughts?

jhorstmann · 2022-02-02T18:27:15Z

👍 from my side

alamb · 2022-02-02T20:37:23Z

Thanks @tustvold and @jhorstmann

github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Jan 23, 2022

tustvold commented Jan 23, 2022

View reviewed changes

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved

tustvold commented Jan 23, 2022

View reviewed changes

tustvold marked this pull request as draft January 23, 2022 14:43

tustvold mentioned this pull request Jan 26, 2022

Improve MutableArrayData Null Handling (#1224) (#1230) #1225

Closed

tustvold force-pushed the unaligned-bit-chunk branch from 7959411 to d00cfcf Compare January 28, 2022 19:14

tustvold marked this pull request as ready for review January 28, 2022 19:15

Add UnalignedBitChunks (apache#1227)

df8f97c

tustvold force-pushed the unaligned-bit-chunk branch from d00cfcf to df8f97c Compare January 28, 2022 19:44

tustvold added 2 commits January 29, 2022 00:00

Clippy

6c1bb85

Fix flaky test

a81312f

tustvold changed the title ~~Unaligned bit chunk~~ Faster bitmask iteration Jan 29, 2022

jhorstmann reviewed Jan 29, 2022

View reviewed changes

alamb reviewed Jan 29, 2022

View reviewed changes

Improve test legibility

4cab321

alamb reviewed Jan 29, 2022

View reviewed changes

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved

parquet/src/arrow/bit_util.rs Show resolved Hide resolved

tustvold added 4 commits January 29, 2022 11:52

Fix SlicesIterator offset direction

6f38aa8

Format

f2bcb3e

Fix byte-aligned termination

a9699e0

Test edge-cases

f809d33

tustvold mentioned this pull request Jan 30, 2022

Add specialized filter kernels in compute module (up to 10x faster) #1248

Merged

alamb approved these changes Feb 1, 2022

View reviewed changes

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved

tustvold and others added 3 commits February 1, 2022 14:47

More tests

ef58b9d

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Review feedback

cecdac1

Make UnalignedBitChunkIterator crate local

fe4d723

alamb approved these changes Feb 2, 2022

View reviewed changes

alamb merged commit f055fb0 into apache:master Feb 2, 2022

nevi-me mentioned this pull request Feb 10, 2022

Test failure: bit_chunk_iterator #1294

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster bitmask iteration #1228

Faster bitmask iteration #1228

tustvold commented Jan 23, 2022 •

edited

Loading

tustvold Jan 23, 2022

codecov-commenter commented Jan 23, 2022 •

edited

Loading

tustvold commented Jan 24, 2022

tustvold commented Jan 28, 2022

jhorstmann Jan 29, 2022

tustvold Jan 29, 2022 •

edited

Loading

jhorstmann Feb 1, 2022

tustvold Feb 1, 2022

jhorstmann Feb 1, 2022

jhorstmann Feb 1, 2022

tustvold Feb 1, 2022

tustvold Feb 1, 2022

jhorstmann Feb 1, 2022

tustvold Feb 1, 2022

alamb Jan 29, 2022

tustvold commented Jan 29, 2022 •

edited

Loading

alamb left a comment

alamb left a comment

alamb left a comment

jhorstmann commented Feb 2, 2022

alamb commented Feb 2, 2022


		let unaligned = UnalignedBitChunk::new(buffer.as_slice(), 5, 27);

		assert_eq!(unaligned.prefix(), Some(((1 << 32) - 1) - ((1 << 5) - 1)));

Faster bitmask iteration #1228

Faster bitmask iteration #1228

Conversation

tustvold commented Jan 23, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

codecov-commenter commented Jan 23, 2022 • edited Loading

Codecov Report

tustvold commented Jan 24, 2022

tustvold commented Jan 28, 2022

Choose a reason for hiding this comment

tustvold Jan 29, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 29, 2022 • edited Loading

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

jhorstmann commented Feb 2, 2022

alamb commented Feb 2, 2022

tustvold commented Jan 23, 2022 •

edited

Loading

codecov-commenter commented Jan 23, 2022 •

edited

Loading

tustvold Jan 29, 2022 •

edited

Loading

tustvold commented Jan 29, 2022 •

edited

Loading