Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster bitmask iteration #1228

Merged
merged 11 commits into from
Feb 2, 2022
Merged

Faster bitmask iteration #1228

merged 11 commits into from
Feb 2, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Jan 23, 2022

Which issue does this PR close?

Closes #1227.

Rationale for this change

This improves the filter benchmarks by a factor of 2x, and likely will have similar benefits elsewhere

filter u8               time:   [140.44 us 140.61 us 140.76 us]                      
                        change: [-51.558% -51.392% -51.226%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [2.4576 us 2.4587 us 2.4601 us]
                        change: [-53.091% -52.977% -52.863%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [1.6956 us 1.6981 us 1.7010 us]
                        change: [-60.543% -60.284% -59.829%] (p = 0.00 < 0.05)
                        Performance has improved.
                        
filter f32              time:   [418.84 us 419.45 us 420.15 us]                       
                        change: [-26.414% -26.286% -26.141%] (p = 0.00 < 0.05)
                        Performance has improved.

filter single record batch                                                                            
                        time:   [183.96 us 188.44 us 193.16 us]
                        change: [-33.234% -31.549% -29.597%] (p = 0.00 < 0.05)
                        Performance has improved.

Note: the filter context benchmarks don't see any change as the filter is computed outside the benchmark body.

What changes are included in this PR?

This adds an UnalignedBitChunkIterator and updates SlicesIterator to use it

Are there any user-facing changes?

No

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Jan 23, 2022
}

if byte_idx == 0 {
let bit_length = bytes.len() * 8;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't ever really a bottleneck in the parquet parsing, but still sees a slight aggregate improvement of 2-3%. Updated for curiosity more than necessity.

@tustvold tustvold marked this pull request as draft January 23, 2022 14:43
@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2022

Codecov Report

Merging #1228 (f809d33) into master (aa71aea) will increase coverage by 0.03%.
The diff coverage is 93.91%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1228      +/-   ##
==========================================
+ Coverage   82.96%   83.00%   +0.03%     
==========================================
  Files         178      178              
  Lines       51522    51690     +168     
==========================================
+ Hits        42744    42904     +160     
- Misses       8778     8786       +8     
Impacted Files Coverage Δ
arrow/src/util/bit_chunk_iterator.rs 93.47% <92.43%> (-4.33%) ⬇️
arrow/src/compute/kernels/filter.rs 92.13% <96.77%> (+2.57%) ⬆️
arrow/src/buffer/immutable.rs 99.45% <100.00%> (+0.52%) ⬆️
parquet/src/arrow/bit_util.rs 100.00% <100.00%> (ø)
arrow/src/datatypes/datatype.rs 66.38% <0.00%> (-0.43%) ⬇️
arrow/src/array/transform/mod.rs 84.64% <0.00%> (-0.13%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa71aea...f809d33. Read the comment docs.

@tustvold
Copy link
Contributor Author

In order to test how much of the performance uplift was the changes to SlicesIterator and how much UnalignedChunk I created a branch with the changes to SlicesIterator but using BitChunks instead of UnalignedBitChunks.

#1225

filter u8              
                        time:   [289.51 us 289.72 us 289.93 us]                      

filter u8 high selectivity                                                                             
                        time:   [5.2759 us 5.2786 us 5.2819 us]

filter u8 low selectivity                                                                             
                        time:   [3.8342 us 3.8385 us 3.8453 us]

Improved SlicesIterator only

filter u8              
                        time:   [175.13 us 175.16 us 175.20 us]                      

filter u8 high selectivity                                                                             
                        time:   [2.5869 us 2.5880 us 2.5892 us]

filter u8 low selectivity                                                                             
                        time:   [1.9370 us 1.9388 us 1.9407 us]

This PR

filter u8               
                        time:   [140.52 us 140.55 us 140.58 us]                      

filter u8 high selectivity                                                                             
                        time:   [2.5015 us 2.5023 us 2.5033 us]

filter u8 low selectivity                                                                             
                        time:   [1.7586 us 1.7591 us 1.7597 us]

So the UnalignedBitChunks does yield a non-negligible performance benefit

@tustvold
Copy link
Contributor Author

I've rebased this to not be on top of #1225 as it is pending some more thought, without that fix this makes a smaller delta but still not insignificant

filter u8               time:   [359.76 us 359.84 us 359.93 us]                      
                        change: [-27.199% -27.161% -27.119%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 high selectivity                                                                             
                        time:   [8.8616 us 8.9026 us 8.9439 us]
                        change: [-29.670% -29.463% -29.275%] (p = 0.00 < 0.05)
                        Performance has improved.

filter u8 low selectivity                                                                             
                        time:   [2.4243 us 2.4249 us 2.4254 us]
                        change: [-45.812% -45.627% -45.478%] (p = 0.00 < 0.05)
                        Performance has improved.

Perhaps more importantly I hope to use this to implement optimized filter kernels

@tustvold tustvold changed the title Unaligned bit chunk Faster bitmask iteration Jan 29, 2022

let unaligned = UnalignedBitChunk::new(buffer.as_slice(), 5, 27);

assert_eq!(unaligned.prefix(), Some(((1 << 32) - 1) - ((1 << 5) - 1)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests might be easier to understand by asserting against a binary literal.

I added a bit of debug output locally and I'm not sure whether this should be the expected output:

eprintln!("{:064b}", unaligned.prefix().unwrap());

// output: 0000000000000000000000000000000011111111111111111111111111100000 

Would have expected the prefix chunk to not have trailing zeroes, and instead have a length less than 64 bits. That does not make a difference when counting bits, but I find it a bit confusing.

I don't yet understand how the current behavior interacts in the advance_to_set_bit function. The way I understand it, offset would be 5 for this example, then trailing_zeros would also be 5, and then in SlicesIterator start_chunk + start_bit would be 10. But I'm probably missing something there.

Copy link
Contributor Author

@tustvold tustvold Jan 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated these tests to hopefully be clearer, let me know if that helps.

Edit: I'm looking into SlicesIterator now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are much nicer to read now, thanks. My one concern is that the actual UnalignedBitChunkIterator iterator is a bit difficult to use, because it requires a bit of special handling for the first element, which has to be shifted by lead_padding. I played around with a similar design, and one idea is to return an iterator of (start_offset, mask, len) tuples.

Example: Bitmap consisting of 130 one bits, starting at offset 60, the iterator would return

(0, 0b1111, 4)
(4, u64::MAX, 64)
(68, 0b11, 2)

Users would then need no special logic for prefix/suffix. When iterating over all set bits you could use trailing_zeros on each chunk and add the start_offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree that UnalignedBitChunkIterator isn't the easiest construction to use, although this is sort of by design. It is not intended as a replacement for BitChunkIterator, but rather something for where you are willing to pay the cost of more complex start and termination logic, for simpler logic within the main loop itself...

Would it allay your concerns if I made it pub(crate) so that we can continue to iterate on it without introducing breaking changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking it internal for now sounds good, I'm ok with merging it then. The performance improvements are really nice, and we can probably find more use cases for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would moving the count_set_bits and iter_set_bits_rev functions to the arrow crate be an option, and then hide the UnalignedBitChunk as their implementation detail? I think they were added after the last release, so that would not even be a breaking change. On the other hand, iter_set_bits_rev seems very specific to the parquet usecase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're currently in a crate private module in parquet somewhat intentionally. I'll have a play with the iterator approach you propose and see if it has an impact on perf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I ran into challenges making a reversible UnalignedBitChunkIterator which is necessary for iter_set_bits_rev, which actually led me to a simpler solution - just implement iter_set_bits_rev in terms of UnalignedBitChunk and make the iterator crate private. Let me know what you think of this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. For the SlicesIterator the current definition of prefix/lead_padding seems to work better than my idea above. If we want to change the behavior at some later point we could rename those methods in a major release. Behavior change would have been bad with a public iter method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

/// Yields an iterator of aligned u64, along with the leading and trailing
/// u64 necessary to align the buffer to a 8-byte boundary
///
/// This is unlike [`BitChunkIterator`] which only exposes a trailing u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good rationale

@tustvold
Copy link
Contributor Author

tustvold commented Jan 29, 2022

I think SlicesIterator has a bug 😞 - fixing... It's applying the offset in the wrong direction 😅

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original implementation looks like it came in with apache/arrow#8960 cc @jorgecarleitao and @nevi-me

I made it through about half of this PR -- and I now need to go to do other things; I will keep reviewing it tomorrow

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved
arrow/src/compute/kernels/filter.rs Show resolved Hide resolved
arrow/src/compute/kernels/filter.rs Show resolved Hide resolved
parquet/src/arrow/bit_util.rs Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through this code again and wrote some additional tests to convince myself it was correct. Nice work @tustvold

I will wait until @jhorstmann is satisfied with this PR before merging.

arrow/src/compute/kernels/filter.rs Show resolved Hide resolved
tustvold and others added 3 commits February 1, 2022 14:47
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @tustvold has made all the changes requested. @jhorstmann any last thoughts?

@jhorstmann
Copy link
Contributor

👍 from my side

@alamb alamb merged commit f055fb0 into apache:master Feb 2, 2022
@alamb
Copy link
Contributor

alamb commented Feb 2, 2022

Thanks @tustvold and @jhorstmann

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UnalignedBitChunkIterator to that iterates through already aligned u64 blocks
4 participants