-
Notifications
You must be signed in to change notification settings - Fork 791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster bitmask iteration #1228
Faster bitmask iteration #1228
Conversation
} | ||
|
||
if byte_idx == 0 { | ||
let bit_length = bytes.len() * 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wasn't ever really a bottleneck in the parquet parsing, but still sees a slight aggregate improvement of 2-3%. Updated for curiosity more than necessity.
Codecov Report
@@ Coverage Diff @@
## master #1228 +/- ##
==========================================
+ Coverage 82.96% 83.00% +0.03%
==========================================
Files 178 178
Lines 51522 51690 +168
==========================================
+ Hits 42744 42904 +160
- Misses 8778 8786 +8
Continue to review full report at Codecov.
|
In order to test how much of the performance uplift was the changes to SlicesIterator and how much UnalignedChunk I created a branch with the changes to SlicesIterator but using BitChunks instead of UnalignedBitChunks.
Improved SlicesIterator only
This PR
So the UnalignedBitChunks does yield a non-negligible performance benefit |
7959411
to
d00cfcf
Compare
I've rebased this to not be on top of #1225 as it is pending some more thought, without that fix this makes a smaller delta but still not insignificant
Perhaps more importantly I hope to use this to implement optimized filter kernels |
d00cfcf
to
df8f97c
Compare
arrow/src/util/bit_chunk_iterator.rs
Outdated
|
||
let unaligned = UnalignedBitChunk::new(buffer.as_slice(), 5, 27); | ||
|
||
assert_eq!(unaligned.prefix(), Some(((1 << 32) - 1) - ((1 << 5) - 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests might be easier to understand by asserting against a binary literal.
I added a bit of debug output locally and I'm not sure whether this should be the expected output:
eprintln!("{:064b}", unaligned.prefix().unwrap());
// output: 0000000000000000000000000000000011111111111111111111111111100000
Would have expected the prefix chunk to not have trailing zeroes, and instead have a length less than 64 bits. That does not make a difference when counting bits, but I find it a bit confusing.
I don't yet understand how the current behavior interacts in the advance_to_set_bit
function. The way I understand it, offset
would be 5 for this example, then trailing_zeros
would also be 5, and then in SlicesIterator
start_chunk + start_bit
would be 10. But I'm probably missing something there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated these tests to hopefully be clearer, let me know if that helps.
Edit: I'm looking into SlicesIterator now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are much nicer to read now, thanks. My one concern is that the actual UnalignedBitChunkIterator
iterator is a bit difficult to use, because it requires a bit of special handling for the first element, which has to be shifted by lead_padding
. I played around with a similar design, and one idea is to return an iterator of (start_offset, mask, len)
tuples.
Example: Bitmap consisting of 130 one bits, starting at offset 60, the iterator would return
(0, 0b1111, 4)
(4, u64::MAX, 64)
(68, 0b11, 2)
Users would then need no special logic for prefix/suffix. When iterating over all set bits you could use trailing_zeros
on each chunk and add the start_offset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree that UnalignedBitChunkIterator isn't the easiest construction to use, although this is sort of by design. It is not intended as a replacement for BitChunkIterator, but rather something for where you are willing to pay the cost of more complex start and termination logic, for simpler logic within the main loop itself...
Would it allay your concerns if I made it pub(crate) so that we can continue to iterate on it without introducing breaking changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marking it internal for now sounds good, I'm ok with merging it then. The performance improvements are really nice, and we can probably find more use cases for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would moving the count_set_bits
and iter_set_bits_rev
functions to the arrow crate be an option, and then hide the UnalignedBitChunk
as their implementation detail? I think they were added after the last release, so that would not even be a breaking change. On the other hand, iter_set_bits_rev
seems very specific to the parquet usecase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're currently in a crate private module in parquet somewhat intentionally. I'll have a play with the iterator approach you propose and see if it has an impact on perf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I ran into challenges making a reversible UnalignedBitChunkIterator
which is necessary for iter_set_bits_rev
, which actually led me to a simpler solution - just implement iter_set_bits_rev
in terms of UnalignedBitChunk
and make the iterator crate private. Let me know what you think of this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. For the SlicesIterator
the current definition of prefix
/lead_padding
seems to work better than my idea above. If we want to change the behavior at some later point we could rename those methods in a major release. Behavior change would have been bad with a public iter
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/// Yields an iterator of aligned u64, along with the leading and trailing | ||
/// u64 necessary to align the buffer to a 8-byte boundary | ||
/// | ||
/// This is unlike [`BitChunkIterator`] which only exposes a trailing u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good rationale
I think SlicesIterator has a bug 😞 - fixing... It's applying the offset in the wrong direction 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation looks like it came in with apache/arrow#8960 cc @jorgecarleitao and @nevi-me
I made it through about half of this PR -- and I now need to go to do other things; I will keep reviewing it tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through this code again and wrote some additional tests to convince myself it was correct. Nice work @tustvold
I will wait until @jhorstmann is satisfied with this PR before merging.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @tustvold has made all the changes requested. @jhorstmann any last thoughts?
👍 from my side |
Thanks @tustvold and @jhorstmann |
Which issue does this PR close?
Closes #1227.
Rationale for this change
This improves the filter benchmarks by a factor of 2x, and likely will have similar benefits elsewhere
Note: the filter context benchmarks don't see any change as the filter is computed outside the benchmark body.
What changes are included in this PR?
This adds an UnalignedBitChunkIterator and updates SlicesIterator to use it
Are there any user-facing changes?
No