-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up bitmap iteration #125
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, it looks better, in terms of performance and in code clarity.
Thank you!
bors merge
125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>
Build failed: |
Hey @saik0, Could you please do a little |
Thank you! |
125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>
Build failed: |
Hum... It seems like you should use the latest version of |
Ah, benches is a different crate. I had to run clippy and format on it as well. |
Thank you :) |
125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>
Build failed: |
tests/iter.rs
Outdated
fn qc_iter(values: Vec<u32>) { | ||
let expected = { | ||
let mut vec = values.clone(); | ||
vec.sort(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vec.sort(); | |
vec.sort_unstable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clippy is not happy about using sort
instead of sort_unstable
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. BTreeSet impls Arbitrary
and is sorted and unique. Iterator
has an eq
method that does an element-wise comparison.
Thank you for the changes, it indeed makes the code more clear! |
Build succeeded: |
127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0 ### Purpose This PR adds some optimizations from CRoaring as outlined in arXiv:1709.07821 section 3 ### Overview * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring) * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in #125 * Implement Array-Bitset aggregates as outlined in section 3.2 * Also branchless 😎 * Tracks bitmap cardinality while performing bitmap-bitmap ops * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready * Curious to hear what you think about this `@lemire` * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap` * This is unfortunately a cross cutting change * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops` * `Bitmap8K` keeps track of it's cardinality * `SortedU16Vec` maintains its sorting * `Store` now only delegates to these containers * My hope is that this will be useful when implementing run containers. 🤞 * Unfortunately so much code was moved this PR is _HUGE_ ### Out of scope * Inline ASM for Array-Bitset aggregates * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned ### Notes * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016 Co-authored-by: saik0 <github@saik0.net> Co-authored-by: Joel Pedraza <github@saik0.net>
125: Speed up bitmap iteration r=Kerollmops a=saik0 A low hanging fruit I found: Bitmap iteration currently tests every bit for each non-zero element This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures. ``` iter bitmap 1..10_000 time: [48.602 us 48.787 us 48.978 us] change: [-44.893% -44.519% -44.164%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe iter bitmap sparse time: [4.8238 us 4.8478 us 4.8737 us] change: [-3.6741% -3.1564% -2.6669%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild iter bitmap dense time: [160.14 us 160.73 us 161.29 us] change: [-27.912% -27.207% -26.571%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild iter bitmap minimal time: [18.447 us 18.512 us 18.579 us] change: [-2.8722% -2.1247% -1.3247%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 1 (1.00%) high mild 5 (5.00%) high severe iter bitmap full time: [312.97 us 314.10 us 315.35 us] change: [-18.450% -17.937% -17.471%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe Benchmarking iter parsed: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.8s, enable flat sampling, or reduce sample count to 50. iter parsed time: [1.5311 ms 1.5371 ms 1.5436 ms] change: [-22.496% -21.943% -21.450%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 4 (4.00%) high mild 1 (1.00%) high severe ``` Co-authored-by: saik0 <github@saik0.net>
127: Add scalar optimizations from CRoaring / arXiv:1709.07821 section 3 r=Kerollmops a=saik0 ### Purpose This PR adds some optimizations from CRoaring as outlined in arXiv:1709.07821 section 3 ### Overview * All inserts and removes are now branchless (!in arXiv:1709.0782, in CRoaring) * Section 3.1 was already implemented, except for `BitmapIter`. This is covered in RoaringBitmap#125 * Implement Array-Bitset aggregates as outlined in section 3.2 * Also branchless 😎 * Tracks bitmap cardinality while performing bitmap-bitmap ops * This is a deviation from CRoaring, and will need to be benchmarked further before this Draft PR is ready * Curious to hear what you think about this `@lemire` * In order to track bitmap cardinality the len field had to moved into `Store::Bitmap` * This is unfortunately a cross cutting change * `Store` was quite large (LoC) and had many responsibilities. The largest change in this draft is decomposing `Store` such hat it's field variants are two new container types: each responsible for maintaining their invariants and implementing `ops` * `Bitmap8K` keeps track of it's cardinality * `SortedU16Vec` maintains its sorting * `Store` now only delegates to these containers * My hope is that this will be useful when implementing run containers. 🤞 * Unfortunately so much code was moved this PR is _HUGE_ ### Out of scope * Inline ASM for Array-Bitset aggregates * Section 4 (explicit SIMD). As noted by the paper authors: The compiler does a decent job of autovectorization, though not as good as hand-tuned ### Notes * I attempted to emulate the inline ASM Array-Bitset aggregates by using a mix of unsafe ptr arithmetic and x86-64 intrinsics, hoping to compile to the same instructions. I was unable to get it under 13 instructions per iteration (compared to the papers 5). While it was an improvement, I abandoned the effort in favor of waiting for the `asm!` macro to stabilize. rust-lang/rust#72016 Co-authored-by: saik0 <github@saik0.net> Co-authored-by: Joel Pedraza <github@saik0.net>
A low hanging fruit I found:
Bitmap iteration currently tests every bit for each non-zero element
This PR skips runs of zeros by inspecting the least significant bit, which can be computed with one instruction on most architectures.