Implement LinearGroupBy with almost only safe Rust #19

Kerollmops · 2020-12-24T16:51:15Z

Related to #18.
Here are the first benchmark tests I made on my computer by running:

cargo +nightly bench --features nightly -- linear_group::bench > 7b2fb19.bench

cargo benchcmp 7b2fb19.bench 0de2daa.bench
 name                                              7b2fb19.bench ns/iter  0de2daa.bench ns/iter  diff ns/iter  diff %  speedup
 linear_group::bench::rev_vector_16_000            37,808                 37,753                          -55  -0.15%   x 1.00
 linear_group::bench::rev_vector_16_000_one_group  23,993                 27,417                        3,424  14.27%   x 0.88
 linear_group::bench::vector_16_000                38,399                 51,859                       13,460  35.05%   x 0.74
 linear_group::bench::vector_16_000_one_group      18,412                 27,202                        8,790  47.74%   x 0.68
 linear_group::bench::vector_16_000_sorted         27,870                 27,803                          -67  -0.24%   x 1.00
 linear_group::bench::vector_little_sorted         65                     84                               19  29.23%   x 0.77

Ten0 · 2021-01-01T22:52:01Z

So I got intrigued by the fact this seems to show a slightly lower perf than the unsafe version in cases where there was one single group, so I went to run some compilation tests:

Unsafe: https://godbolt.org/z/3d4Efq (loop is lines [11,17])
Safe: https://godbolt.org/z/KnjeYo (loop is lines [21, 25])

And contrary to what I expected from looking at the benchmark results, it looks like in the safe version, the compiler managed to optimize the loop inside next even better, only reading one item per iteration. However that did not happen as well when hardcoding the array in the main function.
So overall I'd believe that is just some minor diff in what the optimizer precisely does that may come and go depending on the way the iterator is used eventually, but overall it does indeed not introduce any theoretical inability for the compiler to get rid of this bounds check within the loop that's in the next function.

However, as it turned out, the call to next itself introduced additional unnecessary bounds checking, because it wrote the loop over the windows using something like for (i=0; i!=n; ++i) (line 25 of the assembly) instead of for (i = 0; i < n; ++i) (as per https://doc.rust-lang.org/1.48.0/src/core/slice/iter/macros.rs.html#8), which made it unable to figure out that the scenario where i > n was impossible when calling split_at, so there's still check for an overflow (line 7 of the assembly) in the loop (that would hypothetically lead to a panic).
But since that however did not show in your benchmark, I could imagine there are also scenarios where it would be written in such a way that this does not happen (similarly to the scenario of the other loop where the generated code for the safe version appeared to be smarter than the code for the unsafe version).

Out of curiosity, I also wrote a slight change that would encourage it to not use the len variable you declared to index the array, and instead increase the pointer value at each iteration, and this is the result:
https://godbolt.org/z/9GPbfz
It seems to not do better in either of the scenarios with regards to the number of instructions, but I haven't benchmarked it using your benchmark.

And as a side-note, looking at the assembly made me realize slices are actually not stored as (start, end) (as I mistakenly asserted here), but instead as (start, len) (references: 1, 2). I'm not sure why that was the choice made, though I suspect that might be because when indexing you may use instructions such as cmp edx, dword ptr [rdi + 4*rcx] which can actually be very fast because structures sizes are usually aligned making these just bitshifts, while otherwise computing the length would actually require additional instructions. In any case, if anyone has some reference to more than speculation on that regard I'm interested. :)

Ten0 · 2021-01-01T23:47:03Z

I'm not sure why that was the choice made, though I suspect that might be because........

Or maybe that wasn't ever discussed. ^^ graydon/rust-prehistory@958d12e#diff-e0b3bbcc496ac9c4177ca1bb4e2a2f365e1e58261d853b11e07ecbd5c830fb64R306-R307

Kerollmops · 2021-01-02T15:57:13Z

Wow! What an in-depth code review :)

It seems to not do better in either of the scenarios with regards to the number of instructions, but I haven't benchmarked it using your benchmark.

You can easily try your changes by piping benchmark results to a file and then comparing them with the benchcmp crate (cargo install cargo-benchcmp).

cargo +nightly bench --features nightly -- linear_group::bench > base.bench
# make your changes...
cargo +nightly bench --features nightly -- linear_group::bench > update.bench

cargo benchcmp base.bench update.bench

I'm not sure why that was the choice made, though I suspect that might be because...

This can also be because it is easier to create a wrong slice by specifying an end pointer lower than the start pointer. But the length value cannot be invalid, as it lives between 0 and usize::MAX.

so I went to run some compilation tests.

So, regarding your analysis, the speed doesn't seem related to the fact that we create a Window every time we execute GroupBy::next, isn't it? So writing a safe Rust dedicated window of 2 code wouldn't make it faster?

Ten0 · 2021-01-02T19:01:29Z

So, regarding your analysis, the speed doesn't seem related to the fact that we create a Window every time we execute GroupBy::next, isn't it? So writing a safe Rust dedicated window of 2 code wouldn't make it faster?

Yes indeed, that is completely optimized out by the compiler because the 2 is a constant.
The only overhead my compilation tests seem show would be on the call to split_at/split_at_mut, where an unchecked version may increase performance.
Though I'm not sure why these did not show in your tests (this should impact items with lots of groups, while your benchmarks seemed to show regression mostly where there was a single group.

Kerollmops · 2021-01-03T14:47:40Z

Ok, so I copy/paste the split_at/_mut/_unchecked slice unexposed methods to try out your solution, I also changed the computer I use (from a MacBook to an iMac) so here are the previous benchmarks too. you can see the change in the last commit pushed. Here is a godbolt split view of the generated assembly for the current release and new unsafe unchecked split_at version you proposed.

Here you can find the test that doesn't give great results with the new version you proposed. It seems that the benchmarks are bad when there are many groups of variable (but short) length, but it is only related to the rev version not the classic one.

What's also quite strange is the vector_16_000_one_group benchmark, it why 1.5x faster with the fully safe version (safe split_at/_mut version) but when I switched to an unsafe split_at/_mut/_unchecked version it brings it back to the same speed as the unsafe version (the released version). I really don't see any significant diff between those (apart from the panic jumps), maybe it is related to some inlining in the test function?

I maybe spot a jump that is executed a lot of times when the predicate returns false, line 18 of the safe version (top right). This could explain the slowness for short length groups, where this instruction is always executed, maybe dirtying the instruction cache (do jumps do that?).

From the fully unsafe version (the current released one) to the fully safe version:

 name                                              7b2fb19.bench ns/iter  0de2daa.bench ns/iter  diff ns/iter   diff %  speedup
 linear_group::bench::rev_vector_16_000            22,167                 31,765                        9,598   43.30%   x 0.70
 linear_group::bench::rev_vector_16_000_one_group  18,205                 15,822                       -2,383  -13.09%   x 1.15
 linear_group::bench::vector_16_000                29,064                 29,735                          671    2.31%   x 0.98
 linear_group::bench::vector_16_000_one_group      20,573                 13,742                       -6,831  -33.20%   x 1.50
 linear_group::bench::vector_16_000_sorted         21,096                 17,614                       -3,482  -16.51%   x 1.20
 linear_group::bench::vector_little_sorted         48                     49                                1    2.08%   x 0.98

From the new fully safe (simple split_at/_mut) version to the unsafe split_at/_mut/_unchecked version:

 name                                              0de2daa.bench ns/iter  37b959e.bench ns/iter  diff ns/iter   diff %  speedup
 linear_group::bench::rev_vector_16_000            31,765                 28,874                       -2,891   -9.10%   x 1.10
 linear_group::bench::rev_vector_16_000_one_group  15,822                 16,980                        1,158    7.32%   x 0.93
 linear_group::bench::vector_16_000                29,735                 25,590                       -4,145  -13.94%   x 1.16
 linear_group::bench::vector_16_000_one_group      13,742                 20,488                        6,746   49.09%   x 0.67
 linear_group::bench::vector_16_000_sorted         17,614                 20,930                        3,316   18.83%   x 0.84
 linear_group::bench::vector_little_sorted         49                     43                               -6  -12.24%   x 1.14

But more importantly: From the fully unsafe version (the current released one) to the unsafe split_at/_mut/_unchecked version 🎉 :

 name                                              7b2fb19.bench ns/iter  37b959e.bench ns/iter  diff ns/iter   diff %  speedup
 linear_group::bench::rev_vector_16_000            22,167                 28,874                        6,707   30.26%   x 0.77
 linear_group::bench::rev_vector_16_000_one_group  18,205                 16,980                       -1,225   -6.73%   x 1.07
 linear_group::bench::vector_16_000                29,064                 25,590                       -3,474  -11.95%   x 1.14
 linear_group::bench::vector_16_000_one_group      20,573                 20,488                          -85   -0.41%   x 1.00
 linear_group::bench::vector_16_000_sorted         21,096                 20,930                         -166   -0.79%   x 1.01
 linear_group::bench::vector_little_sorted         48                     43                               -5  -10.42%   x 1.12

Ten0

Looks good. We should probably rename the PR to "Implement LinearGroupBy with almost only safe Rust" before merging it though :)

Implement LinearGroupBy with only safe Rust

0de2daa

Kerollmops mentioned this pull request Dec 31, 2020

The return of the GroupBy and GroupByMut iterators on slice rust-lang/rust#79895

Merged

Kerollmops force-pushed the only-safe-rust branch from 37b959e to 6012549 Compare January 3, 2021 18:15

Split at unchecked the slices

107a9a3

Kerollmops force-pushed the only-safe-rust branch from 6012549 to 107a9a3 Compare January 3, 2021 18:15

Ten0 approved these changes Jan 5, 2021

View reviewed changes

Kerollmops changed the title ~~Implement LinearGroupBy with only safe Rust~~ Implement LinearGroupBy with almost only safe Rust Feb 13, 2021

Kerollmops merged commit c12a394 into master Feb 13, 2021

Kerollmops deleted the only-safe-rust branch February 13, 2021 16:49

Ten0 mentioned this pull request Oct 11, 2021

Lot of unsafe code could probably be replaced by safe code without perf. impact #18

Open

Kerollmops mentioned this pull request Jan 23, 2022

Will simd be introduced？ RoaringBitmap/roaring-rs#60

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement LinearGroupBy with almost only safe Rust #19

Implement LinearGroupBy with almost only safe Rust #19

Kerollmops commented Dec 24, 2020 •

edited

Loading

Ten0 commented Jan 1, 2021 •

edited

Loading

Ten0 commented Jan 1, 2021

Kerollmops commented Jan 2, 2021

Ten0 commented Jan 2, 2021 •

edited

Loading

Kerollmops commented Jan 3, 2021 •

edited

Loading

Ten0 left a comment

Implement LinearGroupBy with almost only safe Rust #19

Implement LinearGroupBy with almost only safe Rust #19

Conversation

Kerollmops commented Dec 24, 2020 • edited Loading

Ten0 commented Jan 1, 2021 • edited Loading

Ten0 commented Jan 1, 2021

Kerollmops commented Jan 2, 2021

Ten0 commented Jan 2, 2021 • edited Loading

Kerollmops commented Jan 3, 2021 • edited Loading

Ten0 left a comment

Choose a reason for hiding this comment

Kerollmops commented Dec 24, 2020 •

edited

Loading

Ten0 commented Jan 1, 2021 •

edited

Loading

Ten0 commented Jan 2, 2021 •

edited

Loading

Kerollmops commented Jan 3, 2021 •

edited

Loading