Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LinearGroupBy with almost only safe Rust #19

Merged
merged 2 commits into from
Feb 13, 2021

Conversation

Kerollmops
Copy link
Owner

@Kerollmops Kerollmops commented Dec 24, 2020

Hey @Ten0,

Related to #18.
Here are the first benchmark tests I made on my computer by running:

cargo +nightly bench --features nightly -- linear_group::bench > 7b2fb19.bench
cargo benchcmp 7b2fb19.bench 0de2daa.bench
 name                                              7b2fb19.bench ns/iter  0de2daa.bench ns/iter  diff ns/iter  diff %  speedup
 linear_group::bench::rev_vector_16_000            37,808                 37,753                          -55  -0.15%   x 1.00
 linear_group::bench::rev_vector_16_000_one_group  23,993                 27,417                        3,424  14.27%   x 0.88
 linear_group::bench::vector_16_000                38,399                 51,859                       13,460  35.05%   x 0.74
 linear_group::bench::vector_16_000_one_group      18,412                 27,202                        8,790  47.74%   x 0.68
 linear_group::bench::vector_16_000_sorted         27,870                 27,803                          -67  -0.24%   x 1.00
 linear_group::bench::vector_little_sorted         65                     84                               19  29.23%   x 0.77

@Ten0
Copy link

Ten0 commented Jan 1, 2021

So I got intrigued by the fact this seems to show a slightly lower perf than the unsafe version in cases where there was one single group, so I went to run some compilation tests:

And contrary to what I expected from looking at the benchmark results, it looks like in the safe version, the compiler managed to optimize the loop inside next even better, only reading one item per iteration. However that did not happen as well when hardcoding the array in the main function.
So overall I'd believe that is just some minor diff in what the optimizer precisely does that may come and go depending on the way the iterator is used eventually, but overall it does indeed not introduce any theoretical inability for the compiler to get rid of this bounds check within the loop that's in the next function.

However, as it turned out, the call to next itself introduced additional unnecessary bounds checking, because it wrote the loop over the windows using something like for (i=0; i!=n; ++i) (line 25 of the assembly) instead of for (i = 0; i < n; ++i) (as per https://doc.rust-lang.org/1.48.0/src/core/slice/iter/macros.rs.html#8), which made it unable to figure out that the scenario where i > n was impossible when calling split_at, so there's still check for an overflow (line 7 of the assembly) in the loop (that would hypothetically lead to a panic).
But since that however did not show in your benchmark, I could imagine there are also scenarios where it would be written in such a way that this does not happen (similarly to the scenario of the other loop where the generated code for the safe version appeared to be smarter than the code for the unsafe version).

Out of curiosity, I also wrote a slight change that would encourage it to not use the len variable you declared to index the array, and instead increase the pointer value at each iteration, and this is the result:
https://godbolt.org/z/9GPbfz
It seems to not do better in either of the scenarios with regards to the number of instructions, but I haven't benchmarked it using your benchmark.

And as a side-note, looking at the assembly made me realize slices are actually not stored as (start, end) (as I mistakenly asserted here), but instead as (start, len) (references: 1, 2). I'm not sure why that was the choice made, though I suspect that might be because when indexing you may use instructions such as cmp edx, dword ptr [rdi + 4*rcx] which can actually be very fast because structures sizes are usually aligned making these just bitshifts, while otherwise computing the length would actually require additional instructions. In any case, if anyone has some reference to more than speculation on that regard I'm interested. :)

@Ten0
Copy link

Ten0 commented Jan 1, 2021

I'm not sure why that was the choice made, though I suspect that might be because........

Or maybe that wasn't ever discussed. ^^ graydon/rust-prehistory@958d12e#diff-e0b3bbcc496ac9c4177ca1bb4e2a2f365e1e58261d853b11e07ecbd5c830fb64R306-R307

@Kerollmops
Copy link
Owner Author

Wow! What an in-depth code review :)

It seems to not do better in either of the scenarios with regards to the number of instructions, but I haven't benchmarked it using your benchmark.

You can easily try your changes by piping benchmark results to a file and then comparing them with the benchcmp crate (cargo install cargo-benchcmp).

cargo +nightly bench --features nightly -- linear_group::bench > base.bench
# make your changes...
cargo +nightly bench --features nightly -- linear_group::bench > update.bench

cargo benchcmp base.bench update.bench

I'm not sure why that was the choice made, though I suspect that might be because...

This can also be because it is easier to create a wrong slice by specifying an end pointer lower than the start pointer. But the length value cannot be invalid, as it lives between 0 and usize::MAX.

so I went to run some compilation tests.

So, regarding your analysis, the speed doesn't seem related to the fact that we create a Window every time we execute GroupBy::next, isn't it? So writing a safe Rust dedicated window of 2 code wouldn't make it faster?

@Ten0
Copy link

Ten0 commented Jan 2, 2021

So, regarding your analysis, the speed doesn't seem related to the fact that we create a Window every time we execute GroupBy::next, isn't it? So writing a safe Rust dedicated window of 2 code wouldn't make it faster?

Yes indeed, that is completely optimized out by the compiler because the 2 is a constant.
The only overhead my compilation tests seem show would be on the call to split_at/split_at_mut, where an unchecked version may increase performance.
Though I'm not sure why these did not show in your tests (this should impact items with lots of groups, while your benchmarks seemed to show regression mostly where there was a single group.

@Kerollmops
Copy link
Owner Author

Kerollmops commented Jan 3, 2021

Ok, so I copy/paste the split_at/_mut/_unchecked slice unexposed methods to try out your solution, I also changed the computer I use (from a MacBook to an iMac) so here are the previous benchmarks too. you can see the change in the last commit pushed. Here is a godbolt split view of the generated assembly for the current release and new unsafe unchecked split_at version you proposed.

Here you can find the test that doesn't give great results with the new version you proposed. It seems that the benchmarks are bad when there are many groups of variable (but short) length, but it is only related to the rev version not the classic one.

What's also quite strange is the vector_16_000_one_group benchmark, it why 1.5x faster with the fully safe version (safe split_at/_mut version) but when I switched to an unsafe split_at/_mut/_unchecked version it brings it back to the same speed as the unsafe version (the released version). I really don't see any significant diff between those (apart from the panic jumps), maybe it is related to some inlining in the test function?

I maybe spot a jump that is executed a lot of times when the predicate returns false, line 18 of the safe version (top right). This could explain the slowness for short length groups, where this instruction is always executed, maybe dirtying the instruction cache (do jumps do that?).

From the fully unsafe version (the current released one) to the fully safe version:

 name                                              7b2fb19.bench ns/iter  0de2daa.bench ns/iter  diff ns/iter   diff %  speedup
 linear_group::bench::rev_vector_16_000            22,167                 31,765                        9,598   43.30%   x 0.70
 linear_group::bench::rev_vector_16_000_one_group  18,205                 15,822                       -2,383  -13.09%   x 1.15
 linear_group::bench::vector_16_000                29,064                 29,735                          671    2.31%   x 0.98
 linear_group::bench::vector_16_000_one_group      20,573                 13,742                       -6,831  -33.20%   x 1.50
 linear_group::bench::vector_16_000_sorted         21,096                 17,614                       -3,482  -16.51%   x 1.20
 linear_group::bench::vector_little_sorted         48                     49                                1    2.08%   x 0.98

From the new fully safe (simple split_at/_mut) version to the unsafe split_at/_mut/_unchecked version:

 name                                              0de2daa.bench ns/iter  37b959e.bench ns/iter  diff ns/iter   diff %  speedup
 linear_group::bench::rev_vector_16_000            31,765                 28,874                       -2,891   -9.10%   x 1.10
 linear_group::bench::rev_vector_16_000_one_group  15,822                 16,980                        1,158    7.32%   x 0.93
 linear_group::bench::vector_16_000                29,735                 25,590                       -4,145  -13.94%   x 1.16
 linear_group::bench::vector_16_000_one_group      13,742                 20,488                        6,746   49.09%   x 0.67
 linear_group::bench::vector_16_000_sorted         17,614                 20,930                        3,316   18.83%   x 0.84
 linear_group::bench::vector_little_sorted         49                     43                               -6  -12.24%   x 1.14

But more importantly: From the fully unsafe version (the current released one) to the unsafe split_at/_mut/_unchecked version 🎉 :

 name                                              7b2fb19.bench ns/iter  37b959e.bench ns/iter  diff ns/iter   diff %  speedup
 linear_group::bench::rev_vector_16_000            22,167                 28,874                        6,707   30.26%   x 0.77
 linear_group::bench::rev_vector_16_000_one_group  18,205                 16,980                       -1,225   -6.73%   x 1.07
 linear_group::bench::vector_16_000                29,064                 25,590                       -3,474  -11.95%   x 1.14
 linear_group::bench::vector_16_000_one_group      20,573                 20,488                          -85   -0.41%   x 1.00
 linear_group::bench::vector_16_000_sorted         21,096                 20,930                         -166   -0.79%   x 1.01
 linear_group::bench::vector_little_sorted         48                     43                               -5  -10.42%   x 1.12

Copy link

@Ten0 Ten0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. We should probably rename the PR to "Implement LinearGroupBy with almost only safe Rust" before merging it though :)

@Kerollmops Kerollmops changed the title Implement LinearGroupBy with only safe Rust Implement LinearGroupBy with almost only safe Rust Feb 13, 2021
@Kerollmops Kerollmops merged commit c12a394 into master Feb 13, 2021
@Kerollmops Kerollmops deleted the only-safe-rust branch February 13, 2021 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants