Faster slice PartialOrd #28436

ranma42 · 2015-09-16T12:10:30Z

This branch improves the performance of Ord and PartialOrd methods for slices compared to the iter-based implementation.
Based on the approach used in #26884.

Knowing the result of equality comparison can enable additional optimizations in LLVM. Additionally, this makes it obvious that `partial_cmp` on totally ordered types cannot return `None`.

Reusing the same idea as in rust-lang#26884, we can exploit the fact that the length of slices is known, hence we can use a counted loop instead of iterators, which means that we only need a single counter, instead of having to increment and check one pointer for each iterator. Using the generic implementation of the boolean comparison operators (`lt`, `le`, `gt`, `ge`) provides further speedup for simple types. This happens because the loop scans elements checking for equality and dispatches to element comparison or length comparison depending on the result of the prefix comparison. ``` test u8_cmp ... bench: 14,043 ns/iter (+/- 1,732) test u8_lt ... bench: 16,156 ns/iter (+/- 1,864) test u8_partial_cmp ... bench: 16,250 ns/iter (+/- 2,608) test u16_cmp ... bench: 15,764 ns/iter (+/- 1,420) test u16_lt ... bench: 19,833 ns/iter (+/- 2,826) test u16_partial_cmp ... bench: 19,811 ns/iter (+/- 2,240) test u32_cmp ... bench: 15,792 ns/iter (+/- 3,409) test u32_lt ... bench: 18,577 ns/iter (+/- 2,075) test u32_partial_cmp ... bench: 18,603 ns/iter (+/- 5,666) test u64_cmp ... bench: 16,337 ns/iter (+/- 2,511) test u64_lt ... bench: 18,074 ns/iter (+/- 7,914) test u64_partial_cmp ... bench: 17,909 ns/iter (+/- 1,105) ``` ``` test u8_cmp ... bench: 6,511 ns/iter (+/- 982) test u8_lt ... bench: 6,671 ns/iter (+/- 919) test u8_partial_cmp ... bench: 7,118 ns/iter (+/- 1,623) test u16_cmp ... bench: 6,689 ns/iter (+/- 921) test u16_lt ... bench: 6,712 ns/iter (+/- 947) test u16_partial_cmp ... bench: 6,725 ns/iter (+/- 780) test u32_cmp ... bench: 7,704 ns/iter (+/- 1,294) test u32_lt ... bench: 7,611 ns/iter (+/- 3,062) test u32_partial_cmp ... bench: 7,640 ns/iter (+/- 1,149) test u64_cmp ... bench: 7,517 ns/iter (+/- 2,164) test u64_lt ... bench: 7,579 ns/iter (+/- 1,048) test u64_partial_cmp ... bench: 7,629 ns/iter (+/- 1,195) ```

rust-highfive · 2015-09-16T12:10:42Z

r? @huonw

(rust_highfive has picked a reviewer for you, use r? to override)

ranma42 · 2015-09-16T12:12:18Z

Benchmarking data is based on https://gist.github.com/ranma42/51d46e4a4664e388a8f7

bluss · 2015-09-16T12:16:58Z

src/libcore/slice.rs

+        let l = cmp::min(self.len(), other.len());
+
+        for i in 0..l {
+            match self[i].cmp(&other[i]) {


Are bounds checks eliminated here? This suggests you need to slice self and other to length l for it to be.

(I guess this wasn't needed in the partialeq case since it requires equal lengths, and llvm understands that)

You are right, in order to remove all bound checks I need to re-slice self and other, which further improves performance :)

When you add this, maybe add a comment to indicate why you reslice.

ranma42 · 2015-09-16T12:34:40Z

@bluss In a previous version my code was using Some(self.cmp(other)), but I was not sure about which form was preferred. I will rewrite as you suggested.

bluss · 2015-09-16T12:43:44Z

I almost think the main win is not using the iterator comparators' 1) .zip() and 2) match on tuples, both which @dotdash pointed out have far from optimal codegen.

ranma42 · 2015-09-16T13:01:35Z

@bluss Another significant win (which kind of surprised me) was not specialising the boolean comparisons "eagerly". It is faster to only check for equality in the loop and then perform the dispatch "outside" than to have an explicit implementation like

    #[inline]
    fn le(&self, other: &[T]) -> bool {
        let l = cmp::min(self.len(), other.len());
        let lhs = &self[..l];
        let rhs = &other[..l];

        for i in 0..l {
            match lhs[i].partial_cmp(&rhs[i]) {
                Some(Ordering::Equal) => (),
                Some(Ordering::Less) => return true, // this
                _ => return false,  // makes the loop slower
            }
        }

        self.len().le(&other.len())
    }

bluss · 2015-09-16T13:23:51Z

Oh that's cool, thanks for the explanation!

Instead of manually defining it, `partial_cmp` can simply wrap the result of `cmp` for totally ordered types.

In order to get rid of all range checks, the compiler needs to explicitly see that the slices it iterates over are as long as the loop variable upper bound. This further improves the performance of slice comparison: ``` test u8_cmp ... bench: 4,761 ns/iter (+/- 1,203) test u8_lt ... bench: 4,579 ns/iter (+/- 649) test u8_partial_cmp ... bench: 4,768 ns/iter (+/- 761) test u16_cmp ... bench: 4,607 ns/iter (+/- 580) test u16_lt ... bench: 4,681 ns/iter (+/- 567) test u16_partial_cmp ... bench: 4,607 ns/iter (+/- 967) test u32_cmp ... bench: 4,448 ns/iter (+/- 891) test u32_lt ... bench: 4,546 ns/iter (+/- 992) test u32_partial_cmp ... bench: 4,415 ns/iter (+/- 646) test u64_cmp ... bench: 4,380 ns/iter (+/- 1,184) test u64_lt ... bench: 5,684 ns/iter (+/- 602) test u64_partial_cmp ... bench: 4,663 ns/iter (+/- 1,158) ```

ranma42 · 2015-09-16T13:45:47Z

Updated with improvements suggested by @bluss (and new benchmarking results).
As expected, the main part of the loop (which scans the slice looking for elements which are not equal) runs about as fast as the eq and ne implementation.
The commit messages provide some insight about how each commit affects performance and/or the results of codegen (the bounds check optimisation is mentioned in the fourth commit; the equals-only check is mentioned in the second one). Would it be better to mention it directly in the code (as comments)?

bluss · 2015-09-16T13:47:52Z

Yes, I think it's best to leave a comment in the code for the reslicing. It looks redundant for the reader.

ranma42 · 2015-09-16T13:48:40Z

It might also be a good idea to avoid declaring cmp and partial_cmp as inline (and it would be consistent with the eq and ne implementation). Any opitions on this?

bluss · 2015-09-16T13:50:27Z

They don't need inline, since the trait impl is already generic. I think it's fine to remove.. we are trying to be conservative with the inlines.

Be more conservative with inlining.

The explicit slicing is needed in order to enable additional range check optimizations in the compiler.

bluss · 2015-09-16T15:59:42Z

Thank you, it looks good! Great benchmark wins!

@bors r+

bors · 2015-09-16T15:59:43Z

📌 Commit 74dc146 has been approved by bluss

bors · 2015-09-16T15:59:45Z

⌛ Testing commit 74dc146 with merge 619d874...

bors · 2015-09-16T17:34:57Z

💔 Test failed - auto-mac-32-opt

bluss · 2015-09-16T17:52:28Z

@bors retry

This branch improves the performance of Ord and PartialOrd methods for slices compared to the iter-based implementation. Based on the approach used in #26884.

bors · 2015-09-16T21:45:47Z

⌛ Testing commit 74dc146 with merge 47d125d...

bors · 2015-09-17T00:22:47Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

ranma42 added 2 commits September 16, 2015 11:26

Specialize PartialOrd for totally ordered primitive types

1614173

Knowing the result of equality comparison can enable additional optimizations in LLVM. Additionally, this makes it obvious that `partial_cmp` on totally ordered types cannot return `None`.

rust-highfive assigned huonw Sep 16, 2015

bluss reviewed Sep 16, 2015
View reviewed changes

ranma42 added 2 commits September 16, 2015 15:25

Reuse cmp in totally ordered types

bf9254a

Instead of manually defining it, `partial_cmp` can simply wrap the result of `cmp` for totally ordered types.

ranma42 added 2 commits September 16, 2015 16:09

Remove inline attribute

08b9edf

Be more conservative with inlining.

Explain explicit slicing in slice cmp and partial_cmp methods

74dc146

The explicit slicing is needed in order to enable additional range check optimizations in the compiler.

bluss added the relnotes Marks issues that should be documented in the release notes of the next release. label Sep 16, 2015

bors added a commit that referenced this pull request Sep 16, 2015

Auto merge of #28436 - ranma42:faster-partialord, r=bluss

47d125d

This branch improves the performance of Ord and PartialOrd methods for slices compared to the iter-based implementation. Based on the approach used in #26884.

bors merged commit 74dc146 into rust-lang:master Sep 17, 2015

ranma42 deleted the faster-partialord branch September 23, 2015 13:25

Faster slice PartialOrd #28436

Faster slice PartialOrd #28436

Uh oh!

Conversation

ranma42 commented Sep 16, 2015

Uh oh!

rust-highfive commented Sep 16, 2015

Uh oh!

ranma42 commented Sep 16, 2015

Uh oh!

bluss Sep 16, 2015

Choose a reason for hiding this comment

Uh oh!

bluss Sep 16, 2015

Choose a reason for hiding this comment

Uh oh!

ranma42 Sep 16, 2015

Choose a reason for hiding this comment

Uh oh!

bluss Sep 16, 2015

Choose a reason for hiding this comment

Uh oh!

bluss Sep 16, 2015

Choose a reason for hiding this comment

Uh oh!

ranma42 commented Sep 16, 2015

Uh oh!

bluss commented Sep 16, 2015

Uh oh!

ranma42 commented Sep 16, 2015

Uh oh!

bluss commented Sep 16, 2015

Uh oh!

ranma42 commented Sep 16, 2015

Uh oh!

bluss commented Sep 16, 2015

Uh oh!

ranma42 commented Sep 16, 2015

Uh oh!

bluss commented Sep 16, 2015

Uh oh!

bluss commented Sep 16, 2015

Uh oh!

bors commented Sep 16, 2015

Uh oh!

bors commented Sep 16, 2015

Uh oh!

bors commented Sep 16, 2015

Uh oh!

bluss commented Sep 16, 2015

Uh oh!

bors commented Sep 16, 2015

Uh oh!

bors commented Sep 17, 2015

Uh oh!

Uh oh!