Optimize unnecessary check in Vec::retain #88060

TennyZhuang · 2021-08-15T18:47:16Z

The function vec::Vec::retain only have two stages:

Nothing was deleted.
Some elements were deleted.

Here is an unnecessary check if g.deleted_cnt > 0 in the loop, and it's difficult for compiler to optimize it. I split the loop into two stages manully and keep the code clean using const generics.

I write a special but common bench case for this optimization. I call retain on vec but keep all elements.

Before and after this optimization:

test vec::bench_retain_whole_100000                      ... bench:      84,803 ns/iter (+/- 17,314)

test vec::bench_retain_whole_100000                      ... bench:      42,638 ns/iter (+/- 16,910)

The result is expected, there are two ifs before the optimization and one if after.

rust-highfive · 2021-08-15T18:47:19Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @joshtriplett (or someone else) soon.

Please see the contribution instructions for more information.

TennyZhuang · 2021-08-15T18:49:45Z

r? @oxalica A trivial but interesting optimization, PTAL

oxalica

The generated code seems kind of weird. The second-stage loop is generated into many basic blocks and there are many jumps around.

Anyway, the benchmark is quite convincing.

BTW: Actually find this PR from your Twitter post. 😄

library/alloc/benches/vec.rs

library/alloc/src/vec/mod.rs

TennyZhuang · 2021-08-16T02:17:10Z

I add a normal benchmark:

before

test vec::bench_retain_100000                            ... bench:     460,816 ns/iter (+/- 24,389)
test vec::bench_retain_whole_100000                      ... bench:      88,300 ns/iter (+/- 4,635)

after

test vec::bench_retain_100000                            ... bench:      62,627 ns/iter (+/- 6,692)
test vec::bench_retain_whole_100000                      ... bench:      46,326 ns/iter (+/- 4,133)

It's extremely fasater, even that I can't explain it

oxalica · 2021-08-16T07:18:05Z

I ran the benchmark on my R5 2400G PC and it is quite different, and is quite stable between multiple runs. The with-many-move case actually becomes slower for me.

Before

test vec::bench_retain_100000                            ... bench:      58,436 ns/iter (+/- 2,482)
test vec::bench_retain_whole_100000                      ... bench:      54,747 ns/iter (+/- 1,646)

After

test vec::bench_retain_100000                            ... bench:      53,703 ns/iter (+/- 2,905)
test vec::bench_retain_whole_100000                      ... bench:      61,588 ns/iter (+/- 689)

It might indicate that the generated code heavily relies on the behavior of branch predictor. Are you on a Intel CPU? Intel's branch predictor may work different with AMD's.

TennyZhuang · 2021-08-16T07:37:56Z

I ran the benchmark on my R5 2400G PC and it is quite different, and is quite stable between multiple runs. The with-many-move case actually becomes slower for me.

Before
test vec::bench_retain_100000                            ... bench:      58,436 ns/iter (+/- 2,482)
test vec::bench_retain_whole_100000                      ... bench:      54,747 ns/iter (+/- 1,646)
After
test vec::bench_retain_100000                            ... bench:      53,703 ns/iter (+/- 2,905)
test vec::bench_retain_whole_100000                      ... bench:      61,588 ns/iter (+/- 689)
It might indicate that the generated code heavily relies on the behavior of branch predictor. Are you on a Intel CPU? Intel's branch predictor may work different with AMD's.

I test it on my MacBook Pro 13 with 2.0GHz quad-core 10th-generation Intel Core i5 processor, Turbo Boost up to 3.8GHz.

I can run the bench on another device.

@Xuanwo could you also run a benchmark for my optimization?

@oxalica I will also try manually expand two loop without const generics.

Xuanwo · 2021-08-16T07:55:36Z

I ran the same benchmark on my laptop:

                   -`                    xuanwo@thinkpad-x1-carbon
                  .o+`                   -------------------------
                 `ooo/                   OS: Arch Linux x86_64
                `+oooo:                  Host: 20KHCTO1WW ThinkPad X1 Carbon 6th
               `+oooooo:                 Kernel: 5.13.10-zen1-1-zen
               -+oooooo+:                Uptime: 1 day, 5 hours, 8 mins
             `/:-:++oooo+:               Packages: 1144 (pacman)
            `/++++/+++++++:              Shell: zsh 5.8
           `/++++++++++++++:             Resolution: 1920x1080
          `/+++ooooooooooooo/`           DE: KDE5
         ./ooosssso++osssssso+`          WM: KWin
        .oossssso-````/ossssss+`         WM Theme: Breeze
       -osssssso.      :ssssssso.        Theme: Breeze Light [KDE5], Canta-light [GTK2/3]
      :osssssss/        osssso+++.       Icons: breeze [KDE5], breeze [GTK2/3]
     /ossssssss/        +ssssooo/-       Terminal: tmux
   `/ossssso+/:-        -:/+osssso+-     CPU: Intel i7-8650U (8) @ 4.200GHz
  `+sso+:-`                 `.-/+oso:    GPU: Intel UHD Graphics 620
 `++:.                           `-/+/   Memory: 10319MiB / 15754MiB
 .`                                 `/

Before

test vec::bench_retain_100000                            ... bench:      74,452 ns/iter (+/- 2,484)
test vec::bench_retain_whole_100000                      ... bench:      59,383 ns/iter (+/- 2,675)

After

test vec::bench_retain_100000                            ... bench:      69,770 ns/iter (+/- 1,764)
test vec::bench_retain_whole_100000                      ... bench:      53,829 ns/iter (+/- 1,374)

TennyZhuang · 2021-08-16T08:00:10Z

I ran it on another Mac:

sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

The command:

./x.py bench -i library/alloc --test-args bench_retain

result is almost the same:

Before

test vec::bench_retain_100000                            ... bench:     401,699 ns/iter (+/- 11,941)
test vec::bench_retain_whole_100000                      ... bench:      75,848 ns/iter (+/- 2,291)

After

test vec::bench_retain_100000                            ... bench:      52,742 ns/iter (+/- 1,948)
test vec::bench_retain_whole_100000                      ... bench:      38,554 ns/iter (+/- 3,745)

TennyZhuang · 2021-08-16T08:00:35Z

It seems that the original retain has a very poor performance on MacOS with unknown reason

TennyZhuang · 2021-08-16T08:22:13Z

@Xuanwo @oxalica Can you run a version without const generics?

For your convinience:

git remote add zty git@github.com:TennyZhuang/rust.git
git fetch zty
git checkout optimize-vec-retain-expand-manully

TennyZhuang · 2021-08-16T08:36:14Z

On my mac, the performance is same between const generics and manully expand:

test vec::bench_retain_100000                            ... bench:      52,471 ns/iter (+/- 3,921)
test vec::bench_retain_whole_100000                      ... bench:      38,560 ns/iter (+/- 2,666)

Xuanwo · 2021-08-16T08:50:35Z

Two versions work similar on my laptop:

const generics

test vec::bench_retain_100000                            ... bench:      76,786 ns/iter (+/- 7,531)
test vec::bench_retain_whole_100000                      ... bench:      62,311 ns/iter (+/- 19,721)

manully expand

test vec::bench_retain_100000                            ... bench:      72,178 ns/iter (+/- 3,247)
test vec::bench_retain_whole_100000                      ... bench:      56,534 ns/iter (+/- 1,479)

TennyZhuang · 2021-08-16T08:58:49Z

OK, the conclusion is that:

const generics is OK.
It optimize a little (10%) in most cases or architectures. @Xuanwo
~~Performance is downgraded a little (10%) in some case and some specific architecture with unknown reason~~ (Resolved)
It optimize much (8x) on some specific architecture with unknown reason. @TennyZhuang
In theory, it actually optimize a unnecessary check.
It introduce a little code complexity.

I tend to think this PR deserves to be merged.

@oxalica what's your opinion?

oxalica · 2021-08-17T07:11:50Z

I figured out why the code generation is weird after your change.
LLVM failed to reasoning g.processed_len == original_len after the second loop, thus the drop(g) cannot be optimized. That's why I see a never-called memmove in the generated code. In you manually expanded code without const generic, this issue still exists. Not sure if it's a LLVM bug.

I'd suggest to change the while condition to g.procesed_len != original_len and it's now successfully optimized.
Generated code after

Well, this should not affect the efficiency of main loop.

TennyZhuang · 2021-08-17T07:24:58Z

I figured out why the code generation is weird after your change.
LLVM failed to reasoning g.processed_len == original_len after the second loop, thus the drop(g) cannot be optimized. That's why I see a never-called memmove in the generated code. In you manually expanded code without const generic, this issue still exists. Not sure if it's a LLVM bug.

I'd suggest to change the while condition to g.procesed_len != original_len and it's now successfully optimized.
Generated code after

Well, this should not affect the efficiency of main loop.

Good catch, I have updated my code.

Current performance on Intel(R) Core(TM) i5-8259U CPU @ 2.30GHz

test vec::bench_retain_100000                            ... bench:      56,911 ns/iter (+/- 5,288)
test vec::bench_retain_whole_100000                      ... bench:      56,371 ns/iter (+/- 5,982)

Optimize unnecessary check in VecDeque::retain This pr is highly inspired by rust-lang#88060 which shared the same idea: we can split the `for` loop into stages so that we can remove unnecessary checks like `del > 0`. ## Benchmarks Before ```rust test collections::vec_deque::tests::bench_retain_half_10000 ... bench: 290,125 ns/iter (+/- 8,717) test collections::vec_deque::tests::bench_retain_odd_10000 ... bench: 291,588 ns/iter (+/- 9,621) test collections::vec_deque::tests::bench_retain_whole_10000 ... bench: 287,426 ns/iter (+/- 9,009) ``` After ```rust test collections::vec_deque::tests::bench_retain_half_10000 ... bench: 243,940 ns/iter (+/- 8,563) test collections::vec_deque::tests::bench_retain_odd_10000 ... bench: 242,768 ns/iter (+/- 3,903) test collections::vec_deque::tests::bench_retain_whole_10000 ... bench: 202,926 ns/iter (+/- 6,332) ``` Based on the current benchmark, this PR will improve the perf of `VecDeque::retain` by around 16%. For special cases, the improvement will be up to 30%. Signed-off-by: Xuanwo <github@xuanwo.io>

TennyZhuang · 2021-08-22T02:24:09Z

@joshtriplett PTAL, Thanks

TennyZhuang · 2021-08-24T10:02:09Z

r? @dtolnay

It seems that @joshtriplett is busy. Could you also review my PR? it's similar to #88075

the8472 · 2021-12-04T13:05:21Z

library/alloc/benches/vec.rs

+fn bench_retain_100000(b: &mut Bencher) {
+    let v = (1..=100000).collect::<Vec<u32>>();
+    b.iter(|| {
+        let mut v = v.clone();


calling .clone() inside the benchmark loop should be avoided since it can introduce allocator noise. It's better to just refill the vec.

Thanks but the PR was merged, should I submit another PR to fix it?

I'll take care of it. I just wanted to leave a note.

rust-highfive assigned joshtriplett Aug 15, 2021

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Aug 15, 2021

oxalica reviewed Aug 15, 2021

View reviewed changes

library/alloc/benches/vec.rs Show resolved Hide resolved

library/alloc/src/vec/mod.rs Outdated Show resolved Hide resolved

TennyZhuang force-pushed the optimize-vec-retain branch from 493df40 to 1388eb8 Compare August 16, 2021 02:16

TennyZhuang force-pushed the optimize-vec-retain branch from 1388eb8 to c34e154 Compare August 16, 2021 02:50

Xuanwo mentioned this pull request Aug 16, 2021

Optimize unnecessary check in VecDeque::retain #88075

Merged

This comment has been minimized.

Sign in to view

TennyZhuang force-pushed the optimize-vec-retain branch from c34e154 to 6eb1549 Compare August 17, 2021 07:22

TennyZhuang force-pushed the optimize-vec-retain branch from 6eb1549 to 89d47b6 Compare August 19, 2021 12:27

rust-highfive assigned dtolnay and unassigned joshtriplett Aug 24, 2021

This was referenced Nov 30, 2021

Bump arrayvec from 0.5.2 to 0.7.2 Rcvisual/polkadot#3

Open

Bump arrayvec from 0.5.2 to 0.7.2 Kylin-Network/polkadot#9

Open

the8472 mentioned this pull request Dec 3, 2021

Vec::retain() is significantly slower than into_iter().filter().collect() #91497

Closed

dependabot bot mentioned this pull request Dec 4, 2021

Bump arrayvec from 0.5.2 to 0.7.2 ferrell-code/polkadot#18

Open

the8472 reviewed Dec 4, 2021

View reviewed changes

This was referenced Dec 13, 2021

build(deps): Update arrayvec requirement from 0.5.1 to 0.7.2 dvc94ch/rust-libp2p#20

Open

build(deps): Update arrayvec requirement from 0.5.1 to 0.7.2 m5l14i11/rust-libp2p#3

Open

build(deps): bump arrayvec from 0.7.1 to 0.7.2 ourobouros/kerla#1

Open

This was referenced Dec 21, 2021

Bump arrayvec from 0.5.2 to 0.7.2 bhavesh20-mb/Bhavesh-Polkadot#2

Open

build(deps): Update arrayvec requirement from 0.5.1 to 0.7.2 sireliah/rust-libp2p#3

Open

dependabot bot mentioned this pull request Feb 8, 2022

build(deps): bump arrayvec from 0.5.2 to 0.7.2 Wizdave97/polkadot#1

Open

This was referenced Feb 22, 2022

Bump arrayvec from 0.7.1 to 0.7.2 swim-io/pool#33

Merged

Bump arrayvec from 0.5.2 to 0.7.2 moonbeam-foundation/polkadot#23

Closed

Bump arrayvec from 0.5.2 to 0.7.2 ZJXChain/XChain#6

Open

dependabot bot mentioned this pull request Mar 2, 2022

Bump arrayvec from 0.5.2 to 0.7.2 4meta5/polkadot#23

Open

This was referenced Mar 18, 2022

Bump arrayvec from 0.5.2 to 0.7.2 chevdor/polkadot#51

Closed

Bump arrayvec from 0.5.2 to 0.7.2 tetcoin/tetcoin#24

Open

This was referenced Apr 5, 2022

Bump arrayvec from 0.5.2 to 0.7.2 kabocha-network/relay-chain#10

Open

Bump arrayvec from 0.5.2 to 0.7.2 joao-paulo-parity/polkadot#49

Closed

build(deps): Update arrayvec requirement from 0.5.1 to 0.7.2 jonnycrunch/rust-libp2p#3

Open

This was referenced May 13, 2022

build(deps): Update arrayvec requirement from 0.5.1 to 0.7.2 sigp/rust-libp2p#185

Closed

bump: update arrayvec requirement from 0.5.2 to 0.7.2 mun-lang/mun#400

Merged

bump: update arrayvec requirement from 0.5.2 to 0.7.2 forksnd/mun#18

Closed

dependabot bot mentioned this pull request Jun 1, 2022

Bump arrayvec from 0.5.2 to 0.7.2 nazar-pc/polkadot#5

Closed

dependabot bot mentioned this pull request Jul 14, 2022

Bump arrayvec from 0.5.2 to 0.7.2 dzmitry-lahoda-forks/polkadot#5

Open

dependabot bot mentioned this pull request Aug 13, 2022

build(deps): Update arrayvec requirement from 0.5.1 to 0.7.2 Entropy-Foundation/rust-libp2p#6

Open

Optimize unnecessary check in Vec::retain #88060

Optimize unnecessary check in Vec::retain #88060

Uh oh!

Conversation

TennyZhuang commented Aug 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rust-highfive commented Aug 15, 2021

Uh oh!

TennyZhuang commented Aug 15, 2021

Uh oh!

oxalica left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TennyZhuang commented Aug 16, 2021

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

oxalica commented Aug 16, 2021

Uh oh!

TennyZhuang commented Aug 16, 2021

Uh oh!

Xuanwo commented Aug 16, 2021

Uh oh!

TennyZhuang commented Aug 16, 2021

Uh oh!

TennyZhuang commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TennyZhuang commented Aug 16, 2021

Uh oh!

TennyZhuang commented Aug 16, 2021

Uh oh!

Xuanwo commented Aug 16, 2021

Uh oh!

TennyZhuang commented Aug 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oxalica commented Aug 17, 2021

Uh oh!

TennyZhuang commented Aug 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TennyZhuang commented Aug 22, 2021

Uh oh!

TennyZhuang commented Aug 24, 2021

Uh oh!

the8472 Dec 4, 2021

Choose a reason for hiding this comment

Uh oh!

TennyZhuang Dec 4, 2021

Choose a reason for hiding this comment

Uh oh!

the8472 Dec 4, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TennyZhuang commented Aug 15, 2021 •

edited

Loading

TennyZhuang commented Aug 16, 2021 •

edited

Loading

TennyZhuang commented Aug 16, 2021 •

edited

Loading

TennyZhuang commented Aug 17, 2021 •

edited

Loading