-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize unnecessary check in Vec::retain #88060
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @joshtriplett (or someone else) soon. Please see the contribution instructions for more information. |
r? @oxalica A trivial but interesting optimization, PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generated code seems kind of weird. The second-stage loop is generated into many basic blocks and there are many jumps around.
Anyway, the benchmark is quite convincing.
BTW: Actually find this PR from your Twitter post. 😄
493df40
to
1388eb8
Compare
I add a normal benchmark: before
after
It's extremely fasater, even that I can't explain it |
1388eb8
to
c34e154
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I ran the benchmark on my R5 2400G PC and it is quite different, and is quite stable between multiple runs. The with-many-move case actually becomes slower for me. Before
After
It might indicate that the generated code heavily relies on the behavior of branch predictor. Are you on a Intel CPU? Intel's branch predictor may work different with AMD's. |
I test it on my MacBook Pro 13 with I can run the bench on another device. @Xuanwo could you also run a benchmark for my optimization? @oxalica I will also try manually expand two loop without const generics. |
I ran the same benchmark on my laptop: -` xuanwo@thinkpad-x1-carbon
.o+` -------------------------
`ooo/ OS: Arch Linux x86_64
`+oooo: Host: 20KHCTO1WW ThinkPad X1 Carbon 6th
`+oooooo: Kernel: 5.13.10-zen1-1-zen
-+oooooo+: Uptime: 1 day, 5 hours, 8 mins
`/:-:++oooo+: Packages: 1144 (pacman)
`/++++/+++++++: Shell: zsh 5.8
`/++++++++++++++: Resolution: 1920x1080
`/+++ooooooooooooo/` DE: KDE5
./ooosssso++osssssso+` WM: KWin
.oossssso-````/ossssss+` WM Theme: Breeze
-osssssso. :ssssssso. Theme: Breeze Light [KDE5], Canta-light [GTK2/3]
:osssssss/ osssso+++. Icons: breeze [KDE5], breeze [GTK2/3]
/ossssssss/ +ssssooo/- Terminal: tmux
`/ossssso+/:- -:/+osssso+- CPU: Intel i7-8650U (8) @ 4.200GHz
`+sso+:-` `.-/+oso: GPU: Intel UHD Graphics 620
`++:. `-/+/ Memory: 10319MiB / 15754MiB
.` `/
Before
After
|
I ran it on another Mac:
The command:
result is almost the same: Before
After
|
It seems that the original |
On my mac, the performance is same between const generics and manully expand:
|
Two versions work similar on my laptop: const generics test vec::bench_retain_100000 ... bench: 76,786 ns/iter (+/- 7,531)
test vec::bench_retain_whole_100000 ... bench: 62,311 ns/iter (+/- 19,721) manully expand test vec::bench_retain_100000 ... bench: 72,178 ns/iter (+/- 3,247)
test vec::bench_retain_whole_100000 ... bench: 56,534 ns/iter (+/- 1,479) |
OK, the conclusion is that:
I tend to think this PR deserves to be merged. @oxalica what's your opinion? |
I figured out why the code generation is weird after your change. I'd suggest to change the Well, this should not affect the efficiency of main loop. |
c34e154
to
6eb1549
Compare
Good catch, I have updated my code. Current performance on
|
6eb1549
to
89d47b6
Compare
Optimize unnecessary check in VecDeque::retain This pr is highly inspired by rust-lang#88060 which shared the same idea: we can split the `for` loop into stages so that we can remove unnecessary checks like `del > 0`. ## Benchmarks Before ```rust test collections::vec_deque::tests::bench_retain_half_10000 ... bench: 290,125 ns/iter (+/- 8,717) test collections::vec_deque::tests::bench_retain_odd_10000 ... bench: 291,588 ns/iter (+/- 9,621) test collections::vec_deque::tests::bench_retain_whole_10000 ... bench: 287,426 ns/iter (+/- 9,009) ``` After ```rust test collections::vec_deque::tests::bench_retain_half_10000 ... bench: 243,940 ns/iter (+/- 8,563) test collections::vec_deque::tests::bench_retain_odd_10000 ... bench: 242,768 ns/iter (+/- 3,903) test collections::vec_deque::tests::bench_retain_whole_10000 ... bench: 202,926 ns/iter (+/- 6,332) ``` Based on the current benchmark, this PR will improve the perf of `VecDeque::retain` by around 16%. For special cases, the improvement will be up to 30%. Signed-off-by: Xuanwo <github@xuanwo.io>
@joshtriplett PTAL, Thanks |
r? @dtolnay It seems that @joshtriplett is busy. Could you also review my PR? it's similar to #88075 |
fn bench_retain_100000(b: &mut Bencher) { | ||
let v = (1..=100000).collect::<Vec<u32>>(); | ||
b.iter(|| { | ||
let mut v = v.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling .clone()
inside the benchmark loop should be avoided since it can introduce allocator noise. It's better to just refill the vec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks but the PR was merged, should I submit another PR to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take care of it. I just wanted to leave a note.
The function
vec::Vec::retain
only have two stages:Here is an unnecessary check
if g.deleted_cnt > 0
in the loop, and it's difficult for compiler to optimize it. I split the loop into two stages manully and keep the code clean using const generics.I write a special but common bench case for this optimization. I call retain on vec but keep all elements.
Before and after this optimization:
The result is expected, there are two
if
s before the optimization and oneif
after.