You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since PGO shows measurable improvements in many cases, I decided to perform PGO benchmarks on this library (especially because I found some performance numbers). Here are my results - I hope they will be helpful for someone.
Test environment
Fedora 39
Linux kernel 6.8.9
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.78.0
prettyplease version: the latest for now from the master branch on commit 179974cc93c8d54894483463c1eea4df9e70a694
Disabled Turbo boost
Benchmark
For benchmark purposes, I use built-in into the project's test scenario with reformatting several projects (described below). I used this tool since I needed an executable to optimize. As benchmark datasets, I used two projects: Vector and grafbase (they both are big enough. Additionally, they were already checked out to my PC): Vector on the master branch with 5a4a2b2a10131af7ef4ca32ff13b9040e231f5a6 commit, grafbase on the main branch, 5605d62f69790f62a385e8155bddf838f977165b commit. For PGO optimization I use the cargo-pgo tool.
Release bench result I got with taskset -c 0 prettyplease-update command. The PGO training phase is done with taskset -c 0 prettyplease-update with the instrumented binary, PGO optimization phase - with taskset -c 0 prettyplease-update. taskset -c 0 is used for reducing the OS scheduler influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).
Also, I decided to test LTO as well on the project. I enabled LTO support with the following lines in the root Cargo.toml:
[profile.release]
codegen-units = 1
lto = true
Results
I got the following results on formatting the grafbase sources. PGO training set - Vector sources:
hyperfine --warmup 5 --min-runs 10 'taskset -c 0 ../prettyplease/target/update_release' 'taskset -c 0 ../prettyplease/target/update_release_lto' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_instrumented' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_instrumented' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_optimized'
Benchmark 1: taskset -c 0 ../prettyplease/target/update_release
Time (mean ± σ): 775.1 ms ± 2.2 ms [User: 695.3 ms, System: 76.7 ms]
Range (min … max): 770.1 ms … 778.0 ms 10 runs
Benchmark 2: taskset -c 0 ../prettyplease/target/update_release_lto
Time (mean ± σ): 677.3 ms ± 2.2 ms [User: 597.9 ms, System: 76.7 ms]
Range (min … max): 674.4 ms … 680.2 ms 10 runs
Benchmark 3: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_instrumented
Time (mean ± σ): 869.7 ms ± 7.8 ms [User: 776.9 ms, System: 83.1 ms]
Range (min … max): 863.3 ms … 887.2 ms 10 runs
Benchmark 4: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized
Time (mean ± σ): 631.0 ms ± 2.3 ms [User: 543.7 ms, System: 79.3 ms]
Range (min … max): 627.1 ms … 635.1 ms 10 runs
Benchmark 5: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_instrumented
Time (mean ± σ): 1.489 s ± 0.006 s [User: 1.221 s, System: 0.238 s]
Range (min … max): 1.479 s … 1.503 s 10 runs
Benchmark 6: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_optimized
Time (mean ± σ): 623.4 ms ± 3.7 ms [User: 529.9 ms, System: 85.6 ms]
Range (min … max): 618.8 ms … 630.6 ms 10 runs
Summary
taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_optimized ran
1.01 ± 0.01 times faster than taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized
1.09 ± 0.01 times faster than taskset -c 0 ../prettyplease/target/update_release_lto
1.24 ± 0.01 times faster than taskset -c 0 ../prettyplease/target/update_release
1.40 ± 0.02 times faster than taskset -c 0 ../prettyplease/target/update_release_lto_pgo_instrumented
2.39 ± 0.02 times faster than taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_instrumented
According to the results, LTO and PGO measurably improve performance at least in the simple benchmark above. BOLT also improves performance but the improvement wasn't huge.
Further steps
I can suggest the following action points:
Perform more PGO benchmarks with other datasets (if you are interested enough). If it shows improvements - add a note to the documentation (the README file?) about possible improvements in the library's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference. I don't think that anything valuable for this library can improved in this way, though.
I would be happy to answer your questions about PGO and PLO.
P.S. Please do not treat the issue like a bug or something like that. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.
The text was updated successfully, but these errors were encountered:
Hi!
Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since PGO shows measurable improvements in many cases, I decided to perform PGO benchmarks on this library (especially because I found some performance numbers). Here are my results - I hope they will be helpful for someone.
Test environment
prettyplease
version: the latest for now from themaster
branch on commit179974cc93c8d54894483463c1eea4df9e70a694
Benchmark
For benchmark purposes, I use built-in into the project's test scenario with reformatting several projects (described below). I used this tool since I needed an executable to optimize. As benchmark datasets, I used two projects: Vector and grafbase (they both are big enough. Additionally, they were already checked out to my PC): Vector on the
master
branch with5a4a2b2a10131af7ef4ca32ff13b9040e231f5a6
commit,grafbase
on themain
branch,5605d62f69790f62a385e8155bddf838f977165b
commit. For PGO optimization I use the cargo-pgo tool.Release bench result I got with
taskset -c 0 prettyplease-update
command. The PGO training phase is done withtaskset -c 0 prettyplease-update
with the instrumented binary, PGO optimization phase - withtaskset -c 0 prettyplease-update
.taskset -c 0
is used for reducing the OS scheduler influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).Also, I decided to test LTO as well on the project. I enabled LTO support with the following lines in the root
Cargo.toml
:Results
I got the following results on formatting the grafbase sources. PGO training set - Vector sources:
, where:
update_release
: Releaseupdate_release_lto
: Release + LTOupdate_release_lto_pgo_optimized
: Release + LTO + PGOupdate_release_lto_pgo_optimized_bolt_optimized
: Release + LTO + PGO + BOLTupdate_release_lto_pgo_instrumented
: Release + LTO + PGO instrumentationupdate_release_lto_pgo_optimized_bolt_instrumented
: Release + LTO + PGO + BOLT instrumentationAccording to the results, LTO and PGO measurably improve performance at least in the simple benchmark above. BOLT also improves performance but the improvement wasn't huge.
Further steps
I can suggest the following action points:
I would be happy to answer your questions about PGO and PLO.
P.S. Please do not treat the issue like a bug or something like that. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.
The text was updated successfully, but these errors were encountered: