Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) benchmark results #74

zamazan4ik · 2024-05-13T12:28:11Z

Hi!

Recently I tested Profile-Guided Optimization (PGO) compiler optimization on different projects in different software domains - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since PGO shows measurable improvements in many cases, I decided to perform PGO benchmarks on this library (especially because I found some performance numbers). Here are my results - I hope they will be helpful for someone.

Test environment

Fedora 39
Linux kernel 6.8.9
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.78.0
prettyplease version: the latest for now from the master branch on commit 179974cc93c8d54894483463c1eea4df9e70a694
Disabled Turbo boost

Benchmark

For benchmark purposes, I use built-in into the project's test scenario with reformatting several projects (described below). I used this tool since I needed an executable to optimize. As benchmark datasets, I used two projects: Vector and grafbase (they both are big enough. Additionally, they were already checked out to my PC): Vector on the master branch with 5a4a2b2a10131af7ef4ca32ff13b9040e231f5a6 commit, grafbase on the main branch, 5605d62f69790f62a385e8155bddf838f977165b commit. For PGO optimization I use the cargo-pgo tool.

Release bench result I got with taskset -c 0 prettyplease-update command. The PGO training phase is done with taskset -c 0 prettyplease-update with the instrumented binary, PGO optimization phase - with taskset -c 0 prettyplease-update. taskset -c 0 is used for reducing the OS scheduler influence on the results. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

Also, I decided to test LTO as well on the project. I enabled LTO support with the following lines in the root Cargo.toml:

[profile.release]
codegen-units = 1
lto = true

Results

I got the following results on formatting the grafbase sources. PGO training set - Vector sources:

hyperfine --warmup 5 --min-runs 10 'taskset -c 0 ../prettyplease/target/update_release' 'taskset -c 0 ../prettyplease/target/update_release_lto' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_instrumented' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_instrumented' 'taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_optimized'
Benchmark 1: taskset -c 0 ../prettyplease/target/update_release
  Time (mean ± σ):     775.1 ms ±   2.2 ms    [User: 695.3 ms, System: 76.7 ms]
  Range (min … max):   770.1 ms … 778.0 ms    10 runs

Benchmark 2: taskset -c 0 ../prettyplease/target/update_release_lto
  Time (mean ± σ):     677.3 ms ±   2.2 ms    [User: 597.9 ms, System: 76.7 ms]
  Range (min … max):   674.4 ms … 680.2 ms    10 runs

Benchmark 3: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_instrumented
  Time (mean ± σ):     869.7 ms ±   7.8 ms    [User: 776.9 ms, System: 83.1 ms]
  Range (min … max):   863.3 ms … 887.2 ms    10 runs

Benchmark 4: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized
  Time (mean ± σ):     631.0 ms ±   2.3 ms    [User: 543.7 ms, System: 79.3 ms]
  Range (min … max):   627.1 ms … 635.1 ms    10 runs

Benchmark 5: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_instrumented
  Time (mean ± σ):      1.489 s ±  0.006 s    [User: 1.221 s, System: 0.238 s]
  Range (min … max):    1.479 s …  1.503 s    10 runs

Benchmark 6: taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_optimized
  Time (mean ± σ):     623.4 ms ±   3.7 ms    [User: 529.9 ms, System: 85.6 ms]
  Range (min … max):   618.8 ms … 630.6 ms    10 runs

Summary
  taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_optimized ran
    1.01 ± 0.01 times faster than taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized
    1.09 ± 0.01 times faster than taskset -c 0 ../prettyplease/target/update_release_lto
    1.24 ± 0.01 times faster than taskset -c 0 ../prettyplease/target/update_release
    1.40 ± 0.02 times faster than taskset -c 0 ../prettyplease/target/update_release_lto_pgo_instrumented
    2.39 ± 0.02 times faster than taskset -c 0 ../prettyplease/target/update_release_lto_pgo_optimized_bolt_instrumented

, where:

update_release: Release
update_release_lto: Release + LTO
update_release_lto_pgo_optimized: Release + LTO + PGO
update_release_lto_pgo_optimized_bolt_optimized: Release + LTO + PGO + BOLT
(just for reference) update_release_lto_pgo_instrumented: Release + LTO + PGO instrumentation
(just for reference) update_release_lto_pgo_optimized_bolt_instrumented: Release + LTO + PGO + BOLT instrumentation

According to the results, LTO and PGO measurably improve performance at least in the simple benchmark above. BOLT also improves performance but the improvement wasn't huge.

Further steps

I can suggest the following action points:

Perform more PGO benchmarks with other datasets (if you are interested enough). If it shows improvements - add a note to the documentation (the README file?) about possible improvements in the library's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference. I don't think that anything valuable for this library can improved in this way, though.

I would be happy to answer your questions about PGO and PLO.

P.S. Please do not treat the issue like a bug or something like that. Since the "Discussions" functionality is disabled in this repo, I created the Issue instead.

The text was updated successfully, but these errors were encountered:

dtolnay · 2024-05-13T16:07:16Z

Thanks!

dtolnay closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) benchmark results #74

Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) benchmark results #74

zamazan4ik commented May 13, 2024

dtolnay commented May 13, 2024

Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) benchmark results #74

Profile-Guided Optimization (PGO) and Post-Link Optimization (PLO) benchmark results #74

Comments

zamazan4ik commented May 13, 2024

Test environment

Benchmark

Results

Further steps

dtolnay commented May 13, 2024