Evaluate using Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO) for the project #147

zamazan4ik · 2024-06-03T10:29:09Z

zamazan4ik
Jun 3, 2024

Hi!

I checked various compiler optimizations (like Profile-Guided Optimization (PGO)) on many projects (including compilers) - all the results are available at https://github.com/zamazan4ik/awesome-pgo . Since such optimizations help with optimizing such projects, I decided to perform some LTO and PGO tests with the Amber compiler. Below are the results.

Test environment

Fedora 40
Linux kernel 6.8.10
AMD Ryzen 9 5900x
48 Gib RAM
SSD Samsung 980 Pro 2 Tib
Compiler - Rustc 1.78
Amber version: the latest for now from the master branch on commit 0a780488caa8f980a736bc1d1593728a121123d7
Disabled Turbo boost

Benchmark

I decided to perform PGO benchmarks on a simple scenario - amber input.ab output.sh command. For PGO optimization I use the cargo-pgo tool. Release build is done with cargo build --release, PGO instrumented with cargo pgo build, PGO optimized - cargo pgo optimize build. The training workload is the same for PGO and PLO - amber input.sh output.sh, where the input.sh script is this one.

taskset -c 0 is used to reduce the OS scheduler's influence on the results during the benchmarks. All measurements are done on the same machine, with the same background "noise" (as much as I can guarantee).

LTO is enabled by adding the following lines to the profile.release section in the Cargo.toml root file:

lto = true
codegen-units = 1

Results

Here are the results:

hyperfine --warmup 50 --min-runs 200 -i -N 'taskset -c 0 ./amber_release uninstall.ab install.sh' 'taskset -c 0 ./amber_release_lto uninstall.ab install.sh' 'taskset -c 0 ./amber_lto_optimized uninstall.ab install.sh'
Benchmark 1: taskset -c 0 ./amber_release uninstall.ab install.sh
  Time (mean ± σ):      34.4 ms ±   0.3 ms    [User: 32.7 ms, System: 1.5 ms]
  Range (min … max):    34.2 ms …  37.0 ms    200 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: taskset -c 0 ./amber_release_lto uninstall.ab install.sh
  Time (mean ± σ):      34.4 ms ±   0.5 ms    [User: 32.8 ms, System: 1.3 ms]
  Range (min … max):    34.1 ms …  37.7 ms    200 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 3: taskset -c 0 ./amber_lto_optimized uninstall.ab install.sh
  Time (mean ± σ):      33.9 ms ±   0.4 ms    [User: 32.2 ms, System: 1.4 ms]
  Range (min … max):    33.5 ms …  36.7 ms    200 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./amber_lto_optimized uninstall.ab install.sh ran
    1.01 ± 0.02 times faster than taskset -c 0 ./amber_release_lto uninstall.ab install.sh
    1.02 ± 0.02 times faster than taskset -c 0 ./amber_release uninstall.ab install.sh

where:

amber_release - Release build
amber_release_lto - Release + LTO build
amber_lto_optimized - Release + LTO + PGO optimized build

At least in the very simple test above, we don't see performance improvements from enabling LTO. The improvement from enabling PGO is consistent across multiple tests (not only this one) but isn't huge.

Just for reference, the slowdown during the PGO training phase:

hyperfine --warmup 50 --min-runs 200 -i -N 'taskset -c 0 ./amber_release uninstall.ab install.sh' 'taskset -c 0 ./amber_lto_instrumented uninstall.ab install.sh'
Benchmark 1: taskset -c 0 ./amber_release uninstall.ab install.sh
  Time (mean ± σ):      34.4 ms ±   0.1 ms    [User: 32.7 ms, System: 1.4 ms]
  Range (min … max):    34.2 ms …  35.2 ms    200 runs

Benchmark 2: taskset -c 0 ./amber_lto_instrumented uninstall.ab install.sh
  Time (mean ± σ):      40.0 ms ±   0.5 ms    [User: 37.8 ms, System: 1.9 ms]
  Range (min … max):    39.5 ms …  45.0 ms    200 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Summary
  taskset -c 0 ./amber_release uninstall.ab install.sh ran
    1.16 ± 0.02 times faster than taskset -c 0 ./amber_lto_instrumented uninstall.ab install.sh

where:

amber_release - Release build
amber_lto_instrumented - Release + LTO + PGO instrumented build

For reference, the binary sizes:

Release: 1.7 Mib
Release + LTO: 1.3 Mib
Release + LTO + PGO instrumentation: 2.2 Mib
Release + LTO + PGO optimized: 1.2 Mib

Further steps

I can suggest the following action points:

Enable LTO. Even if it doesn't improve the project performance - it reduces the binary size significantly without performance penalties for the users.
Perform more PGO benchmarks with other datasets (if you are interested enough in it) and other scenarios. If it shows improvements - add a note to the documentation about possible improvements in the compiler's performance with PGO.
Probably, you can try to get some insights about how the code can be optimized further based on the changes that the compiler performed with PGO. It can be done via analyzing flamegraphs before and after applying PGO to understand the difference.
More intensive testing Post-Link Optimization techniques (like LLVM BOLT) would be interesting too (Clang and Rustc already use BOLT as an addition to PGO). However, I recommend starting with the usual PGO since it's a much more stable technology with fewer limitations.

I would be happy to answer your questions about PGO and PLO.

For now, I don't think that there is a huge rush to integrate PGO into the current Amber build. Later, when more features are integrated into the project, and the compiler performance becomes a more critical thing to consider (compared to other tasks) - maybe it will be worth it. For now, I recommend at least enable LTO in the build scripts at least for the Release builds.

b1ek · 2024-06-03T11:22:44Z

b1ek
Jun 3, 2024
Maintainer

just write up a PR with all of that implemented, i'd be happy to merge it if the only thing it affects is the output size.

also if PGO only saves up 0.1 MB, and requires more than a line in Cargo.toml, perhaps its not really worth the trouble? also i wonder if it is (PGO) available for cross compilation to macOS (and maybe one day Windows)

1 reply

zamazan4ik Jun 3, 2024
Author

just write up a PR with all of that implemented, i'd be happy to merge it if the only thing it affects is the output size.

Got it. I just posted all the benchmarks above before any PR for collecting the feedback before actual integration steps.

also if PGO only saves up 0.1 MB, and requires more than a line in Cargo.toml, perhaps its not really worth the trouble?

The main motivation for doing PGO is optimizing the compiler's speed, not the binary size. The tests show that the compiler's performance is improved (but only for a few percent). At least for now I don't see huge reasons for integrating PGO (but if the maintainers think that <1 ms improvement is worth it - why not).

also i wonder if it is (PGO) available for cross compilation to macOS (and maybe one day Windows)

PGO can be used for cross-compilation too. However, it's recommended to use "native" PGO profiles per OS.

Ph0enixKM · 2024-06-03T15:32:28Z

Ph0enixKM
Jun 3, 2024
Maintainer

@zamazan4ik Thanks for your research and contribution in this area! I think that LTO can be added right away since it doesn't require any additional work. We should focus more on feature-completeness and stability if the optimisations do not improve Amber's compile time by that much. Great work nonetheless! 🙌

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate using Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO) for the project #147

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Evaluate using Link-Time Optimization (LTO) and Profile-Guided Optimization (PGO) for the project #147

zamazan4ik Jun 3, 2024

Test environment

Benchmark

Results

Further steps

Replies: 2 comments · 1 reply

b1ek Jun 3, 2024 Maintainer

zamazan4ik Jun 3, 2024 Author

Ph0enixKM Jun 3, 2024 Maintainer

zamazan4ik
Jun 3, 2024

Replies: 2 comments 1 reply

b1ek
Jun 3, 2024
Maintainer

zamazan4ik Jun 3, 2024
Author

Ph0enixKM
Jun 3, 2024
Maintainer